[IEEE 2012 International Joint Conference on Neural Networks (IJCNN 2012 - Brisbane) - Brisbane,...

Performance Analysis of Nonlinear Echo StateNetwork Readouts in Signal Processing Tasks

Levy Boccato, Diogo C. Soriano, Romis Attux and Fernando Jose Von ZubenDepartment of Computer Engineering and Industrial Automation (DCA)

School of Electrical and Computer Engineering (FEEC)University of Campinas (UNICAMP)

Campinas, Sao Paulo, BrazilEmail: {lboccato, soriano, attux, vonzuben}@dca.fee.unicamp.br

Abstract—Echo state networks (ESNs) characterize an at-tractive alternative to conventional recurrent neural network(RNN) approaches as they offer the possibility of preserving,to a certain extent, the processing capability of a recurrentarchitecture and, at the same time, of simplifying the trainingprocess. However, the original ESN architecture cannot fullyexplore the potential of the RNN, given that only the second-orderstatistics of the signals are effectively used. In order to overcomethis constraint, distinct proposals promote the use of a nonlinearreadout aiming to explore higher-order available informationthough still maintaining a closed-form solution in the least-squares sense. In this work, we review two proposals of nonlinearreadouts - a Volterra filter structure and an extreme learningmachine - and analyze the performance of these architectures inthe context of two relevant signal processing tasks: supervisedchannel equalization and chaotic time series prediction. Theobtained results reveal that the nonlinear readout can be decisivein the process of aproximating the desired signal. Additionally,we discuss the possibility of combining both ideas of nonlinearreadouts and preliminary results indicate that a performanceimprovement can be attained.

I. INTRODUCTION

Recurrent neural networks (RNNs) are structures naturallyadapted to deal with complex problems characterized by theexistence of dynamical behavior, such as time series predic-tion, dynamical system identification and adaptive filtering [1].This is a direct consequence of two factors: 1) the presenceof feedback connections, which allow the development of aninternal memory of the signal over time; and 2) the flexibilityprovided by nonlinear processing elements.

In spite of this acknowledged potential, RNNs face well-known difficulties inherent to their canonical training process:slow convergence, gradient decay and the possibility of reach-ing unstable behaviors. In order to overcome these drawbacks,new recurrent structures with a simpler training stage havebeen proposed, defining a promising set of alternatives thatincludes the so-called echo state networks (ESNs) [2] [3].

A key element of the ESN structure is the use of fixedweights in all recurrent connections of the RNN, defininga stage called dynamical reservoir. In addition to that, thedynamical patterns engendered by the reservoir are linearlycombined in a readout stage, which can be properly designedwith the aid of any linear regression method [3]. This showsthat, in essence, an ESN is built from two key elements

- the dynamical reservoir and the readout stage - which isemblematic of an important and active research field knownas reservoir computing (RC) [3] [4].

Among the proposals concerning the design of the dy-namical reservoir, we highlight, aside from Jaeger’s classicalrecipe [2], the work of Ozturk et al. [5], which evoked theconcept of Kautz filters in order to establish an analyticalprocedure to define the reservoir connection weights capableof generating a more diverse repertoire of dynamics. On theother hand, a linear combiner has been the usual choice toplay the role of the readout structure in view of the possibilityof obtaining, in closed-form, optimal solutions in the least-squares sense. A structure of this kind, however, cannot makeuse of the higher-order statistics of the reservoir signals, whichmeans that, in certain circumstances, relevant informationconcerning the echo states might not be duly explored. Thisevidence motivated the investigation of alternative strategiesfor combining the reservoir signals in order to achieve a moreaccurate approximation of the desired input-ouput behavior.

In particular, the possibility of using a nonlinear structurein the output layer emerges as an interesting option. Recently,two works have proposed distinct, albeit conceptually related,schemes for building a nonlinear ESN readout: Butcher et al.[6] introduced the idea of using an extreme learning machine(ELM) in the role of readout, and analyzed the performance ofthis novel architecture considering polynomial approximationand spoken digit recognition tasks; Boccato el al. [7] replacedthe linear combiner at the output with a Volterra filter structureand, aiming to avoid the possibility of an excessive growth inthe number of coefficients, employed the Principal ComponentAnalysis (PCA) technique. The obtained results in supervisedchannel equalization revealed that a significant performanceimprovement can be achieved with this nonlinear readout.

Both ideas share two most relevant features: (i) the problemof adjusting the coefficients of the nonlinear readout structureis linear with respect to the free parameters, which meansthat an optimal solution can be obtained in the least-squaressense, being the simplicity of the original ESN training pro-cess preserved; (ii) the proposed architectures can be easilyadapted to different reservoir design methods. These featuresencourage a detailed study of the potential of nonlinear echostate network readouts. In this work, we will review the main

U.S. Government work not protected by U.S. copyright

WCCI 2012 IEEE World Congress on Computational Intelligence June, 10-15, 2012 - Brisbane, Australia IJCNN

aspects of the proposals of Butcher et al. [6] and Boccato et al.[7], and carry out a comparative analysis of the performanceachieved with each proposal in the context of two classical andmost relevant information processing tasks: supervised channelequalization and chaotic time series prediction. Additionally,we will analyze the promising possibility of combining bothreadouts in a cascade, thereby establishing a single readoutbased on both structures.

This paper is organized as follows: Section II presentsthe basic echo state network proposed by Jaeger [2], andthe alternative proposal of Ozturk et al. [5], as well as thenonlinear readouts proposed by Butcher et al. [6] and Boccatoel al. [7]; the definition of the problems studied in this work- supervised channel equalization and chaotic time seriesprediction - is found in Section III. Next, Section IV exhibitsthe set of results obtained with each ESN architecture inboth tasks, highlighting the benefits acquired with a moreflexible output layer, and comparing the available options fora nonlinear readout. Finally, some concluding remarks andperspectives of future works are presented in Section V.

II. ECHO STATE NETWORKS

Echo state networks, proposed by Jaeger [2], can be seen asa trade-off solution which aims to unite two important features:not only do these networks maintain, in a certain extent, theprocessing capability inherent to recurrent structures, but alsoallow a significant simplification in the training process.

In order to actualize such potential advantages, these net-works employ two distinct processing layers: (i) a highly in-terconnected recurrent layer of nonlinear processing elements(NPEs), named reservoir, which engenders a rich repertoireof dynamical behaviors; (ii) a readout, which is responsiblefor linearly combining the activations of the reservoir NPEsto produce the network outputs. The basic ESN architectureis depicted in Figure 1.

u(𝑛) x(𝑛)

W𝑖𝑛

W

W𝑜𝑢𝑡

W𝑏𝑎𝑐𝑘

y(𝑛)

Fig. 1. The basic echo state network architecture.

The vector x(𝑛) = [𝑥1(𝑛) . . . 𝑥𝑁 (𝑛)]𝑇 contains the net-

work states, which are updated according to the followingexpression:

x(𝑛+1) = f(W𝑖𝑛u(𝑛+ 1) +Wx(𝑛) +W𝑏𝑎𝑐𝑘y(𝑛)

), (1)

where u(𝑛) = [𝑢1(𝑛) . . . 𝑢𝐾(𝑛)]𝑇 contains the set of 𝐾

inputs of the network, W𝑖𝑛 ∈ ℛ𝑁×𝐾 specifies the coef-ficients of the linear combinations involving the input sig-nals, W ∈ ℛ𝑁×𝑁 brings the recurrent connection weightswithin the reservoir, W𝑏𝑎𝑐𝑘 ∈ ℛ𝑁×𝐿 contains the weightsassociated with the output feedback connections, and f(⋅) =(𝑓1(⋅), . . . , 𝑓𝑁 (⋅)) denotes the activation functions of the in-ternal units.

The network states are then linearly combined in order toproduce the network output vector y(𝑛) ∈ ℝ

𝐿, which can beexpressed as follows:

y(𝑛+ 1) = W𝑜𝑢𝑡x(𝑛+ 1), (2)

where W𝑜𝑢𝑡 ∈ ℛ𝐿×𝑁 is the output weight matrix.An essential conceptual element developed in the study

performed by Jaeger is the echo state property [2], whichestablishes that, when certain conditions involving specificcharacteristics of the reservoir weight matrix W are satisfied,the reservoir dynamics tend to reflect the recent history of theinput signals, which means that the network states becomea nonlinear transformation of the current input and of anemergent memory of the input history as well.

Since the presence of a useful memory is ensured by theecho state property, the reservoir weight matrix W can be,in theory, arbitrarily chosen, as long as it fulfills the imposedconditions for the existence of echo states 1. Based on theseobservations, the strategy introduced by Jaeger [2] consists inrandomly creating a weight matrix W, satisfying the echo stateproperty, which exhibits a certain degree of sparseness thatintuitively contributes to decouple groups of neurons, favoringthe emergence of individual dynamical behaviors.

After the dynamical reservoir has been properly designed, aswell as the input weight matrix Win, which does not influencethe echo state property, the network training process boilsdown to adjusting the coefficients of the linear readout. Thus,any linear regression method can be used to determine theoptimal output weights.

However, the arbitrary choice of the recurrent connectionweights, albeit justified in intuitive terms, does not necessarilylead to a sound performance. Moreover, the network capabilityof approximating the desired signal can vary significantlywhen different weight matrices are considered, even whenthese matrices share the same spectral radius. This observationreveals the vital role of the reservoir design within the ESNapproach, and it represented the main motivation of the workby Ozturk et al. [5]. By analyzing the dynamics of thelinearized version of the system described in Equation 1, theauthors proposed an unsupervised procedure to design thereservoir: the reservoir weight matrix assumes a pre-definedcanonical form so that the poles of the linearized system are

1It is important to mention that the echo state property is expressed in termsof the largest singular value of the internal weight matrix W, which must besmaller than one. Interestingly, in all experiments Jaeger carried out, whenthe largest absolute eigenvalues of the weight matrix W is smaller than one(∥𝑊∥ < 1), the existence of echo states is ensured in practice. This explainswhy this weaker heuristic condition is usually employed.

uniformly spread within the unit circle, which contributes tothe occurrence of a more diverse set of reservoir dynamics.Additionally, Ozturk et al. [5] also introduced the use of theaverage state entropy (ASE) as a measure of the dynamicaldiversity produced by the reservoir.

In both aforementioned ESN approaches, the output layercorresponds to a linear combiner, which offers an advan-tage in terms of the training simplicity. Nevertheless, thisstructure cannot make use of the higher-order statistics ofthe information coming from the nonlinear dynamics of thereservoir. Therefore, in order to overcome this bottleneck,more sophisticated readout schemes have been proposed, likethose presented by Butcher et al. [6] and Boccato et al. [7].

A. Nonlinear readouts

The perspective of using nonlinear readouts for echo statenetworks is encouraged by the possibility of effectively ex-ploiting the statistics of the network states with the purpose ofbetter approximating the desired signal. However, it is crucialthat the output layer remains linear with respect to the freeparameters, so that the simplicity of the training process ispreserved, and a closed-form solution in the least-squares (orWiener) sense can still be found [1].

In this context, the idea of Boccato et al. [7] consisted inreplacing the original linear combiner with the Volterra filterstructure. Hence, each network output is computed as follows:

𝑦𝑙(𝑛) = ℎ(𝑙)0 +

𝑁∑

𝑖=1

ℎ(𝑙)1 (𝑖)𝑥𝑖(𝑛) +

𝑁∑

𝑖=1

𝑁∑

𝑗=1

ℎ(𝑙)2 (𝑖, 𝑗)𝑥𝑖(𝑛)𝑥𝑗(𝑛)

+𝑁∑

𝑖=1

𝑁∑

𝑗=1

𝑁∑

𝑘=1

ℎ(𝑙)3 (𝑖, 𝑗, 𝑘)𝑥𝑖(𝑛)𝑥𝑗(𝑛)𝑥𝑘(𝑛) + . . . , (3)

where ℎ(𝑙)𝑧 (𝑘1, . . . , 𝑘𝑧) denote the filter coefficients called

Volterra kernels, with 𝑧 = 1, . . . ,𝑀𝑜 representing the orderof the corresponding polynomial terms. 𝑀𝑜 assumes a finitevalue when the polynomial expansion is truncated. This ap-proach meets the expected requirements regarding the trainingprocess, and is capable of using higher-order statistics of thenetwork states in the computation of the optimal solution, asremarked in [7].

Additionally, aiming to mitigate the curse of dimension-ality [8], Boccato et al. [7] applied the Principal Compo-nent Analysis (PCA) technique before the reservoir dynamicsare transmitted to the Volterra filter. Hence, by using onlya few principal components, the corresponding number ofcoefficients to be adjusted can be significantly reduced. Thisarchitecture has been analyzed in the context of supervisedchannel equalization, and promising results have been reportedin [7].

Based on a different framework, the architecture proposedby Butcher et al. [6] employs extreme learning machines(ELMs) in the readout. ELMs are single-hidden layer feed-forward neural networks (SLFNs) presenting the peculiarcharacteristic that only the output weigths are adjusted, withthe aid of generalized inverse operation of the hidden layer

activation matrix, whereas the input weights, as well as thebiases of the hidden units, can be randomly chosen [9] [10] 2.

While the hybrid architecture proposed by Butcher et al.[6] uses two static SLFNs to form the output layer, whichreceive, respectively, the ESN inputs and states, in this workwe shall consider a simplified version, in which a singleELM performs a nonliner mapping of the reservoir dynamics,similarly to what occurs in the proposal of Boccato et al. [7].In mathematical terms, the activations of the ELM hidden unitsare given by:

xh(𝑛) = fℎ(Wℎx(𝑛) + b

), (4)

where Wℎ is the randomly-defined input matrix and b con-tains the random bias of each hidden unit. Then, the ELMoutputs are obtained by means of linear combinations of thehidden activations, as shown in the following expression:

y(𝑛) = 𝜷xh(𝑛), (5)

where 𝜷 specifies the coefficients of such combinations. As wecan observe, the simplicity of the training process is preserved,since the problem of obtaining the optimal values of 𝛽𝑖 islinear with respect to the free parameters.

Another relevant characteristic existing in both approachespresented in this section is the flexibility with respect todifferent reservoir design strategies. In fact, both architecturescan be easily adapted to use the original idea of [2] as wellas the alternative procedure introduced by [5], but we shallrestrict the analysis to the first method, which, in the performedexperiments, led to the best results. Henceforth, we shall referto each ESN architecture by the following acronyms: R-ESN isassociated with the original ESN conceived by Jaeger, ASE-ESN is related to the proposal of Ozturk et al., R-PVESNrefers to the ESN architecture with the Volterra filter, and,finally, R-ESN/ELM refers to the ESN using an ELM inthe readout. Now, we proceed to the description of the mainaspects of the applications considered in this work.

III. APPLICATIONS

In order to assess the potential advantages acquired withthe use of nonlinear structures in the readout, the echo statenetworks presented in the previous section shall be appliedin the context of: (i) supervised channel equalization, inwhich the ESNs shall play the role of the equalizer, aimingat recovering the original information transmitted through acommunication channel, and (ii) chaotic time series prediction,in which the ESNs attempt to predict the state of a dynamicalsystem exhibiting chaotic behavior.

A. Channel Equalization

Fundamentally, the problem of channel equalization consistsof recovering an information signal of interest from receivedmeasurements that correspond to distorted versions of suchsignal due to the action of the physical environment used

2A rigorous proof showing that the input weights, along with the biases ofthe hidden neurons of SLFNs, can be randomly assigned when the activationfunctions in the hidden layer are infinitely differentiable can be found in [10].

for the transmission. This inevitable distortion is usuallymodeled as the resulting effect of a possibly noisy linear /nonlinear system (channel), as shown in Figure 2, in which𝑠(𝑛) represents the signal of interest and 𝑟(𝑛) is the receivedsignal.

𝑠(𝑛)Channel

𝑟(𝑛)

Fig. 2. Model of a Communication System.

A classical strategy to counterbalance the channel effects isto process the received signal with the aid of an especially-tailored filter, known as equalizer, which, ideally, shouldreverse the action of the channel, offering an undistortedversion of the original signal, i.e.,

𝑦(𝑛) = 𝑘𝑠(𝑛− 𝑑), (6)

where 𝑦(𝑛) is the equalizer output, 𝑘 is a constant gain and𝑑 denotes the equalization delay. This condition is commonlyreferred to as zero-forcing [1].

A successful application of this strategy involves crucialchoices concerning: 1) the filtering structure, 2) an equalizationcriterion, and 3) an efficient algorithm to adjust the parametersof the filter according to the selected criterion. In this work,since echo state networks shall operate as equalizers, thefiltering structure will be both nonlinear and recurrent, featuresthat can be decisive even to solve certain linearly-generatedproblems [11].

In a supervised learning framework, the canonical formula-tion of the equalization problem relies on the idea of selectingthe filter parameters that minimize the mean-squared errorbetween the desired signal and the equalizer output:

𝐽𝑀𝑆𝐸 = 𝐸{[𝑑(𝑛)− 𝑦(𝑛)]2

}(7)

where 𝐸{⋅} denotes statistical expectation, 𝑦(𝑛) representsthe equalizer output and 𝑑(𝑛) is the desired signal, whichcorresponds to 𝑠(𝑛) or a delayed version thereof. Interestingly,since, in the ESN approach, the outputs are obtained throughlinear combinations of the reservoir states, the problem ofadjusting the coefficients of such combinations in order tominimize (7) admits a single optimal solution, which evokesthe conceptual framework of Wiener filtering [1].

B. Chaotic Time Series Prediction

Chaotic systems are, in simple terms, systems described bya deterministic set of nonlinear equations (usually differenceor differential equations) that exhibit aperiodic solutions withstrong sensitivity to the initial conditions. In practice, chaoticbehavior can be found, for instance, in many biological, me-chanical, electrical and physical systems [12]. The complexityof this dynamical regime, together with such a vast practicaldomain, explain why the prediction of experimental time seriesgenerated by chaotic systems is a classical (and challenging)task in time series analysis [13] [14].

In this context, among the techniques developed to predictchaotic time series [15] [14], recurrent neural networks hasreceived great attention due to their flexibility and nonlinearstructure [16]. For this reason, a performance analysis of ESNswith different nonlinear readouts is done here concerning suchsignal processing task.

In simple terms, the chaotic time series were generatedusing two classical chaotic models: the logistic map and theLorenz model. The former stands out by exhibiting a impulsiveautocorrelation function for specific parameters (case analyzedhere), while the latter is characterized by a relatively longcorrelated time series. Within this framework, the predictionproblem basically consists in estimating a future state 𝑙(𝑛+ℎ)given the current one 𝑙(𝑛) (and also, in some scenarios,some delayed samples), which is analyzed here for a differentnumber of inputs and for different prediction horizons (ℎ).

IV. EXPERIMENTAL RESULTS

In this section, we present the methodology used to trainand test the ESNs, as well as the main results achieved byeach architecture in both signal processing tasks. All the ex-periments were carried out within the Matlab R⃝ environment.

A. Methodology

In both applications, the basic measure used to analyze theperformance of the ESNs for equalization and prediction ofchaotic time series is the mean squared error (MSE) betweenthe desired signal 𝑑(𝑛) and the network output 𝑦ESN(𝑛), asfollows:

MSE =1

𝑇𝑠

𝑇𝑠∑

𝑖=1

[𝑑(𝑖)− 𝑦ESN(𝑖)]2 (8)

In fact, an average MSE (AMSE) considering 20 indepen-dent experiments is calculated, which offers a more consis-tent performance evaluation of the networks. Additionally,the standard deviation of the MSE values obtained in suchexperiments is also presented, which complements the analysisbased on the AMSE values with a notion on the variability ofthe performance of each ESN architecture.

With respect to the network training process, we adoptedthe same setup used in [2], [5] and [7]. For all networks,the input weights Win

𝑖𝑗 were set to −1 or +1 with equalprobability, and no output feedback is allowed (Wback = 0).The reservoir connection weights W𝑖𝑗 are set to −0.4, 0.4 or0 with probabilities of 0.025, 0.025 and 0.95, respectively, inthe case of R-ESN, R-PVESN and R-ESN/ELM [2], whereas,in the case of ASE-ESN, the reservoir weight matrix assumesthe canonical form displayed in [5], with a spectral radiusequal to 0.8.

B. Application: Channel Equalization

Let the source signal 𝑠(𝑛) be composed of i.i.d samplesof a binary alphabet {+1,−1} (BPSK modulation). The echostate networks shall play the role of equalizers using only theinstantaneous received sample 𝑟(𝑛) to estimate the originalinformation 𝑠(𝑛) without accounting for any equalization

delay, which defines a challenging equalization scenario inview of the fact that a single input is available for the equalizer(𝐾 = 1). In a scenario of this kind, the ability of ESNs tocreate an internal memory of the input history is potentiallyessential. Additionally, 𝑇𝑠 = 1100 samples are used duringtraining and test phases, being the first hundred samplesdiscarded due to possible transient effects.

1) First Scenario: The first channel we shall consider isdescribed by the following transfer function:

𝐻(𝑧) = 0.407 + 0.815𝑧−1 + 0.407𝑧−2. (9)

A peculiar feature of this channel is that there are coincidentstates: the input sequences of symbols (𝑠(𝑛) = −1, 𝑠(𝑛−1) =−1, 𝑠(𝑛− 2) = 1) and (𝑠(𝑛) = 1, 𝑠(𝑛− 1) = −1, 𝑠(𝑛− 2) =−1) produce the received sample 𝑟(𝑛) = −0.815, whereasthe sequences (𝑠(𝑛) = 1, 𝑠(𝑛 − 1) = 1, 𝑠(𝑛 − 2) = −1) and(𝑠(𝑛) = −1, 𝑠(𝑛 − 1) = 1, 𝑠(𝑛 − 2) = 1) lead to 𝑟(𝑛) =0.815. In this case, feedforward structures will not be ableto distinguish among the possible sequences whenever thisstate occurs. As demonstrated by Montalvao et al. [17], theuse of recurrent structures is absolutely necessary in order toadequately separate the channel states in this case.

The ESNs have been trained and tested according to thepreviously described methodology, and the AMSE valuesobtained by each network, along with the respective standarddeviations, are presented in Table I.

TABLE IAMSE VALUES OBTAINED WITH EACH ECHO STATE NETWORK

CONSIDERING 𝐻(𝑧) = 0.407 + 0.815𝑧−1 + 0.407𝑧−2 .

Linear channel AMSENetwork Parameter Training Test

R-ESN

𝑁 = 10 4.46(±2.33)e-01 4.40(±2.38)e-01𝑁 = 40 1.75(±0.78)e-02 1.75(±0.70)e-02𝑁 = 60 9.33(±3.88)e-03 1.05(±0.48)e-02𝑁 = 100 5.87(±2.16)e-03 9.90(±5.19)e-03

ASE-ESN

𝑁 = 10 3.95(±1.59)e-01 4.02(±1.76)e-01𝑁 = 40 1.37(±0.91)e-01 1.56(±1.09)e-01𝑁 = 60 8.68(±7.07)e-02 1.13(±0.97)e-01𝑁 = 100 3.03(±1.84)e-02 6.10(±4.30)e-02

R-PVESN

𝑁𝑝𝑐 = 3 6.13(±1.68)e-02 5.92(±1.54)e-02𝑁𝑝𝑐 = 6 1.05(±0.80)e-02 9.09(±5.79)e-03𝑁𝑝𝑐 = 8 1.44(±1.34)e-03 2.22(±1.87)e-03𝑁𝑝𝑐 = 10 1.90(±5.28)e-04 7.58(±15.8)e-04

R-ESN/ELM

𝑁ℎ = 10 2.34(±1.36)e-01 2.43(±1.38)e-01𝑁ℎ = 40 1.22(±0.84)e-02 1.13(±0.68)e-02𝑁ℎ = 60 3.29(±3.74)e-03 3.47(±3.56)e-03𝑁ℎ = 100 3.07(±6.53)e-04 4.83(±9.99)e-04

For both architectures with nonlinear readouts, the numberof neurons within the dynamical reservoir is set to 𝑁 = 20,whereas the number of principal components (𝑁𝑝𝑐) and ofELM hidden neurons (𝑁ℎ) assume different values.

In this scenario, the mean-squared error associated with theoptimal feedforward equalizer, one that can perfectly solveall states, except the aforementioned coincident states, is0.5 [18]. Therefore, it is possible to affirm that the resultsdisplayed in Table I attest that the presence of recurrentconnections within the reservoir is essential for the equalizerto correctly distinguish the coincident states, since the AMSE

values obtained with the ESNs are significantly smaller thanthe feedforward threshold value of 0.5. Moreover, the resultsshown in Table I also reveal an important improvement in theperformance of the ESNs when nonlinear readouts, such asVolterra filters and ELMs, are used.

2) Second Scenario: Now, we consider the case in whichthe channel is nonlinear, as shown in the following expression:

𝑦𝑐ℎ𝑎𝑛𝑛𝑒𝑙(𝑘) = 𝑦lin(𝑘)− 0.8𝑦2lin − 0.3𝑦3lin, (10)

where 𝑦lin(𝑘) denotes the output of the FIR channel whosetransfer function is given by 𝐻(𝑧) = 0.5 + 𝑧−1.

Even though it does not seem to be a challenging scenario,this channel presents the peculiar characteristic that, for anyequalization delay, the channel states cannot be linearly sep-arated. This means that, even in the absence of noise, it isnot possible to reconstruct the signal transmitted through thischannel with the aid of linear filters.

As considered in the previous scenario, we adopted 𝑁 = 20for both R-PVESN and R-ESN/ELM. For this scenario, theobtained results are exhibited in Table II. The results shownin Table II highlight the advantages obtained with the useof nonlinear readouts: in fact, the AMSE values achieved byR-PVESN and R-ESN/ELM are many orders of magnitudesmaller than those obtained with the conventional ESNs.

TABLE IIAMSE VALUES OBTAINED WITH EACH ECHO STATE NETWORK

CONSIDERING THE NONLINEAR CHANNEL.

Nonlinear channel AMSENetwork Parameter Training Test

R-ESN

𝑁 = 10 4.54(±2.19)e-01 4.60(±2.20)e-01𝑁 = 40 5.24(±5.52)e-03 5.35(±5.34)e-03𝑁 = 60 1.09(±0.80)e-03 1.66(±1.54)e-03𝑁 = 100 4.40(±4.50)e-04 8.26(±6.64)e-04

ASE-ESN

𝑁 = 10 1.77(±0.88)e-01 1.80(±0.86)e-01𝑁 = 40 9.83(±4.31)e-03 1.75(±2.25)e-02𝑁 = 60 3.98(±2.36)e-03 1.27(±2.76)e-02𝑁 = 100 1.49(±0.81)e-03 5.29(±3.13)e-03

R-PVESN

𝑁𝑝𝑐 = 3 9.05(±6.43)e-02 9.50(±7.02)e-02𝑁𝑝𝑐 = 5 5.19(±8.38)e-05 5.96(±9.56)e-05𝑁𝑝𝑐 = 8 4.49(±8.85)e-09 1.11(±2.11)e-08𝑁𝑝𝑐 = 10 2.08(±5.51)e-11 5.95(±21.3)e-09

R-ESN/ELM

𝑁ℎ = 10 6.01(±4.70)e-02 6.16(±4.81)e-02𝑁ℎ = 40 4.19(±8.13)e-06 4.35(±8.28)e-06𝑁ℎ = 60 6.88(±13.6)e-08 7.84(±15.0)e-08𝑁ℎ = 100 1.67(±3.00)e-10 2.84(±5.95)e-10

Motivated by the encouraging results exposed so far, wealso investigated the possibility of using both nonlinear read-outs in a cascade. Hence, a new ESN architecture has beenconsidered: in this case, the reservoir states are fed into asingle hidden layer containing 𝑁ℎ nonlinear neurons, whoseactivations yield the inputs of the Volterra filter.

Based on preliminary experiments, we adopted the values𝑁 = 10, 𝑁ℎ = 20 and 𝑁𝑝𝑐 = 10 for the number ofreservoir neurons, hidden neurons and principal components,respectively. The AMSE value obtained by this extendedarchitecture correspond to 1.39e-013, for the training set,and is equal to 2.24e-013, for the test dataset. These resultsstrongly suggest that there is additional information that can

be explored with the combined readout structure. Now, weproceed to the application of the studied ESNs to chaoticprediction tasks.

C. Application: Chaotic Time Series Prediction

In this work, we have considered two benchmarck nonlineardynamical systems for the performance analysis of the ESNarchitectures presented in Section II: the logistic map and theLorenz system.

The first scenario corresponds to a discrete-time dynamicalsystem whose state 𝑙(𝑛) evolves according to the followingexpression:

𝑙(𝑛+ 1) = 𝜇 𝑙(𝑛) (1− 𝑙(𝑛)) , (11)

where 𝜇 is a positive constant. Particularly, for 𝜇 = 4.0, thesystem exhibits chaotic behavior for almost all initial condi-tions, which leads to aperiodic time series with a impulsiveautocorrelation function. Thus, in this study, we adopted 𝜇 = 4and the initial condition 𝑙(0) = 0.49.

The second scenario involves a tridimensional continuous-time nonlinear dynamical system, and the equations thatgovern the Lorenz oscillator are:

𝑑𝑥

𝑑𝑡= 𝜎(𝑦 − 𝑥) (12)

𝑑𝑦

𝑑𝑡= 𝑥(𝜌− 𝑧)− 𝑦

𝑑𝑧

𝑑𝑡= 𝑥𝑦 − 𝛽𝑧,

where 𝜎, 𝜌 and 𝛽 are positive constants. For the standardvalues 𝜎 = 10, 𝛽 = 8/3 and 𝜌 = 28, the system exhibitschaotic behavior. In this work, the Lorenz system is initializedat the point (−0.2028; 1.815; 22.646) to avoid transient effects,and, instead of predicting all three coordinates of the state,only the first coordinate (𝑥(𝑡)) shall be predicted by the ESNs.Moreover, the corresponding time series is normalized in orderto have zero mean and unit variance, and the AMSE valuesare computed in this domain.

In both aforementioned scenarios, the error in the predictionoffered by each ESN shall be monitored as the number ofpast samples 𝑛𝑠 available in the network input is increased.Moreover, we also analyze how the prediction accuracy isaffected when the prediction horizon ℎ increases. Starting fromthe initial condition, the subsequent 𝑇𝑠 = 1100 samples areused to train the ESNs, but the first hundred samples are nottaken into account in the MSE computation to avoid transienteffects. The test set is composed of the next 𝑇𝑠 = 1100samples. In the following, we display the results obtainedby each ESN for the logistic map and the Lorenz system. Itis important to mention that the presentation of the resultsis restricted to a specific parameter setting for each ESNarchitecture, that which led to the best performance.

1) Logistic Map: Firstly, we examined the influence of thenumber of input samples in the prediction of the subsequentstate 𝑙(𝑛 + 1). In other words, we focused on the case inwhich the prediction horizon is equal to ℎ = 1, while 𝑛𝑠

varies. In Table III, the AMSE values obtained with each ESN

architecture is presented considering the following parametersetup: 𝑁 = 100 for both R-ESN and ASE-ESN, 𝑁 = 20 and𝑁𝑝𝑐 = 10 for R-PVESN, and, finally, 𝑁 = 5 and 𝑁ℎ = 100for R-ESN/ELM.

TABLE IIIAMSE VALUES OBTAINED WITH EACH ECHO STATE NETWORK

CONSIDERING ℎ = 1.

AMSENetwork Inputs Training Test

R-ESN𝑛𝑠 = 1 4.88(±2.56)e-03 6.63(±3.08)e-03𝑛𝑠 = 3 3.00(±1.51)e-03 4.20(±1.88)e-03𝑛𝑠 = 5 2.12(±0.55)e-02 2.79(±0.57)e-02

ASE-ESN𝑛𝑠 = 1 6.39(±2.32)e-03 9.97(±3.04)e-03𝑛𝑠 = 3 6.29(±2.64)e-03 1.26(±0.50)e-02𝑛𝑠 = 5 2.89(±0.93)e-02 4.26(±1.14)e-02

R-PVESN𝑛𝑠 = 1 1.39(±4.88)e-07 4.16(±12.6)e-07𝑛𝑠 = 3 5.08(±8.41)e-07 3.83(±7.56)e-06𝑛𝑠 = 5 3.05(±3.58)e-05 1.08(±0.89)e-04

R-ESN/ELM𝑛𝑠 = 1 4.84(±13.4)e-10 7.07(±19.8)e-10𝑛𝑠 = 3 9.54(±31.6)e-05 1.35(±0.44)e-04𝑛𝑠 = 5 6.87(±6.52)e-03 1.97(±3.86)e-02

It is possible to observe in Table III that the performanceof the ESNs is deteriorated as additional inputs are available.This is probably due to the fact that, in view of the impulsivecorrelation profile of the time series in question, the additionalinputs, which are not particularly useful even in ideal terms,probably interfere with the dynamical behavior of the reser-voir, behaving in practice like “noise”.

We can also notice in Table III that the ESN architecturescharacterized by the use of nonlinear readouts have achievedsignificantly better results. Indeed, the AMSE values obtainedwith these architectures, especially the R-ESN/ELM, are manyorders of magnitude smaller than those obtained with R-ESNand ASE-ESN. Curiously, the performance of R-ESN/ELM isthe most deteriorated as 𝑛𝑠 is increased. Finally, we wouldlike to mention that similar conclusions can be attained whendistinct values of the prediction horizon are considered.

Now, we investigate the impact of the prediction horizon onthe performance of each ESN when a single input sample isavailable (𝑛𝑠 = 1). Due to the profile of the autocorrelationfunction of the sole state variable, the expected behavioris that, as ℎ is increased, the prediction error significantlydeteriorates.

As expected, the prediction error associated with each ESNis increased as the prediction horizon grows. However, we canalso observe in Table IV that, while the R-ESN and ASE-ESNare not capable of adequately predicting the future state whenℎ > 1, the ESN architectures using nonlinear readouts achievesatisfactory results in such cases, especially the R-ESN/ELM,which was the only successful architecture when ℎ = 3.

Finally, as analyzed in Section IV-B2, we also combinedboth nonlinear readout proposals in a cascade, and examinedthe performance of this hybrid architecture in the predictionof the logistic map. Based on preliminary experiments, weadopted 𝑁 = 5, 𝑁ℎ = 100 and 𝑁𝑝𝑐 = 10, and the followingAMSE values have been achieved considering the test dataset:

for ℎ = 1, AMSE = 1.05 × 10−12; for ℎ = 2, AMSE =2.05 × 10−9, and, for ℎ = 3, AMSE = 7.19 × 10−5. As wecan notice, by adding a Volterra filter structure in the outputof the extreme learning machine, it was possible to improvethe prediction accuracy when compared to R-ESN/ELM.

TABLE IVAMSE VALUES OBTAINED WITH EACH ECHO STATE NETWORK

CONSIDERING 𝑛𝑠 = 1.

AMSENetwork Horizon Training Test

R-ESNℎ = 1 4.88(±2.56)e-03 6.63(±3.08)e-03ℎ = 2 1.07(±0.03)e-01 1.33(±0.04)e-01ℎ = 3 1.16(±0.01)e-01 1.42(±0.03)e-01

ASE-ESNℎ = 1 6.39(±2.32)e-03 9.97(±3.04)e-03ℎ = 2 1.14(±0.01)e-01 1.43(±0.04)e-01ℎ = 3 1.16(±0.01)e-01 1.42(±0.03)e-01

R-PVESNℎ = 1 1.39(±4.88)e-07 4.16(±12.6)e-07ℎ = 2 7.71(±24.0)e-05 2.39(±5.12)e-04ℎ = 3 2.94(±3.00)e-02 3.34(±12.2)e-01

R-ESN/ELMℎ = 1 4.84(±13.4)e-10 7.07(±19.8)e-10ℎ = 2 6.74(±20.0)e-07 9.95(±29.0)e-07ℎ = 3 3.18(±6.46)e-03 4.18(±8.33)e-03

To conclude the analysis in this scenario, Figure 3 displaysthe correct state sequence of the logistic map, as well as thepredicted series offered by each ESN architecture. For thesake of brevity, we omit the predicted series related to the R-ESN/ELM, since the AMSE values associated with R-PVESNand R-ESN/ELM are quite similar. It is possible to visualizein Figure 3 the performance improvement acquired with thenonlinear readouts.

2) Lorenz System: Similarly to the analysis performed inthe context of the logistic map, we examined the performanceof each ESN in the prediction of the state of the Lorenz system.In all experiments, the following values for each networkparameter have been adopted: 𝑁 = 100 for both R-ESN andASE-ESN, 𝑁 = 20 and 𝑁𝑝𝑐 = 10 for R-PVESN, and, finally,𝑁 = 20 and 𝑁ℎ = 100 for R-ESN/ELM.

Firstly, we display in Table V the AMSE values associatedwith each ESN as the number of inputs is increased, consid-ering a time horizon equal to ℎ = 0.045 seconds.

Some interesting remarks can be drawn from Table V:(i) unlike the performance behavior verified for the logisticmap, we can observe that using additional inputs may leadto a performance improvement, which follows from the slowcorrelation decay of the Lorenz time series: in fact, the bestperformance for almost all ESN architecture was reachedwith 𝑛𝑠 = 4; (ii) the application of an ELM as the ESNreadout did not offer any benefits when compared to theconventional linear combiner; (iii) on the other hand, the bestresults are associated with R-PVESN, albeit the difference tothe AMSE values obtained with R-ESN and ASE-ESN is notas pronounced as occurred in the logistic map.

These observations attest that the use of a nonlinear readoutcan bring interesting improvements in performance, but theyalso indicate that not every nonlinear structure is going tosucceed, depending on the application involved.

Finally, considering the best case in terms of the number ofinputs (𝑛𝑠 = 4), we monitored the performance of each ESNas the time horizon is increased. In Table VI, the AMSE ob-tained with the ESNs in displayed, and the column associatedwith the prediction horizon gives the integer multiplying thebasic time step ℎ = 0.045.

0 20 40 60 80 100−0.2

0

0.2

0.4

0.6

0.8

1

1.2

Sta

te v

alue

Time instant

originalpredicted

(a) R-ESN

0 20 40 60 80 100−0.2

0

0.2

0.4

0.6

0.8

1

1.2

Sta

te v

alue

Time instant

originalpredicted

(b) ASE-ESN

0 20 40 60 80 100−0.2

0

0.2

0.4

0.6

0.8

1

1.2

Sta

te v

alue

Time instant

originalpredicted

(c) R-PVESN

Fig. 3. Predicted series by each ESN architecture for the logistic mapconsidering ℎ = 1 and 𝑛𝑠 = 1.

As can be observed, for all values of the time horizon,the best performance is obtained with the ESN architecturecharacterized by a Volterra filter structure as the readout.However, an extreme learning machine was not capable ofimproving the performance of the ESN in the prediction ofthe state of Lorenz system. Nonetheless, based on the resultsobtained in the context of chaotic prediction, it is possibleto assert that nonlinear readouts are interesting options toincrease the processing capability of an echo state network.

TABLE VAMSE VALUES OBTAINED WITH EACH ECHO STATE NETWORK

CONSIDERING ℎ = 0.045𝑠.

AMSENetwork Inputs Training Test

R-ESN

𝑛𝑠 = 1 1.02(±0.60)e-04 1.75(±1.38)e-04𝑛𝑠 = 3 2.30(±1.12)e-05 3.61(±1.69)e-05𝑛𝑠 = 4 1.87(±0.47)e-05 3.85(±1.20)e-05𝑛𝑠 = 6 2.19(±0.95)e-05 3.81(±1.47)e-05

ASE-ESN


R-PVESN


R-ESN/ELM


TABLE VIAMSE VALUES OBTAINED WITH EACH ECHO STATE NETWORK

CONSIDERING 𝑛𝑠 = 4.

AMSENetwork Horizon Training Test

R-ESN

1 1.22(±0.60)e-05 2.57(±1.65)e-053 9.90(±2.70)e-05 2.26(±0.70)e-045 3.61(±0.60)e-04 1.34(±0.34)e-037 3.34(±1.06)e-03 1.96(±0.54)e-029 3.15(±0.81)e-02 1.15(±0.21)e-01

ASE-ESN

1 2.68(±0.77)e-05 6.69(±2.79)e-053 1.70(±0.48)e-04 4.52(±1.71)e-045 6.17(±1.05)e-04 2.77(±1.19)e-037 5.03(±1.33)e-03 2.82(±0.54)e-029 4.08(±0.79)e-02 1.47(±0.30)e-01

R-PVESN

1 3.76(±3.04)e-07 5.39(±6.37)e-063 7.07(±7.68)e-07 9.65(±13.1)e-065 2.46(±1.70)e-06 2.88(±2.42)e-057 2.33(±1.88)e-04 2.72(±1.15)e-039 6.10(±3.14)e-03 6.53(±1.47)e-02

R-ESN/ELM

1 1.18(±0.85)e-04 3.03(±2.87)e-043 4.32(±4.47)e-04 9.41(±10.3)e-045 9.27(±7.27)e-04 2.17(±1.87)e-037 3.07(±1.81)e-03 7.81(±4.22)e-039 2.33(±0.96)e-02 1.04(±0.41)e-01

V. CONCLUSION

In this work, we have studied two recent echo state networkarchitectures which are characterized by the use of nonlinearreadouts: (i) the proposal of Butcher et al. [6], in whichthe linear combiner at the output is replaced by an extremelearning machine; and (ii) the proposal of Boccato et al. [7],which employs a Volterra filter structure along with a stagebased on PCA. These architectures have been analyzed inthe context of two relevant signal processing tasks: channelequalization and chaotic time series prediction.

Based on the obtained results, it is possible to affirm thatthe use of a more flexible structure in the output layer of

ESNs, like Volterra filters and extreme learning machines, cansignificantly improve the performance of echo state networks.Furthermore, since both approaches are based on structuresthat are linear with respect to the free parameters, the simplic-ity of the ESN training process is duly preserved. Additionally,we have also addressed the possibility of combining both ESNreadout structures in a cascade, which has provided promisingresults for both signal processing tasks considered here.

Motivated by these evidences, we intend to study the use ofVolterra filters in the output layer of ELMs, since, in theory, thesame benefits brought by the nonlinear structure in the contextof echo state networks [7] could be attained in the feedforwardcase. Moreover, the use of alternative design methods for thereservoir, e.g., those based on the idea of dynamical systems,remains as an interesting future perspective.

ACKNOWLEDGMENT

This work was supported by FAPESP and CNPq.

REFERENCES

[1] S. Haykin, Adaptive filter theory, 3rd ed. NJ: Prentice Hall, 1996.[2] H. Jaeger, “The echo state approach to analyzing and training recurrent

neural networks,” German National Research Center for InformationTechnology, Tech. Rep. 148, 2001.

[3] M. Lukosevicius and H. Jaeger, “Reservoir computing approaches torecurrent neural network training,” Computer Science Review, vol. 3,pp. 127–149, 2009.

[4] D. Verstraeten, B. Schrauwen, M. D’Haene, and D. Stroodbandt, “Anexperimental unification of reservoir computing methods,” Neural Net-works, vol. 20, no. 3, pp. 391–403, 2007.

[5] M. C. Ozturk, D. Xu, and J. C. Principe, “Analysis and design of echostate networks,” Neural Computation, vol. 19, pp. 111–138, 2007.

[6] J. Butcher, D. Verstraeten, B. Schrauwen, C. Day, and P. Haycock,“Extending reservoir computing with random static projections: a hybridbetween extreme learning and RC,” in Proceedings of 18th ESANN,2010, pp. 303–308.

[7] L. Boccato, A. Lopes, R. Attux, and F. J. Von Zuben, “An echostate network architecture based on Volterra filtering and PCA withapplication to the channel equalization problem,” in Proceedings of theInt. Joint Conf. on Neural Networks, IJCNN 2011, 2011, pp. 580–587.

[8] R. E. Bellman, Dynamic Programming. Princeton University Press,1957.

[9] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning machine: Anew learning scheme of feedforward neural networks,” in Proceedingsof the Int. Joint Conf. on Neural Networks, IJCNN 2004, 2004, pp.985–990.

[10] ——, “Extreme learning machine: Theory and applications,” Neurocom-puting, vol. 70, pp. 489–501, 2006.

[11] T. Adali, “Why a nonlinear solution for a linear problem?” in Proceed-ings of IEEE Workshop on Neural Networks for Signal Processing, 1999,pp. 157–165.

[12] S. H. Strogatz, Nonlinear Dynamics and Chaos: With Applications toPhysics, Biology, Chemistry and Engineering. Westview Press, 2000.

[13] H. D. I. Abarbanel, Analysis of Observed Chaotic Data. Springer, 1997.[14] H. Kantz and T. Schreiber, Nonlinear Time Series Analysis, 2nd ed.

Cambridge University Press, 2004.[15] J. D. Farmer and J. J. Sidorowich, “Predicting Chaotic Time Series,”

Physical Review Letters, vol. 59, pp. 845–848, 1987.[16] D. P. Mandic and J. A. Chambers, Recurrent Neural Networks for

Prediction. Wiley-Interscience, 2001.[17] J. Montalvao, B. Dorizzi, and J. C. M. Mota, “Some theoretical limits of

efficiency of linear and nonlinear equalizer,” Journal of Communicationsand Information Systems, vol. 14, pp. 85–92, 1999.

[18] R. Ferrari, R. Suyama, R. Lopes, R. Attux, and J. Romano, An OptimalMMSE Fuzzy Predictor for SISO and MIMO Blind Equalization. FirstIAPR Workshop on Cognitive Information Processing, 2008, pp. 86–91.

[IEEE 2012 International Joint Conference on Neural Networks (IJCNN 2012 - Brisbane) - Brisbane,...

Documents

Transcript of [IEEE 2012 International Joint Conference on Neural Networks (IJCNN 2012 - Brisbane) - Brisbane,...