Fuzzy Neural Network Technique for System State Forecasting

11
1484 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 5, OCTOBER 2013 Fuzzy Neural Network Technique for System State Forecasting Dezhi Li, Member, IEEE, Wilson Wang, Senior Member, IEEE, and Fathy Ismail Abstract —In many system state forecasting applications, the prediction is performed based on multiple datasets, each corre- sponding to a distinct system condition. The traditional methods dealing with multiple datasets (e.g., vector autoregressive moving average models and neural networks) have some shortcomings, such as limited modeling capability and opaque reasoning oper- ations. To tackle these problems, a novel fuzzy neural network (FNN) is proposed in this paper to effectively extract information from multiple datasets, so as to improve forecasting accuracy. The proposed predictor consists of both autoregressive (AR) nodes modeling and nonlinear nodes modeling; AR models/nodes are used to capture the linear correlation of the datasets, and the nonlinear correlation of the datasets are modeled with nonlinear neuron nodes. A novel particle swarm technique [i.e., Laplace particle swarm (LPS) method] is proposed to facilitate parameters estimation of the predictor and improve modeling accuracy. The effectiveness of the developed FNN predictor and the associated LPS method is verified by a series of tests related to Mackey–Glass data forecast, exchange rate data prediction, and gear system prognosis. Test results show that the developed FNN predictor and the LPS method can capture the dynamics of multiple datasets effectively and track system characteristics accurately. Index Terms—Fuzzy neural predictors, machinery condition prognosis, multiple dimensional datasets, particle swarm opti- mization. I. Introduction S YSTEM STATE prognosis is an important research and development area, which aims to forecast the future states of a dynamic system based on its past observations. Several types of forecasting methods have been proposed in the literature. The classical approaches for time series forecasting are mainly based on analytical models, such as autoregressive (AR) models, autoregressive moving average (ARMA) [1], and vector autoregressive moving average (VARMA) models [2]. These classical models describe the underlying relationship among datasets so as to extrapolate future states of a dynamic Manuscript received June 30, 2012; revised January 7, 2013; accepted April 8, 2013. Date of publication June 3, 2013; date of current version September 11, 2013. This work was supported in part by the Natural Sciences and Engineering Research Council of Canada (NSERC) and eMech Systems Inc. This paper was recommended by Associate Editor T.-H. S. Li. D. Li and F. Ismail are with the Department of Mechanical and Mechatronics Engineering, University of Waterloo, Waterloo, ON N2L 3G1, Canada (e-mail: [email protected]). W. Wang is with the Department of Mechanical Engineering, Lakehead University, Thunder Bay, ON P7B 5E1, Canada. Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCYB.2013.2259229 system, and have been used in some forecasting applications, such as river flow [3], wind speed [4], electricity demand load [5], [6], and economy analysis [7], [8]. However, these classical approaches have some limitations in applications. For example, although the VARMA model can cope with multivariate time series generated simultaneously (e.g., mul- tiple measurements of an internal combustion engine using the vibration, pressure, thermal, and acoustic sensors), it is difficult to apply a VARMA model to address multiple datasets generated one after another. Although a group of AR or ARMA models can deal with multiple datasets that are generated separately, with each individual AR/ARMA tackling one dataset, the AR/ARMA model can only predict the linear patterns but not nonlinear patterns of a dataset. On the other hand, nonlinear patterns (e.g., impulses and transients) are usually associated with machinery faults that are more important in applications, such as diagnostics and prognostics. The AR model estimates future states of a dynamic system only based on its past observations, while an ARMA model deploys both past observations and past innovations for system state forecasting. Since the linear pattern of a training dataset may differ from that of the predicted dataset, applying the AR/ARMA models (to be fitted to training datasets) to predicted dataset may cause large esti- mation errors (innovations). Consequently, using fitted ARMA model may result in poor forecasting performance through its past innovations (estimation errors). From this aspect, an AR model is superior to an ARMA model in capturing the deterministic pattern of an individual dataset in multiple datasets applications, each corresponding to a different system condition. The alternative approach to dealing with multiple datasets from different resources (e.g., sensors) is the use of soft- computing tools, such as neural networks (NNs) [9], [10], neuro-fuzzy system [23], [24], and evolutionary computation [25], [26]. However, an NN predictor has an opaque rea- soning structure while its convergence is usually slow and not guaranteed [11], [12]. Consequently a black-box NN- based predictor may not generate accurate forecasting results in modeling complex dynamic systems with both linear and nonlinear patterns. To tackle these challenges, the objective of this paper is to develop a new approach, namely, fuzzy neural network (FNN) technique, for the prognosis of complex dynamic systems especially with multiple training datasets. The linear and nonlinear characteristics of the predicted datasets will be 2168-2267 c 2013 IEEE

Transcript of Fuzzy Neural Network Technique for System State Forecasting

1484 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 5, OCTOBER 2013

Fuzzy Neural Network Technique forSystem State Forecasting

Dezhi Li, Member, IEEE, Wilson Wang, Senior Member, IEEE, and Fathy Ismail

Abstract—In many system state forecasting applications, theprediction is performed based on multiple datasets, each corre-sponding to a distinct system condition. The traditional methodsdealing with multiple datasets (e.g., vector autoregressive movingaverage models and neural networks) have some shortcomings,such as limited modeling capability and opaque reasoning oper-ations. To tackle these problems, a novel fuzzy neural network(FNN) is proposed in this paper to effectively extract informationfrom multiple datasets, so as to improve forecasting accuracy.The proposed predictor consists of both autoregressive (AR)nodes modeling and nonlinear nodes modeling; AR models/nodesare used to capture the linear correlation of the datasets, andthe nonlinear correlation of the datasets are modeled withnonlinear neuron nodes. A novel particle swarm technique [i.e.,Laplace particle swarm (LPS) method] is proposed to facilitateparameters estimation of the predictor and improve modelingaccuracy. The effectiveness of the developed FNN predictor andthe associated LPS method is verified by a series of tests relatedto Mackey–Glass data forecast, exchange rate data prediction,and gear system prognosis. Test results show that the developedFNN predictor and the LPS method can capture the dynamicsof multiple datasets effectively and track system characteristicsaccurately.

Index Terms—Fuzzy neural predictors, machinery conditionprognosis, multiple dimensional datasets, particle swarm opti-mization.

I. Introduction

SYSTEM STATE prognosis is an important research anddevelopment area, which aims to forecast the future states

of a dynamic system based on its past observations. Severaltypes of forecasting methods have been proposed in theliterature. The classical approaches for time series forecastingare mainly based on analytical models, such as autoregressive(AR) models, autoregressive moving average (ARMA) [1], andvector autoregressive moving average (VARMA) models [2].These classical models describe the underlying relationshipamong datasets so as to extrapolate future states of a dynamic

Manuscript received June 30, 2012; revised January 7, 2013; accepted April8, 2013. Date of publication June 3, 2013; date of current version September11, 2013. This work was supported in part by the Natural Sciences andEngineering Research Council of Canada (NSERC) and eMech Systems Inc.This paper was recommended by Associate Editor T.-H. S. Li.

D. Li and F. Ismail are with the Department of Mechanical and MechatronicsEngineering, University of Waterloo, Waterloo, ON N2L 3G1, Canada (e-mail:[email protected]).

W. Wang is with the Department of Mechanical Engineering, LakeheadUniversity, Thunder Bay, ON P7B 5E1, Canada.

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCYB.2013.2259229

system, and have been used in some forecasting applications,such as river flow [3], wind speed [4], electricity demandload [5], [6], and economy analysis [7], [8]. However, theseclassical approaches have some limitations in applications.For example, although the VARMA model can cope withmultivariate time series generated simultaneously (e.g., mul-tiple measurements of an internal combustion engine usingthe vibration, pressure, thermal, and acoustic sensors), it isdifficult to apply a VARMA model to address multiple datasetsgenerated one after another.

Although a group of AR or ARMA models can deal withmultiple datasets that are generated separately, with eachindividual AR/ARMA tackling one dataset, the AR/ARMAmodel can only predict the linear patterns but not nonlinearpatterns of a dataset. On the other hand, nonlinear patterns(e.g., impulses and transients) are usually associated withmachinery faults that are more important in applications, suchas diagnostics and prognostics. The AR model estimates futurestates of a dynamic system only based on its past observations,while an ARMA model deploys both past observations andpast innovations for system state forecasting. Since the linearpattern of a training dataset may differ from that of thepredicted dataset, applying the AR/ARMA models (to be fittedto training datasets) to predicted dataset may cause large esti-mation errors (innovations). Consequently, using fitted ARMAmodel may result in poor forecasting performance throughits past innovations (estimation errors). From this aspect,an AR model is superior to an ARMA model in capturingthe deterministic pattern of an individual dataset in multipledatasets applications, each corresponding to a different systemcondition.

The alternative approach to dealing with multiple datasetsfrom different resources (e.g., sensors) is the use of soft-computing tools, such as neural networks (NNs) [9], [10],neuro-fuzzy system [23], [24], and evolutionary computation[25], [26]. However, an NN predictor has an opaque rea-soning structure while its convergence is usually slow andnot guaranteed [11], [12]. Consequently a black-box NN-based predictor may not generate accurate forecasting resultsin modeling complex dynamic systems with both linear andnonlinear patterns.

To tackle these challenges, the objective of this paper isto develop a new approach, namely, fuzzy neural network(FNN) technique, for the prognosis of complex dynamicsystems especially with multiple training datasets. The linearand nonlinear characteristics of the predicted datasets will be

2168-2267 c© 2013 IEEE

LI et al.: FNN TECHNIQUE FOR SYSTEM STATE FORECASTING 1485

modeled separately in forecasting reasoning. It is new in thefollowing aspects:

1) the FNN predictor applies both linear AR nodes andnonlinear nodes to model different system characteris-tics. Each AR node characterizes the linear correlation ofa dataset to improve modeling accuracy. Nonlinear nodesare used to model nonlinear properties of the trainingdatasets;

2) a novel particle swarm (PS) technique [i.e., Laplace par-ticle swarm (LPS) method] is proposed to maximize thelog-likelihood function of each AR node and optimizethe system parameters; and

3) the developed FNN predictor is implemented for ma-chinery system prognosis.

The remainder of this paper is organized as follows. Theproposed FNN predictor is discussed in Section II. SectionIII presents the theoretical foundation of the proposed LPSoptimization. In Section IV, the effectiveness of the proposedFNN predictor and training technique is first examined bysimulation tests, and then they are implemented for gearsystem prognosis.

II. Fuzzy Neural Network Predictor

As stated in Section I, although both AR and ARMA modelscan catch the linear (but not nonlinear) pattern of the multipledatasets, an AR model is superior to an ARMA model inthe aspect that large past innovations (estimation errors) inan ARMA model may deteriorate its forecasting performance.Although the NNs can be pretrained and used to track systemcharacteristics, they have an opaque reasoning architectureand cannot effectively represent the features of the multipletraining datasets. To properly tackle these modeling problems,a novel FNN predictor is proposed in this section to providea more efficient modeling strategy for complex systems withmultiple training datasets. Each AR model is implemented asa linear node to capture the linear pattern of a dataset, whereasnonlinear features will be modeled by nonlinear nodes. Fig. 1describes the architecture of the developed FNN predictor (notfor training), which is a five-layer feed-forward network.

Layer 1 is the input layer. The input xt − i, i = 1, 2,. . . ,p−max, is the value of the target dataset for predicationat time lags i; and p−max is the number of input data forprediction.

Layer 2 is the AR node layer to model linear pattern of thetraining datasets. Each AR node in this layer contains a linearAR model expressed as

Li = θ1,ixt−1 + θ2,ixt−2 + · · · + θpi,ixt−pi+ αi,t (1)

where θj,i are linear parameters, j = 1, 2, . . . , pi; i = 1, 2, . . . ,m; m is the number of AR nodes; αi,t is the estimation errorof the ith AR node at time t; pi is the model order of theith AR node, which takes small integer values to simplifypredictor structure and to reduce computational burden oftraining process. The number of AR nodes depends on thenumber of training datasets that correspond to the number ofinformation sources (e.g., sensors). The number of inputs to

Fig. 1. Network architecture of the proposed FNN predictor.

each AR node can take small integer values (e.g., 2 or 3) tosimplify predictor structure and reduce computational burdenof the training process. To guarantee sufficient input data forall the AR nodes in layer 2, the number of input data fromtarget dataset are given as p max = max{p1, p2, . . . , pm}.ωi, i = 1, 2, . . . , m, is the synaptic weight between the ithAR node and the output node; ωi is a user-defined coefficientthat determines the contribution of the ith AR node estimationoutput to the final output. All the AR nodes, associatedwith ωi, compose the linear node modeling; the linear nodemodeling output YL is given as

YL =

m∑i=1

ωi(Li − αi,t)

m∑i=1

ωi

. (2)

Then the linear estimation errors at time t can be estimatedas ϕt = xt − YL.

The nonlinear nodes parameters, associated with weights ωj ,can be considered as nonlinear node modeling in the form of aradial basis function (RBF). It aims to forecast and compensatefor linear estimation error ϕt based on past estimation errors{ϕt−1, ϕt−2, . . . , ϕt−n} that are the inputs to layer 3; n is thenumber of nodes in layer 3.

Layer 4 is the nonlinear node layer to model nonlinearpatterns of the training datasets. A Gaussian function is usedas the membership function (MF) in each node in this layerbecause it is simple for implementation and can properlycharacterize local information of a nonlinear process; the nodeoutput can be formulated as

Nj = e− 1

2

(�−υj

σj

)2

(3)

where � = {ϕt−1, ϕt−2, . . . , ϕt−n} represents the linear estima-tion error vector; υj and σj are the respective center and spread

1486 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 5, OCTOBER 2013

Fig. 2. Flowchart for the illustration of the training and forecasting processesof the proposed FNN predictor.

of the Gaussian function Nj . The output of the nonlinear nodemodeling YN will be determined by

YN =

n∑j=1

ωjNj

n∑j=1

ωj

(4)

where ωj, j = 1, 2, . . . , n, is the synaptic weight between anonlinear node and the output node. Layer 5 is the outputlayer. The FNN output Y is formulated as Y = YL + YN .

Layers 3 and 4 can be considered NN-based nonlinear ap-proximators with inputs {ϕt−1, ϕt−2, . . . , ϕt−n} and the outputYN .

The parameters of the developed FNN predictor should beproperly trained in order to achieve optimal mapping betweenthe input spaces and the output space. The parameters areoptimized by the use of a novel training technique to bediscussed in Section III.

The prediction accuracy depends on the order of ARmodels and the number of nonlinear nodes. Therefore, theseparameters need to be carefully selected by expertise or somestatistical tools (e.g., the cross-validation method [27]).

A flowchart of the training process and forecasting operationusing FNN is illustrated in Fig. 2. Since each training datasetis from a specific information source (e.g., a different sensor)and may possess distinct linear patterns, several AR models(nodes) are used to capture the linear characteristics of differ-ent datasets. However, the characteristics of the residuals fromdifferent AR nodes can be modeled by some nonlinear nodesin the NN. So the duplicated features of different residualscan be automatically accommodated by one nonlinear nodethrough the training process.

In general, a dataset is composed of some linear patterns andnonlinear patterns. It may be difficult to accurately predict both

types of patterns using a single technique. The classical ARpredictor is able to capture linear patterns of a given dataset;however, it cannot effectively catch nonlinear patterns of thedata. On the other hand, an NN-based approximator (e.g.,RBF) can model nonlinear patterns by the use of training;however, it is not an efficient predictor for linear patterns,which will require more computation time to achieve the de-sired accuracy. The proposed FNN predictor aims to integratethe advantages of these two classical methods. In the FNN, thetraining data are processed with two steps: AR nodes are usedto model linear patterns of the given dataset, while nonlinearpatterns will be characterized by an NN model. Then both thelinear process and the nonlinear process can be modeled withproper approaches. Therefore, the forecasting accuracy can beimproved.

III. Laplace Particle Swarm Training Technique

Before the discussion of the proposed LPS-based maximumlikelihood estimation (MLE), a brief description will be givenregarding the classical PS method.

A. Classical Particle Swarm Method

The classical PS method optimizes the objective functionby simultaneously maintaining several candidate solutions inthe search space. The accuracy of each candidate solution isrepresented by a fitness grade. The higher the fitness grade,the better the candidate solution in optimizing the objectivefunction. A particle starts with one candidate solution thatis further evolved following certain update rules at eachtime step. It also records the highest fitness grade to whichthis particle has achieved thus far. The candidate solutioncorresponding to this highest fitness grade is referred to asthe local best candidate solution. Among all of the local bestcandidate solutions, the one with the highest fitness is calledthe global best candidate solution [20]. In each iteration, theincrement of the candidate solution in particle i is updated by

υi(t + 1) = αυi(t) + β1γ1[xi(t) − xi(t)] + β2γ2[x(t) − xi(t)] (5)

where υi(t) is the velocity of particle i at time t and xi(t)is the position of particle i (candidate solution) at time t. α,β1, and β2 are user-defined parameters {0α1.2, 0β12, 0β22}.γ1 and γ2 are random parameters over [0, 1]. xi(t) is the localbest candidate solution of particle i at time t, and x(t) is theglobal best candidate solution at time t. Consequently, eachparticle will be optimized by

xi(t + 1) = xi(t) + υi(t + 1). (6)

A lower and upper boundaries of the particle, xmin and xmax,are set to prevent a particle from moving beyond the searchspace. If the particle is beyond the boundaries, the velocitywill be updated by

υi(t) = τυi(t) (7)

where τ ∈ [0, 1] is a coefficient used to adjust velocity [20].To conduct the PS, the fitness of each particle will be

evaluated first, and then local best candidate solutions and

LI et al.: FNN TECHNIQUE FOR SYSTEM STATE FORECASTING 1487

the global best candidate solution are updated. Finally, thevelocity and position of each particle are updated using (5)and (6), respectively [20].

Although the classical PS has some merits, such as lowcomputation complexity and easy programming [13], [14],it has several limitations. For instance, it cannot search theparameter space comprehensively because the search directionof the ith particle only depends on the local best candidatesolution xi(t) and the global best candidate solution xi(t), butthat direction may not cover other search directions.

B. Proposed Laplace Particle Swarm Technique

To overcome the aforementioned problems in the classicalPS methods, the proposed LPS will diversify search directionsso as to improve training efficiency. In order to cover the entiresearch space, the search direction of the ith particle will bedirected by the particle’s local best candidate solution xi(t),the global best candidate solution x(t), and a random localbest candidate solution xj(t), where i �= j. The search steplength is adaptively determined by a random number over thefollowing Laplace distribution:

F (x) =

{12 exp

(x−nω

), if xσ

1 − 12 exp

(η−x

ω

), if x > σ

(8)

where η and ω denote the center and slope of the Laplacecumulative distribution function, respectively.

Let u be a uniformly distributed random number over [0, 1]and η = 0. The random number λ derived from Laplacedistribution will be

λ =

{ω ln(2ui) if u < 0.5−ω ln(2 − 2ui), if u ≥ 0.5.

(9)

To guarantee convergence of the algorithm, a constraintfunction ξ is introduced to mitigate the impact of Laplacecoefficient λ

ξ =

(1 − Ic

It

)b

(10)

where Ic and I t are the indices of the current iteration and totalnumber of iterations, respectively; b ∈ [0.5, 5] is the strengthfactor.

Given a particle xi(t) and a randomly selected local bestcandidate solution xj(t)(i �= j), the update expression of thevelocity in the LPS will be

υi(t + 1) = λξ⌊xj(t) − xi(t)

⌋+ β1γ1 [xi(t) − xi(t)]

+β2γ2 [x(t) − xi(t)] (11)

where xi(t) can be updated using (6).

C. Derivation of the Log-Likelihood Function

To conduct the MLE, a log-likelihood function should beestablished properly. In this subsection, one of the classicmethods to derive the log-likelihood function of an AR model(node) will be discussed with the use of Kalman filter [15].The log-likelihood function can be regarded as the objectivefunction of the PS or LPS algorithm; the purpose of using PSor LPS algorithms is to maximize the objective function to

derive the optimized system parameters in the AR node. Toformulate the log-likelihood function with respect to the ithAR node as represented in (1), (1) can be transformed intothe state space representation

xt+1 = Axt + Rεt+1 (12)

yt = Z′xt (13)

where xt is a pi × 1 state vector; A is an pi × pi matrix; andR and Z are pi × 1 vectors defined as

A =

⎡⎢⎢⎢⎢⎢⎢⎢⎣

θ1,i 1 0 0 . . . 0θ2,i 0 1 0 . . . 0θ3,i 0 0 1 . . . 0...

. . ....

θpi−1,i 0 0 0 . . . 1θpi,i 0 0 0 . . . 0

⎤⎥⎥⎥⎥⎥⎥⎥⎦

R =

⎡⎢⎢⎢⎢⎢⎣

100...0

⎤⎥⎥⎥⎥⎥⎦

Z =

⎡⎢⎢⎢⎢⎢⎣

100...0

⎤⎥⎥⎥⎥⎥⎦

The Kalman filter is applied in this case to construct thelikelihood function of the system and recursively computext+1|t . xt+1|t = Et[xt+1|y1, y2, . . . , yt; x0] is the estimated valueof xt + 1 based on historical observations (y1,y2, . . . ,yt)(observations in the ith dataset) and initial condition x0. Themean squared error matrix is derived as

Pt+1|t = E[(xt+1 − xt+1|t)(xt+1 − xt+1|t)′]. (14)

The innovations are computed by

αi,t = yt − E �yt|y1, . . . , yt−1; x0�= yt − Z′xt|t−1. (15)

The innovation variance ρt is computed by

ρt = E[(yt − Z′xt|t+1)(yt − Z′xt|t+1)′]= Z′Pt|t−1Z.

(16)

The mean square matrices Pt+1|t will be updated by

Pt+1|t = A

(Pt|t−1 − Pt|t−1ZZ′Pt|t−1

ρt

)A′ + RR′σ2 (17)

where σ2 is the variance of the noise term εt + 1.Given initial conditions P1|0 = E(xt x

′t) and x0 = 0, the log-

likelihood function of the observation vector y1, y2, . . . ,yM

will be

l = −

⎡⎢⎢⎣

M lnM∑t=1

α2i,t

ρt

+M∑t=1

ln ρt

⎤⎥⎥⎦ (18)

where εt is the innovation at time instance t and M is thelength of the ith dataset.

Details of the derivation of log-likelihood function can befound in [16] and [17]. When the log-likelihood function is

1488 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 5, OCTOBER 2013

obtained for a particular AR node, the proposed LPS will beused to minimize this negative log-likelihood function, so as tosearch for the global optimal parameters θj,i in this AR node.

D. The Hybrid Training Strategy

Based on investigation from the authors’ research group,a hybrid training strategy can reduce the search dimensioncompared with a single-pass training approach [19]; it canprevent some possible trapping due to local optima, and im-prove training convergence. To optimize the parameters in thedeveloped FNN predictor, a hybrid training method is adoptedin this paper to optimize system parameters separately: theproposed LPS-MLE is used to optimize the parameters inthe AR nodes (in layer 2 in Fig. 1). The classical gradientdescent (GD) algorithm is employed to update the parametersin nonlinear nodes and weights wj .

In using the proposed LPS-MLE to train linear node pa-rameters, one AR node is used to fit each training dataset andderive a series of linear estimation errors for ith dataset: αi,t ,t = 1, 2, . . . , M, by (15), where M is the length of the ithdataset. After fitting m AR models to m datasets, m sets ofestimation errors α1,t , α2,t , . . . ,αm,t can be determined, whichare fed into the nonlinear node modeling system one afteranother. In each series αi,t , {αi,t−1, αi,t−2, . . . , αi,t−n} are theinput to nonlinear node modeling system to predict the linearestimation error αi,t at time instance t, t = 1, 2, . . . M.

The hybrid training process is summarized as follows.

1) The nonlinear node parameters and weights ωj, j =1, 2, . . . , n, are initialized over (0, 1).

2) The parameters θj,i, j = 1, 2, . . . , pi, in the ith AR nodeare evaluated simultaneously by using the LPS-MLE.The linear estimation errors αi,t are computed using (15).When the maximum number of generations of the LPSis reached, terminate the estimation of θj,i in the ith ARnode and the collection of αi,t . Note that the parametersθj,i in different AR nodes are optimized separately.

3) After the training of the AR node parameters, m setsof estimation errors {α1,t}, {α2,t}, . . . , {αm,t} are fed tononlinear node modeling system. In the ith set estima-tion error αi,t , t = 1, 2, . . . , M, the inputs are in theform of {αi,t−1, αi,t−2, . . . , αi,t−n} where αi,t is the de-sired output. The nonlinear node parameters and weightfactors (ωj, j = 1, 2, . . . , n) are estimated by usingthe classical GD algorithm [21]. When the maximumnumber of iteration of the GD is reached, the trainingprocess stops.

After the training process, the target dataset{xi,t−1, xi,t−2, . . . , xt−p max} is input to the FNN predictor (asshown in Fig. 1) to forecast the future states of the system.

IV. Performance Evaluation and Applications

A. Overview

The effectiveness of the proposed FNN predictor and theLPS training technique is verified in this Section first bysimulation, and then the FNN predictor is implemented forgear system prognosis.

Fig. 3. Flowchart of the hybrid training process.

Fig. 4. Network architecture of the RBF predictor.

As stated in the Introduction, the current methods to dealwith multiple datasets are mainly based on the use of arti-ficial NNs. Since the nonlinear node modeling in FNN canbe considered as an implementation of the RBF, the RBFpredictor is used here to compare the forecasting accuracywith the proposed FNN. The architecture of RBF is shown inFig. 4. It is a three-layer feedforward network. The nodes inlayer 1 input data xt − j to the next layer, where j is the timelag (j = 1,2, . . . ,l) and l is the number of nodes in layer 1.Gaussian MFs are used in layer 2 as activation function

Ok = e− 1

2

(x−ckαk

)2

(19)

where x is the input vector of xt − j , j = 1,2, . . . ,l, ck isthe center of the kth Gaussian function, ak is the slope of thekth Gaussian function; k = 1,2, . . . ,q, and q is the number ofnodes in layer 2. Layer 3 is the output layer (one node in thiscase) with the output

YRBF =

q∑k=1

wkOk

q∑k=1

wk

(20)

LI et al.: FNN TECHNIQUE FOR SYSTEM STATE FORECASTING 1489

where wk is the synaptic weight between layers 2 and 3. TheRBF is trained by the commonly used GD algorithm.

To make a comparison, a widely used adaptive neuro-fuzzyinference system (ANFIS) predictor is also applied in this casefor Mackey–Glass data forecasting [21]. The first layer of theANFIS predictor is the input layer that has three inputs nodesxt − j , j = 1, 2, 3. The second layer employs two SigmoidalMFs and one Gaussian MF for each input. The third layer isthe operation layer where each node conducts a distinct linearcombination of three inputs. The fourth layer is the outputlayer that uses the Centroid method to generate one output.The node parameters in layer 2 are trained by GD method.The node parameters in layer 3 and weights between layers 3and 4 are trained by recursive least square estimate [21].

B. Simulation-1: Mackey–Glass Data Forecasting

Firstly, the effectiveness of the proposed techniques willbe examined by using benchmark datasets from Mackey–Glass equation given in (21); datasets from the Mackey–Glassequation have properties, such as chaotic, nonperiodic, andnon-convergence, and are commonly used in the research fieldof time series forecasting to evaluate the performance of apredictor [22]

dx(t)

dt=

0.2x(t − τ)

1 + x10(t − τ)− 0.1x(t). (21)

In this simulation test, three datasets from (21) with differentinitial conditions x(0) = 1.3, x(0) = 1.5, and x(0) = 1.7 are usedfor training. A dataset with initial condition x(0) = 1.9 is usedfor testing. Other conditions for these datasets include τ = 30,dt = 1, and x(t) = 0 for t < 0. Each dataset contains 800 datasamples. The comparison test will be over four aspects.

1) The performance of the developed FNN predictor willbe compared with the classical RBF predictor.

2) The performance of the developed FNN predictor willalso be compared with the ANFIS predictor.

3) The performance of the FNN predictor based on a singletraining dataset (i.e., one AR node in layer 2 in Fig. 1)will be compared with the FNN using multiple trainingdatasets (i.e., multiple AR nodes in layer 2).

4) The efficiency of the proposed LPS-MLE training tech-nique will be examined. The FNN predictor trained bythe hybrid training of PS-MLE and GD is denoted byFNN−PS, while the FNN predictor trained by LPS-MLEand GD is denoted by FNN−LPS.

Two types of FNN structures are employed, which arerepresented by the respective number of nodes from layers 1to 5: 3–3–3–3–1 (three AR nodes for three training datasets)and 1–1-3-3-1 (one AR node for a single training dataset). pi

= 3 and p−max = 3 are set in both cases.Fig. 5 demonstrates the one-step-ahead forecasting results

by the RBF (with 3-5-1 nodes from layers 1 to 3) predictortrained by GD, the FNN−LPS and FNN−PS with three trainingdatasets (i.e., three AR nodes). Table I summarizes the trainingand the prediction errors of the related predictors. It can beseen that the FNN predictors [Fig. 5(b) and (c)] outperform theRBF predictor [Fig. 5(a)] due to its efficient modeling strategy.The RBF generates the largest training error and prediction

Fig. 5. Forecasting results of the Mackey–Glass data. Blue solid line: realdata to estimate. Red dotted line: prediction performance of different schemesby (a) RBF, (b) FNN-PS, and (c) FNN-LPS.

TABLE I

Performance Comparison of the Related Predictors in Terms

of the Mean Square Errors Based on Mackey−Glass Data

Predictors (# of AR nodes) Training error Prediction error (10−4)RBF 0.0345 212

ANFIS 0.0056 23FNN−PS (1) 0.0034 7.5037

FNN−LPS (1) 0.0016 5.7655FNN−PS (3) 0.0182 5.5880FNN−PS (3) 0.0053 1.4898

error. FNN−PS and FNN−LPS predictors trained by threedatasets (i.e., three AR nodes) outperform the correspondingpredictors trained by only one training dataset (i.e., oneAR node); if more training datasets are used properly, theforecasting accuracy can be improved significantly.

On the other hand, as stated in Section III, the parameters ineach AR node are optimized by the MLE. So the efficiency ofthe searching algorithms of PS and LPS can be compared byminimizing the negative log-likelihood function [i.e., -l in (18)]with the same number of iteration (100 in this case). Fig. 6shows the performance comparison of the PS and the LPS tosearch for global minima of negative log-likelihood functionrelated to three Mackey–Glass training datasets (i.e., with threeAR nodes). It is seen that the LPS provides better performancethan the PS due to its more efficient search mechanism. It

1490 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 5, OCTOBER 2013

Fig. 6. Performance of PS and LPS searching for global minimum of thenegative log-likelihood functions of three AR node membership functionscorresponding to three Mackey–Glass training datasets with different initialconditions: (a) x(0) = 1.3; (b) x(0) = 1.5; and (c) x(0) = 1.7. The blue solidline is the performance of PS and the red dotted line is the performance ofLPS.

TABLE II

Performance Comparison of the Related Predictors in Terms

of the Mean Square Errors Based on Exchange Rate Data

Predictors Training error Prediction error (10−5)RBF 0.0038 7.5531

FNN−PS 0.0017 1.8151FNN−LPS 0.0013 0.3835

can be concluded that the FNN predictor is able to capturethe characteristics of a dynamic system (Mackey–Glass datain this case) and outperforms RBF-based predictors, and theproposed LPS technique is more efficient than PS algorithmin global searching.

The convergence of the nonlinear modeling training interms of mean square errors with respect to FNN−PS(1),FNN−LPS(1), FNN−PS(3), and FNN−LPS(3) is shown inFig. 7, the quantity in the parentheses denotes the numberof AR nodes (or training datasets) used in the correspondingpredictor. In this case, there are three Gaussian MF nodesin layer 4 (Fig. 1); each Gaussian MF has three dimensionsbecause there are three inputs (layer 3 in Fig. 1). The GaussianMFs in one of the three nonlinear nodes are shown in Fig. 8.

Fig. 7. Convergence of the training in terms of mean square errors (MSE)for Mackey–Glass data forecasting with respect to (a) one training datasetand (b) three training datasets. The blue dotted line is the error convergenceof FNN−PS predictor and the red solid line is the error convergence ofFNN−LPS predictor.

Fig. 8. Examples of the Gaussian membership functions with respectto nonlinear modeling training process in Mackey–Glass data forecasting(a) before training using FNN−LPS(1); (b) after training using FNN−LPS(1);(c) before training using FNN−LPS(3); and (d) after training usingFNN−LPS(3).

C. Simulation-2: Exchange Rate Forecasting

Next, the effectiveness of the developed FNN predictor isexamined by using currency exchange rate data that havecharacteristics of nonlinearity and nonperiodicity. The dailyexchange rate of U.S. dollars versus Canadian dollars in years2009 and 2010 are used as two training datasets, and the dailyexchange rate data in 2011 is used for testing the forecastingperformance of the predictors. Each training dataset contains502 data samples and the test dataset contains 245 data

LI et al.: FNN TECHNIQUE FOR SYSTEM STATE FORECASTING 1491

Fig. 9. Forecasting results of exchange rate data of U.S. dollars versusCanadian dollars in 2011. The blue solid line is the real data to estimate;the red dotted line is the prediction performance of different schemes by(a) RBF, (b) FNN-PS, and (c) FNN-LPS.

samples. The structure of the FNN will be 2-2-3-3-1 in termsof the number of layer nodes, with pi = 3, i= 1, 2, and p−max

= 3. The RBF predictor has the same setting as in the previoussimulation. Fig. 9 shows the one-step-ahead forecasting resultsby using the related predictors, and Table II summarizes thetest results. It can be seen that the FNN predictors [Fig. 9(b)and (c)] outperform the RBF [Fig. 9(a)] in terms of predictionaccuracy. The performance of the PS and the LPS in searchingfor global minimum of the negative log-likelihood functionis compared in Fig. 10. It is seen that the proposed LPS issuperior to the PS due to its advanced search strategy. Inaddition, the FNN−LPS can catch the dynamic characteristicsof the systems (exchange rate in this case) more effectivelyand outperforms the FNN−PS predictor. The convergence ofthe training in terms of mean square errors for FNN−PS andFNN−LPS predictors is shown in Fig. 11.

D. Gear System Prognosis

Gear systems are commonly used in rotary machinery,such as vehicles and various industrial facilities. Gear systemprognostic information can be used for system conditionmonitoring to recognize the occurrence of a gear defect atits earliest stage so as to prevent machinery performance

Fig. 10. Performance of PS and LPS searching for global minimum ofthe negative log-likelihood function of two AR node membership functionscorresponding to two exchange rate training datasets: (a) U.S. dollars versusCanadian dollars in 2009 and (b) U.S. dollars versus Canadian dollars in 2010.The blue solid line is the performance of PS and the red dotted line is theperformance of LPS.

Fig. 11. Convergence of the training in terms of the mean square errors(MSE) for exchange rate data forecasting. The blue dotted line is theerror convergence of FNN−PS predictor and the red solid line is the errorconvergence of FNN−LPS predictor.

degradation, malfunction, or even catastrophic failures (e.g.,in aircraft).

Fig. 12 schematically shows the experimental setup used forthis gear system monitoring tests. The system is driven by a 3-HP AC motor. The load is provided by a heavy-duty magneticbreak unit (PHC-5). A two-stage gearbox is tested in thispaper. A magnetic pick-up sensor (BH1522-010) is mountedin the radial direction of gear #4 to provide a reference signal(one pulse per each tooth period) for the time synchronousaverage filtering. The gap between the magnetic pick-up sensorand the gear top land is in the range of 0.5–1.5 mm. The motorrotation is controlled by a speed controller (Delta VFD-B) thatallows the tested gear system to operate in the range of 20–4200 rpm. The speed of the drive motor and the load of themagnetic break system are adjusted automatically to simulatedifferent speed/load operating conditions. The vibration sig-nals are measured using two accelerometers (ICP-SN99096)mounted at the ends of the gearbox housing. The signals fromdifferent sensors are collected using a data acquisition board.

1492 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 5, OCTOBER 2013

Fig. 12. Experimental setup for gear system monitoring. (1): Tested gearwith simulated defects. (2)–(4): healthy transmission gears.

Fig. 13. Gear conditions tested (for gear #1 in Fig. 12). (a) Healthy gear.(b) Pitted gear. (c) Cracked gear.

The preconditioned signals are then fed to a computer forfurther processing.

In this paper, the fault detection of the gear system isconducted gear by gear. The measured vibration from theexperimental setup is an overall signal associated with dif-ferent vibratory sources, such as shafts, bearings, and thegears. Because the gear signal is periodic in nature, it canbe differentiated by applying a time synchronous filteringprocess, based on the reference signature provided by themagnetic sensor. As a result, the signals nonsynchronous withthe rotation of the gear of interest can be removed. Each gearsignal is obtained and represented over one full revolution(i.e., the signal average) [18]. In this case, an index basedon the continuous wavelet amplitude measurement is used asan example for gear fault diagnosis/prognosis. Details of therelated techniques can be found in [19].

Three gear cases are tested in this paper for gear #1, asshown in Fig. 13: a) healthy gears, b) pitted gears, and c)cracked gears. Many tests have been conducted correspondingto different gear conditions. Three typical examples are usedin this case as described below.

In gear system prognosis, the predictors are firstly trainedusing the healthy gear dataset and pitted gear dataset are usedfor training (i.e., with two AR nodes); the cracked gear datasetis used for testing. The healthy gear data and pitted gear datacontain 344 and 302 samples, respectively.

Fig. 14. Performance of cracked gear prognosis. The blue solid line is thereal data to estimate; the red dotted line is the prediction performance ofdifferent schemes by (a) RBF, (b) FNN-PS, and (c) FNN-LPS.

TABLE III

Performance Comparison of the Related Predictors in Terms

of the Mean Square Errors Based on Cracked

Gear Prognosis Data

Predictors Training error Prediction errorRBF 0.8218 0.7841

FNN−PS 0.0625 0.0397FNN−LPS 0.0427 0.0225

In cracked gear testing, the tests are conducted under loadlevels from 0.5 to 3 hp, and motor speeds from 100 to 3600rpm. The monitoring time-step is set at 1.0 h, that is, themonitoring systems are applied automatically every 1 h forcondition monitoring operation. The monitoring is performedgear by gear. After about 130 h, a transverse cut with 10% ofthe tooth root thickness is introduced to one tooth of gear #1,as illustrated in Fig. 13(c), to simulate the initial fatigue crack.Then, the test is resumed and continued until the damagedtooth is broken off about 153 h later.

The cracked gear data contains 283 samples. The settingof the FNN and the RBF predictors are the same as inprevious simulation case. Fig. 14 shows the one-step-aheadforecasting performance using the related predictors, and Ta-ble III summarizes the test results. It can be seen that the

LI et al.: FNN TECHNIQUE FOR SYSTEM STATE FORECASTING 1493

Fig. 15. Performance of PS and LPS searching for global minimum ofthe negative log-likelihood function of two AR node membership functionscorresponding to (a) health gear dataset and (b) pitted gear dataset. The bluesolid line is the performance of PS and the red dotted line is the performanceof LPS.

Fig. 16. Convergence of the training in terms of the mean square errors(MSE) for gear system prognosis. The blue dotted line is the error convergenceof FNN−PS predictor and the red solid line is the error convergence ofFNN−LPS predictor.

FNN predictor [Fig. 14(b) and (c)] yields better performancethan the RBF [Fig. 14(a)] due to its efficient informationextraction strategy. From Fig. 15, it is clear that the proposedLPS technique can significantly enhance search ability ofthe predictors compared with the classical PS method. Thedeveloped FNN−LPS predictor outperforms the FNN−PS; itcan capture and track the dynamic behaviors of the gear systemquickly and efficiently. The convergence of the training forFNN−PS and FNN−LPS predictors is shown in Fig. 16.

V. Conclusion

An FNN predictor was developed in this paper to properlymodel multiple training datasets in order to improve theaccuracy of system state prognosis. The FNN predictor can beintegrated the advantages of both linear modeling and nonlin-ear modeling approaches in dealing with multiple datasets thathave different characteristics. A new training technique, LPS,was proposed to improve training efficiency and convergence.The effectiveness of the proposed FNN predictor and related

training technique was verified using simulation tests. Thedeveloped FNN predictor was also implemented for gearsystem prognosis. Test results showed that the FNN–LPSpredictor was an efficient forecasting tool especially for systemwith multiple datasets. It can be captured the dynamic behaviorof a system quickly and accurately. The LPS technique was anefficient training algorithm in improving training convergence.

References

[1] S. Lu, K. H. Ju, and K. H. Chon, “A new algorithm for linear andnonlinear ARMA model parameter estimation using affine geometry,”IEEE Trans. Biomed. Eng., vol. 48, no. 10, pp. 1116–1124, 2001.

[2] H. Lutkepohl, New Introduction to Multiple Time Series Analysis.Berlin/Heidelberg, Germany: Springer-Verlag, 2005.

[3] K. Mohammadi, H. R. Eslami, and R. Kahawita, “Parameter estimationof an ARMA model for river flow forecasting using goal programming,”J. Hydrol., vol. 331, no. 1–2, 30, pp. 293–299, 2006.

[4] E. Erdem and J. Shi, “ARMA based approaches for forecasting the tupleof wind speed,” Appl. Energy, vol. 88, no. 4, pp. 1405–1414, 2011.

[5] J. Nowicka-Zagrajek and R. Weron, “Modeling electricity loads inCalifornia: ARMA models with hyperbolic noise,” Signal Proces., vol.82, no. 12, pp. 1903–1915, 2002.

[6] N. Haldrup, F. S. Nielsen, and M. Nielsen, “A vector autoregressivemodel for electricity prices subject to long memory and regime switch-ing,” Energy Econ., vol. 32, no. 5, pp. 1044–1058, 2010.

[7] C. Kascha and K. Mertens, “Business cycle analysis and VARMAmodels,” J. Econ. Dyn. Control, vol. 33, no. 2, pp. 267–282, 2009.

[8] F. F. R. Ramos, “Forecasts of market shares from VAR and BVARmodels: A comparison of their accuracy,” Int. J. Forecasting, vol. 19,no. 1, pp. 95–110, 2003.

[9] M. Han and Y. Wang, “Analysis and modeling of multivariate chaotictime series based on neural network,” Expert Syst. Appl., vol. 36, no. 2,pp. 1280–1290, 2009.

[10] L. Su, “Prediction of multivariate chaotic time series with local poly-nomial fitting,” Computers Math. Appl., vol. 59, no. 2, pp. 737–744,2010.

[11] P. A. Fishwick, “Neural network models in simulation: A comparisonwith traditional modeling approaches,” in Proc. Winter Simul. Conf., pp.702–710, 1989.

[12] Z. Tang, C. Almeida, and P. A. Fighwick, “Time series forecasting usingneural networks vs. Box–Jenkins methodology,” Simulation, vol. 57, no.5, pp. 303–310, 1991.

[13] A. Arefi and M. R. Haghifam, “A modified particle swarm optimizationfor correlated phenomena,” Appl. Soft Comput., vol. 11, no. 8, pp. 4640–4654, 2011.

[14] M. M. Noel, “A new gradient based particle swarm optimization algo-rithm for accurate computation of global minimum,” Appl. Soft Comput.,vol. 12, no. 1, pp. 353–359, 2012.

[15] G. Welch and G. Bishop, “An Introduction to the Kalman Filter,” ChapelHill, NC, USA: University of North Carolina, 2006.

[16] C. Hevia, “Maximum likelihood estimation of an ARMA(p,q) model,”The World Bank, DECRG, 2008.

[17] E. J. Hannan and A. J. McDougall, “Regression procedures for ARMAestimation,” J. Am. Stat. Assoc., vol. 83, no. 409, pp. 490–498, 1988.

[18] W. Wang and D. Kanneg, “An integrated classifier for condition moni-toring in transmission systems,” Mech. Syst. Signal Proces., vol. 23, no.4, pp. 1298–1312, 2009.

[19] W. Wang, “An enhanced diagnostic system for gear system monitoring,”IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 38, no. 1, pp. 102–112,2008.

[20] M. Clerc, Particle Swarm Optimization. London, U.K.: ISTE Ltd., 2006.[21] J. R. Jang, C. Sun, E. Mizutani, Neuro-Fuzzy and Soft Computing: A

Computational Approach to Learning and Machine Intelligence. NJ,USA: Prentice Hall, 1993.

[22] M. Mackey and L. Glass, “Oscillation and chaos in physiological controlsystems,” Science, vol. 197, no. 4300, pp. 287–289, 1977.

[23] L. Yu and Y. Zhang, “Evolutionary fuzzy neural networks for hybridfinancial prediction,” IEEE Trans. System Man Cybern. C, Appl. Rev.,vol. 35, no. 2, pp. 244–249, 2005.

[24] D. G. Stavrakoudis and J. B. Theocharis, “Pipelined recurrent fuzzyneural networks for nonlinear adaptive speech prediction,” IEEE Trans.Syst., Man, Cybern. B, Cybern., vol. 37, no. 5, pp. 1305–1320, 2007.

1494 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 5, OCTOBER 2013

[25] C. Lin, C. Chen, and C. Lin, “A hybrid of cooperative particle swarmoptimization and cultural algorithm for neural fuzzy networks and itsprediction applications,” IEEE Trans. Syst., Man, Cybern. C, Appl. Rev.,vol. 39, no. 1, pp. 55–68, 2009.

[26] Y. Chen, B. Yang, A. Abraham, and L. Peng, “Automatic designof hierarchical Takagi-Sugeno type fuzzy systems using evolutionaryalgorithms,” IEEE Trans. Fuzzy Syst., vol. 15, no. 3, pp. 385–397, 2007.

[27] R. Kohavi, “A study of cross-validation and bootstrap for accuracyestimation and model selection,” in Proc. 14th Int. Joint Conf. Artif.Intell., vol. 2, no. 12, pp. 1137–1143, 1995.

Dezhi Li (M’10) received the B.Sc. degree in elec-trical engineering from Shandong University, Jinan,China, in 2008, and M.Sc. degree in control en-gineering from Lakehead University, Thunder Bay,ON, Canada, in 2010.

From 2010 to 2011, he was a Research Associateat Lakehead University. He is currently a Ph.D.candidate with the Department of Mechanical andMechatronics Engineering at University of Waterloo.His research interests include signal processing, ma-chinery condition monitoring, mechatronic systems,

linear/nonlinear system control, and artificial intelligence.

Wilson Wang (M’04–SM’07) received the M.E.degree in industrial engineering from the Universityof Toronto, Toronto, ON, Canada, in 1998, and thePh.D. degree in mechatronics engineering from theUniversity of Waterloo, Waterloo, ON, Canada, in2002.

From 2002 to 2004, he was a Senior Scientistwith Mechworks Systems Inc. He joined LakeheadUniversity, Thunder Bay, ON, Canada, in 2004,where he is currently a Professor with the De-partment of Mechanical Engineering. His research

interests include signal processing, artificial intelligence, machinery conditionmonitoring, intelligent control, mechatronics, and bioinformatics.

Fathy Ismail received the B.Sc. and M.Sc. degreesin mechanical and production engineering, in 1970and 1974, respectively, from Alexandria University,Egypt, and the Ph.D. degree from McMaster Uni-versity, Hamilton, ON, Canada, in 1983.

He joined the University of Waterloo, Waterloo,Ontario, Canada, in 1983, and is currently a Profes-sor with the Department of Mechanical and Mecha-tronics Engineering. He has served as the Chairof the Department and the Associate Dean of theFaculty of Engineering for graduate studies. His

research interests include machining dynamics, high-speed machining, mod-eling structures from modal analysis testing, and machinery health conditionmonitoring and diagnosis.