1-s2.0-S0957417410009930-main

download 1-s2.0-S0957417410009930-main

of 9

Transcript of 1-s2.0-S0957417410009930-main

  • 8/21/2019 1-s2.0-S0957417410009930-main

    1/9

    An intelligent forecasting model based on robust wavelet m-support vector machine

    Qi Wu a,b,⇑, Rob Law b

    a Jiangsu Key Laboratory for Design and Manufacture of Micro–Nano Biomedical Instruments, Southeast University, Nanjing 211189, Chinab School of Hotel and Tourism Management, Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong 

    a r t i c l e i n f o

    Keywords:

    Support vector machineWavelet kernel

    Robust loss function

    Particle swarm optimization

    Forecast

    a b s t r a c t

    Aiming at the problem of small samples, season character, nonlinearity, randomicity and fuzziness in

    product demand series, the existing support vector kernel does not approach the random curve of thedemands time series in the  L2( R n) space (quadratic continuous integral space). The robust loss function

    is also proposed to solve the shortcoming of  e-insensitive loss function during handling hybrid noises.A novel robust wavelet support vector machine (RW  m-SVM) is proposed based on wavelet theory andthe modified support vector machine. Particle swarm optimization algorithm is designed to select the

    optimal parameters of RW m-SVM model in the scope of constraint permission. The results of applicationin car demand forecasts show that the forecasting approach based on the RW  m-SVM model is effectiveand feasible, the comparison between the method proposed in this paper and other ones is also given

    which proves this method is better than RW  m-SVM and other traditional methods.  2010 Elsevier Ltd. All rights reserved.

    1. Introduction

    Application of time series prediction can found in the areas of 

    economic and business planning, inventory and product control,

    weather forecasting, signal processing and many other fields

    (Box & Jenkins, 1994; Engle, 1984; Hornik, Stinchcombe, & White,

    1989; Hill, Connor, & Remus, 1996; Tuan & Lanh, 1981; Tong, 1983;

    Tang, Almedia, & Fishwick, 1991; Zhang, 2001). Product demand

    forecasting as an application of time series forecasting is a complex

    dynamic system, and the demandbehavior is affected by many fac-

    tors. Many of these factors have the random, nonlinear, seasonal,

    and uncertain characteristics. There is a kind of nonlinear mapping

    relationship between the influencing factors and demand series,

    and it is difficult to describe the relationship by definite mathemat-

    ical models.

    For the linear series,   Box and Jenkins (1994)   developed the

    autoregressive integrated moving average (ARIMA) methodology

    for forecasting time series events. A basic tenet of the ARIMA mod-

    eling approach is the assumption of linearity among the variables.

    However, there are many time series events for which the assump-

    tion of linearity may not hold. Clearly, ARIMA models cannot be

    effectively used to capture and explain nonlinear relationships.

    When ARIMA models are applied to processes that are nonlinear,

    forecasting errors often increase greatly as the forecasting horizon

    becomes longer. To improve forecasting nonlinear time series

    events, researchers have developed alternative modeling ap-

    proaches, which include nonlinear regression models, the bilinear

    model (Tuan & Lanh, 1981), the threshold autoregressive model

    (Tong, 1983), and the autoregressive heteroscedastic model

    (ARCH) (Engle, 1984). Although these methods exhibiting improve-

    ment over the linear models for some specific case, tend to be

    application specific, lack of generality and harder to implement

    (Zhang, 2001).

    For the nonlinear series, the artificial neural network (ANN) is a

    general purpose model that has been used as a universal functional

    approximator. For example, it is supposed to be able to model eas-

    ily any type of parametric or non-parametric process including

    automatically and optimally transforming the input data. These

    claims lead an increasing interest in neural networks (Hornik

    et al., 1989). Researchers use ANN methodology to forecast a num-

    ber of nonlinear time series events (Hill et al., 1996; Tang et al.,

    1991; Tang & Fishwick, 1993). The effectiveness of neural network

    models and their performance in comparison to traditional fore-

    casting methods have also been a subject of many studies (Gorr,

    1994; Zhang, Patuwo, & Hu, 1998).   Bell, Ribar, and Verchio

    (1989)   compare back-propagation networks against regression

    models in predicting commercial bank failures. The neural network

    model performs well in failure prediction and the expected costs

    for misclassification by the neural network models are found to

    be lower than those of the logistic regression model.  Roy and Cos-

    set (1990) also use neural network and logistic regression models

    in predicting country risk ratings for economic models andpolitical

    indicators. The neural network models have lower mean absolute

    error in their predictions and react more evenly to the indicators

    0957-4174/$ - see front matter   2010 Elsevier Ltd. All rights reserved.doi:10.1016/j.eswa.2010.09.036

    ⇑ Corresponding author at: Jiangsu Key Laboratory for Design andManufacture of 

    Micro–Nano Biomedical Instruments, Southeast University, Nanjing 211189, China.

    Tel.: +86 25 51166581; fax: +86 25 511665260.

    E-mail addresses:  [email protected],  [email protected] (Q. Wu),  hmro-

    [email protected] (R. Law).

    Expert Systems with Applications 38 (2011) 4851–4859

    Contents lists available at   ScienceDirect

    Expert Systems with Applications

    j o u r n a l h o m e p a g e :   w w w . e l s e v i e r . c o m / l o c a t e / e s w a

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://dx.doi.org/10.1016/j.eswa.2010.09.036mailto:[email protected]:[email protected]:[email protected]:[email protected]://dx.doi.org/10.1016/j.eswa.2010.09.036http://www.sciencedirect.com/science/journal/09574174http://www.elsevier.com/locate/eswahttp://www.elsevier.com/locate/eswahttp://www.sciencedirect.com/science/journal/09574174http://dx.doi.org/10.1016/j.eswa.2010.09.036mailto:[email protected]:[email protected]:[email protected]:[email protected]://dx.doi.org/10.1016/j.eswa.2010.09.036http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-

  • 8/21/2019 1-s2.0-S0957417410009930-main

    2/9

    than their logistic counterparts. Duliba (1991) compare neural net-

    work models with four types of regression models in predicting the

    financial performance of transportation companies. She has found

    that the neural network model outperforms the random-effects

    regression model rather than the fixed-effects model. Though neu-

    ral networks are more powerful than regression methods for time

    series prediction, their drawback is that the design of an efficient

    architecture and the choice of the parameters involved require

    longer processing time. In fact, learning neural network weights

    can be considered as a hard optimization problem for which the

    learning time scales exponentially as the problem size grows. To

    overcome this disadvantage, a new approach should be explored.

    Recently, a novel machine learning technique, called support

    vector machine (SVM), has drawn much attention in the fields of 

    pattern classification and regression forecasting. SVM was first

    introduced by Vapnik (1995). Support vector machine (SVM) is a

    kind of classifier’s studying method on statistic study theory. This

    algorithm derives from linear classifier, and can solve the problem

    of two kind classifier, later this algorithm applies in non-linear

    fields, that is to say, we can find the optimal hyperplane (large

    margin) to classify the samples set. It is an approximate implemen-

    tation to the structure risk minimization (SRM) principle in statis-

    tical learning theory, rather than the empirical risk minimization

    (ERM) method (Kwok, 1999).

    Compared with traditional neural networks, SVM can use the

    theory of minimizing the structure risk to avoid the problems of 

    excessive study, calamity data, local minimal value and so on.

    For the small samples set, this algorithm can be generalized well.

    Support vector machine (SVM) has been successfully used for ma-

    chine learning with large and high dimensional data sets. These

    attractive properties make SVM become a promising technique.

    This is due to the fact that the generalization property of an SVM

    does not depend on the complete training data but only a subset

    thereof, the so-called support vectors. Now, SVM has been applied

    in many fields as follows: handwriting recognition, three-dimen-

    sion objects recognition, faces recognition, text images recognition,

    voice recognition, regression analysis, and so on   Carbonneau,Laframbois, and Vahidov (2008), Trontl, Smuc, and Pevec (2007)

    Wohlberg, Tartakovsky, and Guadagnini (2006).

    For pattern recognition and regression analysis, the non-linear

    ability of SVM can use kernel mapping to achieve. For the kernel

    mapping, the kernel function must satisfy the condition of Mercer

    theorem. The Gauss function is a kind of kernel function which is

    general used. It shows the good generalization ability. However,

    for our used kernel functions so far, the SVM cannot approach

    any curve in L2( R n) space (quadratic continuous integral space), be-

    cause the kernel function which is used now is not the complete

    orthonormal base. This character lead the SVM cannot approach

    every curve in the L2( R n) space, similarly, the regression SVM can-

    not approach every function.

    According to the above describing, we need find a new kernelfunction, and this function can build a set of complete base through

    horizontal floating and flexing. As we know, this kind of function

    has already existed, and it is the wavelet functions. Based on wave-

    let decomposition, this paper propose a kind of allowable support

    vector’s kernel function which is named wavelet kernel function,

    and we can prove that this kind of kernel function is existent.

    The Morlet and Mexican wavelet kernel functions are the ortho-

    normal base of   L2( R n) space. Based on the wavelet analysis and

    conditions of the support vector kernel function, Morlet or Mexican

    wavelet kernel function for support vector regression machine

    (SVM) is proposed, which is a kind of approximately orthonormal

    function. This kernel function can simulate almost any curve in

    quadratic continuous integral space, thus it enhances the general-

    ization ability of the SVR. The papers (Khandoker, Lai, Begg, &Palaniswami, 2007; Widodo & Yang, 2008  research on wavelet

    e-support vector machine. Much research indicates the perfor-mance of   m-SVM is better than one of   e-SVM. According to thewavelet kernel function and the regularization theory,  m-supportvector machine on wavelet kernel function (Wm-SVM) is proposedin this paper.

    However, the standard SVM encounters certain difficulties in

    real application. Some improved SVMs have been put forward to

    solve the concrete problems (Kwok, 1999). Though the standard

    SVM that adopts  e-insensitive loss function has good generaliza-tion capability in some applications. But it is difficult to handle

    Gaussian noises and the normal distribution noise parts of series.

    Therefore, this paper focuses on the modeling of a new wavelet

    SVM that can penalize the Gaussian noise parts of series.

    Based on the RW m-SVM, an intelligence forecasting approachfor car demand series with the nonlinear and uncertain character-

    istics is proposed in this paper. Section  2 construct an intelligence

    forecasting model based on a new m-support vector regression ma-chine on wavelet kernel function and robust loss function (RW  m-SVM) and particle swarm optimization algorithm (PSO). Section  3

    gives two algorithms to solve the intelligence forecasting problem.

    Section 4   gives an application of the intelligence forecasting sys-

    tem based on RW m-SVM model. Section 5  draws the conclusions.

    2. Robust wavelet   m -support vector machine (RW   m -SVM)

     2.1. Support vector machine

    SVM represent a novel neural network technique, which has

    gained ground in classification, forecasting and regression analysis.

    One of its key properties is that training SVM is equivalent to solv-

    ing a linearly constrained quadratic programming problem, whose

    solution turns out to be unique and globally optimal. Therefore,

    unlike other networks’ training techniques, SVM circumvent the

    problem of getting stuck at local minima. Another advantage of 

    SVM is that the solution to the optimization problem depends only

    on a subset of the training data points, which are referred to as thesupport vectors.

    Let us consider a set of data points ( x1, y1), ( x2, y2), . . ., ( xl, yl),

    which are independently and randomly generated from an un-

    known function. Specifically,  x i   is a column vector of attributes,  yiis a scalar, which represents the dependent variable, and  l  denotes

    the number of data points in the training set. SVM approximate

    such an unknown function by mapping x into a higher dimensional

    space through a function  / , and determining a linear maximum-

    margin hyper-plane. In particle, the smallest distance to such a

    hyperplane is called the margin of separation. The hyper-plane will

    be an optimal separating hyper-plane if margin is maximized. The

    data points that are located exactly the margin distance away from

    the hyper-plane are denominated the support vectors.

    Mathematically, SVM utilize a classifying hyper-plane of the

    form   f ( x) = w  x + b = 0, where the coefficients   w   and  b   are esti-

    mated by minimizing a regularized risk function:

    1

    2kwk2 þ C 

    Xli¼1

    Leð yiÞ;   ð1Þ

    where   kwk   is denoted as the regularized term, Pl

    i¼1Leð yiÞ   is the

    empirical error, and  C   > 0 is an arbitrary penalty parameter called

    the regularization constant. Basically, SVM penalize f ( xi) when is de-

    parts from yi  by means of an  e-insensitive loss function:

    Leð yiÞ ¼  0   if  j f ð xiÞ  yij 

  • 8/21/2019 1-s2.0-S0957417410009930-main

    3/9

    the margin of separation to the hyper-plane. The e-insensitive lossfunction is illuminated in Fig. 1.

    The minimization of expression  (1) is implemented by intro-

    ducing the slack variables  ni and ni . Specifically, the m-support vec-

    tor regression (m-SVM) solves the following quadraticprogramming problem:

    minw;nðÞ ;e;b sðw; nðÞ; eÞ ¼ 1

    2 kwk2

    þ C     v   eþ1

    lXli¼1 n

    i þ ni !   ð3ÞSubject to   ðw  xi þ bÞ  yi  6 eþ ni;   ð4Þ

     yi  ðw  xi þ bÞ 6 eþ ni ;   ð5Þn

    ðÞi   P 0;   eP 0:   ð6Þ

    The solution to this minimization problem is of the form

     f ð xÞ ¼Xli¼1

    ai  ai

    K ð xi; xÞ þ b;   ð7Þ

    where ai   and ai   are the Lagrange multiplies associated with theconstrains (w  xi + b)  yi 6 e + ni   and   yi  ðw  xi þ bÞ 6 e þ n

    i ,

    respectively. The function  K ( xi, x j) = /( xi)0/( x j) represents a kernel,

    which is the inner product of the two vectors xi  and x j in the space

    /( xi) and /( x j).

    Well-known kernel functions are   K ð xi; x jÞ ¼  x0i x j   (linear),

    K ð xi; x jÞ ¼   c x0i x j þ r  d

    ; c >  0 (polynomial),   K ( xi, x j) = e x p (ck xi  x jk

    2),   c > 0 (radial basis function), and   K ð xi; x jÞ ¼  tanh  c x0i x j þ r 

    (sigmoid). Theradialkernel is a popularchoice in the SVMliterature.

     2.2. The conditions of wavelet support vector’s kernel function

    The support vector’s kernel function can be described as not

    only the product of point, such as   K ( x, x0) = K ( x  x0), but also the

    horizontal floating function, such as  K ( x, x0) = K ( x  x0). In fact, if a

    function satisfied condition of Mercer, it is the allowable support

    vector kernel function.

    Lemma 1.   The symmetry function K (x,x0) is the kernel function of 

    SVM if and only if: for all function  u – 0 which satisfied the conditionof R 

    Rd u2ðnÞdn <  1, we need satisfy the condition as follows:Z Z 

      K ð x; x0Þuð xÞuð xÞd xd x0 P 0:   ð8Þ

    This theorem proposed a simple method to build kernel function.

    For the horizontal floating function, because hardly dividing

    this function into two same functions, we can give the condition

    of horizontal floating kernel function.

    Lemma 2.   The horizontal floating function is allowable support 

    vector’s kernel function if and only if the Fourier transform of K (x)

    need satisfy the condition follows:

    F ½ xðxÞ ¼ ð2pÞn=2Z 

    Rnexpð jðx: xÞÞK ð xÞdx P 0:   ð9Þ

    If the wavelet function   w(x) satisfied the conditions:   w(x) 2L 2(R) \ L1(R), and   ŵð xÞ ¼  0; bw   is the Fourier transform of functionw(x). The wavelet function group can be defined as:wa;mð xÞ ¼ ðaÞ

    12w

      x  ma

    ;   ð10Þ

    where a is the so-called scaling parameter, m is the horizontal floating 

    coefficient, and w(x) is called the ‘‘mother wavelet ”. The parameter of translation m 2 R and dilation a > 0, may be continuous or discrete.

    For the function f(x), f(x)  2 L 2(R), The wavelet transform f(x) can be de-

     fined as:

    W ða; mÞ ¼ ðaÞ12Z   þ1

    1 f ð xÞw   x  m

    a

    dx;   ð11Þ

    where w⁄(x) stands for the complex conjugation of  w(x).

    The wavelet transform  W (a, m) can be considered as functions

    of translation m  with each scale  a . Eq.   (11) indicates the waveletanalysis is a time–frequency analysis, or a time-scaled analysis.

    Different from the Short Time Fourier Transform, the wavelet

    transform can be used for multi-scale analysis of a signal through

    dilation and translation so it can extract time–frequency features

    of a signal effectively.

    Wavelet transform is also reversible, which provides the possi-

    bility to reconstruct the original signal. A classical inversion for-

    mula for f ( x) is:

     f ð xÞ ¼ C 1wZ   þ1

    1

    Z   þ11

    W ða; mÞwa;mð xÞda

    a2 dm;   ð12Þ

    where

    C w ¼ Z   1

    1 jŵðw

    Þj2

    jwj   dw 

  • 8/21/2019 1-s2.0-S0957417410009930-main

    4/9

    We can build the horizontal floating kernel function as follows:

    K ð x; x0Þ ¼Ydi¼1

    w  xi  x0i

    ai

    ;   ð16Þ

    where ai  is the scaling parameter of wavelet,  ai > 0. So far, because

    the wavelet kernel function must satisfy the conditions of Theorem

    2, the number of wavelet kernel function which can be showed by

    existent functions is few. Now, we give an existent wavelet kernelfunction: Morlet wavelet kernel function, and we can prove that

    this function can satisfy the condition of allowable support vector’s

    kernel function. Morlet wavelet function is defined as follows:

    wð xÞ ¼ cosðx0 xÞexp x2

    2 :   ð17Þ

     Theorem 1.  Morlet wavelet kernel function is defined as:

    K ð x; x0Þ ¼Yni¼1

    cos   x0  xi  x0i

    a

    exp     xi  x

    0i

    22a2

    !  ð18Þ

    and this kernel function is an allowable support vector kernel function.

    Proof.  According to the Lemma 2, we only need to prove

    F ½ xðxÞ ¼ ð2pÞn=2Z 

    Rnexpð jðx: xÞÞK ð xÞd x P 0;   ð19Þ

    where K ð xÞ ¼Qn

    i¼1w  xi

    a

     ¼

    Qni¼1 cos

      w0 xia

    expðk xik

    2=2a2Þ, j  denote imag-

    inary number unit. We haveZ Rn

    expð jx xÞK ð xÞd x

    ¼Z 

    Rnexpð jx xÞ

    Yni¼1

    cos   w0 xi

    a

    exp   k xik

    2

    2a2

    ! !d x

    ¼Yni¼1

    Z   11

    expð jxibixiÞ   expð jw0 xi=aÞþexpð jw0 xi=aÞ2

    exp   k xik2

    2a2 !

    d xi ¼Yni¼1

    12

    Z   11

    exp   k xik2

    2  þ   w0 j

    a  jxia

     xi

    !

    þexp   k xik2

    2    w0 j

    a þ jxia

     xi

    !!

    ¼Yni¼1

    jaj ffiffiffiffiffiffiffi

    2pp 

    2  exp   ðw0 xiaÞ

    2

    2

    !þexp   ðw0 þxiaÞ

    2

    2

    ! !:

    ð20ÞSubstituting formula (20) into Eq. (19), we can obtain Eq. (21).

    F ½ X ðxÞ ¼Yni¼1

    jaj2

      exp   ðw0 xiaÞ

    2

    2

    !þexp   ðw0 þxiaÞ

    2

    2

    ! !;

    ð21

    Þwhere a – 0, we have

    F ½ xðxÞ P 0:   ð22Þ

    If we use wavelet kernel function as the support vector’s kernel

    function,the regression estimation equationof Wm-SVM is defined as:

     f ð xÞ ¼Xli¼1

    ai  ai Yl

    i¼1w

      x  xia

    þ b:   ð23Þ

    For wavelet analysis and theory, see (Krantz, 1994; Liu & Di,

    1992).   h

     2.3. Robust loss function

    However, for standard wavelet m-SVM, it is difficult to deal withthe hybrid noise of time series. To solve the shortage of  e-insensi-tive loss of standard wavelet  m-SVM, a new hybrid function com-posed of Gaussian function, Laplace function and   e-insensitiveloss function is constructed as the loss function of  m-SVM, whichis called robust loss function. Then robust loss function can be de-

    fined as follows:

    LðnÞ ¼0   jnj 6 e12ðjnj eÞ2 e  el;

    8><>:

    ð24Þ

    where el = e + l,  n  are slack variable.The middle part of robust loss function curve is replaced by error

    quadratic curve, which is used to inhibit (penalize) the type of noise

    with the feature of Gaussian distribution. The linear part is generally

    used to inhibit (penalize) singularity points and biggish magnitude

    noises of time series. The curve of robust loss function, which is di-

    vided into three parts, is illustrated in  Fig. 2. The proposed robust

    loss function integrates the advantage of Gaussian loss function, La-

    place loss function ande-insensitive loss function and makes supportvector machine better robustness and good generalization ability.

     2.4. Robust wavelet  m-support vector machine

    Integrating the wavelet kernel function, robust loss functionand  m-support vector machine, a robust wavelet support vectormachine is proposed in this part. The parameter b  is taken into ac-

    count confidence interval of RW   m-SVM, then the new optimalproblem be reformulated as

    minw;nðÞ ;e;b

    1

    2ðkwk2 þb2ÞþC    m eþ

    Xi2I 1

    1

    2  n2i þn2i þ1

    l

    Xi2I 2

    l   ni þni !

    ð25ÞSubjectto   ðw  xi þbÞ yi 6 eþni;   ð26Þ

     yi ðw  xi þ bÞ6 eþni ;   ð27Þn

    ðÞi   P0;   eP0:   ð28Þ

    Problem   (25)   is a quadratic programming (QP) problem. By

    introducing Lagrangian multipliers, a Lagrangian function can be

    defined as follows.

    L   w; b;aðÞ; b; nðÞ; e;gðÞ  ¼ 1

    2kwk2 þ 1

    2b

    2 þ C meþ C Xi2I 1

     12  n

    2i þ  n2i

    þ C l

    Xi2I 2

    l   ni þ ni

     beXi2I 2

    gini þgi ni

    Xli¼1

    ai   eþ ni þ w  xi þ b  yið Þ

    Xl

    i¼1 ai   eþ ni  w  xi  b þ yi

    ;   ð29Þε −   ε +

     µ 

    e

    ( )e L

     µ ε 

    Fig. 2.   Robust loss function.

    4854   Q. Wu, R. Law/ Expert Systems with Applications 38 (2011) 4851–4859

    http://-/?-http://-/?-http://-/?-http://-/?-

  • 8/21/2019 1-s2.0-S0957417410009930-main

    5/9

    where aðÞi   ;gðÞi   ; bP 0 are Lagrangian multipliers. Differentiating the

    Lagrangian function (29) with regard to  w,  b, e,  n(⁄), we have

    @ L

    @ w ¼ 0 ) w ¼

    Xli¼1

    ai  ai

     xi;   ð30Þ

    @ L

    @ b ¼ 0 )X

    l

    1

    ai ai  ¼ b;   ð31Þ@ L

    @ e ¼ 0 ) b ¼ C m

    Xli¼1

    ai þ ai

    ;   ð32Þ

    @ L

    @ nðÞ ¼ 0 ) gðÞi   ¼ C l=l  aðÞi   :   ð33Þ

    By substituting  (30)–(33)   into   (29), we can obtain the corre-

    sponding dual form of function (25) as follows:

    mina;a2Rl

    1

    2

    Xli¼1

    Xl j¼1

    ai ai

      a j a j

    ðK ð xi; x jÞ þ 1Þ Xli¼1

     y i   ai ai

    þ   12C 

    Xli¼1

    a2i þ a2i

    s:t:   eT 

    ða þ aÞ 6 C  m;0 6 ai;   a

    i   6 min   C  m;

    l  l

    :   ð34Þ

    Formula (34) is represented by means of matrix form, we have

    mina;a2 R l

    1

    2  aT ; ðaÞT h i   Q  þ  E 

    C   Q 

    Q Q  þ E C 

    " #  a

    a

    þ  y T ; y T    a

    a

    s:t:   eT ða þ aÞ 6 C v ;

    0 6 ai;   ai   6 min   C m;

    l l

    ;

    ð35Þ

    where Q ij = k( xi, x j) + 1 ,  e  = [1, . . ., 1]T .  a and  a⁄  are lagrangian multi-

    pliers, which are nonnegative number.

    Transform Eq. (35) into compact formulation as follows:

    min  1

    2aT H aþ    ya

    s:t:   eT ða þ aÞ 6 C v ;

    0 6 ai;ai   6 min   C  m;

    l  l

    ;

    ð36Þ

    where   a ¼  a

    a

    ;   H  ¼

      Q  þ  E =C    Q Q Q  þ  E =C 

    ;    y  ¼

       y  y 

    .

    The output regression function of RW m-SVM is as follows:

     f ð xÞ ¼Xli¼1

    ai ai Yl

    i¼1w

      x  xia

    þ 1

    !:   ð37Þ

    It is obvious that RW m-SVM (whose constraint conditions areless than those of the standardWm-SVMby one) has a more concisedual problem. There is no parameter  b   in the estimation function

    Eq. (37), which reduces the complexity of the model.

     2.5. The optimization algorithm for the unknown parameters of the

    RW  m-SVM model

    The confirmation of unknown parameters of the RW  m-SVM iscomplicated process. In fact, it is a multivariable optimization

    problem in a continuous space. The appropriate parameter combi-

    nation of models can enhance approximating degree of the original

    series Therefore, it is necessary to select an intelligence algorithm

    to get the optimal parameters of the proposed models. The param-

    eters of RWm-SVM have a great effect on the generalization perfor-mance of RW   m-SVM. An appropriate parameter combination

    corresponds to a high generalization performance of the RW m-SVM. PSO algorithm is considered as an excellent technique to

    solve the combinatorial optimization problems (Krusienski, 2006;

    Yamaguchi, 2007). The PSO algorithm, introduced by   Kenedy &

    Eberhart (1995), is used to determine the parameter combination

    of RW m-SVM.Similarly to evolutionary computation techniques, PSO uses a

    set of particles, representing potential solutions to the problemun-

    der consideration. The swarm consists of  m  particles; each particle

    has a position X i = { xi1, xi2, . . ., xim}, and a velocity V i = {v i1,v i2, . . .,v im},

    and moves through a n-dimensional search space. According to the

    global variant of the PSO algorithm, each particle moves towards

    its best previous position and towards the best particle  g   in the

    swarm. Let us denote the best previously visited position of the

    ith particle that gives the best fitness value as   p_c i = { p_c i1, p_-

    c i2, . . ., p_c im}, and the best previously visited position of the swarm

    that gives best fitness as  p_ g  = { p_ g 1, p_ g 2, . . ., p_ g n}.

    The change of position of each particle from one iteration to an-

    other can be computed according to the distance between the cur-

    rent position and its previous best position and the distance

    between the current position and the best position of swarm. Then

    the updating of velocity and particle position can be obtained by

    using the following equations:

    v kþ1id   ¼ wv kij þ c 1r 1   p c ij  xkij

    þ c 2r 2   p g  j  xkij

    ;   ð38Þ

     xkþ1ij   ¼ xkij þv kþ1ij   ;   ð39Þ

    where w is called inertia weight and is employed to control theimpact

    of theprevioushistoryof velocitiesonthecurrent one. Accordingly, the

    parameter   w  regulates the trade-off between the global and local

    explorationabilities of theswarm. A large inertia weightfacilitatesglo-

    balexploration,whilea smallonetendsto facilitate local exploration. A

    suitable value of the inertia weight  w  usually provides balance be-

    tween global and local exploration abilities and consequently results

    in a reduction of the number of iterations required to locate the opti-

    mum solution.   k = 1, 2,. . ., K max   denotes the iteration number,   c 1   is

    the cognition learning factor,   c 2  is the social learning factor and  r 1and r 2 are randomnumbers uniformly distributed in [0,1].

    Thus, the particle flies through potential solutions towards  P kiand pg k in a navigated way while still exploring new areas by the

    stochastic mechanism to escape from local optima. Since there

    was no actual mechanism for controlling the velocity of a particle,

    it was necessary to impose a maximum value   V max   on it. If the

    velocity exceeds the threshold, it is set equal to  V max, which con-

    trols the maximum travel distance at each iteration to avoid this

    particle flying past good solutions.

     2.6. The intelligence forecasting system

    In the forecasting technique of product demand series, two of 

    the key problems are how to deal with noise and nonstationarity.A potential solution to the above two problem is to use a mixture

    of experts (ME) architecture illuminated by Fig. 3. ME architecture

    is generalized into a two-stage architecture to handle the non-sta-

    tionary in the data. In the first of the two-stage architecture, a mix-

    ture of experts including evolutionary algorithm, partial least

    squares,  k-nearest neighbors are competed to optimize the model

    in the second part of the two-stage architecture. To valuate the

    model forecasting capacity of the second stage, the fitness function

    of ME architecture is designed as follows:

     fitness ¼ 1l

    Xli¼1

     yi  yi yi

    2;   ð40Þ

    where l is thesize of theselected sample, yi denote the forecastingva-lue of the selected sample, yi is original date of the selected sample.

    Q. Wu, R. Law / Expert Systems with Applications 38 (2011) 4851–4859   4855

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-

  • 8/21/2019 1-s2.0-S0957417410009930-main

    6/9

    3. Intelligent forecasting method based on RW   m -SVM and PSO

    ME architecture is an intelligence forecasting system that can

    handle the noise and nonstationarity of time series and construct

    the nonlinearity relation in high dimension space effectively.

    According to the above idea, Particle swarm optimization algo-

    rithm can be described as following:

     Algorithm 1

    Step (1) Data preparation: Training and testing sets are repre-

    sented as Tr and Te, respectively.

    Step (2) Particle initialization and PSO parameters setting: Gener-

    ate initial particles. Set the PSO parameters including

    number of particles (n), particle dimension (m), number

    of maximal iterations (kmax), error limitation of the fit-

    ness function, velocity limitation (V max), and inertia

    weight for particle velocity (w). Set iterative variable:

    k = 0. And perform the training process from Steps 3–7.

    Step (3) Set iterative variable:  k  = k + 1.

    Step (4) Compute the fitness function value of each particle. Take

    current particle as individual extremum point of everyparticle and do the particle with minimal fitness value

    as the global extremum point.

    Step (5) Stop condition checking: if stopping criteria (maximum

    iterations predefined or the error accuracy of the fitness

    function) are met, go to Step 7. Otherwise, go to the next

    step.

    Step (6) Update the particle position by formula (38) and (39) and

    form new particle swarms, go to Step 3.

    Step (7) End the training procedure, output the optimal parame-

    ters (C ,v , a).

    On the basis of the RW m-SVM model, we can summarize a de-mand forecasting algorithm as the follows.

     Algorithm 2

    Step (1) Initialize the original data by normalization and fuzzifi-

    cation, and then, form training and testing set.

    Step (2) Deal the demand series with wavelet transform on thedifferent scale and select the best wavelet function   K 

    and scale scope ai that can match the original series well.

    Step (3) Compute the wavelet kernel function by  (16). Construct

    the QP problem (34) of the RW m-SVM.Step (4) Go to Algorithm 1, and get the optimal parameters com-

    bination vector (C ,v , a), solve the optimization problem

    (36) and obtain the parameters  a(⁄).

    Step (5) For a new demand task, extract product characteristics

    and form a set of input variables x.

    Step (6) Compute the forecasting result  f ( x) by (31).

    4. Experiments

    To illustrate the proposed intelligence forecasting method, theforecast of car demand series is studied. The car is a type of con-

    sumption product influenced by macroeconomic in manufacturing

    system and its demand action is usually driven by many uncertain

    factors. Some factors with large influencing weights are gathered

    to develop a factor list, as shown in  Table 1. The first four factors

    are expressed as linguistic information and the last two factors

    are expressed as numerical data.

    In our experiments, car demand series are selected from pastde-

    mand record in a typical company. The detailed characteristic

    data and demand series of these cars compose the corresponding

    Output the optimal combinational paramters

    Accuracy check 

    Output the current

    combinational parameters

    v

    Fig. 3.   The intelligence forecasting system based on RW  m-SVM and PSO.

    4856   Q. Wu, R. Law/ Expert Systems with Applications 38 (2011) 4851–4859

    http://-/?-http://-/?-

  • 8/21/2019 1-s2.0-S0957417410009930-main

    7/9

    training and testing sample sets. During the process of the car scale

    series forecasting, six influencing factors, viz., brand famous degree

    (BF), performance parameter (PP), form beauty (FB),demands

    experience (SE), dweller deposit (nd) and oil price (n p), are taken

    into account the first four influencing factors are linguistic infor-

    mation, the latest two factors are numerical information. All

    linguistic information of gotten influencing factors is dealt with

    fuzzy logic and form numerical information.

    The proposed forecasting model has been implemented in Mat-

    lab 7.1 programming language. The experiments are made on a

    1.80 GHz Core (TM)2 CPU personal computer (PC) with 1.0G mem-

    ory under Microsoft Windows xp professional. Some criteria, such

    as mean absolute error (MAE), mean absolute percentage error

    (MAPE) and mean square error (MSE), are adopted to evaluate

     Table 1

    Influencing factors of car demand forecast.

    Product characteristics Unit Expression Weight

    Brand famous degree (BF) Dimensionless Linguistic

    information

    0.9

    Performance parameter

    (PP)

    Dimensionless Linguistic

    information

    0.8

    Form beauty (FB) Dimensionless Linguistic

    information

    0.8

    Sales experience (SE) Dimensionless Linguistic

    information

    0.5

    Dweller deposit (DD) Dimensionless Numerical

    information

    0.8

    Oil price (OP) Dimensionless Numerical

    information

    0.4

    Fig. 4.   Mexican wavelet transform ofdemands time series in the scope of different scale.

    Fig. 5.  Morlet wavelet transform of demands time series in the scope of different scale.

    Q. Wu, R. Law / Expert Systems with Applications 38 (2011) 4851–4859   4857

  • 8/21/2019 1-s2.0-S0957417410009930-main

    8/9

    the performance of the intelligence forecasting system. The initial

    parameters of the intelligence forecasting system are given as fol-

    lows: inertia weight   w0 = 0.9; positive acceleration constants  c 1,

    c 2 = 2;   l  = 1; the fitness accuracy of the normalized samples isequal to 0.0005.

    The wavelettransform of the originalscale series on the different

    scales is got by means of the Steps 1 and 2 of  Algorithm 2. The se-

    lected waveletfunctions consist of morlet,haar, mexican andGauss-

    ianwavelet.To reducethe length of thispaper, onlyrepresentational

    morlet and mexican wavelet transforms on the different scales are

    given in Figs. 4 and 5. Mexicanwavelet transform is the bestwavelet

    transform that can inosculate the original demand series on the

    scope of scale from 0.01 to 2 among all given wavelet transforms.

    Therefore, Mexican wavelet can be ascertained as a kernel func-

    tion of RW m-SVM model, three parameters also are determined asfollows:

    v  2 ½0; 1;   a 2 ½0:001; 2   and

    C  2   maxð xi; jÞ minð xi; jÞl

      103; maxð xi: jÞ minð xi; jÞl

      103

    :

    The optimal combinational parameters are obtained by   Algo-

    rithm 1,  viz.,  C  = 525.57, v = 0.82 and a   = 0.27. Fig. 6 illuminated

    the forecasting result of the original car demand series given byAlgorithm 2.

    To analyze the forecasting capability of RW m-SVM model, themodels (wavelet   m-support vector machine with Gaussian lossfunction (W g -SVM) and wavelet  m-support vector machine (Wm-

    SVM)) train the original demand series respectively, then give thelast 12 months forecasting results of each model shown in Table 2

    (the last 12 months sample for testing sample). The linear inertia

    weight of standard PSO is adopted:

    w ¼ wmax  wmax  wminkmax

    k   ð41Þ

    where   wmax = 0.9 is the maximal inertia weight,  wmin = 0.1 is the

    minimal inertia weight,  k  is iterative number of controlling proce-

    dure process.

    To evaluate the forecasting error of these models, the compari-

    son among different forecasting approaches is shown in  Table 3.

    The  Table 3   shows the error index distribution by means of 

    dealt with four different models. The index (MAE, MAE and MSE)

    of W g -SVM model is better than that of Wm-SVM model. Theindexes of RW   m-SVM are better than these of Wm-SVM andW g -SVM. It is obvious that robust loss function can improve the

    generalization ability of support vector machine.

    Experiment results show that the regression’s precision of RW

    m-SVM is improved due to adopting wavelet kernel and robust lossfunction, compared with the models (W g -SVM and Wm-SVM) andm-SVM whose kernel function is Gauss function under the sameconditions.

    5. Conclusion

    In this paper, a new version of WSVR, named RW  m-SVM, isproposed to setup the nonlinear system of product demand series

    by the integration of wavelet theory, robust loss function andm-SVM. The new forecasting model based on RW m-SVM and PSO,

    Fig. 6.  The car demands forecasting results from RW  m-SVM model.

     Table 2

    Comparison of forecasting results from four different models.

    Model 1 2 3 4 5 6 7 8 9 10 11 12

    Real value 2967 3268 3300 1891 3489 3544 2708 1513 3411 3672 3483 1523

    m-SVM 2971 3240 3269 1964 3439 3448 2754 1661 3489 3587 3433 1669Wm-SVM 2953 3257 3286 1981 3456 3465 2736 1678 3472 3605 3451 1687W g -SVM 2962 3258 3286 1936 3511 3465 2726 1632 3464 3605 3450 1641

    RW m-SVM 2967 3257 3286 1922 3519 3472 2734 1611 3467 3610 3453 1620

     Table 3

    Error statistic of four forecasting models.

    Model MAE MAPE MSE

    m-SVM 69.5833 0.0309 6662Wm-SVM 63.1667 0.0303 6673W g -SVM 48.5833 0.0223 3822

    RW m-SVM 43.9167 0.0194 2910

    4858   Q. Wu, R. Law/ Expert Systems with Applications 38 (2011) 4851–4859

  • 8/21/2019 1-s2.0-S0957417410009930-main

    9/9

    named PSO RW   m-SVM, is presented to approximate arbitrarydemand curve in   L2 space. The simulation results indicate RW

    m-SVM can provide better forecasting precision of the productdemand series.

    The performance of the RW  m-SVM is evaluated using the dataof car demand, and the simulation results demonstrate that RW

    m-SVM is effective in dealing with uncertain data and hybridnoises. Moreover, it is shown that particle swarm optimization

    algorithm presented here is available for the RW  m-SVM to seekthe optimal parameters.

    Compared to Wm-SVMand W g -SVM, RWm-SVM has the best in-dexes (MAE, MAPE and MSE). RW m-SVM can overcomes the ‘‘curseof dimensionality” and has some other attractive properties, such

    as the strong learning capability for small samples, the good gener-

    alization performance for hybrid noises, the insensitivity to noise

    or outliers and the automatic select of optimal parameters. More-

    over, the wavelet transform can reduce noises in data while pre-

    serve the detail or resolution of the data. Therefore, in the

    process of establishing the forecasting models, much uncertain

    information of scale data is not neglected but considered wholly

    into the wavelet kernel function. The forecasting accuracy is im-

    proved by means of adopting wavelet technique.

     Acknowledgements

    This research was partly supported by the National Natural Sci-

    ence Foundation of China under Grant 60904043, a research grant

    funded by the Hong Kong Polytechnic University, China Postdoc-

    toral Science Foundation (20090451152), Jiangsu Planned Projects

    for Postdoctoral Research Funds (0901023C) and Southeast Univer-

    sity Planned Projects for Postdoctoral Research Funds.

    References

    Bell, T., Ribar, G., & Verchio, J. (1989). Neural nets vs logistic regression. In  Presented

    at the University of Southern California expert system symposium (Nov.) .Box, G. E. P., & Jenkins, G. M. (1994). Time series analysis: Forecasting and control (3rd

    ed.). Englewood Cliffs, NJ: Prentice- Hall, Inc..

    Carbonneau, R., Laframbois, K., & Vahidov, R. (2008). Application of machine

    learning techniques for supply chain demand forecasting.   European Journal of Operational Research, 184(3), 1140–1154.

    Duliba, K. (1991). Contrasting neural nets with regression in predicting

    performance. In   Proceedings of the 24th international conference on systemscience, Hawaii  (Vol. 4, pp. 163–170).

    Engle, R. F. (1984). Combining competing forecasts of inflation using a bivariate

    ARCH model.  Journal of Economic Dynamics and Control, 18(2), 151–165.Gorr, W. L. (1994). Research prospective on neural forecasting.  International Journal

    of Forecasting, 10(1), 1–4.Hill, T., Connor, M. O., & Remus, W. (1996). Neural network models for time series

    forecasts. Management Science, 42(7), 1082–1092.Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feed forward networks

    are universal approximators.  Neural Networks, 2(5), 359–366.Kenedy, J., & Eberhart, R. (1995). Particle swarm optimization. In  Proceedings of the

    IEEE international conference on neural networks  (pp. 1942–1948).Khandoker, A. H., Lai, D. T. H., Begg, R. K., & Palaniswami, M. (2007). Wavelet-based

    feature extraction for support vector machines for screening balance

    impairments in the elderly.   IEEE Transactions on Neural Systems andRehabilitation Engineering, 15(4), 587–597.

    Krantz, S. G. (1994). Wavelet: Mathematics and application. Boca Raton, FL: CRC.Krusienski, D. J. (2006). A modified particle swarm optimization algorithm for

    adaptive filtering. In   IEEE international symposium on circuits and systems, Kos,Greece (pp. 137–140).

    Kwok, J. T. (1999). Moderating the outputs of support vector machine classifiers.

    IEEE Transactions on Neural Networks, 10(5), 1018–1031.Liu, G. Z., & Di, S. L. (1992).   Wavelet analysis and application. Xi’an, China: Xidian

    Univ. Press.

    Roy, J., & Cosset, J. (1990). Forecasting country risk ratings using a neural network.

    In Proceedings of the 23rd international conference on system science, Hawaii (Vol.4, pp. 327–334).

    Tang, Z., Almedia, C., & Fishwick, P. A. (1991). Time series forecasting using neural

    networks vs. Box–Jenkins methodology.  Simulation, 57 (5), 303–310.Tang, Z., & Fishwick, P. A. (1993). Feedforward neural nets as models for time series

    forecasting. ORSA Journal of Computing, 5(4), 374–385.Tong, H. (1983).   Threshold models in non-linear time series analysis. New York:

    Springer-Verlag.

    Trontl, K., Smuc, T., & Pevec, D. (2007). Support vector regression model for the

    estimation of  c-ray buildup factors for multi-layer shields.  Annals of Nuclear Energy, 34(12), 939–952.

    Tuan, D. P., & Lanh, T. T. (1981). On the first-order bilinear time series model. Journalof Applied Probability, 18(3), 617–627.

    Vapnik, V. (1995).  The nature of statistical learning . New York: Springer.Widodo, A., & Yang, B. S. (2008). Wavelet support vector machine for induction

    machine fault diagnosis based on transient current signal.  Expert Systems with Applications, 35(1–2), 307–316.

    Wohlberg, B., Tartakovsky, D. M., & Guadagnini, A. (2006). Subsurface

    characterization with support vector machines.   IEEE Transactions onGeoscience and Remote Sensing, 44(1), 47–57.

    Yamaguchi, T. (2007). Adaptive particle swarm optimization-Self-coordinating

    mechanism with updating information. In   IEEE international conference onsystems, man and cybernetics, Taipei, Taiwan   (pp. 3: 2303–2308).

    Zhang, G. P. (2001). An investigation of neural networks for linear time-series

    forecasting. Computers and Operations Research, 28(12), 1183–1202.Zhang, G., Patuwo, E. B., & Hu, M. Y. (1998). Forecasting with artificial neural

    network: The state of the art.  International Journal of Forecasting, 14(1), 35–62.

    Q. Wu, R. Law / Expert Systems with Applications 38 (2011) 4851–4859   4859