Prof. Richardson Neuralnetworks

download Prof. Richardson Neuralnetworks

of 24

Transcript of Prof. Richardson Neuralnetworks

  • 8/9/2019 Prof. Richardson Neuralnetworks

    1/61

    Experiments with neural network formodeling of nonlinear dynamical systems:

    Design problems

    Ewa Skubalska-Rafaj lowicz

    Wroc law University of Technology, Wroc law,Wroc law, Poland

    http://find/http://goback/

  • 8/9/2019 Prof. Richardson Neuralnetworks

    2/61

    Summary

    Introduction

    PreliminariesMotivations for using random projections

    Model 1 NFIR with random projections

    Information matrix for Model 1D-optimal inputs for Model 1

    Simulation studies for Model 1

    Brief outline of Model 2 – NARX with randomprojections

    Experiment design for Model 2

    Conclusions

    http://find/

  • 8/9/2019 Prof. Richardson Neuralnetworks

    3/61

    Our aims in this lecture include:

     providing a brief introduction to modelingdynamical systems by neural networks

     focusing on neural networks with randomprojections,

     on their estimation and optimal input

    signals.

    http://find/

  • 8/9/2019 Prof. Richardson Neuralnetworks

    4/61

  • 8/9/2019 Prof. Richardson Neuralnetworks

    5/61

    Artificial neural networks at glance:Neural networks are (general) nonlinearblack-box structures.

     Classical neural network architectures(feed-forward structures):

    multi-layer perceptron with one hidden layer

    and sigmoid activation functions (MLP), radial bases function networks (RBF), orthogonal (activation functions) neural

    networks (wavelets networks, fourier networks) spline networks

     Other classes of NN: support vector machines, Kohonen nets ( based on vector quantization), based on Kolmogorov Representation Thm.

    (Sprecher networks and others)

    http://find/

  • 8/9/2019 Prof. Richardson Neuralnetworks

    6/61

    Dynamical neural networks:

    1.  networks with lateral or feedbackconnections,

    Hopfield networks (associative memory), Ellman nets (context sensitive)

    2.  networks with external dynamics –

    NARMAX, NARX, NFIR (see below).

    Aims of modeling dynamical systems:

     simulation – estimate outputs when only

    inputs are available,  prediction – also last observations of 

    outputs yn, . . . , yn−p  are given.

    http://find/

  • 8/9/2019 Prof. Richardson Neuralnetworks

    7/61

  • 8/9/2019 Prof. Richardson Neuralnetworks

    8/61

    Further remarks on NN:

    NN are usually non-linear-in-the-parameters.

    Parametric vs nonparametric approach ??If we allow a net to growth with the numberof observations, they are sufficiently rich to

    approximate smooth functions. In practice,finite structures are considered, leading to aparametric (weights) optimizing approach.

    Learning (training) == selecting weights,

    using nonlinear LMS (using Levenberg -Marquardt local optimization algorithm).Our idea 1:  some simplifications may lead tolinear-in-the-parameter network structures.

    http://find/

  • 8/9/2019 Prof. Richardson Neuralnetworks

    9/61

    Further remarks on NN 2:

    By a feed-forward NN we mean a real valued

    function on Rd:

    gM(x; w̄, θ̄) = θ0 +M

     j=1θ jϕ(),

    where x ∈ K ⊂ Rd,  K   is a compact set.

    Activation functions  ϕ(t) is a sigmoid

    function, e.g, logistic –   11+exp (−t)hiperbolic tan – tanh(t) =   1−exp (−2t)

    1+exp (−2t)

    arctan   2π

     arctan (t)

    http://find/

  • 8/9/2019 Prof. Richardson Neuralnetworks

    10/61

    Further remarks on NN 3:

    Universal approximation property (UAP):

    Certain classes of neural network models areshown to be universal approximators, in thesense that:

    – for every continuous function f : Rd

    → R,– for every compact  K ⊂ Rd,  ∀ > 0,a network of appropriate size M and acorresponding set of parameters (weights)

    w̄, θ̄  exist, such that

    supx∈K

    ||f(x) − gM(w̄, θ̄; x)|| < .

    http://find/

  • 8/9/2019 Prof. Richardson Neuralnetworks

    11/61

    Further remarks on NN 4:

    Having a learning sequence (xn, yn),n = 1, 2, . . . , N, the weights w̄, θ̄  are usuallyselected by minimization of 

    Nn=1

    yn − gM(w̄, θ̄; xn)2 (1)w..r.t. w̄, θ̄. This is frequently complicated –

    by spurious local minima – iterative optimi-zation process of searching global minimum.It can be unreliable when dimensions of w̄, θ̄are large, as in modeling dynamical systems.

    http://find/

  • 8/9/2019 Prof. Richardson Neuralnetworks

    12/61

  • 8/9/2019 Prof. Richardson Neuralnetworks

    13/61

    Motivations for using random projectionsTo motivate our further discussion, consider the wellknown, simple finite impulse response (FIR) model:

    yn  =J

     j=1

    α j un− j + n,   n = 1, 2, . . . ,  N,   (3)

    where yn  are outputs observed with the noise  n’s, whileun’s form an input signal, which is observed (or evendesigned) in order to estimate  α j’s. Let us supposethat our system has a long memory – needs J ≈ 103 foradequate modeling, e.g., chaotic systems.   Is it

    reasonable to estimate  ∼ 103 parameters ?, even if N isvery large ?

    The alternative idea:  project vector of past un’s ontorandom directions and select only those projections

    that are relevant for a proper modeling.

    http://find/

  • 8/9/2019 Prof. Richardson Neuralnetworks

    14/61

    Random projectionsProjection: x → v = Sx, Rd → Rk,   k 

  • 8/9/2019 Prof. Richardson Neuralnetworks

    15/61

    Model 1 – with internal projections

    For T denoting the transposition, define:

    ūn  = [un−1,  un−2, . . . ,  u(n−r)]T.

    Above, r ≥ 1 is treated as large (hundreds,say) – we discuss models with long memory.Model 1.   For n=(r+1), (r+2),. . . , N,

    yn  =K

    k=1 θk   ϕ( s̄Tk   ūn   

    int.proj.

    ) + n,   (4)

    where  n’s are i.i.d. random errors, havingGaussian distr. with zero mean and finite

    variance; E(n) = 0,  σ2

    = E(n)2

    < ∞,

    M d l 1 i

    http://find/

  • 8/9/2019 Prof. Richardson Neuralnetworks

    16/61

    Model 1 – assumptions

     θ̄  = [θ1, θ2, . . . , θK]T – vector of unknown

    parameters, to be estimated from (un,

    yn),n = 1, 2, . . . , N – observations of inputs unand outputs yn.

     ϕ :   R → [−1,  1] a sigmoidal function,

    limt→∞ ϕ(t) = 1, limt→−∞ ϕ(t) = −1,nondecreasing, e.g.   ϕ(t) = 2 arctg(t)/π.We admit also:   ϕ(t) = t. Whenobservations properly scaled, weapproximate  ϕ(t) ≈ t.

      s̄k  = sk/||sk||, where r × 1 i.i.d. randomvectors sk: E(sk) = 0, cov(sk) = Ir. sk’s

    mutually independent from  n’s.  K ( T )

    http://find/

  • 8/9/2019 Prof. Richardson Neuralnetworks

    17/61

    Model 1: yn  =K

    k=1 θk   ϕ( s̄Tk   ūn

       int.proj.

    ) + n.

     K is large, but K  

  • 8/9/2019 Prof. Richardson Neuralnetworks

    18/61

    We derive Fisher’s information matrix (FIM):

    Case A)  FIM exact, if  ϕ(t) = t,

    Case B)  FIM approximate if  ϕ(t) ≈ t.Rewrite Model 1 as:

    yn  =  θ̄T x̄n + n,   (5)

    x̄Tn def = [ϕ(s̄T1   ūn), ϕ(s̄T2   ūn), . . . , ϕ(s̄TK ūn)],

    x̄n   =

      ≈   if B)ST ūn,   (6)

    S   def = [̄s1,  s̄2, . . . ,   s̄K].   Then, for  σ  = 1, FIM:

    MN(S) =N

    n=r+1 x̄n x̄Tn   = S

    T

      N

    n=r+1 ūn ūTn S.   (7)

    A d FIM (AFIM)

    http://find/

  • 8/9/2019 Prof. Richardson Neuralnetworks

    19/61

    Averaged FIM (AFIM):

    Define the correlation matrix of lagged inputvectors:

    Ru  = limN→∞

    (N − r)−1N

    n=r+1

    ūn ūTn ,   (8)

    which is well defined for stationary andergodic sequences. AFIM is defined as:

    M(Ru) = ES   limN→∞

    (N − r)−1 MN(S)

      (9)

    AFIM – final form:

    M(Ru) = ES ST Ru S   (10)

    P bl

    http://find/

  • 8/9/2019 Prof. Richardson Neuralnetworks

    20/61

    Problem statement:

    The constraint on input signal power:

    diag. el. [Ru] = limN→∞

    N−1N

    i=1

    u2i   ≤ 1.   (11)

    Problem statement – D optimal experiment design for Model 1

    Assume that the class of Models 1 issufficiently rich to include unknown system.Motivated by relationships between the

    estimation accuracy and FIM, find R∗u, under

    constraints  (11)  such that

    maxRu

    Det[M(Ru)] = Det[M(R∗u)]   (12)

    R k i t S

    http://find/

  • 8/9/2019 Prof. Richardson Neuralnetworks

    21/61

    Remarks on averaging w.r.t. S

    As usual, when averages are involved, one canconsider different ways of combing averagingwith the matrix inversion and calculatingdeterminants.  The way selected above ismanageable.   Other possibilities: the

    minimization, w.r.t. Ru, either

    Det

    ES

    N M−1N   (S)

    ,   as N →∞   (13)

    or   ES Det N M−1N   (S) ,   as N →∞.   (14)The above dilemma is formally very similar tothe one that arises when we consider the

    bayesian prior on unknown parameters.

    D ti l i t d i f M d l 1

    http://find/

  • 8/9/2019 Prof. Richardson Neuralnetworks

    22/61

    D-optimal experiment design for Model 1

    Result 1

    maxRu Det[M(Ru)], under   (11), is attainedwhen all off-diagonal elements of R∗u  are zero.It follows from known fact that for symmetric A:DetA ≤ rk=1 akk  with = iff A is a diagonal matrix.Selecting R∗u  = Ir   – r × r identity matrix, we obtain:

    M(R∗u) = ES[ST S] = IK,

    because: E(s̄T j   s̄K) = 0 for k = j and 1, for k = j.

    RemarkFor N →∞  sequences un’s with Ru  = Ir  can begenerated as i.i.d.  N (0, 1). For finite N pseudorandombinary signals are known, for which Ru  ≈ Ir.

    Summary y K θ ( s̄T ū ) +

    http://find/

  • 8/9/2019 Prof. Richardson Neuralnetworks

    23/61

    Summary – yn  =

    k=1 θk   ϕ( sTk  un

       int.proj.

    ) + n.

    1.  Select r ≥ 1 – expected length of thesystem memory.

    2.  Generate input signal un’s with R∗u  = Ir.

    3.  Observe pairs (un,  yn), n = 1, 2, . . . ,  N.

    4.  Select K  

  • 8/9/2019 Prof. Richardson Neuralnetworks

    24/61

    Summary – yn  =

    k=1 θk   ϕ( sTk  un

       int.proj.

    ) + n.

     Test H0  :  θ̂k  = 0, for all components of  θ̂.  Form  K  – the set of those k that H0   :

    θ̂k  = 0 is rejected.

     Re-estimate  θk, k ∈ K  by LSQ – denotethem by  θ̃k.

     Form the final model for prediction:

    ŷn  = k∈K

    θ̃k   ϕ(̄sTk   ūn)   (16)

    and validate it on data that were not used

    for its estimation.

    Remarks on estimating Model 1

    http://find/

  • 8/9/2019 Prof. Richardson Neuralnetworks

    25/61

    Remarks on estimating Model 1

    1.  The nice feature of the above method isits computational simplicity.

    2.  It can be used also when the experiment isnot active, i.e.,  un’s are only observed.

    3.  The method can be applied also forestimating a regression function with alarge number of regressors – replace ūn  by

    them.4.  For testing H0  :  θ̂k  = 0 we can use the

    standard t-Student test.

    http://find/

  • 8/9/2019 Prof. Richardson Neuralnetworks

    26/61

    Simulation studies – the system behaviour

  • 8/9/2019 Prof. Richardson Neuralnetworks

    27/61

    Simulation studies – the system behaviourComplicated dynamics caused by PRBS input

    http://find/

  • 8/9/2019 Prof. Richardson Neuralnetworks

    28/61

    Simulation studies – the system behaviour

  • 8/9/2019 Prof. Richardson Neuralnetworks

    29/61

    Simulation studies the system behaviourSampled output y(n τ ) +noise

    Simulation studies – estimated model:

    http://find/

  • 8/9/2019 Prof. Richardson Neuralnetworks

    30/61

    Simulation studies estimated model:

    yn  =K

    k=1

    θk (s̄Tk   ūn) + n,   (19)

    i.e.,  φ(ν ) = ν   (φ(ν ) = arctan(0.1 ν ) provides

    very similar results), where  K = 50 – the number of random

    projections,

     r = 2000 – the number of past inputs(r = dim(ūn)) that are projected by

      s̄k ∼ N (0,   Ir) – normalized to 1.

    http://find/

  • 8/9/2019 Prof. Richardson Neuralnetworks

    31/61

    Estimated model – one step ahead prediction vs

  • 8/9/2019 Prof. Richardson Neuralnetworks

    32/61

    Estimated model one step ahead prediction vstesting data

    Estimated model – one step ahead prediction vs testing data,

    http://find/

  • 8/9/2019 Prof. Richardson Neuralnetworks

    33/61

    p p g ,

    after rejecting 21 terms with parameters having p-val.>  0.05

    in t-Student test (29 terms remain);

    What if a time series is generated from a nonlinear

    http://find/

  • 8/9/2019 Prof. Richardson Neuralnetworks

    34/61

    gand chaotic ODE’s ?

    Consider the well known chaotic Lorenz sys-tem, perturbed by (interpolated) PRBS u(t):

    ẋ(t) = 100 u(t) − 5 (x(t) − y(t))   (20)

    ẏ(t) = x(t) (−z(t) + 26.5) − y(t)   (21)ż(t) = x(t)y(t) − z(t)   (22)

    Our aim:  select and estimate models in order:

    A) to predict x(tn) from  χn  = x(tn) + n,B) to predict y(tn) from  ηn  = y(tn) + n,

    without using the knowledge about  (20)-(22).

    The system behaviour – phase plot

    http://find/

  • 8/9/2019 Prof. Richardson Neuralnetworks

    35/61

    y p p

    http://find/

  • 8/9/2019 Prof. Richardson Neuralnetworks

    36/61

  • 8/9/2019 Prof. Richardson Neuralnetworks

    37/61

  • 8/9/2019 Prof. Richardson Neuralnetworks

    38/61

    One step ahead prediction from the estimated

  • 8/9/2019 Prof. Richardson Neuralnetworks

    39/61

    model output vs testing data (2000 recorded)

    ,

    Very accurate prediction !  So far, so good ?Let’s consider a real challenge: prediction of y(t) component.

    http://goforward/http://find/http://goback/

  • 8/9/2019 Prof. Richardson Neuralnetworks

    40/61

    Simulated 104 noisy observations:  ∼ 8000 used for learning,

  • 8/9/2019 Prof. Richardson Neuralnetworks

    41/61

    ∼ 2000 for testing prediction abilities.

    ,

    Estimated model – Case B):

    http://goforward/http://find/http://goback/

  • 8/9/2019 Prof. Richardson Neuralnetworks

    42/61

    yn  =

    Kk=1

    θk (s̄Tk   ūn) + n,   (24)

    where

     K = 200 – the number of randomprojections,

     r = 750 – the number of past inputs(r = dim(ūn)) that are projected by

      s̄k ∼ N (0,   Ir) – normalized to 1.Note that we use 2.66 times smaller numberof projections than in case B), while r is 30%larger.

    Estimated model – response vs learning data.)

    http://find/

  • 8/9/2019 Prof. Richardson Neuralnetworks

    43/61

    Case B), the first 1000 observations

    The fit is not perfect, but retains almost alloscillations.

    Estimated model – response vs learning data.C ) 2000

    http://goforward/http://find/http://goback/

  • 8/9/2019 Prof. Richardson Neuralnetworks

    44/61

    Case B), next 2000 observations

    http://find/

  • 8/9/2019 Prof. Richardson Neuralnetworks

    45/61

    Estimated model – one step ahead predictionC B) ll i d

  • 8/9/2019 Prof. Richardson Neuralnetworks

    46/61

    Case B), all testing data

    The prediction is far from precision, but stillretains a general shape of the highly chaotic

    sequence.

    http://find/

  • 8/9/2019 Prof. Richardson Neuralnetworks

    47/61

  • 8/9/2019 Prof. Richardson Neuralnetworks

    48/61

  • 8/9/2019 Prof. Richardson Neuralnetworks

    49/61

  • 8/9/2019 Prof. Richardson Neuralnetworks

    50/61

  • 8/9/2019 Prof. Richardson Neuralnetworks

    51/61

    Model 2 – experiment design 2

  • 8/9/2019 Prof. Richardson Neuralnetworks

    52/61

    Mimicking the proof from Goodwin Payne

    (Thm. 6.4.9) and using the assumedproperties of S, we obtain:

    Theorem 2.

    Assume  ρ(0)  > σ  and that unknown systemcan be adequately described by   (25). Selectφ(ν ) = ν . Then, Det(M̄) is maximized whenA is the unit matrix, which holds if  ρ(0) = 1,ρ(k) = 0 for k  > 0, i.e.,  when the systemoutput is an uncorrelated sequence.

    Model 2 – experiment design 3 – Remarks

    http://find/http://goback/

  • 8/9/2019 Prof. Richardson Neuralnetworks

    53/61

    Condition  ρ(k) = 0 for k > 0 can formally beensured by the following minimum variancecontrol law (negative feedback from past yn):

    un  = −β−10

    L

    l=1 βl φ(s̄Tl   ȳn) + ηn,   (27)where  ηn’s is a set point sequence, whichshould be i.i.d. sequence with zero mean and

    the variance (ρ

    (0)− σ

    )/β2

    0. In practice,realization of   (27)  can cause troubles (notquite adequate model, unknown  β  etc.), butrandom projections are expected to reducethem, due to their smoothing effect.

    Model 3 = Model 1 + Model 2

    http://find/

  • 8/9/2019 Prof. Richardson Neuralnetworks

    54/61

    The estimation and rejection procedure for

    the combined model:

    yn  =L

    l=1βl   ϕ(s̄

    Tl   ȳn) +

    L+K

    l=L+1βl φ(s̄

    Tl   ūn) + n.   (28)

    is the as above.Open problem:  design D-optimal input signalfor the estimation of  βl’s in  (28), under input

    and/or output power constraint.

    One can expect that it is possible to derivethe equivalence thm., assuming  φ(ν ) = ν .

    Concluding remarks

    http://find/

  • 8/9/2019 Prof. Richardson Neuralnetworks

    55/61

    1.  Random projections of past inputs and/oroutputs occurred to be a powerful tool formodeling systems with long memory.

    2.  The proposed Model 1 provides a verygood prediction for linear dynamic systems,

    while for quickly changing chaotic systemsits is able to predict a general shape of theoutput signal.

    3.  D-optimal input signal can be designed forModel 1 and 2, mimicking the proofs of the classical results for linear systemswithout projections.

    Concluding remarks 2

    http://find/

  • 8/9/2019 Prof. Richardson Neuralnetworks

    56/61

    – Despite the similarities of Model 1 to PPR,random projections + rejections of spuriousterms leads to much simpler estimationprocedure.

    – A similar procedure can be used for a

    regression function estimation, when we havea large number of candidate terms, while thenumber of observations is not enough for their

    estimation – projections allow for estimating acommon impact of several terms.

    – Model 2 without an input signal  can be usedfor predicting time series such as sun spots.

    PARTIAL BIBLIOGRAPHYD. Aeyels, Generic observability of differentiable systems,SIAM J Control and Optimization vol 19 pp 595603 1981

    http://find/

  • 8/9/2019 Prof. Richardson Neuralnetworks

    57/61

    SIAM J. Control and Optimization, vol. 19, pp. 595603, 1981.

    D. Aeyels, On the number of samples necessary to achieve

    observability, Systems and Control Letters, vol. 1, no. 2, pp.9294, August 1981.

    Casdagli, M., Nonlinear Prediction of Chaotic Time Series,Physical Review D, 35(3):35-356, 1989.

    Leontaritis, I.J. and S.A. Billings, Input-Output ParametricModels for Nonlinear Systems part I: Deterministic NonlinearSystems, International Journal of Control, 41(2):303-328,1985.

    A.U. Levin and K.S. Narendra. Control of non-linear dynamicalsystems using neural networks. controllability and stabilization.IEEE Transactions on Neural Networks, 4:192206, March 1993.

    A.U. Levin, K.S. Narendra, Recursive identification usingfeedforward neural networks, International Journal of Control61 (3) (1995) 533547

    http://goforward/http://find/http://goback/

  • 8/9/2019 Prof. Richardson Neuralnetworks

    58/61

    61 (3) (1995) 533547.

    Juditzky, H. Hjalmarsson, A. Benveniste, B.Delyon, L. Ljung,

    J. Sjobergs and Q. Zhang, Nonlinear Black-box Models inSystem Identification: Mathematical Foundations, Automatica31(12): 1725–1750, 1995. Ljung,L.System Identification-Theory for the User. Prentice-Hall, N.J. 2nd edition, 1999.

    Pintelon R. and Schoukens, J. System Identification. AFrequency Domain Approach, IEEE Press, New York, 2001.

    Sderstrm, T. and P.Stoica, System Indentification, PrenticeHall, Englewood Cliffs, NJ. 1989.

    Sjberg, J.,Q. Zhang, L.Ljung, A.Benveniste, B.Delyon,P.-Y.Glorennec, H.Hjalmarsson, and A.Juditsky: ”Non-linearBlack-box Modeling in System Identification: a UnifiedOverview” ,Automatica, 31:1691-1724, 1995.

    G. Cybenko, Approximation by superpositions of a sigmoidalfunction, Mathematics of Control, Signals, and Systems 2(1989) 303 314

    http://find/

  • 8/9/2019 Prof. Richardson Neuralnetworks

    59/61

    (1989) 303–314.

    K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward

    networks are universal approximators, Neural Networks 2(1989) 359–366.

    Gyorfi, L., Kohler, M., Krzyzak, A. and Walk, H. (2002). ADistribution-Free Theory of Nonparametric Regression.

    Springer: New York. MR1987657Jones, Lee K. (1992), A Simple Lemma on GreedyApproximation in Hilbert Space and Convergence Rates forProjection Pursuit Regression and Neural Network Training,

    Annals of Statistics, 20, 608-613W.B. Johnson, J. Lindenstrauss, Extensions of Lipshitzmapping into Hilbert space, Contemporary Mathematics 26(1984) 189–206.

    Matousek, J.: On variants of the JohnsonLindenstrauss lemma.Random Structures and Algorithms, 33(2): 142-156, 2008.

    http://find/http://goback/

  • 8/9/2019 Prof. Richardson Neuralnetworks

    60/61

    E. J. Cand‘es and Terence Tao, Near-optimal signal recoveryfrom random projections: Universal encoding strategies?, IEEE

    Transactions on Information Theory, vol. 52, no. 12, pp.54065425, 2006.

    Takens F. Detecting strange attractors in fluid turbulence. In:Rand D, Young LS, editors. Dynamical systems and

    turbulence. Berlin: Springer; 1981.J. Stark, Delay embeddings for forced systems. I. Deterministicforcing, Journal of Nonlinear Science 9 (1999) 255332.

    Stark, J., D.S. Broomhead, M.E. Davies, and J. Huke, Takens

    Embedding Theorems for Forced and Stochastic Systems,Nonlinear Analysis: Theory, Methods and Applications,30(8):5303-5314, 1997.

    http://find/http://goback/

  • 8/9/2019 Prof. Richardson Neuralnetworks

    61/61