Prof. Richardson Neuralnetworks

8/9/2019 Prof. Richardson Neuralnetworks

1/61

Experiments with neural network formodeling of nonlinear dynamical systems:

Design problems

Ewa Skubalska-Rafaj lowicz

Wroc law University of Technology, Wroc law,Wroc law, Poland

http://find/http://goback/


2/61

Summary

Introduction

PreliminariesMotivations for using random projections

Model 1 NFIR with random projections

Information matrix for Model 1D-optimal inputs for Model 1

Simulation studies for Model 1

Brief outline of Model 2 – NARX with randomprojections

Experiment design for Model 2

Conclusions

http://find/


3/61

Our aims in this lecture include:

providing a brief introduction to modelingdynamical systems by neural networks

focusing on neural networks with randomprojections,

on their estimation and optimal input

signals.

http://find/


4/61


5/61

Artificial neural networks at glance:Neural networks are (general) nonlinearblack-box structures.

Classical neural network architectures(feed-forward structures):

multi-layer perceptron with one hidden layer

and sigmoid activation functions (MLP), radial bases function networks (RBF), orthogonal (activation functions) neural

networks (wavelets networks, fourier networks) spline networks

Other classes of NN: support vector machines, Kohonen nets ( based on vector quantization), based on Kolmogorov Representation Thm.

(Sprecher networks and others)

http://find/


6/61

Dynamical neural networks:

1. networks with lateral or feedbackconnections,

Hopfield networks (associative memory), Ellman nets (context sensitive)

2. networks with external dynamics –

NARMAX, NARX, NFIR (see below).

Aims of modeling dynamical systems:

simulation – estimate outputs when only

inputs are available, prediction – also last observations of

outputs yn, . . . , yn−p are given.

http://find/


7/61


8/61

Further remarks on NN:

NN are usually non-linear-in-the-parameters.

Parametric vs nonparametric approach ??If we allow a net to growth with the numberof observations, they are sufficiently rich to

approximate smooth functions. In practice,finite structures are considered, leading to aparametric (weights) optimizing approach.

Learning (training) == selecting weights,

using nonlinear LMS (using Levenberg -Marquardt local optimization algorithm).Our idea 1: some simplifications may lead tolinear-in-the-parameter network structures.

http://find/


9/61

Further remarks on NN 2:

By a feed-forward NN we mean a real valued

function on Rd:

gM(x; w̄, θ̄) = θ0 +M

j=1θ jϕ(),

where x ∈ K ⊂ Rd, K is a compact set.

Activation functions ϕ(t) is a sigmoid

function, e.g, logistic – 11+exp (−t)hiperbolic tan – tanh(t) = 1−exp (−2t)

1+exp (−2t)

arctan 2π

arctan (t)

http://find/


10/61


Universal approximation property (UAP):

Certain classes of neural network models areshown to be universal approximators, in thesense that:

– for every continuous function f : Rd

→ R,– for every compact K ⊂ Rd, ∀ > 0,a network of appropriate size M and acorresponding set of parameters (weights)

w̄, θ̄ exist, such that

supx∈K

||f(x) − gM(w̄, θ̄; x)|| < .

http://find/


11/61


Having a learning sequence (xn, yn),n = 1, 2, . . . , N, the weights w̄, θ̄ are usuallyselected by minimization of

Nn=1

yn − gM(w̄, θ̄; xn)2 (1)w..r.t. w̄, θ̄. This is frequently complicated –

by spurious local minima – iterative optimi-zation process of searching global minimum.It can be unreliable when dimensions of w̄, θ̄are large, as in modeling dynamical systems.

http://find/


12/61


13/61

Motivations for using random projectionsTo motivate our further discussion, consider the wellknown, simple finite impulse response (FIR) model:

yn =J

j=1

α j un− j + n, n = 1, 2, . . . , N, (3)

where yn are outputs observed with the noise n’s, whileun’s form an input signal, which is observed (or evendesigned) in order to estimate α j’s. Let us supposethat our system has a long memory – needs J ≈ 103 foradequate modeling, e.g., chaotic systems. Is it

reasonable to estimate ∼ 103 parameters ?, even if N isvery large ?

The alternative idea: project vector of past un’s ontorandom directions and select only those projections

that are relevant for a proper modeling.

http://find/


14/61

Random projectionsProjection: x → v = Sx, Rd → Rk, k


15/61

Model 1 – with internal projections

For T denoting the transposition, define:

ūn = [un−1, un−2, . . . , u(n−r)]T.

Above, r ≥ 1 is treated as large (hundreds,say) – we discuss models with long memory.Model 1. For n=(r+1), (r+2),. . . , N,

yn =K

k=1 θk ϕ( s̄Tk ūn

int.proj.

) + n, (4)

where n’s are i.i.d. random errors, havingGaussian distr. with zero mean and finite

variance; E(n) = 0, σ2

= E(n)2

< ∞,

M d l 1 i

http://find/


16/61

Model 1 – assumptions

θ̄ = [θ1, θ2, . . . , θK]T – vector of unknown

parameters, to be estimated from (un,

yn),n = 1, 2, . . . , N – observations of inputs unand outputs yn.

ϕ : R → [−1, 1] a sigmoidal function,

limt→∞ ϕ(t) = 1, limt→−∞ ϕ(t) = −1,nondecreasing, e.g. ϕ(t) = 2 arctg(t)/π.We admit also: ϕ(t) = t. Whenobservations properly scaled, weapproximate ϕ(t) ≈ t.

s̄k = sk/||sk||, where r × 1 i.i.d. randomvectors sk: E(sk) = 0, cov(sk) = Ir. sk’s

mutually independent from n’s. K ( T )

http://find/


17/61

Model 1: yn =K

k=1 θk ϕ( s̄Tk ūn

int.proj.

) + n.

K is large, but K


18/61

We derive Fisher’s information matrix (FIM):

Case A) FIM exact, if ϕ(t) = t,

Case B) FIM approximate if ϕ(t) ≈ t.Rewrite Model 1 as:

yn = θ̄T x̄n + n, (5)

x̄Tn def = [ϕ(s̄T1 ūn), ϕ(s̄T2 ūn), . . . , ϕ(s̄TK ūn)],

x̄n =

≈ if B)ST ūn, (6)

S def = [̄s1, s̄2, . . . , s̄K]. Then, for σ = 1, FIM:

MN(S) =N

n=r+1 x̄n x̄Tn = S

T

N

n=r+1 ūn ūTn S. (7)

A d FIM (AFIM)

http://find/


19/61

Averaged FIM (AFIM):

Define the correlation matrix of lagged inputvectors:

Ru = limN→∞

(N − r)−1N

n=r+1

ūn ūTn , (8)

which is well defined for stationary andergodic sequences. AFIM is defined as:

M(Ru) = ES limN→∞

(N − r)−1 MN(S)

(9)

AFIM – final form:

M(Ru) = ES ST Ru S (10)

P bl

http://find/


20/61

Problem statement:

The constraint on input signal power:

diag. el. [Ru] = limN→∞

N−1N

i=1

u2i ≤ 1. (11)

Problem statement – D optimal experiment design for Model 1

Assume that the class of Models 1 issufficiently rich to include unknown system.Motivated by relationships between the

estimation accuracy and FIM, find R∗u, under

constraints (11) such that

maxRu

Det[M(Ru)] = Det[M(R∗u)] (12)

R k i t S

http://find/


21/61

Remarks on averaging w.r.t. S

As usual, when averages are involved, one canconsider different ways of combing averagingwith the matrix inversion and calculatingdeterminants. The way selected above ismanageable. Other possibilities: the

minimization, w.r.t. Ru, either

Det

ES

N M−1N (S)

, as N →∞ (13)

or ES Det N M−1N (S) , as N →∞. (14)The above dilemma is formally very similar tothe one that arises when we consider the

bayesian prior on unknown parameters.

D ti l i t d i f M d l 1

http://find/


22/61

D-optimal experiment design for Model 1

Result 1

maxRu Det[M(Ru)], under (11), is attainedwhen all off-diagonal elements of R∗u are zero.It follows from known fact that for symmetric A:DetA ≤ rk=1 akk with = iff A is a diagonal matrix.Selecting R∗u = Ir – r × r identity matrix, we obtain:

M(R∗u) = ES[ST S] = IK,

because: E(s̄T j s̄K) = 0 for k = j and 1, for k = j.

RemarkFor N →∞ sequences un’s with Ru = Ir can begenerated as i.i.d. N (0, 1). For finite N pseudorandombinary signals are known, for which Ru ≈ Ir.

Summary y K θ ( s̄T ū ) +

http://find/


23/61

Summary – yn =

k=1 θk ϕ( sTk un

int.proj.

) + n.

1. Select r ≥ 1 – expected length of thesystem memory.

2. Generate input signal un’s with R∗u = Ir.

3. Observe pairs (un, yn), n = 1, 2, . . . , N.

4. Select K


24/61

Summary – yn =

k=1 θk ϕ( sTk un

int.proj.

) + n.

Test H0 : θ̂k = 0, for all components of θ̂. Form K – the set of those k that H0 :

θ̂k = 0 is rejected.

Re-estimate θk, k ∈ K by LSQ – denotethem by θ̃k.

Form the final model for prediction:

ŷn = k∈K

θ̃k ϕ(̄sTk ūn) (16)

and validate it on data that were not used

for its estimation.

Remarks on estimating Model 1

http://find/


25/61

Remarks on estimating Model 1

1. The nice feature of the above method isits computational simplicity.

2. It can be used also when the experiment isnot active, i.e., un’s are only observed.

3. The method can be applied also forestimating a regression function with alarge number of regressors – replace ūn by

them.4. For testing H0 : θ̂k = 0 we can use the

standard t-Student test.

http://find/


26/61

Simulation studies – the system behaviour


27/61

Simulation studies – the system behaviourComplicated dynamics caused by PRBS input

http://find/


28/61

Simulation studies – the system behaviour


29/61

Simulation studies the system behaviourSampled output y(n τ ) +noise

Simulation studies – estimated model:

http://find/


30/61

Simulation studies estimated model:

yn =K

k=1

θk (s̄Tk ūn) + n, (19)

i.e., φ(ν ) = ν (φ(ν ) = arctan(0.1 ν ) provides

very similar results), where K = 50 – the number of random

projections,

r = 2000 – the number of past inputs(r = dim(ūn)) that are projected by

s̄k ∼ N (0, Ir) – normalized to 1.

http://find/


31/61

Estimated model – one step ahead prediction vs


32/61

Estimated model one step ahead prediction vstesting data

Estimated model – one step ahead prediction vs testing data,

http://find/


33/61

p p g ,

after rejecting 21 terms with parameters having p-val.> 0.05

in t-Student test (29 terms remain);

What if a time series is generated from a nonlinear

http://find/


34/61

gand chaotic ODE’s ?

Consider the well known chaotic Lorenz sys-tem, perturbed by (interpolated) PRBS u(t):

ẋ(t) = 100 u(t) − 5 (x(t) − y(t)) (20)

ẏ(t) = x(t) (−z(t) + 26.5) − y(t) (21)ż(t) = x(t)y(t) − z(t) (22)

Our aim: select and estimate models in order:

A) to predict x(tn) from χn = x(tn) + n,B) to predict y(tn) from ηn = y(tn) + n,

without using the knowledge about (20)-(22).

The system behaviour – phase plot

http://find/


35/61

y p p

http://find/


36/61


37/61


38/61

One step ahead prediction from the estimated


39/61

model output vs testing data (2000 recorded)

,

Very accurate prediction ! So far, so good ?Let’s consider a real challenge: prediction of y(t) component.

http://goforward/http://find/http://goback/


40/61

Simulated 104 noisy observations: ∼ 8000 used for learning,


41/61

∼ 2000 for testing prediction abilities.

,

Estimated model – Case B):



42/61

yn =

Kk=1

θk (s̄Tk ūn) + n, (24)

where

K = 200 – the number of randomprojections,

r = 750 – the number of past inputs(r = dim(ūn)) that are projected by

s̄k ∼ N (0, Ir) – normalized to 1.Note that we use 2.66 times smaller numberof projections than in case B), while r is 30%larger.

Estimated model – response vs learning data.)

http://find/


43/61

Case B), the first 1000 observations

The fit is not perfect, but retains almost alloscillations.

Estimated model – response vs learning data.C ) 2000



44/61

Case B), next 2000 observations

http://find/


45/61

Estimated model – one step ahead predictionC B) ll i d


46/61

Case B), all testing data

The prediction is far from precision, but stillretains a general shape of the highly chaotic

sequence.

http://find/


47/61


48/61


49/61


50/61


51/61

Model 2 – experiment design 2


52/61

Mimicking the proof from Goodwin Payne

(Thm. 6.4.9) and using the assumedproperties of S, we obtain:

Theorem 2.

Assume ρ(0) > σ and that unknown systemcan be adequately described by (25). Selectφ(ν ) = ν . Then, Det(M̄) is maximized whenA is the unit matrix, which holds if ρ(0) = 1,ρ(k) = 0 for k > 0, i.e., when the systemoutput is an uncorrelated sequence.

Model 2 – experiment design 3 – Remarks



53/61

Condition ρ(k) = 0 for k > 0 can formally beensured by the following minimum variancecontrol law (negative feedback from past yn):

un = −β−10

L

l=1 βl φ(s̄Tl ȳn) + ηn, (27)where ηn’s is a set point sequence, whichshould be i.i.d. sequence with zero mean and

the variance (ρ

(0)− σ

)/β2

0. In practice,realization of (27) can cause troubles (notquite adequate model, unknown β etc.), butrandom projections are expected to reducethem, due to their smoothing effect.

Model 3 = Model 1 + Model 2

http://find/


54/61

The estimation and rejection procedure for

the combined model:

yn =L

l=1βl ϕ(s̄

Tl ȳn) +

L+K

l=L+1βl φ(s̄

Tl ūn) + n. (28)

is the as above.Open problem: design D-optimal input signalfor the estimation of βl’s in (28), under input

and/or output power constraint.

One can expect that it is possible to derivethe equivalence thm., assuming φ(ν ) = ν .

Concluding remarks

http://find/


55/61

1. Random projections of past inputs and/oroutputs occurred to be a powerful tool formodeling systems with long memory.

2. The proposed Model 1 provides a verygood prediction for linear dynamic systems,

while for quickly changing chaotic systemsits is able to predict a general shape of theoutput signal.

3. D-optimal input signal can be designed forModel 1 and 2, mimicking the proofs of the classical results for linear systemswithout projections.

Concluding remarks 2

http://find/


56/61

– Despite the similarities of Model 1 to PPR,random projections + rejections of spuriousterms leads to much simpler estimationprocedure.

– A similar procedure can be used for a

regression function estimation, when we havea large number of candidate terms, while thenumber of observations is not enough for their

estimation – projections allow for estimating acommon impact of several terms.

– Model 2 without an input signal can be usedfor predicting time series such as sun spots.

PARTIAL BIBLIOGRAPHYD. Aeyels, Generic observability of differentiable systems,SIAM J Control and Optimization vol 19 pp 595603 1981

http://find/


57/61

SIAM J. Control and Optimization, vol. 19, pp. 595603, 1981.

D. Aeyels, On the number of samples necessary to achieve

observability, Systems and Control Letters, vol. 1, no. 2, pp.9294, August 1981.

Casdagli, M., Nonlinear Prediction of Chaotic Time Series,Physical Review D, 35(3):35-356, 1989.

Leontaritis, I.J. and S.A. Billings, Input-Output ParametricModels for Nonlinear Systems part I: Deterministic NonlinearSystems, International Journal of Control, 41(2):303-328,1985.

A.U. Levin and K.S. Narendra. Control of non-linear dynamicalsystems using neural networks. controllability and stabilization.IEEE Transactions on Neural Networks, 4:192206, March 1993.

A.U. Levin, K.S. Narendra, Recursive identification usingfeedforward neural networks, International Journal of Control61 (3) (1995) 533547



58/61

61 (3) (1995) 533547.

Juditzky, H. Hjalmarsson, A. Benveniste, B.Delyon, L. Ljung,

J. Sjobergs and Q. Zhang, Nonlinear Black-box Models inSystem Identification: Mathematical Foundations, Automatica31(12): 1725–1750, 1995. Ljung,L.System Identification-Theory for the User. Prentice-Hall, N.J. 2nd edition, 1999.

Pintelon R. and Schoukens, J. System Identification. AFrequency Domain Approach, IEEE Press, New York, 2001.

Sderstrm, T. and P.Stoica, System Indentification, PrenticeHall, Englewood Cliffs, NJ. 1989.

Sjberg, J.,Q. Zhang, L.Ljung, A.Benveniste, B.Delyon,P.-Y.Glorennec, H.Hjalmarsson, and A.Juditsky: ”Non-linearBlack-box Modeling in System Identification: a UnifiedOverview” ,Automatica, 31:1691-1724, 1995.

G. Cybenko, Approximation by superpositions of a sigmoidalfunction, Mathematics of Control, Signals, and Systems 2(1989) 303 314

http://find/


59/61

(1989) 303–314.

K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward

networks are universal approximators, Neural Networks 2(1989) 359–366.

Gyorfi, L., Kohler, M., Krzyzak, A. and Walk, H. (2002). ADistribution-Free Theory of Nonparametric Regression.

Springer: New York. MR1987657Jones, Lee K. (1992), A Simple Lemma on GreedyApproximation in Hilbert Space and Convergence Rates forProjection Pursuit Regression and Neural Network Training,

Annals of Statistics, 20, 608-613W.B. Johnson, J. Lindenstrauss, Extensions of Lipshitzmapping into Hilbert space, Contemporary Mathematics 26(1984) 189–206.

Matousek, J.: On variants of the JohnsonLindenstrauss lemma.Random Structures and Algorithms, 33(2): 142-156, 2008.



60/61

E. J. Cand‘es and Terence Tao, Near-optimal signal recoveryfrom random projections: Universal encoding strategies?, IEEE

Transactions on Information Theory, vol. 52, no. 12, pp.54065425, 2006.

Takens F. Detecting strange attractors in fluid turbulence. In:Rand D, Young LS, editors. Dynamical systems and

turbulence. Berlin: Springer; 1981.J. Stark, Delay embeddings for forced systems. I. Deterministicforcing, Journal of Nonlinear Science 9 (1999) 255332.

Stark, J., D.S. Broomhead, M.E. Davies, and J. Huke, Takens

Embedding Theorems for Forced and Stochastic Systems,Nonlinear Analysis: Theory, Methods and Applications,30(8):5303-5314, 1997.



61/61

Prof. Richardson Neuralnetworks

Documents

Transcript of Prof. Richardson Neuralnetworks