Lectures 24&25: Recurrent Neural Networks - Sharif

In the Name of God

Lectures 24&25: Recurrent Neural Networks

Sequence Learning• MLP & RBF networks are static networks, i.e. they learn a

mapping from a single input signal to a single output

Sequence Learning

mapping from a single input signal to a single output response for an arbitrary large number of pairs;

• Dynamic networks learn a mapping from a single input signal to a sequence of response signals, for an arbitrarysignal to a sequence of response signals, for an arbitrary number of pairs (signal, sequence).

• Typically the input signal to a dynamic network is an element of the sequence and then the network produceselement of the sequence and then the network produces as a response the rest of the sequence.

• To learn sequences we need to include some form of memory (short term memory) to the network.y ( y)

History of recurrent neural networks (RNN)

• z-transform• A single neuron > an adaptive filter

networks (RNN)

• A single neuron -> an adaptive filter• Widrow & Hoff (1960): least-mean square algorithm

(LMS) = delta ruleTh d f t l i i i• The need for temporal processing arises in numerous applications [prediction modeling (Box & Jenkins, 1976), noise cancellation (Widrow and Stearns, 1985), adaptive equalization (Proakis 1989) adaptive control (Narendra &equalization (Proakis, 1989), adaptive control (Narendra & Annaswamy, 1989), system identification (Ljung, 1987)]

• The recurrent network : ”Automata Studies”, Kleene 1954• Kalman filter theory (Rudolf E Kalman 1960)• Kalman filter theory (Rudolf E. Kalman, 1960)• Controllability and observability (Zadeh & Desoer, 1963),

Kailath, 1980), (Sontag, 1990), (Lewis & Syrmos, 1995)

History of recurrent neural networks (RNN)

• The NARX model (Leontaritis & Billings 1985)• The NARX model in the context of neural networks (Chen

networks (RNN)

• The NARX model in the context of neural networks (Chen et al, 1990)

• Recurrent network architectures (Jordan 1986)Oli d Gil (1996) h d th t i d d• Olin and Giles (1996) showed that using second-order recurrent networks the correct classification of temporal sequences of finite length is guaranteed.

• The idea behind back propagation through time (Minsky &• The idea behind back-propagation through time (Minsky & Papert, 1969), Werbos (1974), Rumelhart (1986).

• The real-time recurrent learning algorithm (Williams & Zipser 1989) < compare with McBride & Narenda (1965)Zipser, 1989) <- compare with McBride & Narenda (1965) system identification for tuning the parameters of an arbitary dynamical system.

• System identification (Ljung 1987) (Ljung&Glad 1994)• System identification (Ljung, 1987),(Ljung&Glad,1994)

RNNs: temporal processing• How do we build time into the operation of a neural

network?network?• Implicit representation: The temporal structure of the

input signal is embedded in the spatial structure of the networknetwork.

• Explicit representation: Time is given its own particular representation.

• For a neural network to be dynamic, it must be given memory

• The primary role of memory is to transform a static networkThe primary role of memory is to transform a static network into a dynamic one

RNNs: temporal processing• In the implicit form, we assume that the environment from

which we collect examples of (input signal outputwhich we collect examples of (input signal, output sequence) is stationary. For the explicit form the environment could be non-stationary, i.e. the network can track the changes in the structure of the signaltrack the changes in the structure of the signal.

• Implicit Representation of Time: We combine a memory structure in the input layer of the network with a static network modelmodel

• Explicit Representation of Time: We explicitly allow the network to code time, by generalizing the network weights from scalars to vectors as in TBP (Temporal Backpropagation)to vectors, as in TBP (Temporal Backpropagation).

• Typical forms of memories that are used are the Tapped Delay Line and the Gamma Memory family.

RNNs: temporal processing• Memory may be divided into short-term and long-term.

• Long term memory = is built into a neural network• Long-term memory = is built into a neural network through supervised learning

• Short-term memory = is built through time delays• Short-time memory can be implemented in continuous

time or in discrete time.• z-transform, z-1 is the unit delay operator.z transform, z is the unit delay operator.• We may define a discrete-time memory as a linear time

invariant, single input-multiple output system (causal, normalized)normalized).

• Taps: The junction points, to which the output terminals of the memory are connected, are commonly called taps.

A sample RNN

• A sample RNN architecture:

• Short-term memory structure:• Tapped-Delay Line MemoryTapped Delay Line Memory• Gamma Memory

A sample RNN

• So, …,• Both nonlinearity and time should be considered while

modeling time series data• In the previous architecture:• In the previous architecture:

• Static network that can be of any type such as MLP or RBF accounts for the nonlinearity in the modeling

• The memory provided by delay operators accounts for time

Short-term memory

• Suppose Z-transformed input and output• Memory depth is defined as the first time moment of overall impulse

response of the memory gp(n) (generating kernel).• Memory resolution is defined as the number of taps in the memory

structure per unit time.• p is the order of memory

• Most commonly used form of short-time memory is called a tapped delay line memory.

Tapped-delay-line memory• It consists of p unit delay operators. • The generating kernel is g(n) = (n 1) where (n) is the• The generating kernel is g(n) = (n-1), where (n) is the

unit impulse.

Gamma memory• Gamma memory, each section of this memory structure

consists of a feedback loop with unit delay and adjustableconsists of a feedback loop with unit delay and adjustable parameter .

• gp(n) = (1-)n-1, 0< <2, n≥1, gp(n) represents a discrete version of the integrand of the gamma functionversion of the integrand of the gamma function.

RNNs: two sample structures• Network Architecture for Temporal Processing:

• NETtalk: converts English text into English phonemes, devised by Sejnowski and Rosenberg (1987)

• Time-delay Neural Networks (TDNN): A very popular neural network that performs temporal processing

NETtalk• NETtalk was the first massively parallel network that converts English

text into phonemes (for the purpose of speech synthesis).p ( p p p y )• Input Layer: 203 sensory nodes• Hidden layer: 80 neurons• Output: 26 neurons (1 neuron for each English phoneme)p ( g p )• Activation function: logistic• Input: a 7 letter window of the text (each letter is encoded)• As a result, it incorporates “time” in the processingAs a result, it incorporates time in the processing

NETtalk• The performance of NETtalk exhibited some similarities

with observed human performance:with observed human performance: • The more words the network learned, the better it was at

generalizing and correctly pronouncing new words• The performance of the network degraded very slowly as• The performance of the network degraded very slowly as

synaptic connections in the network were damaged• Relearning after damage to the network was much faster

than learning during the original trainingthan learning during the original training

Time-delay neural networks• In explicit time representation, neurons have a spatio-

temporal structure i e its synapse arriving to a neuron istemporal structure, i.e. its synapse arriving to a neuron is not a scalar number but a vector of weights, which are used for convolution of the time-delayed input signal of a previous neuron with the synapseprevious neuron with the synapse.

• A schematic representation of a neuron is given below:

Time-delay neural networks

• Time-delay neural network (TDNN) is a multilayer(TDNN) is a multilayer feedforward network.

• The TDNN is a multilayer feedforward network whosefeedforward network whose hidden neurons and output neurons are replicated across time.

• It was devised to capture exlicitly the concept of time symmetry as encountered in the recognition of isolated wordthe recognition of isolated word (phoneme) using a spectrogram.

Time-delay neural networks

• Time-delay neural network (TDNN) is a multilayer feedforward t k Th f ll i i h t i ti f TDNN f hnetwork. The following is characteristics of TDNN for phoneme

recognition:• The input layer consists of 192 (16 by 12) sensory nodes

encoding the spectrogram (time frequency response)encoding the spectrogram (time-frequency response)• The hidden layer contains 10 copies of 8 hidden neurons; and the

output layer contains 6 copies of 4 output neurons• The various replicas of a hidden neuron apply the same set of

synaptic weights to narrow (three-time-step) windows of the spectrogram;i il l h i li f l h• similarly, the various replicas of an output neuron apply the same

set of synaptic weights to narrow (five-time step) windows of the pseudospectrogram computed by the hidden layer. I f l ti i l i th f t t d t f th• In performance evaluation involving the use of test data from three speakers, the TDNN achieved an average recognition score of 98.5 percent.

Focused time-delayed feedforward network (FTDFN)

• Temporal pattern recognition (or modeling time series) requires:i f tt th t l ti

feedforward network (FTDFN)

• processing of patterns that evolve over time, • the response at a particular instant of time depending not only on

the present value of the input but also on its past values. A nonlinear filter can be built on a static neural network The• A nonlinear filter can be built on a static neural network. The network is stimulated through a short-term memory.

• The represented structure can be implemented at the level of a single neuron or a network of neuronssingle neuron or a network of neurons.

Focused neuronal filter• The entire memory structure is located at the input end of the unit.• The output of the filter in response to the input x(n) and its past• The output of the filter, in response to the input x(n) and its past

values x(n-1),...,x(n-p), is given by yj(n) = (pl=0wj(l) x(n-l) + bj )

where (.) is the activation function of neuron j, the wj(l) are its synaptic weights, and bj is the bias.

FTDFN

• A focused time-delayed feedforward network in the figure below• It is a more powerful nonlinear filter consisting of:• It is a more powerful nonlinear filter consisting of:

• a tapped delay line memory of order p and• a multilayer perceptron.

Training FTDFN

• To train the filter the standard backpropagationalgorithms can be usedalgorithms can be used.

• At time n, the ”temporal pattern” applied to the input layer of the network is the signal vector x(n) = [x(n), x(n-1) , ... , x(n p)]Tx(n-p)]T

• x(n) may be viewed as a description of the state of the nonlinear filter at time n.

• An epoch consists of:• a sequence of states• the number of which is determined by the memory order p• the size N of the training sample.

An example of FTDFN• x(n)= sin(n + sin(n²)) is approximated by the focused TDFN.• Parameters: p = 20 hidden layer = 10 neurons the activationParameters: p 20, hidden layer 10 neurons, the activation

function of hidden neurons is logistic, output layer = 1 neuron, activation function of output layer is linear, learning-rate parameter = 0.01, no momentum constant.

Universal myopic mapping theorem

• The block labeled hi(n) represents multiple convolutions in the time domain that is a bank of linear filtersin the time domain, that is, a bank of linear filters operating in parallel• The hi are drawn from a large set of real-valued kernels, each

one of which represents the impulse response of a linear filterone of which represents the impulse response of a linear filter• The block labeled N represents a static (i.e., memoryless)

nonlinear feedforward network such as an ordinary MLP • The structure is a universal dynamic mapperThe structure is a universal dynamic mapper.

Universal myopic mapping theorem

• Myopic: unable to see distant objects clearlyTh A hift i i t i d i (i• Theorem: Any shift-invariant myopic dynamic map (i.e. stationary environments) can be uniformly approximated arbitrarily well by a structure consisting of two functional blocks: a bank of linear filters feeding a static neuralblocks: a bank of linear filters feeding a static neural network

• TDFN can be a special case of the structure considered in this theoremin this theorem

• The structure is inherently stable, provided that the linear filters are themselves stable.

FIR filters• The interpretation of the focused neuronal filter:• The combination of:The combination of:

• unit delay elements and• associated synaptic weights

• may be viewed as a finite-duration impulse response filter (FIR) of order y p p ( )p.

Neuronal filter as a nonlinear FIR filter

• The spatio-temporal model of the following figure is referred to as a multiple input neuronal filter

filterreferred to as a multiple input neuronal filter.

• It can also be considered as a distributed neuronal filter, in the sense that the filtering action is distributed across different points in space.different points in space.

Distributed TDFN

• The univeral myopic mapping algorithm provides the mathematical justification for focused TDFNsmathematical justification for focused TDFNs

• It is limited to maps that are shift invariant.• As a result, The use of focused TDFNs is suitable for use in

stationary environmentsstationary environments. • Distributed time-delayed feedforward networks:

distributed in the sense that the implicit influence of time is distributed throughout the networkdistributed throughout the network.

• The construction of such a network is based on the multiple input neuronal filter as the spatio-temporal model of a neuronneuron.

Distributed TDFN• Let wji(l) denote the weight connected to the l-th tap of the FIR filter

modeling the synapse that connects the output of neuron i to neuron j. g y p p jThe index l ranges from 0 to p, where p is the order of the FIR filter. According to this model, the signal sji(n) appearing at the output of the i-th synapse of neuron j is given by the convolution sum:

sji(n) = pl 0wji(l)xi(n-l) where n denotes discrete time.sji(n) l=0wji(l)xi(n l) where n denotes discrete time.

• It can be rewritten in the matrix form by introducing the following definitions for the state vector and weight vector for synapse i:

xi(n) = [xi(n),xi(n-1),...,xi(n-p)]Twji = [wji(0),wji(1),...,wji(p)]T

sji(n) = wjiT xi(n)

• Summing the contributions of the complete set of m0 synapses depicted in the model:in the model:

vj(n) = i=1m0 sji + bj = i=1

m0 wjiT xi(n) + bj

yj(n) = (vj(n) )

where v (n) denotes the induced local field of neuron j b is thewhere vj(n) denotes the induced local field of neuron j, bj is the externally applied bias, and (.) denotes the nonlinear activation function of the neuron.

Training

Training a distributed TDFN network:W d i d l i l ith• We need a supervised learning algorithm

• The actual response of each neuron in the output layer is compared with a desired response at each time instant. Th f d d d b h k• The sum of squared errors produced by the network: • E(n) =½ j ej²(n), where the index j refers to a neuron in

the output layer only, • ej(n) is the error signal ej(n) = dj(n) ‒ yj(n)

• The goal is to minimize a cost function defined as theThe goal is to minimize a cost function defined as the value of E(n) computed over all time E total = n E(n)

Temporal Backpropagation

• The approach is based on an approximation to the method of steepest descentof steepest descent.

• E total /wji =∑n E total /vj(n) . vj(n)/wji• wji(n+1) = wji(n) - E total/vj(n) . vj(n)/wji

→ wji(n+1) = wji(n) - j(n)xi(n)

• The explicit form of the local gradient j(n) depends on e e p c t o o t e oca g ad e t j( ) depe ds owether or not neuron j lies in the output layer or in a hidden layer of the network


neuron j in the output layer:• It is similar to the original BP algorithm• → wji(n+1) = wji(n) - j(n)xi(n) • j(n) = ej(n) ’(vj(n))j(n) ej(n) (vj(n))


neuron j in a hidden layer:( +1) ( ) ( ) ( )• → wji(n+1) = wji(n) - j(n)xi(n)

• we define A as the set of all neurons whose inputs are fed by neuron j in a forward mannerL ( ) d h i d d l l fi ld f h• Let vr(n) denote the induced local field of neuron r that belongs to the set A

• j(n) = -E total /vj(n)= - ∑rεA ∑k E total/vr(k) . vr(k)/vj(k)j j j= - ∑rεA ∑k r(k) . vr(k)/yj(k) . yj(k)/vj(k)= - ’(vj(n)) ∑rεA ∑k r(k) . vr(k)/yj(k)

• We havevr(k) = ∑mo

j=0 ∑ pl=0 wrj(l) yj(n-l)

• We know that the above convolution is comolutive so• We know that the above convolution is comolutive, sovr(k) = ∑mo

j=0 ∑ pl=0 yj(l) wrj(n-l)


• Thus,v (k)/y (k) = w (k l) ; n ≤ k ≤ n+pvr(k)/yj(k) = wrj(k-l) ; n ≤ k ≤ n+pvr(k)/yj(k) = 0 ; otherwise

• andj(n) = - ’(vj(n)) ∑rεA ∑l=0

p r(n+l) wrj(n)

• Let us define ∆r(n) = [r(n), r(n+1),..., r(n+p)]Tr( ) [r( ), r( ), , r( p)]• We Have

j(n) = - ’(vj(n)) ∑rεA∆rT (n) wrj(n)

wji(n+1) = wji(n) - j(n)xi(n)

Temporal Backpropagation• The symmetry between the forward

propagation of states and thepropagation of states and the backward propagation of errorterms is preserved.

• As a result the sense of parallel• As a result, the sense of parallel distributed processing is maintained.

• Each unique weight of synaptic filter i d l i th l l tiis used only once in the calculation of the s; there is no redundant use of terms experienced in the instantaneous gradient method.

• We may formulate the causal form of the temporal back-propagation p p p galgorithm.

Dynamically driven RNNs• Recurrent networks are those with one or more feedback

loopsloops.• The feedback can be of a local or global kind.• Input-output mapping networks: a recurrent network

d t ll t t ll li d i t i lresponds temporally to an externally applied input signal → dynamically driven recurrent network.

• The application of feedback enables recurrent network to ppacquire state representations, which make them suitable devices for such diverse applications as nonlinearprediction and modeling, adaptive equalization speech p g, p q pprocessing, plant control, automobile engine diagnostics.

• Successful early models of recurrent networks are:• Jordan NetworkJordan Network• Elman Network

Jordan Network• The Jordan Network has the structure of an MLP and

additional context units The Output neurons feedback toadditional context units. The Output neurons feedback to the context neurons in 1-1 fashion. The context units also feedback to themselves.

• The network is trained by using the Backpropagation• The network is trained by using the Backpropagationalgorithm

• A schematic is shown in the following figure:

Elman Network• The Elman Network has also the structure of an MLP and

additional context units The Hidden neurons feedback toadditional context units. The Hidden neurons feedback to the context neurons in 1-1 fashion.

• It is also called Simple Recurrent Network (SRN).Th t k i t i d b i th B k ti• The network is trained by using the Backpropagationalgorithm

• A schematic is shown in the following figure:g g

RNNs

• More complex forms of recurrent networks are possible. We can start by extending an MLP as a basic buildingWe can start by extending an MLP as a basic building block.

• Typical paradigms of complex recurrent models are:• Nonlinear Autoregressive with Exogenous Inputs

Network (NARX)• The State Space Model• The State Space Model• The Recurrent Multilayer Perceptron (RMLP)

Input-output recurrent model

• Nonlinear autoregressive with exogeneous inputs model (NARX)model (NARX)

y(n+1) = F( y(n), ..., y(n-q+1), u(n), ..., u(n-q+1) )

• The model has a single input that is applied to a tapped-delay-line memory of q units.

• It has a single output that is fed back to the input via another tapped-delay-line memory also of q units.

• The system is indeed single-input and single-output.• We previously saw the basic structure (without

feedback from the output)!


• The contents of these two tapped-delay-lines memories are used to feed the input layer of the multilayerare used to feed the input layer of the multilayer perceptron.

• u(n): The present value of the model input .( +1) Th di l f th d l t t• y(n+1): The corresponding value of the model output.

• The signal vector applied to the input layer of the multilayer perceptron consists of a data window made up as follows:as follows: • Present and past values of the input (exogenous

inputs)D l d l f th t t ( d)• Delayed values of the output (regressed)


• Consider a recurrent network with a single input and outputoutput.

• y(n+q) = (x(n),uq(n)) where q is the dimensionality of the system, and :R2q→R.P id d th t th t t k i b bl ( )• Provided that the recurrent network is observable x(n) = (yq(n),uq-1(n)) where :R2q→R.

• y(n+q) = F(yq(n),uq(n)) where uq-1(n) is contained in uq(n)as its first (q 1) elements and the nonlinear mappingas its first (q-1) elements, and the nonlinear mapping F:R2q→R takes care of both and .

y(n+1) = F(y(n), ..., y(n-q+1), u(n), ..., u(n-q+1))

State-space model

• The hidden neurons define the state of the network. Th t t f th hidd l i f d b k t th i t• The output of the hidden layer is fed back to the input layer via a bank of unit delays.

• The input layer consists of a concatenation of feedback d d dnodes and source nodes.

• The network is connected to the external environment via the source node. Th d f h d l Th b f i d l d• The order of the model: The number of unit delays used to feed the output of the hidden layer back to the input layer.

x(n+1) =f(x(n),u(n))y(n) = Cx(n)y( ) ( )

State-space model

x(n+1) =f(x(n),u(n))( ) C ( )y(n) = Cx(n)

• where f is a suitable nonlinear function characterizing the hidden layer. x is the state vector, as it is produced by th hidd l It h t i th t tthe hidden layer. It has q components. y is the output vector and it has p components. The input vector is given by u and it has m components.

State-space model

• The simple recurrent network (SRN) differs from the main model by:main model by:• replacing the output layer by a nonlinear one and• omitting the bank of unit delays at the output.

State-space model

• The state of a dynamical system: is defined as a set of quantities that summarizes all the information about thequantities that summarizes all the information about the past behavior of the system that is needed to uniquely describe its future behavior, except for the purely external effects arising from the applied input (excitation).effects arising from the applied input (excitation).

• Let the q-by-1 vector x(n) denote the state of a nonlinear discrete-time system.

• Let the m-by-1 vector u(n) denote the input applied to the• Let the m-by-1 vector u(n) denote the input applied to the system.

• Let the p-by-1 vector y(n) denote the corresponding output of the systemoutput of the system.

State-space model

• The dynamic behavior of the system (noise-free) is described as:described as:• x(n+1) =(Wax(n)+Wbu(n)) (the process equation), • y(n) = Cx(n) (the measurement equation) where Wa is

b t i W i b ( +1) t i C i ba q-by-q matrix, Wb is a q-by-(m+1) matrix, C is a p-by-q matris; and : Rq→Rq is a diagonal map described by :[x1, x2,…, xq]T → [(x1), (x2),…, (xq)]T for some memory-less component-wise nonlinearity : R→Rmemory-less component-wise nonlinearity : R→R.

• The spaces Rm,Rq and Rp are called the input space, state space and output space → m input p outputstate space, and output space → m-input, p-output recurrent model of order q.

Recurrent multilayer perceptron

• One or more hidden layers• Feedback around each layer• The general structure of a static MLP network• The output is calculated as follows (assuming that xI xII and xThe output is calculated as follows (assuming that xI, xII, and xo

are the first, second and output layer outputs):• The functions I(•), II(•) and O(•) denote the Activation

functions of the corresponding layerfunctions of the corresponding layer.

x (n+1) = (x (n) u(n))xI(n+1) =I(xI(n),u(n))xII(n+1) =II(xII(n),xI(n+1)),

...,( 1) ( ( ) ( ))xO(n+1) =O(xO(n), xK(n))

Second-order network• Second-order neuron: When the induced

local field vk is combined using m ltiplications e refer to the ne ron as amultiplications, we refer to the neuron as a second-order neuron.

• A second-order recurrent networks• vk(n) = bk+ijwkijxi(n)uj(n)vk(n) bk ijwkijxi(n)uj(n)• xk(n+1) =(vk(n)) = 1 /(1+exp(- vk(n) )• The pair xj(n)uj(n) represents state,input

and a positive weight wkij represents the presence of the transtionpresence of the transtion state,input→next state, while a negative weight represents the absence of the transition. The state stransition is described by (x u ) = xdescribed by (xi,uj) = xk.

• Second-order networks are used for representing and learning deterministic finite-state automata (DFA).

State-space model

x(n+1)=(Wax(n)+Wbu(n))( ) C ( )y(n) = Cx(n)

Wa is a q-by-q matrix,a q y qWb is a q-by-(m+1)

matrix, C is a p-by-q matrix;C s a p by q a ;: Rq→Rq is a diagonal

map described by :[x1, x2,…, xq]T → [ 1, 2, , q][(x1), (x2),…, (xq)]T

More on dynamical systems

• Three prominent features of dynamical systems (such as recurrent neural networks):recurrent neural networks):• Stability (discussed)• Controlability: Controllability is concerned with

h th t t l th d i b h i fwhether or not we can control the dynamic behavior of the recurrent network

• Observability: Observability is concerned with whether or not we can observe the result of the control appliedor not we can observe the result of the control applied to the recurrent network

Controllability and observability• A recurrent network is said to be controllable if an initial state

is steerable to any desired state within a finite number of time ysteps.

• A recurrent network is said to be observable if the state of the network can be determined from a finite set of input/output p pmeasurements.

• A state x- is said to be an equilibrium state if for an input u it satisfies the condition x- = (Ax-+Bu-)

• Set x- = 0 and u- = 0 → 0 = (0).• Linearize x- = (Ax-+Bu-) by expanding it as a Taylor series

around x- = 0 and u- = 0 and retaining first-order terms g• x(n+1) =’(0)Wa x(n)+ ’(0)wb u(n) where x(n) and u(n)

are small displacements and the q-by-q matrix ’(0)is the Jacobian of (v) with respect to its argument v.( ) g

x(n+1) = A x(n)+ b u(n) and y(n) = cTx(n)

Controllability and observability• The linearized system represented by x(n+1) = A x(n)+ b u(n) is

controllable if the matrix Mc = [Aq-1 b,…, Ab, b] is of rank q, that is, f

cfull rank, because then the linearized process equation above would have a unique solution.

• The matrix Mc is called the controllability matrix of the linearized systemsystem.

• In the similar way: y(n) = cTx(n) → MO=[c, cAT,…, c(AT)q-1]• The linearized system represented by x(n+1) = A x(n)+ b u(n)

and y(n) = cTx(n) is observable if the matrix MO is of rank q thatand y(n) c x(n) is observable if the matrix MO is of rank q, that is, full rank.

• The matrix MO is called the observability matrix of the linearized system.y

• Let a recurrent network and its linearized version around the origin. If the linearized system is controllable, then the recurrent network is locally controllable around the origin.L t t t k d it li i d i d th i i• Let a recurrent network and its linearized version around the origin. If the linearized system is observable, then the recurrent network is locally observable around the origin.

Nonlinear autoregressive with exogenous input model

• y(n+q) = (x(n) u (n))

exogenous input model

y(n+q) = (x(n),uq(n))• x(n) = Ω(yq(n)+uq-1(n))

( ) (Ω( ( ) ( )) ( )) F(( ( ) ( ))• y(n+q) = (Ω(yq(n)+uq-1(n)),uq(n)) = F((yq(n)+uq(n))Or• y(n+q) = F(y(n),…,y(n-q+1),u(n),…,u(n-q+1))

Nonlinear autoregressive with exogenous input model

NARX network with q = 3 hidden neurons

exogenous input modelq

Computational power of RNNs

• Theorem I) All Touring machines may be simulated by f ll t d t t k b ilt

p p

fully connected recurrent networks built on neurons with sigmoid activation functions.

• The Touring machine is a cmputing device with:1. control unit2. linear tape3. read-write head3. read write head

Computational power of RNNs• Theorem II) NARX networks with one layer of hidden

neurons with bounded, one-sided saturated activation

p p

,functions and a linear output neuron can simulate fully connected recurrent networks with bounded, one-sided saturated activation functions, except for a linear l dslowdown.

• Bounded, one-sided saturated activation functions (BOSS):

1) a ≤ (x) ≤ b a≠b for all xR1) a ≤ (x) ≤ b, a≠b, for all xR2) There exist values s and S, (x) = S for all a ≤ s.3) (x1) ≠ (x2) for some x1 and x2.

• → NARX networks with one hidden layer of neurons with BOSS ti ti f ti d li t tBOSS activation functions and a linear output neuron are turing equivalent.

Training of RNNs• The training of the recurrent networks (with dynamics in

the neurons) can be done with two methods:

g

the neurons) can be done with two methods:• Back-Propagation Through Time• Real-Time Recurrent Learning

• We can train a recurrent network with either epochwise• We can train a recurrent network with either epochwiseor continuous training operation. However an epoch in recurrent networks does not mean the presentation of all learning patterns but rather denotes the length of alearning patterns but rather denotes the length of a single sequence that we use for training. So, an epoch in recurrent network corresponds in presenting only onepattern to the network.

• At the end of an epoch the network stabilizes.

Training of RNNsEpochwise Back-propagation through time:• E (n n ) = ½ n1 e (n) ²

g

• ETotal (n0,n1) = ½ n1n=n0 j Aej(n) ²

Truncated Bak-propagation through time in real-time fashion:

• E(n) = ½ j Aej(n) ²E(n) ½ j Aej(n) • We save only the relevant history of input data and

network state for a fixed number of time steps. → the truncation depth

Real-time recurrent learning (RTRL):• concatenated input-feedback layer• processing layer of computational nodes• e(n) = d(n) – y(n), ETotal = ne(n)Decoupled extended Kalman filter (DEKF):• Observing the unknowd states based on extended g

Kalman filtering

Training of RNNs: heuristics• Lexigraphic order of training samples should be followed,

with the shortest strings of symbols being presented to

g

with the shortest strings of symbols being presented to the network first

• The training should begin with a small training sample, and then its size should be incrementally increased asand then its size should be incrementally increased as the training proceeds

• The synaptic weights of the network should be updated only if the absolute error on the training sample currently y g p ybeing processed by the network is greater than some prescribed criterion

• The use of weight decay during training is recommended; g y g g ;weight decay, a crude form of complexity regularization

Backpropagation through time• The bacpropagation through time (BPTT) algorithm for

training an RNN is an extension of the standard BP

p p g g

training an RNN is an extension of the standard BP• Let N denote an RNN required to learn a temporal task,

starting from time no all the way up to time ng o y p• Let N* denote the feedforward network that results from

unfolding the temporal operation of the recurrent network N

Backpropagation through time• N* is related to N as follows:

For each time step in the inter al (n n) the net ork N* has a la er

p p g g

• For each time step in the interval (no, n), the network N* has a layer containing K neurons, where K is the number of neurons contained in the network NIn every layer of the network N* there is a copy of each neuron in• In every layer of the network N* there is a copy of each neuron in the network N

• For each time step I [no, n], the synaptic connection from neuron iin layer l to neuron j in layer l+1 of the network N* is a copy of thein layer l to neuron j in layer l+1 of the network N* is a copy of the synaptic connection from neuron i to neuron j in the network N

Backpropagation through time• The following example explains the idea of unfolding:

W th t h t k ith t hi h

p p g g

• We assume that we have a network with two neurons which is unfolded for a number of steps, n:

Backpropagation through time• Let the dataset used for training the network be partitioned into

independent epochs with each epoch representing a temporal

p p g g

independent epochs, with each epoch representing a temporal pattern of interest. Let n0 denote the start time of an epoch and n1 denotes its end time.W d fi th f ll i t f ti• We can define the following cost function:

1

)(21),( 2

10

n

Ajjtotal nennE

• where A is the set of indices j pertaining to those neurons in the network for which desired responses are specified, and ej(n) is the error signal at the output of such a neuron measured with

02 nn Aj

the error signal at the output of such a neuron measured with respect to some desired response.

Backpropagation through time• The algorithm proceeds as follows:

p p g g

1) For a given epoch, the recurrent network starts running from some initial state until it reaches a new state, at which point the training is stopped and the network is reset to an initial state for the next epoch. The initial state doesn’t have to be the same for each epoch of training. Rather, what is important is for the initial state for the new epoch to be p pdifferent from the state reached by the network at the end of the previous epoch;

2) First a single forward pass of the data through the network2) First a single forward pass of the data through the network for the interval (n0, n1) is performed. The complete record of input data, network state (i.e. synaptic weights), and desired

thi i t l i dresponses over this interval is saved;

Backpropagation through time3) A single backward pass over this past record is performed to

compute the values of the local gradients:

p p g g

compute the values of the local gradients:

for all j A and n < n n This computation is performed by:)(

),()( 10

nvnnEn

j

totalj

for all j A and n0 < n n1 . This computation is performed by:

10

1

)1()())(('

)())((')( nnnfornwnenv

nnfornenvn

kjkjj

jj

j

where ’(•) is the derivative of an activation function with respect to its argument, and vj(n) is the induced local field of

10)()())(( fAk

kjkjj

p g , j( )neuron j.

The use of above formula is repeated, starting from time n1 and ki b k t b t t ti th b f tworking back, step by step, to time n0 ; the number of steps

involved here is equal to the number of time steps contained in the epoch.

Backpropagation through time4) Once the computation of back-propagation has been performed

back to time n +1 the following adjustment is applied to the

p p g g

back to time n0+1, the following adjustment is applied to the synaptic weight wji of neuron j:

1

( ) ( 1)n

totalji j i

Ew n x n

w

where is the learning rate parameter and xi(n-1) is the input applied to the i-th synapse of neuron j at time n-1.

0 1n njiw

• There is a potential problem with the method, which is called the Vanishing Gradients Problem, i.e. the g ,corrections calculated for the weights are not large enough when using methods based on steepest descent.

However this is a research problem currently and ones has• However this is a research problem currently and ones has to see the literature for details.

So,…,• Dynamic networks learn sequences in contrast to the static

mappings of MLP and RBF networksmappings of MLP and RBF networks.

• Time representation takes place explicitly or implicitly.

• The implicit form includes time-delayed versions of the input• The implicit form includes time-delayed versions of the input vector and use of a static network model afterwards or the use of recurrent networks.

• The explicit form uses a generalization of the MLP model where a synapse is modeled now as a weight vector and not as a single number. The synapse activation is not any more g y p ythe product of the synapse’s weight with the output of a previous neuron but rather the inner product of the synapse’s weight vector with the time-delayed state vectorsynapse s weight vector with the time delayed state vector of the previous neuron.

So,…,• The extended MLP networks with explicit temporal structure

are trained with the Temporal BackPropagationare trained with the Temporal BackPropagationalgorithm.

• The recurrent networks include a number of simple and complex architectures. In the simpler case we train the networks using the standard BackProgation algorithm.

• In the more complex cases we first unfold the network in• In the more complex cases we first unfold the network in time and then train it using the BackProgation Through Time algorithm.

RNNs: applicationsStock Price Pattern Analysis using RNN• Two Hidden Layers

pp

• Two Hidden Layers• The input layer consists of two sets of units• The first set represent current stock data• The second set is called context layer• It is used to represent the temporal context • It holds a copy of the first hidden unit’s activity level

at the previous time step

• K. Kamijo, T. Tanigawa, "Stock Price Pattern Recognition A Recurrent Neural Network Approach"

RNNs: applicationsActive Noise Control• An acoustic noise source creates undesirable noise in a surrounding

pp

• An acoustic noise source creates undesirable noise in a surrounding area. The goal of the active noise suppression system is to reduce the undesirable noise at a particular location by using a loudspeaker to produce “anti-noise” that attenuates the unwanted noise by destructiveproduce anti noise that attenuates the unwanted noise by destructive interference.

• In order for a system of this type to work effectively, it is critical that the ANC system be able to predict (and then cancel) unwanted sound in theANC system be able to predict (and then cancel) unwanted sound in the noise control zone.

• L.R. Medsker, et all, “Recurrent Neural Network Design and Applications”, CRC Press, 2001.

RNNs: applicationsAdaptive Robot Control• RNN-controlled robots have become the subject of interest in a number

pp

• RNN-controlled robots have become the subject of interest in a number of fields of research


RNNs: applicationsAdaptive Robot Control• All network Architectures tested

pp

• All network Architectures tested.

Best Performance


Reservoir Computingg

• Inspired by cortical columnsp y Create desired output from a pool of complex

dynamics derived by input!• Echo state networks Jaeger 2001

Analog neurons Analog neurons• Liquid state machines Maas et al 2002 Maas et al. 2002 Spiking neurons

• Backpropagation DecolorationBackpropagation Decoloration Steil 2004

Echo State Networks

• ESNSSSS

Vertex: Artificial neurons Edges: Synaptic connections

SparseSparse

DirectedDirected

WeightedWeighted

SparseSparse

DirectedDirected

WeightedWeighted• Echo state property Fading memory

gg

RecurrentRecurrent

RandomRandom

gg

RecurrentRecurrent

RandomRandom

• Update formulas

• Linear learning on Readout (i e output layer)• Linear learning on Readout (i.e., output layer)

Main Question

• Parameters of reservoir are selected randomly:y Weights Spectral Radius Sparseness Weight distribution

T f f ti Transfer functions• How a reservoir can be optimized for a specific

task?task?

Tasks

• NARMA (2x2)

• Parameters: Scaling factor of input and feedback weights Desired spectral radius 0.38Desired spectral radius Kernel elements

3

0.37

0.36

0.35

0.34Min 0 320 0.33

0.32

0.31

0.30

Min=0.3203

)3.0()1( 1

1

nNRMSE

n

nnn

E0.29

0 10 20 30 40 50 60 70 80

n

Readingg

• S Haykin, Neural Networks: A Comprehensive y , pFoundation, 2007 (Chapters 13 and 15).

Lectures 24&25: Recurrent Neural Networks - Sharif

Documents

Transcript of Lectures 24&25: Recurrent Neural Networks - Sharif