8/14/2019 Lecture Neural Nets 2010
1/148
1
Intelligent ControlReti Neurali Per LidentificazionePredizione Ed Il ontrollo
Lecture 1:
Introduction to Neural Networks
8/14/2019 Lecture Neural Nets 2010
2/148
8/14/2019 Lecture Neural Nets 2010
3/148
3
Topics in Neural Networks
Introduction
8/14/2019 Lecture Neural Nets 2010
4/148
4
Lecture Outline
1. Introduction (2)i. Course introductionii. Introduction to neural networkiii. Issues in Neural network
2. Simple Neural Network (3)i. Perceptronii. Adaline
3. Multilayer Perceptron (4)i. Basicsii. Dynamics
4. Radial Basis Networks (5)
8/14/2019 Lecture Neural Nets 2010
5/148
5
Introduction to Neural Networks
8/14/2019 Lecture Neural Nets 2010
6/148
8/14/2019 Lecture Neural Nets 2010
7/1487
Artificial Neuronbias
i
j
neti =jwijyj + b
8/14/2019 Lecture Neural Nets 2010
8/1488
Artificial Neuron
Input/Output Signal may be.
Real value.
Unipolar {0, 1}. Bipolar {-1, +1}.
Weight : wijstrength of connection.
Note that wijrefers to the weight fromunitj to unit i(not the other way round).
8/14/2019 Lecture Neural Nets 2010
9/1489
Artificial Neuron
The bias bis a constant that can be written as
wi0y0with y0 = band wi0 = 1 such that
The function f is the units activation function.
In the simplest case, f is the identity function,
and the units output is just its net input. This
is called a linear unit.
Other activation functions are : step function,
sigmoid function and Gaussian function.
n
j
jiji ywnet
0
8/14/2019 Lecture Neural Nets 2010
10/14810
Funes de Ativao
8/14/2019 Lecture Neural Nets 2010
11/14811
Artificial Neural Networks (ANN)
Inp
utvector
weight
weight
Activation
function
Ou
tput(vector)
Activation
function
Signal
routing
8/14/2019 Lecture Neural Nets 2010
12/14812
Historical Development of ANN
William James (1890) : Describes in words and figuressimple distributed networks and Hebbian learning
McCulloch & Pitts (1943) : Binary threshold units that
perform logical operations (they proof universalcomputation)
Hebb (1949) : formulation of a physiological (local)learning rule
Roseblatt (1958) : The perceptrona first real learningmachine
Widrow & Hoff (1960) : ADALINE and the Widrow-Hoffsupervised learning rule
8/14/2019 Lecture Neural Nets 2010
13/14813
Historical Development of ANN
Kohonen (1982) : Self-organizing maps
Hopfield (1982): Hopfield Networks
Rumelhart, Hinton & Williams (1986) :Back-propagation & Multilayer Perceptron
Broomhead & Lowe (1988) : Radial
basis functions (RBF)
Vapnik (1990) -- support vector machine
8/14/2019 Lecture Neural Nets 2010
14/148
14
WhenShouldANN Solution BeConsidered ?
The solution to the problem cannot be explicitly described
by an algorithm, a set of equations, or a set of rules.
There is some evidence that an input-output mapping exists
between a set of input and output variables.
There should be a large amount of data available to train
the network.
8/14/2019 Lecture Neural Nets 2010
15/148
15
ProblemsThat Can Lead to PoorPerformance ?
The network has to distinguish between very similar cases
with a very high degree of accuracy.
The train data does not represent the ranges of cases that
the network will encounter in practice.
The network has a several hundred inputs.
The main discriminating factors are not present in the
available data. E.g. trying to assess the loan applicationwithout having knowledge of the applicant's salaries.
The network is required to implement a very complex
function.
8/14/2019 Lecture Neural Nets 2010
16/148
16
Applicationsof Artificial Neural Networks
Manufacturing : fault diagnosis, fraud detection.
Retailing : fraud detection, forecasting, data
mining. Finance : fraud detection, forecasting, data mining.
Engineering : fault diagnosis, signal/imageprocessing.
Production : fault diagnosis, forecasting.
Sales & Marketing : forecasting, data mining.
8/14/2019 Lecture Neural Nets 2010
17/148
17
Data Pre-processing
Neural networks very rarelyoperate on the raw
data. An initial pre-processingstage is essential.Some examples are as follows:
Feature extraction of images: For example, the analysis of X-raysrequires pre-processing to extract features which may be of interest
within a specified region. Representing input variables with numbers. For example "+1" is the
person is married, "0" if divorced, and "-1" if single. Another exampleis representing the pixels of an image: 255 = bright white, 0 = black.
To ensure the generalization capability of a neural network, the data
should be encoded in form which allows for interpolation.
8/14/2019 Lecture Neural Nets 2010
18/148
18
Data Pre-processing
Categorical Variable
A categorical variable is a variable that can belong to
one of a number of discrete categories. For example,
red, green, blue.
Categorical variables are usually encoded using 1 out-of
n coding. e.g. for three colours, red = (1 0 0), green
=(0 1 0) Blue =(0 0 1).
If we used red = 1, green = 2, blue = 3, then this type
of encoding imposes an ordering on the values of the
variables which does not exist.
8/14/2019 Lecture Neural Nets 2010
19/148
19
Data Pre-processing
CONTINUOUS VARIABLES
A continuous variable can be directly applied to
a neural network. However, if the dynamic range
of input variables are not approximately the
same, it is better to normalizeall input variables
of the neural network.
8/14/2019 Lecture Neural Nets 2010
20/148
20
Example of Normalized Input Vector
Input vector : (2 4 5 6 10 4)t
Mean of vector :
Standard deviation :
Normalized vector :
Mean of normalized vector is zero
Standard deviation of normalized vector is unity
167.56
1 6
1
i
ix
714.2)(16
16
1
2
iix
tiN
xx 43.078.131.006.043.017.1
8/14/2019 Lecture Neural Nets 2010
21/148
21
Simple Neural Networks
Lecture 3:
Simple Perceptron
8/14/2019 Lecture Neural Nets 2010
22/148
22
Outlines
The Perceptron
Linearly separable problem
Network structure
Perceptron learning ruleConvergence of Perceptron
8/14/2019 Lecture Neural Nets 2010
23/148
23
The perceptron was a simple model of ANN introducedby Rosenblatt of MIT in the 1960 with the idea oflearning.
Perceptron is designed to accomplish a simple patternrecognition task: after learning with real value training data
{x(i), d(i), i =1,2, , p} where d(i) = 1 or -1
For a new signal (pattern)x(i+1),the perceptron is
capable of telling you to which class the new signalbelongs
x(i+1) perceptron 1 or 1
THE PERCEPTRON
8/14/2019 Lecture Neural Nets 2010
24/148
24
Perceptron
Linear threshold unit (LTU)
x1
x2
xn
.
.
.
w1
w2
wn
w0=b
x0=1
x=i=0nwixi
1 if i=0nwixi >0o(x)=-1 otherwise
o
{
8/14/2019 Lecture Neural Nets 2010
25/148
25
Decision Surface of a Perceptron
+
++
+-
-
-
-x1
x2
+
+-
-
x1
x2
Perceptron is able to represent some useful functions
AND (x1,x2) choose weights w0=-1.5, w1=1, w2=1
But functions that are not linearly separable (e.g.
XOR) are not representable
AND
w0
w2
w1
8/14/2019 Lecture Neural Nets 2010
26/148
26
m
i
ii
m
i
ii xwfbxwfy01
)()(
where f is the hard limiter function i.e.
01
01
1
1m
i
ii
m
iii
bxwif
bxwify
We can always treat the bias bas another weight with
inputs equal 1
Mathematically the Perceptron is
8/14/2019 Lecture Neural Nets 2010
27/148
27
01
m
i
ii bxw
Why is the network capable of solving linearly
separable problem ?
0
1
m
i
ii bxw
01
m
i
ii bxw
8/14/2019 Lecture Neural Nets 2010
28/148
28
Learning rule
An algorithm to update the weights wso that finallythe input patterns lie on both sides of the line decidedby the perceptron
Let tbe the time, at t = 0, we have
0)0( xw
8/14/2019 Lecture Neural Nets 2010
29/148
29
Learning rule
An algorithm to update the weights wso that finallythe input patterns lie on both sides of the line decided by theperceptronLet tbe the time, at t = 1
0)1( xw
8/14/2019 Lecture Neural Nets 2010
30/148
30
Learning rule
An algorithm to update the weights wso that finallythe input patterns lie on both sides of the line decided by theperceptronLet tbe the time, at t = 2
0)2( xw
8/14/2019 Lecture Neural Nets 2010
31/148
31
Learning rule
An algorithm to update the weights wso that finallythe input patterns lie on both sides of the line decided by theperceptronLet tbe the time, at t = 3
0)3( xw
8/14/2019 Lecture Neural Nets 2010
32/148
32
)())]()(()()[()()1(
)(1)(1)(
txtxtwsigntdttwtw
classintxifclassintxiftd
Perceptron learning rule
In Math
Where (t) is the learning rate >0,
+1 if x>0sign(x)= hard limiter function
1 if x
8/14/2019 Lecture Neural Nets 2010
33/148
Exemplo 1
Logical OR function
So lets set w0= -0.6, w1= 0.2 andw2= 0.3.
x0 = 1.
33
X1 X2 R
-1 -1 -1
1 -1 1
-1 1 1
1 1 1
8/14/2019 Lecture Neural Nets 2010
34/148
Exemplo 2
Logical E function
34
X1 X2 R
-1 -1 -1
1 -1 -1
-1 1 -1
1 1 1
8/14/2019 Lecture Neural Nets 2010
35/148
35
In words:
If the classification is right, do not update the
weights
If the classification is not correct, update the
weight towards the opposite direction so that the
output move close to the right directions.
8/14/2019 Lecture Neural Nets 2010
36/148
36
Perceptron convergence theorem(Rosenblatt, 1962)
Let the subsets of training vectors be linearly separable. Thenafter finite steps of learning we have
lim w(t) = w which correctly separate the samples.
The idea of proof is that to consider ||w(t+1)-w||-||w(t)-w||which is a decrease function of t
8/14/2019 Lecture Neural Nets 2010
37/148
37
Summary of Perceptron learning
Variables and parametersx(t) = (m+1) dim. input vectors at time t= ( b, x1 (t), x2 (t), .... , xm (t) )
w(t) = (m+1) dim. weight vectors= ( 1 , w1 (t), .... , wm (t) )
b = bias
y(t) = actual responset = learning rate parameter, a +ve constant < 1d(t) = desired response
8/14/2019 Lecture Neural Nets 2010
38/148
38
Summary of Perceptron learning
Data { (x(i), d(i)), i=1,,p}
Present the data to the network once a point
could be cyclic :(x(1), d(1)), (x(2), d(2)),, (x(p), d(p)),(x(p+1), d(p+1)),
or randomly(Hence we mix time t with i here)
8/14/2019 Lecture Neural Nets 2010
39/148
39
1. Initialization Set w(0)=0. Then perform the followingcomputation for time step t=1,2,...
2. Activation At time step t, activate the perceptron by applying
input vector x(t)and desired response d(t)
3. Computation of actual response Compute the actual responseof the perceptron
y(t) = sign ( w(t) x(t) )
where signis the sign function
4. Adaptation of weight vector Update the weight vector of theperceptron
w(t+1) = w(t)+ t [ d(t) - y(t) ] x(t)5. Continuation
Summary of Perceptron learning (algorithm)
8/14/2019 Lecture Neural Nets 2010
40/148
40
Questions remain
Where or when to stop?
By minimizing the generalization error
For training data{(x(i), d(i)), i=1,p}
How to define training error after t steps of learning?
E(t)= pi=1[d(i)-sign(w(t) .x(i)]
2
8/14/2019 Lecture Neural Nets 2010
41/148
41
After
learning
t steps
E(t) = 0
8/14/2019 Lecture Neural Nets 2010
42/148
42
How to define generalization error?
For a new signal {x(t+1),d(t+1)}, we have
Eg= [d(t+1)-sign (x(t+1) w (t)) ]2.
After
learning
t steps
8/14/2019 Lecture Neural Nets 2010
43/148
43
We next turn to ADALINE learning,
from which we can understandthe learning rule, and more general the
Back-Propagation (BP) learning
8/14/2019 Lecture Neural Nets 2010
44/148
44
Simple Neural Network
ADALINE Learning
8/14/2019 Lecture Neural Nets 2010
45/148
45
Outlines
ADALINE
Gradient descending learning
Modes of training
8/14/2019 Lecture Neural Nets 2010
46/148
46
Unhappy over Perceptron Training
When a perceptron gives the right answer, no
learning takes place
Anything below the threshold is interpreted
as no, even it is just below the threshold.
It might be better to train the neuron basedon how far below the threshold it is.
8/14/2019 Lecture Neural Nets 2010
47/148
47
ADALINEis an acronym for ADAptive LINear Element
(or ADAptive LInear NEuron) developed by Bernard
Widrow and Marcian Hoff (1960).
There are several variations of Adaline. One has
threshold same as perceptron and another just a bare
linear function.
TheAdaline learningrule is also known as the least-
mean-squares (LMS) rule, the delta rule, or the Widrow-Hoff rule.
It is a training rule that minimizes the output error
using (approximate) gradient descent method.
ADALINE
Replace the step function in the perceptron with a
8/14/2019 Lecture Neural Nets 2010
48/148
48
Replace the step function in the perceptron with acontinuous (differentiable) function f, e.g the simplest islinear function
With or without the threshold, theAdalineis trainedbased on the output of the function f rather than the finaloutput.
f (x)
(Adaline)
8/14/2019 Lecture Neural Nets 2010
49/148
49
After each training pattern x(i) is presented, the correction toapply to the weights is proportional to the error.
E (i,t) = [ d(i) f(w(t) x(i)) ] 2 i=1,...,p
N.B. If f is a linear function f(w(t) x(i)) = w(t) x(i)
Summing together, our purpose is to find wwhich minimizes
E (t) = i
E(i,t)
8/14/2019 Lecture Neural Nets 2010
50/148
50
To find gw(t+1) = w(t)+g( E(w(t)) )
so that wautomatically tends to the
global minima of E(w).
w(t+1) = w(t)- E(w(t))t
(see figure below)
General Approach gradient descent method
8/14/2019 Lecture Neural Nets 2010
51/148
51
Gradientdirection is the direction of uphillforexample, in the Figure, at position 0.4, the
gradientis uphill ( F is E, consider one dim case )
Gradient direction
F(0.4)
F
In gradient descent algorithm we have
8/14/2019 Lecture Neural Nets 2010
52/148
52
In gradient descent algorithm, we have
w(t+1) = w(t) F(w(t)) t
therefore the ball goes downhill sinceF(w(t))is downhill direction
Gradient direction
w(t)
I di t d t l ith h
8/14/2019 Lecture Neural Nets 2010
53/148
53
Gradient direction
w(t+1)
In gradient descent algorithm, we have
w(t+1) = w(t) F(w(t)) t
therefore the ball goes downhill sinceF(w(t))is downhill direction
8/14/2019 Lecture Neural Nets 2010
54/148
54
Graduallythe ball will stop at a local minimawhere thegradient is zero
Gradient direction
w(t+k)
8/14/2019 Lecture Neural Nets 2010
55/148
55
In words
Gradient method could be thought of as a ball rolling downfrom a hill: the ball will roll down and finally stop at the valley
Thus, the weights are adjusted by
wj(t+1) = wj(t) +t [d(i) - f(w(t) x(i)) ] xj(i)f
This corresponds to gradient descent on the quadratic errorsurface E
When f =1, we have the perceptron learning rule (we have ingeneral f>0 in neural networks). The ball moves in the rightdirection.
8/14/2019 Lecture Neural Nets 2010
56/148
56
About LMS Learning Rule
Suppose we have a set of input vectors, {x1, X2, . . . . XL},each having its own, perhaps unique, correct or desired outputvalue, dk, k = 1,...,L. The problem of finding a single weightvector that can successfully associate each input vector with itsdesired output value is no longer simple.
Calculation of w*. To begin, let's state the problem a little
differently: Given examples, (x1, d1), (x2, d2),..., (XL,dL), of someprocessing function that associates input vectors, xk, with (ormaps to) the desired output values, dk, what is the best weightvector, w*. Consider that the output can be write in vectornotation:
y = wtx If the actual output value is ykfor the kth input vector, then the
corresponding error term is ek= dkyk
8/14/2019 Lecture Neural Nets 2010
57/148
57
More About LMS
The mean squared error, or expectation value ofthe error, is defined by
L
k
kk
L 1
22 1
where L is the number of input vectors in the training set.We can expand the mean squared error as follows
wpwRwwdwxdwxxwd
t
k
t
kk
t
kk
t
kk
2
22
22
Where R = (xkxkt.), is called the input correlation matrix, and a
vector p = (dkx
k
t)
8/14/2019 Lecture Neural Nets 2010
58/148
58
More
To find the weight vector corresponding to theminimum mean squared error, we differentiate(w), evaluate the result at w*, and set the resultequal to zero:
pRw
pRw
pRw
pRww
w
1*
*
* 022
22
8/14/2019 Lecture Neural Nets 2010
59/148
59
Sequential mode (on-line, stochastic, orper-pattern) :
Weights updated after each pattern ispresented (Perceptron is in this class)
Batch mode(off-line or per-epoch) : Weightsupdated after all patterns are presented
Two types of network training:
Comparison Perceptron and
8/14/2019 Lecture Neural Nets 2010
60/148
60
Comparison Perceptron andGradient Descent Rules
Perceptron learning rule guaranteed to succeed if Training examples are linearly separable
Sufficiently small learning rate
Linear unit training rule uses gradient descentguaranteed to converge to hypothesis with
minimum squared error given sufficiently small
learning rate Even when training data contains noise
Even when training data not separable by Hyperplane
R i f P
8/14/2019 Lecture Neural Nets 2010
61/148
61
Renaissance of Perceptron
Perceptron
Support VectorMachine
Multi-LayerPerceptron
Learning Theory, 90
Back-Propagation, 80
8/14/2019 Lecture Neural Nets 2010
62/148
62
Summary of Previous Lectures
Perceptron
W(t+1)= W(t)+(t) [ d(t) - sign (w(t) . x)] x
Adaline (Gradient descent method)
W(t+1)= W(t)+(t) [ d(t) - f(w(t) . x)] x f
8/14/2019 Lecture Neural Nets 2010
63/148
63
Multi-Layer Perceptron (MLP)
Idea: Credit assignment problem
Problem of assigning credit or blame to
individual elements involving in forming overallresponse of a learning system (hidden units)
In neural networks, problem relates to dividingwhich weights should be altered, by how muchand in which direction.
E l Th l t k
8/14/2019 Lecture Neural Nets 2010
64/148
64
xn
x1
x2
Input Output
Example: Three-layernetworks
Input layer Hidden layer Output layer
Signal routing
P ti f hit t
8/14/2019 Lecture Neural Nets 2010
65/148
65
Properties of architecture
No connections within a layer
No direct connections between input and output layersFully connected between layers
Often more than 2 layers
Number of output units need not equal number of input units
Number of hidden units per layer can be more or less than
input or output units
f w x bi ij j i
j
m
( )1
Each unit is a perceptron
8/14/2019 Lecture Neural Nets 2010
66/148
66
BP(Back Propagation)
8/14/2019 Lecture Neural Nets 2010
67/148
67
MultiLayer Perceptron I
Back Propagating Learning
BP learning algorithm
8/14/2019 Lecture Neural Nets 2010
68/148
68
BP learning algorithmSolution tocredit assignment problemin MLP
Rumelhart, Hinton and Williams (1986)
BP has two phases:
Forward pass phase: computes functional signal,feedforward propagation of input pattern signals throughnetwork
Backward pass phase: computes error signal,propagation of error (difference between actual and desiredoutput values) backwards through network starting at outputunits
BP Learning for Simplest MLP
8/14/2019 Lecture Neural Nets 2010
69/148
69
I
w(t)
W(t)
y
OBP Learning for Simplest MLP
Task: Data {I, d} to minimize
E = (d - o)2/2= [d - f(W(t)y(t))]2/2
= [d - f(W(t)f(w(t)I))]2/2
Error function at the output unit
Weight at time t is w(t) and W(t),
intend to find the weight w and W at time t+1
Where y = f(w(t)I), output of the hidden unit
8/14/2019 Lecture Neural Nets 2010
70/148
8/14/2019 Lecture Neural Nets 2010
71/148
71
yytWfodtW
tdWdf
dfdEtW
tdW
dEtWtW
))((')()(
)()(
)()()1(
Backward Pass Phase
I
w(t)
W(t)
y
O
o = f ( W(t) y )E = (d - o)2/2
B k d h
8/14/2019 Lecture Neural Nets 2010
72/148
72
ytW
yytWfodtW
tdW
df
df
dE
tW
tdW
dEtWtW
)(
))((')()(
)()(
)()()1(
Backward pass phase
I
w(t)
W(t)
y
O
where = ( d-o) f
Backward pass phase
8/14/2019 Lecture Neural Nets 2010
73/148
73
IItwftWtw
tdw
dytWytWfodtw
tdw
dy
dy
dEtw
tdwdEtwtw
))((')()(
)()())((')()(
)(
)(
)()()1(
I
w(t)
W(t)
y
O
Backward pass phase
o= f ( W(t) y )= f ( W(t) f( w(t) I) )
8/14/2019 Lecture Neural Nets 2010
74/148
74
General Two LayerNetworkI inputs, Ooutputs, wconnections for input
units, Wconnections for output units, yis the
activity of input unit
net (t) = network input to the unit at time t
Ww
I O
Input units
Output units
y
Forward pass
8/14/2019 Lecture Neural Nets 2010
75/148
75
Forward pass
Weights are fixed during forward & backward pass at timet
1. Compute values for hidden units
2. compute values for output units
net t w t I t
y f net t
j ji ii
j j
( ) ( ) ( )
( ( ))
Net t W t y
O f Net t
k kj jj
k k
( ) ( )
( ( ))
Ii
wji(t)
Wkj(t)
yj
Ok
Backward Pass
8/14/2019 Lecture Neural Nets 2010
76/148
76
Recall delta rule , error measure for pattern n is
We want to know how to modify weights in order to decrease Ewhere
both for hidden units and output units
This can be rewritten as product of two terms using chain rule
E t d t O tk kk
( ) ( ( ) ( ))
1
22
1
)(
)()()1(
tw
tEtwtw
ij
ijij
8/14/2019 Lecture Neural Nets 2010
77/148
77
)(
)(
)(
)(
)(
)(
tw
tnet
tnet
tE
tw
tE
ij
j
jij
How error for pattern changes as function of changein network input to unitj
How net input to unitjchanges as a function ofchange in weight w
both for hidden units and output units
Term A
Term B
Summaryi ht d t l l
8/14/2019 Lecture Neural Nets 2010
78/148
78
weight updates are local
output unit
hidden unit
)()()()1(
)()()()1(
tyttWtW
tIttwtw
jkkjkj
ijjiji
k
ikjkj
ijjiji
tIWttnetf
tIttwtw
)()())(('
)()()()1(
)())(('))()((
)()()()1(
tytNetftOtd
tyttWtW
jkkk
jkkjkj
Once weight changes are computed for all units, weights areupdated at same time (bias included as weights here)
We now compute the derivative of the activation function f ( ).
(hidden unit)
(output unit)
Activation Functions
8/14/2019 Lecture Neural Nets 2010
79/148
79
Activation Functionsto compute we need to find the derivative of activation
function f
to find derivative the activation function must be smooth
Sigmoidal (logistic) function-common in MLP
where k is a positive constant. The sigmoidal function givesvalue in range of 0 to 1
Input-output function of a neuron (rate coding assumption)
))(exp(1
1))((
tnetktnetf
i
i
Shape of sigmoidal function
8/14/2019 Lecture Neural Nets 2010
80/148
80
Note: when net = 0, f = 0.5
8/14/2019 Lecture Neural Nets 2010
81/148
Returning to local error gradientsin BP algorithm we have
8/14/2019 Lecture Neural Nets 2010
82/148
82
g g gfor output units
For hidden units we have
))(1)(())()((
))(('))()(()(
tOtkOtOtd
tNetftOtdt
iiii
iiii
k
kikii
k
kikii
Wttytky
Wttnetft
)())(1)((
)())((')(
Since degree of weight change is proportional to derivative ofactivation function, weight changes will be greatest when unitsreceives mid-range functional signal than at extremes
Summary of BP learning algorithm
8/14/2019 Lecture Neural Nets 2010
83/148
83
Summary of BP learning algorithm
Set learning rate
Set initial weight values (incl.. biases): w, W
Loop until stopping criteria satisfied:
present input pattern to input units
compute functional signal for hidden unitscompute functional signal for output units
present Target response to output unitscompute error signal for output units
compute error signal for hidden unitsupdate all weights at same timeincrement n to n+1 and select next I and dend loop
8/14/2019 Lecture Neural Nets 2010
84/148
Exercise For the network shown in figure, calculate the expressions for the
weight changes using the EBP algorithm in an on-line learning mode.The training data, consisting of the input pattern vectorsx = [x1x2]and the output desired responses d = [d1d2]
84
Network training: T i i t h t dl til t i it i t
8/14/2019 Lecture Neural Nets 2010
85/148
85
Training set shown repeatedly until stopping criteria are met
Each full presentation of all patterns = epoch
Randomise order of training patterns presented for eachepoch in order to avoid correlation between consecutive
training pairs being learnt (order effects)
Two types of network training:
Sequential mode(on-line, stochastic, or per-pattern)
Weights updated after each pattern is presented
Batch mode(off-line or per -epoch)
Advantages and disadvantages of different
8/14/2019 Lecture Neural Nets 2010
86/148
86
g gmodes
Sequential mode: Less storage for each weighted connection
Random order of presentation and updating per pattern
means search of weight space is stochastic--reducing risk of
local minima able to take advantage of any redundancy intraining set (i.e.. same pattern occurs more than once in
training set, esp. for large training sets)
Simpler to implement
Batch mode: Faster learning than sequential mode
8/14/2019 Lecture Neural Nets 2010
87/148
87
MultiLayer Perceptron II
Dynamics of MultiLayer Perceptron
Summary of Network Training
8/14/2019 Lecture Neural Nets 2010
88/148
88
Forward phase: I(t), w(t),net(t),y(t),W(t),Net(t),O(t)
Backward phase:
Output unit
Input unit
k
ikjkj
ijijji
tItWttnetf
tIttwtw
)()()())(('
)()()()1(
)())(('))()(()()()()1(
tytNetftOtdtyttWtW
jkkk
jkkjkj
8/14/2019 Lecture Neural Nets 2010
89/148
8/14/2019 Lecture Neural Nets 2010
90/148
90
Network training:
Despite this sound theoretical foundation concerning
the representational capabilities of neural networks,
and notwithstanding the success of the EBP learning
algorithm, there are many practical drawbacks to theEBP algorithm.
The most troublesome is the usually long training
process, which does not ensure that the absolute
minimum of the cost function (the best performanceof the network) will be achieved.
N t k t i i
8/14/2019 Lecture Neural Nets 2010
91/148
Network training:
The algorithm may become stuck at some localminimum, and such a termination with a suboptimalsolution will require repetition of the whole learning
process by changing the structure or some of thelearning parameters that influence the iterativescheme.
As in many other scientific and engineering
disciplines, so in the field of artificial neural networks,the theory (or at least part of it) was established onlyafter a number of practical neural networkapplications had been implemented.
91
Advantages and disadvantages of different
8/14/2019 Lecture Neural Nets 2010
92/148
92
g gmodes
Sequential mode: Less storage for each weighted connection
Random order of presentation and updating per pattern
means search of weight space is stochastic--reducing risk of
local minima able to take advantage of any redundancy intraining set (i.e.. same pattern occurs more than once in
training set, esp. for large training sets)
Simpler to implement
Batch mode:Faster learning than sequential mode
Goals of Neural Network Training
8/14/2019 Lecture Neural Nets 2010
93/148
93
g
To give the correct output for input
training vector (Learning)
To give good responses to new unseeninput patterns (Generalization)
Training and Testing Problems
8/14/2019 Lecture Neural Nets 2010
94/148
94
Training and Testing Problems
Stuck neurons: Degree of weight change is proportionalto derivative of activation function, weight changes will begreatest when units receives mid-range functional signal thanat extremes neuron. To avoid stuck neurons weightsinitialization should give outputs of all neurons approximate 0.5
Insufficient number of training patterns: In thiscase, the training patterns will be learnt instead of theunderlying relationship between inputs and output, i.e. networkjust memorizing the patterns.
Too few hidden neurons: network will not produce a
good model of the problem.Over-fitting: the training patterns will be learnt insteadof the underlying function between inputs and output becauseof too many of hidden neurons. This means that the networkwill have a poor generalization capability.
8/14/2019 Lecture Neural Nets 2010
95/148
95
Dynamics of BP learningAim is to minimise an error function over all trainingpatterns by adapting weights in MLP
Recalling the typical error function is the mean
squared error as follows
E(t)=
The idea is to reduce E(t) to global minimum point.
p
k
kk tOtd1
2))()((
2
1
8/14/2019 Lecture Neural Nets 2010
96/148
96
Dynamics of BP learning
In single layer perceptronwith linear activationfunctions, the error function is simple, describedby a smooth parabolic surface with a single
minimum
8/14/2019 Lecture Neural Nets 2010
97/148
8/14/2019 Lecture Neural Nets 2010
98/148
8/14/2019 Lecture Neural Nets 2010
99/148
Selecting Initial Weight Values
8/14/2019 Lecture Neural Nets 2010
100/148
100
Selecting Initial Weight Values
Choice of initial weight values is important as thisdecides starting position in weight space. That is,
how far away from global minimum
Aim is to select weight values which produce
midrange function signals
Select weight values randomly from uniform
probability distribution
Normalise weight values so number of weightedconnections per unit produces midrange function
signal
Convergence of Backprop
8/14/2019 Lecture Neural Nets 2010
101/148
101
Avoid local minumum with fast convergence:
Add momentum
Stochastic gradient descent
Train multiple nets with different initial weights
Nature of convergence
Initialize weights near zero or initial networksnear-linear
Increasingly non-linear functions possible astraining progresses
Use of Available Data Set for Training
8/14/2019 Lecture Neural Nets 2010
102/148
102
Use of Available Data Set for Training
Training setuse to update the weights.Patterns in this set are repeatedly in random
order. The weight update equation areapplied after a certain number of patterns.
Validation setuse to decide when to stoptraining only by monitoring the error.
Test setUse to test the performance of theneural network. It should not be used as partof the neural network development cycle.
The available data set is normally split into three
sets as follows:
Earlier Stopping - Good Generalization
R i t h t i th
8/14/2019 Lecture Neural Nets 2010
103/148
103
Running too many epochs may overtrainthenetwork and result in overfittingand perform
poorly in generalization. Keep a hold-out validation set and test accuracy
after every epoch. Maintain weights for bestperforming network on the validation set and stoptraining when error increases increases beyondthis.
No. of epochs
errorTraining set
Validation set
Model Selection by Cross validation
8/14/2019 Lecture Neural Nets 2010
104/148
104
Model Selection by Cross-validation
Too few hidden unitsprevent the network from learningadequately fitting the data and learning the concept.
Too many hidden unitsleads to overfitting.
Similar cross-validation methodscan be used to determinean appropriate number of hidden units by using the optimaltest error to select the model with optimal number of hiddenlayers and nodes.
No. of epochs
errorTraining set
Validation set
8/14/2019 Lecture Neural Nets 2010
105/148
105
Genetic Algorithms
lternative training algorithm
History Background
8/14/2019 Lecture Neural Nets 2010
106/148
106
Idea of evolutionary computing was introduced in the 1960s by I.Rechenbergin his work "Evolution strategies" (Evolutionsstrategiein
original). His idea was then developed by other researchers. Genetic
Algorithms(GAs) were invented by John Hollandand developed by him
and his students and colleagues. This lead to Holland's book "Adaption in
Natural and Artificial Systems" published in 1975.
In 1992 John Kozahas used genetic algorithm to evolve programsto
perform certain tasks. He called his method Genetic Programming"(GP). LISP programs were used, because programs in this language can
expressed in the form of a "parse tree", which is the object the GA works
on.
Biological Background
8/14/2019 Lecture Neural Nets 2010
107/148
107
g gChromosome.
All living organisms consist of cells. In each cell there is the same set of
chromosomes. Chromosomes are strings of DNAand serves as a model for
the whole organism.A chromosome consist of genes, blocks of DNA. Each
gene encodes a particular protein. Basically can be said, that each gene
encodes a trait, for example color of eyes. Possible settings for a trait (e.g.
blue, brown) are called alleles. Each gene has its own position in the
chromosome. This position is called locus.
Complete set of genetic material (all chromosomes) is called genome.Particular set of genes in genome is called genotype. The genotype is with
later development after birth base for the organism's phenotype, its physical
and mental characteristics, such as eye color, intelligence etc.
Biological Background
http://cs.felk.cvut.cz/~xobitko/ga/dna.htmlhttp://cs.felk.cvut.cz/~xobitko/ga/dna.html8/14/2019 Lecture Neural Nets 2010
108/148
108
Reproduction.
During reproduction, first occurs recombination(orcrossover). Genes from parents form in some way the
whole new chromosome. The new created offspring can
then be mutated. Mutationmeans, that the elements ofDNA are a bit changed. This changes are mainly caused by
errors in copying genes from parents.
The fitnessof an organism is measured by success of the
organism in its life.
Evolutionary Computation
8/14/2019 Lecture Neural Nets 2010
109/148
109
Evolutionary Computation
Based on evolution as it occurs in nature Lamarck, Darwin, Wallace: evolution of species, survival
of the fittest
Mendel: genetics provides inheritance mechanism
Hence genetic algorithms
Essentially a massively parallel search procedure
Start with random population of individuals
Gradually move to better individuals
A Simple Genetic Algorithm
8/14/2019 Lecture Neural Nets 2010
110/148
110
p g
Optimization task: find the maximum of f(x)
for example f(x)=xsin(x) x [0,p]genotype: binary string s [0,1]5 e.g. 11010, 01011, 10001mapping : genotype phenotype
binary integer encoding: x = si 2n-i-1/ (2n-1)
genotype integ. phenotype fitness prop. fitness11010 26 2.6349 1.2787 30%
01011 11 1.1148 1.0008 24%10001 17 1.7228 1.7029 40%00101 5 0.5067 0.2459 6%
Initial population
5
1
n
i
Some Other Issues RegardingE l ti C ti
8/14/2019 Lecture Neural Nets 2010
111/148
111
Evolutionary Computing
Evolution according to Lamarck. Individual adapts during lifetime.
Adaptations inherited by children.
In nature, genes dont change; but for computations we could
allow this...
Baldwin effect. Individuals ability to learn has positive effect on evolution.
It supports a more diverse gene pool.
Thus, more experimentation with genes possible.
Bacteria and virus. New evolutionary computing strategies.
8/14/2019 Lecture Neural Nets 2010
112/148
112
Radial Basis Functions
Radial Basis Functions
Introduction fit=ajuste
8/14/2019 Lecture Neural Nets 2010
113/148
Introduction fit=ajuste
The back-propagation algorithm for the design ofa MLP (under surpervision) may be viewed as theapplication of a recursive technique know instatistic as stochastic approximation.
In this moment we take a completely differentapproach by viewing the design of a neuralnetwork as a curve fitting (approximation)
problemin a high-dimensional space. According with this view-point, learning is
equivalent to find a surface in multidimensionalspace that provide a best fit to the training data.
113
Radial-basis function (RBF) networks
8/14/2019 Lecture Neural Nets 2010
114/148
114
RBF = radial-basis function: a function whichdepends only on the radial distance from a point
XOR problem
quadratically separable
Radial-basis function (RBF) networks
S RBF f ti t ki th f
8/14/2019 Lecture Neural Nets 2010
115/148
115
So RBFs are functions taking the form
where is a nonlinear activation function, xis the
input and xiis the ithposition, prototype, basis or
centrevector.
The idea is thatpoints near the centres will have
similar outputs (i.e. if x ~ xi then f (x) ~f (xi))
since they should have similar properties.
The simplest isthe linear RBF : (x) =||x xi||
||)(|| ixx
Radial Functions
8/14/2019 Lecture Neural Nets 2010
116/148
Radial Functions
Characteristic feature-their responsedecreases (or increases) monotonically withdistance from a central point.
The center, the distance scale, and the
precise shape of the radial functionare parameters of the model, all fixed if it islinear.
Typical radial functions are:
The Gaussian RBF (monotonically decreases withdistance from the center).
A multiquadric RBF (monotonically increases withdistance from the center).
Typical RBFs include(a) Multiquadrics
8/14/2019 Lecture Neural Nets 2010
117/148
117
(a) Multiquadrics
for some c>0
(b) Inverse multiquadrics
for some c>0
(c) Gaussian
for some >0
2/122)()( crr
2/122 )()( crr
)2
exp()(2
2
rr
8/14/2019 Lecture Neural Nets 2010
118/148
118nonlocalized functions localized functions
8/14/2019 Lecture Neural Nets 2010
119/148
Starting point: exact interpolation
8/14/2019 Lecture Neural Nets 2010
120/148
120
Starting point: exact interpolation
Each input pattern x must be mapped onto atarget value d
8/14/2019 Lecture Neural Nets 2010
121/148
y1
Single-layer networks1 y)1||y-x1||
8/14/2019 Lecture Neural Nets 2010
122/148
122
yp
Input
y1y2
Output
Input layer : N y)N||y-xN||
wj
d
output = wii (y - xi)
adjustable parameters are weights wj
number of hidden units = number of data points
Form of the basis functions decided in advance
8/14/2019 Lecture Neural Nets 2010
123/148
Large = 1
8/14/2019 Lecture Neural Nets 2010
124/148
124
g
Small = 0.2
8/14/2019 Lecture Neural Nets 2010
125/148
125
Problems with exact interpolationcan produce poor generalisation performance as only data
8/14/2019 Lecture Neural Nets 2010
126/148
126
a p odu poo g a a o p o a a o y da apoints constrain mapping
Overfitting problem
Bishop(1995) example
Underlying function f(x)=0.5+0.4sine(2x)sampled randomly for 30 points
added Gaussian noise to each data point
30 data points 30 hidden RBF units
fits all data points but creates oscillations due added noiseand unconstrained between data points
8/14/2019 Lecture Neural Nets 2010
127/148
127
All Data Points 5 Basis functions
8/14/2019 Lecture Neural Nets 2010
128/148
Learning Algorithms
8/14/2019 Lecture Neural Nets 2010
129/148
Redes Neurais e
Lgica Fuzzy Radial Basis Function Network 129
g g
Parametersto be learnt are: centers
spreads
weights
Different learning algorithms
Learning Algorithm 1
8/14/2019 Lecture Neural Nets 2010
130/148
Neurais e Lgica
Fuzzy Radial Basis Function Network 130
Centers are selected at random center locationsare chosen randomly from the
training set
Spreadsare chosen by normalization:
1m
maxdcentersofnumber
centers2anybetweendistanceMaximum
1
2
i2max
12
i
m,1i
txd
mexptx
i
Learning Algorithm 1
8/14/2019 Lecture Neural Nets 2010
131/148
Redes Neurais e
Lgica Fuzzy Radial Basis Function Network 131
Weightsare found by means of pseudo-inverse method
1
2
ij2
max
1j ij i
m2,1,i,N...,2,1,j
txd
mexp
dw
...,
Desired response
Pseudo-inverse of
8/14/2019 Lecture Neural Nets 2010
132/148
Learning Algorithm 2: Centers K-means clustering algorithm for centers
8/14/2019 Lecture Neural Nets 2010
133/148
Redes Neurais e
Lgica Fuzzy Radial Basis Function Network 133
1 Initialization: tk(0) random k = 1, , m1
2 Sampling: draw x from input space C3 Similaritymatching: find index of best center
4 Updating: adjust centers
5. Continuation: increment nby 1, goto 2and continueuntil no noticeable changes of centers occur
)n(tx(n)minargk(x) kk
k(x)kif)n(tx(n))n(t kk
otherwise)n(tk )1n(tk
Learning Algorithm 3
8/14/2019 Lecture Neural Nets 2010
134/148
Redes Neurais e
Lgica Fuzzy Radial Basis Function Network 134
Supervised learning of all the parameters
using the gradient descent method
Modify centers
j
tjt
tj
E
N
1i
2i
2
1e
Instantaneous error function
Learning
rate forjt
Depending on the specific function can be computed using
the chain rule of calculus
Learning Algorithm 3
8/14/2019 Lecture Neural Nets 2010
135/148
Redes Neurais e
Lgica Fuzzy Radial Basis Function Network 135
Modify spreads
Modify output weights
j
j
E
j
ij
ijijw
w
E
Design Radial Basis Networks
8/14/2019 Lecture Neural Nets 2010
136/148
g
136
net = newrb(P,T,goal,spread,MN,DF)
P R-by-Q matrix of Q input vectors
T S-by-Q matrix of Q target class
vectorsgoal Mean squared error goal
(default = 0.0)
spread Spread of radial basis functions(default = 1.0)
MN Maximum number of neurons(default is Q)
DF Number of neurons to add betweendisplays (default = 25)
Examples
8/14/2019 Lecture Neural Nets 2010
137/148
p
137
Here you design a radial basis network, given inputs
Pand targetsT.
P = [1 2 3];T = [2.0 4.1 5.9];
net = newrb(P,T);
The network is simulated for a new input.
P = 1.5;Y = sim(net,P)
Comparison with multilayer NNRBF N t k d t f l ( li )
8/14/2019 Lecture Neural Nets 2010
138/148
Redes Neurais e
Lgica Fuzzy Radial Basis Function Network 138
RBF-Networksare used to perform complex (non-linear)pattern classification tasks.
Comparison between RBFnetworksand multilayerperceptrons:
Both are examples of non-linear layered feed-forward
networks.
Both are universal approximators.
Hidden layers: RBF networks have one singlehidden layer. MLP networks may have morehidden layers.
Comparison with multilayer NN
8/14/2019 Lecture Neural Nets 2010
139/148
Redes Neurais e
Lgica Fuzzy Radial Basis Function Network 139
Neuron Models: The computation nodes in the hidden layer of a RBF network are
different. They serve a different purpose from those in the outputlayer.
Typically computation nodes of MLP in a hidden or output layershare a common neuron model.
Linearity: The hidden layer of RBF is non-linear, the output layer of RBF is
linear.
Hidden and output layers of MLP are usually non-linear.
Comparison with multilayer NN
8/14/2019 Lecture Neural Nets 2010
140/148
Redes Neurais e
Lgica Fuzzy Radial Basis Function Network 140
Activation functions: The argument of activation function of each hidden unit in aRBF NN computes the Euclidean distance between inputvector and the center of that unit.
The argument of the activation function of each hidden unitin a MLP computes the inner product of input vector and the
synaptic weight vector of that unit. Approximations:
RBF NN using Gaussian functions construct localapproximations to non-linear I/O mapping.
MLP NN construct globalapproximations to non-linear I/O
mapping.
8/14/2019 Lecture Neural Nets 2010
141/148
141
pplication ExamplesNonlinear Identification, Prediction and Control
Nonlinear System Identification
8/14/2019 Lecture Neural Nets 2010
142/148
142
Target function: yp(k+1) = f(.)
Identified function: yNET(k+1) = F(.)
Estimation error: e(k+1)
Nonlinear System Neural Control
8/14/2019 Lecture Neural Nets 2010
143/148
143
d: reference/desired response
y: system output/desired output
u: system input/controller output
: desired controller input
u*:NN output
e: controller/network error
The goal of training is to find an
appropriate plant control u from
the desired response d. The weights
are adjusted based on the difference
between the outputs of the networksI & II to minimise e. If network I is
trained so that y = d, then u = u*.
Networks act as inverse dynamics
identifiers.
Nonlinear System Identification
8/14/2019 Lecture Neural Nets 2010
144/148
144
Neural network
input generationPm
Nonlinear System Identification
8/14/2019 Lecture Neural Nets 2010
145/148
145
Neural network targetTm
Neural network response(angle & velocity)
8/14/2019 Lecture Neural Nets 2010
146/148
Model Reference Control
8/14/2019 Lecture Neural Nets 2010
147/148
147
Neural controller + nonlinear system diagram
Neural controller, reference model, neural model
Matlab NNtool GUI (Graphical User Interface)
8/14/2019 Lecture Neural Nets 2010
148/148
Top Related