8/3/2019 Martin Brown- Multi-Layer Perceptrons
1/16
EE-M016 2005/6: IS L9&10 1/16, v3.0
Lecture 9&10: Multi-Layer Perceptrons
x1
x2
x0=1
y
h0=1
h2
h1
Dr Martin Brown
Room: E1kEmail: [email protected]
Telephone: 0161 306 4672
http://www.eee.manchester.ac.uk/intranet/pg/coursematerial/
8/3/2019 Martin Brown- Multi-Layer Perceptrons
2/16
EE-M016 2005/6: IS L9&10 2/16, v3.0
Lecture 9&10: Outline
Layered sigmoidal models (multi-layer perceptrons MLP)
1. Network structure and modelling abilities
2. Gradient descent for MLPs (error back propagation
EBP)
3. Example: learning XOR solution4. Variations on/extensions to basic, non-linear gradient
descent parameter estimation
MLPs are non-linear in both:
Inputs/features, therefore the models for non-linear
decision boundaries and non-linear regression surfaces
Parameters, therefore gradient descent can only be
used to show convergence to a local minimum
8/3/2019 Martin Brown- Multi-Layer Perceptrons
3/16
EE-M016 2005/6: IS L9&10 3/16, v3.0
Lecture 9&10: Resources
These slides are largely self-contained, but extra, background
material can be found in:
Machine Learning, T Mitchell, McGraw Hill, 1997
Machine Learning, Neural and Statistical Classification, D
Michie, DJ Spiegelhalter and CC Taylor, 1994:http://www.amsta.leeds.ac.uk/~charles/statlog/
In addition, there are many on-line sources for multi-layer
perceptrons (MLPs) and error back propagation (EBP), just
search on google
Advanced text:
Information Theory, Inference and Learning Algorithms, D
MacKay, Cambridge University Press, 2003
8/3/2019 Martin Brown- Multi-Layer Perceptrons
4/16
EE-M016 2005/6: IS L9&10 4/16, v3.0
Multi-Layer Perceptron Networks
Layered perceptron (with bi-polar/binary outputs) networkscan realize any logical function, however there is nosimple way to estimate the parameters/generalise the(single layer) Perceptron convergence procedure
Multi-layer perceptron (MLP) networks are a class of models
that are formed from layered sigmoidal nodes, which canbe used for regression or classification purposes.
They are commonly trained using gradient descent on amean squared error performance function, using a
technique known as error back propagation in order tocalculate the gradients.
Widely applied to many prediction and classificationproblems over the past 15 years.
8/3/2019 Martin Brown- Multi-Layer Perceptrons
5/16
EE-M016 2005/6: IS L9&10 5/16, v3.0
Multi-Layer Perceptron Networks
Use 2 or more layers of parameters where: Empty circles represent sigmoidal (tanh) nodes
Solid circles represent real signals (inputs, biases & outputs)
Arrows represent adjustable parameters
Multi-Layer Perceptron networks can have:
Any number of layers of parameters (but generally just 2)
Any number of outputs (but generally just 1)
Any number of nodes in the hidden layers (see Slide 14)
x1
x2
x0=1
y
h0=1
h2
h1
Output
layerUoHidden
layer5h
8/3/2019 Martin Brown- Multi-Layer Perceptrons
6/16
EE-M016 2005/6: IS L9&10 6/16, v3.0
Exemplar Model Outputs
MLP with two hidden nodes. Theresponse surface resembles an
impulse ridge because one sigmoid is
subtracted from the other.
This is a learnt solution to the
classification XOR problem.
This non-linear regression
surface is generated by anMLP with three hidden
nodes, and a linear transfer
function in the output layer
8/3/2019 Martin Brown- Multi-Layer Perceptrons
7/16
EE-M016 2005/6: IS L9&10 7/16, v3.0
Gradient Descent ParameterEstimation
All of the models parameters can be stacked up into a single vectorU,then use gradient descent learning:
U0 are small, random values
Performance function(s) non-linear in U
No direct solution
Local minima are possible
Learning rate is difficult to estimate because local Hessian (second
derivative matrix) varies in parameter space
x
x!
)(^
^
1
^k
kk
pL
U
p
^U
2
^
2
1
1
2^
2
1
)),(()(
)),(()(
!
! !
x
x
tytyp
tytyp
t
T
t
UkUk+1^ ^
^
8/3/2019 Martin Brown- Multi-Layer Perceptrons
8/16
EE-M016 2005/6: IS L9&10 8/16, v3.0
Output LayerGradient Calculation
Gradient descent update:
For the ith training pattern:
Using the chain rule:
Giving an update rule:)()('))()((
^
tuftyty
u
u
y
y
pp tt
x
!
!x
x
x
x
x
x
x
x
2^
21 ))()(()( tytypt !
x
x!(
)(^
^k
k
pL
)()('))()((^^
1
^
tuftytykk x ! L
xT
u! )(ufy !
2^
21 )( yyp !
Same as the derivationfor a single-layer
sigmoidal model, as
describedin lecture 7&8.
Hiddenlayer
Output layer
8/3/2019 Martin Brown- Multi-Layer Perceptrons
9/16
EE-M016 2005/6: IS L9&10 9/16, v3.0
Analyze the path such that altering the jth
hidden nodesparameter vector affects the models output
By the chain rule:
Gradient expression (back error propagation):
Hidden LayerGradient Calculation
h
j
h
j
h
j
h
j
h
j
o
o
o
o
t
h
j
tu
u
y
y
u
u
y
y
pp
x
x
x
x
x
x
x
x
x
x
x
x!
)())()((
)(*)(')('*))()((
^
^
ttyty
tufuftytyp
h
j
h
j
o
j
o
h
j
t
x
x
H
Ux
x
!
!
7 f()x
)()(
h
j
Th
j
h
j ffyxu !!
h
jh
juh
jy ouoyo
j
8/3/2019 Martin Brown- Multi-Layer Perceptrons
10/16
EE-M016 2005/6: IS L9&10 10/16, v3.0
MLP Iterative ParameterEstimation
Randomly initialise all parameters in network (to small values)For each parameter update
present each input pattern to the network & get output
calculate the update for each parameter according to:
where:
output layer
hidden layer
calculate average parameter updatesupdate weights
Stop when steps > max_steps or MSE < tolerance or test MSE
is minimum
)())()((
^
, txtytyl
i
l
j
l
kij HLU !(
oo uf'!H)(' hj
o
j
oh
j ufUHH !
8/3/2019 Martin Brown- Multi-Layer Perceptrons
11/16
EE-M016 2005/6: IS L9&10 11/16, v3.0
Example: Learning the XOR Problem
Performance history for the XORdata and MLP with 2 hidden
nodes. Note its non-monotonic
behaviour, also large number of
iterations
L = 0.05, update after each datum
Learning histories for the 9
parameters in the MLP. Note that
even when the MSE goes up, the
parameters are heading towards
optimal values
8/3/2019 Martin Brown- Multi-Layer Perceptrons
12/16
EE-M016 2005/6: IS L9&10 12/16, v3.0
Example: Trained XOR Model
The trained optimal model has a ridge where the target is 1
and plateaus out in the regions where target is 1. Note
all inputs and targets are bipolar {1,1}, rather than binary
8/3/2019 Martin Brown- Multi-Layer Perceptrons
13/16
EE-M016 2005/6: IS L9&10 13/16, v3.0
Basic Variations on ParameterEstimation
Parameter updates can be performed: After each pattern is presented (LMS)
After the complete data set has been presented (Batch)
Generally, convergence is smoother in the latter case,though overall convergence may be slower
When to stop learning is typically done by monitoring theperformance and stopping when an acceptable level isreached, before the parameters become too large.
Learning rate needs to be carefully selected to ensure stable
learning along the parameter trajectory within areasonable time period
Generally, input features are scaled to zero mean, unitvariance (or lie between [1, 1])
8/3/2019 Martin Brown- Multi-Layer Perceptrons
14/16
EE-M016 2005/6: IS L9&10 14/16, v3.0
Selecting the Size of the Hidden Layer
In building a non-linear model such as an MLP, the
labelled data may be divided into 3 sets
Training: used to learn the optimal parameters values
Testing: used to compare different model structures
Validation: used to get a final performance figures
The aim is to select a model that performs well on the test
set, and use the validation set to obtain a final
performance estimate.
Use to select number of nodes in hidden layer
performance
number of hidden nodes
.Training (parameter estimation)
Testing (model selection)
Validation (final performance)
8/3/2019 Martin Brown- Multi-Layer Perceptrons
15/16
EE-M016 2005/6: IS L9&10 15/16, v3.0
Lecture 9&10: Conclusions
Multi-layer perceptrons are universal approximators they canmodel any continuous function arbitrarily closely, givensufficient number of hidden nodes (existence proof only)
Used for both classification and regression problems, althoughwith regression, often a linear transfer function is used in theoutput layer so that the output is unbounded.
Trained using gradient descent, which suffers all the well-known disadvantages.
Sometimes known as error back propagation because theoutput error is fed backwards to the gradient signal of the
hidden layer(s).
The number of hidden nodes, and learning rate, needs to beexperimentally found, often using separate training, testingand validation data sets
8/3/2019 Martin Brown- Multi-Layer Perceptrons
16/16
EE-M016 2005/6: IS L9&10 16/16, v3.0
Lecture 9&10: Laboratory Session
Make sure you have the Single Layer Sigmoid, trained usinggradient descent, algorithm working (see lab 7&8). Thisforms the main part of your assignment.
Extend this procedure to implement an MLP to solve the
XOR problem. You should note that the output layer isequivalent to an single layer sigmoid, and that all youhave to add is the output and parameter updatecalculations for the hidden layer.
Make sure this works by monitoring the MSE and showing
that it tends to 0 as the number of iterations increases youll need two hidden nodes
Draw the logical function boundaries for each node to verifythat the output is correct.
Top Related