Download - Martin Brown- Multi-Layer Perceptrons

8/3/2019 Martin Brown- Multi-Layer Perceptrons

1/16

EE-M016 2005/6: IS L9&10 1/16, v3.0

Lecture 9&10: Multi-Layer Perceptrons

x1

x2

x0=1

y

h0=1

h2

h1

Dr Martin Brown

Room: E1kEmail: [email protected]

Telephone: 0161 306 4672

http://www.eee.manchester.ac.uk/intranet/pg/coursematerial/


2/16

EE-M016 2005/6: IS L9&10 2/16, v3.0

Lecture 9&10: Outline

Layered sigmoidal models (multi-layer perceptrons MLP)

1. Network structure and modelling abilities

2. Gradient descent for MLPs (error back propagation

EBP)

3. Example: learning XOR solution4. Variations on/extensions to basic, non-linear gradient

descent parameter estimation

MLPs are non-linear in both:

Inputs/features, therefore the models for non-linear

decision boundaries and non-linear regression surfaces

Parameters, therefore gradient descent can only be

used to show convergence to a local minimum


3/16

EE-M016 2005/6: IS L9&10 3/16, v3.0

Lecture 9&10: Resources

These slides are largely self-contained, but extra, background

material can be found in:

Machine Learning, T Mitchell, McGraw Hill, 1997

Machine Learning, Neural and Statistical Classification, D

Michie, DJ Spiegelhalter and CC Taylor, 1994:http://www.amsta.leeds.ac.uk/~charles/statlog/

In addition, there are many on-line sources for multi-layer

perceptrons (MLPs) and error back propagation (EBP), just

search on google

Advanced text:

Information Theory, Inference and Learning Algorithms, D

MacKay, Cambridge University Press, 2003


4/16

EE-M016 2005/6: IS L9&10 4/16, v3.0

Multi-Layer Perceptron Networks

Layered perceptron (with bi-polar/binary outputs) networkscan realize any logical function, however there is nosimple way to estimate the parameters/generalise the(single layer) Perceptron convergence procedure

Multi-layer perceptron (MLP) networks are a class of models

that are formed from layered sigmoidal nodes, which canbe used for regression or classification purposes.

They are commonly trained using gradient descent on amean squared error performance function, using a

technique known as error back propagation in order tocalculate the gradients.

Widely applied to many prediction and classificationproblems over the past 15 years.


5/16

EE-M016 2005/6: IS L9&10 5/16, v3.0

Multi-Layer Perceptron Networks

Use 2 or more layers of parameters where: Empty circles represent sigmoidal (tanh) nodes

Solid circles represent real signals (inputs, biases & outputs)

Arrows represent adjustable parameters

Multi-Layer Perceptron networks can have:

Any number of layers of parameters (but generally just 2)

Any number of outputs (but generally just 1)

Any number of nodes in the hidden layers (see Slide 14)

x1

x2

x0=1

y

h0=1

h2

h1

Output

layerUoHidden

layer5h


6/16

EE-M016 2005/6: IS L9&10 6/16, v3.0

Exemplar Model Outputs

MLP with two hidden nodes. Theresponse surface resembles an

impulse ridge because one sigmoid is

subtracted from the other.

This is a learnt solution to the

classification XOR problem.

This non-linear regression

surface is generated by anMLP with three hidden

nodes, and a linear transfer

function in the output layer


7/16

EE-M016 2005/6: IS L9&10 7/16, v3.0

Gradient Descent ParameterEstimation

All of the models parameters can be stacked up into a single vectorU,then use gradient descent learning:

U0 are small, random values

Performance function(s) non-linear in U

No direct solution

Local minima are possible

Learning rate is difficult to estimate because local Hessian (second

derivative matrix) varies in parameter space

x

x!

)(^

^

1

^k

kk

pL

U

p

^U

2

^

2

1

1

2^

2

1

)),(()(

)),(()(

!

! !

x

x

tytyp

tytyp

t

T

t

UkUk+1^ ^

^


8/16

EE-M016 2005/6: IS L9&10 8/16, v3.0

Output LayerGradient Calculation

Gradient descent update:

For the ith training pattern:

Using the chain rule:

Giving an update rule:)()('))()((

^

tuftyty

u

u

y

y

pp tt

x

!

!x

x

x

x

x

x

x

x

2^

21 ))()(()( tytypt !

x

x!(

)(^

^k

k

pL

)()('))()((^^

1

^

tuftytykk x ! L

xT

u! )(ufy !

2^

21 )( yyp !

Same as the derivationfor a single-layer

sigmoidal model, as

describedin lecture 7&8.

Hiddenlayer

Output layer


9/16

EE-M016 2005/6: IS L9&10 9/16, v3.0

Analyze the path such that altering the jth

hidden nodesparameter vector affects the models output

By the chain rule:

Gradient expression (back error propagation):

Hidden LayerGradient Calculation

h

j

h

j

h

j

h

j

h

j

o

o

o

o

t

h

j

tu

u

y

y

u

u

y

y

pp

x

x

x

x

x

x

x

x

x

x

x

x!

)())()((

)(*)(')('*))()((

^

^

ttyty

tufuftytyp

h

j

h

j

o

j

o

h

j

t

x

x

H

Ux

x

!

!

7 f()x

)()(

h

j

Th

j

h

j ffyxu !!

h

jh

juh

jy ouoyo

j


10/16

EE-M016 2005/6: IS L9&10 10/16, v3.0

MLP Iterative ParameterEstimation

Randomly initialise all parameters in network (to small values)For each parameter update

present each input pattern to the network & get output

calculate the update for each parameter according to:

where:

output layer

hidden layer

calculate average parameter updatesupdate weights

Stop when steps > max_steps or MSE < tolerance or test MSE

is minimum

)())()((

^

, txtytyl

i

l

j

l

kij HLU !(

oo uf'!H)(' hj

o

j

oh

j ufUHH !


11/16

EE-M016 2005/6: IS L9&10 11/16, v3.0

Example: Learning the XOR Problem

Performance history for the XORdata and MLP with 2 hidden

nodes. Note its non-monotonic

behaviour, also large number of

iterations

L = 0.05, update after each datum

Learning histories for the 9

parameters in the MLP. Note that

even when the MSE goes up, the

parameters are heading towards

optimal values


12/16

EE-M016 2005/6: IS L9&10 12/16, v3.0

Example: Trained XOR Model

The trained optimal model has a ridge where the target is 1

and plateaus out in the regions where target is 1. Note

all inputs and targets are bipolar {1,1}, rather than binary


13/16

EE-M016 2005/6: IS L9&10 13/16, v3.0

Basic Variations on ParameterEstimation

Parameter updates can be performed: After each pattern is presented (LMS)

After the complete data set has been presented (Batch)

Generally, convergence is smoother in the latter case,though overall convergence may be slower

When to stop learning is typically done by monitoring theperformance and stopping when an acceptable level isreached, before the parameters become too large.

Learning rate needs to be carefully selected to ensure stable

learning along the parameter trajectory within areasonable time period

Generally, input features are scaled to zero mean, unitvariance (or lie between [1, 1])


14/16

EE-M016 2005/6: IS L9&10 14/16, v3.0

Selecting the Size of the Hidden Layer

In building a non-linear model such as an MLP, the

labelled data may be divided into 3 sets

Training: used to learn the optimal parameters values

Testing: used to compare different model structures

Validation: used to get a final performance figures

The aim is to select a model that performs well on the test

set, and use the validation set to obtain a final

performance estimate.

Use to select number of nodes in hidden layer

performance

number of hidden nodes

.Training (parameter estimation)

Testing (model selection)

Validation (final performance)


15/16

EE-M016 2005/6: IS L9&10 15/16, v3.0

Lecture 9&10: Conclusions

Multi-layer perceptrons are universal approximators they canmodel any continuous function arbitrarily closely, givensufficient number of hidden nodes (existence proof only)

Used for both classification and regression problems, althoughwith regression, often a linear transfer function is used in theoutput layer so that the output is unbounded.

Trained using gradient descent, which suffers all the well-known disadvantages.

Sometimes known as error back propagation because theoutput error is fed backwards to the gradient signal of the

hidden layer(s).

The number of hidden nodes, and learning rate, needs to beexperimentally found, often using separate training, testingand validation data sets


16/16

EE-M016 2005/6: IS L9&10 16/16, v3.0

Lecture 9&10: Laboratory Session

Make sure you have the Single Layer Sigmoid, trained usinggradient descent, algorithm working (see lab 7&8). Thisforms the main part of your assignment.

Extend this procedure to implement an MLP to solve the

XOR problem. You should note that the output layer isequivalent to an single layer sigmoid, and that all youhave to add is the output and parameter updatecalculations for the hidden layer.

Make sure this works by monitoring the MSE and showing

that it tends to 0 as the number of iterations increases youll need two hidden nodes

Draw the logical function boundaries for each node to verifythat the output is correct.