Martin Brown- Multi-Layer Perceptrons

download Martin Brown- Multi-Layer Perceptrons

of 16

Transcript of Martin Brown- Multi-Layer Perceptrons

  • 8/3/2019 Martin Brown- Multi-Layer Perceptrons

    1/16

    EE-M016 2005/6: IS L9&10 1/16, v3.0

    Lecture 9&10: Multi-Layer Perceptrons

    x1

    x2

    x0=1

    y

    h0=1

    h2

    h1

    Dr Martin Brown

    Room: E1kEmail: [email protected]

    Telephone: 0161 306 4672

    http://www.eee.manchester.ac.uk/intranet/pg/coursematerial/

  • 8/3/2019 Martin Brown- Multi-Layer Perceptrons

    2/16

    EE-M016 2005/6: IS L9&10 2/16, v3.0

    Lecture 9&10: Outline

    Layered sigmoidal models (multi-layer perceptrons MLP)

    1. Network structure and modelling abilities

    2. Gradient descent for MLPs (error back propagation

    EBP)

    3. Example: learning XOR solution4. Variations on/extensions to basic, non-linear gradient

    descent parameter estimation

    MLPs are non-linear in both:

    Inputs/features, therefore the models for non-linear

    decision boundaries and non-linear regression surfaces

    Parameters, therefore gradient descent can only be

    used to show convergence to a local minimum

  • 8/3/2019 Martin Brown- Multi-Layer Perceptrons

    3/16

    EE-M016 2005/6: IS L9&10 3/16, v3.0

    Lecture 9&10: Resources

    These slides are largely self-contained, but extra, background

    material can be found in:

    Machine Learning, T Mitchell, McGraw Hill, 1997

    Machine Learning, Neural and Statistical Classification, D

    Michie, DJ Spiegelhalter and CC Taylor, 1994:http://www.amsta.leeds.ac.uk/~charles/statlog/

    In addition, there are many on-line sources for multi-layer

    perceptrons (MLPs) and error back propagation (EBP), just

    search on google

    Advanced text:

    Information Theory, Inference and Learning Algorithms, D

    MacKay, Cambridge University Press, 2003

  • 8/3/2019 Martin Brown- Multi-Layer Perceptrons

    4/16

    EE-M016 2005/6: IS L9&10 4/16, v3.0

    Multi-Layer Perceptron Networks

    Layered perceptron (with bi-polar/binary outputs) networkscan realize any logical function, however there is nosimple way to estimate the parameters/generalise the(single layer) Perceptron convergence procedure

    Multi-layer perceptron (MLP) networks are a class of models

    that are formed from layered sigmoidal nodes, which canbe used for regression or classification purposes.

    They are commonly trained using gradient descent on amean squared error performance function, using a

    technique known as error back propagation in order tocalculate the gradients.

    Widely applied to many prediction and classificationproblems over the past 15 years.

  • 8/3/2019 Martin Brown- Multi-Layer Perceptrons

    5/16

    EE-M016 2005/6: IS L9&10 5/16, v3.0

    Multi-Layer Perceptron Networks

    Use 2 or more layers of parameters where: Empty circles represent sigmoidal (tanh) nodes

    Solid circles represent real signals (inputs, biases & outputs)

    Arrows represent adjustable parameters

    Multi-Layer Perceptron networks can have:

    Any number of layers of parameters (but generally just 2)

    Any number of outputs (but generally just 1)

    Any number of nodes in the hidden layers (see Slide 14)

    x1

    x2

    x0=1

    y

    h0=1

    h2

    h1

    Output

    layerUoHidden

    layer5h

  • 8/3/2019 Martin Brown- Multi-Layer Perceptrons

    6/16

    EE-M016 2005/6: IS L9&10 6/16, v3.0

    Exemplar Model Outputs

    MLP with two hidden nodes. Theresponse surface resembles an

    impulse ridge because one sigmoid is

    subtracted from the other.

    This is a learnt solution to the

    classification XOR problem.

    This non-linear regression

    surface is generated by anMLP with three hidden

    nodes, and a linear transfer

    function in the output layer

  • 8/3/2019 Martin Brown- Multi-Layer Perceptrons

    7/16

    EE-M016 2005/6: IS L9&10 7/16, v3.0

    Gradient Descent ParameterEstimation

    All of the models parameters can be stacked up into a single vectorU,then use gradient descent learning:

    U0 are small, random values

    Performance function(s) non-linear in U

    No direct solution

    Local minima are possible

    Learning rate is difficult to estimate because local Hessian (second

    derivative matrix) varies in parameter space

    x

    x!

    )(^

    ^

    1

    ^k

    kk

    pL

    U

    p

    ^U

    2

    ^

    2

    1

    1

    2^

    2

    1

    )),(()(

    )),(()(

    !

    ! !

    x

    x

    tytyp

    tytyp

    t

    T

    t

    UkUk+1^ ^

    ^

  • 8/3/2019 Martin Brown- Multi-Layer Perceptrons

    8/16

    EE-M016 2005/6: IS L9&10 8/16, v3.0

    Output LayerGradient Calculation

    Gradient descent update:

    For the ith training pattern:

    Using the chain rule:

    Giving an update rule:)()('))()((

    ^

    tuftyty

    u

    u

    y

    y

    pp tt

    x

    !

    !x

    x

    x

    x

    x

    x

    x

    x

    2^

    21 ))()(()( tytypt !

    x

    x!(

    )(^

    ^k

    k

    pL

    )()('))()((^^

    1

    ^

    tuftytykk x ! L

    xT

    u! )(ufy !

    2^

    21 )( yyp !

    Same as the derivationfor a single-layer

    sigmoidal model, as

    describedin lecture 7&8.

    Hiddenlayer

    Output layer

  • 8/3/2019 Martin Brown- Multi-Layer Perceptrons

    9/16

    EE-M016 2005/6: IS L9&10 9/16, v3.0

    Analyze the path such that altering the jth

    hidden nodesparameter vector affects the models output

    By the chain rule:

    Gradient expression (back error propagation):

    Hidden LayerGradient Calculation

    h

    j

    h

    j

    h

    j

    h

    j

    h

    j

    o

    o

    o

    o

    t

    h

    j

    tu

    u

    y

    y

    u

    u

    y

    y

    pp

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x!

    )())()((

    )(*)(')('*))()((

    ^

    ^

    ttyty

    tufuftytyp

    h

    j

    h

    j

    o

    j

    o

    h

    j

    t

    x

    x

    H

    Ux

    x

    !

    !

    7 f()x

    )()(

    h

    j

    Th

    j

    h

    j ffyxu !!

    h

    jh

    juh

    jy ouoyo

    j

  • 8/3/2019 Martin Brown- Multi-Layer Perceptrons

    10/16

    EE-M016 2005/6: IS L9&10 10/16, v3.0

    MLP Iterative ParameterEstimation

    Randomly initialise all parameters in network (to small values)For each parameter update

    present each input pattern to the network & get output

    calculate the update for each parameter according to:

    where:

    output layer

    hidden layer

    calculate average parameter updatesupdate weights

    Stop when steps > max_steps or MSE < tolerance or test MSE

    is minimum

    )())()((

    ^

    , txtytyl

    i

    l

    j

    l

    kij HLU !(

    oo uf'!H)(' hj

    o

    j

    oh

    j ufUHH !

  • 8/3/2019 Martin Brown- Multi-Layer Perceptrons

    11/16

    EE-M016 2005/6: IS L9&10 11/16, v3.0

    Example: Learning the XOR Problem

    Performance history for the XORdata and MLP with 2 hidden

    nodes. Note its non-monotonic

    behaviour, also large number of

    iterations

    L = 0.05, update after each datum

    Learning histories for the 9

    parameters in the MLP. Note that

    even when the MSE goes up, the

    parameters are heading towards

    optimal values

  • 8/3/2019 Martin Brown- Multi-Layer Perceptrons

    12/16

    EE-M016 2005/6: IS L9&10 12/16, v3.0

    Example: Trained XOR Model

    The trained optimal model has a ridge where the target is 1

    and plateaus out in the regions where target is 1. Note

    all inputs and targets are bipolar {1,1}, rather than binary

  • 8/3/2019 Martin Brown- Multi-Layer Perceptrons

    13/16

    EE-M016 2005/6: IS L9&10 13/16, v3.0

    Basic Variations on ParameterEstimation

    Parameter updates can be performed: After each pattern is presented (LMS)

    After the complete data set has been presented (Batch)

    Generally, convergence is smoother in the latter case,though overall convergence may be slower

    When to stop learning is typically done by monitoring theperformance and stopping when an acceptable level isreached, before the parameters become too large.

    Learning rate needs to be carefully selected to ensure stable

    learning along the parameter trajectory within areasonable time period

    Generally, input features are scaled to zero mean, unitvariance (or lie between [1, 1])

  • 8/3/2019 Martin Brown- Multi-Layer Perceptrons

    14/16

    EE-M016 2005/6: IS L9&10 14/16, v3.0

    Selecting the Size of the Hidden Layer

    In building a non-linear model such as an MLP, the

    labelled data may be divided into 3 sets

    Training: used to learn the optimal parameters values

    Testing: used to compare different model structures

    Validation: used to get a final performance figures

    The aim is to select a model that performs well on the test

    set, and use the validation set to obtain a final

    performance estimate.

    Use to select number of nodes in hidden layer

    performance

    number of hidden nodes

    .Training (parameter estimation)

    Testing (model selection)

    Validation (final performance)

  • 8/3/2019 Martin Brown- Multi-Layer Perceptrons

    15/16

    EE-M016 2005/6: IS L9&10 15/16, v3.0

    Lecture 9&10: Conclusions

    Multi-layer perceptrons are universal approximators they canmodel any continuous function arbitrarily closely, givensufficient number of hidden nodes (existence proof only)

    Used for both classification and regression problems, althoughwith regression, often a linear transfer function is used in theoutput layer so that the output is unbounded.

    Trained using gradient descent, which suffers all the well-known disadvantages.

    Sometimes known as error back propagation because theoutput error is fed backwards to the gradient signal of the

    hidden layer(s).

    The number of hidden nodes, and learning rate, needs to beexperimentally found, often using separate training, testingand validation data sets

  • 8/3/2019 Martin Brown- Multi-Layer Perceptrons

    16/16

    EE-M016 2005/6: IS L9&10 16/16, v3.0

    Lecture 9&10: Laboratory Session

    Make sure you have the Single Layer Sigmoid, trained usinggradient descent, algorithm working (see lab 7&8). Thisforms the main part of your assignment.

    Extend this procedure to implement an MLP to solve the

    XOR problem. You should note that the output layer isequivalent to an single layer sigmoid, and that all youhave to add is the output and parameter updatecalculations for the hidden layer.

    Make sure this works by monitoring the MSE and showing

    that it tends to 0 as the number of iterations increases youll need two hidden nodes

    Draw the logical function boundaries for each node to verifythat the output is correct.