Lecture Neural Nets 2010

download Lecture Neural Nets 2010

of 148

Transcript of Lecture Neural Nets 2010

  • 8/14/2019 Lecture Neural Nets 2010

    1/148

    1

    Intelligent ControlReti Neurali Per LidentificazionePredizione Ed Il ontrollo

    Lecture 1:

    Introduction to Neural Networks

  • 8/14/2019 Lecture Neural Nets 2010

    2/148

  • 8/14/2019 Lecture Neural Nets 2010

    3/148

    3

    Topics in Neural Networks

    Introduction

  • 8/14/2019 Lecture Neural Nets 2010

    4/148

    4

    Lecture Outline

    1. Introduction (2)i. Course introductionii. Introduction to neural networkiii. Issues in Neural network

    2. Simple Neural Network (3)i. Perceptronii. Adaline

    3. Multilayer Perceptron (4)i. Basicsii. Dynamics

    4. Radial Basis Networks (5)

  • 8/14/2019 Lecture Neural Nets 2010

    5/148

    5

    Introduction to Neural Networks

  • 8/14/2019 Lecture Neural Nets 2010

    6/148

  • 8/14/2019 Lecture Neural Nets 2010

    7/1487

    Artificial Neuronbias

    i

    j

    neti =jwijyj + b

  • 8/14/2019 Lecture Neural Nets 2010

    8/1488

    Artificial Neuron

    Input/Output Signal may be.

    Real value.

    Unipolar {0, 1}. Bipolar {-1, +1}.

    Weight : wijstrength of connection.

    Note that wijrefers to the weight fromunitj to unit i(not the other way round).

  • 8/14/2019 Lecture Neural Nets 2010

    9/1489

    Artificial Neuron

    The bias bis a constant that can be written as

    wi0y0with y0 = band wi0 = 1 such that

    The function f is the units activation function.

    In the simplest case, f is the identity function,

    and the units output is just its net input. This

    is called a linear unit.

    Other activation functions are : step function,

    sigmoid function and Gaussian function.

    n

    j

    jiji ywnet

    0

  • 8/14/2019 Lecture Neural Nets 2010

    10/14810

    Funes de Ativao

  • 8/14/2019 Lecture Neural Nets 2010

    11/14811

    Artificial Neural Networks (ANN)

    Inp

    utvector

    weight

    weight

    Activation

    function

    Ou

    tput(vector)

    Activation

    function

    Signal

    routing

  • 8/14/2019 Lecture Neural Nets 2010

    12/14812

    Historical Development of ANN

    William James (1890) : Describes in words and figuressimple distributed networks and Hebbian learning

    McCulloch & Pitts (1943) : Binary threshold units that

    perform logical operations (they proof universalcomputation)

    Hebb (1949) : formulation of a physiological (local)learning rule

    Roseblatt (1958) : The perceptrona first real learningmachine

    Widrow & Hoff (1960) : ADALINE and the Widrow-Hoffsupervised learning rule

  • 8/14/2019 Lecture Neural Nets 2010

    13/14813

    Historical Development of ANN

    Kohonen (1982) : Self-organizing maps

    Hopfield (1982): Hopfield Networks

    Rumelhart, Hinton & Williams (1986) :Back-propagation & Multilayer Perceptron

    Broomhead & Lowe (1988) : Radial

    basis functions (RBF)

    Vapnik (1990) -- support vector machine

  • 8/14/2019 Lecture Neural Nets 2010

    14/148

    14

    WhenShouldANN Solution BeConsidered ?

    The solution to the problem cannot be explicitly described

    by an algorithm, a set of equations, or a set of rules.

    There is some evidence that an input-output mapping exists

    between a set of input and output variables.

    There should be a large amount of data available to train

    the network.

  • 8/14/2019 Lecture Neural Nets 2010

    15/148

    15

    ProblemsThat Can Lead to PoorPerformance ?

    The network has to distinguish between very similar cases

    with a very high degree of accuracy.

    The train data does not represent the ranges of cases that

    the network will encounter in practice.

    The network has a several hundred inputs.

    The main discriminating factors are not present in the

    available data. E.g. trying to assess the loan applicationwithout having knowledge of the applicant's salaries.

    The network is required to implement a very complex

    function.

  • 8/14/2019 Lecture Neural Nets 2010

    16/148

    16

    Applicationsof Artificial Neural Networks

    Manufacturing : fault diagnosis, fraud detection.

    Retailing : fraud detection, forecasting, data

    mining. Finance : fraud detection, forecasting, data mining.

    Engineering : fault diagnosis, signal/imageprocessing.

    Production : fault diagnosis, forecasting.

    Sales & Marketing : forecasting, data mining.

  • 8/14/2019 Lecture Neural Nets 2010

    17/148

    17

    Data Pre-processing

    Neural networks very rarelyoperate on the raw

    data. An initial pre-processingstage is essential.Some examples are as follows:

    Feature extraction of images: For example, the analysis of X-raysrequires pre-processing to extract features which may be of interest

    within a specified region. Representing input variables with numbers. For example "+1" is the

    person is married, "0" if divorced, and "-1" if single. Another exampleis representing the pixels of an image: 255 = bright white, 0 = black.

    To ensure the generalization capability of a neural network, the data

    should be encoded in form which allows for interpolation.

  • 8/14/2019 Lecture Neural Nets 2010

    18/148

    18

    Data Pre-processing

    Categorical Variable

    A categorical variable is a variable that can belong to

    one of a number of discrete categories. For example,

    red, green, blue.

    Categorical variables are usually encoded using 1 out-of

    n coding. e.g. for three colours, red = (1 0 0), green

    =(0 1 0) Blue =(0 0 1).

    If we used red = 1, green = 2, blue = 3, then this type

    of encoding imposes an ordering on the values of the

    variables which does not exist.

  • 8/14/2019 Lecture Neural Nets 2010

    19/148

    19

    Data Pre-processing

    CONTINUOUS VARIABLES

    A continuous variable can be directly applied to

    a neural network. However, if the dynamic range

    of input variables are not approximately the

    same, it is better to normalizeall input variables

    of the neural network.

  • 8/14/2019 Lecture Neural Nets 2010

    20/148

    20

    Example of Normalized Input Vector

    Input vector : (2 4 5 6 10 4)t

    Mean of vector :

    Standard deviation :

    Normalized vector :

    Mean of normalized vector is zero

    Standard deviation of normalized vector is unity

    167.56

    1 6

    1

    i

    ix

    714.2)(16

    16

    1

    2

    iix

    tiN

    xx 43.078.131.006.043.017.1

  • 8/14/2019 Lecture Neural Nets 2010

    21/148

    21

    Simple Neural Networks

    Lecture 3:

    Simple Perceptron

  • 8/14/2019 Lecture Neural Nets 2010

    22/148

    22

    Outlines

    The Perceptron

    Linearly separable problem

    Network structure

    Perceptron learning ruleConvergence of Perceptron

  • 8/14/2019 Lecture Neural Nets 2010

    23/148

    23

    The perceptron was a simple model of ANN introducedby Rosenblatt of MIT in the 1960 with the idea oflearning.

    Perceptron is designed to accomplish a simple patternrecognition task: after learning with real value training data

    {x(i), d(i), i =1,2, , p} where d(i) = 1 or -1

    For a new signal (pattern)x(i+1),the perceptron is

    capable of telling you to which class the new signalbelongs

    x(i+1) perceptron 1 or 1

    THE PERCEPTRON

  • 8/14/2019 Lecture Neural Nets 2010

    24/148

    24

    Perceptron

    Linear threshold unit (LTU)

    x1

    x2

    xn

    .

    .

    .

    w1

    w2

    wn

    w0=b

    x0=1

    x=i=0nwixi

    1 if i=0nwixi >0o(x)=-1 otherwise

    o

    {

  • 8/14/2019 Lecture Neural Nets 2010

    25/148

    25

    Decision Surface of a Perceptron

    +

    ++

    +-

    -

    -

    -x1

    x2

    +

    +-

    -

    x1

    x2

    Perceptron is able to represent some useful functions

    AND (x1,x2) choose weights w0=-1.5, w1=1, w2=1

    But functions that are not linearly separable (e.g.

    XOR) are not representable

    AND

    w0

    w2

    w1

  • 8/14/2019 Lecture Neural Nets 2010

    26/148

    26

    m

    i

    ii

    m

    i

    ii xwfbxwfy01

    )()(

    where f is the hard limiter function i.e.

    01

    01

    1

    1m

    i

    ii

    m

    iii

    bxwif

    bxwify

    We can always treat the bias bas another weight with

    inputs equal 1

    Mathematically the Perceptron is

  • 8/14/2019 Lecture Neural Nets 2010

    27/148

    27

    01

    m

    i

    ii bxw

    Why is the network capable of solving linearly

    separable problem ?

    0

    1

    m

    i

    ii bxw

    01

    m

    i

    ii bxw

  • 8/14/2019 Lecture Neural Nets 2010

    28/148

    28

    Learning rule

    An algorithm to update the weights wso that finallythe input patterns lie on both sides of the line decidedby the perceptron

    Let tbe the time, at t = 0, we have

    0)0( xw

  • 8/14/2019 Lecture Neural Nets 2010

    29/148

    29

    Learning rule

    An algorithm to update the weights wso that finallythe input patterns lie on both sides of the line decided by theperceptronLet tbe the time, at t = 1

    0)1( xw

  • 8/14/2019 Lecture Neural Nets 2010

    30/148

    30

    Learning rule

    An algorithm to update the weights wso that finallythe input patterns lie on both sides of the line decided by theperceptronLet tbe the time, at t = 2

    0)2( xw

  • 8/14/2019 Lecture Neural Nets 2010

    31/148

    31

    Learning rule

    An algorithm to update the weights wso that finallythe input patterns lie on both sides of the line decided by theperceptronLet tbe the time, at t = 3

    0)3( xw

  • 8/14/2019 Lecture Neural Nets 2010

    32/148

    32

    )())]()(()()[()()1(

    )(1)(1)(

    txtxtwsigntdttwtw

    classintxifclassintxiftd

    Perceptron learning rule

    In Math

    Where (t) is the learning rate >0,

    +1 if x>0sign(x)= hard limiter function

    1 if x

  • 8/14/2019 Lecture Neural Nets 2010

    33/148

    Exemplo 1

    Logical OR function

    So lets set w0= -0.6, w1= 0.2 andw2= 0.3.

    x0 = 1.

    33

    X1 X2 R

    -1 -1 -1

    1 -1 1

    -1 1 1

    1 1 1

  • 8/14/2019 Lecture Neural Nets 2010

    34/148

    Exemplo 2

    Logical E function

    34

    X1 X2 R

    -1 -1 -1

    1 -1 -1

    -1 1 -1

    1 1 1

  • 8/14/2019 Lecture Neural Nets 2010

    35/148

    35

    In words:

    If the classification is right, do not update the

    weights

    If the classification is not correct, update the

    weight towards the opposite direction so that the

    output move close to the right directions.

  • 8/14/2019 Lecture Neural Nets 2010

    36/148

    36

    Perceptron convergence theorem(Rosenblatt, 1962)

    Let the subsets of training vectors be linearly separable. Thenafter finite steps of learning we have

    lim w(t) = w which correctly separate the samples.

    The idea of proof is that to consider ||w(t+1)-w||-||w(t)-w||which is a decrease function of t

  • 8/14/2019 Lecture Neural Nets 2010

    37/148

    37

    Summary of Perceptron learning

    Variables and parametersx(t) = (m+1) dim. input vectors at time t= ( b, x1 (t), x2 (t), .... , xm (t) )

    w(t) = (m+1) dim. weight vectors= ( 1 , w1 (t), .... , wm (t) )

    b = bias

    y(t) = actual responset = learning rate parameter, a +ve constant < 1d(t) = desired response

  • 8/14/2019 Lecture Neural Nets 2010

    38/148

    38

    Summary of Perceptron learning

    Data { (x(i), d(i)), i=1,,p}

    Present the data to the network once a point

    could be cyclic :(x(1), d(1)), (x(2), d(2)),, (x(p), d(p)),(x(p+1), d(p+1)),

    or randomly(Hence we mix time t with i here)

  • 8/14/2019 Lecture Neural Nets 2010

    39/148

    39

    1. Initialization Set w(0)=0. Then perform the followingcomputation for time step t=1,2,...

    2. Activation At time step t, activate the perceptron by applying

    input vector x(t)and desired response d(t)

    3. Computation of actual response Compute the actual responseof the perceptron

    y(t) = sign ( w(t) x(t) )

    where signis the sign function

    4. Adaptation of weight vector Update the weight vector of theperceptron

    w(t+1) = w(t)+ t [ d(t) - y(t) ] x(t)5. Continuation

    Summary of Perceptron learning (algorithm)

  • 8/14/2019 Lecture Neural Nets 2010

    40/148

    40

    Questions remain

    Where or when to stop?

    By minimizing the generalization error

    For training data{(x(i), d(i)), i=1,p}

    How to define training error after t steps of learning?

    E(t)= pi=1[d(i)-sign(w(t) .x(i)]

    2

  • 8/14/2019 Lecture Neural Nets 2010

    41/148

    41

    After

    learning

    t steps

    E(t) = 0

  • 8/14/2019 Lecture Neural Nets 2010

    42/148

    42

    How to define generalization error?

    For a new signal {x(t+1),d(t+1)}, we have

    Eg= [d(t+1)-sign (x(t+1) w (t)) ]2.

    After

    learning

    t steps

  • 8/14/2019 Lecture Neural Nets 2010

    43/148

    43

    We next turn to ADALINE learning,

    from which we can understandthe learning rule, and more general the

    Back-Propagation (BP) learning

  • 8/14/2019 Lecture Neural Nets 2010

    44/148

    44

    Simple Neural Network

    ADALINE Learning

  • 8/14/2019 Lecture Neural Nets 2010

    45/148

    45

    Outlines

    ADALINE

    Gradient descending learning

    Modes of training

  • 8/14/2019 Lecture Neural Nets 2010

    46/148

    46

    Unhappy over Perceptron Training

    When a perceptron gives the right answer, no

    learning takes place

    Anything below the threshold is interpreted

    as no, even it is just below the threshold.

    It might be better to train the neuron basedon how far below the threshold it is.

  • 8/14/2019 Lecture Neural Nets 2010

    47/148

    47

    ADALINEis an acronym for ADAptive LINear Element

    (or ADAptive LInear NEuron) developed by Bernard

    Widrow and Marcian Hoff (1960).

    There are several variations of Adaline. One has

    threshold same as perceptron and another just a bare

    linear function.

    TheAdaline learningrule is also known as the least-

    mean-squares (LMS) rule, the delta rule, or the Widrow-Hoff rule.

    It is a training rule that minimizes the output error

    using (approximate) gradient descent method.

    ADALINE

    Replace the step function in the perceptron with a

  • 8/14/2019 Lecture Neural Nets 2010

    48/148

    48

    Replace the step function in the perceptron with acontinuous (differentiable) function f, e.g the simplest islinear function

    With or without the threshold, theAdalineis trainedbased on the output of the function f rather than the finaloutput.

    f (x)

    (Adaline)

  • 8/14/2019 Lecture Neural Nets 2010

    49/148

    49

    After each training pattern x(i) is presented, the correction toapply to the weights is proportional to the error.

    E (i,t) = [ d(i) f(w(t) x(i)) ] 2 i=1,...,p

    N.B. If f is a linear function f(w(t) x(i)) = w(t) x(i)

    Summing together, our purpose is to find wwhich minimizes

    E (t) = i

    E(i,t)

  • 8/14/2019 Lecture Neural Nets 2010

    50/148

    50

    To find gw(t+1) = w(t)+g( E(w(t)) )

    so that wautomatically tends to the

    global minima of E(w).

    w(t+1) = w(t)- E(w(t))t

    (see figure below)

    General Approach gradient descent method

  • 8/14/2019 Lecture Neural Nets 2010

    51/148

    51

    Gradientdirection is the direction of uphillforexample, in the Figure, at position 0.4, the

    gradientis uphill ( F is E, consider one dim case )

    Gradient direction

    F(0.4)

    F

    In gradient descent algorithm we have

  • 8/14/2019 Lecture Neural Nets 2010

    52/148

    52

    In gradient descent algorithm, we have

    w(t+1) = w(t) F(w(t)) t

    therefore the ball goes downhill sinceF(w(t))is downhill direction

    Gradient direction

    w(t)

    I di t d t l ith h

  • 8/14/2019 Lecture Neural Nets 2010

    53/148

    53

    Gradient direction

    w(t+1)

    In gradient descent algorithm, we have

    w(t+1) = w(t) F(w(t)) t

    therefore the ball goes downhill sinceF(w(t))is downhill direction

  • 8/14/2019 Lecture Neural Nets 2010

    54/148

    54

    Graduallythe ball will stop at a local minimawhere thegradient is zero

    Gradient direction

    w(t+k)

  • 8/14/2019 Lecture Neural Nets 2010

    55/148

    55

    In words

    Gradient method could be thought of as a ball rolling downfrom a hill: the ball will roll down and finally stop at the valley

    Thus, the weights are adjusted by

    wj(t+1) = wj(t) +t [d(i) - f(w(t) x(i)) ] xj(i)f

    This corresponds to gradient descent on the quadratic errorsurface E

    When f =1, we have the perceptron learning rule (we have ingeneral f>0 in neural networks). The ball moves in the rightdirection.

  • 8/14/2019 Lecture Neural Nets 2010

    56/148

    56

    About LMS Learning Rule

    Suppose we have a set of input vectors, {x1, X2, . . . . XL},each having its own, perhaps unique, correct or desired outputvalue, dk, k = 1,...,L. The problem of finding a single weightvector that can successfully associate each input vector with itsdesired output value is no longer simple.

    Calculation of w*. To begin, let's state the problem a little

    differently: Given examples, (x1, d1), (x2, d2),..., (XL,dL), of someprocessing function that associates input vectors, xk, with (ormaps to) the desired output values, dk, what is the best weightvector, w*. Consider that the output can be write in vectornotation:

    y = wtx If the actual output value is ykfor the kth input vector, then the

    corresponding error term is ek= dkyk

  • 8/14/2019 Lecture Neural Nets 2010

    57/148

    57

    More About LMS

    The mean squared error, or expectation value ofthe error, is defined by

    L

    k

    kk

    L 1

    22 1

    where L is the number of input vectors in the training set.We can expand the mean squared error as follows

    wpwRwwdwxdwxxwd

    t

    k

    t

    kk

    t

    kk

    t

    kk

    2

    22

    22

    Where R = (xkxkt.), is called the input correlation matrix, and a

    vector p = (dkx

    k

    t)

  • 8/14/2019 Lecture Neural Nets 2010

    58/148

    58

    More

    To find the weight vector corresponding to theminimum mean squared error, we differentiate(w), evaluate the result at w*, and set the resultequal to zero:

    pRw

    pRw

    pRw

    pRww

    w

    1*

    *

    * 022

    22

  • 8/14/2019 Lecture Neural Nets 2010

    59/148

    59

    Sequential mode (on-line, stochastic, orper-pattern) :

    Weights updated after each pattern ispresented (Perceptron is in this class)

    Batch mode(off-line or per-epoch) : Weightsupdated after all patterns are presented

    Two types of network training:

    Comparison Perceptron and

  • 8/14/2019 Lecture Neural Nets 2010

    60/148

    60

    Comparison Perceptron andGradient Descent Rules

    Perceptron learning rule guaranteed to succeed if Training examples are linearly separable

    Sufficiently small learning rate

    Linear unit training rule uses gradient descentguaranteed to converge to hypothesis with

    minimum squared error given sufficiently small

    learning rate Even when training data contains noise

    Even when training data not separable by Hyperplane

    R i f P

  • 8/14/2019 Lecture Neural Nets 2010

    61/148

    61

    Renaissance of Perceptron

    Perceptron

    Support VectorMachine

    Multi-LayerPerceptron

    Learning Theory, 90

    Back-Propagation, 80

  • 8/14/2019 Lecture Neural Nets 2010

    62/148

    62

    Summary of Previous Lectures

    Perceptron

    W(t+1)= W(t)+(t) [ d(t) - sign (w(t) . x)] x

    Adaline (Gradient descent method)

    W(t+1)= W(t)+(t) [ d(t) - f(w(t) . x)] x f

  • 8/14/2019 Lecture Neural Nets 2010

    63/148

    63

    Multi-Layer Perceptron (MLP)

    Idea: Credit assignment problem

    Problem of assigning credit or blame to

    individual elements involving in forming overallresponse of a learning system (hidden units)

    In neural networks, problem relates to dividingwhich weights should be altered, by how muchand in which direction.

    E l Th l t k

  • 8/14/2019 Lecture Neural Nets 2010

    64/148

    64

    xn

    x1

    x2

    Input Output

    Example: Three-layernetworks

    Input layer Hidden layer Output layer

    Signal routing

    P ti f hit t

  • 8/14/2019 Lecture Neural Nets 2010

    65/148

    65

    Properties of architecture

    No connections within a layer

    No direct connections between input and output layersFully connected between layers

    Often more than 2 layers

    Number of output units need not equal number of input units

    Number of hidden units per layer can be more or less than

    input or output units

    f w x bi ij j i

    j

    m

    ( )1

    Each unit is a perceptron

  • 8/14/2019 Lecture Neural Nets 2010

    66/148

    66

    BP(Back Propagation)

  • 8/14/2019 Lecture Neural Nets 2010

    67/148

    67

    MultiLayer Perceptron I

    Back Propagating Learning

    BP learning algorithm

  • 8/14/2019 Lecture Neural Nets 2010

    68/148

    68

    BP learning algorithmSolution tocredit assignment problemin MLP

    Rumelhart, Hinton and Williams (1986)

    BP has two phases:

    Forward pass phase: computes functional signal,feedforward propagation of input pattern signals throughnetwork

    Backward pass phase: computes error signal,propagation of error (difference between actual and desiredoutput values) backwards through network starting at outputunits

    BP Learning for Simplest MLP

  • 8/14/2019 Lecture Neural Nets 2010

    69/148

    69

    I

    w(t)

    W(t)

    y

    OBP Learning for Simplest MLP

    Task: Data {I, d} to minimize

    E = (d - o)2/2= [d - f(W(t)y(t))]2/2

    = [d - f(W(t)f(w(t)I))]2/2

    Error function at the output unit

    Weight at time t is w(t) and W(t),

    intend to find the weight w and W at time t+1

    Where y = f(w(t)I), output of the hidden unit

  • 8/14/2019 Lecture Neural Nets 2010

    70/148

  • 8/14/2019 Lecture Neural Nets 2010

    71/148

    71

    yytWfodtW

    tdWdf

    dfdEtW

    tdW

    dEtWtW

    ))((')()(

    )()(

    )()()1(

    Backward Pass Phase

    I

    w(t)

    W(t)

    y

    O

    o = f ( W(t) y )E = (d - o)2/2

    B k d h

  • 8/14/2019 Lecture Neural Nets 2010

    72/148

    72

    ytW

    yytWfodtW

    tdW

    df

    df

    dE

    tW

    tdW

    dEtWtW

    )(

    ))((')()(

    )()(

    )()()1(

    Backward pass phase

    I

    w(t)

    W(t)

    y

    O

    where = ( d-o) f

    Backward pass phase

  • 8/14/2019 Lecture Neural Nets 2010

    73/148

    73

    IItwftWtw

    tdw

    dytWytWfodtw

    tdw

    dy

    dy

    dEtw

    tdwdEtwtw

    ))((')()(

    )()())((')()(

    )(

    )(

    )()()1(

    I

    w(t)

    W(t)

    y

    O

    Backward pass phase

    o= f ( W(t) y )= f ( W(t) f( w(t) I) )

  • 8/14/2019 Lecture Neural Nets 2010

    74/148

    74

    General Two LayerNetworkI inputs, Ooutputs, wconnections for input

    units, Wconnections for output units, yis the

    activity of input unit

    net (t) = network input to the unit at time t

    Ww

    I O

    Input units

    Output units

    y

    Forward pass

  • 8/14/2019 Lecture Neural Nets 2010

    75/148

    75

    Forward pass

    Weights are fixed during forward & backward pass at timet

    1. Compute values for hidden units

    2. compute values for output units

    net t w t I t

    y f net t

    j ji ii

    j j

    ( ) ( ) ( )

    ( ( ))

    Net t W t y

    O f Net t

    k kj jj

    k k

    ( ) ( )

    ( ( ))

    Ii

    wji(t)

    Wkj(t)

    yj

    Ok

    Backward Pass

  • 8/14/2019 Lecture Neural Nets 2010

    76/148

    76

    Recall delta rule , error measure for pattern n is

    We want to know how to modify weights in order to decrease Ewhere

    both for hidden units and output units

    This can be rewritten as product of two terms using chain rule

    E t d t O tk kk

    ( ) ( ( ) ( ))

    1

    22

    1

    )(

    )()()1(

    tw

    tEtwtw

    ij

    ijij

  • 8/14/2019 Lecture Neural Nets 2010

    77/148

    77

    )(

    )(

    )(

    )(

    )(

    )(

    tw

    tnet

    tnet

    tE

    tw

    tE

    ij

    j

    jij

    How error for pattern changes as function of changein network input to unitj

    How net input to unitjchanges as a function ofchange in weight w

    both for hidden units and output units

    Term A

    Term B

    Summaryi ht d t l l

  • 8/14/2019 Lecture Neural Nets 2010

    78/148

    78

    weight updates are local

    output unit

    hidden unit

    )()()()1(

    )()()()1(

    tyttWtW

    tIttwtw

    jkkjkj

    ijjiji

    k

    ikjkj

    ijjiji

    tIWttnetf

    tIttwtw

    )()())(('

    )()()()1(

    )())(('))()((

    )()()()1(

    tytNetftOtd

    tyttWtW

    jkkk

    jkkjkj

    Once weight changes are computed for all units, weights areupdated at same time (bias included as weights here)

    We now compute the derivative of the activation function f ( ).

    (hidden unit)

    (output unit)

    Activation Functions

  • 8/14/2019 Lecture Neural Nets 2010

    79/148

    79

    Activation Functionsto compute we need to find the derivative of activation

    function f

    to find derivative the activation function must be smooth

    Sigmoidal (logistic) function-common in MLP

    where k is a positive constant. The sigmoidal function givesvalue in range of 0 to 1

    Input-output function of a neuron (rate coding assumption)

    ))(exp(1

    1))((

    tnetktnetf

    i

    i

    Shape of sigmoidal function

  • 8/14/2019 Lecture Neural Nets 2010

    80/148

    80

    Note: when net = 0, f = 0.5

  • 8/14/2019 Lecture Neural Nets 2010

    81/148

    Returning to local error gradientsin BP algorithm we have

  • 8/14/2019 Lecture Neural Nets 2010

    82/148

    82

    g g gfor output units

    For hidden units we have

    ))(1)(())()((

    ))(('))()(()(

    tOtkOtOtd

    tNetftOtdt

    iiii

    iiii

    k

    kikii

    k

    kikii

    Wttytky

    Wttnetft

    )())(1)((

    )())((')(

    Since degree of weight change is proportional to derivative ofactivation function, weight changes will be greatest when unitsreceives mid-range functional signal than at extremes

    Summary of BP learning algorithm

  • 8/14/2019 Lecture Neural Nets 2010

    83/148

    83

    Summary of BP learning algorithm

    Set learning rate

    Set initial weight values (incl.. biases): w, W

    Loop until stopping criteria satisfied:

    present input pattern to input units

    compute functional signal for hidden unitscompute functional signal for output units

    present Target response to output unitscompute error signal for output units

    compute error signal for hidden unitsupdate all weights at same timeincrement n to n+1 and select next I and dend loop

  • 8/14/2019 Lecture Neural Nets 2010

    84/148

    Exercise For the network shown in figure, calculate the expressions for the

    weight changes using the EBP algorithm in an on-line learning mode.The training data, consisting of the input pattern vectorsx = [x1x2]and the output desired responses d = [d1d2]

    84

    Network training: T i i t h t dl til t i it i t

  • 8/14/2019 Lecture Neural Nets 2010

    85/148

    85

    Training set shown repeatedly until stopping criteria are met

    Each full presentation of all patterns = epoch

    Randomise order of training patterns presented for eachepoch in order to avoid correlation between consecutive

    training pairs being learnt (order effects)

    Two types of network training:

    Sequential mode(on-line, stochastic, or per-pattern)

    Weights updated after each pattern is presented

    Batch mode(off-line or per -epoch)

    Advantages and disadvantages of different

  • 8/14/2019 Lecture Neural Nets 2010

    86/148

    86

    g gmodes

    Sequential mode: Less storage for each weighted connection

    Random order of presentation and updating per pattern

    means search of weight space is stochastic--reducing risk of

    local minima able to take advantage of any redundancy intraining set (i.e.. same pattern occurs more than once in

    training set, esp. for large training sets)

    Simpler to implement

    Batch mode: Faster learning than sequential mode

  • 8/14/2019 Lecture Neural Nets 2010

    87/148

    87

    MultiLayer Perceptron II

    Dynamics of MultiLayer Perceptron

    Summary of Network Training

  • 8/14/2019 Lecture Neural Nets 2010

    88/148

    88

    Forward phase: I(t), w(t),net(t),y(t),W(t),Net(t),O(t)

    Backward phase:

    Output unit

    Input unit

    k

    ikjkj

    ijijji

    tItWttnetf

    tIttwtw

    )()()())(('

    )()()()1(

    )())(('))()(()()()()1(

    tytNetftOtdtyttWtW

    jkkk

    jkkjkj

  • 8/14/2019 Lecture Neural Nets 2010

    89/148

  • 8/14/2019 Lecture Neural Nets 2010

    90/148

    90

    Network training:

    Despite this sound theoretical foundation concerning

    the representational capabilities of neural networks,

    and notwithstanding the success of the EBP learning

    algorithm, there are many practical drawbacks to theEBP algorithm.

    The most troublesome is the usually long training

    process, which does not ensure that the absolute

    minimum of the cost function (the best performanceof the network) will be achieved.

    N t k t i i

  • 8/14/2019 Lecture Neural Nets 2010

    91/148

    Network training:

    The algorithm may become stuck at some localminimum, and such a termination with a suboptimalsolution will require repetition of the whole learning

    process by changing the structure or some of thelearning parameters that influence the iterativescheme.

    As in many other scientific and engineering

    disciplines, so in the field of artificial neural networks,the theory (or at least part of it) was established onlyafter a number of practical neural networkapplications had been implemented.

    91

    Advantages and disadvantages of different

  • 8/14/2019 Lecture Neural Nets 2010

    92/148

    92

    g gmodes

    Sequential mode: Less storage for each weighted connection

    Random order of presentation and updating per pattern

    means search of weight space is stochastic--reducing risk of

    local minima able to take advantage of any redundancy intraining set (i.e.. same pattern occurs more than once in

    training set, esp. for large training sets)

    Simpler to implement

    Batch mode:Faster learning than sequential mode

    Goals of Neural Network Training

  • 8/14/2019 Lecture Neural Nets 2010

    93/148

    93

    g

    To give the correct output for input

    training vector (Learning)

    To give good responses to new unseeninput patterns (Generalization)

    Training and Testing Problems

  • 8/14/2019 Lecture Neural Nets 2010

    94/148

    94

    Training and Testing Problems

    Stuck neurons: Degree of weight change is proportionalto derivative of activation function, weight changes will begreatest when units receives mid-range functional signal thanat extremes neuron. To avoid stuck neurons weightsinitialization should give outputs of all neurons approximate 0.5

    Insufficient number of training patterns: In thiscase, the training patterns will be learnt instead of theunderlying relationship between inputs and output, i.e. networkjust memorizing the patterns.

    Too few hidden neurons: network will not produce a

    good model of the problem.Over-fitting: the training patterns will be learnt insteadof the underlying function between inputs and output becauseof too many of hidden neurons. This means that the networkwill have a poor generalization capability.

  • 8/14/2019 Lecture Neural Nets 2010

    95/148

    95

    Dynamics of BP learningAim is to minimise an error function over all trainingpatterns by adapting weights in MLP

    Recalling the typical error function is the mean

    squared error as follows

    E(t)=

    The idea is to reduce E(t) to global minimum point.

    p

    k

    kk tOtd1

    2))()((

    2

    1

  • 8/14/2019 Lecture Neural Nets 2010

    96/148

    96

    Dynamics of BP learning

    In single layer perceptronwith linear activationfunctions, the error function is simple, describedby a smooth parabolic surface with a single

    minimum

  • 8/14/2019 Lecture Neural Nets 2010

    97/148

  • 8/14/2019 Lecture Neural Nets 2010

    98/148

  • 8/14/2019 Lecture Neural Nets 2010

    99/148

    Selecting Initial Weight Values

  • 8/14/2019 Lecture Neural Nets 2010

    100/148

    100

    Selecting Initial Weight Values

    Choice of initial weight values is important as thisdecides starting position in weight space. That is,

    how far away from global minimum

    Aim is to select weight values which produce

    midrange function signals

    Select weight values randomly from uniform

    probability distribution

    Normalise weight values so number of weightedconnections per unit produces midrange function

    signal

    Convergence of Backprop

  • 8/14/2019 Lecture Neural Nets 2010

    101/148

    101

    Avoid local minumum with fast convergence:

    Add momentum

    Stochastic gradient descent

    Train multiple nets with different initial weights

    Nature of convergence

    Initialize weights near zero or initial networksnear-linear

    Increasingly non-linear functions possible astraining progresses

    Use of Available Data Set for Training

  • 8/14/2019 Lecture Neural Nets 2010

    102/148

    102

    Use of Available Data Set for Training

    Training setuse to update the weights.Patterns in this set are repeatedly in random

    order. The weight update equation areapplied after a certain number of patterns.

    Validation setuse to decide when to stoptraining only by monitoring the error.

    Test setUse to test the performance of theneural network. It should not be used as partof the neural network development cycle.

    The available data set is normally split into three

    sets as follows:

    Earlier Stopping - Good Generalization

    R i t h t i th

  • 8/14/2019 Lecture Neural Nets 2010

    103/148

    103

    Running too many epochs may overtrainthenetwork and result in overfittingand perform

    poorly in generalization. Keep a hold-out validation set and test accuracy

    after every epoch. Maintain weights for bestperforming network on the validation set and stoptraining when error increases increases beyondthis.

    No. of epochs

    errorTraining set

    Validation set

    Model Selection by Cross validation

  • 8/14/2019 Lecture Neural Nets 2010

    104/148

    104

    Model Selection by Cross-validation

    Too few hidden unitsprevent the network from learningadequately fitting the data and learning the concept.

    Too many hidden unitsleads to overfitting.

    Similar cross-validation methodscan be used to determinean appropriate number of hidden units by using the optimaltest error to select the model with optimal number of hiddenlayers and nodes.

    No. of epochs

    errorTraining set

    Validation set

  • 8/14/2019 Lecture Neural Nets 2010

    105/148

    105

    Genetic Algorithms

    lternative training algorithm

    History Background

  • 8/14/2019 Lecture Neural Nets 2010

    106/148

    106

    Idea of evolutionary computing was introduced in the 1960s by I.Rechenbergin his work "Evolution strategies" (Evolutionsstrategiein

    original). His idea was then developed by other researchers. Genetic

    Algorithms(GAs) were invented by John Hollandand developed by him

    and his students and colleagues. This lead to Holland's book "Adaption in

    Natural and Artificial Systems" published in 1975.

    In 1992 John Kozahas used genetic algorithm to evolve programsto

    perform certain tasks. He called his method Genetic Programming"(GP). LISP programs were used, because programs in this language can

    expressed in the form of a "parse tree", which is the object the GA works

    on.

    Biological Background

  • 8/14/2019 Lecture Neural Nets 2010

    107/148

    107

    g gChromosome.

    All living organisms consist of cells. In each cell there is the same set of

    chromosomes. Chromosomes are strings of DNAand serves as a model for

    the whole organism.A chromosome consist of genes, blocks of DNA. Each

    gene encodes a particular protein. Basically can be said, that each gene

    encodes a trait, for example color of eyes. Possible settings for a trait (e.g.

    blue, brown) are called alleles. Each gene has its own position in the

    chromosome. This position is called locus.

    Complete set of genetic material (all chromosomes) is called genome.Particular set of genes in genome is called genotype. The genotype is with

    later development after birth base for the organism's phenotype, its physical

    and mental characteristics, such as eye color, intelligence etc.

    Biological Background

    http://cs.felk.cvut.cz/~xobitko/ga/dna.htmlhttp://cs.felk.cvut.cz/~xobitko/ga/dna.html
  • 8/14/2019 Lecture Neural Nets 2010

    108/148

    108

    Reproduction.

    During reproduction, first occurs recombination(orcrossover). Genes from parents form in some way the

    whole new chromosome. The new created offspring can

    then be mutated. Mutationmeans, that the elements ofDNA are a bit changed. This changes are mainly caused by

    errors in copying genes from parents.

    The fitnessof an organism is measured by success of the

    organism in its life.

    Evolutionary Computation

  • 8/14/2019 Lecture Neural Nets 2010

    109/148

    109

    Evolutionary Computation

    Based on evolution as it occurs in nature Lamarck, Darwin, Wallace: evolution of species, survival

    of the fittest

    Mendel: genetics provides inheritance mechanism

    Hence genetic algorithms

    Essentially a massively parallel search procedure

    Start with random population of individuals

    Gradually move to better individuals

    A Simple Genetic Algorithm

  • 8/14/2019 Lecture Neural Nets 2010

    110/148

    110

    p g

    Optimization task: find the maximum of f(x)

    for example f(x)=xsin(x) x [0,p]genotype: binary string s [0,1]5 e.g. 11010, 01011, 10001mapping : genotype phenotype

    binary integer encoding: x = si 2n-i-1/ (2n-1)

    genotype integ. phenotype fitness prop. fitness11010 26 2.6349 1.2787 30%

    01011 11 1.1148 1.0008 24%10001 17 1.7228 1.7029 40%00101 5 0.5067 0.2459 6%

    Initial population

    5

    1

    n

    i

    Some Other Issues RegardingE l ti C ti

  • 8/14/2019 Lecture Neural Nets 2010

    111/148

    111

    Evolutionary Computing

    Evolution according to Lamarck. Individual adapts during lifetime.

    Adaptations inherited by children.

    In nature, genes dont change; but for computations we could

    allow this...

    Baldwin effect. Individuals ability to learn has positive effect on evolution.

    It supports a more diverse gene pool.

    Thus, more experimentation with genes possible.

    Bacteria and virus. New evolutionary computing strategies.

  • 8/14/2019 Lecture Neural Nets 2010

    112/148

    112

    Radial Basis Functions

    Radial Basis Functions

    Introduction fit=ajuste

  • 8/14/2019 Lecture Neural Nets 2010

    113/148

    Introduction fit=ajuste

    The back-propagation algorithm for the design ofa MLP (under surpervision) may be viewed as theapplication of a recursive technique know instatistic as stochastic approximation.

    In this moment we take a completely differentapproach by viewing the design of a neuralnetwork as a curve fitting (approximation)

    problemin a high-dimensional space. According with this view-point, learning is

    equivalent to find a surface in multidimensionalspace that provide a best fit to the training data.

    113

    Radial-basis function (RBF) networks

  • 8/14/2019 Lecture Neural Nets 2010

    114/148

    114

    RBF = radial-basis function: a function whichdepends only on the radial distance from a point

    XOR problem

    quadratically separable

    Radial-basis function (RBF) networks

    S RBF f ti t ki th f

  • 8/14/2019 Lecture Neural Nets 2010

    115/148

    115

    So RBFs are functions taking the form

    where is a nonlinear activation function, xis the

    input and xiis the ithposition, prototype, basis or

    centrevector.

    The idea is thatpoints near the centres will have

    similar outputs (i.e. if x ~ xi then f (x) ~f (xi))

    since they should have similar properties.

    The simplest isthe linear RBF : (x) =||x xi||

    ||)(|| ixx

    Radial Functions

  • 8/14/2019 Lecture Neural Nets 2010

    116/148

    Radial Functions

    Characteristic feature-their responsedecreases (or increases) monotonically withdistance from a central point.

    The center, the distance scale, and the

    precise shape of the radial functionare parameters of the model, all fixed if it islinear.

    Typical radial functions are:

    The Gaussian RBF (monotonically decreases withdistance from the center).

    A multiquadric RBF (monotonically increases withdistance from the center).

    Typical RBFs include(a) Multiquadrics

  • 8/14/2019 Lecture Neural Nets 2010

    117/148

    117

    (a) Multiquadrics

    for some c>0

    (b) Inverse multiquadrics

    for some c>0

    (c) Gaussian

    for some >0

    2/122)()( crr

    2/122 )()( crr

    )2

    exp()(2

    2

    rr

  • 8/14/2019 Lecture Neural Nets 2010

    118/148

    118nonlocalized functions localized functions

  • 8/14/2019 Lecture Neural Nets 2010

    119/148

    Starting point: exact interpolation

  • 8/14/2019 Lecture Neural Nets 2010

    120/148

    120

    Starting point: exact interpolation

    Each input pattern x must be mapped onto atarget value d

  • 8/14/2019 Lecture Neural Nets 2010

    121/148

    y1

    Single-layer networks1 y)1||y-x1||

  • 8/14/2019 Lecture Neural Nets 2010

    122/148

    122

    yp

    Input

    y1y2

    Output

    Input layer : N y)N||y-xN||

    wj

    d

    output = wii (y - xi)

    adjustable parameters are weights wj

    number of hidden units = number of data points

    Form of the basis functions decided in advance

  • 8/14/2019 Lecture Neural Nets 2010

    123/148

    Large = 1

  • 8/14/2019 Lecture Neural Nets 2010

    124/148

    124

    g

    Small = 0.2

  • 8/14/2019 Lecture Neural Nets 2010

    125/148

    125

    Problems with exact interpolationcan produce poor generalisation performance as only data

  • 8/14/2019 Lecture Neural Nets 2010

    126/148

    126

    a p odu poo g a a o p o a a o y da apoints constrain mapping

    Overfitting problem

    Bishop(1995) example

    Underlying function f(x)=0.5+0.4sine(2x)sampled randomly for 30 points

    added Gaussian noise to each data point

    30 data points 30 hidden RBF units

    fits all data points but creates oscillations due added noiseand unconstrained between data points

  • 8/14/2019 Lecture Neural Nets 2010

    127/148

    127

    All Data Points 5 Basis functions

  • 8/14/2019 Lecture Neural Nets 2010

    128/148

    Learning Algorithms

  • 8/14/2019 Lecture Neural Nets 2010

    129/148

    Redes Neurais e

    Lgica Fuzzy Radial Basis Function Network 129

    g g

    Parametersto be learnt are: centers

    spreads

    weights

    Different learning algorithms

    Learning Algorithm 1

  • 8/14/2019 Lecture Neural Nets 2010

    130/148

    Neurais e Lgica

    Fuzzy Radial Basis Function Network 130

    Centers are selected at random center locationsare chosen randomly from the

    training set

    Spreadsare chosen by normalization:

    1m

    maxdcentersofnumber

    centers2anybetweendistanceMaximum

    1

    2

    i2max

    12

    i

    m,1i

    txd

    mexptx

    i

    Learning Algorithm 1

  • 8/14/2019 Lecture Neural Nets 2010

    131/148

    Redes Neurais e

    Lgica Fuzzy Radial Basis Function Network 131

    Weightsare found by means of pseudo-inverse method

    1

    2

    ij2

    max

    1j ij i

    m2,1,i,N...,2,1,j

    txd

    mexp

    dw

    ...,

    Desired response

    Pseudo-inverse of

  • 8/14/2019 Lecture Neural Nets 2010

    132/148

    Learning Algorithm 2: Centers K-means clustering algorithm for centers

  • 8/14/2019 Lecture Neural Nets 2010

    133/148

    Redes Neurais e

    Lgica Fuzzy Radial Basis Function Network 133

    1 Initialization: tk(0) random k = 1, , m1

    2 Sampling: draw x from input space C3 Similaritymatching: find index of best center

    4 Updating: adjust centers

    5. Continuation: increment nby 1, goto 2and continueuntil no noticeable changes of centers occur

    )n(tx(n)minargk(x) kk

    k(x)kif)n(tx(n))n(t kk

    otherwise)n(tk )1n(tk

    Learning Algorithm 3

  • 8/14/2019 Lecture Neural Nets 2010

    134/148

    Redes Neurais e

    Lgica Fuzzy Radial Basis Function Network 134

    Supervised learning of all the parameters

    using the gradient descent method

    Modify centers

    j

    tjt

    tj

    E

    N

    1i

    2i

    2

    1e

    Instantaneous error function

    Learning

    rate forjt

    Depending on the specific function can be computed using

    the chain rule of calculus

    Learning Algorithm 3

  • 8/14/2019 Lecture Neural Nets 2010

    135/148

    Redes Neurais e

    Lgica Fuzzy Radial Basis Function Network 135

    Modify spreads

    Modify output weights

    j

    j

    E

    j

    ij

    ijijw

    w

    E

    Design Radial Basis Networks

  • 8/14/2019 Lecture Neural Nets 2010

    136/148

    g

    136

    net = newrb(P,T,goal,spread,MN,DF)

    P R-by-Q matrix of Q input vectors

    T S-by-Q matrix of Q target class

    vectorsgoal Mean squared error goal

    (default = 0.0)

    spread Spread of radial basis functions(default = 1.0)

    MN Maximum number of neurons(default is Q)

    DF Number of neurons to add betweendisplays (default = 25)

    Examples

  • 8/14/2019 Lecture Neural Nets 2010

    137/148

    p

    137

    Here you design a radial basis network, given inputs

    Pand targetsT.

    P = [1 2 3];T = [2.0 4.1 5.9];

    net = newrb(P,T);

    The network is simulated for a new input.

    P = 1.5;Y = sim(net,P)

    Comparison with multilayer NNRBF N t k d t f l ( li )

  • 8/14/2019 Lecture Neural Nets 2010

    138/148

    Redes Neurais e

    Lgica Fuzzy Radial Basis Function Network 138

    RBF-Networksare used to perform complex (non-linear)pattern classification tasks.

    Comparison between RBFnetworksand multilayerperceptrons:

    Both are examples of non-linear layered feed-forward

    networks.

    Both are universal approximators.

    Hidden layers: RBF networks have one singlehidden layer. MLP networks may have morehidden layers.

    Comparison with multilayer NN

  • 8/14/2019 Lecture Neural Nets 2010

    139/148

    Redes Neurais e

    Lgica Fuzzy Radial Basis Function Network 139

    Neuron Models: The computation nodes in the hidden layer of a RBF network are

    different. They serve a different purpose from those in the outputlayer.

    Typically computation nodes of MLP in a hidden or output layershare a common neuron model.

    Linearity: The hidden layer of RBF is non-linear, the output layer of RBF is

    linear.

    Hidden and output layers of MLP are usually non-linear.

    Comparison with multilayer NN

  • 8/14/2019 Lecture Neural Nets 2010

    140/148

    Redes Neurais e

    Lgica Fuzzy Radial Basis Function Network 140

    Activation functions: The argument of activation function of each hidden unit in aRBF NN computes the Euclidean distance between inputvector and the center of that unit.

    The argument of the activation function of each hidden unitin a MLP computes the inner product of input vector and the

    synaptic weight vector of that unit. Approximations:

    RBF NN using Gaussian functions construct localapproximations to non-linear I/O mapping.

    MLP NN construct globalapproximations to non-linear I/O

    mapping.

  • 8/14/2019 Lecture Neural Nets 2010

    141/148

    141

    pplication ExamplesNonlinear Identification, Prediction and Control

    Nonlinear System Identification

  • 8/14/2019 Lecture Neural Nets 2010

    142/148

    142

    Target function: yp(k+1) = f(.)

    Identified function: yNET(k+1) = F(.)

    Estimation error: e(k+1)

    Nonlinear System Neural Control

  • 8/14/2019 Lecture Neural Nets 2010

    143/148

    143

    d: reference/desired response

    y: system output/desired output

    u: system input/controller output

    : desired controller input

    u*:NN output

    e: controller/network error

    The goal of training is to find an

    appropriate plant control u from

    the desired response d. The weights

    are adjusted based on the difference

    between the outputs of the networksI & II to minimise e. If network I is

    trained so that y = d, then u = u*.

    Networks act as inverse dynamics

    identifiers.

    Nonlinear System Identification

  • 8/14/2019 Lecture Neural Nets 2010

    144/148

    144

    Neural network

    input generationPm

    Nonlinear System Identification

  • 8/14/2019 Lecture Neural Nets 2010

    145/148

    145

    Neural network targetTm

    Neural network response(angle & velocity)

  • 8/14/2019 Lecture Neural Nets 2010

    146/148

    Model Reference Control

  • 8/14/2019 Lecture Neural Nets 2010

    147/148

    147

    Neural controller + nonlinear system diagram

    Neural controller, reference model, neural model

    Matlab NNtool GUI (Graphical User Interface)

  • 8/14/2019 Lecture Neural Nets 2010

    148/148