ch6 ann and ga

download ch6 ann and ga

of 104

Transcript of ch6 ann and ga

  • 7/31/2019 ch6 ann and ga

    1/104

    1

    Neural Networks: Definition

    Neural computing is the study of networks of adaptable nodes

    which, through a process of learning from task examples, store

    experiential knowledge and make it available for use.

  • 7/31/2019 ch6 ann and ga

    2/104

    2

    What Are Neural Networks?

    A computing model, inspired by the mammalian neural system,

    composed of many simple, highly interconnected processing

    units.

    Neural network models are algorithms for cognitive tasks, such aslearning and optimization, which are in a loose sense based on

    concepts derived from research into the nature of the brain.

  • 7/31/2019 ch6 ann and ga

    3/104

    3

    What Are Neural Networks?

    Neural network model is a directed graph with the following

    properties:

    A state variable ni is associated with each node i.

    A real value weight wij is associated with each link from node i

    to node j.

    A real value bias i is associated with each node i.

    A transfer function fi(nj, wij, i) is defined, for each node i,

    which determines the state of node i.

  • 7/31/2019 ch6 ann and ga

    4/104

    4

    What Can ANN Do?

    Biological

    Modeling the retina

    Modeling brain disorders (ADD)

    Business

    Evaluate probability of oil in geological formation

    Identify and filter promotion and job applicants

    Mine corporate databases for business rules

    Financial

    Assessing credit risk

    Identify forgeries

    Interpret handwritten forms

    Predict portfolio and stock values

  • 7/31/2019 ch6 ann and ga

    5/104

    5

    What Can ANN Do?

    Manufacturing

    Automated robot control systems

    Control material flow

    Optimize production lines

    Quality inspection

    Medical

    Analyze speech in hearing aids

    Diagnose and prescribe treatment by symptoms

    Monitor surgery and recovery

    Read X-rays and CET/PET Scans

  • 7/31/2019 ch6 ann and ga

    6/104

    6

    What Can ANN Do?

    Military

    Classify radar and sonar signals

    Target acquisition and tracking

    Analyze intelligence inputs

    Optimizing scarce resources

    Signal processing

    Adaptive Noise Canceling

    Zip Code Reader

    Speech Recognition

  • 7/31/2019 ch6 ann and ga

    7/104

    7

    A Brief History

    First concepts

    Turing 1936

    McCulloch & Pitts 1943

    Hebb 1949

    Early steps 1950s - 1960s

    The perceptron

    ADALINE and MADALINE

    Excessive hype

  • 7/31/2019 ch6 ann and ga

    8/104

    8

    A Brief History

    Stunted growth 1969-1981

    Perceptrons by Minskey and Papert

    Continued work

    Renewed interest

    The Hopfield model 1982

    Backpropagation rediscovered 1985 (first 1974 by

    Werbos)

    Radial Basis Functions - Broomhead & Lowe 1988

  • 7/31/2019 ch6 ann and ga

    9/104

    9

    A Quick Word About The Brain

  • 7/31/2019 ch6 ann and ga

    10/104

    10

    The Biological Neuron

    Cell Body Synapse() Dendrites() Axons()

  • 7/31/2019 ch6 ann and ga

    11/104

    11

    Computers And The Brain

    We do not understand the brain

    The ANN model is only loosely based on the brain

    The ANN model is metaphoric to the brain

  • 7/31/2019 ch6 ann and ga

    12/104

    12

    Computers vs. Neural Networks

    Von-Neumann Machines Neural Networks

    Few strong processors ~1011 Simple neurons

    Serial processing Parallel processing

    Central control No central control

    10-9 sec. Cycle 10-3 sec. Cycle

    Bit data Voltage data

    Not tolerant Very robust

    Fast numeric operations Slow numeric

    operations

    Slow high operations Fast high operations

    Learning ? Learning !

  • 7/31/2019 ch6 ann and ga

    13/104

    13

    Building Blocks Of The Model

    The processing element

    The connections

    Learning methods

  • 7/31/2019 ch6 ann and ga

    14/104

    14

    Processing Element Building Block

    The basic building block of a neural network is the

    processing element (or node or unit).

    A generalised node embodies elements:

    inputs(+bias)

    weights

    transfer function

    combining function

    activation function

    output(s)

  • 7/31/2019 ch6 ann and ga

    15/104

    15

    The function of a single node

    The job of a processing element is to receive a number of

    inputs (either from the external world or from other nodes

    or from itself) and to distribute a single output (either to

    the external world or to other nodes).

  • 7/31/2019 ch6 ann and ga

    16/104

    16

    Some Input Functions

    Weighted Summation

    net = w1x1 + w2x2+ + wnxn + bias

    where wi is the weight associated with the connection

    between an input and the processing element

  • 7/31/2019 ch6 ann and ga

    17/104

    17

    Some Input Functions

    Multiplication (or Product)

    net = w1 x1 * w2x2* * wnxn

    similar to the weighted summation but the summation is

    replaced by the product

    Maximum, Minimum, Majority

    net = max (wnxn)

    net = min (wnxn)

    net = 1 IF (wnxn) > 0 ELSE -1

  • 7/31/2019 ch6 ann and ga

    18/104

    18

    Some Activation Functions

    Sigmoid

    maps an input into a value between zero and one

    Linear

    where no transformation takes place to the outcome of

    the combing function

    Tangent

    similar to the sigmoid but the mapping is between -1 and

    1

    Step

    where the transfer value equals 1 if the outcome of the

    combing function is greater than some threshold,

    otherwise it equals 0

  • 7/31/2019 ch6 ann and ga

    19/104

    19

    Some Activation Functions

  • 7/31/2019 ch6 ann and ga

    20/104

    20

    Closer Look At Transfer Functions

    Unipolar

    Sigmoid

    Threshold()

    Bipolar

    Sigmoid

    Sign

  • 7/31/2019 ch6 ann and ga

    21/104

    21

    The Connections

    The connections are the only thing changing in neural

    networks

    Connections may be either inhibitory or excitatory

    Connection strengths are expressed by weights

  • 7/31/2019 ch6 ann and ga

    22/104

    22

    The role of the weights

    Each input or node is connected to a processing element

    Graphically this is represented by an arc

    Each arc has a weight. The weight simply determines the

    influence (or strength) of an input to a processing element

    Neuro-computing is concerned with identification of thecorrect set of weights

  • 7/31/2019 ch6 ann and ga

    23/104

    23

    An example of a single node

    Assume a processing element receives 3 inputs: 1 0.5 0.3

    If the combining function is the weighted summation and the

    weights are: -0.2 0.04 2.35

    then the result of the combining function is 0.705

    1

    0.5

    0.3

    -0.2

    0.04

    2.35

    0.705

  • 7/31/2019 ch6 ann and ga

    24/104

    24

    An example of a single node

    If the activation function is

    linear f(x)=x then output is 0.705

    1

    0.5

    0.3

    -0.2

    0.04

    2.35

    0.705f(x)=x

    0.705

  • 7/31/2019 ch6 ann and ga

    25/104

    25

    An example of a single node

    If the activation function is

    sigmoid then output is 1 / (1 + exp(-0.705)) = 0.669

    1

    0.5

    0.3

    -0.2

    0.04

    2.35

    0.705f(x)=1/(1+exp(x)

    0.705

  • 7/31/2019 ch6 ann and ga

    26/104

    26

    Neural Networks Layers

    NN can be constructed using a number of processing

    elements

    Rather than a chaotic construction it is generally preferable

    to build neural networks using layersA neural network will have an input layer, an output layer and

    in between zero, one or more of hidden layers

  • 7/31/2019 ch6 ann and ga

    27/104

    27

    Neural Network Layers 2

    Depending on where a processing element is placed, it is

    categorised as an input, hidden or output processing

    element

    Typically, but not necessarily, each processing element ina layer has the same transfer function

    a NN with 4-3-2 configuration is a 2 or 3 layer NN

    (depends on if input layer is counted) with 4 input nodes, 3

    hidden nodes, 2 output nodes

  • 7/31/2019 ch6 ann and ga

    28/104

    28

    The Role of the Input Layer

    An input processing element receives input from the external

    world and simply sends the actual input to the processing

    elements of the next layer

  • 7/31/2019 ch6 ann and ga

    29/104

    29

    The Role of the Hidden Layer

    A hidden processing element receives its input from the

    nodes of the previous layer and the transformation of the

    input is sent to the next layer

    A hidden layer may be seen as a pre-processor

  • 7/31/2019 ch6 ann and ga

    30/104

    30

    The Role of the Output Layer

    An output processing element delivers the representation of

    the original input after transformations have taken place to

    the world

  • 7/31/2019 ch6 ann and ga

    31/104

    31

    Connectivity Matters

    A number of different networks can be constructed - differ in

    terms of the connectivity pattern and the number of layers

    No hidden layers are called single-layer networks

    One or more hidden layers are called multi-layer networks

    If all connections lead from input to output then it is called

    a feed-forward network

    If there are connections in the opposite direction then it is

    called a feedback or recurrent network

  • 7/31/2019 ch6 ann and ga

    32/104

    32

    Artificial Neural Networks Models

    Single layer

    feedforward

    Multi layer

    feedforward

    Recurrent

    ( feedforward )

  • 7/31/2019 ch6 ann and ga

    33/104

    33

    Calculations of a multi-layer feed-forward

    neural network

    x2

    +1

    +1

    1.5

    -1

    0.5

    +1

    +1 0.5+1

    x1

    x4

    x3

    x5

  • 7/31/2019 ch6 ann and ga

    34/104

    34

    Learning Laws

    As we saw on the previous slide the output with the current

    weights is wrong if we want to perform AND.

    This bring to us the problem of finding the correct set ofweights

    The process of identifying the correct set of weights is called

    the learning process and it is characterised by a learninglaw

  • 7/31/2019 ch6 ann and ga

    35/104

    35

    Learning Laws 2

    The purpose of a learning law is to locate the set of weights

    which will give correct answers for all the inputs

    The learning is achieved by employing an algorithm whichiteratively changes the weights of the connections in

    response to every set of inputs until the correct weights

    have been located

  • 7/31/2019 ch6 ann and ga

    36/104

    36

    Learning Laws 3

    Most learning laws are based on Hebbs rule which states

    that

    if two units are simultaneously active, increase the

    strength of the connection between them

    This rule is the basis for most learning laws used today

    (Kohonen learning, Boltzman learning, Delta rule)

  • 7/31/2019 ch6 ann and ga

    37/104

    37

    Some Learning Rules

    Hebbian learning rule

    Perceptron learning rule

    Delta learning rule

    Widrow-Hoff learning rule

    j

    t

    iij xxwcfw )(

    jtiiij xxwdcw sgn

    jiiiij xnetfodcw'

    j

    t

    iiij xxwdcw

  • 7/31/2019 ch6 ann and ga

    38/104

    38

    Learning Methods

    Supervised approach

    a neural network is given a set of inputs and also the

    correct output

  • 7/31/2019 ch6 ann and ga

    39/104

    39

    Learning Methods 2

    Unsupervised approach

    a neural network is given a set of inputs and no outputs.

    The network attempts to generate its own classes

  • 7/31/2019 ch6 ann and ga

    40/104

    40

    Learning Methods 3

    Reinforcement approach

    a neural network is given a set of inputs and no outputs.

    The network generates an output and only then it is

    told if the produced output was correct or not

    Learn by doing

  • 7/31/2019 ch6 ann and ga

    41/104

    41

    Single-Layer Perceptrons

    Network architecture

    x1

    x2

    x3

    w1

    w2

    w3

    w0

    y= signum(net)

    y=step(net)

    net= xi * wi -

    = xi * wi + w0

    where w0 =

    = xi * wi

    where i=0 nowSignum(net) = 1 if net > 0

    else -1

    Step(net)=1 if net > 0 else 0

  • 7/31/2019 ch6 ann and ga

    42/104

    42

    Example I - The AND Function

    X1

    X2

    W2

    =

    W1 =

    W0

    = O

    1

    1

    2

    1,1 ---> 1

    rest ---> 0

  • 7/31/2019 ch6 ann and ga

    43/104

    43

    Single-Layer Perceptrons

    If correct response no modification takes place, else

    An entire pass through all of the input training vectors is

    called an epoch. When such an entire pass of the training

    set has occurred without error, training is complete.

    jtiiij xxwdcw sgn

  • 7/31/2019 ch6 ann and ga

    44/104

    44

    Limitations

    Perceptron networks have several limitations.

    First, the output values of a perceptron can take on only one

    of two values (True or False).

    Second, perceptrons can only classify linearly separable setsof vectors. If a straight line or plane can be drawn to

    separate the input vectors into their correct categories, the

    input vectors are linearly separable and the perceptron will

    find the solution. If the vectors are not linearly separable

    learning will never reach a point where all vectors are

    classified properly.

    The most famous example is the boolean XOR problem.

  • 7/31/2019 ch6 ann and ga

    45/104

    45

    The XOR problem

    In 1960s perceptrons created a great deal of interest until.

    M.Minsky and S. Papert Perceptrons MIT Press

    Cambridge MA 1969

    single-layer perceptrons can only be used for toy problemssince

    cannot represent a simple XOR function

  • 7/31/2019 ch6 ann and ga

    46/104

    46

    The XOR problem 2

    The task is to classify a binary input vector to class 0 if the

    vector has an even number of 1s or assign it to class 1.

    A two-input binary XOR truth table:

    0 0 0

    0 1 1

    1 0 1

    1 1 0

  • 7/31/2019 ch6 ann and ga

    47/104

    47

    The XOR problem 3

    Recall that the output of a perceptron is given as follows:

    1 if the weighted input is greater than 0

    0 otherwise

    The first input of XOR is 0 0 with desired output as 0

    hence the weighted input must be less or equal than zero

    in order to get the desired output

    0 w1 + 0 w2 + 1 wo < = 0

    wo < = 0

  • 7/31/2019 ch6 ann and ga

    48/104

    48

    The XOR problem 4

    The second input of XOR is 0 1 with desired output as 1

    hence the weighted input must be greater than zero in

    order to get the desired output

    0 w1 + 1 w2 + 1 wo > 0

    w2 + wo > 0

  • 7/31/2019 ch6 ann and ga

    49/104

    49

    The XOR problem 5

    The third input of XOR is 1 0 with desired output as 1

    hence the weighted input must be greater than zero in

    order to get the desired output

    1 w1 + 0 w2 + 1 wo > 0

    w1 + wo > 0

  • 7/31/2019 ch6 ann and ga

    50/104

    50

    The XOR problem 6

    The fourth input of XOR is 1 1 with desired output as 0

    hence the weighted input must be less or equal than zero

    in order to get the desired output

    1 w1 + 1 w2 + 1 wo < = 0

    w1 + w2 + wo < = 0

  • 7/31/2019 ch6 ann and ga

    51/104

    51

    The XOR problem 7

    In summary the percptron requires satisfying the following

    four inequalities

    wo < = 0

    w2 + wo > 0w1 + wo > 0

    w1 + w2 + wo < = 0

    The first inequality tell us that wo must be less or equal to

    zero. Therefore for 2nd and 3rd to apply must have w2and w1 respectively as positive numbers - which

    contradicts with the 4th which says that their summation

    must be negative or zero

  • 7/31/2019 ch6 ann and ga

    52/104

    52

    Linear Separability

    For binary inputs and outputs using the step function the

    output is 1 if the net input is positive and 0 if the net input

    is negative

    net_input = 0: for two-inputs this equation represents a

    line

    If there are weights so that all of the training input vectorsfor which the correct response is +1 lie on one side of

    the decision line and all of the training input vectors for

    which the correct response is 0 lie on the other side of

    the boundary then the problem is linearly separable

  • 7/31/2019 ch6 ann and ga

    53/104

    53

    Linear Separability

  • 7/31/2019 ch6 ann and ga

    54/104

    54

    The XOR problem 8

    The XOR problem is not linearly separable

    We can not use a single-layer perceptron to construct a

    straight line to partition the two dimensional input

    space into two regions, each containing only data

    points of the same class

    X

    Y

    0

    1

    0 1

    0

    0

    1

    1

  • 7/31/2019 ch6 ann and ga

    55/104

    55

    Multi-Layer Perceptrons

    The lack of suitable training methods for multi-layer

    perceptrons (MLPs) led to a waning of interest until the

    reformulation of the backpropagation training method

    Previous work used signum or step activation functionswhich are nondifferentiable, now continuous activation

    functions are employed

  • 7/31/2019 ch6 ann and ga

    56/104

    56

    Multi-Layer Perceptrons 2

    All nodes (or neurons) perform the same function on

    incoming signals

    a composite of the weighted sum and a differentiable

    nonlinear activation function together known as thetransfer function

  • 7/31/2019 ch6 ann and ga

    57/104

    57

    Multi Layer Feedforward Networks

    The layers that are neither input nor output are called hidden

    layers

    Hidden layers extract high order statistics and in a way

    provide an overall view of the input dataThe output of each layer is used as input to the next layer

    There is no theoretical limit on connections between non

    neighboring layers

  • 7/31/2019 ch6 ann and ga

    58/104

    58

    MLP Architecture 2-2-1

    x2 In p u t le ve l

    In te r m e d ia tele ve l (H id d e n )

    O u tp u t le ve l

    y

    x1

    h1 h2

  • 7/31/2019 ch6 ann and ga

    59/104

    59

    Activation Functions

    Logistic function

    f(net) = 1 / (1 + e -net )

    Hyperbolic tangent function

    f(net) = tanh(net/2) = (1 - e -net ) / (1 + e -net ) =

    (2 / (1+e -net) ) - 1 = (e net - e -net) / (e net + e -net)

    Identity function

    f(net) = net

    where net is the weighted input

  • 7/31/2019 ch6 ann and ga

    60/104

    60

    Activation Functions 2

    Logistic and Hyperbolic tangent function

    approximate the signum and step function respectively

    but they provide smooth, non-zero derivatives with

    respect to the input signalsreferred to as squashing functions since the inputs to

    these functions are squashed to the range [0,1] or [-

    1,1]

    referred to as sigmoidal functions because of their S-

    shaped curves

    the hyperbolic is sometimes referred to as the bipolar

    sigmoidal

    the logistic is sometimes referred to as the binary

    sigmoidal

  • 7/31/2019 ch6 ann and ga

    61/104

    61

    Activation Functions Graphs

    The Logistic Function

    -2

    The Hyperbolic Function

    -2

  • 7/31/2019 ch6 ann and ga

    62/104

    62

    Identity Activation Function

    Identity function

    it is usually employed for nodes of the output layer to

    approximate a continuous valued function not limited to

    [0,1] or [-1,1]such nodes are referred to as the linear nodes

    The Identity Function

    -2

  • 7/31/2019 ch6 ann and ga

    63/104

    63

    Binary and Bipolar Sigmoid Derivatives

    f(net) = 1 / (1 + e -net )

    f(net) = f(net) [ 1-f(net) ]

    f(net) = (2 / (1+e -net) ) - 1

    f(net) = 0.5 [ 1 + f(net) ] [ 1 - f(net) ]

  • 7/31/2019 ch6 ann and ga

    64/104

    64

    LearningLearning target:

    minimize the difference between actual outputs and target

    outputs

    Learning rule:

    Steepest descent (Back-propagation)

    Conjugate gradient method

    All optimization methods using first derivativeDerivative-free optimization

  • 7/31/2019 ch6 ann and ga

    65/104

    65

    MLP and the backpropagation algorithm

  • 7/31/2019 ch6 ann and ga

    66/104

    66

  • 7/31/2019 ch6 ann and ga

    67/104

    67

  • 7/31/2019 ch6 ann and ga

    68/104

    68

    MLP and the backpropagation algorithm

    oj

    ( d e s ir e do u tp u t )

    hi wi j

    wkixk

    XS ig n a l E rr o r

    In p u t L a y e r H id d e n L a y e r O u t p u t L a y e r

    yj

  • 7/31/2019 ch6 ann and ga

    69/104

    69

    Backpropagation Algorithm

    0 Initialise Weights

    1 While Stopping condition is false, do steps 2 to 9

  • 7/31/2019 ch6 ann and ga

    70/104

    70

    Backpropagation Algorithm 2

    2 For each training pair, do steps 3 to 8

    Feedforward pass

    3 Each input unit receives input signal and broadcasts this

    signal to all units in the layer above (the hidden units)4 Each hidden unit sums its weighted input signals, applies

    its activation function to compute its output signal and

    sends this signal to all units in the layer above (output

    units)

    5 Each output unit sums its weighted input signals and

    applies its activation function to compute its output signal

    End of Feedforward Pass

  • 7/31/2019 ch6 ann and ga

    71/104

    71

    Backpropagation Algorithm 3

    Backward Pass

    6 Each output unit receives a target pattern corresponding

    to the input training pattern, computes its error information

    term, calculates its weight and bias correction term, andsends its error information term to units in the layer

    below

    7 Each hidden unit sums its error information terms (from

    units in the layer above) multiplies by the derivative of its

    activation function to calculate its error information term,calculates its weight and bias correction term

    End of Backward pass

  • 7/31/2019 ch6 ann and ga

    72/104

    72

    Backpropagation Algorithm 4

    Updating Pass

    8 Each output unit updates its bias and weights. Each

    hidden unit updates its bias and weights.

    End of Updating pass

    9 Test stopping criterion

  • 7/31/2019 ch6 ann and ga

    73/104

    73

    Backpropagation Algorithm 5

  • 7/31/2019 ch6 ann and ga

    74/104

    74

    Problems

    How to determine the architecture?

    How to determine the parameters?

    How to get global optima?

    ... ...

  • 7/31/2019 ch6 ann and ga

    75/104

    75

    GA and ANN

    Three levels:

    connection weights: introduce an adaptive and global

    approach to training

    architectures: adapt the topologies to different tasks withouthuman intervention and thus provide an approach to

    automatic ANN design as both ANN connection weights

    and structures

    learn rules: learning to learn, an adaptive process of

    automatic discovery of novel learning rules

  • 7/31/2019 ch6 ann and ga

    76/104

    76

    Evolution of connection weights

    Weight training in ANNs is usually formulated as

    minimization of an error function, such as the mean

    square error between target and actual outputs averaged

    over all examples, by iteratively adjusting connectingweights.

    BP often gets trapped in a local minimum of the error

    function and is incapable of finding a global minimum if the

    error function is multimodal and/or nondifferentiable.

    GA can be used effectively in the evolution to find a near-optimal set of connection weights globally without

    computing gradient information.

  • 7/31/2019 ch6 ann and ga

    77/104

    77

    Typical cycle of the evolution of the

    connection weights

    1 Decode each individual in the current generation into a set

    of connection weights and construct a corresponding ANN

    with the weights

    2 Evaluate each ANN by computing its total mean squareerror between actual and target outputs. The fitness of an

    individual is determined by the error. A regularization term

    may be included in the fitness function to penalize large

    weights.

    3 Select parents for reproduction based on their fitness

    4 Apply genetic operators, such as crossover and mutation,

    to parents to generate offspring, which form the next

    generation

  • 7/31/2019 ch6 ann and ga

    78/104

    78

    Representation

    Binary or real number

    Put connection weights to the same node together. Nodes in

    ANN are in essence feature extractors and detectors.

    Separating inputs to the same node far apart wouldincrease the difficulty of constructing useful feature

    detectors because they might be destroyed by crossover

    operators.

    Permutation problem: The many-to-one mapping from the

    representation to the actual ANN since two ANNs thatorder their hidden nodes differently in their chromosomes

    will still be equivalent functionally. This makes crossover

    operator very inefficient in producing good offspring.

  • 7/31/2019 ch6 ann and ga

    79/104

    79

  • 7/31/2019 ch6 ann and ga

    80/104

    80

    Comparison between GA and BP

    GA can handle the global search problem better. It can be

    used to train many different networks regardless of their

    architecture and saves a lot of human efforts in

    developing different training algorithm for different types of

    ANN.

    GA makes it easier to generate ANN with some special

    characteristics.

    GA is much less sensitive to initial conditions of training.

    There is no clear winner in terms of the best training

    algorithm.

  • 7/31/2019 ch6 ann and ga

    81/104

    81

    Hybrid training

    Combine GAs global search ability with local searchs ability

    to fine tune. GA can be used to locate a good region in the

    space and then a local search procedure is used to find a

    near-optimal solution in this region.

  • 7/31/2019 ch6 ann and ga

    82/104

    82

    The evolution of architecture

    The architecture of an ANN includes its topological structure,

    i.e., connectivity, and the transfer function of each node in

    the ANN.

    The architecture has significant impact on a networksinformation processing capabilities. Given a learning task,

    an ANN with only a few connections and linear nodes may

    not be able to perform the task at all due to its limited

    capability, while an ANN with a large number of

    connections and nonlinear nodes may overfit noise in thetraining data and fail to have good generalization ability.

  • 7/31/2019 ch6 ann and ga

    83/104

    83

    Traditional way to design the architecture

    There is no systematic way to design a near-optimal

    architecture for a given task automatically.

    A constructive algorithm starts with a minimal network

    (network with minimal number of hidden layers, nodes and

    connections) and adds new layers, nodes andconnections when necessary during training.

    A destructive algorithm starts with a maximal network

    (network with maximal number of hidden layers, nodes

    and connections) and deletes unnecessary layers, nodes

    and connections when during training.

    Such structural hill climbing methods are susceptible to

    becoming trapped at structural local optima. They only

    investigate restricted topological subsets rather than the

    complete class of network architecture.

  • 7/31/2019 ch6 ann and ga

    84/104

    84

    Typical cycle of the evolution of

    architecture

    1 Decode each individual in the current generation into an

    architecture.

    2 Train each ANN with the decoded architecture by a

    predefined learning rule starting from different sets ofrandom initial connection weights and learning rule

    parameters.

    3 Compute the fitness of each individual according to the

    above training result and other performance criteria such

    as the complexity of the architecture.

    4 Select parents from the population based on their fitness.

    5 Apply search operators to the parents and generate

    offspring which form the next generation.

  • 7/31/2019 ch6 ann and ga

    85/104

    85

    The direct encoding scheme

    An NN matrix C=(c(i,j)) can represent an ANN architecture

    with N nodes, where c(i,j) indicates presence or absence

    of the connection from node i to node j.

    Such an encoding scheme can handle both feedforward andrecurrent ANNs.

  • 7/31/2019 ch6 ann and ga

    86/104

    86

    A feedforward ANN

  • 7/31/2019 ch6 ann and ga

    87/104

    87

    A recurrent ANN

  • 7/31/2019 ch6 ann and ga

    88/104

    88

    Notes about direct encoding scheme

    It is straightforward to implement.

    Training error, training time, complexity can be used in the

    fitness function

    A large ANN would require a very large matrix and thusincrease the computation time of the evolution. Domain

    knowledge can be used to reduce the search space

    The permutation problem still exists

  • 7/31/2019 ch6 ann and ga

    89/104

    89

    The indirect encoding scheme

    Only some characteristics of an architecture are encoded to

    reduce the length of the chromosome. The details about

    each connection in an ANN is either predefined according

    to prior knowledge or specified by a set of deterministic

    development rules.

  • 7/31/2019 ch6 ann and ga

    90/104

    90

    Parametric representation

    ANN architectures may be specified by a set of parameters

    such as the number of hidden layers, the number of

    hidden nodes in each layer, the number of connections

    between two layers, etc.

    In general the parametric representation method will be most

    suitable when we know what kind of architectures we are

    trying to find.

  • 7/31/2019 ch6 ann and ga

    91/104

    91

    Example of pattern recognition

    Input Output Input Output

    0000 00 0100 00

    1100 00 1000 00

    1001 01 0000 011101 01 0101 01

    0010 11 1010 11

    0110 11 1110 11

    0011 10 0111 101011 10 1111 10

    In fact the first two bits of the input are noise and the output

    is the Gray code of the last two bits of the input.

  • 7/31/2019 ch6 ann and ga

    92/104

    92

    Chromosome

    We use a 16-bit chromosome

    The first 2 bits stand for the study ratio: 0.5, 0.25, 0.125,

    0.0625

    The next 2 bits stands for the momentum: 0.9, 0.8, 0.7, 0.6The next 2 bits stands for the range of the initial weight: 1,

    0.5, 0.25, 0.125

    The next 5 bits is used for the 1st hidden layer: the first bit

    means if there is a hidden layer and the other 4 bits

    stands for the number of hidden units.

    The last 5 bits is used for the 2nd hidden layer: the first bit

    means if there is a hidden layer and the other 4 bits

    stands for the number of hidden units.

  • 7/31/2019 ch6 ann and ga

    93/104

    93

    Evolution and result

    Only use the first 8 samples for evolution.

    Use 7 of these 8 samples for training the ANN and the other

    one is used to get the fitness.

    Finally we get a 4-1-4-2 ANN(structure and weight).In order to check the final result we use the other 8 samples

    and compare with a 4-16-16-2 ANN which is trained by BP.

  • 7/31/2019 ch6 ann and ga

    94/104

    94

    Developmental rule representation

    Development rules, which are used to construct architectures,

    are encoded in chromosomes.

    A development rule is usually described by a recursive

    equation or a production system.How to get such a set of rules to construct an ANN? One

    answer is to evolve them. We can encode the whole rule

    set as an individual (Pittsburgh approach) or encode each

    rule as an individual (Michigan approach)

  • 7/31/2019 ch6 ann and ga

    95/104

    95

    Examples of some development rules

  • 7/31/2019 ch6 ann and ga

    96/104

    96

    Development of an ANN architecture

    Si lt l ti f hit t &

  • 7/31/2019 ch6 ann and ga

    97/104

    97

    Simultaneous evolution of architectures &

    weights

  • 7/31/2019 ch6 ann and ga

    98/104

    98

    Evolution of learning rules

    An ANN training algorithm may have different performance

    when applied to different architectures. The design of

    training rules, more fundamentally the learning rules used

    to adjust weights, depends on the type of architectures

    under investigation. Different variants of the Hebbian

    learning rule have been proposed to deal with different

    architectures. It is desirable to develop an automatic and

    systematic way to adapt the learning rule to an

    architecture and the task to be performed. Designing alearning rule manually often implies that some

    assumptions, which are not necessarily true in practice,

    have to be made.

    T i l l f th l ti f l i

  • 7/31/2019 ch6 ann and ga

    99/104

    99

    Typical cycle of the evolution of learning

    rule

    1 Decode each individual in the current generation into a

    learning rule

    2 Construct a set of ANNs with randomly generated

    architectures and initial connection weights, and trainthem using the decoded learning rule.

    3 Calculate the fitness of each individual according to the

    average training result

    4 Select parents from the current generation according to

    their fitness

    5 Apply search operators to parents to generate offspring

    which form the new generation

  • 7/31/2019 ch6 ann and ga

    100/104

    100

    Evolution of algorithm parameters

    The adaptive adjustment of BPs parameters through

    evolution could be considered as the first attempt to the

    evolution of learning rules.

    Some researchers used an GA process to find parametersfor BP but ANNs architecture was predefined. The

    parameters evolved in this case tend to be optimized

    towards the architecture rather than being generally

    applied to learning.

    Some researchers encoded BPs parameters inchromosomes together with ANNs architecture.

  • 7/31/2019 ch6 ann and ga

    101/104

    101

    Evolution of learning rules

    The evolution of learning rules has to work on the dynamic

    behavior of an ANN.

    Try to develop a universal representation scheme which can

    specify any kind of dynamic behaviors is clearlyimpractical.

    Two basic assumptions which have often been made on

    learning rules are 1) weight-updating depends only on

    local information such as the activation of the input node,

    the activation of the output node, the current connectionweight, etc.; 2) the learning rule is the same for all

    connections in an ANN

  • 7/31/2019 ch6 ann and ga

    102/104

    102

    Learning rule

    A learning rule can be described by the following function

    There are three major issues involved in the evolution of

    learning rules: 1) determination of a subset of terms

    described in the above equation; 2) representation of the

    coefficients as chromosomes, and 3) the GA used to

    evolve these chromosomes.

  • 7/31/2019 ch6 ann and ga

    103/104

    103

    Other combination between GA and ANN

    Evolution of input features: finding a near-optimal set of input

    features to an ANN

    ANN as fitness estimator: the time-consuming fitness

    evaluation based on real systems is replaced by fastfitness evaluation based on ANN

    Evolving ANN ensembles: combining different individuals in

    the population to form an integrated system is expected to

    produce better results.

    A general framework for GA and ANN

  • 7/31/2019 ch6 ann and ga

    104/104