Neural Networks - Lecture 4.ppt

82
RECURRENT NETWORKS RECURRENT NETWORKS Introduction Suppose we want to predict tomorrow’s stock market’s average y(t+1) based on today’s economic indicators x i (t) Suppose we want the prediction to be based on today’s and yesterday’s economic indicators x i (t) & x i (t-1) In this case we can augment the number of inputs

description

Neural Networks by Muhammad Amjad

Transcript of Neural Networks - Lecture 4.ppt

  • RECURRENT NETWORKSIntroduction

    Suppose we want to predict tomorrows stock markets average y(t+1) based on todays economic indicators xi(t)Suppose we want the prediction to be based on todays and yesterdays economic indicators xi(t) & xi(t-1)

    In this case we can augment the number of inputs

    *

  • RECURRENT NETWORKSIntroduction

    However, if we wish the network to consider an arbitrary-sized window of time in the past (e.g. sometimes two days, sometimes 20 days), then a different solution is required

    The recurrent networks provide one such solution

    They are used to learn sequential or time-varying patterns, in which the current value of the pattern is dependent upon its past history

    *

  • RECURRENT NETWORKSArchitecture

    A specific group of units receives feedback signals from the previous time step

    These units are known as context unitsThe weights on the feedback connections to the context units are fixed (e.g. 1)

    *

  • RECURRENT NETWORKSArchitecture

    Context unit is feed-forward connected to its hidden unit & all other hidden unit

    Number of context units = number of hidden unitsOne hidden unit has feedback connection to one context unit only

    *

  • RECURRENT NETWORKSTraining of Weights

    At time t, the activations of the context units are the activations (output signals) of the hidden units at the previous time step

    The weights from the context units to the hidden units are trained in exactly the same manner as the weights from the input units to the hidden units

    Thus, at any time step, the training algorithm is the same as for standard backpropagation

    *

  • RECURRENT NETWORKSUtilization

    After training, the recurrent nets can be presented an arbitrary sized sequence and it will predict the next element of the sequence

    It can also be trained to classify or label a sequence. We give an arbitrary sized portion of the sequence as input and the network will try to tell us the name of the sequence

    *

  • RECURRENT NETWORKSExample: Next valid letter in a string of characters

    Suppose that a string of characters are generated by a small finite-state grammar. The string begins with the symbol B and ends with the symbol EAt each decision point, either path can be taken with equal probability

    Two examples of the possible strings areB P V V E& B T X X T T V P S E

    *

  • RECURRENT NETWORKSExample: Architecture used

    *

  • RECURRENT NETWORKSExample: Training

    The training patterns for the neural net consisted of 60,000 randomly generated strings

    The string length ranged from 3 to 30 letters (plus the Begin and End symbols at the beginning and end of strings)

    The letters in the string are presented sequentially to the net, one by one, respecting the order in which they appear

    Each letter in the string becomes an input, on its turn, and its successor is considered as the output

    *

  • RECURRENT NETWORKSExample: Training

    The training algorithm for the context sensitive grammar is:

    *

  • RECURRENT NETWORKSExample: Training

    StringB T X S E

    *

  • RECURRENT NETWORKSExample: Testing & Utilization

    After training, we hope that the net has learnt the grammar

    Given a valid sequence of letters, it can be used to predict the next valid letter in the sequence

    An interesting application can be the determination of a given string as a valid string

    As each symbol is presented, the net predicts the possible valid successors of that symbol (output units with activations of 0.3 or more are counted as valid successors)

    *

  • RECURRENT NETWORKSExample: Testing & Utilization

    If the next letter in the string is not one of the predicted valid successors, the string is rejected. If all the letters pass the test, then the string is accepted as valid

    The reported result for 70,000 random strings, 0.3% (210) of which were valid, are that the net correctly reject 99.7% of the invalid strings and accepted 100% of the valid strings

    The net also performed perfectly on a set of extremely long strings (100 or more characters in length)

    *

  • RECURRENT NETWORKSExample: Testing & Utilization

    Note that the grammar is such that we can establish context with the help of two samples

    If the context is deeper than more complex network may be required

    *

  • RECURRENT NETWORKSExample: Testing & Utilization

    The example may be adapted to other problems

    For example, letters may be replaced by features composing the letters

    or

    by words composing a sentence

    *

  • RECURRENT NETWORKSConclusion

    The architecture that we have studied is called simple recurrent network as it contains a single hidden layer

    It is also called Elman network and is similar to an architecture proposed by Jordan

    It is implemented in Matlabs neural network tool-box

    *

  • References

    Read Section 7.2.3, page 372 of Laurene FausettRECURRENT NETWORKS

    *

  • Unsupervised method of clustering

    Allows increase in number of clusters only if required

    Developed by Carpenter & Grossberg in 1987

    ART-1 is designed for clustering binary valued vectors and ART-2 is for continuous valued inputsART-1

    *

  • ArchitectureART-1

    *

  • Architecture

    The architecture of the computational units for ART1 consists of F1 units (input and interface units), F2 units (cluster units),and a reset unit

    The reset unit implements user control over the degree of similarity of patterns placed on the same clusterART-1

    *

  • Architecture

    Each unit in the F1(a) input layer is connected only to the corresponding unit in the F1(b) interface layer.

    Each unit in the input and interface layers is connected to the reset unit. The reset unit is connected to every F2 (cluster) unit. ART-1

    *

  • Architecture

    Each unit of the interface layer is connected to each unit in the cluster layer by two weighted pathways.

    The interface unit Xi is connected to the cluster unit Yj by bottom up weight bij. Similarly, unit Yj is connected to unit Xi by top-down weights tji

    The cluster layer is a competitive layer in which only the uninhibited node with the largest net input has a non-zero activationART-1

    *

  • Training

    Notations n number of components in the input vectorm maximum number of clusters that can be formedbijbottom-up weights (from interface unit Xi to cluster unit Yj)tji binary top-down weights (from cluster unit Yj to interface unit Xi) p vigilance parametersbinary input vector (an n-tuple)xactivation vector for interface layer (binary) x norm of vector x, defined as the sum of the components xiART-1

    *

  • Training: Competition-to-Learn Phase

    A binary input vector s is presented to the input layer, and the signals are sent to the corresponding X units. These interface units then broadcast to the cluster layer over connection pathways with bottom-up weights

    Each cluster unit computes its net input and the units compete for the right to be active

    The unit with the largest net input sets its activation to 1; all others have an activation of zero. Let the index of the winning unit be J. This winning unit becomes the candidate to learn the input patternART-1

    *

  • Training: Similarity Check

    A signal is then sent down from cluster to interface layer (multiplied by the top-down weights) The interface X units remain on only if they receive non-zero signals from both the input and cluster units.

    The activation vector of interface layer has the states of individual units as its components. Those units are 1 or on which have non-zero inputsThe norm of the vector x (the activation vector for the interface layer) gives the number of components in which the top-down weight vector for the winning cluster unit tJ and the input vector s are both 1. This quantity is sometimes referred to as the match.ART-1

    *

  • Training: Weights Update

    If the ratio of x to s is greater than or equal to the vigilance parameter, the weights (top down and bottom up) for the winning cluster unit are adjusted

    The use of the ratio allows an ART1 net to respond to relative differences. This reflects the fact that a difference of one component in vectors with only a few non-zero components is much more significant than a difference of one component in vectors with many non-zero componentsART-1

    *

  • Training: Search for another unit if 1st candidate is rejected

    However, if the ratio is less than the vigilance parameter, the candidate unit is rejected, and another candidate unit must be chosen

    The winning cluster unit becomes inhibited, & it cannot be chosen again as a candidate on this learning trial, & the activations of the input & interface units are reset to zero The same input vector again sends its signal to the interface units, which again send this as the bottom-up signal to the cluster layer, & the competition is repeated (but without the participation of any inhibited units)ART-1

    *

  • Training: Search for another unit if 1st candidate is rejected

    The process continues until either a satisfactory match is found (a candidate is accepted) or all units are inhibited

    The action to be taken if all units are inhibited must be specified by the user: reduce the value of the vigilance parameter thus allowing less similar patterns to be placed on the same cluster, or to increase the number of cluster units, or simply to designate the current input pattern as an outlier that could not be clustered ART-1

    *

  • Training: Search for another unit if 1st candidate is rejected

    At the end of each presentation of a pattern, all cluster units are returned to inactive status and are available to participate in the next competitionART-1

    *

  • TrainingART-1

    *

  • TrainingART-1

    *

  • TrainingART-1

    *

  • TrainingART-1

    *

  • ArchitectureART-1

    *

  • Training: Example

    The ART1 algorithm is used to group 4 vectors in at most 3 clusters

    Vectors are: [1100][0001][1000][0011]ART-1

    *

  • Training: ExampleART-1

    *

  • Training: ExampleART-1

    *

  • Training: ExampleART-1

    *

  • Training: ExampleART-1

    *

  • Training: ExampleART-1

    *

  • References

    Read Section 5.2 of Laurene FausetteART-1

    *

  • References

    Read Section 5.2 of Laurene FausetteART-1

    *

  • ArchitectureRBF NETWORKS

    *

  • Architecture

    Radial Basis Function Networks consist of:- A hidden layer of k Gaussian neurons- A set of weights wi

    For Gaussian basis functions, the activation is given asRBF NETWORKS

    *

  • Training

    RBFNs are trained using a two step process:

    Determine the K neurons using:Kohonen Unsupervised Learning

    Determine the weightsSupervised Learning (Backpropagation)RBF NETWORKS

    *

  • Architecture

    RBF NETWORKS

    *

  • Architecture

    RBF NETWORKS

    *

  • There can be several outputs m, and each output would be independent of other outputs: the network can be viewed as m independent networksRBF NETWORKS

  • hidden layer of RBF neuronsInputs Outputs aslinear combination ofRBF NETWORKS

  • References

    Engelbrecht Chapter 5RBF NETWORKS

    *

  • ArchitectureHOPFIELD NETWORK

    *

  • HOPFIELD NETWORKArchitecture

  • UtilizationHOPFIELD NETWORK

    *

  • UtilizationHOPFIELD NETWORK

    *

  • UtilizationHOPFIELD NETWORK

    *

  • Training by Hebbs RuleHOPFIELD NETWORK

    *

  • Training by Hebbs RuleHOPFIELD NETWORKConsider 2 nodes which have (1, 1) activation for the pattern to be stored

    If there is a positive weight between them, then each would reinforce the other (each one is making a positive contribution to the activation of other)

    For (0, 1) or (1, 0) activations, negative weights reinforce each others activations

    *

  • Training by Hebbs RuleHOPFIELD NETWORKThis is a short cut for Hebbs rule

    It gives a positive weight, if both activations are (1, 1) or (0, 0) and a negative weight if they are (1, 0) or (0, 1)

    *

  • Training by Hebbs Rule: Bi-polar activationsHOPFIELD NETWORKThis is again a short cut for Hebbs rule

    The formula gives positive weights for (1, 1) and negative weights for (-1, 1) activations

    *

  • Storage CapacityHOPFIELD NETWORK

    *

  • References

    Laurene FausetteSection 3.4.4Kevin Gurney Chapter 5 & 6HOPFIELD NETWORK

    *

  • Introduced in 1983 by Hinton & Sejnowski

    The architecture of a Boltzmann machine consists of a set of units and a set of bidirectional connections between pairs of units

    Not all units are connected, however if two units are connected then their connection is bidirectional and the weights in both directions are same, i.e. wij = wji

    A unit may also have a self-connection with weight wiiBOLTZMANN MACHINE

    *

  • Architecture

    Boltzmann machine to solve the Travelling Salesman problem

    Units are arranged in a two dimensional array

    The units within each row are fully interconnected

    Similarly, the units within each column are fully interconnectedBOLTZMANN MACHINE

    *

  • Architecture

    Architecture for 10 city TSPBOLTZMANN MACHINE

    *

  • Architecture

    The state xi of a unit Xi is either on (1) or off (0)

    The objective of the neural net is to maximize to consensus function over all the units of the netBOLTZMANN MACHINE

    *

  • Architecture

    The net attempts to find this maximum by letting each unit attempt to change its state from on to off or vice versa

    The change in consensus for a change of state of unit Xi isBOLTZMANN MACHINEWhere xi is the current state of Xi

    The coefficient [1 2xi] will be +1 if unit Xi is currently off and -1 if it is on

    *

  • Architecture

    However, unit Xi does not necessarily change its state, even if doing so would increase the consensus of the net

    We update is done probabilistically so that the chances of the net getting stuck in a local maximum are reducedBOLTZMANN MACHINE

    *

  • BOLTZMANN MACHINE

    *

  • Architecture

    The connection pattern for the net is as followsBOLTZMANN MACHINE

    *

  • ArchitecureBOLTZMANN MACHINE

    *

  • Architecture

    Consider the relation between the penalty weight -p and the bonus weight b

    Allowing a unit Uij to turn on should be encouraged only if no other units are on in that column or row (no city can be visited twice, neither can be two different cities visited at the same time)

    If we set p > b, then our objective will be achieved, and if a unit turns on with a unit already on in its row or column, then the consensus function will have a negative changeBOLTZMANN MACHINE

    *

  • Architecture

    Now let us consider the relation between the constraint weight b and the distance weights

    Let d denote the maximum distance between any two cities on the tour

    Suppose that no unit in on in a column j and in row i

    Allowing the Uij to turn on should be encouraged and the weights should be set so that the consensus will be increased if it turns onBOLTZMANN MACHINE

    *

  • Architecture

    The change in consensus will be b di,k1 di,k2

    where k1 indicates city visited at stage j-1 of the tour, and k2 denotes the city visited at stage j+1 (and the city i is visited at stage j)BOLTZMANN MACHINE

    *

  • Architecture

    Since d is the maximum possible distance, hence the minimum change in consensus b di,k1 di,k2 will be equal to b 2d

    Since the change in consensus should be positive we can take the value of b > 2d for the net to function properlyBOLTZMANN MACHINE

    *

  • Setting the weights

    The weights for a Boltzmann machine are set manually

    Their values are chosen such that the net will tend to make transitions towards a maximum of the consensus functionBOLTZMANN MACHINE

    *

  • AlgorithmBOLTZMANN MACHINE

    *

  • AlgorithmBOLTZMANN MACHINE

    *

  • Algorithm: Initial Temperature

    The initial temperature should be taken large enough so that the probability of accepting a change of state is approximately 0.5, regardless of whether the change is beneficial or detrimentalBOLTZMANN MACHINE

    *

  • Example

    The Boltzmann machine was used to solve the TSP

    The starting configuration had approximately half the units on

    The parameters were: To = 20, b = 60, & p = 70

    The cooling schedule was Tnew = 0.9Told after each epoch

    An epoch consisted of each unit attempting to change its valueBOLTZMANN MACHINE

    *

  • Example

    The experiment was repeated 100 times for different initial configurations

    Typically valid tours were produced in 10 or fewer epochs & for all 100 configurations valid tours were found in 20 or fewer epochs

    In these experiments, it was rare for the net to change its configuration once a valid tour was foundBOLTZMANN MACHINE

    *

  • Example

    Five tours of less length wereBOLTZMANN MACHINE

    *

  • References

    Laurene FausetteSection 7.1.1BOLTZMANN MACHINE

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *