Neural Networks - Lecture 4.ppt

RECURRENT NETWORKSIntroduction

Suppose we want to predict tomorrows stock markets average y(t+1) based on todays economic indicators xi(t)Suppose we want the prediction to be based on todays and yesterdays economic indicators xi(t) & xi(t-1)

In this case we can augment the number of inputs

*

RECURRENT NETWORKSIntroduction

However, if we wish the network to consider an arbitrary-sized window of time in the past (e.g. sometimes two days, sometimes 20 days), then a different solution is required

The recurrent networks provide one such solution

They are used to learn sequential or time-varying patterns, in which the current value of the pattern is dependent upon its past history

*

RECURRENT NETWORKSArchitecture

A specific group of units receives feedback signals from the previous time step

These units are known as context unitsThe weights on the feedback connections to the context units are fixed (e.g. 1)

*

RECURRENT NETWORKSArchitecture

Context unit is feed-forward connected to its hidden unit & all other hidden unit

Number of context units = number of hidden unitsOne hidden unit has feedback connection to one context unit only

*

RECURRENT NETWORKSTraining of Weights

At time t, the activations of the context units are the activations (output signals) of the hidden units at the previous time step

The weights from the context units to the hidden units are trained in exactly the same manner as the weights from the input units to the hidden units

Thus, at any time step, the training algorithm is the same as for standard backpropagation

*

RECURRENT NETWORKSUtilization

After training, the recurrent nets can be presented an arbitrary sized sequence and it will predict the next element of the sequence

It can also be trained to classify or label a sequence. We give an arbitrary sized portion of the sequence as input and the network will try to tell us the name of the sequence

*

RECURRENT NETWORKSExample: Next valid letter in a string of characters

Suppose that a string of characters are generated by a small finite-state grammar. The string begins with the symbol B and ends with the symbol EAt each decision point, either path can be taken with equal probability

Two examples of the possible strings areB P V V E& B T X X T T V P S E

*

RECURRENT NETWORKSExample: Architecture used

*

RECURRENT NETWORKSExample: Training

The training patterns for the neural net consisted of 60,000 randomly generated strings

The string length ranged from 3 to 30 letters (plus the Begin and End symbols at the beginning and end of strings)

The letters in the string are presented sequentially to the net, one by one, respecting the order in which they appear

Each letter in the string becomes an input, on its turn, and its successor is considered as the output

*


The training algorithm for the context sensitive grammar is:

*


StringB T X S E

*

RECURRENT NETWORKSExample: Testing & Utilization

After training, we hope that the net has learnt the grammar

Given a valid sequence of letters, it can be used to predict the next valid letter in the sequence

An interesting application can be the determination of a given string as a valid string

As each symbol is presented, the net predicts the possible valid successors of that symbol (output units with activations of 0.3 or more are counted as valid successors)

*


If the next letter in the string is not one of the predicted valid successors, the string is rejected. If all the letters pass the test, then the string is accepted as valid

The reported result for 70,000 random strings, 0.3% (210) of which were valid, are that the net correctly reject 99.7% of the invalid strings and accepted 100% of the valid strings

The net also performed perfectly on a set of extremely long strings (100 or more characters in length)

*


Note that the grammar is such that we can establish context with the help of two samples

If the context is deeper than more complex network may be required

*


The example may be adapted to other problems

For example, letters may be replaced by features composing the letters

or

by words composing a sentence

*

RECURRENT NETWORKSConclusion

The architecture that we have studied is called simple recurrent network as it contains a single hidden layer

It is also called Elman network and is similar to an architecture proposed by Jordan

It is implemented in Matlabs neural network tool-box

*

References

Read Section 7.2.3, page 372 of Laurene FausettRECURRENT NETWORKS

*

Unsupervised method of clustering

Allows increase in number of clusters only if required

Developed by Carpenter & Grossberg in 1987

ART-1 is designed for clustering binary valued vectors and ART-2 is for continuous valued inputsART-1

*

ArchitectureART-1

*

Architecture

The architecture of the computational units for ART1 consists of F1 units (input and interface units), F2 units (cluster units),and a reset unit

The reset unit implements user control over the degree of similarity of patterns placed on the same clusterART-1

*

Architecture

Each unit in the F1(a) input layer is connected only to the corresponding unit in the F1(b) interface layer.

Each unit in the input and interface layers is connected to the reset unit. The reset unit is connected to every F2 (cluster) unit. ART-1

*

Architecture

Each unit of the interface layer is connected to each unit in the cluster layer by two weighted pathways.

The interface unit Xi is connected to the cluster unit Yj by bottom up weight bij. Similarly, unit Yj is connected to unit Xi by top-down weights tji

The cluster layer is a competitive layer in which only the uninhibited node with the largest net input has a non-zero activationART-1

*

Training

Notations n number of components in the input vectorm maximum number of clusters that can be formedbijbottom-up weights (from interface unit Xi to cluster unit Yj)tji binary top-down weights (from cluster unit Yj to interface unit Xi) p vigilance parametersbinary input vector (an n-tuple)xactivation vector for interface layer (binary) x norm of vector x, defined as the sum of the components xiART-1

*

Training: Competition-to-Learn Phase

A binary input vector s is presented to the input layer, and the signals are sent to the corresponding X units. These interface units then broadcast to the cluster layer over connection pathways with bottom-up weights

Each cluster unit computes its net input and the units compete for the right to be active

The unit with the largest net input sets its activation to 1; all others have an activation of zero. Let the index of the winning unit be J. This winning unit becomes the candidate to learn the input patternART-1

*

Training: Similarity Check

A signal is then sent down from cluster to interface layer (multiplied by the top-down weights) The interface X units remain on only if they receive non-zero signals from both the input and cluster units.

The activation vector of interface layer has the states of individual units as its components. Those units are 1 or on which have non-zero inputsThe norm of the vector x (the activation vector for the interface layer) gives the number of components in which the top-down weight vector for the winning cluster unit tJ and the input vector s are both 1. This quantity is sometimes referred to as the match.ART-1

*

Training: Weights Update

If the ratio of x to s is greater than or equal to the vigilance parameter, the weights (top down and bottom up) for the winning cluster unit are adjusted

The use of the ratio allows an ART1 net to respond to relative differences. This reflects the fact that a difference of one component in vectors with only a few non-zero components is much more significant than a difference of one component in vectors with many non-zero componentsART-1

*

Training: Search for another unit if 1st candidate is rejected

However, if the ratio is less than the vigilance parameter, the candidate unit is rejected, and another candidate unit must be chosen

The winning cluster unit becomes inhibited, & it cannot be chosen again as a candidate on this learning trial, & the activations of the input & interface units are reset to zero The same input vector again sends its signal to the interface units, which again send this as the bottom-up signal to the cluster layer, & the competition is repeated (but without the participation of any inhibited units)ART-1

*


The process continues until either a satisfactory match is found (a candidate is accepted) or all units are inhibited

The action to be taken if all units are inhibited must be specified by the user: reduce the value of the vigilance parameter thus allowing less similar patterns to be placed on the same cluster, or to increase the number of cluster units, or simply to designate the current input pattern as an outlier that could not be clustered ART-1

*


At the end of each presentation of a pattern, all cluster units are returned to inactive status and are available to participate in the next competitionART-1

*

TrainingART-1

*

ArchitectureART-1

*

Training: Example

The ART1 algorithm is used to group 4 vectors in at most 3 clusters

Vectors are: [1100][0001][1000][0011]ART-1

*

Training: ExampleART-1

*

References

Read Section 5.2 of Laurene FausetteART-1

*

ArchitectureRBF NETWORKS

*

Architecture

Radial Basis Function Networks consist of:- A hidden layer of k Gaussian neurons- A set of weights wi

For Gaussian basis functions, the activation is given asRBF NETWORKS

*

Training

RBFNs are trained using a two step process:

Determine the K neurons using:Kohonen Unsupervised Learning

Determine the weightsSupervised Learning (Backpropagation)RBF NETWORKS

*

Architecture

RBF NETWORKS

*

There can be several outputs m, and each output would be independent of other outputs: the network can be viewed as m independent networksRBF NETWORKS

hidden layer of RBF neuronsInputs Outputs aslinear combination ofRBF NETWORKS

References

Engelbrecht Chapter 5RBF NETWORKS

*

ArchitectureHOPFIELD NETWORK

*

HOPFIELD NETWORKArchitecture

UtilizationHOPFIELD NETWORK

*

Training by Hebbs RuleHOPFIELD NETWORK

*

Training by Hebbs RuleHOPFIELD NETWORKConsider 2 nodes which have (1, 1) activation for the pattern to be stored

If there is a positive weight between them, then each would reinforce the other (each one is making a positive contribution to the activation of other)

For (0, 1) or (1, 0) activations, negative weights reinforce each others activations

*

Training by Hebbs RuleHOPFIELD NETWORKThis is a short cut for Hebbs rule

It gives a positive weight, if both activations are (1, 1) or (0, 0) and a negative weight if they are (1, 0) or (0, 1)

*

Training by Hebbs Rule: Bi-polar activationsHOPFIELD NETWORKThis is again a short cut for Hebbs rule

The formula gives positive weights for (1, 1) and negative weights for (-1, 1) activations

*

Storage CapacityHOPFIELD NETWORK

*

References

Laurene FausetteSection 3.4.4Kevin Gurney Chapter 5 & 6HOPFIELD NETWORK

*

Introduced in 1983 by Hinton & Sejnowski

The architecture of a Boltzmann machine consists of a set of units and a set of bidirectional connections between pairs of units

Not all units are connected, however if two units are connected then their connection is bidirectional and the weights in both directions are same, i.e. wij = wji

A unit may also have a self-connection with weight wiiBOLTZMANN MACHINE

*

Architecture

Boltzmann machine to solve the Travelling Salesman problem

Units are arranged in a two dimensional array

The units within each row are fully interconnected

Similarly, the units within each column are fully interconnectedBOLTZMANN MACHINE

*

Architecture

Architecture for 10 city TSPBOLTZMANN MACHINE

*

Architecture

The state xi of a unit Xi is either on (1) or off (0)

The objective of the neural net is to maximize to consensus function over all the units of the netBOLTZMANN MACHINE

*

Architecture

The net attempts to find this maximum by letting each unit attempt to change its state from on to off or vice versa

The change in consensus for a change of state of unit Xi isBOLTZMANN MACHINEWhere xi is the current state of Xi

The coefficient [1 2xi] will be +1 if unit Xi is currently off and -1 if it is on

*

Architecture

However, unit Xi does not necessarily change its state, even if doing so would increase the consensus of the net

We update is done probabilistically so that the chances of the net getting stuck in a local maximum are reducedBOLTZMANN MACHINE

*

BOLTZMANN MACHINE

*

Architecture

The connection pattern for the net is as followsBOLTZMANN MACHINE

*

ArchitecureBOLTZMANN MACHINE

*

Architecture

Consider the relation between the penalty weight -p and the bonus weight b

Allowing a unit Uij to turn on should be encouraged only if no other units are on in that column or row (no city can be visited twice, neither can be two different cities visited at the same time)

If we set p > b, then our objective will be achieved, and if a unit turns on with a unit already on in its row or column, then the consensus function will have a negative changeBOLTZMANN MACHINE

*

Architecture

Now let us consider the relation between the constraint weight b and the distance weights

Let d denote the maximum distance between any two cities on the tour

Suppose that no unit in on in a column j and in row i

Allowing the Uij to turn on should be encouraged and the weights should be set so that the consensus will be increased if it turns onBOLTZMANN MACHINE

*

Architecture

The change in consensus will be b di,k1 di,k2

where k1 indicates city visited at stage j-1 of the tour, and k2 denotes the city visited at stage j+1 (and the city i is visited at stage j)BOLTZMANN MACHINE

*

Architecture

Since d is the maximum possible distance, hence the minimum change in consensus b di,k1 di,k2 will be equal to b 2d

Since the change in consensus should be positive we can take the value of b > 2d for the net to function properlyBOLTZMANN MACHINE

*

Setting the weights

The weights for a Boltzmann machine are set manually

Their values are chosen such that the net will tend to make transitions towards a maximum of the consensus functionBOLTZMANN MACHINE

*

AlgorithmBOLTZMANN MACHINE

*

Algorithm: Initial Temperature

The initial temperature should be taken large enough so that the probability of accepting a change of state is approximately 0.5, regardless of whether the change is beneficial or detrimentalBOLTZMANN MACHINE

*

Example

The Boltzmann machine was used to solve the TSP

The starting configuration had approximately half the units on

The parameters were: To = 20, b = 60, & p = 70

The cooling schedule was Tnew = 0.9Told after each epoch

An epoch consisted of each unit attempting to change its valueBOLTZMANN MACHINE

*

Example

The experiment was repeated 100 times for different initial configurations

Typically valid tours were produced in 10 or fewer epochs & for all 100 configurations valid tours were found in 20 or fewer epochs

In these experiments, it was rare for the net to change its configuration once a valid tour was foundBOLTZMANN MACHINE

*

Example

Five tours of less length wereBOLTZMANN MACHINE

*

References

Laurene FausetteSection 7.1.1BOLTZMANN MACHINE

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

Neural Networks - Lecture 4.ppt

Documents

Transcript of Neural Networks - Lecture 4.ppt