Lecture Neural Nets 2010

8/14/2019 Lecture Neural Nets 2010

1/148

1

Intelligent ControlReti Neurali Per LidentificazionePredizione Ed Il ontrollo

Lecture 1:

Introduction to Neural Networks


2/148


3/148

3

Topics in Neural Networks

Introduction


4/148

4

Lecture Outline

1. Introduction (2)i. Course introductionii. Introduction to neural networkiii. Issues in Neural network

2. Simple Neural Network (3)i. Perceptronii. Adaline

3. Multilayer Perceptron (4)i. Basicsii. Dynamics

4. Radial Basis Networks (5)


5/148

5

Introduction to Neural Networks


6/148


7/1487

Artificial Neuronbias

i

j

neti =jwijyj + b


8/1488

Artificial Neuron

Input/Output Signal may be.

Real value.

Unipolar {0, 1}. Bipolar {-1, +1}.

Weight : wijstrength of connection.

Note that wijrefers to the weight fromunitj to unit i(not the other way round).


9/1489

Artificial Neuron

The bias bis a constant that can be written as

wi0y0with y0 = band wi0 = 1 such that

The function f is the units activation function.

In the simplest case, f is the identity function,

and the units output is just its net input. This

is called a linear unit.

Other activation functions are : step function,

sigmoid function and Gaussian function.

n

j

jiji ywnet

0


10/14810

Funes de Ativao


11/14811

Artificial Neural Networks (ANN)

Inp

utvector

weight

weight

Activation

function

Ou

tput(vector)

Activation

function

Signal

routing


12/14812

Historical Development of ANN

William James (1890) : Describes in words and figuressimple distributed networks and Hebbian learning

McCulloch & Pitts (1943) : Binary threshold units that

perform logical operations (they proof universalcomputation)

Hebb (1949) : formulation of a physiological (local)learning rule

Roseblatt (1958) : The perceptrona first real learningmachine

Widrow & Hoff (1960) : ADALINE and the Widrow-Hoffsupervised learning rule


13/14813

Historical Development of ANN

Kohonen (1982) : Self-organizing maps

Hopfield (1982): Hopfield Networks

Rumelhart, Hinton & Williams (1986) :Back-propagation & Multilayer Perceptron

Broomhead & Lowe (1988) : Radial

basis functions (RBF)

Vapnik (1990) -- support vector machine


14/148

14

WhenShouldANN Solution BeConsidered ?

The solution to the problem cannot be explicitly described

by an algorithm, a set of equations, or a set of rules.

There is some evidence that an input-output mapping exists

between a set of input and output variables.

There should be a large amount of data available to train

the network.


15/148

15

ProblemsThat Can Lead to PoorPerformance ?

The network has to distinguish between very similar cases

with a very high degree of accuracy.

The train data does not represent the ranges of cases that

the network will encounter in practice.

The network has a several hundred inputs.

The main discriminating factors are not present in the

available data. E.g. trying to assess the loan applicationwithout having knowledge of the applicant's salaries.

The network is required to implement a very complex

function.


16/148

16

Applicationsof Artificial Neural Networks

Manufacturing : fault diagnosis, fraud detection.

Retailing : fraud detection, forecasting, data

mining. Finance : fraud detection, forecasting, data mining.

Engineering : fault diagnosis, signal/imageprocessing.

Production : fault diagnosis, forecasting.

Sales & Marketing : forecasting, data mining.


17/148

17

Data Pre-processing

Neural networks very rarelyoperate on the raw

data. An initial pre-processingstage is essential.Some examples are as follows:

Feature extraction of images: For example, the analysis of X-raysrequires pre-processing to extract features which may be of interest

within a specified region. Representing input variables with numbers. For example "+1" is the

person is married, "0" if divorced, and "-1" if single. Another exampleis representing the pixels of an image: 255 = bright white, 0 = black.

To ensure the generalization capability of a neural network, the data

should be encoded in form which allows for interpolation.


18/148

18

Data Pre-processing

Categorical Variable

A categorical variable is a variable that can belong to

one of a number of discrete categories. For example,

red, green, blue.

Categorical variables are usually encoded using 1 out-of

n coding. e.g. for three colours, red = (1 0 0), green

=(0 1 0) Blue =(0 0 1).

If we used red = 1, green = 2, blue = 3, then this type

of encoding imposes an ordering on the values of the

variables which does not exist.


19/148

19

Data Pre-processing

CONTINUOUS VARIABLES

A continuous variable can be directly applied to

a neural network. However, if the dynamic range

of input variables are not approximately the

same, it is better to normalizeall input variables

of the neural network.


20/148

20

Example of Normalized Input Vector

Input vector : (2 4 5 6 10 4)t

Mean of vector :

Standard deviation :

Normalized vector :

Mean of normalized vector is zero

Standard deviation of normalized vector is unity

167.56

1 6

1

i

ix

714.2)(16

16

1

2

iix

tiN

xx 43.078.131.006.043.017.1


21/148

21

Simple Neural Networks

Lecture 3:

Simple Perceptron


22/148

22

Outlines

The Perceptron

Linearly separable problem

Network structure

Perceptron learning ruleConvergence of Perceptron


23/148

23

The perceptron was a simple model of ANN introducedby Rosenblatt of MIT in the 1960 with the idea oflearning.

Perceptron is designed to accomplish a simple patternrecognition task: after learning with real value training data

{x(i), d(i), i =1,2, , p} where d(i) = 1 or -1

For a new signal (pattern)x(i+1),the perceptron is

capable of telling you to which class the new signalbelongs

x(i+1) perceptron 1 or 1

THE PERCEPTRON


24/148

24

Perceptron

Linear threshold unit (LTU)

x1

x2

xn

.

.

.

w1

w2

wn

w0=b

x0=1

x=i=0nwixi

1 if i=0nwixi >0o(x)=-1 otherwise

o

{


25/148

25

Decision Surface of a Perceptron

+

++

+-

-

-

-x1

x2

+

+-

-

x1

x2

Perceptron is able to represent some useful functions

AND (x1,x2) choose weights w0=-1.5, w1=1, w2=1

But functions that are not linearly separable (e.g.

XOR) are not representable

AND

w0

w2

w1


26/148

26

m

i

ii

m

i

ii xwfbxwfy01

)()(

where f is the hard limiter function i.e.

01

01

1

1m

i

ii

m

iii

bxwif

bxwify

We can always treat the bias bas another weight with

inputs equal 1

Mathematically the Perceptron is


27/148

27

01

m

i

ii bxw

Why is the network capable of solving linearly

separable problem ?

0

1

m

i

ii bxw

01

m

i

ii bxw


28/148

28

Learning rule

An algorithm to update the weights wso that finallythe input patterns lie on both sides of the line decidedby the perceptron

Let tbe the time, at t = 0, we have

0)0( xw


29/148

29

Learning rule

An algorithm to update the weights wso that finallythe input patterns lie on both sides of the line decided by theperceptronLet tbe the time, at t = 1

0)1( xw


30/148

30

Learning rule


0)2( xw


31/148

31

Learning rule


0)3( xw


32/148

32

)())]()(()()[()()1(

)(1)(1)(

txtxtwsigntdttwtw

classintxifclassintxiftd

Perceptron learning rule

In Math

Where (t) is the learning rate >0,

+1 if x>0sign(x)= hard limiter function

1 if x


33/148

Exemplo 1

Logical OR function

So lets set w0= -0.6, w1= 0.2 andw2= 0.3.

x0 = 1.

33

X1 X2 R

-1 -1 -1

1 -1 1

-1 1 1

1 1 1


34/148

Exemplo 2

Logical E function

34

X1 X2 R

-1 -1 -1

1 -1 -1

-1 1 -1

1 1 1


35/148

35

In words:

If the classification is right, do not update the

weights

If the classification is not correct, update the

weight towards the opposite direction so that the

output move close to the right directions.


36/148

36

Perceptron convergence theorem(Rosenblatt, 1962)

Let the subsets of training vectors be linearly separable. Thenafter finite steps of learning we have

lim w(t) = w which correctly separate the samples.

The idea of proof is that to consider ||w(t+1)-w||-||w(t)-w||which is a decrease function of t


37/148

37

Summary of Perceptron learning

Variables and parametersx(t) = (m+1) dim. input vectors at time t= ( b, x1 (t), x2 (t), .... , xm (t) )

w(t) = (m+1) dim. weight vectors= ( 1 , w1 (t), .... , wm (t) )

b = bias

y(t) = actual responset = learning rate parameter, a +ve constant < 1d(t) = desired response


38/148

38

Summary of Perceptron learning

Data { (x(i), d(i)), i=1,,p}

Present the data to the network once a point

could be cyclic :(x(1), d(1)), (x(2), d(2)),, (x(p), d(p)),(x(p+1), d(p+1)),

or randomly(Hence we mix time t with i here)


39/148

39

1. Initialization Set w(0)=0. Then perform the followingcomputation for time step t=1,2,...

2. Activation At time step t, activate the perceptron by applying

input vector x(t)and desired response d(t)

3. Computation of actual response Compute the actual responseof the perceptron

y(t) = sign ( w(t) x(t) )

where signis the sign function

4. Adaptation of weight vector Update the weight vector of theperceptron

w(t+1) = w(t)+ t [ d(t) - y(t) ] x(t)5. Continuation

Summary of Perceptron learning (algorithm)


40/148

40

Questions remain

Where or when to stop?

By minimizing the generalization error

For training data{(x(i), d(i)), i=1,p}

How to define training error after t steps of learning?

E(t)= pi=1[d(i)-sign(w(t) .x(i)]

2


41/148

41

After

learning

t steps

E(t) = 0


42/148

42

How to define generalization error?

For a new signal {x(t+1),d(t+1)}, we have

Eg= [d(t+1)-sign (x(t+1) w (t)) ]2.

After

learning

t steps


43/148

43

We next turn to ADALINE learning,

from which we can understandthe learning rule, and more general the

Back-Propagation (BP) learning


44/148

44

Simple Neural Network

ADALINE Learning


45/148

45

Outlines

ADALINE

Gradient descending learning

Modes of training


46/148

46

Unhappy over Perceptron Training

When a perceptron gives the right answer, no

learning takes place

Anything below the threshold is interpreted

as no, even it is just below the threshold.

It might be better to train the neuron basedon how far below the threshold it is.


47/148

47

ADALINEis an acronym for ADAptive LINear Element

(or ADAptive LInear NEuron) developed by Bernard

Widrow and Marcian Hoff (1960).

There are several variations of Adaline. One has

threshold same as perceptron and another just a bare

linear function.

TheAdaline learningrule is also known as the least-

mean-squares (LMS) rule, the delta rule, or the Widrow-Hoff rule.

It is a training rule that minimizes the output error

using (approximate) gradient descent method.

ADALINE

Replace the step function in the perceptron with a


48/148

48

Replace the step function in the perceptron with acontinuous (differentiable) function f, e.g the simplest islinear function

With or without the threshold, theAdalineis trainedbased on the output of the function f rather than the finaloutput.

f (x)

(Adaline)


49/148

49

After each training pattern x(i) is presented, the correction toapply to the weights is proportional to the error.

E (i,t) = [ d(i) f(w(t) x(i)) ] 2 i=1,...,p

N.B. If f is a linear function f(w(t) x(i)) = w(t) x(i)

Summing together, our purpose is to find wwhich minimizes

E (t) = i

E(i,t)


50/148

50

To find gw(t+1) = w(t)+g( E(w(t)) )

so that wautomatically tends to the

global minima of E(w).

w(t+1) = w(t)- E(w(t))t

(see figure below)

General Approach gradient descent method


51/148

51

Gradientdirection is the direction of uphillforexample, in the Figure, at position 0.4, the

gradientis uphill ( F is E, consider one dim case )

Gradient direction

F(0.4)

F

In gradient descent algorithm we have


52/148

52

In gradient descent algorithm, we have

w(t+1) = w(t) F(w(t)) t

therefore the ball goes downhill sinceF(w(t))is downhill direction

Gradient direction

w(t)

I di t d t l ith h


53/148

53

Gradient direction

w(t+1)

In gradient descent algorithm, we have

w(t+1) = w(t) F(w(t)) t

therefore the ball goes downhill sinceF(w(t))is downhill direction


54/148

54

Graduallythe ball will stop at a local minimawhere thegradient is zero

Gradient direction

w(t+k)


55/148

55

In words

Gradient method could be thought of as a ball rolling downfrom a hill: the ball will roll down and finally stop at the valley

Thus, the weights are adjusted by

wj(t+1) = wj(t) +t [d(i) - f(w(t) x(i)) ] xj(i)f

This corresponds to gradient descent on the quadratic errorsurface E

When f =1, we have the perceptron learning rule (we have ingeneral f>0 in neural networks). The ball moves in the rightdirection.


56/148

56

About LMS Learning Rule

Suppose we have a set of input vectors, {x1, X2, . . . . XL},each having its own, perhaps unique, correct or desired outputvalue, dk, k = 1,...,L. The problem of finding a single weightvector that can successfully associate each input vector with itsdesired output value is no longer simple.

Calculation of w*. To begin, let's state the problem a little

differently: Given examples, (x1, d1), (x2, d2),..., (XL,dL), of someprocessing function that associates input vectors, xk, with (ormaps to) the desired output values, dk, what is the best weightvector, w*. Consider that the output can be write in vectornotation:

y = wtx If the actual output value is ykfor the kth input vector, then the

corresponding error term is ek= dkyk


57/148

57

More About LMS

The mean squared error, or expectation value ofthe error, is defined by

L

k

kk

L 1

22 1

where L is the number of input vectors in the training set.We can expand the mean squared error as follows

wpwRwwdwxdwxxwd

t

k

t

kk

t

kk

t

kk

2

22

22

Where R = (xkxkt.), is called the input correlation matrix, and a

vector p = (dkx

k

t)


58/148

58

More

To find the weight vector corresponding to theminimum mean squared error, we differentiate(w), evaluate the result at w*, and set the resultequal to zero:

pRw

pRw

pRw

pRww

w

1*

*

* 022

22


59/148

59

Sequential mode (on-line, stochastic, orper-pattern) :

Weights updated after each pattern ispresented (Perceptron is in this class)

Batch mode(off-line or per-epoch) : Weightsupdated after all patterns are presented

Two types of network training:

Comparison Perceptron and


60/148

60

Comparison Perceptron andGradient Descent Rules

Perceptron learning rule guaranteed to succeed if Training examples are linearly separable

Sufficiently small learning rate

Linear unit training rule uses gradient descentguaranteed to converge to hypothesis with

minimum squared error given sufficiently small

learning rate Even when training data contains noise

Even when training data not separable by Hyperplane

R i f P


61/148

61

Renaissance of Perceptron

Perceptron

Support VectorMachine

Multi-LayerPerceptron

Learning Theory, 90

Back-Propagation, 80


62/148

62

Summary of Previous Lectures

Perceptron

W(t+1)= W(t)+(t) [ d(t) - sign (w(t) . x)] x

Adaline (Gradient descent method)

W(t+1)= W(t)+(t) [ d(t) - f(w(t) . x)] x f


63/148

63

Multi-Layer Perceptron (MLP)

Idea: Credit assignment problem

Problem of assigning credit or blame to

individual elements involving in forming overallresponse of a learning system (hidden units)

In neural networks, problem relates to dividingwhich weights should be altered, by how muchand in which direction.

E l Th l t k


64/148

64

xn

x1

x2

Input Output

Example: Three-layernetworks

Input layer Hidden layer Output layer

Signal routing

P ti f hit t


65/148

65

Properties of architecture

No connections within a layer

No direct connections between input and output layersFully connected between layers

Often more than 2 layers

Number of output units need not equal number of input units

Number of hidden units per layer can be more or less than

input or output units

f w x bi ij j i

j

m

( )1

Each unit is a perceptron


66/148

66

BP(Back Propagation)


67/148

67

MultiLayer Perceptron I

Back Propagating Learning

BP learning algorithm


68/148

68

BP learning algorithmSolution tocredit assignment problemin MLP

Rumelhart, Hinton and Williams (1986)

BP has two phases:

Forward pass phase: computes functional signal,feedforward propagation of input pattern signals throughnetwork

Backward pass phase: computes error signal,propagation of error (difference between actual and desiredoutput values) backwards through network starting at outputunits

BP Learning for Simplest MLP


69/148

69

I

w(t)

W(t)

y

OBP Learning for Simplest MLP

Task: Data {I, d} to minimize

E = (d - o)2/2= [d - f(W(t)y(t))]2/2

= [d - f(W(t)f(w(t)I))]2/2

Error function at the output unit

Weight at time t is w(t) and W(t),

intend to find the weight w and W at time t+1

Where y = f(w(t)I), output of the hidden unit


70/148


71/148

71

yytWfodtW

tdWdf

dfdEtW

tdW

dEtWtW

))((')()(

)()(

)()()1(

Backward Pass Phase

I

w(t)

W(t)

y

O

o = f ( W(t) y )E = (d - o)2/2

B k d h


72/148

72

ytW

yytWfodtW

tdW

df

df

dE

tW

tdW

dEtWtW

)(

))((')()(

)()(

)()()1(

Backward pass phase

I

w(t)

W(t)

y

O

where = ( d-o) f

Backward pass phase


73/148

73

IItwftWtw

tdw

dytWytWfodtw

tdw

dy

dy

dEtw

tdwdEtwtw

))((')()(

)()())((')()(

)(

)(

)()()1(

I

w(t)

W(t)

y

O

Backward pass phase

o= f ( W(t) y )= f ( W(t) f( w(t) I) )


74/148

74

General Two LayerNetworkI inputs, Ooutputs, wconnections for input

units, Wconnections for output units, yis the

activity of input unit

net (t) = network input to the unit at time t

Ww

I O

Input units

Output units

y

Forward pass


75/148

75

Forward pass

Weights are fixed during forward & backward pass at timet

1. Compute values for hidden units

2. compute values for output units

net t w t I t

y f net t

j ji ii

j j

( ) ( ) ( )

( ( ))

Net t W t y

O f Net t

k kj jj

k k

( ) ( )

( ( ))

Ii

wji(t)

Wkj(t)

yj

Ok

Backward Pass


76/148

76

Recall delta rule , error measure for pattern n is

We want to know how to modify weights in order to decrease Ewhere

both for hidden units and output units

This can be rewritten as product of two terms using chain rule

E t d t O tk kk

( ) ( ( ) ( ))

1

22

1

)(

)()()1(

tw

tEtwtw

ij

ijij


77/148

77

)(

)(

)(

)(

)(

)(

tw

tnet

tnet

tE

tw

tE

ij

j

jij

How error for pattern changes as function of changein network input to unitj

How net input to unitjchanges as a function ofchange in weight w

both for hidden units and output units

Term A

Term B

Summaryi ht d t l l


78/148

78

weight updates are local

output unit

hidden unit

)()()()1(

)()()()1(

tyttWtW

tIttwtw

jkkjkj

ijjiji

k

ikjkj

ijjiji

tIWttnetf

tIttwtw

)()())(('

)()()()1(

)())(('))()((

)()()()1(

tytNetftOtd

tyttWtW

jkkk

jkkjkj

Once weight changes are computed for all units, weights areupdated at same time (bias included as weights here)

We now compute the derivative of the activation function f ( ).

(hidden unit)

(output unit)

Activation Functions


79/148

79

Activation Functionsto compute we need to find the derivative of activation

function f

to find derivative the activation function must be smooth

Sigmoidal (logistic) function-common in MLP

where k is a positive constant. The sigmoidal function givesvalue in range of 0 to 1

Input-output function of a neuron (rate coding assumption)

))(exp(1

1))((

tnetktnetf

i

i

Shape of sigmoidal function


80/148

80

Note: when net = 0, f = 0.5


81/148

Returning to local error gradientsin BP algorithm we have


82/148

82

g g gfor output units

For hidden units we have

))(1)(())()((

))(('))()(()(

tOtkOtOtd

tNetftOtdt

iiii

iiii

k

kikii

k

kikii

Wttytky

Wttnetft

)())(1)((

)())((')(

Since degree of weight change is proportional to derivative ofactivation function, weight changes will be greatest when unitsreceives mid-range functional signal than at extremes

Summary of BP learning algorithm


83/148

83

Summary of BP learning algorithm

Set learning rate

Set initial weight values (incl.. biases): w, W

Loop until stopping criteria satisfied:

present input pattern to input units

compute functional signal for hidden unitscompute functional signal for output units

present Target response to output unitscompute error signal for output units

compute error signal for hidden unitsupdate all weights at same timeincrement n to n+1 and select next I and dend loop


84/148

Exercise For the network shown in figure, calculate the expressions for the

weight changes using the EBP algorithm in an on-line learning mode.The training data, consisting of the input pattern vectorsx = [x1x2]and the output desired responses d = [d1d2]

84

Network training: T i i t h t dl til t i it i t


85/148

85

Training set shown repeatedly until stopping criteria are met

Each full presentation of all patterns = epoch

Randomise order of training patterns presented for eachepoch in order to avoid correlation between consecutive

training pairs being learnt (order effects)

Two types of network training:

Sequential mode(on-line, stochastic, or per-pattern)

Weights updated after each pattern is presented

Batch mode(off-line or per -epoch)

Advantages and disadvantages of different


86/148

86

g gmodes

Sequential mode: Less storage for each weighted connection

Random order of presentation and updating per pattern

means search of weight space is stochastic--reducing risk of

local minima able to take advantage of any redundancy intraining set (i.e.. same pattern occurs more than once in

training set, esp. for large training sets)

Simpler to implement

Batch mode: Faster learning than sequential mode


87/148

87

MultiLayer Perceptron II

Dynamics of MultiLayer Perceptron

Summary of Network Training


88/148

88

Forward phase: I(t), w(t),net(t),y(t),W(t),Net(t),O(t)

Backward phase:

Output unit

Input unit

k

ikjkj

ijijji

tItWttnetf

tIttwtw

)()()())(('

)()()()1(

)())(('))()(()()()()1(

tytNetftOtdtyttWtW

jkkk

jkkjkj


89/148


90/148

90

Network training:

Despite this sound theoretical foundation concerning

the representational capabilities of neural networks,

and notwithstanding the success of the EBP learning

algorithm, there are many practical drawbacks to theEBP algorithm.

The most troublesome is the usually long training

process, which does not ensure that the absolute

minimum of the cost function (the best performanceof the network) will be achieved.

N t k t i i


91/148

Network training:

The algorithm may become stuck at some localminimum, and such a termination with a suboptimalsolution will require repetition of the whole learning

process by changing the structure or some of thelearning parameters that influence the iterativescheme.

As in many other scientific and engineering

disciplines, so in the field of artificial neural networks,the theory (or at least part of it) was established onlyafter a number of practical neural networkapplications had been implemented.

91

Advantages and disadvantages of different


92/148

92

g gmodes

Sequential mode: Less storage for each weighted connection

Random order of presentation and updating per pattern

means search of weight space is stochastic--reducing risk of

local minima able to take advantage of any redundancy intraining set (i.e.. same pattern occurs more than once in

training set, esp. for large training sets)

Simpler to implement

Batch mode:Faster learning than sequential mode

Goals of Neural Network Training


93/148

93

g

To give the correct output for input

training vector (Learning)

To give good responses to new unseeninput patterns (Generalization)

Training and Testing Problems


94/148

94

Training and Testing Problems

Stuck neurons: Degree of weight change is proportionalto derivative of activation function, weight changes will begreatest when units receives mid-range functional signal thanat extremes neuron. To avoid stuck neurons weightsinitialization should give outputs of all neurons approximate 0.5

Insufficient number of training patterns: In thiscase, the training patterns will be learnt instead of theunderlying relationship between inputs and output, i.e. networkjust memorizing the patterns.

Too few hidden neurons: network will not produce a

good model of the problem.Over-fitting: the training patterns will be learnt insteadof the underlying function between inputs and output becauseof too many of hidden neurons. This means that the networkwill have a poor generalization capability.


95/148

95

Dynamics of BP learningAim is to minimise an error function over all trainingpatterns by adapting weights in MLP

Recalling the typical error function is the mean

squared error as follows

E(t)=

The idea is to reduce E(t) to global minimum point.

p

k

kk tOtd1

2))()((

2

1


96/148

96

Dynamics of BP learning

In single layer perceptronwith linear activationfunctions, the error function is simple, describedby a smooth parabolic surface with a single

minimum


97/148


98/148


99/148

Selecting Initial Weight Values


100/148

100

Selecting Initial Weight Values

Choice of initial weight values is important as thisdecides starting position in weight space. That is,

how far away from global minimum

Aim is to select weight values which produce

midrange function signals

Select weight values randomly from uniform

probability distribution

Normalise weight values so number of weightedconnections per unit produces midrange function

signal

Convergence of Backprop


101/148

101

Avoid local minumum with fast convergence:

Add momentum

Stochastic gradient descent

Train multiple nets with different initial weights

Nature of convergence

Initialize weights near zero or initial networksnear-linear

Increasingly non-linear functions possible astraining progresses

Use of Available Data Set for Training


102/148

102

Use of Available Data Set for Training

Training setuse to update the weights.Patterns in this set are repeatedly in random

order. The weight update equation areapplied after a certain number of patterns.

Validation setuse to decide when to stoptraining only by monitoring the error.

Test setUse to test the performance of theneural network. It should not be used as partof the neural network development cycle.

The available data set is normally split into three

sets as follows:

Earlier Stopping - Good Generalization

R i t h t i th


103/148

103

Running too many epochs may overtrainthenetwork and result in overfittingand perform

poorly in generalization. Keep a hold-out validation set and test accuracy

after every epoch. Maintain weights for bestperforming network on the validation set and stoptraining when error increases increases beyondthis.

No. of epochs

errorTraining set

Validation set

Model Selection by Cross validation


104/148

104

Model Selection by Cross-validation

Too few hidden unitsprevent the network from learningadequately fitting the data and learning the concept.

Too many hidden unitsleads to overfitting.

Similar cross-validation methodscan be used to determinean appropriate number of hidden units by using the optimaltest error to select the model with optimal number of hiddenlayers and nodes.

No. of epochs

errorTraining set

Validation set


105/148

105

Genetic Algorithms

lternative training algorithm

History Background


106/148

106

Idea of evolutionary computing was introduced in the 1960s by I.Rechenbergin his work "Evolution strategies" (Evolutionsstrategiein

original). His idea was then developed by other researchers. Genetic

Algorithms(GAs) were invented by John Hollandand developed by him

and his students and colleagues. This lead to Holland's book "Adaption in

Natural and Artificial Systems" published in 1975.

In 1992 John Kozahas used genetic algorithm to evolve programsto

perform certain tasks. He called his method Genetic Programming"(GP). LISP programs were used, because programs in this language can

expressed in the form of a "parse tree", which is the object the GA works

on.

Biological Background


107/148

107

g gChromosome.

All living organisms consist of cells. In each cell there is the same set of

chromosomes. Chromosomes are strings of DNAand serves as a model for

the whole organism.A chromosome consist of genes, blocks of DNA. Each

gene encodes a particular protein. Basically can be said, that each gene

encodes a trait, for example color of eyes. Possible settings for a trait (e.g.

blue, brown) are called alleles. Each gene has its own position in the

chromosome. This position is called locus.

Complete set of genetic material (all chromosomes) is called genome.Particular set of genes in genome is called genotype. The genotype is with

later development after birth base for the organism's phenotype, its physical

and mental characteristics, such as eye color, intelligence etc.

Biological Background
http://cs.felk.cvut.cz/~xobitko/ga/dna.htmlhttp://cs.felk.cvut.cz/~xobitko/ga/dna.html


108/148

108

Reproduction.

During reproduction, first occurs recombination(orcrossover). Genes from parents form in some way the

whole new chromosome. The new created offspring can

then be mutated. Mutationmeans, that the elements ofDNA are a bit changed. This changes are mainly caused by

errors in copying genes from parents.

The fitnessof an organism is measured by success of the

organism in its life.

Evolutionary Computation


109/148

109

Evolutionary Computation

Based on evolution as it occurs in nature Lamarck, Darwin, Wallace: evolution of species, survival

of the fittest

Mendel: genetics provides inheritance mechanism

Hence genetic algorithms

Essentially a massively parallel search procedure

Start with random population of individuals

Gradually move to better individuals

A Simple Genetic Algorithm


110/148

110

p g

Optimization task: find the maximum of f(x)

for example f(x)=xsin(x) x [0,p]genotype: binary string s [0,1]5 e.g. 11010, 01011, 10001mapping : genotype phenotype

binary integer encoding: x = si 2n-i-1/ (2n-1)

genotype integ. phenotype fitness prop. fitness11010 26 2.6349 1.2787 30%

01011 11 1.1148 1.0008 24%10001 17 1.7228 1.7029 40%00101 5 0.5067 0.2459 6%

Initial population

5

1

n

i

Some Other Issues RegardingE l ti C ti


111/148

111

Evolutionary Computing

Evolution according to Lamarck. Individual adapts during lifetime.

Adaptations inherited by children.

In nature, genes dont change; but for computations we could

allow this...

Baldwin effect. Individuals ability to learn has positive effect on evolution.

It supports a more diverse gene pool.

Thus, more experimentation with genes possible.

Bacteria and virus. New evolutionary computing strategies.


112/148

112

Radial Basis Functions

Radial Basis Functions

Introduction fit=ajuste


113/148

Introduction fit=ajuste

The back-propagation algorithm for the design ofa MLP (under surpervision) may be viewed as theapplication of a recursive technique know instatistic as stochastic approximation.

In this moment we take a completely differentapproach by viewing the design of a neuralnetwork as a curve fitting (approximation)

problemin a high-dimensional space. According with this view-point, learning is

equivalent to find a surface in multidimensionalspace that provide a best fit to the training data.

113

Radial-basis function (RBF) networks


114/148

114

RBF = radial-basis function: a function whichdepends only on the radial distance from a point

XOR problem

quadratically separable

Radial-basis function (RBF) networks

S RBF f ti t ki th f


115/148

115

So RBFs are functions taking the form

where is a nonlinear activation function, xis the

input and xiis the ithposition, prototype, basis or

centrevector.

The idea is thatpoints near the centres will have

similar outputs (i.e. if x ~ xi then f (x) ~f (xi))

since they should have similar properties.

The simplest isthe linear RBF : (x) =||x xi||

||)(|| ixx

Radial Functions


116/148

Radial Functions

Characteristic feature-their responsedecreases (or increases) monotonically withdistance from a central point.

The center, the distance scale, and the

precise shape of the radial functionare parameters of the model, all fixed if it islinear.

Typical radial functions are:

The Gaussian RBF (monotonically decreases withdistance from the center).

A multiquadric RBF (monotonically increases withdistance from the center).

Typical RBFs include(a) Multiquadrics


117/148

117

(a) Multiquadrics

for some c>0

(b) Inverse multiquadrics

for some c>0

(c) Gaussian

for some >0

2/122)()( crr

2/122 )()( crr

)2

exp()(2

2

rr


118/148

118nonlocalized functions localized functions


119/148

Starting point: exact interpolation


120/148

120

Starting point: exact interpolation

Each input pattern x must be mapped onto atarget value d


121/148

y1

Single-layer networks1 y)1||y-x1||


122/148

122

yp

Input

y1y2

Output

Input layer : N y)N||y-xN||

wj

d

output = wii (y - xi)

adjustable parameters are weights wj

number of hidden units = number of data points

Form of the basis functions decided in advance


123/148

Large = 1


124/148

124

g

Small = 0.2


125/148

125

Problems with exact interpolationcan produce poor generalisation performance as only data


126/148

126

a p odu poo g a a o p o a a o y da apoints constrain mapping

Overfitting problem

Bishop(1995) example

Underlying function f(x)=0.5+0.4sine(2x)sampled randomly for 30 points

added Gaussian noise to each data point

30 data points 30 hidden RBF units

fits all data points but creates oscillations due added noiseand unconstrained between data points


127/148

127

All Data Points 5 Basis functions


128/148

Learning Algorithms


129/148

Redes Neurais e

Lgica Fuzzy Radial Basis Function Network 129

g g

Parametersto be learnt are: centers

spreads

weights

Different learning algorithms

Learning Algorithm 1


130/148

Neurais e Lgica

Fuzzy Radial Basis Function Network 130

Centers are selected at random center locationsare chosen randomly from the

training set

Spreadsare chosen by normalization:

1m

maxdcentersofnumber

centers2anybetweendistanceMaximum

1

2

i2max

12

i

m,1i

txd

mexptx

i



131/148

Redes Neurais e


Weightsare found by means of pseudo-inverse method

1

2

ij2

max

1j ij i

m2,1,i,N...,2,1,j

txd

mexp

dw

...,

Desired response

Pseudo-inverse of


132/148

Learning Algorithm 2: Centers K-means clustering algorithm for centers


133/148

Redes Neurais e


1 Initialization: tk(0) random k = 1, , m1

2 Sampling: draw x from input space C3 Similaritymatching: find index of best center

4 Updating: adjust centers

5. Continuation: increment nby 1, goto 2and continueuntil no noticeable changes of centers occur

)n(tx(n)minargk(x) kk

k(x)kif)n(tx(n))n(t kk

otherwise)n(tk )1n(tk



134/148

Redes Neurais e


Supervised learning of all the parameters

using the gradient descent method

Modify centers

j

tjt

tj

E

N

1i

2i

2

1e

Instantaneous error function

Learning

rate forjt

Depending on the specific function can be computed using

the chain rule of calculus



135/148

Redes Neurais e


Modify spreads

Modify output weights

j

j

E

j

ij

ijijw

w

E

Design Radial Basis Networks


136/148

g

136

net = newrb(P,T,goal,spread,MN,DF)

P R-by-Q matrix of Q input vectors

T S-by-Q matrix of Q target class

vectorsgoal Mean squared error goal

(default = 0.0)

spread Spread of radial basis functions(default = 1.0)

MN Maximum number of neurons(default is Q)

DF Number of neurons to add betweendisplays (default = 25)

Examples


137/148

p

137

Here you design a radial basis network, given inputs

Pand targetsT.

P = [1 2 3];T = [2.0 4.1 5.9];

net = newrb(P,T);

The network is simulated for a new input.

P = 1.5;Y = sim(net,P)

Comparison with multilayer NNRBF N t k d t f l ( li )


138/148

Redes Neurais e


RBF-Networksare used to perform complex (non-linear)pattern classification tasks.

Comparison between RBFnetworksand multilayerperceptrons:

Both are examples of non-linear layered feed-forward

networks.

Both are universal approximators.

Hidden layers: RBF networks have one singlehidden layer. MLP networks may have morehidden layers.

Comparison with multilayer NN


139/148

Redes Neurais e


Neuron Models: The computation nodes in the hidden layer of a RBF network are

different. They serve a different purpose from those in the outputlayer.

Typically computation nodes of MLP in a hidden or output layershare a common neuron model.

Linearity: The hidden layer of RBF is non-linear, the output layer of RBF is

linear.

Hidden and output layers of MLP are usually non-linear.

Comparison with multilayer NN


140/148

Redes Neurais e


Activation functions: The argument of activation function of each hidden unit in aRBF NN computes the Euclidean distance between inputvector and the center of that unit.

The argument of the activation function of each hidden unitin a MLP computes the inner product of input vector and the

synaptic weight vector of that unit. Approximations:

RBF NN using Gaussian functions construct localapproximations to non-linear I/O mapping.

MLP NN construct globalapproximations to non-linear I/O

mapping.


141/148

141

pplication ExamplesNonlinear Identification, Prediction and Control

Nonlinear System Identification


142/148

142

Target function: yp(k+1) = f(.)

Identified function: yNET(k+1) = F(.)

Estimation error: e(k+1)

Nonlinear System Neural Control


143/148

143

d: reference/desired response

y: system output/desired output

u: system input/controller output

: desired controller input

u*:NN output

e: controller/network error

The goal of training is to find an

appropriate plant control u from

the desired response d. The weights

are adjusted based on the difference

between the outputs of the networksI & II to minimise e. If network I is

trained so that y = d, then u = u*.

Networks act as inverse dynamics

identifiers.



144/148

144

Neural network

input generationPm



145/148

145

Neural network targetTm

Neural network response(angle & velocity)


146/148

Model Reference Control


147/148

147

Neural controller + nonlinear system diagram

Neural controller, reference model, neural model

Matlab NNtool GUI (Graphical User Interface)


148/148

Lecture Neural Nets 2010

Documents

Transcript of Lecture Neural Nets 2010