Back Propagation Amir Ali Hooshmandan Mehran Najafi Mohamad Ali Honarpisheh.

75
Back Back Propagat Propagat ion ion Amir Ali Hooshmandan Amir Ali Hooshmandan Mehran Najafi Mehran Najafi Mohamad Ali Honarpisheh Mohamad Ali Honarpisheh

Transcript of Back Propagation Amir Ali Hooshmandan Mehran Najafi Mohamad Ali Honarpisheh.

Back Back PropagaPropaga

tiontionAmir Ali HooshmandanAmir Ali Hooshmandan

Mehran NajafiMehran NajafiMohamad Ali HonarpishehMohamad Ali Honarpisheh

ContentsContents• What is it?• History• Architecture• Activation Function• Learnig Algorithm• EBP Heuristics• How Long to Train• Virtues AND Limitations of BP• About Initialization• Accelerating training• An Application• Different Problems Require Different Learning Rate Adaptive Methods

What is itWhat is it• A Supervised learning algorithm• Base Error correction learning rule• Generalize Adaptive Filtering

Algorithm

HistoryHistory• 1986

– Rumelhart• Paper Why are “what” and “where” processed by

separate cortical visual systems?• Book Parallel Distributed Processing: Explorations

in Micro Structures of cognition

– Parker• Optimal algorithms for adaptive networks:

second order back propagationsecond order direct propagation

• 1974 & 1969

ArchitectureArchitecture

i

j

k

Wj,k

Vi,j

Xi

Zj

Zk

Activation FunctionActivation Function

))(exp(1

1))((

navnvf

jj

)](1)[())((' nynaynvf jjj

output of neuron J)(ny j

Characteristics:

Learnig AlgorithmLearnig Algorithm

ek(n) = dk(n) - yk(n)

Energy in this error

Energy of the error produced by outputNeuron J in epoch n

Total energy of theoutput of net:

)(_2

1 2 nye k

Ck

k nyenE )(_2

1)( 2

Learnig Algorithm(cont.’d)Learnig Algorithm(cont.’d)Purpose

Minimizing E(n)

)(

)(

, nw

nE

kj

Ck

k nyenE )(_2

1)( 2 )(_

)(_

)(nye

nye

nEk

k

e_yk(n) = dk(n) - yk(n) 1)(

)(_

ny

nye

k

k

))(_()( ninyfny kk

Local Field

))(_(')(_

)(ninyf

niny

nyk

k

k

Hj

jkjk nznwniny )()()(_ , )()(

)(_

,

nznw

ninyj

kj

k

Learnig Algorithm(cont.’d)Learnig Algorithm(cont.’d)

Chain Rule in Derivation

)(

)(_

)(_

)(

)(

)(_

)(_

)(

)(

)(

,, nw

niny

niny

ny

ny

nye

nye

nE

nw

nE

ji

j

j

j

j

j

jkj

)())(_('1)(_)(

)(

,

nzninyfnyenw

nEjkk

kj

)(_

)()(

niny

nEn

kk

Local gradient

-

Computing Computing ΔΔvvi,j i,j for none for none output layersoutput layers

Problem

We don’t have error because it is responsible for many errors

Find another way to compute δj

)(_

)()(

ninz

nEn

jj

)(_

)(

)(

)()(

ninz

nz

nz

nEn

j

j

jj

ChainRule

Computing Computing ΔΔvvi,j i,j for none for none output layersoutput layers

Ck

k nyenE )(_2

1)( 2

)(

)(_)(_

)(

)(

nz

nyenye

nz

nE

j

k

Ckk

j

)(

)(_

)(_

)(_

)(

)(_

nz

niny

niny

nye

nz

nye

j

k

k

k

j

k

)_()()()()(_ kkkkk inyfndnyndnye

)_(')(_

)(_k

k

k inyfniny

nye

Hj

jkjk nznwniny )()()(_ ,

)()(

)(_, nw

nz

ninykj

j

k

Computing Computing ΔΔvvi,j i,j for none for none output layersoutput layers

Ck

jkkjj

j

jj nwnninzf

ninz

nz

nz

nEn )()())(_('

)(_

)(

)(

)()( ,

)(

)(_)(_

)(

)(

nz

nyenye

nz

nE

j

k

Ckk

j

Ckjkkk

j

nwnyfnyenz

nE)())((')(_

)(

)(,

Ckjkk

j

nwnnz

nE)()(

)(

)(,

))(_()( ninzfnz jj

))(_(')(_

)(ninzf

ninz

nzj

j

j

Computing Weight Computing Weight correctioncorrection

(Weight correction) = (Learning rate parameter) (Weight correction) = (Learning rate parameter) * (local gradient) * (local gradient) * (input signal of previouse layer Neuron)* (input signal of previouse layer Neuron)

)()()(, nxnnw ijji

)()()(, nznnv jkkj

Training AlgorithmTraining AlgorithmStep 0 : Initialize weights (Set to random variables with zero mean and variance one)

Step 1: While stopping condition is false do Step 2-9. Step 2: For each training pair do Steps 3-8. Feed forward

Step 3: Each input unit(Xi,i=1,..,n) receives

input signal xi and broadcasts this signal to all units in the layer above(the hidden

units(

Training Algorithm(Cont)Training Algorithm(Cont)Step 4: Each hidden unit (Zj j=1,…,p) sums its weighted input signals

applies its activation function to compute its output signal

and sends this signal to all units in the layer above (outputunits)

)( jj inzfz

ij

n

iiojj vxvinz

1

Training Algorithm(Cont)Training Algorithm(Cont) Step 5: Each output unit (Yk ,k=1,…..,m) sums its weighted input signals,

y_ink=wOk+

and applies its activation function to compute its output signal.

yk=f(y_ink).

p

jjkjwz

1

Training Algorithm(Cont)Training Algorithm(Cont)Backpropagation of error:Step 6: Each output unit ( Yk ,k=1,…,m) receives a target pattern corresponding to the input training patern computes its error information term.

calculates its weight correction term (used to update wjk later),

calculates its bias correction term ( used to update wOk later).

and sends to units in the layer below,

,jkOk Zw

kOkw

),_(')( kkkk inyfyt

Training Algorithm(Cont)Training Algorithm(Cont) Step 7: Each hidden units (Zj, j=1,…,p) sums its delta inputs from

units in the layer above).

O_inj= multiplies by the derivative of its activation function to

calculate its error information term,

calculates its weight correction term(used to update vij later),

and calculates its bias correction term(used to update voj later),

m

kjkkw

1

),_('_ jjj inzfin

ijij xv

jojv

Training Algorithm(Cont)Training Algorithm(Cont)Update weights and biases:Step 8: Each output units(Yk,k=1,….,m) updates its bias and weights(j=0,…,p):

wjk(new)=wjk(old)+ Each hidden unit(Zj j=1,….,p) updates its bias and weights(i=0,….,n):

vij(new)=vij(old)+

Step 9: Test stopping condition

jkw

ijv

EBP HeuristicsEBP Heuristics• Number Of Hidden Layers :

– Theoretical & Simulation results showed that there is no need to have more than two hidden layers.

– One OR Two hidden layers…?• Chester ( 1990 ) : “Why two hidden layers are better

than one”• Gallant ( 1990 ) : “Never try a multilayer model for

fittingdata until you have first tried a single-layer model”

Both architectures are theoretically able to approximate any continuous function to the desired degree of accuracy.

EBP Heuristics (cont’d)EBP Heuristics (cont’d)• Number of hidden layers… :

– Its difficult to say which topology is better :• Size of NN• Learning Time• Implementability in hardware• Accuracy

– Solving problem using NN with one hidden layer first , seems appropriate.

EBP Heuristics (cont’d)EBP Heuristics (cont’d)

– Every adjustable network parameter of the cost function should have its own individual learning rate parameter.

– Every Learning rate parameter should be allowed to vary from one iteration to the next.

– Increasing weight ‘s LR that has same derivative sign for several iterations.

– Decreasing weight’s LR that has alternating derivative sign.

How Long to TrainHow Long to Train• The Aim is to balance between Generalization &

Memorization( Minimizing cost function is not necessarily good idea ).

– Hecht-Nielsen( 1990 ) : Using two disjoint sets for training

• Training Set• Training-Testing Set

– As long as the error for the training-testing set decreases , training continues.

– When the error begin to increase , the net is starting to memorize.

Virtues AND Limitations of Virtues AND Limitations of BPBP

• Connectionism– Biological Reasons

• No excitatory or inhibitory for real neurons• No Global connection in MLP• No backward propagation in real neurons

– Useful in parallel hardware implementation

– Fault Tolerance

Virtues…( Cont’d )Virtues…( Cont’d )• Computational Efficiency

– Computational complexity of alg. is measured in terms of multiplications,additions…

– Learning Algorithm is said to be computationally efficient , when its complexity is polynomial…

– The BP algorithm is computationally efficient.• In MLP with a total W weights , its complexity is

linear in W

Virtues…(cont’d)Virtues…(cont’d)• Convergence

Saarinen (1992 ) : Local convergence rates of the BP algorithm are linear

– Too flat OR too curved– Wrong Direction

• Local Minima– BP learning is basically a Hill climbing technique.– Presence of local minima( isolated valleys )

About InitializationAbout Initialization

About Init… (cont)About Init… (cont)• Other issues :

– Initialization of OL weights shouldn’t result in small weighs...?

• If the output layer weights are small, then so is thecontribution of the HL neurons to the output error, and consequently the effect of the hidden layer weights is not visible enough.

• If the OL weights are too small, deltas( for HLs ) also become very small, which in turn leads to small initial changes in the hidden layer weights.

About Init… (cont)About Init… (cont)

• Initialization by using random numbers is very important in avoiding the effects of symmetry in the network. all the HL neurons should start with guaranteed different weight.

– If they have similar (or, even worse, the same) weights, they will perform similarly (the same) on all data pairs by changing weights in similar (the same) directions.

– This makes the whole learning process unnecessarily long (or learning will be the same for all neurons, and there will practically be no learning).

Nguyen – Widrow Nguyen – Widrow InitializationInitialization……

• Two Layer NNs have been proven capable of approximating any arbitrary functions…

– How this work?

– and method for speeding up training process…

Behavior of hidden nodes…Behavior of hidden nodes…

• For simplicity two layer network with one input is trained to approximate a function of one variable d(x). “x” as input and using BP algorithm…– Output is in the form of :

– Sigmoid function ( tanh ) :

• Approximately linear with slope 1 for x between -1 and 1.• Saturate to -1 or 1 as x becomes large in magnitude• Each term in above some is linear function of x over small interval

•Size of each interval is determined by wi

•Location of interval is determined by wbi

•Network learns to implement desired function by building piece-wise linear approximations

•Pieces are summed to form the complete approximation

(Random Initialization)

Improving Learning SpeedImproving Learning Speed• Main Idea :

– Divide desired region into small intervals– Setting weights in a manner that each

hidden node is assigned to its own interval at start of training.

– Training is as before…

Improving … (cont)Improving … (cont)• Desired region : (-1,1) , has length 2• H hidden units

– So, each hidden unit is responsible for an interval of length 2/H

– Sigmoid(wi x + wbi ) is approximately linear over :

– Which has length 2/wi , therefore :

– Its preferable to have intervals overlap :

– For wbi :

Training of a net initialized as discussed…

Net whose weights are initialized to random values between -.5 and .5

•Improvement is best when a large number of hidden units is used with a complicated desired response.

•Training time decreased from 2 days to 4 hours for Truck-Backer-Upper

MomentumMomentum• Weight change’s Direction : Combination of current gradient

and previous gradient.

– Advantage : Reduce the role of outliers– Doesn’t adjust LR directly.

)()1(

])1()([)()1(

twztw

OR

twtwztwtw

jkjkjk

jkjkjkjkjk

Momentum Parameter , its in the range from 0 to 1

Momentum (cont’d)Momentum (cont’d)

– Allows large weight adjustments as long as the correction is in the same direction…

– Forms an exponentially weighted sum :

– BP Vs MOM : XOR function with bipolar representation

)()1( twztw jkjkjk

#Epochs

BP-.2387MOM.9.238

Delta-Bar-DeltaDelta-Bar-Delta• Allows each weight to have its own learning rate• Lets learning rates vary with time• Two heuristics are used to determine appropriate

changes :– If weight changes is in the same direction for several

time steps , LR for that weight should be increased.– If direction of weight change alternates , the LR should

be decreased.

• Note : these heuristics wont always improve the performance.

DBD (cont’d)DBD (cont’d)• DBD rule consists of :

– Weight update rule– LR update rule

• DBD rule changes the weights as follows :

Use information of current and past derivative to form “delta-bar”

DBD (cont’d)DBD (cont’d)• The 1st heuristic is implemented by increasing the LR by

a constant amount :

• 2nd heuristic is implemented by decreasing LR by a proportion of its current value :

• LR increase linearly and decrease exponentially.

)1(tjk

)()1( tt jkjk

DBD (cont’d)DBD (cont’d)Results for XOR …

Computer Network Computer Network Intrusion Detection Intrusion Detection Via Neural Networks Via Neural Networks

MethodsMethods

GoalsGoals

• neural network (NN) techniques can be used in detecting intruders logging on to a computer network.

• compares the performance of the four neural network methods in intrusion detection.

• The NN techniques used are 1. gradient descent back propagation (BP) 2. gradient descent BP with momentum 3. variable learning rate gradient descent BP 4. Conjugate Gradient BP (CGP)

Background on Intrusion Detection Background on Intrusion Detection Systems (IDS)Systems (IDS)

• Information assurance is a field that deals with protecting information on computers or computer networks from being compromised.

• Intrusion detection : detecting unauthorized users from accessing the

information on those computers.• Current intrusion detection techniques can not detect new

and novel attacks• The relevance of NN in intrusion detection becomes

apparent when one views the intrusion detection problem as a pattern classification problem.

Pattern ClassificationPattern Classification Problem Problem

• By building profiles of authorized computer users one can train the NN to classify the incoming computer traffic into authorized traffic or not authorized traffic.

• The task of intrusion detection is to construct a model that captures a set of user attributes and determine if that user’s set of attributes belongs to the authorized user or those of the intruder.

Problem DefinitionProblem Definition

• Attribute set consists of the unique characteristics of the user logging onto a computer network:

– Authorized user and Intruder• The problem can be stated as:

• Where: x = input vector consisting a user’s attributes y = {authorized user, intruder} We want to map the input set x to an output

Solving the Intrusion DetectionSolving the Intrusion DetectionProblem Using The BackProblem Using The Back

PropagationPropagation

• Multilayer Perceptron with Two Hidden Layers

The error of our model is: e = d - y d = desired output y = actual output

ContinueContinue• Activation functions Sigmoidal :

• That users in the UNIX OS environment could be profiled via four attributes

command, host, time, execution time

• For simplicity in testing the back propagation methods, we decided to generate a user profile data file without profile drift

ContinueContinue• The generated data used here was organized into two

parts. Training Data : 90% Authorized traffic 10% Intrusion traffic Testing Data: 98% Authorized traffic 2% Intrusion traffic

ContinueContinue

ContinueContinue

• Our objective is to train the neural networks for detecting intrusion traffic with the fewest number of intrusion samples

File1 consists of 5 CUs in each of the input File2 consists of 6 CUs in each of the input File3 consists of 7 CUs in each of the input Each CU has 4 elements

• We investigate three kind of Error Back Propagation Neural Network and result’s

Gradient Descent BP Gradient Descent BP (GD)(GD)

• This method updates the network weights and biases in the direction of the performance function that decreases most rapidly, i.e. the negative of the gradient. The new weight vector wk+1 is adjusted according to:

• α is the learning rate and gk is the gradient of the error with respect to the weight vector

• The negative sign indicates that the new

• Weight vector wk+1 is moving in a direction opposite to that of the gradient.

Gradient Descent BP with Gradient Descent BP with Momentum (GDM)Momentum (GDM)

• Making weight changes equal to the sum of a fraction of the last weight change and the new change suggested by the gradient descent BP rule.

• Advantage:

1. Momentum allows a network to respond not only to the local gradient, but also to recent trends in the error Surface

2. Momentum allows the network to ignore small features in the error surface.

3. Without momentum a network may get stuck in shallow local minimum. With momentum a network can slide through such a minimum

Gradient Descent BP with Gradient Descent BP with Momentum (GDM)Momentum (GDM)

• µ, which can be any number between 0 and 1.

Variable Learning Rate BP with Variable Learning Rate BP with Momentum (GDX)Momentum (GDX)

• The learning rate parameter is used to determine how fast the BP method converges to the minimum solution.

• If the learning rate is made too large : the algorithm will become unstable.• if the learning rate is set to too small : the algorithm will take a long time to converge.• To speed up the convergence time, the variable learning

rate gradient descent BP utilizes larger learning rate α when the neural network model is far from the solution and smaller learning rate α when the neural net is near the solution.

Variable Learning Rate BP with Variable Learning Rate BP with Momentum (GDX)Momentum (GDX)

• The new weight vector wk+1 is adjusted the same as in the gradient descent with momentum above but with a varying αk

Conjugate Gradient BP Conjugate Gradient BP (CGP)(CGP)

• In the conjugate gradient algorithms a search is performed along

conjugate directions, which produces generally faster convergence

Than steepest descent directions.

• A search is made along the conjugate gradient direction to determine the step size which will minimize the performance function along that line

Performance ComparisonPerformance Comparison

• Result when Input is File1

Performance ComparisonPerformance Comparison• Result when Input is File2

Performance ComparisonPerformance Comparison

• Result when Input is File3

Results From TablesResults From Tables• 1.the gradient descent with momentum method was not as

good in detecting the intrusion traffic from the authorized traffic as the gradient descent method.

• 2.gradient descent with momentum methods were able to classify intrusion traffic from authorized traffic but with higher MSE values.

• 3. the number of samples used as inputs affected the performance of the classification of the data

Output Values of the Two Classes

Results From TablesResults From Tables• The gradient descent, the gradient descent with momentum and

the variable rate gradient descent with momentum method converged to a MSE = exp(-3). they were able to classify the intrusion traffic from the authorized traffic.

• The input sample that yielded the best performance for all five methods contained 6 CUs

• The number of neurons used at the hidden layer depended on the number of CUs in the input samples. In these cases, the NN topology of {24,10,1} yielded the best results.

• When the input data files were File1 and File3 (i.e. input sample consists of 5 or 7 CUs), we did not get good results compare to the case when the input is File2

(Memorize & Generalize Problem) • Fourth, the conjugate gradient descent have the best

performance

Different Problems Different Problems Require Different Require Different

Learning Rate Learning Rate Adaptive Methods Adaptive Methods

General parameter General parameter settingssettings

Network Architecture : standard feedforward neural network.Maximum of parameters was constant for all tasks.AFs for non-input units : standard hyperbolic tangent.Correctness : All output units produce correct answer.Seven networks for each configuration…Fixed total number of itteration.

Different ProblemsDifferent Problems• Each algorithm tested on three different tasks

– Parity Bit : # of activated input units is odd or even. In this study input layer is composed of nine units each has two states , 512 different patterns. Only one output layer is needed.

– N-M-N Encoder : The encoder task consisted of reproducing the same output activation pattern as the input one. In the activation patterns , one unit is active and the others aren’t. The complexity of this task resided in the fact that the number of hidden units was less than the number of input and output units (M<N).[M=7 , N=16].

TextureTexture

– The task consisted in detecting the orientation, either horizontal or vertical, of stripes defined by texture in an image.

Algorithm dependent Algorithm dependent parameter settingsparameter settings

– The main problem with AMs is that they require many parameters that are, like the LR, problem dependent. It is impossible to test all possible parameter combinations. Usually, to compare AMs for a given task, authors affirm that they tried to find “good” parameter settings! ( easiness of finding these params …. )

• For MOM , the free parameter is momentum factor. Tested values are : 0 , .5 , .7 , .9 and .99 .

• For DBD , the other parameter is increase factor.

Results for Parity-BitResults for Parity-Bit

LRs

Free Parameters

Each element represents the number of times the AM solve the task (max=7) for a given parameter combination. The lines correspond to different LRs and the columns to different free parameter values.MOM and DBD achieving a performance of 6/7 .

Results For EncoderResults For Encoder

For DBD and MOM, although many parameter combinations solved the task, none resulted in a one hundred percent efficacy.

Results for TextureResults for Texture

No parameter combination was found that could solve the texture task using MOM.

DBD solved the task only if the initial LR was 4^(-2).

With the proper initial LR the other DBD parameter (incremental constant) showed greater flexibility

DiscussionDiscussion

– The first and obvious conclusion that can be drawn from these results is that no AM could attain a better performance than all the others on all tasks

– MOM and DBD had a similar behavior when they were used on the encoder and parity task.

– The only task of which they clearly differed was the texture task where MOM never solved the task as opposed to DBD.

Comparison Of NN & SVMComparison Of NN & SVM• Problem : Recognizing Young-Old Gait Patterns

– The gaits of 12 young and 12 elderly participants were recorded and analyzed.

– 24 gait parameters (features) were extracted for training and testing the NN and SVM systems.

– NNs have been employed to classify normal and pathological gait… with good success.

– SVMs have emerged as a new and powerful technique for learning from data and in particular for solving classification and regression problems with reported better performance.

Gate ParametersGate Parameters• Three types of gait params :

– Basic Gate Data ( 9 Variables ) : Walking speed , stride length , ….

– Kinetic Data ( 5 Variables ) : Foot-Ground forces ….

– Kinematic Data( 10 Variables ) : knee and ankle joint angles … .

Experimental ResultsExperimental Results• A total of 24 subjects were used , which 20 subjects’ data

were used to train the NN and SVM, and the remaining 4 subjects were used to test the generalization ability of both techniques.

• Due to small sample size, a cross-validation technique was employed. In this way , all 24 subjects appeared in the testing phase of the NN and SVM models.( 6 group , 4 in each one )

• Each Alg. Is tested 20 times.