Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks...

32
Artificial Neural Networks (Cont.) Chapter 4 • Perceptron Gradient Descent Multilayer Networks • Backpropagation Algorithm 1

Transcript of Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks...

Page 1: Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.

1

Artificial Neural Networks (Cont.)

Chapter 4

• Perceptron• Gradient Descent• Multilayer Networks • Backpropagation Algorithm

Page 2: Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.

2

Review: The Main Idea of Gradient Descent

Goal: minimizing the error:

Page 3: Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.

3

Review: The Main Idea of Gradient Descent

Page 4: Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.

4

Gradient Descent

Derive the equations for updating the weights of the simple neuron using the gradient descent technique

Page 5: Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.

5

Gradient Descent (with Linear Transfer function)

Page 6: Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.

6

Review:Batch vs. Incremental Gradient Descent

Page 7: Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.

7

Gradient Descent(with Linear Transfer function)

Is it batch or incremental mode?

Page 8: Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.

8

Forward Equation

: the transfer function of

: the output of

X: the input array

: the n weight matrix for the first layer

: the weight matrix for the second layer

Page 9: Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.

9

Backpropagation Learning Rule

• Each weight changed by (if all the neurons are sigmoid units):

where l is the layer number η is a constant called the learning rate tj is the correct teacher output for output unit j

δ(l)j is the error measure for unit j in the lth layer

)1()()( li

ll owjji

unitoutput an is if ))(1( )2()2()2()2( jotoojjjj j

unithidden a is if )1( )2()2()1()1()1( jwook

kjkjjj

Usually the output neurons are linear & hidden neurons are sigmoid units. Then, what will be changed in the above equations?

Page 10: Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.

10

• Compute using the BP learning rule for the following condition:

Backpropagation Learning Rule

)1()()( li

ll owjji

unitoutput an is if ))(1( )2()2()2()2( jotoojjjj j

unithidden a is if )1( )2()2()1()1()1( jwook

kjkjjj

Current output: oj=0.2Correct output: tj=1.0

output

hidden

input

Page 11: Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.

11

Error Backpropagation

• First calculate error of output units and use this to change the top layer of weights.

Current output: oj=0.2Correct output: tj=1.0

= 0.2(1–0.2)(1–0.2)=0.128

output

hidden

input

Update weights into j)1()2()2(

ijjiow

))(1( )2()2()2()2(

jjjjotoo j

Page 12: Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.

12

Error Backpropagation

• Next calculate error for hidden units based on errors on the output units it feeds into.

output

hidden

input

k

kjkjjjwoo )2()2()1()1()1( )1(

Page 13: Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.

13

Error Backpropagation

• Finally update bottom layer of weights based on errors calculated for hidden units.

output

hidden

input

Update weights into j

)0()1()1(

ijjiow

k

kjkjjjwoo )2()2()1()1()1( )1(

)( )0(ixo

i

Page 14: Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.

14

Comments on Training Algorithm

• Not guaranteed to converge to zero training error, may converge to local optima or oscillate indefinitely.

• However, in practice, does converge to low error for many large networks on real data.

• Many epochs (thousands) may be required, hours or days of training for large networks.

• To avoid local-minima problems, run several trials starting with different random weights (random restarts).

– Take results of trial with lowest training set error.

– Build a committee of results from multiple trials.

Page 15: Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.

15

Representational Power

• Boolean functions: Any boolean function can be represented by a two-layer network with sufficient hidden units.

• Continuous functions: Any bounded continuous function can be approximated with arbitrarily small error by a two-layer network.

– Sigmoid functions can act as a set of basis functions for composing more complex functions, like sine waves in Fourier analysis.

• Arbitrary function: Any function can be approximated to arbitrary accuracy by a three-layer network.

Page 16: Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.

16

How many hidden layersand hidden units per layer?

• Theoretically, one hidden layer (possibly with many hidden units) is sufficient for most problems

• There is no theoretical results on minimum necessary # of hidden units (either problem dependent or independent)

• Practical rule of thumb:

–n = # of input units; p = # of hidden units

–For binary/bipolar data: p = 2n

–For real data: p >> 2n

• Multiple hidden layers with fewer units may be trained faster for similar quality in some applications

Page 17: Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.

17

Data sets to handle over-fitting & # of hidden neurons

• Training set:

– A set of examples used for learning, that is to fit the parameters [i.e., weights] of the classifier.

• Validation set:

– A set of examples used to tune the parameters [i.e., architecture, not weights] of a classifier, for example to choose the number of hidden units in a neural network or handle over-fitting.

• Test set (completely unseen data):

– A set of examples used only to assess the performance [generalization] of a fully specified classifier.

• Usually all the data set is divided by 60%-20%-20% or 70%-20%-10%

Page 18: Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.

18

Over-training/over-fitting

• The meaning of over-fitting: – Trained net fits very well with the training samples (total

error almost zero), but not with new input patterns

• Over-training may become serious if– Training samples were not obtained properly

– Training samples have noise

• Control over-training for better generalization– Cross-validation: dividing the samples into two sets,

- 80% into training set: used to train the network

- 20% into test set: used to validate training results

periodically test the trained net with test samples, stop training when test results start to deteriorating, repeat the process for many times, report the average results.

Page 19: Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.

19

Over-Training Prevention• Running too many epochs can result in over-fitting.

• Keep a hold-out validation set and test accuracy on it after every epoch. Stop training when additional epochs actually increase validation error.

erro

r

on training data

on test data

0 # training epochs

Page 20: Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.

20

Determining the Best Number of Hidden Units

• Too few hidden units prevents the network from adequately fitting the data.

• Too many hidden units can result in over-fitting.

• Use internal cross-validation to empirically determine an optimal number of hidden units.

erro

r

on training data

on test data

0 # hidden units

Page 21: Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.

21

Learning Rate• Adaptive learning rate to fastening the training process• There are many different approaches• One of them:

weights

trainingerror

كاهش تغييرات وزنافزايش تغييرات گراديان

Page 22: Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.

22

Momentum• Improving the gradient descent to escape the local

minima• Adds a percentage of the last movement to the current

movement

)1()( kwEkw ijij

Page 23: Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.

23

Typical Learning Curve

0 50 100 150 200

101

Epoch

Su

m-S

qu

are

d E

rro

rSum-Squared Network Error for 224 Epochs

100

10-1

10-2

10-3

10-4

Page 24: Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.

24

Typical learning with adaptive learning rate

0 10 20 30 40 50 60 70 80 90 100Epoch

Training for 103 Epochs

0 20 40 60 80 100 1200

0.2

0.4

0.6

0.8

1

Epoch

Le

arn

ing

Rate

10-4

10-2

100

102S

um

-Sq

ua

red

Err

or

10-3

101

10-1

Page 25: Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.

25

Typical Learning with adaptive learning rate plus momentum

0 10 20 30 40 50 60 70 80Epoch

Training for 85 Epochs

0 10 20 30 40 50 60 70 80 900

0.5

1

2.5

Epoch

Le

arn

ing

Rate

10-4

10-2

100

102

Su

m-S

qu

are

d E

rro

r

10-3

101

10-1

1.5

2

Page 26: Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.

26

Hidden Unit Representations

• Trained hidden units can be seen as newly constructed features that make the target concept linearly separable in the transformed space.

• On many real domains, hidden units can be interpreted as representing meaningful features such as vowel detectors or edge detectors, etc..

• However, the hidden layer can also become a distributed representation of the input in which each individual unit is not easily interpretable as a meaningful feature.

Page 27: Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.

27

Appropriate problems for NN

• Instances are represented by many attribute-value pairs

• The target function output may be discrete-valued, real-valued or a vector of several real- or discrete-valued attributes

• The dataset may contains errors

• Long training time are acceptable

• Fast evaluation (test phase) is required

• The ability of humans to understand the learned target function is not important

Page 28: Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.

28

Successful Applications

• Text to Speech (NetTalk)

• Fraud detection

• Financial Applications

• Chemical Plant Control

• Automated Vehicles

• Game Playing

– Neurogammon (a neural-network backgammon program)

• Handwriting recognition

Page 29: Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.

29

More Issues in Neural Nets

• More efficient training methods:– Quickprop– Conjugate gradient (exploits 2nd derivative)

• Learning the proper network architecture:– Grow network until able to fit data– Shrink large network until unable to fit data

• Recurrent networks that use feedback and can learn finite state machines with “backpropagation through time.”

Page 30: Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.

30

More Issues in Neural Nets (cont.)

• More biologically plausible learning algorithms based on Hebbian learning.

• Unsupervised Learning

– Self-Organizing Feature Maps (SOMs)

• Reinforcement Learning

– Frequently used as function approximators for learning value functions.

Page 31: Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.

31

Assignment 2

• Tom Mitchelle problems – 4.1– 4.2– 4.5– 4.7– 4.8– 4.10

Page 32: Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.

32

Remaining Course Plan

• Chapter 9. Genetic Algorithms

• Chapter 13. Reinforcement Learning

• Clustering

• Dimension Reduction Algorithms

• Support Vector Machine

• Cellular Automata

• Other Biologically-inspired optimization algorithms (Ant Colony,

PSO, Simulated annealing, …)

• Active Learning