15 Machine Learning Multilayer Perceptron

211
Neural Networks Multilayer Perceptron Andres Mendez-Vazquez December 12, 2015 1 / 94

Transcript of 15 Machine Learning Multilayer Perceptron

Page 1: 15 Machine Learning Multilayer Perceptron

Neural NetworksMultilayer Perceptron

Andres Mendez-Vazquez

December 12, 2015

1 / 94

Page 2: 15 Machine Learning Multilayer Perceptron

Outline1 Introduction

The XOR Problem2 Multi-Layer Perceptron

ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm

3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions

4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer

2 / 94

Page 3: 15 Machine Learning Multilayer Perceptron

Outline1 Introduction

The XOR Problem2 Multi-Layer Perceptron

ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm

3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions

4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer

3 / 94

Page 4: 15 Machine Learning Multilayer Perceptron

Do you remember?

The Perceptron has the following problemGiven that the perceptron is a linear classifier

It is clear that

It will never be able to classify stuff that is not linearly separable

4 / 94

Page 5: 15 Machine Learning Multilayer Perceptron

Do you remember?

The Perceptron has the following problemGiven that the perceptron is a linear classifier

It is clear that

It will never be able to classify stuff that is not linearly separable

4 / 94

Page 6: 15 Machine Learning Multilayer Perceptron

Example: XOR Problem

The Problem

0

1

1

Class 1

Class 2

5 / 94

Page 7: 15 Machine Learning Multilayer Perceptron

The Perceptron cannot solve it

BecauseThe perceptron is a linear classifier!!!

ThusSomething needs to be done!!!

MaybeAdd an extra layer!!!

6 / 94

Page 8: 15 Machine Learning Multilayer Perceptron

The Perceptron cannot solve it

BecauseThe perceptron is a linear classifier!!!

ThusSomething needs to be done!!!

MaybeAdd an extra layer!!!

6 / 94

Page 9: 15 Machine Learning Multilayer Perceptron

The Perceptron cannot solve it

BecauseThe perceptron is a linear classifier!!!

ThusSomething needs to be done!!!

MaybeAdd an extra layer!!!

6 / 94

Page 10: 15 Machine Learning Multilayer Perceptron

A little bit of historyIt was first cited by VapnikVapnik cites (Bryson, A.E.; W.F. Denham; S.E. Dreyfus. Optimalprogramming problems with inequality constraints. I: Necessary conditionsfor extremal solutions. AIAA J. 1, 11 (1963) 2544-2550) as the firstpublication of the backpropagation algorithm in his book "Support VectorMachines."

It was first used byArthur E. Bryson and Yu-Chi Ho described it as a multi-stage dynamicsystem optimization method in 1969.

HoweverIt was not until 1974 and later, when applied in the context of neuralnetworks and through the work of Paul Werbos, David E. Rumelhart,Geoffrey E. Hinton and Ronald J. Williams that it gained recognition.

7 / 94

Page 11: 15 Machine Learning Multilayer Perceptron

A little bit of historyIt was first cited by VapnikVapnik cites (Bryson, A.E.; W.F. Denham; S.E. Dreyfus. Optimalprogramming problems with inequality constraints. I: Necessary conditionsfor extremal solutions. AIAA J. 1, 11 (1963) 2544-2550) as the firstpublication of the backpropagation algorithm in his book "Support VectorMachines."

It was first used byArthur E. Bryson and Yu-Chi Ho described it as a multi-stage dynamicsystem optimization method in 1969.

HoweverIt was not until 1974 and later, when applied in the context of neuralnetworks and through the work of Paul Werbos, David E. Rumelhart,Geoffrey E. Hinton and Ronald J. Williams that it gained recognition.

7 / 94

Page 12: 15 Machine Learning Multilayer Perceptron

A little bit of historyIt was first cited by VapnikVapnik cites (Bryson, A.E.; W.F. Denham; S.E. Dreyfus. Optimalprogramming problems with inequality constraints. I: Necessary conditionsfor extremal solutions. AIAA J. 1, 11 (1963) 2544-2550) as the firstpublication of the backpropagation algorithm in his book "Support VectorMachines."

It was first used byArthur E. Bryson and Yu-Chi Ho described it as a multi-stage dynamicsystem optimization method in 1969.

HoweverIt was not until 1974 and later, when applied in the context of neuralnetworks and through the work of Paul Werbos, David E. Rumelhart,Geoffrey E. Hinton and Ronald J. Williams that it gained recognition.

7 / 94

Page 13: 15 Machine Learning Multilayer Perceptron

Then

Something NotableIt led to a “renaissance” in the field of artificial neural network research.

NeverthelessDuring the 2000s it fell out of favour but has returned again in the 2010s,now able to train much larger networks using huge modern computingpower such as GPUs.

8 / 94

Page 14: 15 Machine Learning Multilayer Perceptron

Then

Something NotableIt led to a “renaissance” in the field of artificial neural network research.

NeverthelessDuring the 2000s it fell out of favour but has returned again in the 2010s,now able to train much larger networks using huge modern computingpower such as GPUs.

8 / 94

Page 15: 15 Machine Learning Multilayer Perceptron

Outline1 Introduction

The XOR Problem2 Multi-Layer Perceptron

ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm

3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions

4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer

9 / 94

Page 16: 15 Machine Learning Multilayer Perceptron

Multi-Layer Perceptron (MLP)Multi-Layer Architecture·

Output

Input

Target

Hidden

Sigmoid Activationfunction

Identity Activationfunction

10 / 94

Page 17: 15 Machine Learning Multilayer Perceptron

Information Flow

We have the following information flow

Function Signals

Error Signals

11 / 94

Page 18: 15 Machine Learning Multilayer Perceptron

Explanation

Problems with Hidden Layers1 Increase complexity of Training2 It is necessary to think about “Long and Narrow” network vs “Short

and Fat” network.

Intuition for a One Hidden Layer1 For every input case of region, that region can be delimited by

hyperplanes on all sides using hidden units on the first hidden layer.2 A hidden unit in the second layer than ANDs them together to bound

the region.

AdvantagesIt has been proven that an MLP with one hidden layer can learn anynonlinear function of the input.

12 / 94

Page 19: 15 Machine Learning Multilayer Perceptron

Explanation

Problems with Hidden Layers1 Increase complexity of Training2 It is necessary to think about “Long and Narrow” network vs “Short

and Fat” network.

Intuition for a One Hidden Layer1 For every input case of region, that region can be delimited by

hyperplanes on all sides using hidden units on the first hidden layer.2 A hidden unit in the second layer than ANDs them together to bound

the region.

AdvantagesIt has been proven that an MLP with one hidden layer can learn anynonlinear function of the input.

12 / 94

Page 20: 15 Machine Learning Multilayer Perceptron

Explanation

Problems with Hidden Layers1 Increase complexity of Training2 It is necessary to think about “Long and Narrow” network vs “Short

and Fat” network.

Intuition for a One Hidden Layer1 For every input case of region, that region can be delimited by

hyperplanes on all sides using hidden units on the first hidden layer.2 A hidden unit in the second layer than ANDs them together to bound

the region.

AdvantagesIt has been proven that an MLP with one hidden layer can learn anynonlinear function of the input.

12 / 94

Page 21: 15 Machine Learning Multilayer Perceptron

Explanation

Problems with Hidden Layers1 Increase complexity of Training2 It is necessary to think about “Long and Narrow” network vs “Short

and Fat” network.

Intuition for a One Hidden Layer1 For every input case of region, that region can be delimited by

hyperplanes on all sides using hidden units on the first hidden layer.2 A hidden unit in the second layer than ANDs them together to bound

the region.

AdvantagesIt has been proven that an MLP with one hidden layer can learn anynonlinear function of the input.

12 / 94

Page 22: 15 Machine Learning Multilayer Perceptron

Explanation

Problems with Hidden Layers1 Increase complexity of Training2 It is necessary to think about “Long and Narrow” network vs “Short

and Fat” network.

Intuition for a One Hidden Layer1 For every input case of region, that region can be delimited by

hyperplanes on all sides using hidden units on the first hidden layer.2 A hidden unit in the second layer than ANDs them together to bound

the region.

AdvantagesIt has been proven that an MLP with one hidden layer can learn anynonlinear function of the input.

12 / 94

Page 23: 15 Machine Learning Multilayer Perceptron

The Process

We have something like thisLayer 1

(0,0,1)

(1,0,0)

(1,0,0)

Layer 2

13 / 94

Page 24: 15 Machine Learning Multilayer Perceptron

Outline1 Introduction

The XOR Problem2 Multi-Layer Perceptron

ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm

3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions

4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer

14 / 94

Page 25: 15 Machine Learning Multilayer Perceptron

Remember!!! The Quadratic Learning Error function

Cost Function our well know error at pattern m

J (m) = 12e2

k (m) (1)

Delta Rule or Widrow-Hoff Rule

∆wkj (m) = −ηek (m) xj(m) (2)

Actually this is know as Gradient Descent

wkj (m + 1) = wkj (m) + ∆wkj (m) (3)

15 / 94

Page 26: 15 Machine Learning Multilayer Perceptron

Remember!!! The Quadratic Learning Error function

Cost Function our well know error at pattern m

J (m) = 12e2

k (m) (1)

Delta Rule or Widrow-Hoff Rule

∆wkj (m) = −ηek (m) xj(m) (2)

Actually this is know as Gradient Descent

wkj (m + 1) = wkj (m) + ∆wkj (m) (3)

15 / 94

Page 27: 15 Machine Learning Multilayer Perceptron

Remember!!! The Quadratic Learning Error function

Cost Function our well know error at pattern m

J (m) = 12e2

k (m) (1)

Delta Rule or Widrow-Hoff Rule

∆wkj (m) = −ηek (m) xj(m) (2)

Actually this is know as Gradient Descent

wkj (m + 1) = wkj (m) + ∆wkj (m) (3)

15 / 94

Page 28: 15 Machine Learning Multilayer Perceptron

Back-propagation

SetupLet tk be the k-th target (or desired) output and zk be the k-th computedoutput with k = 1, . . . , d and w represents all the weights of the network

Training Error for a single Pattern or Sample!!!

J (w) = 12

c∑k=1

(tk − zk)2 = 12 ‖t − z‖2 (4)

16 / 94

Page 29: 15 Machine Learning Multilayer Perceptron

Back-propagation

SetupLet tk be the k-th target (or desired) output and zk be the k-th computedoutput with k = 1, . . . , d and w represents all the weights of the network

Training Error for a single Pattern or Sample!!!

J (w) = 12

c∑k=1

(tk − zk)2 = 12 ‖t − z‖2 (4)

16 / 94

Page 30: 15 Machine Learning Multilayer Perceptron

Outline1 Introduction

The XOR Problem2 Multi-Layer Perceptron

ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm

3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions

4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer

17 / 94

Page 31: 15 Machine Learning Multilayer Perceptron

Gradient DescentGradient DescentThe back-propagation learning rule is based on gradient descent.

Reducing the ErrorThe weights are initialized with pseudo-random values and are changed ina direction that will reduce the error:

∆w = −η ∂J∂w (5)

Whereη is the learning rate which indicates the relative size of the change inweights:

w (m + 1) = w (m) + ∆w (m) (6)

where m is the m-th pattern presented18 / 94

Page 32: 15 Machine Learning Multilayer Perceptron

Gradient DescentGradient DescentThe back-propagation learning rule is based on gradient descent.

Reducing the ErrorThe weights are initialized with pseudo-random values and are changed ina direction that will reduce the error:

∆w = −η ∂J∂w (5)

Whereη is the learning rate which indicates the relative size of the change inweights:

w (m + 1) = w (m) + ∆w (m) (6)

where m is the m-th pattern presented18 / 94

Page 33: 15 Machine Learning Multilayer Perceptron

Gradient DescentGradient DescentThe back-propagation learning rule is based on gradient descent.

Reducing the ErrorThe weights are initialized with pseudo-random values and are changed ina direction that will reduce the error:

∆w = −η ∂J∂w (5)

Whereη is the learning rate which indicates the relative size of the change inweights:

w (m + 1) = w (m) + ∆w (m) (6)

where m is the m-th pattern presented18 / 94

Page 34: 15 Machine Learning Multilayer Perceptron

Outline1 Introduction

The XOR Problem2 Multi-Layer Perceptron

ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm

3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions

4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer

19 / 94

Page 35: 15 Machine Learning Multilayer Perceptron

Multilayer ArchitectureMultilayer Architecture: hidden–to-output weights

Output

Input

Target

Hidden

20 / 94

Page 36: 15 Machine Learning Multilayer Perceptron

Observation about the activation function

Hidden Output is equal to

yj = f( d∑

i=1wjixi

)

Output is equal to

zk = f

ynH∑j=1

wkjyj

21 / 94

Page 37: 15 Machine Learning Multilayer Perceptron

Observation about the activation function

Hidden Output is equal to

yj = f( d∑

i=1wjixi

)

Output is equal to

zk = f

ynH∑j=1

wkjyj

21 / 94

Page 38: 15 Machine Learning Multilayer Perceptron

Hidden–to-Output Weights

Error on the hidden–to-output weights∂J∂wkj

= ∂J∂netk

· ∂netk∂wkj

= −δk ·∂netk∂wkj

(7)

netk

It describes how the overall error changes with the activation of the unit’snet:

netk =ynH∑j=1

wkjyj = wTk · y (8)

Now

δk = − ∂J∂netk

= − ∂J∂zk· ∂zk∂netk

= (tk − zk) f ′ (netk) (9)

22 / 94

Page 39: 15 Machine Learning Multilayer Perceptron

Hidden–to-Output Weights

Error on the hidden–to-output weights∂J∂wkj

= ∂J∂netk

· ∂netk∂wkj

= −δk ·∂netk∂wkj

(7)

netk

It describes how the overall error changes with the activation of the unit’snet:

netk =ynH∑j=1

wkjyj = wTk · y (8)

Now

δk = − ∂J∂netk

= − ∂J∂zk· ∂zk∂netk

= (tk − zk) f ′ (netk) (9)

22 / 94

Page 40: 15 Machine Learning Multilayer Perceptron

Hidden–to-Output Weights

Error on the hidden–to-output weights∂J∂wkj

= ∂J∂netk

· ∂netk∂wkj

= −δk ·∂netk∂wkj

(7)

netk

It describes how the overall error changes with the activation of the unit’snet:

netk =ynH∑j=1

wkjyj = wTk · y (8)

Now

δk = − ∂J∂netk

= − ∂J∂zk· ∂zk∂netk

= (tk − zk) f ′ (netk) (9)

22 / 94

Page 41: 15 Machine Learning Multilayer Perceptron

Hidden–to-Output Weights

Why?

zk = f (netk) (10)

Thus∂zk∂netk

= f ′ (netk) (11)

Since netk = wTk · y therefore:

∂netk∂wkj

= yj (12)

23 / 94

Page 42: 15 Machine Learning Multilayer Perceptron

Hidden–to-Output Weights

Why?

zk = f (netk) (10)

Thus∂zk∂netk

= f ′ (netk) (11)

Since netk = wTk · y therefore:

∂netk∂wkj

= yj (12)

23 / 94

Page 43: 15 Machine Learning Multilayer Perceptron

Hidden–to-Output Weights

Why?

zk = f (netk) (10)

Thus∂zk∂netk

= f ′ (netk) (11)

Since netk = wTk · y therefore:

∂netk∂wkj

= yj (12)

23 / 94

Page 44: 15 Machine Learning Multilayer Perceptron

Finally

The weight update (or learning rule) for the hidden-to-output weightsis:

4wkj = ηδkyj = η (tk − zk) f ′ (netk) yj (13)

24 / 94

Page 45: 15 Machine Learning Multilayer Perceptron

Outline1 Introduction

The XOR Problem2 Multi-Layer Perceptron

ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm

3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions

4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer

25 / 94

Page 46: 15 Machine Learning Multilayer Perceptron

Multi-Layer ArchitectureMulti-Layer Architecture: Input–to-Hidden weights

Output

Input

Target

Hidden

26 / 94

Page 47: 15 Machine Learning Multilayer Perceptron

Input–to-Hidden WeightsError on the Input–to-Hidden weights

∂J∂wji

= ∂J∂yj· ∂yj∂netj

· ∂netj∂wji

(14)

Thus

∂J∂yj

= ∂

∂yj

[12

c∑k=1

(tk − zk)2]

= −c∑

k=1(tk − zk) ∂zk

∂yj

= −c∑

k=1(tk − zk) ∂zk

∂netk· ∂netk∂yj

= −c∑

k=1(tk − zk) ∂f (netk)

∂netk· wkj

27 / 94

Page 48: 15 Machine Learning Multilayer Perceptron

Input–to-Hidden WeightsError on the Input–to-Hidden weights

∂J∂wji

= ∂J∂yj· ∂yj∂netj

· ∂netj∂wji

(14)

Thus

∂J∂yj

= ∂

∂yj

[12

c∑k=1

(tk − zk)2]

= −c∑

k=1(tk − zk) ∂zk

∂yj

= −c∑

k=1(tk − zk) ∂zk

∂netk· ∂netk∂yj

= −c∑

k=1(tk − zk) ∂f (netk)

∂netk· wkj

27 / 94

Page 49: 15 Machine Learning Multilayer Perceptron

Input–to-Hidden Weights

Finally∂J∂yj

= −c∑

k=1(tk − zk) f ′ (netk) · wkj (15)

Remember

δk = − ∂J∂netk

= (tk − zk) f ′ (netk) (16)

28 / 94

Page 50: 15 Machine Learning Multilayer Perceptron

What is ∂yj∂netj

?

First

netj =d∑

i=1wjixi = wT

j · x (17)

Then

yj = f (netj)

Then∂yj∂netj

= ∂f (netj)∂netj

= f ′ (netj)

29 / 94

Page 51: 15 Machine Learning Multilayer Perceptron

What is ∂yj∂netj

?

First

netj =d∑

i=1wjixi = wT

j · x (17)

Then

yj = f (netj)

Then∂yj∂netj

= ∂f (netj)∂netj

= f ′ (netj)

29 / 94

Page 52: 15 Machine Learning Multilayer Perceptron

What is ∂yj∂netj

?

First

netj =d∑

i=1wjixi = wT

j · x (17)

Then

yj = f (netj)

Then∂yj∂netj

= ∂f (netj)∂netj

= f ′ (netj)

29 / 94

Page 53: 15 Machine Learning Multilayer Perceptron

Then, we can define δj

By defying the sensitivity for a hidden unit:

δj = f ′ (netj)c∑

k=1wkjδk (18)

Which means that:“The sensitivity at a hidden unit is simply the sum of the individualsensitivities at the output units weighted by the hidden-to-outputweights wkj ; all multiplied by f ′ (netj)”

30 / 94

Page 54: 15 Machine Learning Multilayer Perceptron

Then, we can define δj

By defying the sensitivity for a hidden unit:

δj = f ′ (netj)c∑

k=1wkjδk (18)

Which means that:“The sensitivity at a hidden unit is simply the sum of the individualsensitivities at the output units weighted by the hidden-to-outputweights wkj ; all multiplied by f ′ (netj)”

30 / 94

Page 55: 15 Machine Learning Multilayer Perceptron

Then, we can define δj

By defying the sensitivity for a hidden unit:

δj = f ′ (netj)c∑

k=1wkjδk (18)

Which means that:“The sensitivity at a hidden unit is simply the sum of the individualsensitivities at the output units weighted by the hidden-to-outputweights wkj ; all multiplied by f ′ (netj)”

30 / 94

Page 56: 15 Machine Learning Multilayer Perceptron

What about ∂netj∂wji

?

We have that∂netj∂wji

=∂wT

j · x∂wji

= ∂∑d

i=1 wjixi∂wji

= xi

31 / 94

Page 57: 15 Machine Learning Multilayer Perceptron

Finally

The learning rule for the input-to-hidden weights is:

∆wji = ηxiδj = η

[ c∑k=1

wkjδk

]f ′ (netj) xi (19)

32 / 94

Page 58: 15 Machine Learning Multilayer Perceptron

Basically, the entire training process has the following steps

InitializationAssuming that no prior information is available, pick the synaptic weightsand thresholds

Forward ComputationCompute the induced function signals of the network by proceedingforward through the network, layer by layer.

Backward ComputationCompute the local gradients of the network.

FinallyAdjust the weights!!!

33 / 94

Page 59: 15 Machine Learning Multilayer Perceptron

Basically, the entire training process has the following steps

InitializationAssuming that no prior information is available, pick the synaptic weightsand thresholds

Forward ComputationCompute the induced function signals of the network by proceedingforward through the network, layer by layer.

Backward ComputationCompute the local gradients of the network.

FinallyAdjust the weights!!!

33 / 94

Page 60: 15 Machine Learning Multilayer Perceptron

Basically, the entire training process has the following steps

InitializationAssuming that no prior information is available, pick the synaptic weightsand thresholds

Forward ComputationCompute the induced function signals of the network by proceedingforward through the network, layer by layer.

Backward ComputationCompute the local gradients of the network.

FinallyAdjust the weights!!!

33 / 94

Page 61: 15 Machine Learning Multilayer Perceptron

Basically, the entire training process has the following steps

InitializationAssuming that no prior information is available, pick the synaptic weightsand thresholds

Forward ComputationCompute the induced function signals of the network by proceedingforward through the network, layer by layer.

Backward ComputationCompute the local gradients of the network.

FinallyAdjust the weights!!!

33 / 94

Page 62: 15 Machine Learning Multilayer Perceptron

Outline1 Introduction

The XOR Problem2 Multi-Layer Perceptron

ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm

3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions

4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer

34 / 94

Page 63: 15 Machine Learning Multilayer Perceptron

Now, Calculating Total Change

We have for thatThe Total Training Error is the sum over the errors of N individual patterns

The Total Training Error

J =N∑

p=1Jp = 1

2

N∑p=1

d∑k=1

(tpk − zp

k )2 = 12

n∑p=1‖tp − zp‖2 (20)

35 / 94

Page 64: 15 Machine Learning Multilayer Perceptron

Now, Calculating Total Change

We have for thatThe Total Training Error is the sum over the errors of N individual patterns

The Total Training Error

J =N∑

p=1Jp = 1

2

N∑p=1

d∑k=1

(tpk − zp

k )2 = 12

n∑p=1‖tp − zp‖2 (20)

35 / 94

Page 65: 15 Machine Learning Multilayer Perceptron

About the Total Training Error

RemarksA weight update may reduce the error on the single pattern beingpresented but can increase the error on the full training set.However, given a large number of such individual updates, the totalerror of equation (20) decreases.

36 / 94

Page 66: 15 Machine Learning Multilayer Perceptron

About the Total Training Error

RemarksA weight update may reduce the error on the single pattern beingpresented but can increase the error on the full training set.However, given a large number of such individual updates, the totalerror of equation (20) decreases.

36 / 94

Page 67: 15 Machine Learning Multilayer Perceptron

Outline1 Introduction

The XOR Problem2 Multi-Layer Perceptron

ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm

3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions

4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer

37 / 94

Page 68: 15 Machine Learning Multilayer Perceptron

Now, we want the training to stop

ThereforeIt is necessary to have a way to stop when the change of the weights isenough!!!

A simple way to stop the trainingThe algorithm terminates when the change in the criterion functionJ (w) is smaller than some preset value Θ.

∆J (w) = |J (w (t + 1))− J (w (t))| (21)

There are other stopping criteria that lead to better performance thanthis one.

38 / 94

Page 69: 15 Machine Learning Multilayer Perceptron

Now, we want the training to stop

ThereforeIt is necessary to have a way to stop when the change of the weights isenough!!!

A simple way to stop the trainingThe algorithm terminates when the change in the criterion functionJ (w) is smaller than some preset value Θ.

∆J (w) = |J (w (t + 1))− J (w (t))| (21)

There are other stopping criteria that lead to better performance thanthis one.

38 / 94

Page 70: 15 Machine Learning Multilayer Perceptron

Now, we want the training to stop

ThereforeIt is necessary to have a way to stop when the change of the weights isenough!!!

A simple way to stop the trainingThe algorithm terminates when the change in the criterion functionJ (w) is smaller than some preset value Θ.

∆J (w) = |J (w (t + 1))− J (w (t))| (21)

There are other stopping criteria that lead to better performance thanthis one.

38 / 94

Page 71: 15 Machine Learning Multilayer Perceptron

Now, we want the training to stop

ThereforeIt is necessary to have a way to stop when the change of the weights isenough!!!

A simple way to stop the trainingThe algorithm terminates when the change in the criterion functionJ (w) is smaller than some preset value Θ.

∆J (w) = |J (w (t + 1))− J (w (t))| (21)

There are other stopping criteria that lead to better performance thanthis one.

38 / 94

Page 72: 15 Machine Learning Multilayer Perceptron

Other Stopping Criteria

Norm of the GradientThe back-propagation algorithm is considered to have converged when theEuclidean norm of the gradient vector reaches a sufficiently small gradientthreshold.

‖∇wJ (m)‖ < Θ (22)

Rate of change in the average error per epochThe back-propagation algorithm is considered to have converged when theabsolute rate of change in the average squared error per epoch issufficiently small. ∣∣∣∣∣∣ 1

N

N∑p=1

Jp

∣∣∣∣∣∣ < Θ (23)

39 / 94

Page 73: 15 Machine Learning Multilayer Perceptron

Other Stopping Criteria

Norm of the GradientThe back-propagation algorithm is considered to have converged when theEuclidean norm of the gradient vector reaches a sufficiently small gradientthreshold.

‖∇wJ (m)‖ < Θ (22)

Rate of change in the average error per epochThe back-propagation algorithm is considered to have converged when theabsolute rate of change in the average squared error per epoch issufficiently small. ∣∣∣∣∣∣ 1

N

N∑p=1

Jp

∣∣∣∣∣∣ < Θ (23)

39 / 94

Page 74: 15 Machine Learning Multilayer Perceptron

About the Stopping Criteria

Observations1 Before training starts, the error on the training set is high.

I Through the learning process, the error becomes smaller.2 The error per pattern depends on the amount of training data and the

expressive power (such as the number of weights) in the network.3 The average error on an independent test set is always higher than on

the training set, and it can decrease as well as increase.4 A validation set is used in order to decide when to stop training.

I We do not want to over-fit the network and decrease the power of theclassifier generalization “we stop training at a minimum of the error onthe validation set”

40 / 94

Page 75: 15 Machine Learning Multilayer Perceptron

About the Stopping Criteria

Observations1 Before training starts, the error on the training set is high.

I Through the learning process, the error becomes smaller.2 The error per pattern depends on the amount of training data and the

expressive power (such as the number of weights) in the network.3 The average error on an independent test set is always higher than on

the training set, and it can decrease as well as increase.4 A validation set is used in order to decide when to stop training.

I We do not want to over-fit the network and decrease the power of theclassifier generalization “we stop training at a minimum of the error onthe validation set”

40 / 94

Page 76: 15 Machine Learning Multilayer Perceptron

About the Stopping Criteria

Observations1 Before training starts, the error on the training set is high.

I Through the learning process, the error becomes smaller.2 The error per pattern depends on the amount of training data and the

expressive power (such as the number of weights) in the network.3 The average error on an independent test set is always higher than on

the training set, and it can decrease as well as increase.4 A validation set is used in order to decide when to stop training.

I We do not want to over-fit the network and decrease the power of theclassifier generalization “we stop training at a minimum of the error onthe validation set”

40 / 94

Page 77: 15 Machine Learning Multilayer Perceptron

About the Stopping Criteria

Observations1 Before training starts, the error on the training set is high.

I Through the learning process, the error becomes smaller.2 The error per pattern depends on the amount of training data and the

expressive power (such as the number of weights) in the network.3 The average error on an independent test set is always higher than on

the training set, and it can decrease as well as increase.4 A validation set is used in order to decide when to stop training.

I We do not want to over-fit the network and decrease the power of theclassifier generalization “we stop training at a minimum of the error onthe validation set”

40 / 94

Page 78: 15 Machine Learning Multilayer Perceptron

About the Stopping Criteria

Observations1 Before training starts, the error on the training set is high.

I Through the learning process, the error becomes smaller.2 The error per pattern depends on the amount of training data and the

expressive power (such as the number of weights) in the network.3 The average error on an independent test set is always higher than on

the training set, and it can decrease as well as increase.4 A validation set is used in order to decide when to stop training.

I We do not want to over-fit the network and decrease the power of theclassifier generalization “we stop training at a minimum of the error onthe validation set”

40 / 94

Page 79: 15 Machine Learning Multilayer Perceptron

About the Stopping Criteria

Observations1 Before training starts, the error on the training set is high.

I Through the learning process, the error becomes smaller.2 The error per pattern depends on the amount of training data and the

expressive power (such as the number of weights) in the network.3 The average error on an independent test set is always higher than on

the training set, and it can decrease as well as increase.4 A validation set is used in order to decide when to stop training.

I We do not want to over-fit the network and decrease the power of theclassifier generalization “we stop training at a minimum of the error onthe validation set”

40 / 94

Page 80: 15 Machine Learning Multilayer Perceptron

Some More Terminology

EpochAs with other types of backpropagation, ’learning’ is a supervised processthat occurs with each cycle or ’epoch’ through a forward activation flow ofoutputs, and the backwards error propagation of weight adjustments.

In our caseI am using the batch sum of all correcting weights to define that epoch.

41 / 94

Page 81: 15 Machine Learning Multilayer Perceptron

Some More Terminology

EpochAs with other types of backpropagation, ’learning’ is a supervised processthat occurs with each cycle or ’epoch’ through a forward activation flow ofoutputs, and the backwards error propagation of weight adjustments.

In our caseI am using the batch sum of all correcting weights to define that epoch.

41 / 94

Page 82: 15 Machine Learning Multilayer Perceptron

Outline1 Introduction

The XOR Problem2 Multi-Layer Perceptron

ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm

3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions

4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer

42 / 94

Page 83: 15 Machine Learning Multilayer Perceptron

Final Basic Batch AlgorithmPerceptron(X)

1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch

m = 0

2 do

3 m = m + 1

4 for s = 1 to N

5 x (m) = X (:, s)

6 for k = 1 to c

7 δk = (tk − zk) f ′(

wTk · y)

8 for j = 1 to nH

9 netj = wTj · x;yj = f

(netj)

10 wkj (m) = wkj (m) + ηδkyj (m)

11 for j = 1 to nH

12 δj = f ′(

netj)∑c

k=1wkjδk

13 for i = 1 to d

14 wji (m) = wji (m) + ηδjxi (m)

15 until ‖∇wJ (m)‖ < Θ

16 return w (m) 43 / 94

Page 84: 15 Machine Learning Multilayer Perceptron

Final Basic Batch AlgorithmPerceptron(X)

1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch

m = 0

2 do

3 m = m + 1

4 for s = 1 to N

5 x (m) = X (:, s)

6 for k = 1 to c

7 δk = (tk − zk) f ′(

wTk · y)

8 for j = 1 to nH

9 netj = wTj · x;yj = f

(netj)

10 wkj (m) = wkj (m) + ηδkyj (m)

11 for j = 1 to nH

12 δj = f ′(

netj)∑c

k=1wkjδk

13 for i = 1 to d

14 wji (m) = wji (m) + ηδjxi (m)

15 until ‖∇wJ (m)‖ < Θ

16 return w (m) 43 / 94

Page 85: 15 Machine Learning Multilayer Perceptron

Final Basic Batch AlgorithmPerceptron(X)

1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch

m = 0

2 do

3 m = m + 1

4 for s = 1 to N

5 x (m) = X (:, s)

6 for k = 1 to c

7 δk = (tk − zk) f ′(

wTk · y)

8 for j = 1 to nH

9 netj = wTj · x;yj = f

(netj)

10 wkj (m) = wkj (m) + ηδkyj (m)

11 for j = 1 to nH

12 δj = f ′(

netj)∑c

k=1wkjδk

13 for i = 1 to d

14 wji (m) = wji (m) + ηδjxi (m)

15 until ‖∇wJ (m)‖ < Θ

16 return w (m) 43 / 94

Page 86: 15 Machine Learning Multilayer Perceptron

Final Basic Batch AlgorithmPerceptron(X)

1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch

m = 0

2 do

3 m = m + 1

4 for s = 1 to N

5 x (m) = X (:, s)

6 for k = 1 to c

7 δk = (tk − zk) f ′(

wTk · y)

8 for j = 1 to nH

9 netj = wTj · x;yj = f

(netj)

10 wkj (m) = wkj (m) + ηδkyj (m)

11 for j = 1 to nH

12 δj = f ′(

netj)∑c

k=1wkjδk

13 for i = 1 to d

14 wji (m) = wji (m) + ηδjxi (m)

15 until ‖∇wJ (m)‖ < Θ

16 return w (m) 43 / 94

Page 87: 15 Machine Learning Multilayer Perceptron

Final Basic Batch AlgorithmPerceptron(X)

1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch

m = 0

2 do

3 m = m + 1

4 for s = 1 to N

5 x (m) = X (:, s)

6 for k = 1 to c

7 δk = (tk − zk) f ′(

wTk · y)

8 for j = 1 to nH

9 netj = wTj · x;yj = f

(netj)

10 wkj (m) = wkj (m) + ηδkyj (m)

11 for j = 1 to nH

12 δj = f ′(

netj)∑c

k=1wkjδk

13 for i = 1 to d

14 wji (m) = wji (m) + ηδjxi (m)

15 until ‖∇wJ (m)‖ < Θ

16 return w (m) 43 / 94

Page 88: 15 Machine Learning Multilayer Perceptron

Final Basic Batch AlgorithmPerceptron(X)

1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch

m = 0

2 do

3 m = m + 1

4 for s = 1 to N

5 x (m) = X (:, s)

6 for k = 1 to c

7 δk = (tk − zk) f ′(

wTk · y)

8 for j = 1 to nH

9 netj = wTj · x;yj = f

(netj)

10 wkj (m) = wkj (m) + ηδkyj (m)

11 for j = 1 to nH

12 δj = f ′(

netj)∑c

k=1wkjδk

13 for i = 1 to d

14 wji (m) = wji (m) + ηδjxi (m)

15 until ‖∇wJ (m)‖ < Θ

16 return w (m) 43 / 94

Page 89: 15 Machine Learning Multilayer Perceptron

Final Basic Batch AlgorithmPerceptron(X)

1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch

m = 0

2 do

3 m = m + 1

4 for s = 1 to N

5 x (m) = X (:, s)

6 for k = 1 to c

7 δk = (tk − zk) f ′(

wTk · y)

8 for j = 1 to nH

9 netj = wTj · x;yj = f

(netj)

10 wkj (m) = wkj (m) + ηδkyj (m)

11 for j = 1 to nH

12 δj = f ′(

netj)∑c

k=1wkjδk

13 for i = 1 to d

14 wji (m) = wji (m) + ηδjxi (m)

15 until ‖∇wJ (m)‖ < Θ

16 return w (m) 43 / 94

Page 90: 15 Machine Learning Multilayer Perceptron

Final Basic Batch AlgorithmPerceptron(X)

1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch

m = 0

2 do

3 m = m + 1

4 for s = 1 to N

5 x (m) = X (:, s)

6 for k = 1 to c

7 δk = (tk − zk) f ′(

wTk · y)

8 for j = 1 to nH

9 netj = wTj · x;yj = f

(netj)

10 wkj (m) = wkj (m) + ηδkyj (m)

11 for j = 1 to nH

12 δj = f ′(

netj)∑c

k=1wkjδk

13 for i = 1 to d

14 wji (m) = wji (m) + ηδjxi (m)

15 until ‖∇wJ (m)‖ < Θ

16 return w (m) 43 / 94

Page 91: 15 Machine Learning Multilayer Perceptron

Final Basic Batch AlgorithmPerceptron(X)

1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch

m = 0

2 do

3 m = m + 1

4 for s = 1 to N

5 x (m) = X (:, s)

6 for k = 1 to c

7 δk = (tk − zk) f ′(

wTk · y)

8 for j = 1 to nH

9 netj = wTj · x;yj = f

(netj)

10 wkj (m) = wkj (m) + ηδkyj (m)

11 for j = 1 to nH

12 δj = f ′(

netj)∑c

k=1wkjδk

13 for i = 1 to d

14 wji (m) = wji (m) + ηδjxi (m)

15 until ‖∇wJ (m)‖ < Θ

16 return w (m) 43 / 94

Page 92: 15 Machine Learning Multilayer Perceptron

Final Basic Batch AlgorithmPerceptron(X)

1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch

m = 0

2 do

3 m = m + 1

4 for s = 1 to N

5 x (m) = X (:, s)

6 for k = 1 to c

7 δk = (tk − zk) f ′(

wTk · y)

8 for j = 1 to nH

9 netj = wTj · x;yj = f

(netj)

10 wkj (m) = wkj (m) + ηδkyj (m)

11 for j = 1 to nH

12 δj = f ′(

netj)∑c

k=1wkjδk

13 for i = 1 to d

14 wji (m) = wji (m) + ηδjxi (m)

15 until ‖∇wJ (m)‖ < Θ

16 return w (m) 43 / 94

Page 93: 15 Machine Learning Multilayer Perceptron

Final Basic Batch AlgorithmPerceptron(X)

1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch

m = 0

2 do

3 m = m + 1

4 for s = 1 to N

5 x (m) = X (:, s)

6 for k = 1 to c

7 δk = (tk − zk) f ′(

wTk · y)

8 for j = 1 to nH

9 netj = wTj · x;yj = f

(netj)

10 wkj (m) = wkj (m) + ηδkyj (m)

11 for j = 1 to nH

12 δj = f ′(

netj)∑c

k=1wkjδk

13 for i = 1 to d

14 wji (m) = wji (m) + ηδjxi (m)

15 until ‖∇wJ (m)‖ < Θ

16 return w (m) 43 / 94

Page 94: 15 Machine Learning Multilayer Perceptron

Final Basic Batch AlgorithmPerceptron(X)

1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch

m = 0

2 do

3 m = m + 1

4 for s = 1 to N

5 x (m) = X (:, s)

6 for k = 1 to c

7 δk = (tk − zk) f ′(

wTk · y)

8 for j = 1 to nH

9 netj = wTj · x;yj = f

(netj)

10 wkj (m) = wkj (m) + ηδkyj (m)

11 for j = 1 to nH

12 δj = f ′(

netj)∑c

k=1wkjδk

13 for i = 1 to d

14 wji (m) = wji (m) + ηδjxi (m)

15 until ‖∇wJ (m)‖ < Θ

16 return w (m) 43 / 94

Page 95: 15 Machine Learning Multilayer Perceptron

Final Basic Batch AlgorithmPerceptron(X)

1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch

m = 0

2 do

3 m = m + 1

4 for s = 1 to N

5 x (m) = X (:, s)

6 for k = 1 to c

7 δk = (tk − zk) f ′(

wTk · y)

8 for j = 1 to nH

9 netj = wTj · x;yj = f

(netj)

10 wkj (m) = wkj (m) + ηδkyj (m)

11 for j = 1 to nH

12 δj = f ′(

netj)∑c

k=1wkjδk

13 for i = 1 to d

14 wji (m) = wji (m) + ηδjxi (m)

15 until ‖∇wJ (m)‖ < Θ

16 return w (m) 43 / 94

Page 96: 15 Machine Learning Multilayer Perceptron

Final Basic Batch AlgorithmPerceptron(X)

1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch

m = 0

2 do

3 m = m + 1

4 for s = 1 to N

5 x (m) = X (:, s)

6 for k = 1 to c

7 δk = (tk − zk) f ′(

wTk · y)

8 for j = 1 to nH

9 netj = wTj · x;yj = f

(netj)

10 wkj (m) = wkj (m) + ηδkyj (m)

11 for j = 1 to nH

12 δj = f ′(

netj)∑c

k=1wkjδk

13 for i = 1 to d

14 wji (m) = wji (m) + ηδjxi (m)

15 until ‖∇wJ (m)‖ < Θ

16 return w (m) 43 / 94

Page 97: 15 Machine Learning Multilayer Perceptron

Final Basic Batch AlgorithmPerceptron(X)

1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch

m = 0

2 do

3 m = m + 1

4 for s = 1 to N

5 x (m) = X (:, s)

6 for k = 1 to c

7 δk = (tk − zk) f ′(

wTk · y)

8 for j = 1 to nH

9 netj = wTj · x;yj = f

(netj)

10 wkj (m) = wkj (m) + ηδkyj (m)

11 for j = 1 to nH

12 δj = f ′(

netj)∑c

k=1wkjδk

13 for i = 1 to d

14 wji (m) = wji (m) + ηδjxi (m)

15 until ‖∇wJ (m)‖ < Θ

16 return w (m) 43 / 94

Page 98: 15 Machine Learning Multilayer Perceptron

Final Basic Batch AlgorithmPerceptron(X)

1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch

m = 0

2 do

3 m = m + 1

4 for s = 1 to N

5 x (m) = X (:, s)

6 for k = 1 to c

7 δk = (tk − zk) f ′(

wTk · y)

8 for j = 1 to nH

9 netj = wTj · x;yj = f

(netj)

10 wkj (m) = wkj (m) + ηδkyj (m)

11 for j = 1 to nH

12 δj = f ′(

netj)∑c

k=1wkjδk

13 for i = 1 to d

14 wji (m) = wji (m) + ηδjxi (m)

15 until ‖∇wJ (m)‖ < Θ

16 return w (m) 43 / 94

Page 99: 15 Machine Learning Multilayer Perceptron

Final Basic Batch AlgorithmPerceptron(X)

1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch

m = 0

2 do

3 m = m + 1

4 for s = 1 to N

5 x (m) = X (:, s)

6 for k = 1 to c

7 δk = (tk − zk) f ′(

wTk · y)

8 for j = 1 to nH

9 netj = wTj · x;yj = f

(netj)

10 wkj (m) = wkj (m) + ηδkyj (m)

11 for j = 1 to nH

12 δj = f ′(

netj)∑c

k=1wkjδk

13 for i = 1 to d

14 wji (m) = wji (m) + ηδjxi (m)

15 until ‖∇wJ (m)‖ < Θ

16 return w (m) 43 / 94

Page 100: 15 Machine Learning Multilayer Perceptron

Final Basic Batch AlgorithmPerceptron(X)

1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch

m = 0

2 do

3 m = m + 1

4 for s = 1 to N

5 x (m) = X (:, s)

6 for k = 1 to c

7 δk = (tk − zk) f ′(

wTk · y)

8 for j = 1 to nH

9 netj = wTj · x;yj = f

(netj)

10 wkj (m) = wkj (m) + ηδkyj (m)

11 for j = 1 to nH

12 δj = f ′(

netj)∑c

k=1wkjδk

13 for i = 1 to d

14 wji (m) = wji (m) + ηδjxi (m)

15 until ‖∇wJ (m)‖ < Θ

16 return w (m) 43 / 94

Page 101: 15 Machine Learning Multilayer Perceptron

Outline1 Introduction

The XOR Problem2 Multi-Layer Perceptron

ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm

3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions

4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer

44 / 94

Page 102: 15 Machine Learning Multilayer Perceptron

Example of Architecture to be usedGiven the following Architecture and assuming N samples

45 / 94

Page 103: 15 Machine Learning Multilayer Perceptron

Outline1 Introduction

The XOR Problem2 Multi-Layer Perceptron

ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm

3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions

4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer

46 / 94

Page 104: 15 Machine Learning Multilayer Perceptron

Generating the output zk

Given the input

X =[

x1 x2 · · · xN]

(24)

Wherexi is a vector of features

xi =

x1ix2i...

xdi

(25)

47 / 94

Page 105: 15 Machine Learning Multilayer Perceptron

Generating the output zk

Given the input

X =[

x1 x2 · · · xN]

(24)

Wherexi is a vector of features

xi =

x1ix2i...

xdi

(25)

47 / 94

Page 106: 15 Machine Learning Multilayer Perceptron

ThereforeWe must have the following matrix for the input to hidden inputs

W IH =

w11 w12 · · · w1dw21 w22 · · · w2d...

.... . .

...wnH 1 wnH 2 · · · wnH d

=

wT

1wT

2...

wTnH

(26)

Given that wj =

wj1wj2...

wjd

ThusWe can create the netj for all the inputs by simply

netj = W IH X =

wT

1 x1 wT1 x2 · · · wT

1 xNwT

2 x1 wT2 x2 · · · wT

2 xN...

.... . .

...wT

nH x1 wTnH x2 · · · wT

nH xN

(27)

48 / 94

Page 107: 15 Machine Learning Multilayer Perceptron

ThereforeWe must have the following matrix for the input to hidden inputs

W IH =

w11 w12 · · · w1dw21 w22 · · · w2d...

.... . .

...wnH 1 wnH 2 · · · wnH d

=

wT

1wT

2...

wTnH

(26)

Given that wj =

wj1wj2...

wjd

ThusWe can create the netj for all the inputs by simply

netj = W IH X =

wT

1 x1 wT1 x2 · · · wT

1 xNwT

2 x1 wT2 x2 · · · wT

2 xN...

.... . .

...wT

nH x1 wTnH x2 · · · wT

nH xN

(27)

48 / 94

Page 108: 15 Machine Learning Multilayer Perceptron

Now, we need to generate the yk

We apply the activation function element by element in netj

y1 =

f(wT

1 x1)

f(wT

1 x2)· · · f

(wT

1 xN)

f(wT

2 x1)

f(wT

2 x2)· · · f

(wT

2 xN)

...... . . . ...

f(wT

nH x1)

f(wT

nH x2)· · · f

(wT

nH xN)

(28)

IMPORTANT about overflows!!!Be careful about the numeric stability of the activation function.I the case of python, we can use the ones provided by scipy.special

49 / 94

Page 109: 15 Machine Learning Multilayer Perceptron

Now, we need to generate the yk

We apply the activation function element by element in netj

y1 =

f(wT

1 x1)

f(wT

1 x2)· · · f

(wT

1 xN)

f(wT

2 x1)

f(wT

2 x2)· · · f

(wT

2 xN)

...... . . . ...

f(wT

nH x1)

f(wT

nH x2)· · · f

(wT

nH xN)

(28)

IMPORTANT about overflows!!!Be careful about the numeric stability of the activation function.I the case of python, we can use the ones provided by scipy.special

49 / 94

Page 110: 15 Machine Learning Multilayer Perceptron

Now, we need to generate the yk

We apply the activation function element by element in netj

y1 =

f(wT

1 x1)

f(wT

1 x2)· · · f

(wT

1 xN)

f(wT

2 x1)

f(wT

2 x2)· · · f

(wT

2 xN)

...... . . . ...

f(wT

nH x1)

f(wT

nH x2)· · · f

(wT

nH xN)

(28)

IMPORTANT about overflows!!!Be careful about the numeric stability of the activation function.I the case of python, we can use the ones provided by scipy.special

49 / 94

Page 111: 15 Machine Learning Multilayer Perceptron

However, We can create a Sigmoid function

It is possible to use the following pseudo-codeSigmoid(x)

1 if try{

1.01.0+exp{−αx}

}catch {OVERFLOW } / I will use a

/ try-and-catch to catch/ the overflow

2 if x < 03 return 04 else5 return 16 else7 return 1.0

1.0+exp{−αx}/ 1.0 refers to the floating point (Rationals/ trying to represent Reals)

50 / 94

Page 112: 15 Machine Learning Multilayer Perceptron

However, We can create a Sigmoid function

It is possible to use the following pseudo-codeSigmoid(x)

1 if try{

1.01.0+exp{−αx}

}catch {OVERFLOW } / I will use a

/ try-and-catch to catch/ the overflow

2 if x < 03 return 04 else5 return 16 else7 return 1.0

1.0+exp{−αx}/ 1.0 refers to the floating point (Rationals/ trying to represent Reals)

50 / 94

Page 113: 15 Machine Learning Multilayer Perceptron

However, We can create a Sigmoid function

It is possible to use the following pseudo-codeSigmoid(x)

1 if try{

1.01.0+exp{−αx}

}catch {OVERFLOW } / I will use a

/ try-and-catch to catch/ the overflow

2 if x < 03 return 04 else5 return 16 else7 return 1.0

1.0+exp{−αx}/ 1.0 refers to the floating point (Rationals/ trying to represent Reals)

50 / 94

Page 114: 15 Machine Learning Multilayer Perceptron

However, We can create a Sigmoid function

It is possible to use the following pseudo-codeSigmoid(x)

1 if try{

1.01.0+exp{−αx}

}catch {OVERFLOW } / I will use a

/ try-and-catch to catch/ the overflow

2 if x < 03 return 04 else5 return 16 else7 return 1.0

1.0+exp{−αx}/ 1.0 refers to the floating point (Rationals/ trying to represent Reals)

50 / 94

Page 115: 15 Machine Learning Multilayer Perceptron

Outline1 Introduction

The XOR Problem2 Multi-Layer Perceptron

ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm

3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions

4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer

51 / 94

Page 116: 15 Machine Learning Multilayer Perceptron

For this, we get netk

For this, we obtain the W HO

W HO =(

wo11 wo

12 · · · wo1nH

)=(wT

o

)(29)

Thus

netk =(

wo11 wo

12 · · · wo1nH

)

f(

wT1 x1)

f(

wT1 x2)

· · · f(

wT1 xN)

f(

wT2 x1)

f(

wT2 x2)

· · · f(

wT2 xN)

......

. . ....

f(

wTnH

x1)︸ ︷︷ ︸ f

(wT

nHx2)︸ ︷︷ ︸ · · · f

(wT

nHxN)︸ ︷︷ ︸

yk1 yk2 · · · ykN

(30)

In matrix notationnetk =

(wT

o yk1 wTo yk2 · · · wT

o ykN)

(31)

52 / 94

Page 117: 15 Machine Learning Multilayer Perceptron

For this, we get netk

For this, we obtain the W HO

W HO =(

wo11 wo

12 · · · wo1nH

)=(wT

o

)(29)

Thus

netk =(

wo11 wo

12 · · · wo1nH

)

f(

wT1 x1)

f(

wT1 x2)

· · · f(

wT1 xN)

f(

wT2 x1)

f(

wT2 x2)

· · · f(

wT2 xN)

......

. . ....

f(

wTnH

x1)︸ ︷︷ ︸ f

(wT

nHx2)︸ ︷︷ ︸ · · · f

(wT

nHxN)︸ ︷︷ ︸

yk1 yk2 · · · ykN

(30)

In matrix notationnetk =

(wT

o yk1 wTo yk2 · · · wT

o ykN)

(31)

52 / 94

Page 118: 15 Machine Learning Multilayer Perceptron

For this, we get netk

For this, we obtain the W HO

W HO =(

wo11 wo

12 · · · wo1nH

)=(wT

o

)(29)

Thus

netk =(

wo11 wo

12 · · · wo1nH

)

f(

wT1 x1)

f(

wT1 x2)

· · · f(

wT1 xN)

f(

wT2 x1)

f(

wT2 x2)

· · · f(

wT2 xN)

......

. . ....

f(

wTnH

x1)︸ ︷︷ ︸ f

(wT

nHx2)︸ ︷︷ ︸ · · · f

(wT

nHxN)︸ ︷︷ ︸

yk1 yk2 · · · ykN

(30)

In matrix notationnetk =

(wT

o yk1 wTo yk2 · · · wT

o ykN)

(31)

52 / 94

Page 119: 15 Machine Learning Multilayer Perceptron

Outline1 Introduction

The XOR Problem2 Multi-Layer Perceptron

ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm

3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions

4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer

53 / 94

Page 120: 15 Machine Learning Multilayer Perceptron

Now, we have

Thus, we have zk (In our case k = 1, but it could be a range ofvalues)

zk =(

f(wT

o yk1

)f(wT

o yk2

)· · · f

(wT

o ykN

) )(32)

Thus, we generate a vector of differences

d = t − zk =(

t1 − f(wT

o yk1)

t2 − f(wT

o yk2)

· · · tN − f(wT

o ykN) )

(33)

where t =(

t1 t2 · · · tN)is a row vector of desired outputs for each

sample.

54 / 94

Page 121: 15 Machine Learning Multilayer Perceptron

Now, we have

Thus, we have zk (In our case k = 1, but it could be a range ofvalues)

zk =(

f(wT

o yk1

)f(wT

o yk2

)· · · f

(wT

o ykN

) )(32)

Thus, we generate a vector of differences

d = t − zk =(

t1 − f(wT

o yk1)

t2 − f(wT

o yk2)

· · · tN − f(wT

o ykN) )

(33)

where t =(

t1 t2 · · · tN)is a row vector of desired outputs for each

sample.

54 / 94

Page 122: 15 Machine Learning Multilayer Perceptron

Now, we multiply element wise

We have the following vector of derivatives of net

Df =(ηf ′(wT

o yk1

)ηf ′(wT

o yk2

)· · · ηf ′

(wT

o ykN

) )(34)

where η is the step rate.

Finally, by element wise multiplication (Hadamard Product)

d =(η[t1 − f

(wT

o yk1)]

f ′(wT

o yk1)

η[t2 − f

(wT

o yk2)]

f ′(wT

o yk2)· · ·

η[tN − f

(wT

o ykN)]

f ′(wT

o ykN))

55 / 94

Page 123: 15 Machine Learning Multilayer Perceptron

Now, we multiply element wise

We have the following vector of derivatives of net

Df =(ηf ′(wT

o yk1

)ηf ′(wT

o yk2

)· · · ηf ′

(wT

o ykN

) )(34)

where η is the step rate.

Finally, by element wise multiplication (Hadamard Product)

d =(η[t1 − f

(wT

o yk1)]

f ′(wT

o yk1)

η[t2 − f

(wT

o yk2)]

f ′(wT

o yk2)· · ·

η[tN − f

(wT

o ykN)]

f ′(wT

o ykN))

55 / 94

Page 124: 15 Machine Learning Multilayer Perceptron

Tile d

Tile downward

dtile = nH rows

dd...d

(35)

Finally, we multiply element wise against y1 (Hadamard Product)

∆wtemp1j = y1 ◦ dtile (36)

56 / 94

Page 125: 15 Machine Learning Multilayer Perceptron

Tile d

Tile downward

dtile = nH rows

dd...d

(35)

Finally, we multiply element wise against y1 (Hadamard Product)

∆wtemp1j = y1 ◦ dtile (36)

56 / 94

Page 126: 15 Machine Learning Multilayer Perceptron

We obtain the total ∆w1j

We sum along the rows of ∆wtemp1j

∆w1j =

η[

t1 − f(

wTo yk1

)]f ′(

wTo yk1

)y11 + η

[t1 − f

(wT

o yk1

)]f ′(

wTo yk1

)y1N

...η[

t1 − f(

wTo yk1

)]f ′(

wTo yk1

)ynH 1 + η

[t1 − f

(wT

o yk1

)]f ′(

wTo yk1

)ynH N

(37)

where yhm = f(wT

h xm)with h = 1, 2, ...,nH and m = 1, 2, ...,N .

57 / 94

Page 127: 15 Machine Learning Multilayer Perceptron

Finally, we update the first weights

We have then

W HO (t + 1) = W HO (t) + ∆wT1j (t) (38)

58 / 94

Page 128: 15 Machine Learning Multilayer Perceptron

Outline1 Introduction

The XOR Problem2 Multi-Layer Perceptron

ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm

3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions

4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer

59 / 94

Page 129: 15 Machine Learning Multilayer Perceptron

First

We multiply element wise the W HO and ∆w1j

T = ∆wT1j ◦W T

HO (39)

Now, we obtain the element wise derivative of netj

Dnetj =

f ′(wT

1 x1)

f ′(wT

1 x2)· · · f ′

(wT

1 xN)

f ′(wT

2 x1)

f ′(wT

2 x2)· · · f ′

(wT

2 xN)

...... . . . ...

f ′(wT

nH x1)

f ′(wT

nH x2)· · · f ′

(wT

nH xN)

(40)

60 / 94

Page 130: 15 Machine Learning Multilayer Perceptron

First

We multiply element wise the W HO and ∆w1j

T = ∆wT1j ◦W T

HO (39)

Now, we obtain the element wise derivative of netj

Dnetj =

f ′(wT

1 x1)

f ′(wT

1 x2)· · · f ′

(wT

1 xN)

f ′(wT

2 x1)

f ′(wT

2 x2)· · · f ′

(wT

2 xN)

...... . . . ...

f ′(wT

nH x1)

f ′(wT

nH x2)· · · f ′

(wT

nH xN)

(40)

60 / 94

Page 131: 15 Machine Learning Multilayer Perceptron

Thus

We tile to the right T

T tile =(

T T · · · T)

︸ ︷︷ ︸N Columns

(41)

Now, we multiply element wise together with η

Pt = η (Dnetj ◦T tile) (42)

where η is constant multiplied against the result the Hadamar Product(Result a nH ×N matrix)

61 / 94

Page 132: 15 Machine Learning Multilayer Perceptron

Thus

We tile to the right T

T tile =(

T T · · · T)

︸ ︷︷ ︸N Columns

(41)

Now, we multiply element wise together with η

Pt = η (Dnetj ◦T tile) (42)

where η is constant multiplied against the result the Hadamar Product(Result a nH ×N matrix)

61 / 94

Page 133: 15 Machine Learning Multilayer Perceptron

Finally

We get use the transpose of X which is a N × d matrix

XT =

xT

1xT

2...

xTN

(43)

Finally, we get a nH × d matrix

∆wij = PtXT (44)

Thus, given W IH

W IH (t + 1) = W HO (t) + ∆wTij (t) (45)

62 / 94

Page 134: 15 Machine Learning Multilayer Perceptron

Finally

We get use the transpose of X which is a N × d matrix

XT =

xT

1xT

2...

xTN

(43)

Finally, we get a nH × d matrix

∆wij = PtXT (44)

Thus, given W IH

W IH (t + 1) = W HO (t) + ∆wTij (t) (45)

62 / 94

Page 135: 15 Machine Learning Multilayer Perceptron

Finally

We get use the transpose of X which is a N × d matrix

XT =

xT

1xT

2...

xTN

(43)

Finally, we get a nH × d matrix

∆wij = PtXT (44)

Thus, given W IH

W IH (t + 1) = W HO (t) + ∆wTij (t) (45)

62 / 94

Page 136: 15 Machine Learning Multilayer Perceptron

Outline1 Introduction

The XOR Problem2 Multi-Layer Perceptron

ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm

3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions

4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer

63 / 94

Page 137: 15 Machine Learning Multilayer Perceptron

We have different activation functions

The two most important1 Sigmoid function.2 Hyperbolic tangent function

64 / 94

Page 138: 15 Machine Learning Multilayer Perceptron

We have different activation functions

The two most important1 Sigmoid function.2 Hyperbolic tangent function

64 / 94

Page 139: 15 Machine Learning Multilayer Perceptron

Logistic Function

This non-linear function has the following definition for a neuron j

fj (vj (n)) = 11 + exp {−avj (n)} a > 0 and −∞ < vj (n) <∞ (46)

Example

65 / 94

Page 140: 15 Machine Learning Multilayer Perceptron

Logistic FunctionThis non-linear function has the following definition for a neuron j

fj (vj (n)) = 11 + exp {−avj (n)} a > 0 and −∞ < vj (n) <∞ (46)

Example

65 / 94

Page 141: 15 Machine Learning Multilayer Perceptron

The differential of the sigmoid function

Now if we differentiate, we have

f ′j (vj (n)) =[

11 + exp {−avj (n)}

] [1− 1

1 + exp {−avj (n)}

]

= exp {−avj (n)}(1 + exp {−avj (n)})2

66 / 94

Page 142: 15 Machine Learning Multilayer Perceptron

The differential of the sigmoid function

Now if we differentiate, we have

f ′j (vj (n)) =[

11 + exp {−avj (n)}

] [1− 1

1 + exp {−avj (n)}

]

= exp {−avj (n)}(1 + exp {−avj (n)})2

66 / 94

Page 143: 15 Machine Learning Multilayer Perceptron

The outputs finish as

For the output neurons

δk = (tk − zk) f ′ (netk)= (tk − fk (vk (n))) fk (vk (n)) (1− fk (vk (n)))

For the hidden neurons

δj =fj (vj (n)) (1− fj (vj (n)))c∑

k=1wkjδk

67 / 94

Page 144: 15 Machine Learning Multilayer Perceptron

The outputs finish as

For the output neurons

δk = (tk − zk) f ′ (netk)= (tk − fk (vk (n))) fk (vk (n)) (1− fk (vk (n)))

For the hidden neurons

δj =fj (vj (n)) (1− fj (vj (n)))c∑

k=1wkjδk

67 / 94

Page 145: 15 Machine Learning Multilayer Perceptron

The outputs finish as

For the output neurons

δk = (tk − zk) f ′ (netk)= (tk − fk (vk (n))) fk (vk (n)) (1− fk (vk (n)))

For the hidden neurons

δj =fj (vj (n)) (1− fj (vj (n)))c∑

k=1wkjδk

67 / 94

Page 146: 15 Machine Learning Multilayer Perceptron

Hyperbolic tangent function

Another commonly used form of sigmoidal non linearity is thehyperbolic tangent function

fj (vj (n)) = a tanh (bvj (n)) (47)

Example

68 / 94

Page 147: 15 Machine Learning Multilayer Perceptron

Hyperbolic tangent functionAnother commonly used form of sigmoidal non linearity is thehyperbolic tangent function

fj (vj (n)) = a tanh (bvj (n)) (47)

Example

68 / 94

Page 148: 15 Machine Learning Multilayer Perceptron

The differential of the hyperbolic tangent

We have

fj (vj (n)) =absech2 (bvj (n))

=ab(1− tanh2 (bvj (n))

)BTWI leave to you to figure out the outputs.

69 / 94

Page 149: 15 Machine Learning Multilayer Perceptron

The differential of the hyperbolic tangent

We have

fj (vj (n)) =absech2 (bvj (n))

=ab(1− tanh2 (bvj (n))

)BTWI leave to you to figure out the outputs.

69 / 94

Page 150: 15 Machine Learning Multilayer Perceptron

The differential of the hyperbolic tangent

We have

fj (vj (n)) =absech2 (bvj (n))

=ab(1− tanh2 (bvj (n))

)BTWI leave to you to figure out the outputs.

69 / 94

Page 151: 15 Machine Learning Multilayer Perceptron

Outline1 Introduction

The XOR Problem2 Multi-Layer Perceptron

ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm

3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions

4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer

70 / 94

Page 152: 15 Machine Learning Multilayer Perceptron

Maximizing information content

Two ways of achieving this, LeCun 1993The use of an example that results in the largest training error.The use of an example that is radically different from all thosepreviously used.

For thisRandomized the samples presented to the multilayer perceptron when notdoing batch training.

Or use an emphasizing schemeBy using the error identify the difficult vs. easy patterns:

Use them to train the neural network

71 / 94

Page 153: 15 Machine Learning Multilayer Perceptron

Maximizing information content

Two ways of achieving this, LeCun 1993The use of an example that results in the largest training error.The use of an example that is radically different from all thosepreviously used.

For thisRandomized the samples presented to the multilayer perceptron when notdoing batch training.

Or use an emphasizing schemeBy using the error identify the difficult vs. easy patterns:

Use them to train the neural network

71 / 94

Page 154: 15 Machine Learning Multilayer Perceptron

Maximizing information content

Two ways of achieving this, LeCun 1993The use of an example that results in the largest training error.The use of an example that is radically different from all thosepreviously used.

For thisRandomized the samples presented to the multilayer perceptron when notdoing batch training.

Or use an emphasizing schemeBy using the error identify the difficult vs. easy patterns:

Use them to train the neural network

71 / 94

Page 155: 15 Machine Learning Multilayer Perceptron

Maximizing information content

Two ways of achieving this, LeCun 1993The use of an example that results in the largest training error.The use of an example that is radically different from all thosepreviously used.

For thisRandomized the samples presented to the multilayer perceptron when notdoing batch training.

Or use an emphasizing schemeBy using the error identify the difficult vs. easy patterns:

Use them to train the neural network

71 / 94

Page 156: 15 Machine Learning Multilayer Perceptron

Maximizing information content

Two ways of achieving this, LeCun 1993The use of an example that results in the largest training error.The use of an example that is radically different from all thosepreviously used.

For thisRandomized the samples presented to the multilayer perceptron when notdoing batch training.

Or use an emphasizing schemeBy using the error identify the difficult vs. easy patterns:

Use them to train the neural network

71 / 94

Page 157: 15 Machine Learning Multilayer Perceptron

However

Be careful about emphasizing schemeThe distribution of examples within an epoch presented to thenetwork is distorted.The presence of an outlier or a mislabeled example can have acatastrophic consequence on the performance of the algorithm.

Definition of OutlierAn outlier is an observation that lies outside the overall pattern of adistribution (Moore and McCabe 1999).

72 / 94

Page 158: 15 Machine Learning Multilayer Perceptron

However

Be careful about emphasizing schemeThe distribution of examples within an epoch presented to thenetwork is distorted.The presence of an outlier or a mislabeled example can have acatastrophic consequence on the performance of the algorithm.

Definition of OutlierAn outlier is an observation that lies outside the overall pattern of adistribution (Moore and McCabe 1999).

72 / 94

Page 159: 15 Machine Learning Multilayer Perceptron

However

Be careful about emphasizing schemeThe distribution of examples within an epoch presented to thenetwork is distorted.The presence of an outlier or a mislabeled example can have acatastrophic consequence on the performance of the algorithm.

Definition of OutlierAn outlier is an observation that lies outside the overall pattern of adistribution (Moore and McCabe 1999).

0 1 2 3

Outlier

72 / 94

Page 160: 15 Machine Learning Multilayer Perceptron

Outline1 Introduction

The XOR Problem2 Multi-Layer Perceptron

ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm

3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions

4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer

73 / 94

Page 161: 15 Machine Learning Multilayer Perceptron

Activation Function

We say thatAn activation function f (v) is antisymmetric if f (−v) = −f (v)

It seems to beThat the multilayer perceptron learns faster using an antisymmetricfunction.

Example: The hyperbolic tangent

74 / 94

Page 162: 15 Machine Learning Multilayer Perceptron

Activation Function

We say thatAn activation function f (v) is antisymmetric if f (−v) = −f (v)

It seems to beThat the multilayer perceptron learns faster using an antisymmetricfunction.

Example: The hyperbolic tangent

74 / 94

Page 163: 15 Machine Learning Multilayer Perceptron

Activation FunctionWe say thatAn activation function f (v) is antisymmetric if f (−v) = −f (v)

It seems to beThat the multilayer perceptron learns faster using an antisymmetricfunction.

Example: The hyperbolic tangent

74 / 94

Page 164: 15 Machine Learning Multilayer Perceptron

Outline1 Introduction

The XOR Problem2 Multi-Layer Perceptron

ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm

3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions

4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer

75 / 94

Page 165: 15 Machine Learning Multilayer Perceptron

Target Values

ImportantIt is important that the target values be chosen within the range of thesigmoid activation function.

SpecificallyThe desired response for neuron in the output layer of the multilayerperceptron should be offset by some amount ε

76 / 94

Page 166: 15 Machine Learning Multilayer Perceptron

Target Values

ImportantIt is important that the target values be chosen within the range of thesigmoid activation function.

SpecificallyThe desired response for neuron in the output layer of the multilayerperceptron should be offset by some amount ε

76 / 94

Page 167: 15 Machine Learning Multilayer Perceptron

For example

Given the a limiting value

We have thenIf we have a limiting value +a, we set t = a − ε.If we have a limiting value −a, we set t = −a + ε.

77 / 94

Page 168: 15 Machine Learning Multilayer Perceptron

For example

Given the a limiting value

We have thenIf we have a limiting value +a, we set t = a − ε.If we have a limiting value −a, we set t = −a + ε.

77 / 94

Page 169: 15 Machine Learning Multilayer Perceptron

For example

Given the a limiting value

We have thenIf we have a limiting value +a, we set t = a − ε.If we have a limiting value −a, we set t = −a + ε.

77 / 94

Page 170: 15 Machine Learning Multilayer Perceptron

Outline1 Introduction

The XOR Problem2 Multi-Layer Perceptron

ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm

3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions

4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer

78 / 94

Page 171: 15 Machine Learning Multilayer Perceptron

Normalizing the inputs

Something Important (LeCun, 1993)Each input variable should be preprocessed so that:

The mean value, averaged over the entire training set, is close to zero.Or it is smalll compared to its standard deviation.

Example

79 / 94

Page 172: 15 Machine Learning Multilayer Perceptron

Normalizing the inputs

Something Important (LeCun, 1993)Each input variable should be preprocessed so that:

The mean value, averaged over the entire training set, is close to zero.Or it is smalll compared to its standard deviation.

Example

79 / 94

Page 173: 15 Machine Learning Multilayer Perceptron

Normalizing the inputs

Something Important (LeCun, 1993)Each input variable should be preprocessed so that:

The mean value, averaged over the entire training set, is close to zero.Or it is smalll compared to its standard deviation.

Example

79 / 94

Page 174: 15 Machine Learning Multilayer Perceptron

Normalizing the inputs

Something Important (LeCun, 1993)Each input variable should be preprocessed so that:

The mean value, averaged over the entire training set, is close to zero.Or it is smalll compared to its standard deviation.

ExampleMean Value

79 / 94

Page 175: 15 Machine Learning Multilayer Perceptron

The normalization must include two other measures

UncorrelatedWe can use the principal component analysis

Example

80 / 94

Page 176: 15 Machine Learning Multilayer Perceptron

The normalization must include two other measures

UncorrelatedWe can use the principal component analysis

Example

80 / 94

Page 177: 15 Machine Learning Multilayer Perceptron

In addition

Quite interestingThe decorrelated input variables should be scaled so that their covariancesare The decorrelated approximately equal.

WhyEnsuring that the different synaptic weights in the ely the same speednetwork learn at approximately the same speed.

81 / 94

Page 178: 15 Machine Learning Multilayer Perceptron

In addition

Quite interestingThe decorrelated input variables should be scaled so that their covariancesare The decorrelated approximately equal.

WhyEnsuring that the different synaptic weights in the ely the same speednetwork learn at approximately the same speed.

81 / 94

Page 179: 15 Machine Learning Multilayer Perceptron

There are other heuristics

AsInitializationLearning form hintsLearning ratesetc

82 / 94

Page 180: 15 Machine Learning Multilayer Perceptron

There are other heuristics

AsInitializationLearning form hintsLearning ratesetc

82 / 94

Page 181: 15 Machine Learning Multilayer Perceptron

There are other heuristics

AsInitializationLearning form hintsLearning ratesetc

82 / 94

Page 182: 15 Machine Learning Multilayer Perceptron

There are other heuristics

AsInitializationLearning form hintsLearning ratesetc

82 / 94

Page 183: 15 Machine Learning Multilayer Perceptron

In addition

In section 4.15, Simon HaykinWe have the following techniques:

Network growingI You start with a small network and add neurons and layers to

accomplish the learning task.

Network pruningI Start with a large network, then prune weights that are not necessary in

an orderly fashion.

83 / 94

Page 184: 15 Machine Learning Multilayer Perceptron

In addition

In section 4.15, Simon HaykinWe have the following techniques:

Network growingI You start with a small network and add neurons and layers to

accomplish the learning task.

Network pruningI Start with a large network, then prune weights that are not necessary in

an orderly fashion.

83 / 94

Page 185: 15 Machine Learning Multilayer Perceptron

In addition

In section 4.15, Simon HaykinWe have the following techniques:

Network growingI You start with a small network and add neurons and layers to

accomplish the learning task.

Network pruningI Start with a large network, then prune weights that are not necessary in

an orderly fashion.

83 / 94

Page 186: 15 Machine Learning Multilayer Perceptron

Outline1 Introduction

The XOR Problem2 Multi-Layer Perceptron

ArchitectureBack-propagationGradient DescentHidden–to-Output WeightsInput-to-Hidden WeightsTotal Training ErrorAbout Stopping CriteriaFinal Basic Batch Algorithm

3 Using Matrix Operations to SimplifyUsing Matrix Operations to Simplify the Pseudo-CodeGenerating the Output zkGenerating zkGenerating the Weights from Hidden to Output LayerGenerating the Weights from Input to Hidden LayerActivation Functions

4 Heuristic for Multilayer PerceptronMaximizing information contentActivation FunctionTarget ValuesNormalizing the inputsVirtues and limitations of Back-Propagation Layer

84 / 94

Page 187: 15 Machine Learning Multilayer Perceptron

Virtues and limitations of Back-Propagation Layer

Something NotableThe back-propagation algorithm has emerged as the most popularalgorithm for the training of multilayer perceptrons.

It has two distinct propertiesIt is simple to compute locally.It performs stochastic gradient descent in weight space when doingpattern-by-pattern training

85 / 94

Page 188: 15 Machine Learning Multilayer Perceptron

Virtues and limitations of Back-Propagation Layer

Something NotableThe back-propagation algorithm has emerged as the most popularalgorithm for the training of multilayer perceptrons.

It has two distinct propertiesIt is simple to compute locally.It performs stochastic gradient descent in weight space when doingpattern-by-pattern training

85 / 94

Page 189: 15 Machine Learning Multilayer Perceptron

Virtues and limitations of Back-Propagation Layer

Something NotableThe back-propagation algorithm has emerged as the most popularalgorithm for the training of multilayer perceptrons.

It has two distinct propertiesIt is simple to compute locally.It performs stochastic gradient descent in weight space when doingpattern-by-pattern training

85 / 94

Page 190: 15 Machine Learning Multilayer Perceptron

Connectionism

Back-propagationt is an example of a connectionist paradigm that relies on localcomputations to discover the processing capabilities of neural networks.

This form of restrictionIt is known as the locality constraint

86 / 94

Page 191: 15 Machine Learning Multilayer Perceptron

Connectionism

Back-propagationt is an example of a connectionist paradigm that relies on localcomputations to discover the processing capabilities of neural networks.

This form of restrictionIt is known as the locality constraint

86 / 94

Page 192: 15 Machine Learning Multilayer Perceptron

Why this is advocated in Artificial Neural Networks

FirstArtificial neural networks that perform local computations are often heldup as metaphors for biological neural networks.

SecondThe use of local computations permits a graceful degradation inperformance due to hardware errors, and therefore provides the basis for afault-tolerant network design.

ThirdLocal computations favor the use of parallel architectures as an efficientmethod for the implementation of artificial neural networks.

87 / 94

Page 193: 15 Machine Learning Multilayer Perceptron

Why this is advocated in Artificial Neural Networks

FirstArtificial neural networks that perform local computations are often heldup as metaphors for biological neural networks.

SecondThe use of local computations permits a graceful degradation inperformance due to hardware errors, and therefore provides the basis for afault-tolerant network design.

ThirdLocal computations favor the use of parallel architectures as an efficientmethod for the implementation of artificial neural networks.

87 / 94

Page 194: 15 Machine Learning Multilayer Perceptron

Why this is advocated in Artificial Neural Networks

FirstArtificial neural networks that perform local computations are often heldup as metaphors for biological neural networks.

SecondThe use of local computations permits a graceful degradation inperformance due to hardware errors, and therefore provides the basis for afault-tolerant network design.

ThirdLocal computations favor the use of parallel architectures as an efficientmethod for the implementation of artificial neural networks.

87 / 94

Page 195: 15 Machine Learning Multilayer Perceptron

However, all this has been seriously questioned on thefollowing grounds(Shepherd, 1990b; Crick, 1989; Stork,1989)

FirstThe reciprocal synaptic connections between the neurons of amultilayer perceptron may assume weights that are excitatory orinhibitory.In the real nervous system, neurons usually appear to be the one orthe other.

SecondIn a multilayer perceptron, hormonal and other types of globalcommunications are ignored.

88 / 94

Page 196: 15 Machine Learning Multilayer Perceptron

However, all this has been seriously questioned on thefollowing grounds(Shepherd, 1990b; Crick, 1989; Stork,1989)

FirstThe reciprocal synaptic connections between the neurons of amultilayer perceptron may assume weights that are excitatory orinhibitory.In the real nervous system, neurons usually appear to be the one orthe other.

SecondIn a multilayer perceptron, hormonal and other types of globalcommunications are ignored.

88 / 94

Page 197: 15 Machine Learning Multilayer Perceptron

However, all this has been seriously questioned on thefollowing grounds(Shepherd, 1990b; Crick, 1989; Stork,1989)

FirstThe reciprocal synaptic connections between the neurons of amultilayer perceptron may assume weights that are excitatory orinhibitory.In the real nervous system, neurons usually appear to be the one orthe other.

SecondIn a multilayer perceptron, hormonal and other types of globalcommunications are ignored.

88 / 94

Page 198: 15 Machine Learning Multilayer Perceptron

However, all this has been seriously questioned on thefollowing grounds(Shepherd, 1990b; Crick, 1989; Stork,1989)

ThirdIn back-propagation learning, a synaptic weight is modified by apresynaptic activity and an error (learning) signal independent ofpostsynaptic activity.There is evidence from neurobiology to suggest otherwise.

FourthIn a neurobiological sense, the implementation of back-propagationlearning requires the rapid transmission of information backward alongan axon.It appears highly unlikely that such an operation actually takes placein the brain.

89 / 94

Page 199: 15 Machine Learning Multilayer Perceptron

However, all this has been seriously questioned on thefollowing grounds(Shepherd, 1990b; Crick, 1989; Stork,1989)

ThirdIn back-propagation learning, a synaptic weight is modified by apresynaptic activity and an error (learning) signal independent ofpostsynaptic activity.There is evidence from neurobiology to suggest otherwise.

FourthIn a neurobiological sense, the implementation of back-propagationlearning requires the rapid transmission of information backward alongan axon.It appears highly unlikely that such an operation actually takes placein the brain.

89 / 94

Page 200: 15 Machine Learning Multilayer Perceptron

However, all this has been seriously questioned on thefollowing grounds(Shepherd, 1990b; Crick, 1989; Stork,1989)

ThirdIn back-propagation learning, a synaptic weight is modified by apresynaptic activity and an error (learning) signal independent ofpostsynaptic activity.There is evidence from neurobiology to suggest otherwise.

FourthIn a neurobiological sense, the implementation of back-propagationlearning requires the rapid transmission of information backward alongan axon.It appears highly unlikely that such an operation actually takes placein the brain.

89 / 94

Page 201: 15 Machine Learning Multilayer Perceptron

However, all this has been seriously questioned on thefollowing grounds(Shepherd, 1990b; Crick, 1989; Stork,1989)

ThirdIn back-propagation learning, a synaptic weight is modified by apresynaptic activity and an error (learning) signal independent ofpostsynaptic activity.There is evidence from neurobiology to suggest otherwise.

FourthIn a neurobiological sense, the implementation of back-propagationlearning requires the rapid transmission of information backward alongan axon.It appears highly unlikely that such an operation actually takes placein the brain.

89 / 94

Page 202: 15 Machine Learning Multilayer Perceptron

However, all this has been seriously questioned on thefollowing grounds(Shepherd, 1990b; Crick, 1989; Stork,1989)

FifthBack-propagation learning implies the existence of a "teacher," whichin the con text of the brain would presumably be another set ofneurons with novel properties.The existence of such neurons is biologically implausible.

90 / 94

Page 203: 15 Machine Learning Multilayer Perceptron

However, all this has been seriously questioned on thefollowing grounds(Shepherd, 1990b; Crick, 1989; Stork,1989)

FifthBack-propagation learning implies the existence of a "teacher," whichin the con text of the brain would presumably be another set ofneurons with novel properties.The existence of such neurons is biologically implausible.

90 / 94

Page 204: 15 Machine Learning Multilayer Perceptron

Computational Efficiency

Something NotableThe computational complexity of an algorithm is usually measured interms of the number of multiplications, additions, and storage involved inits implementation.

This is the electrical engineering approach!!!

Taking in account the total number of synapses, W including biasesWe have 4wkj = ηδkyj = η (tk − zk) f ′ (netk) yj (Backward Pass)

We have that for this step1 We need to calculate netk linear in the number of weights.2 We need to calculate yj = f (netj) which is linear in the number of

weights.

91 / 94

Page 205: 15 Machine Learning Multilayer Perceptron

Computational Efficiency

Something NotableThe computational complexity of an algorithm is usually measured interms of the number of multiplications, additions, and storage involved inits implementation.

This is the electrical engineering approach!!!

Taking in account the total number of synapses, W including biasesWe have 4wkj = ηδkyj = η (tk − zk) f ′ (netk) yj (Backward Pass)

We have that for this step1 We need to calculate netk linear in the number of weights.2 We need to calculate yj = f (netj) which is linear in the number of

weights.

91 / 94

Page 206: 15 Machine Learning Multilayer Perceptron

Computational Efficiency

Something NotableThe computational complexity of an algorithm is usually measured interms of the number of multiplications, additions, and storage involved inits implementation.

This is the electrical engineering approach!!!

Taking in account the total number of synapses, W including biasesWe have 4wkj = ηδkyj = η (tk − zk) f ′ (netk) yj (Backward Pass)

We have that for this step1 We need to calculate netk linear in the number of weights.2 We need to calculate yj = f (netj) which is linear in the number of

weights.

91 / 94

Page 207: 15 Machine Learning Multilayer Perceptron

Computational Efficiency

Now the Forward Pass

∆wji = ηxiδj = ηf ′ (netj)[ c∑

k=1wkjδk

]xi

We have that for this step[∑c

k=1 wkjδk ] takes, because of the previous calculations of δk ’s, linear onthe number of weights

Clearly all this takes to have memoryIn addition the calculation of the derivatives of the activation functions,but assuming a constant time.

92 / 94

Page 208: 15 Machine Learning Multilayer Perceptron

Computational Efficiency

Now the Forward Pass

∆wji = ηxiδj = ηf ′ (netj)[ c∑

k=1wkjδk

]xi

We have that for this step[∑c

k=1 wkjδk ] takes, because of the previous calculations of δk ’s, linear onthe number of weights

Clearly all this takes to have memoryIn addition the calculation of the derivatives of the activation functions,but assuming a constant time.

92 / 94

Page 209: 15 Machine Learning Multilayer Perceptron

Computational Efficiency

Now the Forward Pass

∆wji = ηxiδj = ηf ′ (netj)[ c∑

k=1wkjδk

]xi

We have that for this step[∑c

k=1 wkjδk ] takes, because of the previous calculations of δk ’s, linear onthe number of weights

Clearly all this takes to have memoryIn addition the calculation of the derivatives of the activation functions,but assuming a constant time.

92 / 94

Page 210: 15 Machine Learning Multilayer Perceptron

We have that

The Complexity of the multi-layer perceptron is

O (W ) Complexity

93 / 94

Page 211: 15 Machine Learning Multilayer Perceptron

Exercises

We have from NN by Haykin4.2, 4.3, 4.6, 4.8, 4.16, 4.17, 3.7

94 / 94