Multi-Layer Networks and Backpropagation Algorithm

105
Multi-Layer Networks and Backpropagation Algorithm M. Soleymani Sharif University of Technology Fall 2017 Most slides have been adapted from Fei Fei Li lectures, cs231n, Stanford 2017 and some from Hinton lectures, β€œNN for Machine Learning” course, 2015.

Transcript of Multi-Layer Networks and Backpropagation Algorithm

Page 1: Multi-Layer Networks and Backpropagation Algorithm

Multi-Layer Networksand Backpropagation Algorithm

M. Soleymani

Sharif University of Technology

Fall 2017

Most slides have been adapted from Fei Fei Li lectures, cs231n, Stanford 2017

and some from Hinton lectures, β€œNN for Machine Learning” course, 2015.

Page 2: Multi-Layer Networks and Backpropagation Algorithm

Reasons to study neural computation

β€’ Neuroscience: To understand how the brain actually works.– Its very big and very complicated and made of stuff that dies when you poke

it around. So we need to use computer simulations.

β€’ AI: To solve practical problems by using novel learning algorithms inspired by the brain

– Learning algorithms can be very useful even if they are not how the brain actually works.

Page 3: Multi-Layer Networks and Backpropagation Algorithm
Page 4: Multi-Layer Networks and Backpropagation Algorithm

A typical cortical neuron

β€’ Gross physical structure:– There is one axon that branches

– There is a dendritic tree that collects input from other neurons.

β€’ Axons typically contact dendritic trees at synapses– A spike of activity in the axon causes charge to be

injected into the post-synaptic neuron.

β€’ Spike generation:– There is an axon hillock that generates outgoing spikes whenever enough charge

has flowed in at synapses to depolarize the cell membrane.

Page 5: Multi-Layer Networks and Backpropagation Algorithm

A mathematical model for biological neurons

π‘₯1 𝑀1

𝑀1π‘₯1

𝑀2π‘₯2

𝑀3π‘₯3

Page 6: Multi-Layer Networks and Backpropagation Algorithm

How the brain works

β€’ Each neuron receives inputs from other neurons

β€’ The effect of each input line on the neuron is controlled by a synaptic weight

β€’ The synaptic weights adapt so that the whole network learns to perform useful computations

– Recognizing objects, understanding language, making plans, controlling the body.

β€’ You have about 1011 neurons each with about 104 weights.– A huge number of weights can affect the computation in a very short time. Much better

bandwidth than a workstation.

Page 7: Multi-Layer Networks and Backpropagation Algorithm

Be very careful with your brain analogies!

β€’ Biological Neurons:– Many different types

– Dendrites can perform complex non-linear computations

– Synapses are not a single weight but a complex non-linear dynamical system

– Rate code may not be adequate

[Dendritic Computation. London and Hausser]

Page 8: Multi-Layer Networks and Backpropagation Algorithm

Binary threshold neurons

β€’ McCulloch-Pitts (1943): influenced Von Neumann.– First compute a weighted sum of the inputs.

– send out a spike of activity if the weighted sum exceeds a threshold.

– McCulloch and Pitts thought that each spike is like the truth value of a proposition and each neuron combines truth values to compute the truth value of another proposition!

𝑖𝑛𝑝𝑒𝑑1

𝑖𝑛𝑝𝑒𝑑2

𝑖𝑛𝑝𝑒𝑑𝑑

𝑓 𝑖𝑀𝑖π‘₯𝑖

𝑓: Activation function

…

𝑓

Ξ£

𝑀1

𝑀2

𝑀𝑑

Page 9: Multi-Layer Networks and Backpropagation Algorithm

McCulloch-Pitts neuron: binary threshold

9

β€’ Neuron, unit, or processing element:π‘₯1

π‘₯2

π‘₯𝑑

𝑦

…

𝑦 = 1, 𝑧 β‰₯ πœƒ0, 𝑧 < πœƒ

𝑦

πœƒ: activation threshold

𝑀1

𝑀2

𝑀𝑑

π‘₯1

π‘₯2

π‘₯𝑑

𝑦

…

𝑀1

𝑀2

𝑀𝑑

𝑏1

𝑦

bias: 𝑏 = βˆ’πœƒ

Equivalent to

binary McCulloch-Pitts neuron

Page 10: Multi-Layer Networks and Backpropagation Algorithm

AND & OR networks

10

β€’ For -1 and 1 inputs:

Page 11: Multi-Layer Networks and Backpropagation Algorithm

Sigmoid neurons

β€’ These give a real-valued output that is a smooth and bounded function of their total input.

β€’ Typically they use the logistic function– They have nice derivatives.

Page 12: Multi-Layer Networks and Backpropagation Algorithm

Rectified Linear Units (ReLU)

β€’ They compute a linear weighted sum of their inputs.

β€’ The output is a non-linear function of the total input.

Page 13: Multi-Layer Networks and Backpropagation Algorithm

Adjusting weights

β€’ Types of single layer networks:– Perceptron (Rosenblatt, 1962)

– ADALINE (Widrow and Hoff, 1960)

Page 14: Multi-Layer Networks and Backpropagation Algorithm

The standard Perceptron architecture

β€’ Learn how to weight each of the feature activations to get desirable outputs.

β€’ If output is above some threshold, decide that the input vector is a positive example of the target class.

Page 15: Multi-Layer Networks and Backpropagation Algorithm

The perceptron convergence procedure

β€’ Perceptron trains binary output neurons as classifiers

β€’ Pick training cases (until convergence):– If the output unit is correct, leave its weights alone.

– If the output unit incorrectly outputs a zero, add the input vector to it.

– If the output unit incorrectly outputs a 1, subtract the input vector from it.

β€’ This is guaranteed to find a set of weights that gets the right answer for all the training cases if any such set exists.

Page 16: Multi-Layer Networks and Backpropagation Algorithm

Adjusting weights

16

β€’ Weight update for a training pair (𝒙 𝑛 , 𝑦(𝑛)):

– Perceptron: If 𝑠𝑖𝑔𝑛(π’˜π‘‡π’™(𝑛)) β‰  𝑦(𝑛) then βˆ†π’˜ = 𝒙(𝑛)𝑦(𝑛) else βˆ†π’˜ = 𝟎

– ADALINE: βˆ†π’˜ = πœ‚(𝑦(𝑛) βˆ’π’˜π‘‡π’™(𝑛))𝒙(𝑛)

β€’ Widrow-Hoff, LMS, or delta rule π’˜π‘‘+1 = π’˜π‘‘ βˆ’ πœ‚π›»πΈπ‘› π’˜π‘‘

𝐸𝑛 π’˜ = 𝑦(𝑛) βˆ’π’˜π‘‡π’™(𝑛)2

Page 17: Multi-Layer Networks and Backpropagation Algorithm

How to learn the weights: multi class example

Page 18: Multi-Layer Networks and Backpropagation Algorithm

How to learn the weights: multi class example

β€’ If correct: no change

β€’ If wrong: – lower score of the wrong answer (by removing the input from the weight

vector of the wrong answer)

– raise score of the target (by adding the input to the weight vector of the target class)

Page 19: Multi-Layer Networks and Backpropagation Algorithm

How to learn the weights: multi class example

β€’ If correct: no change

β€’ If wrong: – lower score of the wrong answer (by removing the input from the weight

vector of the wrong answer)

– raise score of the target (by adding the input to the weight vector of the target class)

Page 20: Multi-Layer Networks and Backpropagation Algorithm

How to learn the weights: multi class example

β€’ If correct: no change

β€’ If wrong: – lower score of the wrong answer (by removing the input from the weight

vector of the wrong answer)

– raise score of the target (by adding the input to the weight vector of the target class)

Page 21: Multi-Layer Networks and Backpropagation Algorithm

How to learn the weights: multi class example

β€’ If correct: no change

β€’ If wrong: – lower score of the wrong answer (by removing the input from the weight

vector of the wrong answer)

– raise score of the target (by adding the input to the weight vector of the target class)

Page 22: Multi-Layer Networks and Backpropagation Algorithm

How to learn the weights: multi class example

β€’ If correct: no change

β€’ If wrong: – lower score of the wrong answer (by removing the input from the weight

vector of the wrong answer)

– raise score of the target (by adding the input to the weight vector of the target class)

Page 23: Multi-Layer Networks and Backpropagation Algorithm

How to learn the weights: multi class example

β€’ If correct: no change

β€’ If wrong: – lower score of the wrong answer (by removing the input from the weight

vector of the wrong answer)

– raise score of the target (by adding the input to the weight vector of the target class)

Page 24: Multi-Layer Networks and Backpropagation Algorithm

Single layer networks as template matching

β€’ Weights for each class as a template (or sometimes also called a prototype) for that class.

– The winner is the most similar template.

β€’ The ways in which hand-written digits vary are much too complicated to be captured by simple template matches of whole shapes.

β€’ To capture all the allowable variations of a digit we need to learn the features that it is composed of.

Page 25: Multi-Layer Networks and Backpropagation Algorithm

The history of perceptrons

β€’ They were popularised by Frank Rosenblatt in the early 1960’s.– They appeared to have a very powerful learning algorithm.

– Lots of grand claims were made for what they could learn to do.

β€’ In 1969, Minsky and Papert published a book called β€œPerceptrons” that analyzed what they could do and showed their limitations.

– Many people thought these limitations applied to all neural network models.

Page 26: Multi-Layer Networks and Backpropagation Algorithm

What binary threshold neurons cannot do

β€’ A binary threshold output unit cannot even tell if two single bit features are the same!

β€’ A geometric view of what binary threshold neurons cannot do

β€’ The positive and negative cases cannot be separated by a plane

Page 27: Multi-Layer Networks and Backpropagation Algorithm

What binary threshold neurons cannot do

β€’ Positive cases (same): (1,1)->1; (0,0)->1

β€’ Negative cases (different): (1,0)->0; (0,1)->0

β€’ The four input-output pairs give four inequalities that are impossible to satisfy:

– w1 +w2 β‰₯ΞΈ

– 0 β‰₯ΞΈ

– w1 <ΞΈ

– w2 <ΞΈ

Page 28: Multi-Layer Networks and Backpropagation Algorithm

Discriminating simple patterns under translation with wrap-around

β€’ Suppose we just use pixels as thefeatures.

β€’ binary decision unit cannotdiscriminate patterns with the samenumber of on pixels

– if the patterns can translate with wrap-around!

Page 29: Multi-Layer Networks and Backpropagation Algorithm

Sketch of a proof

β€’ For pattern A, use training cases in all possible translations.– Each pixel will be activated by 4 different translations of pattern A.

– So the total input received by the decision unit over all these patterns will be four times the sum of all the weights.

β€’ For pattern B, use training cases in all possible translations.– Each pixel will be activated by 4 different translations of pattern B.

– So the total input received by the decision unit over all these patterns will be four times the sum of all the weights.

β€’ But to discriminate correctly, every single case of pattern A must provide more input to the decision unit than every single case of pattern B.

β€’ This is impossible if the sums over cases are the same.

Page 30: Multi-Layer Networks and Backpropagation Algorithm

Networks with hidden units

β€’ Networks without hidden units are very limited in the input-output mappings they can learn to model.

– More layers of linear units do not help. Its still linear.

– Fixed output non-linearities are not enough.

β€’ We need multiple layers of adaptive, non-linear hidden units. But how can we train such nets?

Page 31: Multi-Layer Networks and Backpropagation Algorithm

Feed-forward neural networks

β€’ Also called Multi-Layer Perceptron (MLP)

Page 32: Multi-Layer Networks and Backpropagation Algorithm

General approximator

32

β€’ If the decision boundary is smooth, then a 3-layer network (i.e. 2 hidden layer) can come arbitrarily close to the target classifier

Page 33: Multi-Layer Networks and Backpropagation Algorithm

33

MLP with Different Number of Layers

Structure Type of Decision Regions Interpretation Example of region

Single Layer (no hidden layer)

Half space Region found by a hyper-plane

Two Layer (one hidden layer)

Polyhedral (open or closed) region

Intersection of halfspaces

Three Layer (two hidden layers)

Arbitrary regions Union of polyhedrals

MLP with unit step activation function

Decision region found by an output unit.

Page 34: Multi-Layer Networks and Backpropagation Algorithm

Beyond linear models

Page 35: Multi-Layer Networks and Backpropagation Algorithm

Beyond linear models

Page 36: Multi-Layer Networks and Backpropagation Algorithm

MLP with single hidden layer

36

β€’ Two-layer MLP (Number of layers of adaptive weights is counted)

π‘œπ‘˜ 𝒙 = πœ“

𝑗=0

𝑀

π‘€π‘—π‘˜[2]𝑧𝑗 β‡’ π‘œπ‘˜ 𝒙 = πœ“

𝑗=0

𝑀

π‘€π‘—π‘˜[2]πœ™

𝑖=0

𝑑

𝑀𝑖𝑗[1]π‘₯𝑖

…

…

…

Input Output

π‘₯0 = 1

π‘₯𝑑

π‘œ1

π‘œπΎ

π‘€π‘—π‘˜[2]

𝑀𝑖𝑗[1]

πœ™

πœ“

𝑧0 = 1

𝑧1

𝑧𝑀

𝑧𝑗

πœ“

πœ™

πœ™

π‘₯1

𝑖 = 0, … , 𝑑𝑗 = 1…𝑀

𝑗 = 1β€¦π‘€π‘˜ = 1,… , 𝐾

Page 37: Multi-Layer Networks and Backpropagation Algorithm

learns to extract features

37

β€’ MLP with one hidden layer is a generalized linear model:

– π‘œπ‘˜(𝒙) = πœ“ 𝑗=1𝑀 π‘€π‘—π‘˜

[2]𝑓𝑗(𝒙)

– 𝑓𝑗 𝒙 = πœ™ 𝑖=0𝑑 𝑀𝑗𝑖

[1]π‘₯𝑖

– The form of the nonlinearity (basis functions 𝑓𝑗) is adapted from the training data (not fixed in advance)

β€’ 𝑓𝑗 is defined based on parameters which can be also adapted during training

β€’ Thus, we don’t need expert knowledge or time consuming tuning of hand-crafted features

Page 38: Multi-Layer Networks and Backpropagation Algorithm

Deep networks

β€’ Deeper networks (with multiple hidden layers) can work better than a single-hidden-layer networks is an empirical observation

– despite the fact that their representational power is equal.

β€’ In practice usually 3-layer neural networks will outperform 2-layer nets, but going even deeper may not help much more.

– This is in stark contrast to Convolutional Networks

Page 39: Multi-Layer Networks and Backpropagation Algorithm

How to adjust weights for multi layer networks?

β€’ We need multiple layers of adaptive, non-linear hidden units. But how can we train such nets?

– We need an efficient way of adapting all the weights, not just the last layer.

– Learning the weights going into hidden units is equivalent to learning features.

– This is difficult because nobody is telling us directly what the hidden units should do.

Page 40: Multi-Layer Networks and Backpropagation Algorithm
Page 41: Multi-Layer Networks and Backpropagation Algorithm

Gradient descent

β€’ We want 𝛻𝑾𝐿 𝑾

β€’ Numerical gradient: – slow :(– approximate :(– easy to write :)

β€’ Analytic gradient:– fast :)– exact :)– error-prone :(

β€’ In practice: Derive analytic gradient, check your implementation with numerical gradient

Page 42: Multi-Layer Networks and Backpropagation Algorithm

Training multi-layer networks

42

β€’ Backpropagation– Training algorithm that is used to adjust weights in multi-layer networks

(based on the training data)

– The backpropagation algorithm is based on gradient descent

– Use chain rule and dynamic programming to efficiently compute gradients

Page 43: Multi-Layer Networks and Backpropagation Algorithm

Computational graphs

Page 44: Multi-Layer Networks and Backpropagation Algorithm

Backpropagation: a simple example

Page 45: Multi-Layer Networks and Backpropagation Algorithm

Backpropagation: a simple example

Page 46: Multi-Layer Networks and Backpropagation Algorithm

Backpropagation: a simple example

Page 47: Multi-Layer Networks and Backpropagation Algorithm

Backpropagation: a simple example

Page 48: Multi-Layer Networks and Backpropagation Algorithm

Backpropagation: a simple example

Page 49: Multi-Layer Networks and Backpropagation Algorithm

Backpropagation: a simple example

Page 50: Multi-Layer Networks and Backpropagation Algorithm

Backpropagation: a simple example

Page 51: Multi-Layer Networks and Backpropagation Algorithm

How to propagate the gradients backward

Page 52: Multi-Layer Networks and Backpropagation Algorithm

How to propagate the gradients backward

Page 53: Multi-Layer Networks and Backpropagation Algorithm

Another example

Page 54: Multi-Layer Networks and Backpropagation Algorithm

Another example

Page 55: Multi-Layer Networks and Backpropagation Algorithm

Another example

Page 56: Multi-Layer Networks and Backpropagation Algorithm

Another example

Page 57: Multi-Layer Networks and Backpropagation Algorithm

Another example

Page 58: Multi-Layer Networks and Backpropagation Algorithm

Another example

Page 59: Multi-Layer Networks and Backpropagation Algorithm

Another example

Page 60: Multi-Layer Networks and Backpropagation Algorithm

Another example

Page 61: Multi-Layer Networks and Backpropagation Algorithm

Another example

Page 62: Multi-Layer Networks and Backpropagation Algorithm

Another example

Page 63: Multi-Layer Networks and Backpropagation Algorithm

Another example

Page 64: Multi-Layer Networks and Backpropagation Algorithm

Another example

Page 65: Multi-Layer Networks and Backpropagation Algorithm

Another example

Page 66: Multi-Layer Networks and Backpropagation Algorithm

Another example

[local gradient] x [upstream gradient]

x0: [2] x [0.2] = 0.4

w0: [-1] x [0.2] = -0.2

Page 67: Multi-Layer Networks and Backpropagation Algorithm

Derivative of sigmoid function

Page 68: Multi-Layer Networks and Backpropagation Algorithm

Derivative of sigmoid function

Page 69: Multi-Layer Networks and Backpropagation Algorithm

Patterns in backward flow

β€’ add gate: gradient distributor

β€’ max gate: gradient router

Page 70: Multi-Layer Networks and Backpropagation Algorithm

Gradients add at branches

Page 71: Multi-Layer Networks and Backpropagation Algorithm

Simple chain rule

β€’ 𝑧 = 𝑓 𝑔 π‘₯

β€’ 𝑦 = 𝑔(π‘₯)

Page 72: Multi-Layer Networks and Backpropagation Algorithm

Multiple paths chain rule

Page 73: Multi-Layer Networks and Backpropagation Algorithm

Modularized implementation: forward / backward API

Page 74: Multi-Layer Networks and Backpropagation Algorithm

Modularized implementation: forward / backward API

Page 75: Multi-Layer Networks and Backpropagation Algorithm

Modularized implementation: forward / backward API

Page 76: Multi-Layer Networks and Backpropagation Algorithm

Caffe Layers

Page 77: Multi-Layer Networks and Backpropagation Algorithm

Caffe sigmoid layer

Page 78: Multi-Layer Networks and Backpropagation Algorithm

Output as a composite function

𝑂𝑒𝑑𝑝𝑒𝑑 = π‘Ž[𝐿]

= 𝑓 𝑧[𝐿]

= 𝑓 π‘Š[𝐿]π‘Ž[πΏβˆ’1]

= 𝑓 π‘Š[𝐿]𝑓(π‘Š[πΏβˆ’1]π‘Ž[πΏβˆ’2]

= 𝑓 π‘Š[𝐿]𝑓 π‘Š[πΏβˆ’1]…𝑓 π‘Š[2]𝑓 π‘Š[1]π‘₯

For convenience, we use the same activation functions for all layers.

However, output layer neurons most commonly do not need activation function (they show class scores or real-valued targets.)

π‘Š[1] π‘₯

Γ—

π‘“π‘Š[2]

Γ—

𝑓

π‘Š[𝐿]

Γ—

𝑓

𝑧[1]

π‘Ž[1]

𝑧[2]

π‘Ž[πΏβˆ’1]

𝑧[𝐿]

π‘Ž[𝐿]

π‘Ž[𝐿] = π‘œπ‘’π‘‘π‘π‘’π‘‘

Page 79: Multi-Layer Networks and Backpropagation Algorithm

Backpropagation: Notation

79

β€’ 𝒂[0] ← 𝐼𝑛𝑝𝑒𝑑

β€’ π‘œπ‘’π‘‘π‘π‘’π‘‘ ← 𝒂[𝐿]

𝑓(. )

𝑓(. )

𝑓(. )

𝑓(. )

𝑓(. )

𝒂[π‘™βˆ’1] 𝒂[𝑙]𝒛[𝑙]

Page 80: Multi-Layer Networks and Backpropagation Algorithm

Backpropagation: Last layer gradient

πœ•πΈπ‘›

πœ•π‘Šπ‘–π‘—[𝐿]

=πœ•πΈπ‘›πœ•π‘Ž[𝐿]

πœ•π‘Ž[𝐿]

πœ•π‘Šπ‘–π‘—[𝐿]

πœ•πΈ

πœ•π‘Ž[𝐿]= 2(𝑦 βˆ’ π‘Ž[𝐿])

πœ•π‘Ž[𝐿]

πœ•π‘Šπ‘–π‘—[𝐿]

= 𝑓′ 𝑧𝑗[𝐿]

πœ•π‘§π‘—[𝐿]

πœ•π‘Šπ‘–π‘—[𝐿]

= 𝑓′ 𝑧𝑗[𝐿]

π‘Žπ‘–[πΏβˆ’1]

π‘Žπ‘–[π‘™βˆ’1]

𝑧𝑗[𝑙]

π‘Žπ‘—[𝑙]

𝑓

π‘Žπ‘–[𝑙]

= 𝑓 𝑧𝑖[𝑙]

𝑧𝑗[𝑙]

=

𝑗=0

𝑀

𝑀𝑖𝑗[𝑙]π‘Žπ‘–[π‘™βˆ’1]

For squared error loss:

𝐸𝑛 = π‘Ž[𝐿]

βˆ’ 𝑦𝑛

2

𝑀𝑖𝑗[𝑙]

Page 81: Multi-Layer Networks and Backpropagation Algorithm

Backpropagation:

81

πœ•πΈπ‘›

πœ•π‘€π‘–π‘—[𝑙]

=πœ•πΈπ‘›

πœ•π‘Žπ‘—[𝑙]

Γ—πœ•π‘Žπ‘—

[𝑙]

πœ•π‘€π‘–π‘—[𝑙]

= 𝛿𝑗[𝑙]

Γ— π‘Žπ‘–[π‘™βˆ’1]

Γ— 𝑓′ 𝑧𝑖[𝑙]

𝛿𝑗[𝑙]

=πœ•πΈπ‘›

πœ•π‘Žπ‘—[𝑙] is the sensitivity of the output to π‘Žπ‘—

[𝑙]

sensitivity vectors can be obtained by running abackward process in the network architecture (hencethe name backpropagation.)

π‘Žπ‘–[π‘™βˆ’1]

𝑧𝑗[𝑙]

π‘Žπ‘—[𝑙]

𝑓

π‘Žπ‘–[𝑙]

= 𝑓 𝑧𝑖[𝑙]

𝑧𝑗[𝑙]

=

𝑗=0

𝑀

𝑀𝑖𝑗[𝑙]π‘Žπ‘–[π‘™βˆ’1]

𝑀𝑖𝑗[𝑙]

Page 82: Multi-Layer Networks and Backpropagation Algorithm

𝛿𝑖[π‘™βˆ’1]

from 𝛿𝑖[𝑙]

82

We will compute 𝜹[π‘™βˆ’1]

from 𝜹[𝑙]

:

𝛿𝑖[π‘™βˆ’1]

=πœ•πΈπ‘›

πœ•π‘Žπ‘–[π‘™βˆ’1]

=

𝑗=1

𝑑[𝑙]

πœ•πΈπ‘›

πœ•π‘Žπ‘—[𝑙]

Γ—πœ•π‘Žπ‘—

[𝑙]

πœ•π‘§π‘—[𝑙]

Γ—πœ•π‘§π‘—

[𝑙]

πœ•π‘Žπ‘–[π‘™βˆ’1]

=

𝑗=1

𝑑[𝑙]

πœ•πΈπ‘›

πœ•π‘Žπ‘—[𝑙]

Γ— 𝑓′ 𝑧𝑖[𝑙]

Γ— 𝑀𝑖𝑗[𝑙]

=

𝑗=1

𝑑[𝑙]

𝛿𝑗[𝑙]

Γ— 𝑓′ 𝑧𝑖[𝑙]

Γ— 𝑀𝑖𝑗[𝑙]

= 𝑓′ 𝑧𝑖[𝑙]

Γ—

𝑗=1

𝑑[𝑙]

𝛿𝑗[𝑙]

Γ— 𝑀𝑖𝑗[𝑙]

π‘Žπ‘–[π‘™βˆ’1]

= 𝑓 𝑧𝑖[π‘™βˆ’1]

𝑧𝑗[𝑙]

=

𝑗=0

𝑀

𝑀𝑖𝑗[𝑙]π‘Žπ‘–[π‘™βˆ’1]

π‘Žπ‘–[π‘™βˆ’1]

π‘Žπ‘—[𝑙]

𝑓′ 𝑧𝑖[π‘™βˆ’1]

π‘Žπ‘–[π‘™βˆ’1]

𝑧𝑗[𝑙]

π‘Žπ‘—[𝑙]

𝑓

𝑀𝑖𝑗[𝑙]

𝑀𝑖𝑗[𝑙]

𝛿𝑗[𝑙]

𝛿𝑗[π‘™βˆ’1]

Page 83: Multi-Layer Networks and Backpropagation Algorithm

Find and save 𝜹[𝐿]

83

β€’ Called error, computed recursively in backward manner

β€’ For the final layer 𝑙 = 𝐿:

𝛿𝑗[𝐿]

=πœ•πΈπ‘›

πœ•π‘Žπ‘—[𝐿]

Page 84: Multi-Layer Networks and Backpropagation Algorithm

Backpropagation of Errors

84

𝛿𝑖[1]

= 𝑓′ 𝑧𝑖[1]

Γ—

𝑗=1

𝑑[2]

𝛿𝑗[2]

Γ— 𝑀𝑖𝑗[2]

𝛿𝑗[2]

= 2 π‘Žπ‘—[2]

βˆ’ 𝑦𝑗(𝑛)

𝑓′ 𝑧𝑗[2]

𝑀𝑖𝑗[2]

𝛿𝑖[1]

𝛿1[2]

𝛿2[2]

𝛿3[2]

𝛿𝑗[2]

𝛿𝐾[2]

𝐸𝑛 =

𝑗

π‘Žπ‘—[𝐿]

βˆ’ 𝑦𝑗𝑛

2

Page 85: Multi-Layer Networks and Backpropagation Algorithm

Gradients for vectorized code

Page 86: Multi-Layer Networks and Backpropagation Algorithm

Vectorized operations

Page 87: Multi-Layer Networks and Backpropagation Algorithm

Vectorized operations

Page 88: Multi-Layer Networks and Backpropagation Algorithm

Vectorized operations

Page 89: Multi-Layer Networks and Backpropagation Algorithm

Vectorized operations

Page 90: Multi-Layer Networks and Backpropagation Algorithm

Vectorized operations

Page 91: Multi-Layer Networks and Backpropagation Algorithm
Page 92: Multi-Layer Networks and Backpropagation Algorithm
Page 93: Multi-Layer Networks and Backpropagation Algorithm
Page 94: Multi-Layer Networks and Backpropagation Algorithm
Page 95: Multi-Layer Networks and Backpropagation Algorithm
Page 96: Multi-Layer Networks and Backpropagation Algorithm
Page 97: Multi-Layer Networks and Backpropagation Algorithm

Always check: The gradient with

respect to a variable should have

the same shape as the Variable

Page 98: Multi-Layer Networks and Backpropagation Algorithm
Page 99: Multi-Layer Networks and Backpropagation Algorithm

SVM example

Page 100: Multi-Layer Networks and Backpropagation Algorithm

Summary

β€’ Neural nets may be very large: impractical to write down gradient formula byhand for all parameters

β€’ Backpropagation = recursive application of the chain rule along a computationalgraph to compute the gradients of all inputs/parameters/intermediates

β€’ Implementations maintain a graph structure, where the nodes implement theforward() / backward() API

– forward: compute result of an operation and save any intermediates needed for gradientcomputation in memory

– backward: apply the chain rule to compute the gradient of the loss function with respect tothe inputs

Page 101: Multi-Layer Networks and Backpropagation Algorithm

Converting error derivatives into a learning procedure

β€’ The backpropagation algorithm is an efficient way of computing the error derivative dE/dw for every weight on a single training case.

β€’ To get a fully specified learning procedure, we still need to make a lot of other decisions about how to use these error derivatives:

– Optimization issues: How do we use the error derivatives on individual cases to discover a good set of weights?

– Generalization issues: How do we ensure that the learned weights work well for cases we did not see during training?

Page 102: Multi-Layer Networks and Backpropagation Algorithm

Optimization issues in using the weight derivatives

β€’ How to initialize weights

β€’ How often to update the weights– Batch size

β€’ How much to update– Use a fixed learning rate?

– Adapt the global learning rate?

– Adapt the learning rate on each connection separately?

– Don’t use steepest descent?

β€’ …

Page 103: Multi-Layer Networks and Backpropagation Algorithm

Overfitting: The downside of using powerful models

β€’ A model is convincing when it fits a lot of data surprisingly well.– It is not surprising that a complicated model can fit a small amount of data

well.

Page 104: Multi-Layer Networks and Backpropagation Algorithm

Ways to reduce overfitting

β€’ A large number of different methods have been developed.– Weight-decay

– Weight-sharing

– Early stopping

– Model averaging

– Dropout

– Generative pre-training

Page 105: Multi-Layer Networks and Backpropagation Algorithm

Resources

β€’ Deep Learning Book, Chapter 6.

β€’ Please see the following note:– http://cs231n.github.io/optimization-2/