Multilayer Perceptron Backpropagation Summary
Transcript of Multilayer Perceptron Backpropagation Summary
C. Chadoulos, P. Kaplanoglou, S. Papadopoulos, Prof. Ioannis Pitas
Aristotle University of Thessaloniki
www.aiia.csd.auth.grVersion 2.8
Multilayer Perceptron
Backpropagation Summary
Biological Neuronโซ Basic computational unit of the brain.
โซ Main parts:
โข Dendrites
โ Act as inputs.
โข Soma
โ Main body of neuron.
โข Axon
โ Acts as output.
โซ Neurons connect with other neurons via synapses.
2
Artificial Neurons
โซ Artificial neurons are mathematical models loosely inspired by their biological counterparts.
โซ Incoming signals through the axons serve as the input vector:
๐ฑ = ๐ฅ1, ๐ฅ2, โฆ , ๐ฅ๐๐, ๐ฅ๐ โ โ.
โซ The synaptic weights are grouped in a weight vector:
๐ฐ = [๐ค1, ๐ค2, โฆ , ๐ค๐]๐ , ๐ค๐ โ โ.
โซ Synaptic integration is modeled as the inner product:
๐ง =
๐=1
๐
๐ค๐๐ฅ๐ = ๐ฐ๐๐ฑ.
Z
threshold
x1
x2
xn
w1
w1
wn
10
Perceptron
โซ Simplest mathematical model of a neuron (Rosenblatt โ McCulloch & Pitts)
โซ Inputs are binary, ๐ฅ๐ โ 0,1 .
โซ Produces single, binary outputs, ๐ฆ๐ โ {0,1}, through an activation function ๐(โ).
โซ Output ๐ฆ๐ signifies whether the neuron will fire or not.
โซ Firing threshold:
๐ฐ๐๐ฑ โฅ โ๐ โ ๐ฐ๐๐ฑ + ๐ โฅ 0.Cell body
12
MLP Architecture
โซ Multilayer perceptrons (feed-forward neural networks) typically consist of ๐ฟ + 1 layers with ๐๐ neurons in each layer: ๐ = 0,โฆ , ๐ฟ.
โซ The first layer (input layer ๐ = 0) contains ๐ inputs, where ๐ is the dimensionality of the input sample vector.
โซ The ๐ฟ โ 1 hidden layers ๐ = 1,โฆ , ๐ฟ โ 1 can contain any number of neurons.
26
MLP Architecture
โซ The output layer ๐ = ๐ฟ typically contains either one neuron (especially when dealing with regression problems) or ๐ neurons, where ๐ is the number of classes.
โซ Thus, neural networks can be described by the following characteristic numbers:
1.Depth: Number of hidden layers ๐ฟ.
2.Width: Number of neurons per layer ๐๐ , ๐ = 1,โฆ , ๐ฟ.
3. ๐๐ฟ = 1, or ๐๐ฟ = ๐.
26
MLP Forward Propagation
.
.
.
.
.
.
๐ง๐(๐)
๐ง๐๐(๐)
๐ง1(๐)
๐ง๐(๐)
.
.
.
.
.
.
โซ Following the same procedure for the ๐๐กโ hidden layer, we get:
โซ In compact form:
๐1(๐)
๐๐(๐)
๐๐(๐)
๐๐๐(๐)
๐(๐ง1๐)
๐(๐ง๐๐)
๐(๐ง๐๐)1
๐(๐ง๐๐๐)
.
.
29
MLP Activation Functionsโซ Previous theorem verifies MLPs as universal approximators, under the condition that the
activation functions ๐ โ for each neuron are continuous.
โซ Continuous (and differentiable) activation functions:
โซ ๐ ๐ฅ =1
1+๐โ๐ฅ, โ๐ โ 0,1
โซ ๐ ๐๐ฟ๐ ๐ฅ = แ0, ๐ฅ < 0๐ฅ, ๐ฅ โฅ 0
, โ๐ โ โ+
โซ ๐ ๐ฅ = tanโ1 ๐ฅ , โ๐ โ ฮคโ๐ 2 , ฮค๐ 2
โซ ๐ ๐ฅ = tanh ๐ฅ =๐๐ฅโ๐โ๐ฅ
๐๐ฅ+๐โ๐ฅ, โ๐ โ โ1,+1
25
Softmax Layer
โข It is the last layer in a neural network classifier.
โข The response of neuron ๐ in the softmax layer ๐ฟ is calculated with regard to the value of its
activation function ๐๐ = ๐ ๐ง๐ .
โข The responses of softmax neurons sum up to one:
โข Better representation for mutually exclusive class labels.
เท๐ฆ๐ = ๐ ๐๐ =๐๐๐
ฯ๐=1๐๐ฟ ๐๐๐
: โ โ 0,1 , ๐ = 1, . . , ๐๐ฟ
ฯ๐=1๐๐ฟ เท๐ฆ๐ = 1.
31
MLP Exampleโข Example architecture with ๐ฟ = 2 layer and ๐ = ๐0 input features, ๐1 neurons at the first layer
and ๐2 output units.๐๐ ๐ = ๐๐(
๐=1
๐๐โ1
๐ค๐๐ ๐ ๐๐(๐ โ 1) + ๐ค0๐(๐))
๐ฅ1
๐ฅ2
๐ฅ๐
1
โฆ
๐ง1(1)
๐ค๐๐1(1)
๐ค01(1)
๐ค1๐1(1)
๐ง๐1(1)โฆ
โฆ
๐ฅ1
๐ค11(1)๐ฅ1
๐ค1๐1(1)๐ฅ1
DendritesSynapsesAxons Axons
๐ค11(1)
๐ค21(1)
๐ง1(2)
Synapses
๐ค01(2)
๐ค11(2)
1 1
Dendrites
๐ง2(1)
โฆ
๐ง๐2(2)โฆ
โฆ
๐ค1๐2(2)
๐ฅ1
๐ค1๐22 ๐1(1)
๐ค11 2 ๐1(1)
๐ค๐1๐2(2)
๐ค๐11(2)
OutputInput
๐ค0๐2(2)
๐ฅ๐0
Fully connected
๐ = 0 ๐ = 1 ๐ = 2
๐1(1)
๐๐1(1)
เท๐ฆ1 = ๐1(2)
เท๐ฆ๐2= ๐๐2
(2)
32
MLP Training โข Mean Square Error (MSE):
๐ฝ ๐ = ๐ฝ ๐, ๐ =1
๐ฯ๐=1๐ เท๐ฒ๐ โ ๐ฒ๐
2.
โข It is suitable for regression and classification.
โข Categorical Cross Entropy Error:
๐ฝ๐ถ๐ถ๐ธ = โฯ๐=1๐ ฯj=1
๐ ๐ฆ๐๐ log เท๐ฆ๐๐ .
โข It is suitable for classifiers that use softmax output layers.
โข Exponential Loss:
๐ฝ(๐) = ฯ๐=1๐ ๐โ๐ฝ๐ฒ๐
๐ เท๐ฒ๐.
37
MLP Training
โข Differentiation:๐๐ฝ
๐๐=
๐๐ฝ
๐๐= ๐ can provide the critical points:
โข Minima, maxima, saddle points.
โข Analytical computations of this kind are usually impossible.
โข Need to resort to numerical optimization methods.
38
MLP Training
โข Iteratively search the parameter space for the optimal values. In each iteration, apply a small correction to the previous value.
๐ ๐ก + 1 = ๐ ๐ก + โ๐.
โข In the neural network case, jointly optimize ๐,๐
๐ ๐ก + 1 = ๐ ๐ก + โ๐,
๐ ๐ก + 1 = ๐ ๐ก + โ๐.
โข What is an appropriate choice for the correction term?
38
MLP Training โข Gradient Descent, as applied to Artificial Neural Networks:
โข Equations for the ๐๐กโ neuron in the ๐๐กโ layer:
40
Backpropagationโซ We can now state the four essential equations for the implementation of the
backpropagation algorithm:
(BP1)
(BP2)
(BP3)
(BP4)
45
Backpropagation Algorithm
46
๐๐ฅ๐ ๐จ๐ซ๐ข๐ญ๐ก๐ฆ:Neural Network Training with Gradient Descent and Back propagation
Input: learning rate ๐, dataset { ๐ฑ๐ , ๐ฒ๐ }๐=1๐
๐๐๐ง๐๐จ๐ฆ ๐ข๐ง๐ข๐ญ๐ข๐๐ฅ๐ข๐ณ๐๐ญ๐ข๐จ๐ง ๐จ๐ ๐ง๐๐ญ๐ฐ๐จ๐ซ๐ค ๐ฉ๐๐ซ๐๐ฆ๐๐ญ๐๐ซ๐ฌ (๐,๐)
๐๐๐๐ข๐ง๐ ๐๐จ๐ฌ๐ญ ๐๐ฎ๐ง๐๐ญ๐ข๐จ๐ง ๐ฝ =1
๐ฯ๐=1๐ ๐ฒ๐ โ เท๐ฒ๐
2
๐ฐ๐ก๐ข๐ฅ๐ ๐๐๐ก๐๐๐๐ง๐๐ก๐๐๐ ๐๐๐๐ก๐๐๐๐ ๐๐๐ก ๐๐๐ก ๐๐จ๐ฝ = 0๐ ๐จ๐ซ๐ฐ๐๐ซ๐ ๐ฉ๐๐ฌ๐ฌ ๐จ๐ ๐ฐ๐ก๐จ๐ฅ๐ ๐๐๐ญ๐๐ก เท๐ฒ = ๐๐๐ ๐ฑ;๐, ๐
๐ฝ =1
๐
๐=1
๐
๐ฒ๐ โ เท๐ฒ๐2
๐ ๐ฟ = โ๐๐ฝ โ๐๐ ๐ง๐
๐ฟ
๐๐ง๐๐ฟ
๐๐จ๐ซ ๐ = 1, โฆ , ๐ฟ โ 1 ๐๐จ
๐ ๐ = ๐ฐ ๐+1 ๐ ๐+1 โ๐๐ ๐ณ ๐
๐๐ณ ๐
๐๐ฝ
๐๐ค๐๐๐= ๐ฟ๐
๐ ๐๐๐โ1
๐๐ฝ
๐๐๐๐= ๐ฟ๐
๐
๐๐ง๐๐ โ๐โ ๐โ๐ฐ๐ฝ(๐, ๐)๐ โ ๐ โ ๐โ๐๐ฝ(๐, ๐)
๐๐ง๐
Optimization Review
โซ Batch Gradient Descent problems:
3) Ill conditioning: Takes into account only first โ order information (1๐ ๐ก derivative). No information about curvature of cost function may lead to zig-zagging motions during optimization, resulting in very slow convergence.
4) May depend too strongly on the initialization of the parameters.
5) Learning rate must be chosen very carefully.
48
Optimization Review
โซ Stochastic Gradient Descent (SGD):
1. Passing the whole dataset through the network just for a small update in the parameters is inefficient.
2.Most cost functions used in Machine Learning and Deep Learning applications can be decomposed into a sum over training samples. Instead of computing the gradient through the whole training set, we decompose the training set {(๐ฑ๐ , ๐ฒ๐), ๐ = 1,โฆ๐} to ๐ต mini-batches (possibly same cardinality ๐) to estimate ๐ป๐๐ฝ from ๐ป๐๐ฝ๐, ๐ = 1,โฆ , ๐ต:
ฯ๐ =1๐ ๐ป๐๐ฝ๐ฑ๐
๐โฯ๐=1๐ ๐ป๐๐ฝ๐ฑ๐๐
= ๐ป๐๐ฝ โ ๐ป๐๐ฝ โ1
๐
๐ =1
๐
๐ป๐๐ฝ๐ฑ๐
49
Optimization Review โข Adaptive Learning Rate Algorithms:
โข Up to this point, the learning rate was given as an input to the algorithm and was kept stable throughout the whole optimization procedure.
โข To overcome the difficulties of using a set rate, adaptive rate algorithms assign a different learning rate for each separate model parameter.
51
Optimization Review โข Adaptive Learning Rate Algorithms:
1.AdaGrad algorithm:
i. Gradient accumulation variable ๐ซ = ๐
ii.Compute and accumulate gradient ๐ โ1
๐๐ป๐ฝฯ๐ ๐ฝ ๐ ๐ฑ ๐ ; ๐ ๐ , ๐ฒ๐ , ๐ซ โ ๐ซ + ๐ โ ๐
iii.Compute update โ๐ โ โ๐
๐+ ๐ซโ ๐ and apply update ๐ โ ๐ + โ๐
Parameters with largest partial derivative of the loss have a more rapid decrease in their learning rate. โ denotes element-wise multiplication.
51
Generalizationโข Good performance on the training set alone is not a satisfying indicator of a good model.
โข We are interested in building models that can perform well on data they have not been used the training process. Bad generalization capabilities mainly occur through 2 forms:
1.Overfitting:
โข Overfitting can be detected when there is great discrepancy between the performance of the model in the training set, and that in any other set. It can happen due to excessive model complexity.
โข An example of overfitting would be trying to model a quadratic equation with a 10-th degree polynomial.
54
Generalization2.Underfitting:
โข On the other hand, underfitting occurs when a model cannot accurately capture the underlying structrure of data.
โข Underfitting can be detected by a very low performance in the training set.
โข Example: trying to model a non-linear equation using linear ones.
55
Revisiting Activation Functionsโข To overcome the shortcomings of the sigmoid activation function, the Rectified Linear
Unit (ReLU) was introduced.
โข Very easy to compute.
โข Due to its quasi-linear form leads to faster convergence (does not saturate at high values)
โข Units with zero derivative will not activate at all.
59
Training on Large Scale Datasets
โข Large number of training samples in the magnitude of hundreds of thousands.
โข Problem: Datasets do not fit in memory.
โข Solution: Using mini-batch SGD method.
โข Many classes, in the magnitude of hundreds up to one thousand.
โข Problem: Difficult to converge using MSE error.
โข Solution: Using Categorical Cross Entropy (CCE) loss on Softmax output.
60
Towards Deep Learningโซ Increasing the network depth (layer number) ๐ฟ can result in negligible weight updates
in the first layers, because the corresponding deltas become very small or vanish completely
โซ Problem: Vanishing gradients.
โซ Solution: Replacing sigmoid with an activation function without an upper bound,
like a rectifier (a.k.a. ramp function, ReLU).
โข Full connectivity has high demands for memory and computations
โข Very deep fully connected DNNs are difficult to implement.
โข New architectures come into play (Convolutional Neural Networks, Deep Autoencoders etc.)
61
Bibliography[ZUR1992] J.M. Zurada, Introduction to artificial neural systems. Vol. 8. St. Paul: West publishing company, 1992.
[JAIN1996] Jain, Anil K., Jianchang Mao, and K. Moidin Mohiuddin. "Artificial neural networks: A tutorial." Computer 29.3 (1996): 31-44.
[YEG2009] Yegnanarayana, Bayya. Artificial neural networks. PHI Learning Pvt. Ltd., 2009.
[DAN2013] Daniel, Graupe. Principles of artificial neural networks. Vol. 7. World Scientific, 2013.
[HOP1988] Hopfield, John J. "Artificial neural networks." IEEE Circuits and Devices Magazine 4.5 (1988): 3-10.
62
Bibliography[HAY2009] S. Haykin, Neural networks and learning machines, Prentice Hall, 2009.[BIS2006] C.M. Bishop, Pattern recognition and machine learning, Springer, 2006.[GOO2016] I. Goodfellow, Y. Bengio, A. Courville, Deep learning, MIT press, 2016[THEO2011] S. Theodoridis, K. Koutroumbas, Pattern Recognition, Elsevier, 2011.[ZUR1992] J.M. Zurada, Introduction to artificial neural systems. Vol. 8. St. Paul: West publishing
company, 1992.[ROS1958] Rosenblatt, Frank. "The perceptron: a probabilistic model for information storage and
organization in the brain." Psychological review 65.6 (1958): 386.[YEG2009] Yegnanarayana, Bayya. Artificial neural networks. PHI Learning Pvt. Ltd., 2009.[DAN2013] Daniel, Graupe. Principles of artificial neural networks. Vol. 7. World Scientific, 2013.[HOP1988] Hopfield, John J. "Artificial neural networks." IEEE Circuits and Devices Magazine 4.5
(1988): 3-10.[SZE2013] C. Szegedy, A. Toshev, D. Erhan. "Deep neural networks for object
detection." Advances in neural information processing systems. 2013.[SAL2016] T. Salimans, D. P. Kingma "Weight normalization: A simple reparameterization to
accelerate training of deep neural networks." Advances in Neural Information ProcessingSystems. 2016.
[MII2019] Miikkulainen, Risto, et al. "Evolving deep neural networks." Artificial Intelligence in theAge of Neural Networks and Brain Computing. Academic Press, 2019. 293-312.
63