Multilayer Perceptron Backpropagation Summary

30
C. Chadoulos, P. Kaplanoglou, S. Papadopoulos, Prof. Ioannis Pitas Aristotle University of Thessaloniki [email protected] www.aiia.csd.auth.gr Version 2.8 Multilayer Perceptron Backpropagation Summary

Transcript of Multilayer Perceptron Backpropagation Summary

C. Chadoulos, P. Kaplanoglou, S. Papadopoulos, Prof. Ioannis Pitas

Aristotle University of Thessaloniki

[email protected]

www.aiia.csd.auth.grVersion 2.8

Multilayer Perceptron

Backpropagation Summary

Biological Neuronโšซ Basic computational unit of the brain.

โšซ Main parts:

โ€ข Dendrites

โž” Act as inputs.

โ€ข Soma

โž” Main body of neuron.

โ€ข Axon

โž” Acts as output.

โšซ Neurons connect with other neurons via synapses.

2

Artificial Neurons

โšซ Artificial neurons are mathematical models loosely inspired by their biological counterparts.

โšซ Incoming signals through the axons serve as the input vector:

๐ฑ = ๐‘ฅ1, ๐‘ฅ2, โ€ฆ , ๐‘ฅ๐‘›๐‘‡, ๐‘ฅ๐‘– โˆˆ โ„.

โšซ The synaptic weights are grouped in a weight vector:

๐ฐ = [๐‘ค1, ๐‘ค2, โ€ฆ , ๐‘ค๐‘›]๐‘‡ , ๐‘ค๐‘– โˆˆ โ„.

โšซ Synaptic integration is modeled as the inner product:

๐‘ง =

๐‘–=1

๐‘

๐‘ค๐‘–๐‘ฅ๐‘– = ๐ฐ๐‘‡๐ฑ.

Z

threshold

x1

x2

xn

w1

w1

wn

10

Cell body

synapse

dendrite

Axon from a neuron

Activation Function

Perceptron threshold

11

Perceptron

โšซ Simplest mathematical model of a neuron (Rosenblatt โ€“ McCulloch & Pitts)

โšซ Inputs are binary, ๐‘ฅ๐‘– โˆˆ 0,1 .

โšซ Produces single, binary outputs, ๐‘ฆ๐‘– โˆˆ {0,1}, through an activation function ๐‘“(โˆ™).

โšซ Output ๐‘ฆ๐‘– signifies whether the neuron will fire or not.

โšซ Firing threshold:

๐ฐ๐‘‡๐ฑ โ‰ฅ โˆ’๐‘ โ‡’ ๐ฐ๐‘‡๐ฑ + ๐‘ โ‰ฅ 0.Cell body

12

MLP Architecture

โšซ Multilayer perceptrons (feed-forward neural networks) typically consist of ๐ฟ + 1 layers with ๐‘˜๐‘™ neurons in each layer: ๐‘™ = 0,โ€ฆ , ๐ฟ.

โšซ The first layer (input layer ๐‘™ = 0) contains ๐‘› inputs, where ๐‘› is the dimensionality of the input sample vector.

โšซ The ๐ฟ โˆ’ 1 hidden layers ๐‘™ = 1,โ€ฆ , ๐ฟ โˆ’ 1 can contain any number of neurons.

26

MLP Architecture

โšซ The output layer ๐‘™ = ๐ฟ typically contains either one neuron (especially when dealing with regression problems) or ๐‘š neurons, where ๐‘š is the number of classes.

โšซ Thus, neural networks can be described by the following characteristic numbers:

1.Depth: Number of hidden layers ๐ฟ.

2.Width: Number of neurons per layer ๐‘˜๐‘™ , ๐‘™ = 1,โ€ฆ , ๐ฟ.

3. ๐‘˜๐ฟ = 1, or ๐‘˜๐ฟ = ๐‘š.

26

MLP Forward Propagation

.

.

.

.

.

.

๐‘ง๐‘—(๐‘™)

๐‘ง๐‘˜๐‘™(๐‘™)

๐‘ง1(๐‘™)

๐‘ง๐‘–(๐‘™)

.

.

.

.

.

.

โšซ Following the same procedure for the ๐‘™๐‘กโ„Ž hidden layer, we get:

โšซ In compact form:

๐‘Ž1(๐‘™)

๐‘Ž๐‘–(๐‘™)

๐‘Ž๐‘—(๐‘™)

๐‘Ž๐‘˜๐‘™(๐‘™)

๐‘“(๐‘ง1๐‘™)

๐‘“(๐‘ง๐‘–๐‘™)

๐‘“(๐‘ง๐‘—๐‘™)1

๐‘“(๐‘ง๐‘˜๐‘™๐‘™)

.

.

29

MLP Activation Functionsโšซ Previous theorem verifies MLPs as universal approximators, under the condition that the

activation functions ๐œ™ โˆ™ for each neuron are continuous.

โšซ Continuous (and differentiable) activation functions:

โšซ ๐œŽ ๐‘ฅ =1

1+๐‘’โˆ’๐‘ฅ, โ„๐‘› โ†’ 0,1

โšซ ๐‘…๐‘’๐ฟ๐‘ˆ ๐‘ฅ = แ‰Š0, ๐‘ฅ < 0๐‘ฅ, ๐‘ฅ โ‰ฅ 0

, โ„๐‘› โ†’ โ„+

โšซ ๐‘“ ๐‘ฅ = tanโˆ’1 ๐‘ฅ , โ„๐‘› โ†’ ฮคโˆ’๐œ‹ 2 , ฮค๐œ‹ 2

โšซ ๐‘“ ๐‘ฅ = tanh ๐‘ฅ =๐‘’๐‘ฅโˆ’๐‘’โˆ’๐‘ฅ

๐‘’๐‘ฅ+๐‘’โˆ’๐‘ฅ, โ„๐‘› โ†’ โˆ’1,+1

25

Softmax Layer

โ€ข It is the last layer in a neural network classifier.

โ€ข The response of neuron ๐‘– in the softmax layer ๐ฟ is calculated with regard to the value of its

activation function ๐‘Ž๐‘– = ๐‘“ ๐‘ง๐‘– .

โ€ข The responses of softmax neurons sum up to one:

โ€ข Better representation for mutually exclusive class labels.

เทœ๐‘ฆ๐‘– = ๐‘” ๐‘Ž๐‘– =๐‘’๐‘Ž๐‘–

ฯƒ๐‘˜=1๐‘˜๐ฟ ๐‘’๐‘Ž๐‘˜

: โ„ โ†’ 0,1 , ๐‘– = 1, . . , ๐‘˜๐ฟ

ฯƒ๐‘–=1๐‘˜๐ฟ เทœ๐‘ฆ๐‘– = 1.

31

MLP Exampleโ€ข Example architecture with ๐ฟ = 2 layer and ๐‘› = ๐‘š0 input features, ๐‘š1 neurons at the first layer

and ๐‘š2 output units.๐‘Ž๐‘— ๐‘™ = ๐‘“๐‘™(

๐‘–=1

๐‘š๐‘™โˆ’1

๐‘ค๐‘–๐‘— ๐‘™ ๐‘Ž๐‘–(๐‘™ โˆ’ 1) + ๐‘ค0๐‘—(๐‘™))

๐‘ฅ1

๐‘ฅ2

๐‘ฅ๐‘›

1

โ€ฆ

๐‘ง1(1)

๐‘ค๐‘›๐‘š1(1)

๐‘ค01(1)

๐‘ค1๐‘š1(1)

๐‘ง๐‘š1(1)โ€ฆ

โ€ฆ

๐‘ฅ1

๐‘ค11(1)๐‘ฅ1

๐‘ค1๐‘š1(1)๐‘ฅ1

DendritesSynapsesAxons Axons

๐‘ค11(1)

๐‘ค21(1)

๐‘ง1(2)

Synapses

๐‘ค01(2)

๐‘ค11(2)

1 1

Dendrites

๐‘ง2(1)

โ€ฆ

๐‘ง๐‘š2(2)โ€ฆ

โ€ฆ

๐‘ค1๐‘š2(2)

๐‘ฅ1

๐‘ค1๐‘š22 ๐‘Ž1(1)

๐‘ค11 2 ๐‘Ž1(1)

๐‘ค๐‘š1๐‘š2(2)

๐‘ค๐‘š11(2)

OutputInput

๐‘ค0๐‘š2(2)

๐‘ฅ๐‘›0

Fully connected

๐‘™ = 0 ๐‘™ = 1 ๐‘™ = 2

๐‘Ž1(1)

๐‘Ž๐‘š1(1)

เทœ๐‘ฆ1 = ๐‘Ž1(2)

เทœ๐‘ฆ๐‘š2= ๐‘Ž๐‘š2

(2)

32

MLP Training โ€ข Mean Square Error (MSE):

๐ฝ ๐›‰ = ๐ฝ ๐–, ๐› =1

๐‘ฯƒ๐‘–=1๐‘ เท๐ฒ๐‘– โˆ’ ๐ฒ๐‘–

2.

โ€ข It is suitable for regression and classification.

โ€ข Categorical Cross Entropy Error:

๐ฝ๐ถ๐ถ๐ธ = โˆ’ฯƒ๐‘–=1๐‘ ฯƒj=1

๐‘š ๐‘ฆ๐‘–๐‘— log เทœ๐‘ฆ๐‘–๐‘— .

โ€ข It is suitable for classifiers that use softmax output layers.

โ€ข Exponential Loss:

๐ฝ(๐›‰) = ฯƒ๐‘–=1๐‘ ๐‘’โˆ’๐›ฝ๐ฒ๐‘–

๐‘‡ เท๐ฒ๐‘–.

37

MLP Training

โ€ข Differentiation:๐œ•๐ฝ

๐œ•๐–=

๐œ•๐ฝ

๐œ•๐›= ๐ŸŽ can provide the critical points:

โ€ข Minima, maxima, saddle points.

โ€ข Analytical computations of this kind are usually impossible.

โ€ข Need to resort to numerical optimization methods.

38

MLP Training

โ€ข Iteratively search the parameter space for the optimal values. In each iteration, apply a small correction to the previous value.

๐›‰ ๐‘ก + 1 = ๐›‰ ๐‘ก + โˆ†๐›‰.

โ€ข In the neural network case, jointly optimize ๐–,๐›

๐– ๐‘ก + 1 = ๐– ๐‘ก + โˆ†๐–,

๐› ๐‘ก + 1 = ๐› ๐‘ก + โˆ†๐›.

โ€ข What is an appropriate choice for the correction term?

38

MLP Training

Steepest descent on a function surface.

39

MLP Training โ€ข Gradient Descent, as applied to Artificial Neural Networks:

โ€ข Equations for the ๐‘—๐‘กโ„Ž neuron in the ๐‘™๐‘กโ„Ž layer:

40

Backpropagationโšซ We can now state the four essential equations for the implementation of the

backpropagation algorithm:

(BP1)

(BP2)

(BP3)

(BP4)

45

Backpropagation Algorithm

46

๐€๐ฅ๐ ๐จ๐ซ๐ข๐ญ๐ก๐ฆ:Neural Network Training with Gradient Descent and Back propagation

Input: learning rate ๐œ‡, dataset { ๐ฑ๐‘– , ๐ฒ๐‘– }๐‘–=1๐‘

๐‘๐š๐ง๐๐จ๐ฆ ๐ข๐ง๐ข๐ญ๐ข๐š๐ฅ๐ข๐ณ๐š๐ญ๐ข๐จ๐ง ๐จ๐Ÿ ๐ง๐ž๐ญ๐ฐ๐จ๐ซ๐ค ๐ฉ๐š๐ซ๐š๐ฆ๐ž๐ญ๐ž๐ซ๐ฌ (๐–,๐›)

๐ƒ๐ž๐Ÿ๐ข๐ง๐ž ๐œ๐จ๐ฌ๐ญ ๐Ÿ๐ฎ๐ง๐œ๐ญ๐ข๐จ๐ง ๐ฝ =1

๐‘ฯƒ๐‘–=1๐‘ ๐ฒ๐‘– โˆ’ เทœ๐ฒ๐‘–

2

๐ฐ๐ก๐ข๐ฅ๐ž ๐‘œ๐‘๐‘ก๐‘–๐‘š๐‘–๐‘ง๐‘Ž๐‘ก๐‘–๐‘œ๐‘› ๐‘๐‘Ÿ๐‘–๐‘ก๐‘’๐‘Ÿ๐‘–๐‘Ž ๐‘›๐‘œ๐‘ก ๐‘š๐‘’๐‘ก ๐๐จ๐ฝ = 0๐…๐จ๐ซ๐ฐ๐š๐ซ๐ ๐ฉ๐š๐ฌ๐ฌ ๐จ๐Ÿ ๐ฐ๐ก๐จ๐ฅ๐ž ๐›๐š๐ญ๐œ๐ก เทœ๐ฒ = ๐Ÿ๐‘๐‘ ๐ฑ;๐–, ๐›

๐ฝ =1

๐‘

๐‘–=1

๐‘

๐ฒ๐‘– โˆ’ เทœ๐ฒ๐‘–2

๐›… ๐ฟ = โˆ‡๐‘Ž๐ฝ โˆ˜๐‘‘๐‘“ ๐‘ง๐‘—

๐ฟ

๐‘‘๐‘ง๐‘—๐ฟ

๐Ÿ๐จ๐ซ ๐‘™ = 1, โ€ฆ , ๐ฟ โˆ’ 1 ๐๐จ

๐›… ๐‘™ = ๐ฐ ๐‘™+1 ๐›… ๐‘™+1 โˆ˜๐‘‘๐‘“ ๐ณ ๐‘™

๐‘‘๐ณ ๐‘™

๐œ•๐ฝ

๐œ•๐‘ค๐‘—๐‘˜๐‘™= ๐›ฟ๐‘—

๐‘™ ๐‘Ž๐‘˜๐‘™โˆ’1

๐œ•๐ฝ

๐œ•๐‘๐‘—๐‘™= ๐›ฟ๐‘—

๐‘™

๐ž๐ง๐๐– โ†๐–โˆ’ ๐œ‡โˆ‡๐ฐ๐ฝ(๐–, ๐›)๐› โ† ๐› โˆ’ ๐œ‡โˆ‡๐›๐ฝ(๐–, ๐›)

๐ž๐ง๐

Optimization Review

โšซ Batch Gradient Descent problems:

3) Ill conditioning: Takes into account only first โ€“ order information (1๐‘ ๐‘ก derivative). No information about curvature of cost function may lead to zig-zagging motions during optimization, resulting in very slow convergence.

4) May depend too strongly on the initialization of the parameters.

5) Learning rate must be chosen very carefully.

48

Optimization Review

โšซ Stochastic Gradient Descent (SGD):

1. Passing the whole dataset through the network just for a small update in the parameters is inefficient.

2.Most cost functions used in Machine Learning and Deep Learning applications can be decomposed into a sum over training samples. Instead of computing the gradient through the whole training set, we decompose the training set {(๐ฑ๐‘– , ๐ฒ๐‘–), ๐‘– = 1,โ€ฆ๐‘} to ๐ต mini-batches (possibly same cardinality ๐‘†) to estimate ๐›ป๐œƒ๐ฝ from ๐›ป๐œƒ๐ฝ๐‘, ๐‘ = 1,โ€ฆ , ๐ต:

ฯƒ๐‘ =1๐‘† ๐›ป๐œƒ๐ฝ๐ฑ๐‘ 

๐‘†โ‰ˆฯƒ๐‘–=1๐‘ ๐›ป๐œƒ๐ฝ๐ฑ๐‘–๐‘

= ๐›ป๐œƒ๐ฝ โ‡’ ๐›ป๐œƒ๐ฝ โ‰ˆ1

๐‘†

๐‘ =1

๐‘†

๐›ป๐œƒ๐ฝ๐ฑ๐‘ 

49

Optimization Review โ€ข Adaptive Learning Rate Algorithms:

โ€ข Up to this point, the learning rate was given as an input to the algorithm and was kept stable throughout the whole optimization procedure.

โ€ข To overcome the difficulties of using a set rate, adaptive rate algorithms assign a different learning rate for each separate model parameter.

51

Optimization Review โ€ข Adaptive Learning Rate Algorithms:

1.AdaGrad algorithm:

i. Gradient accumulation variable ๐ซ = ๐ŸŽ

ii.Compute and accumulate gradient ๐  โ†1

๐‘š๐›ป๐œฝฯƒ๐‘– ๐ฝ ๐‘“ ๐ฑ ๐‘– ; ๐›‰ ๐‘– , ๐ฒ๐‘– , ๐ซ โ† ๐ซ + ๐  โˆ˜ ๐ 

iii.Compute update โˆ†๐›‰ โ† โˆ’๐œ‡

๐œ€+ ๐ซโˆ˜ ๐  and apply update ๐›‰ โ† ๐›‰ + โˆ†๐›‰

Parameters with largest partial derivative of the loss have a more rapid decrease in their learning rate. โ—‹ denotes element-wise multiplication.

51

Generalizationโ€ข Good performance on the training set alone is not a satisfying indicator of a good model.

โ€ข We are interested in building models that can perform well on data they have not been used the training process. Bad generalization capabilities mainly occur through 2 forms:

1.Overfitting:

โ€ข Overfitting can be detected when there is great discrepancy between the performance of the model in the training set, and that in any other set. It can happen due to excessive model complexity.

โ€ข An example of overfitting would be trying to model a quadratic equation with a 10-th degree polynomial.

54

Generalization2.Underfitting:

โ€ข On the other hand, underfitting occurs when a model cannot accurately capture the underlying structrure of data.

โ€ข Underfitting can be detected by a very low performance in the training set.

โ€ข Example: trying to model a non-linear equation using linear ones.

55

Revisiting Activation Functionsโ€ข To overcome the shortcomings of the sigmoid activation function, the Rectified Linear

Unit (ReLU) was introduced.

โ€ข Very easy to compute.

โ€ข Due to its quasi-linear form leads to faster convergence (does not saturate at high values)

โ€ข Units with zero derivative will not activate at all.

59

Training on Large Scale Datasets

โ€ข Large number of training samples in the magnitude of hundreds of thousands.

โ€ข Problem: Datasets do not fit in memory.

โ€ข Solution: Using mini-batch SGD method.

โ€ข Many classes, in the magnitude of hundreds up to one thousand.

โ€ข Problem: Difficult to converge using MSE error.

โ€ข Solution: Using Categorical Cross Entropy (CCE) loss on Softmax output.

60

Towards Deep Learningโšซ Increasing the network depth (layer number) ๐ฟ can result in negligible weight updates

in the first layers, because the corresponding deltas become very small or vanish completely

โšซ Problem: Vanishing gradients.

โšซ Solution: Replacing sigmoid with an activation function without an upper bound,

like a rectifier (a.k.a. ramp function, ReLU).

โ€ข Full connectivity has high demands for memory and computations

โ€ข Very deep fully connected DNNs are difficult to implement.

โ€ข New architectures come into play (Convolutional Neural Networks, Deep Autoencoders etc.)

61

Bibliography[ZUR1992] J.M. Zurada, Introduction to artificial neural systems. Vol. 8. St. Paul: West publishing company, 1992.

[JAIN1996] Jain, Anil K., Jianchang Mao, and K. Moidin Mohiuddin. "Artificial neural networks: A tutorial." Computer 29.3 (1996): 31-44.

[YEG2009] Yegnanarayana, Bayya. Artificial neural networks. PHI Learning Pvt. Ltd., 2009.

[DAN2013] Daniel, Graupe. Principles of artificial neural networks. Vol. 7. World Scientific, 2013.

[HOP1988] Hopfield, John J. "Artificial neural networks." IEEE Circuits and Devices Magazine 4.5 (1988): 3-10.

62

Bibliography[HAY2009] S. Haykin, Neural networks and learning machines, Prentice Hall, 2009.[BIS2006] C.M. Bishop, Pattern recognition and machine learning, Springer, 2006.[GOO2016] I. Goodfellow, Y. Bengio, A. Courville, Deep learning, MIT press, 2016[THEO2011] S. Theodoridis, K. Koutroumbas, Pattern Recognition, Elsevier, 2011.[ZUR1992] J.M. Zurada, Introduction to artificial neural systems. Vol. 8. St. Paul: West publishing

company, 1992.[ROS1958] Rosenblatt, Frank. "The perceptron: a probabilistic model for information storage and

organization in the brain." Psychological review 65.6 (1958): 386.[YEG2009] Yegnanarayana, Bayya. Artificial neural networks. PHI Learning Pvt. Ltd., 2009.[DAN2013] Daniel, Graupe. Principles of artificial neural networks. Vol. 7. World Scientific, 2013.[HOP1988] Hopfield, John J. "Artificial neural networks." IEEE Circuits and Devices Magazine 4.5

(1988): 3-10.[SZE2013] C. Szegedy, A. Toshev, D. Erhan. "Deep neural networks for object

detection." Advances in neural information processing systems. 2013.[SAL2016] T. Salimans, D. P. Kingma "Weight normalization: A simple reparameterization to

accelerate training of deep neural networks." Advances in Neural Information ProcessingSystems. 2016.

[MII2019] Miikkulainen, Risto, et al. "Evolving deep neural networks." Artificial Intelligence in theAge of Neural Networks and Brain Computing. Academic Press, 2019. 293-312.

63

Q & A

Thank you very much for your attention!

Contact: Prof. I. Pitas

[email protected]