Optimization for Deep Networks

67
Optimization for Deep Networks Ishan Misra

Transcript of Optimization for Deep Networks

Page 1: Optimization for Deep Networks

Optimization for Deep NetworksIshan Misra

Page 2: Optimization for Deep Networks

Overview

• Vanilla SGD

• SGD + Momentum

• NAG

• Rprop

• AdaGrad

• RMSProp

• AdaDelta

• Adam

Page 3: Optimization for Deep Networks

More tricks

• Batch Normalization

• Natural Networks

Page 4: Optimization for Deep Networks

Gradient (Steepest) Descent

• Move in the opposite direction of the gradient

Page 5: Optimization for Deep Networks

Conjugate Gradient Methods

• See Moller 1993 [A scaled conjugate gradient algorithm for fast supervised learning], Martens et al., 2010 [Deep Learning via Hessian Free optimization]

Page 6: Optimization for Deep Networks

Notation

Parameters of Network

Function of network parameters

Page 7: Optimization for Deep Networks

Properties of Loss function for SGD

Loss function over all samples must decompose into a loss function per sample

Page 8: Optimization for Deep Networks

Vanilla SGD

Parameters of

Network

Function of network

parameters

Iteration number

Step size/Learning rate

Page 9: Optimization for Deep Networks

SGD + Momentum

• Plain SGD can make erratic updates on non-smooth loss functions• Consider an outlier example which “throws off” the learning process

• Maintain some history of updates

• Physics example• A moving ball acquires “momentum”, at which point it becomes less

sensitive to the direct force (gradient)

Page 10: Optimization for Deep Networks

SGD + Momentum

Parameters of

Network

Function of network

parameters

Iteration number

Step size/Learning rate

Momentum

Page 11: Optimization for Deep Networks

SGD + Momentum

• At iteration 𝑡 you add updates from previous iteration 𝑡−𝑛 by weight 𝜇𝑛

• You effectively multiply your updates by 1

1−𝜇

Page 12: Optimization for Deep Networks

Nesterov Accelerate Gradient (NAG)

• Ilya Sutskever, 2012

• First make a jump as directed by momentum

• Then depending on where you land, correct the parameters

Page 13: Optimization for Deep Networks

NAG

Parameters of

Network

Function of network

parameters

Iteration number

Step size/Learning rate

Momentum

Page 14: Optimization for Deep Networks

NAG vs Standard Momentum

Page 15: Optimization for Deep Networks

Why anything new (beyond Momentum/NAG)?

• How to set learning rate and decay of learning rates?

• Ideally want adaptive learning rates

Page 16: Optimization for Deep Networks

Why anything new (beyond Momentum/NAG)?

• Neurons in each layer learn differently• Gradient magnitudes vary across layers

• Early layers get “vanishing gradients”

• Should ideally use separate adaptive learning rates• One of the reasons for having “gain” or lr multipliers in caffe

• Adaptive learning rate algorithms• Jacobs 1989 – agreement in sign between current gradient for a weight

and velocity for that weight

• Use larger mini-batches

Page 17: Optimization for Deep Networks

Adaptive Learning Rates

Page 18: Optimization for Deep Networks

Resilient Propagation (Rprop)

• Riedmiller and Braun 1993

• Address the problem of adaptive learning rate

• Increase the learning rate for a weight multiplicatively if signs of last two gradients agree

• Else decrease learning rate multiplicatively

Page 19: Optimization for Deep Networks

Rprop Update

Page 20: Optimization for Deep Networks

Rprop Initialization

• Initialize all updates at iteration 0 to constant value• If you set both learning rates to 1, you get “Manhattan update rule”

• Rprop effectively divides the gradient by its magnitude• You never update using the gradient itself, but by its sign

Page 21: Optimization for Deep Networks

Problems with Rprop

• Consider a weight that gets updates of 0.1 in nine mini batches, and -0.9 in tenth mini batch

• SGD would keep this weight roughly where it started

• Rprop would increment weight nine times by 𝛿, and then for the tenth update decrease the weight 𝛿• Effective update 9𝛿 − 𝛿 = 8𝛿

• Across mini-batches we scale updates very differently

Page 22: Optimization for Deep Networks

Adaptive Gradient (AdaGrad)

• Duchi et al., 2010

• We need to scale updates across mini-batches similarly

• Use gradient updates as an indicator to scaling

Page 23: Optimization for Deep Networks

Problems with AdaGrad

• Lowers the update size very aggressively

Page 24: Optimization for Deep Networks

RMSProp = Rprop + SGD

• Tieleman & Hinton et al., 2012 (Coursera slide 29, Lecture 6)

• Scale updates similarly across mini-batches

• Scale by decaying average of squared gradient• Rather than the sum of squared gradients in AdaGrad

Page 25: Optimization for Deep Networks

RMSProp

• Has shown success for training Recurrent Models

• Using Momentum generally does not show much improvement

Page 26: Optimization for Deep Networks

Fancy RMSProp

• “No more pesky learning rates” – Schaul et al.• Computes a diagonal Hessian and uses something similar to RMSProp

• Diagonal Hessian computation requires an additional Forward-Backward pass• Double the time of SGD

Page 27: Optimization for Deep Networks

Units of update

• SGD update is in terms of gradient

Page 28: Optimization for Deep Networks

Unitless updates

• Updates are not in units of parameters

Page 29: Optimization for Deep Networks

Hessian updates

Page 30: Optimization for Deep Networks

Hessian gives correct units

Page 31: Optimization for Deep Networks

AdaDelta

• Zeiler et al., 2012

• Get updates that match units

• Keep properties from RMSProp

• Updates should be of the form

Page 32: Optimization for Deep Networks

AdaDelta

• Approximate denominator by sqrt(decaying average of gradient squares)

• Approximate numerator by decaying average of squared updates

Page 33: Optimization for Deep Networks

AdaDelta

• Approximate denominator by sqrt(decaying average of gradient squares)

• Approximate numerator by decaying average of squared updates

Page 34: Optimization for Deep Networks

AdaDelta Update Rule

Page 35: Optimization for Deep Networks

Problems with AdaDelta

• The moving averages are biased by initialization of decay parameters

Page 36: Optimization for Deep Networks

Problems with AdaDelta

• Not the first intuitive shot at fixing “unit updates” problem

• This just maybe me nitpicking

• Why do updates have a time delay?• There is some explanation in the paper, not convinced …

Page 37: Optimization for Deep Networks

Adam

• Kingma & Ba, 2015

• Averages of gradient, or squared gradients

• Bias correction

Page 38: Optimization for Deep Networks

Adam update rule

Updates are not in the correct unit

Simplification, does not have decay over gamma_1

Page 39: Optimization for Deep Networks

AdaMax Adam

Page 40: Optimization for Deep Networks

Adam Results – Logistic Regression

Page 41: Optimization for Deep Networks

Adam Results - MLP

Page 42: Optimization for Deep Networks

Adam Results – Conv Nets

Page 43: Optimization for Deep Networks

Visualization

Alec Radford

Page 44: Optimization for Deep Networks

Batch NormalizationIoffe and Szegedy, 2015

Page 45: Optimization for Deep Networks

Distribution of input

• Having a fixed input distribution is known to help training of linear classifiers• Normalize inputs for SVM

• Normalize inputs for Deep Networks

Page 46: Optimization for Deep Networks

Distribution of input at each layer

• Each layer would benefit if its input had constant distribution

Page 47: Optimization for Deep Networks

Normalize the input to each layer!

Page 48: Optimization for Deep Networks

Is this normalization a good idea?

• Consider inputs to a sigmoid layer

• If normalized, the sigmoid may not ever “saturate”

Page 49: Optimization for Deep Networks

Modify normalization …

• Accommodate identity transform

Learned parameters

Page 50: Optimization for Deep Networks

Batch Normalization Results

Page 51: Optimization for Deep Networks

Batch Normalization for ensembles

Page 52: Optimization for Deep Networks

Natural GradientsRelated work: PRONG or Natural Neural Networks

Page 53: Optimization for Deep Networks

Gradients and orthonormal coordinates

• In Euclidean space, we define length using orthonormal coordinates

Page 54: Optimization for Deep Networks

What happens on a manifold?

• Use metric tensors (generalizes dot product to manifolds)

G is generally a symmetric PSD matrix.

It is a tensor because G’ = J^T G J, where J = jacobian(G)

Page 55: Optimization for Deep Networks

Gradient descent revisited

• What exactly does a gradient descent update step do?

Page 56: Optimization for Deep Networks

Gradient descent revisited

• What exactly does a gradient descent update step do?

Distance is measured in

orthonormal coordinates

Page 57: Optimization for Deep Networks

Relationship: Natural Gradient vs. Gradient

F is the Fisher information matrix

Page 58: Optimization for Deep Networks

PRONG: Projected Natural Gradient Descent

• Main idea is to avoid computing inverse of Fisher matrix

• Reparametrize the network so that Fisher matrix is identity• Hint: Whitening …

Page 59: Optimization for Deep Networks

Reparametrization

• Each layer has two components

The old 𝜃 contains all the

weights; Learned

Whitening matrix; Estimated

Page 60: Optimization for Deep Networks

Reparametrization

Page 61: Optimization for Deep Networks

Exact forms of weights

• 𝑈𝑖 is the normalized eigen-decomposition of Σ𝑖

Page 62: Optimization for Deep Networks

PRONG Update rule

Page 63: Optimization for Deep Networks

Similarity to Batch Normalization

Page 64: Optimization for Deep Networks

PRONG: Results

Page 65: Optimization for Deep Networks

PRONG: Results

Page 66: Optimization for Deep Networks

Thanks!

Page 67: Optimization for Deep Networks

References

• http://climin.readthedocs.org/en/latest/rmsprop.html

• http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

• Training a 3-Node Neural Network is NP-Complete (Blum and Rivest, 1993)

• Equilibrated adaptive learning rates for non-convex optimization

• Practical Recommendations for Gradient-Based Training of Deep Architectures

• Efficient BackPropagation - Yann Le Cun 1989

• Stochastic Gradient Descent Tricks – Leon Bottou