CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from...
Transcript of CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from...
![Page 1: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/1.jpg)
CS 4803 / 7643: Deep Learning
Dhruv Batra
Georgia Tech
Topics: – Optimization– Computing Gradients
![Page 2: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/2.jpg)
Administrativia• HW1 Reminder
– Due: 09/09, 11:59pm
(C) Dhruv Batra 2
![Page 3: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/3.jpg)
Recap from last time
(C) Dhruv Batra 3
![Page 4: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/4.jpg)
Regularization
4
Data loss: Model predictions should match training data
Regularization: Prevent the model from doing too well on training data
= regularization strength(hyperparameter)
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Occam’s Razor: “Among competing hypotheses, the simplest is the best”William of Ockham, 1285 - 1347
![Page 5: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/5.jpg)
Regularization
5
Data loss: Model predictions should match training data
Regularization: Prevent the model from doing too well on training data
= regularization strength(hyperparameter)
Simple examplesL2 regularization: L1 regularization: Elastic net (L1 + L2):
More complex:DropoutBatch normalizationStochastic depth, fractional pooling, etc
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 6: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/6.jpg)
So far: Linear Classifiers
6
f(x) = Wx
Class scores
![Page 7: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/7.jpg)
Hard cases for a linear classifier
7
Class 1: First and third quadrants
Class 2: Second and fourth quadrants
Class 1: 1 <= L2 norm <= 2
Class 2:Everything else
Class 1: Three modes
Class 2:Everything else
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 8: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/8.jpg)
Feature Extraction
Image features vs Neural Nets
8
f10 numbers giving scores for classes
training
training
10 numbers giving scores for classes
Krizhevsky, Sutskever, and Hinton, “Imagenet classification with deep convolutional neural networks”, NIPS 2012.Figure copyright Krizhevsky, Sutskever, and Hinton, 2012. Reproduced with permission.
![Page 9: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/9.jpg)
(Before) Linear score function:
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Neural networks: without the brain stuff
![Page 10: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/10.jpg)
10
(Before) Linear score function:
(Now) 2-layer Neural Network
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Neural networks: without the brain stuff
![Page 11: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/11.jpg)
11
x hW1 sW2
3072 100 10
Neural networks: without the brain stuff
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
(Before) Linear score function:
(Now) 2-layer Neural Network
![Page 12: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/12.jpg)
12
Neural networks: without the brain stuff
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
x hW1 sW2
3072 100 10
(Before) Linear score function:
(Now) 2-layer Neural Network
![Page 13: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/13.jpg)
13
(Before) Linear score function:
(Now) 2-layer Neural Network or 3-layer Neural Network
Neural networks: without the brain stuff
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 14: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/14.jpg)
Multilayer Networks• Cascaded “neurons”• The output from one layer is the input to the next• Each layer has its own sets of weights
(C) Dhruv Batra 14Image Credit: Andrej Karpathy, CS231n
![Page 15: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/15.jpg)
15
Impulses carried toward cell body
Impulses carried away from cell body
This image by Felipe Peruchois licensed under CC-BY 3.0
dendrite
cell body
axon
presynaptic terminal
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 16: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/16.jpg)
Sigmoid
tanh
ReLU
Leaky ReLU
Maxout
ELU
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Activation functions
![Page 17: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/17.jpg)
Plan for Today• Optimization• Computing Gradients
(C) Dhruv Batra 17
![Page 18: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/18.jpg)
Optimization
![Page 19: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/19.jpg)
Supervised Learning• Input: x (images, text, emails…)• Output: y (spam or non-spam…)
• (Unknown) Target Function– f: X Y (the “true” mapping / reality)
• Data – (x1,y1), (x2,y2), …, (xN,yN)
• Model / Hypothesis Class– {h: X Y}– e.g. y = h(x) = sign(wTx)
• Loss Function– How good is a model wrt my data D?
• Learning = Search in hypothesis space– Find best h in model class.
(C) Dhruv Batra 19
![Page 21: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/21.jpg)
Strategy: Follow the slope
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 22: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/22.jpg)
What is slope?• In 1-dimension, the derivative of a function:
23
![Page 23: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/23.jpg)
What is slope?• In d-dimension, recall partial derivatives:
24
![Page 24: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/24.jpg)
What is slope?• The gradient is the vector of (partial derivatives)
along each dimension
• Properties– The direction of steepest descent is the negative gradient– The slope in any direction is the dot product of the direction
with the gradient
25
![Page 25: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/25.jpg)
original W
negative gradient directionw1
w2
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
http://demonstrations.wolfram.com/VisualizingTheGradientVector/
![Page 26: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/26.jpg)
Gradient Descent
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 27: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/27.jpg)
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 28: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/28.jpg)
Full sum expensive when N is large!
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Gradient Descent has a problem
![Page 29: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/29.jpg)
Full sum expensive when N is large!
Approximate sum using a minibatch of examples32 / 64 / 128 common
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Stochastic Gradient Descent (SGD)
![Page 30: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/30.jpg)
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Stochastic Gradient Descent (SGD)
![Page 31: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/31.jpg)
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Stochastic Gradient Descent (SGD)
![Page 32: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/32.jpg)
Full sum expensive when N is large!
Approximate sum using a minibatch of examples32 / 64 / 128 common
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Stochastic Gradient Descent (SGD)
![Page 33: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/33.jpg)
(C) Dhruv Batra 34
![Page 34: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/34.jpg)
How do we compute gradients?• Analytic or “Manual” Differentiation
• Symbolic Differentiation
• Numerical Differentiation
• Automatic Differentiation– Forward mode AD– Reverse mode AD
• aka “backprop”
(C) Dhruv Batra 35
![Page 35: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/35.jpg)
(C) Dhruv Batra 36Figure Credit: Baydin et al. https://arxiv.org/abs/1502.05767
![Page 36: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/36.jpg)
(C) Dhruv Batra 37Figure Credit: Baydin et al. https://arxiv.org/abs/1502.05767
![Page 37: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/37.jpg)
(C) Dhruv Batra 38Figure Credit: Baydin et al. https://arxiv.org/abs/1502.05767
![Page 38: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/38.jpg)
(C) Dhruv Batra 39Figure Credit: Baydin et al. https://arxiv.org/abs/1502.05767
![Page 39: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/39.jpg)
How do we compute gradients?• Analytic or “Manual” Differentiation
• Symbolic Differentiation
• Numerical Differentiation
• Automatic Differentiation– Forward mode AD– Reverse mode AD
• aka “backprop”
(C) Dhruv Batra 40
![Page 40: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/40.jpg)
(C) Dhruv Batra 41By Brnbrnz (Own work) [CC BY-SA 4.0 (http://creativecommons.org/licenses/by-sa/4.0)]
![Page 41: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/41.jpg)
current W:
[0.34,-1.11,0.78,0.12,0.55,2.81,-3.1,-1.5,0.33,…] loss 1.25347
gradient dW:
[?,?,?,?,?,?,?,?,?,…]
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 42: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/42.jpg)
current W:
[0.34,-1.11,0.78,0.12,0.55,2.81,-3.1,-1.5,0.33,…] loss 1.25347
gradient dW:
[?,?,?,?,?,?,?,?,?,…]
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
W + h (first dim):
[0.34 + 0.0001,-1.11,0.78,0.12,0.55,2.81,-3.1,-1.5,0.33,…] loss 1.25322
![Page 43: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/43.jpg)
gradient dW:
[-2.5,?,?,?,?,?,?,?,?,…]
(1.25322 - 1.25347)/0.0001= -2.5
current W:
[0.34,-1.11,0.78,0.12,0.55,2.81,-3.1,-1.5,0.33,…] loss 1.25347
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
W + h (first dim):
[0.34 + 0.0001,-1.11,0.78,0.12,0.55,2.81,-3.1,-1.5,0.33,…] loss 1.25322
![Page 44: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/44.jpg)
gradient dW:
[-2.5,?,?,?,?,?,?,?,?,…]
current W:
[0.34,-1.11,0.78,0.12,0.55,2.81,-3.1,-1.5,0.33,…] loss 1.25347
W + h (second dim):
[0.34,-1.11 + 0.0001,0.78,0.12,0.55,2.81,-3.1,-1.5,0.33,…] loss 1.25353
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 45: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/45.jpg)
gradient dW:
[-2.5,0.6,?,?,?,?,?,?,?,…]
current W:
[0.34,-1.11,0.78,0.12,0.55,2.81,-3.1,-1.5,0.33,…] loss 1.25347
W + h (second dim):
[0.34,-1.11 + 0.0001,0.78,0.12,0.55,2.81,-3.1,-1.5,0.33,…] loss 1.25353
(1.25353 - 1.25347)/0.0001= 0.6
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 46: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/46.jpg)
gradient dW:
[-2.5,0.6,?,?,?,?,?,?,?,…]
current W:
[0.34,-1.11,0.78,0.12,0.55,2.81,-3.1,-1.5,0.33,…] loss 1.25347
W + h (third dim):
[0.34,-1.11,0.78 + 0.0001,0.12,0.55,2.81,-3.1,-1.5,0.33,…] loss 1.25347
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 47: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/47.jpg)
gradient dW:
[-2.5,0.6,0,?,?,?,?,?,?,…]
current W:
[0.34,-1.11,0.78,0.12,0.55,2.81,-3.1,-1.5,0.33,…] loss 1.25347
W + h (third dim):
[0.34,-1.11,0.78 + 0.0001,0.12,0.55,2.81,-3.1,-1.5,0.33,…] loss 1.25347
(1.25347 - 1.25347)/0.0001= 0
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
![Page 48: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/48.jpg)
Numerical gradient: slow :(, approximate :(, easy to write :)Analytic gradient: fast :), exact :), error-prone :(
In practice: Derive analytic gradient, check your implementation with numerical gradient. This is called a gradient check.
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Numerical vs Analytic Gradients
![Page 49: CS 4803 / 7643: Deep Learning...should match training data Regularization: Prevent the model from doing too well on training data = regularization strength (hyperparameter) Slide Credit:](https://reader035.fdocuments.in/reader035/viewer/2022081601/61143ffa427b50328a7088f0/html5/thumbnails/49.jpg)
How do we compute gradients?• Analytic or “Manual” Differentiation
• Symbolic Differentiation
• Numerical Differentiation
• Automatic Differentiation– Forward mode AD– Reverse mode AD
• aka “backprop”
(C) Dhruv Batra 50