CS 1675: Intro to Machine Learning - University of Pittsburgh
Transcript of CS 1675: Intro to Machine Learning - University of Pittsburgh
![Page 1: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/1.jpg)
CS 1678: Intro to Deep Learning
Neural Network Basics
Prof. Adriana KovashkaUniversity of Pittsburgh
January 28, 2021
![Page 2: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/2.jpg)
Plan for this lecture (next few classes)
• Definition – Architecture
– Basic operations
– Biological inspiration
• Goals– Loss functions
• Training– Gradient descent
– Backpropagation
• Hands-on exercise
![Page 3: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/3.jpg)
Definition
![Page 4: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/4.jpg)
Neural network definition
• Raw activations:
• Nonlinear activation function h (e.g. sigmoid,
tanh, RELU): e.g. z = RELU(a) = max(0, a)
Figure from Christopher Bishop
![Page 5: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/5.jpg)
• Layer 2
• Layer 3 (final)
• Outputs
• Finally:
Neural network definition
(binary)
(multiclass)
(binary)
![Page 6: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/6.jpg)
Sigmoid
tanh
ReLU
Leaky ReLU
Maxout
ELU
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 19, 2018
Activation functions
Fei-Fei Li, Andrej Karpathy, Justin Johnson, Serena Yeung
![Page 7: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/7.jpg)
A multi-layer neural network…
• Is a non-linear classifier
• Can approximate any continuous function to arbitrary
accuracy given sufficiently many hidden units
Lana Lazebnik
![Page 8: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/8.jpg)
Inspiration: Neuron cells
• Neurons
• accept information from multiple inputs
• transmit information to other neurons
• Multiply inputs by weights along edges
• Apply some function to the set of inputs at each node
• If output of function over threshold, neuron “fires”
Text: HKUST, figures: Andrej Karpathy
![Page 9: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/9.jpg)
Biological analog
A biological neuron An artificial neuron
Jia-bin Huang
![Page 10: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/10.jpg)
Hubel and Weisel’s architecture Multi-layer neural network
Adapted from Jia-bin Huang
Biological analog
![Page 11: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/11.jpg)
Feed-forward networks
• Cascade neurons together
• Output from one layer is the input to the next
• Each layer has its own sets of weights
HKUST
![Page 12: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/12.jpg)
Feed-forward networks
• Inputs multiplied by initial set of weights
HKUST
![Page 13: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/13.jpg)
Feed-forward networks
• Intermediate “predictions” computed at first
hidden layer
HKUST
![Page 14: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/14.jpg)
Feed-forward networks
• Intermediate predictions multiplied by second
layer of weights
• Predictions are fed forward through the
network to classify
HKUST
![Page 15: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/15.jpg)
Feed-forward networks
• Compute second set of intermediate
predictions
HKUST
![Page 16: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/16.jpg)
Feed-forward networks
• Multiply by final set of weights
HKUST
![Page 17: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/17.jpg)
Feed-forward networks
• Compute output (e.g. probability of a
particular class being present in the sample)
HKUST
![Page 18: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/18.jpg)
Deep neural networks
• Lots of hidden layers
• Depth = power (usually)
Figure from http://neuralnetworksanddeeplearning.com/chap5.html
We
igh
ts t
o lea
rn!
We
igh
ts t
o lea
rn!
We
igh
ts t
o lea
rn!
We
igh
ts t
o lea
rn!
![Page 19: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/19.jpg)
Goals
![Page 20: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/20.jpg)
How do we train deep networks?
• No closed-form solution for the weights (can’t
set up a system A*w = b, solve for w)
• We will iteratively find such a set of weights
that allow the outputs to match the desired
outputs
• We want to minimize a loss function (a
function of the weights in the network)
• For now, let’s simplify and assume there’s a
single layer of weights in the network, and no
activation function (i.e., output is a linear
combination of the inputs)
![Page 21: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/21.jpg)
Classification goal
Example dataset: CIFAR-10
10 labels
50,000 training images
each image is 32x32x3
10,000 test images.
Andrej Karpathy
![Page 22: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/22.jpg)
Classification scores
[32x32x3]
array of numbers 0...1
(3072 numbers total)
f(x,W)
image parameters
10 numbers,
indicating class
scores
Andrej Karpathy
![Page 23: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/23.jpg)
Linear classifier
[32x32x3]
array of numbers 0...1
10 numbers,
indicating class
scores
3072x1
10x1 10x3072
parameters, or “weights”
(+b) 10x1
Andrej Karpathy
![Page 24: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/24.jpg)
Linear classifier
Example with an image with 4 pixels, and 3 classes (cat/dog/ship)
Andrej Karpathy
![Page 25: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/25.jpg)
Linear classifier
Going forward: Loss function/Optimization
1. Define a loss function
that quantifies our
unhappiness with the
scores across the training
data.
2. Come up with a way of
efficiently finding the
parameters that minimize
the loss function.
(optimization)
TODO:
Adapted from Andrej Karpathy
cat
car
frog
3.2
5.1
-1.7
1.3
4.9
2.0
2.2
2.5
-3.1
![Page 26: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/26.jpg)
Linear classifier
Suppose: 3 training examples, 3 classes.
With some W the scores are:
cat
car
frog
3.2
5.1
-1.7
1.3
4.9
2.0
2.2
2.5
-3.1
Adapted from Andrej Karpathy
![Page 27: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/27.jpg)
Linear classifier: Hinge loss
Suppose: 3 training examples, 3 classes.
With some W the scores are:
cat
car
frog
3.2
5.1
-1.7
1.3
4.9
2.0
2.2
2.5
-3.1
Hinge loss:
Given an example
where
where
is the image and
is the (integer) label,
and using the shorthand for the
scores vector:
the loss has the form:
Adapted from Andrej Karpathy
Want: syi>= sj + 1
i.e. sj – syi+ 1 <= 0
If true, loss is 0
If false, loss is magnitude of violation
![Page 28: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/28.jpg)
Linear classifier: Hinge loss
Suppose: 3 training examples, 3 classes.
With some W the scores are:Hinge loss:
Given an example
where
where
is the image and
is the (integer) label,
and using the shorthand for the
scores vector:
the loss has the form:
= max(0, 5.1 - 3.2 + 1)
+max(0, -1.7 - 3.2 + 1)
= max(0, 2.9) + max(0, -3.9)
= 2.9 + 0
= 2.9
cat
car
frog
3.2
5.1
-1.7
1.3 2.2
4.9 2.5
2.0 -3.1
Losses: 2.9
Adapted from Andrej Karpathy
![Page 29: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/29.jpg)
Linear classifier: Hinge loss
Suppose: 3 training examples, 3 classes.
With some W the scores are:Hinge loss:
Given an example
where
where
is the image and
is the (integer) label,
and using the shorthand for the
scores vector:
the loss has the form:
= max(0, 1.3 - 4.9 + 1)
+max(0, 2.0 - 4.9 + 1)
= max(0, -2.6) + max(0, -1.9)
= 0 + 0
= 0
cat 3.2
car 5.1
frog -1.7
1.3
4.9
2.0
2.2
2.5
-3.1
Losses: 2.9 0
Adapted from Andrej Karpathy
![Page 30: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/30.jpg)
Linear classifier: Hinge loss
Suppose: 3 training examples, 3 classes.
With some W the scores are:Hinge loss:
Given an example
where
where
is the image and
is the (integer) label,
and using the shorthand for the
scores vector:
the loss has the form:
= max(0, 2.2 - (-3.1) + 1)
+max(0, 2.5 - (-3.1) + 1)
= max(0, 5.3 + 1)
+ max(0, 5.6 + 1)
= 6.3 + 6.6
= 12.9
cat
car
frog
3.2
5.1
-1.7
1.3
4.9
2.0
2.2
2.5
-3.1
Losses: 2.9 0 12.9
Adapted from Andrej Karpathy
![Page 31: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/31.jpg)
Linear classifier: Hinge loss
cat
car
frog
3.2
5.1
-1.7
1.3
4.9
2.0
2.2
2.5
-3.1
Suppose: 3 training examples, 3 classes.
With some W the scores are:Hinge loss:
Given an example
where
where
is the image and
is the (integer) label,
and using the shorthand for the
scores vector:
the loss has the form:
and the full training loss is the mean
over all examples in the training data:
L = (2.9 + 0 + 12.9)/32.9 0 12.9Losses: = 15.8 / 3 = 5.3
Lecture 3 - 12
Adapted from Andrej Karpathy
![Page 32: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/32.jpg)
Linear classifier: Hinge loss
Slide from Fei-Fei, Johnson, Yeung
E.g. Suppose that we found a W such that L = 0.
Is this W unique?
No! 2W is also has L = 0!
How do we choose between W and 2W?
![Page 33: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/33.jpg)
Weight Regularization
Data loss: Model predictions
should match training dataRegularization: Prevent the model
from doing too well on training data
= regularization strength
(hyperparameter)
Simple examples
L2 regularization:
L1 regularization:
Elastic net (L1 + L2):
More complex:
Dropout
Batch normalization
Stochastic depth / pooling, etc
Why regularize?
- Express preferences over weights
- Make the model simple so it works on test data
Adapted from Fei-Fei, Johnson, Yeung
![Page 34: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/34.jpg)
Weight Regularization
Expressing preferences
L2 Regularization
L2 regularization likes to
“spread out” the weights
Slide from Fei-Fei, Johnson, Yeung
![Page 35: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/35.jpg)
Weight Regularization
Preferring simple models
yf1 f2
x
Regularization pushes against fitting the data
too well so we don’t fit noise in the data
![Page 36: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/36.jpg)
Want to maximize the log likelihood, or (for a loss function)
to minimize the negative log likelihood of the correct class:cat
car
frog
3.2
5.1
-1.7
scores = unnormalized log probabilities of the classes.
where
Another loss: Cross-entropy
Andrej Karpathy
![Page 37: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/37.jpg)
cat
car
frog
unnormalized log probabilities
24.5
164.0
0.18
3.2
5.1
-1.7
exp normalize
unnormalized probabilities
0.13
0.87
0.00
probabilities
L_i = -log(0.13)
= 0.89
Another loss: Cross-entropy
Adapted from Fei-Fei, Johnson, Yeung
Probabilities
must be >= 0
Probabilities
must sum to 1
Aside:
- This is multinomial logistic regression
- Choose weights to maximize the likelihood of the observed x/y data
(Maximum Likelihood Estimation; more discussion in CS 1675)
![Page 38: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/38.jpg)
Another loss: Cross-entropy
Adapted from Fei-Fei, Johnson, Yeung
cat
car
frog
3.2
5.1
-1.7
24.5
164.0
0.18
0.13
0.87
0.00
exp normalize
Probabilities
must be >= 0
Probabilities
must sum to 1
compare 1.00
0.00
0.00
Kullback–Leibler
divergence
unnormalized
log-probabilities / logitsunnormalized
probabilitiesprobabilities correct
probs
![Page 39: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/39.jpg)
Other losses
• Triplet loss (Schroff, FaceNet, CVPR 2015)
• Anything you want! (almost)
a denotes anchor
p denotes positive
n denotes negative
![Page 40: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/40.jpg)
Training
![Page 41: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/41.jpg)
To minimize loss, use gradient descent
Andrej Karpathy
![Page 42: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/42.jpg)
How to minimize the loss function?
In 1-dimension, the derivative of a function:
In multiple dimensions, the gradient is the vector of (partial derivatives).
Andrej Karpathy
![Page 43: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/43.jpg)
Loss gradients
• Denoted as (diff notations):
• i.e. how does the loss change as a function
of the weights
• We want to change the weights in such a
way that makes the loss decrease as fast as
possible
![Page 44: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/44.jpg)
current W:
[0.34,
-1.11,
0.78,
0.12,
0.55,
2.81,
-3.1,
-1.5,
0.33,…]
loss 1.25347
gradient dW:
[?,
?,
?,
?,
?,
?,
?,
?,
?,…]
Andrej Karpathy
![Page 45: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/45.jpg)
current W:
[0.34,
-1.11,
0.78,
0.12,
0.55,
2.81,
-3.1,
-1.5,
0.33,…]
loss 1.25347
W + h (first dim):
[0.34 + 0.0001,
-1.11,
0.78,
0.12,
0.55,
2.81,
-3.1,
-1.5,
0.33,…]
loss 1.25322
gradient dW:
[?,
?,
?,
?,
?,
?,
?,
?,
?,…]
Andrej Karpathy
![Page 46: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/46.jpg)
gradient dW:
[-2.5,
?,
?,
?,?,
?,
?,?,
?,…]
(1.25322 - 1.25347)/0.0001
= -2.5
current W:
[0.34,
-1.11,
0.78,
0.12,
0.55,
2.81,
-3.1,
-1.5,
0.33,…]
loss 1.25347
W + h (first dim):
[0.34 + 0.0001,
-1.11,
0.78,
0.12,
0.55,
2.81,
-3.1,
-1.5,
0.33,…]
loss 1.25322
Andrej Karpathy
![Page 47: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/47.jpg)
gradient dW:
[-2.5,
?,
?,
?,
?,
?,
?,
?,
?,…]
current W:
[0.34,
-1.11,
0.78,
0.12,
0.55,
2.81,
-3.1,
-1.5,
0.33,…]
loss 1.25347
W + h (second dim):
[0.34,
-1.11 + 0.0001,
0.78,
0.12,
0.55,
2.81,
-3.1,
-1.5,
0.33,…]
loss 1.25353
Andrej Karpathy
![Page 48: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/48.jpg)
gradient dW:
[-2.5,
0.6,
?,
?,?,
?,
?,
?,
?,…]
current W:
[0.34,
-1.11,
0.78,
0.12,
0.55,
2.81,
-3.1,
-1.5,
0.33,…]
loss 1.25347
W + h (second dim):
[0.34,
-1.11 + 0.0001,
0.78,
0.12,
0.55,
2.81,
-3.1,
-1.5,
0.33,…]
loss 1.25353
(1.25353 - 1.25347)/0.0001
= 0.6
Andrej Karpathy
![Page 49: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/49.jpg)
gradient dW:
[-2.5,
0.6,
?,
?,
?,
?,
?,
?,
?,…]
current W:
[0.34,
-1.11,
0.78,
0.12,
0.55,
2.81,
-3.1,
-1.5,
0.33,…]
loss 1.25347
W + h (third dim):
[0.34,
-1.11,
0.78 + 0.0001,
0.12,
0.55,
2.81,
-3.1,
-1.5,
0.33,…]
loss 1.25347
Andrej Karpathy
![Page 50: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/50.jpg)
This is silly. The loss is just a function of W:
want
Calculus
= ...
Andrej Karpathy
![Page 51: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/51.jpg)
gradient dW:
[-2.5,
0.6,
0,
0.2,
0.7,
-0.5,
1.1,
1.3,
-2.1,…]
current W:
[0.34,
-1.11,
0.78,
0.12,
0.55,
2.81,
-3.1,
-1.5,
0.33,…]
loss 1.25347
dW = ...
(some function
data and W)
Andrej Karpathy
![Page 52: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/52.jpg)
Gradient descent
• We’ll update weights
• Move in direction opposite to gradient:
L
Learning rateTime
Figure from Andrej Karpathy
original W
negative gradient directionW_1
W_2
![Page 53: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/53.jpg)
Gradient descent
• Iteratively subtract the gradient with respect
to the model parameters (w)
• I.e., we’re moving in a direction opposite to
the gradient of the loss
• I.e., we’re moving towards smaller loss
![Page 54: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/54.jpg)
Andrej Karpathy
Learning rate selection
The effects of step size (or “learning rate”)
![Page 55: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/55.jpg)
Comments on training algorithm
• Not guaranteed to converge to zero training error, may
converge to local optima or oscillate indefinitely.
• However, in practice, does converge to low error for
many large networks on real data.
• Local minima – not a huge problem in practice for deep
networks.
• Thousands of epochs (epoch = network sees all training
data once) may be required, hours or days to train.
• May be hard to set learning rate and to select number of
hidden units and layers.
• When in doubt, use validation set to decide on
design/hyperparameters.
• Neural networks had fallen out of fashion in 90s, early
2000s; now significantly improved performance (deep
networks trained with dropout and lots of data).
Adapted from Ray Mooney, Carlos Guestrin, Dhruv Batra
![Page 56: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/56.jpg)
Gradient descent in multi-layer nets
• We’ll update weights
• Move in direction opposite to gradient:
• How to update the weights at all layers?
• Answer: backpropagation of error from
higher layers to lower layers
![Page 57: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/57.jpg)
Gradient descent in multi-layer nets
• How to update the weights at all layers?
• Answer: backpropagation of error from
higher layers to lower layers
Figure from Andrej Karpathy
![Page 58: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/58.jpg)
Backpropagation: Graphic example
First calculate error of output units and use this
to change the top layer of weights.
output
hidden
input
Update weights into j
(store diff, update @end)
Adapted from Ray Mooney, equations from Chris Bishop
k
j
i
w(2)
w(1)
![Page 59: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/59.jpg)
Backpropagation: Graphic example
Next calculate error for hidden units based on
errors on the output units it feeds into.
output
hidden
input
k
j
i
Adapted from Ray Mooney, equations from Chris Bishop
![Page 60: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/60.jpg)
Backpropagation: Graphic example
Finally update bottom layer of weights based on
errors calculated for hidden units.
output
hidden
input
Update weights into i
k
j
i
Adapted from Ray Mooney, equations from Chris Bishop
![Page 61: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/61.jpg)
Computing gradient for each weight
• We need to move weights in direction
opposite to gradient of loss wrt that weight:
wkj = wkj – η dL/dwkj (output layer)
wji = wji – η dL/dwji (hidden layer)
• Loss depends on weights in an indirect way,
so we’ll use the chain rule and compute:
dL/dwkj = dL/dyk dyk/dak dak/dwkj
dL/dwji = dL/dzj dzj/daj daj/dwji
![Page 62: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/62.jpg)
Gradient for output layer weights
• Loss depends on weights in an indirect way,
so we’ll use the chain rule and compute:
dL/dwkj = dL/dyk dyk/dak dak/dwkj
• How to compute each of these?
• dL/dyk : need to know form of error function• Example: if L = (yk – yk’)
2, where yk’ is the ground-truth
label, then dL/dyk = 2(yk – yk’)
• dyk/dak : need to know output layer activation• If h(ak)=σ(ak), then d h(ak)/d ak = σ(ak)(1-σ(ak))
• dak/dwkj : zj since ak is a linear combination• ak = wk:
T z = Σj wkj zj
![Page 63: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/63.jpg)
Gradient for hidden layer weights
• We’ll use the chain rule again and compute:
dL/dwji = dL/dzj dzj/daj daj/dwji
• Unlike the previous case (weights for output layer), the error (dL/dzj) is hard to compute
(indirect, need chain rule again)
• We’ll simplify the computation by doing it
step by step via backpropagation of error
• You could directly compute this term– you
will get the same result as with backprop (do
as an exercise!)
![Page 64: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/64.jpg)
Backprop – rough notation
• The following is a framework, slightly imprecise
• Let us denote the inputs at a layer i by ini, the
linear combination of inputs computed at that layer as rawi, the activation as acti
• We define a new quantity that will roughly correspond to accumulated error, erri :
erri = d L / d acti * d acti / d rawi
• Then we can write the updates as
w = w – η * erri * ini
![Page 65: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/65.jpg)
Backprop – formulation
• We’ll write the weight updates as follows
➢wkj = wkj - η δk zj for output units
➢wji = wji - η δj xi for hidden units
• What are δk, δj? • They store error, gradient wrt raw activations (i.e. dL/da)
• They’re of the form dL/dzj dzj/daj
• The latter is easy to compute – just use derivative of
activation function
• The former is easy for output – e.g. (yk – yk’)
• It is harder to compute for hidden layers
• dL/dzj = ∑k wkj δk (where did this come from?)
Figure from Chris Bishop
![Page 66: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/66.jpg)
Deriving backprop (Bishop Eq. 5.56)
• In a neural network:
• Gradient is (using chain rule):
• Denote the “errors” as:
• Also:
Equations from Bishop
![Page 67: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/67.jpg)
Deriving backprop (Bishop Eq. 5.56)
• For output (identity output, L2 loss):
• For hidden units (using chain rule again):
• Backprop formula:
Equations from Bishop, also see https://en.wikipedia.org/wiki/Chain_rule#Example
![Page 68: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/68.jpg)
Putting it all together
• Example: use sigmoid at hidden layer and
output layer, loss is L2 between
true/predicted labels
![Page 69: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/69.jpg)
Example algorithm for sigmoid, L2 error
• Initialize all weights to small random values
• Until convergence (e.g. all training examples’ error
small, or error stops decreasing) repeat:
• For each (x, y’=class(x)) in training set:
– Calculate network outputs: yk
– Compute errors (gradients wrt activations) for each unit:
» δk = yk (1-yk) (yk – yk’) for output units
» δj = zj (1-zj) ∑k wkj δk for hidden units
– Update weights:
» wkj = wkj - η δk zj for output units
» wji = wji - η δj xi for hidden units
Adapted from R. Hwa, R. Mooney
Recall: wji = wji – η dE/dzj dzj/daj daj/dwji
![Page 70: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/70.jpg)
Another example
• Two layer network w/ tanh at hidden layer:
• Derivative:
• Minimize:
• Forward propagation:
![Page 71: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/71.jpg)
Another example
• Errors at output (identity function at output):
• Errors at hidden units:
• Derivatives wrt weights:
![Page 72: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/72.jpg)
Same example with graphic and math
First calculate error of output units and use this
to change the top layer of weights.
output
hidden
input
Update weights into j
Adapted from Ray Mooney, equations from Chris Bishop
k
j
i
![Page 73: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/73.jpg)
Same example with graphic and math
Next calculate error for hidden units based on
errors on the output units it feeds into.
output
hidden
input
k
j
i
Adapted from Ray Mooney, equations from Chris Bishop
![Page 74: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/74.jpg)
Same example with graphic and math
Finally update bottom layer of weights based on
errors calculated for hidden units.
output
hidden
input
Update weights into i
k
j
i
Adapted from Ray Mooney, equations from Chris Bishop
![Page 75: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/75.jpg)
Another way of keeping track of error
Computation graphs
• Accumulate upstream/downstream gradients
at each node
• One set flows from inputs to outputs and can
be computed without evaluating loss
• The other flows from outputs (loss) to inputs
![Page 76: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/76.jpg)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016
f
activations
Fei-Fei Li & Andrej Karpathy & JustinJohnson
13 Jan 2016
Lecture 4 - 22
Andrej Karpathy
Generic example
![Page 77: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/77.jpg)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016
activations
Lecture 4 - 23
13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson
Andrej Karpathy
“local gradient”
f
gradients
Generic example
![Page 78: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/78.jpg)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016
activations
“local gradient”
f
gradients
Lecture 4 - 24
13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson
Andrej Karpathy
Generic example
![Page 79: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/79.jpg)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016
activations
“local gradient”
f
gradients
Lecture 4 - 25
13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson
Andrej Karpathy
Generic example
![Page 80: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/80.jpg)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016
activations
“local gradient”
f
gradients
Lecture 4 - 26
13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson
Andrej Karpathy
Generic example
![Page 81: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/81.jpg)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016
activations
“local gradient”
f
gradients
Lecture 4 - 27
13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson
Andrej Karpathy
Generic example
![Page 82: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/82.jpg)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016
e.g. x = -2, y = 5, z = -4
Lecture 4 - 10
13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson
Andrej Karpathy
Another generic example
![Page 83: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/83.jpg)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016
e.g. x = -2, y = 5, z = -4
Want:
Lecture 4 - 11
13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson
Andrej Karpathy
Another generic example
![Page 84: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/84.jpg)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016
e.g. x = -2, y = 5, z = -4
Want:
Lecture 4 - 12
13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson
Andrej Karpathy
Another generic example
![Page 85: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/85.jpg)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016
e.g. x = -2, y = 5, z = -4
Want:
Lecture 4 - 13
13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson
Andrej Karpathy
Another generic example
![Page 86: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/86.jpg)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016
e.g. x = -2, y = 5, z = -4
Want:
Lecture 4 - 14
13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson
Andrej Karpathy
Another generic example
![Page 87: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/87.jpg)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016
e.g. x = -2, y = 5, z = -4
Want:
Lecture 4 - 15
13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson
Andrej Karpathy
Another generic example
![Page 88: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/88.jpg)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016
e.g. x = -2, y = 5, z = -4
Want:
Lecture 4 - 16
13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson
Andrej Karpathy
Another generic example
![Page 89: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/89.jpg)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016
e.g. x = -2, y = 5, z = -4
Want:
Lecture 4 - 17
13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson
Andrej Karpathy
Another generic example
![Page 90: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/90.jpg)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016
e.g. x = -2, y = 5, z = -4
Want:
Lecture 4 - 18
13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson
Andrej Karpathy
Another generic example
![Page 91: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/91.jpg)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016
e.g. x = -2, y = 5, z = -4
Chain rule:
Want:
Lecture 4 - 19
13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson
Andrej Karpathy
Another generic example
![Page 92: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/92.jpg)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016
e.g. x = -2, y = 5, z = -4
Want:
Lecture 4 - 20
13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson
Andrej Karpathy
Another generic example
![Page 93: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/93.jpg)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 4 - 13 Jan 2016
e.g. x = -2, y = 5, z = -4
Chain rule:
Want:
Lecture 4 - 21
13 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson
Andrej Karpathy
Another generic example
![Page 94: CS 1675: Intro to Machine Learning - University of Pittsburgh](https://reader034.fdocuments.in/reader034/viewer/2022050313/626f8afa13240a1eeb40c35c/html5/thumbnails/94.jpg)
Summary
• Feed-forward network architecture
• Training deep neural nets• We need an objective function that measures and guides us
towards good performance
• We need a way to minimize the loss function: gradient
descent
• We need backpropagation to propagate error towards all
layers and change weights at those layers
• Next: Practices for preventing overfitting,
training with little data, examining conditions
for success, alternative optimization
strategies