Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single...
Transcript of Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single...
![Page 1: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/1.jpg)
Lecture 2: Introduction to Deep Learning
Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound
![Page 2: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/2.jpg)
Course Logistics
- Update on course registrations — 43 students registered (out of 40)
- Those who want to audit …
- Piazza registrations (all announcements and HW solutions will be there)
- Assignment 1 is out (due date Monday, Jan 14 @ 11:59pm)
- Microsoft Azure credits and tutorial
- TA office hours will be posted by tonight (there will be one this week and next)
![Page 3: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/3.jpg)
Introduction to Deep LearningThere is a lot packed into today’s lecture (excerpts from a few lectures of CS231n)
Covering: foundations and most important aspects of supervised DNNs Not-covering: neuroscience background of deep learning, optimization (CPSC 340 & CPSC 540), and not a lot of theoretical underpinning
if you want more details, check out CS231n lectures on-line
![Page 4: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/4.jpg)
![Page 5: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/5.jpg)
Linear regression (review)
*slide adopted from V. Ordonex
production costs
promotionalcosts
genre of the movie
box officefirst week
total revenueUSA
total revenueinternational
total book sales
OutputsInputs (features)
Training Set
![Page 6: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/6.jpg)
Linear regression (review)
production costs
promotionalcosts
genre of the movie
box officefirst week
total revenueUSA
total revenueinternational
total book sales
Outputs
Training Set
Testing Set
Inputs (features)
*slide adopted from V. Ordonex
![Page 7: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/7.jpg)
Linear regression (review)
production costs
promotionalcosts
genre of the movie
box officefirst week
total revenueUSA
total revenueinternational
total book sales
Outputs
Training Set
Testing Set
yj =X
i
wjixi + bj
Inputs (features)
*slide adopted from V. Ordonex
![Page 8: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/8.jpg)
Linear regression (review)
yj =X
i
wjixi + bj
each output is a linear combination of inputs plus bias, easier to write in matrix form:
y = W
Tx+ b
*slide adopted from V. Ordonex
![Page 9: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/9.jpg)
Linear regression (review)
yj =X
i
wjixi + bj
each output is a linear combination of inputs plus bias, easier to write in matrix form:
y = W
Tx+ b
Dtrain = {(x(d),y(d))}
Key to accurate prediction is learning parameters to minimize discrepancy with historical data
production costs
promotionalcosts
genre of the movie
box officefirst week
total revenueUSA
total revenueinternational
total book sales
OutputsInputs (features)
Training Set
*slide adopted from V. Ordonex
![Page 10: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/10.jpg)
Linear regression (review)
yj =X
i
wjixi + bj
each output is a linear combination of inputs plus bias, easier to write in matrix form:
y = W
Tx+ b
L(W,b) =
|Dtrain|X
d=1
l(y(d),y(d))
W⇤,b⇤ = argminL(W,b)Dtrain = {(x(d),y(d))}
Key to accurate prediction is learning parameters to minimize discrepancy with historical data
*slide adopted from V. Ordonex
![Page 11: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/11.jpg)
L(W,b) =
|Dtrain|X
d=1
(y(d) � y(d))2
Linear regression (review)
yj =X
i
wjixi + bj
each output is a linear combination of inputs plus bias, easier to write in matrix form:
y = W
Tx+ b
W⇤,b⇤ = argminL(W,b)Dtrain = {(x(d),y(d))}
Key to accurate prediction is learning parameters to minimize discrepancy with historical data
*slide adopted from V. Ordonex
|| ||
![Page 12: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/12.jpg)
Linear regression (review) — Learning /w Least Squares
W⇤,b⇤ = argminL(W,b)
L(W,b) =
|Dtrain|X
d=1
⇣W
Tx
(d) + b� y
(d)⌘2
@L(W,b)
@wji=
@
@wji
|Dtrain|X
d=1
⇣W
Tx
(d) + b� y
(d)⌘2
= 0
Solution:
@L(W,b)
@wji=
@
@wji
|Dtrain|X
d=1
⇣W
Tx
(d) + b� y
(d)⌘2
= 0
*slide adopted from V. Ordonex
|| ||
|| ||
|| ||
after some operations W⇤ = (XTX)�1XTY
![Page 13: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/13.jpg)
One-layer Neural NetworkInput Layer
Output Layer
Weighted Sum
Activation Function
X
X
y1
y2
a(·)
a(·)
x5
x4
x3
x2
x1
a(x) = x
0.6
1.2
Linear Activation
a(x) = x
![Page 14: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/14.jpg)
One-layer Neural Network
Weighted Sum
Activation Function
X
X
y1
y2
a(·)
a(·)
x5
x4
x3
x2
x1Multi-layer Perceptron Layer (MLP) / Fully Connected (FC) Layer
Input Layer
Wo
,bo
![Page 15: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/15.jpg)
One-layer Neural Network
y1
y2
x5
x4
x3
x2
x1Multi-layer Perceptron Layer (MLP) / Fully Connected (FC) Layer
Input Layer
![Page 16: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/16.jpg)
Multi-layer Neural Network
x5
x4
x3
x2
x1
y1
y2
1st Hidden Layer2nd Hidden Layer
Input Layer
Output Layer
Wh1,bh1
Wh2,bh2
Wo
,bo
![Page 17: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/17.jpg)
Neural Network IntuitionQuestion: What is a Neural Network? Answer: Complex mapping from an input (vector) to an output (vector)
Question: What class of functions should be considered for this mapping? Answer: Compositions of simpler functions (a.k.a. layers)? We will talk more about what specific functions next …
Question: What does a hidden unit do? Answer: It can be thought of as classifier or a feature.
Question: Why have many layers? Answer: 1) More layers = more complex functional mapping
2) More efficient due to distributed representation* slide from Marc’Aurelio Renzato
![Page 18: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/18.jpg)
Neural Network IntuitionQuestion: What is a Neural Network? Answer: Complex mapping from an input (vector) to an output (vector)
Question: What class of functions should be considered for this mapping? Answer: Compositions of simpler functions (a.k.a. layers)? We will talk more about what specific functions next …
Question: What does a hidden unit do? Answer: It can be thought of as classifier or a feature.
Question: Why have many layers? Answer: 1) More layers = more complex functional mapping
2) More efficient due to distributed representation* slide from Marc’Aurelio Renzato
![Page 19: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/19.jpg)
Neural Network IntuitionQuestion: What is a Neural Network? Answer: Complex mapping from an input (vector) to an output (vector)
Question: What class of functions should be considered for this mapping? Answer: Compositions of simpler functions (a.k.a. layers)? We will talk more about what specific functions next …
Question: What does a hidden unit do? Answer: It can be thought of as classifier or a feature.
Question: Why have many layers? Answer: 1) More layers = more complex functional mapping
2) More efficient due to distributed representation* slide from Marc’Aurelio Renzato
![Page 20: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/20.jpg)
Neural Network IntuitionQuestion: What is a Neural Network? Answer: Complex mapping from an input (vector) to an output (vector)
Question: What class of functions should be considered for this mapping? Answer: Compositions of simpler functions (a.k.a. layers)? We will talk more about what specific functions next …
Question: What does a hidden unit do? Answer: It can be thought of as classifier or a feature.
Question: Why have many layers? Answer: 1) More layers = more complex functional mapping
2) More efficient due to distributed representation* slide from Marc’Aurelio Renzato
![Page 21: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/21.jpg)
Multi-layer Neural Network
a(x) = xRecall: => entire neural network is linear, which is not expressive
x5
x4
x3
x2
x1
y1
y2
1st Hidden Layer2nd Hidden Layer
Input Layer
Output Layer
Wh1,bh1
Wh2,bh2
Wo
,bo
Why?W
o
(Wh2 (Wh1x+ b
h1) + b
h2) + b
o
=
[Wo
W
h1
W
h2
]x+ [Wo
W
h1
b
h1 +W
o
b
h2 + b
o
]
W
o
(Wh2 (Wh1x+ b
h1) + b
h2) + b
o
=
[Wo
W
h1
W
h2
]x+ [Wo
W
h1
b
h1 +W
o
b
h2 + b
o
]
W0 b0
0.6
1.2
Linear Activation
![Page 22: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/22.jpg)
One-layer Neural NetworkInput Layer
Output Layer
Weighted Sum
Activation Function
X
X
y1
y2
a(·)
a(·)
x5
x4
x3
x2
x1
0.6
1.2
Linear Activation
a(x) = x
Wo
,bo
![Page 23: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/23.jpg)
One-layer Neural NetworkInput Layer
Output Layer
Weighted Sum
Activation Function
X
X
y1
y2
a(·)
a(·)
x5
x4
x3
x2
x1
0.6
1.2
a(x) = sigmoid(x) =1
1 + e
�x
a
0(x) = sigmoid(x) (1� sigmoid(x))
Sigmoid Activation
![Page 24: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/24.jpg)
Neural network can arbitrarily approximate any continuous function for every value of possible inputs
*slide adopted from http://neuralnetworksanddeeplearning.com/chap4.html
Light Theory: Neural Network as Universal Approximator
![Page 25: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/25.jpg)
Neural network can arbitrarily approximate any continuous function for every value of possible inputs
The guarantee is that by using enough hidden neurons we can always find a neural network whose output satisfies for an arbitrarily small
g(x) |g(x)� f(x)| < ✏
✏
g(x)
*slide adopted from http://neuralnetworksanddeeplearning.com/chap4.html
Light Theory: Neural Network as Universal Approximator
![Page 26: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/26.jpg)
Lets start with a simple network: one hidden layer with two hidden neurons and a single output layer with one neuron (with sigmoid activations)
g(x)
*slide adopted from http://neuralnetworksanddeeplearning.com/chap4.html
Light Theory: Neural Network as Universal Approximator
![Page 27: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/27.jpg)
Lets start with a simple network: one hidden layer with two hidden neurons and a single output layer with one neuron (with sigmoid activations)
g(x)
*slide adopted from http://neuralnetworksanddeeplearning.com/chap4.html
Light Theory: Neural Network as Universal Approximator
Let’s look at output of this (hidden) neuron as a function of parameters (weight, bias)
![Page 28: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/28.jpg)
Lets start with a simple network: one hidden layer with two hidden neurons and a single output layer with one neuron (with sigmoid activations)
*slide adopted from http://neuralnetworksanddeeplearning.com/chap4.html
Light Theory: Neural Network as Universal Approximator
Let’s look at output of this (hidden) neuron as a function of parameters (weight, bias)
![Page 29: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/29.jpg)
By dialing up the weight (e.g. ) we can actually create a “step” function
It is easier to work with sums of step functions, so we can assume that every neuron outputs a step function.
w = 999
s = � b
w
Location of the step?
*slide adopted from http://neuralnetworksanddeeplearning.com/chap4.html
Light Theory: Neural Network as Universal Approximator
![Page 30: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/30.jpg)
By dialing up the weight (e.g. ) we can actually create a “step” function
It is easier to work with sums of step functions, so we can assume that every neuron outputs a step function
w = 999
s = � b
w
Location of the step?
Light Theory: Neural Network as Universal Approximator
*slide adopted from http://neuralnetworksanddeeplearning.com/chap4.html
![Page 31: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/31.jpg)
Light Theory: Neural Network as Universal Approximator
The output neuron is a weighted combination of step functions (assuming bias for that layer is 0)
*slide adopted from http://neuralnetworksanddeeplearning.com/chap4.html
![Page 32: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/32.jpg)
Light Theory: Neural Network as Universal Approximator
The output neuron is a weighted combination of step functions (assuming bias for that layer is 0)
*slide adopted from http://neuralnetworksanddeeplearning.com/chap4.html
![Page 33: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/33.jpg)
Light Theory: Neural Network as Universal Approximator
The output neuron is a weighted combination of step functions (assuming bias for that layer is 0)
*slide adopted from http://neuralnetworksanddeeplearning.com/chap4.html
![Page 34: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/34.jpg)
Light Theory: Neural Network as Universal Approximator
*slide adopted from http://neuralnetworksanddeeplearning.com/chap4.html
![Page 35: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/35.jpg)
Light Theory: Neural Network as Universal Approximator
*slide adopted from http://neuralnetworksanddeeplearning.com/chap4.html
Riemann sum approximation
![Page 36: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/36.jpg)
Riemann sum approximation
Light Theory: Neural Network as Universal Approximator
![Page 37: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/37.jpg)
Conditions needed for proof to hold: Activation function needs to be well defined
limx!1
a(x) = A
limx!�1
a(x) = B
A 6= B
Light Theory: Neural Network as Universal Approximator
*slide adopted from http://neuralnetworksanddeeplearning.com/chap4.html
![Page 38: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/38.jpg)
Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary accuracy, when the width goes to infinity. [ Hornik et al., 1989 ]
Light Theory: Neural Network as Universal Approximator
![Page 39: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/39.jpg)
Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary accuracy, when the width goes to infinity.
Universal Approximation Theorem (revised): A network of infinite depth with a hidden layer of size neurons, where is the dimension of the input space, can approximate any continuous function.
[ Hornik et al., 1989 ]
Light Theory: Neural Network as Universal Approximator
[ Lu et al., NIPS 2017 ]
d+ 1 d
![Page 40: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/40.jpg)
Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary accuracy, when the width goes to infinity.
Universal Approximation Theorem (revised): A network of infinite depth with a hidden layer of size neurons, where is the dimension of the input space, can approximate any continuous function.
Universal Approximation Theorem (further revised): ResNet with a single hidden unit and infinite depth can approximate any continuous function.
[ Hornik et al., 1989 ]
Light Theory: Neural Network as Universal Approximator
[ Lin and Jegelka, NIPS 2018 ]
[ Lu et al., NIPS 2017 ]
d+ 1 d
![Page 41: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/41.jpg)
One-layer Neural NetworkInput Layer
Output Layer
Weighted Sum
Activation Function
X
X
y1
y2
a(·)
a(·)
x5
x4
x3
x2
x1
0.6
1.2
a(x) = sigmoid(x) =1
1 + e
�x
a
0(x) = sigmoid(x) (1� sigmoid(x))
Sigmoid Activation
![Page 42: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/42.jpg)
Learning Parameters of One-layer Neural Network
L(W,b) =
|Dtrain|X
d=1
⇣sigmoid
⇣W
Tx
(d) + b
⌘� y
(d)⌘2
W⇤,b⇤ = argminL(W,b)
Solution:
@L(W,b)
@wji=
@
@wji
|Dtrain|X
d=1
⇣sigmoid
⇣W
Tx
(d) + b
⌘� y
(d)⌘2
@L(W,b)
@wji=
@
@wji
|Dtrain|X
d=1
⇣sigmoid
⇣W
Tx
(d) + b
⌘� y
(d)⌘2
= 0
Problem: No closed form solution @L(W,b)
@wji= 0
*slide adopted from V. Ordonex
![Page 43: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/43.jpg)
Gradient Descent (review)
L(W,b) =
|Dtrain|X
d=1
⇣sigmoid
⇣W
Tx
(d) + b
⌘� y
(d)⌘2
1. Start from random value of W0,b0
2. Compute gradient of the loss with respect to previous (initial) parameters:
r L(W,b)|W=Wk,b=bk
3. Re-estimate the parameters
Wk+1 = Wk � �@L(W,b)
@W
����W=Wk,b=bk
bk+1 = bk � �@L(W,b)
@b
����W=Wk,b=bk
Wk+1 = Wk � �@L(W,b)
@W
����W=Wk,b=bk
bk+1 = bk � �@L(W,b)
@b
����W=Wk,b=bk
For to max number of iterationsk = 0
*slide adopted from V. Ordonex
Wk+1 = Wk � �@L(W,b)
@W
����W=Wk,b=bk
bk+1 = bk � �@L(W,b)
@b
����W=Wk,b=bk
- is the learning rate
![Page 44: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/44.jpg)
Stochastic Gradient Descent (review)
Problem: For large datasets computing sum is expensive
@L(W,b)
@wji=
@
@wji
|Dtrain|X
d=1
⇣sigmoid
⇣W
Tx
(d) + b
⌘� y
(d)⌘2
Solution: Compute approximate gradient with mini-batches of much smaller size (as little as 1-example sometimes)
Problem: How do we compute the actual gradient?
![Page 45: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/45.jpg)
Numerical Differentiation
We can approximate the gradient numerically, using:
Even better, we can use central differencing:
However, both of theses suffer from rounding errors and are not good enough for learning (they are very good tools for checking the correctness of implementation though, e.g., use ).
@f(x)
@xi⇡= lim
h!0
f(x+ h1i)� f(x)
h
@f(x)
@xi⇡= lim
h!0
f(x+ h1i)� f(x� h1i)
2h
h = 0.000001
1i - Vector of all zeros, except for one 1 in i-th location
*slide adopted from T. Chen, H. Shen, A. Krishnamurthy CSE 599G1 lecture at UWashington
![Page 46: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/46.jpg)
We can approximate the gradient numerically, using:
Even better, we can use central differencing:
However, both of theses suffer from rounding errors and are not good enough for learning (they are very good tools for checking the correctness of implementation though, e.g., use ).
Numerical Differentiation
h = 0.000001
1i - Vector of all zeros, except for one 1 in i-th location
@L(W,b)
@wij⇡ lim
h!0
L(W + h1ij ,b)� L(W,b)
h
@L(W,b)
@wij⇡ lim
h!0
L(W + h1ij ,b)� L(W + h1ij ,b)
2h
@L(W,b)
@bj⇡ lim
h!0
L(W,b+ h1j)� L(W,b+ h1j)
2h
@L(W,b)
@bj⇡ lim
h!0
L(W,b+ h1j)� L(W,b)
h
1ij - Matrix of all zeros, except for one 1 in (i,j)-th location
![Page 47: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/47.jpg)
Input function is represented as computational graph (a symbolic tree)
Implements differentiation rules for composite functions:
Problem: For complex functions, expressions can be exponentially large; also difficult to deal with piece-wise functions (creates many symbolic cases)
Symbolic Differentiation
d (f(x) + g(x))
dx=
df(x)
dx+
dg(x)
dx
d (f(x) · g(x))dx
=df(x)
dxg(x) + f(x)
dg(x)
dx
d(f(g(x)))
dx=
df(g(x))
dx· dg(x)
dx
Sum Rule Product Rule Chain Rule
lnx1
x2+
sin
+
� y
v2
v4
v3v5
+
sin
+
�
lnv0
v1
v2
v4
v5
v3
v6
x1
x2 y
y = f(x1, x2) = ln(x1) + x1x2 � sin(x2)
*slide adopted from T. Chen, H. Shen, A. Krishnamurthy CSE 599G1 lecture at UWashington
![Page 48: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/48.jpg)
Input function is represented as computational graph (a symbolic tree)
Implements differentiation rules for composite functions:
Problem: For complex functions, expressions can be exponentially large; also difficult to deal with piece-wise functions (creates many symbolic cases)
Symbolic Differentiation
d (f(x) + g(x))
dx=
df(x)
dx+
dg(x)
dx
d (f(x) · g(x))dx
=df(x)
dxg(x) + f(x)
dg(x)
dx
d(f(g(x)))
dx=
df(g(x))
dx· dg(x)
dx
Sum Rule Product Rule Chain Rule
lnx1
x2+
sin
+
� y
v2
v4
v3v5
+
sin
+
�
lnv0
v1
v2
v4
v5
v3
v6
x1
x2 y
y = f(x1, x2) = ln(x1) + x1x2 � sin(x2)
*slide adopted from T. Chen, H. Shen, A. Krishnamurthy CSE 599G1 lecture at UWashington
![Page 49: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/49.jpg)
Automatic Differentiation (AutoDiff)
Intuition: Interleave symbolic differentiation and simplification
Key Idea: apply symbolic differentiation at the elementary operation level, evaluate and keep intermediate results
y = f(x1, x2) = ln(x1) + x1x2 � sin(x2)
*slide adopted from T. Chen, H. Shen, A. Krishnamurthy CSE 599G1 lecture at UWashington
Success of deep learning owes A LOT to success of AutoDiff algorithms (also to advances in parallel architectures, and large datasets, …)
![Page 50: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/50.jpg)
Automatic Differentiation (AutoDiff)
Each node is an input, intermediate, or output variable
Computational graph (a DAG) with variable ordering from topological sort.
y = f(x1, x2) = ln(x1) + x1x2 � sin(x2)
+
sin
+
�
lnv0
v1
v2
v4
v5
v3
v6
x1
x2 y
*slide adopted from T. Chen, H. Shen, A. Krishnamurthy CSE 599G1 lecture at UWashington
![Page 51: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/51.jpg)
Automatic Differentiation (AutoDiff)
Each node is an input, intermediate, or output variable
Computational graph (a DAG) with variable ordering from topological sort.
y = f(x1, x2) = ln(x1) + x1x2 � sin(x2)
Lets see how we can evaluate a function using computational graph (DNN inferences)
v0 = x1
v1 = x2
v2 = ln(v0)
v3 = v0 · v1v4 = sin(v1)
v5 = v2 + v3
v6 = v5 � v4
y = v6
Computational graph is governed by these equations
*slide adopted from T. Chen, H. Shen, A. Krishnamurthy CSE 599G1 lecture at UWashington
+
sin
+
�
lnv0
v1
v2
v4
v5
v3
v6
x1
x2 y
![Page 52: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/52.jpg)
Automatic Differentiation (AutoDiff)
Each node is an input, intermediate, or output variable
Computational graph (a DAG) with variable ordering from topological sort.
y = f(x1, x2) = ln(x1) + x1x2 � sin(x2)
Lets see how we can evaluate a function using computational graph (DNN inferences)
v0 = x1
v1 = x2
v2 = ln(v0)
v3 = v0 · v1v4 = sin(v1)
v5 = v2 + v3
v6 = v5 � v4
y = v6
f(2, 5)
Forward Evaluation Trace:+
sin
+
�
lnv0
v1
v2
v4
v5
v3
v6
x1
x2 y
![Page 53: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/53.jpg)
Automatic Differentiation (AutoDiff)
Each node is an input, intermediate, or output variable
Computational graph (a DAG) with variable ordering from topological sort.
v0 = x1
v1 = x2
v2 = ln(v0)
v3 = v0 · v1v4 = sin(v1)
v5 = v2 + v3
v6 = v5 � v4
y = v6
f(2, 5)
Forward Evaluation Trace:
2
y = f(x1, x2) = ln(x1) + x1x2 � sin(x2)
Lets see how we can evaluate a function using computational graph (DNN inferences)
+
sin
+
�
lnv0
v1
v2
v4
v5
v3
v6
x1
x2 y
![Page 54: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/54.jpg)
Automatic Differentiation (AutoDiff)
Each node is an input, intermediate, or output variable
Computational graph (a DAG) with variable ordering from topological sort.
v0 = x1
v1 = x2
v2 = ln(v0)
v3 = v0 · v1v4 = sin(v1)
v5 = v2 + v3
v6 = v5 � v4
y = v6
f(2, 5)
Forward Evaluation Trace:
25
y = f(x1, x2) = ln(x1) + x1x2 � sin(x2)
Lets see how we can evaluate a function using computational graph (DNN inferences)
+
sin
+
�
lnv0
v1
v2
v4
v5
v3
v6
x1
x2 y
![Page 55: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/55.jpg)
Automatic Differentiation (AutoDiff)
Each node is an input, intermediate, or output variable
Computational graph (a DAG) with variable ordering from topological sort.
v0 = x1
v1 = x2
v2 = ln(v0)
v3 = v0 · v1v4 = sin(v1)
v5 = v2 + v3
v6 = v5 � v4
y = v6
f(2, 5)
Forward Evaluation Trace:
25
ln(2) = 0.693
y = f(x1, x2) = ln(x1) + x1x2 � sin(x2)
Lets see how we can evaluate a function using computational graph (DNN inferences)
+
sin
+
�
lnv0
v1
v2
v4
v5
v3
v6
x1
x2 y
![Page 56: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/56.jpg)
Automatic Differentiation (AutoDiff)
Each node is an input, intermediate, or output variable
Computational graph (a DAG) with variable ordering from topological sort.
v0 = x1
v1 = x2
v2 = ln(v0)
v3 = v0 · v1v4 = sin(v1)
v5 = v2 + v3
v6 = v5 � v4
y = v6
f(2, 5)
Forward Evaluation Trace:
25
ln(2) = 0.6932 x 5 = 10
sin(5) = 0.959
0.693 + 10 = 10.693
10.693 + 0.959 = 11.652
11.652
y = f(x1, x2) = ln(x1) + x1x2 � sin(x2)
Lets see how we can evaluate a function using computational graph (DNN inferences)
+
sin
+
�
lnv0
v1
v2
v4
v5
v3
v6
x1
x2 y
![Page 57: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/57.jpg)
Automatic Differentiation (AutoDiff)
Each node is an input, intermediate, or output variable
Computational graph (a DAG) with variable ordering from topological sort.
v0 = x1
v1 = x2
v2 = ln(v0)
v3 = v0 · v1v4 = sin(v1)
v5 = v2 + v3
v6 = v5 � v4
y = v6
f(2, 5)
Forward Evaluation Trace:
25
ln(2) = 0.6932 x 5 = 10
sin(5) = 0.959
0.693 + 10 = 10.693
10.693 + 0.959 = 11.652
11.652
y = f(x1, x2) = ln(x1) + x1x2 � sin(x2)
Lets see how we can evaluate a function using computational graph (DNN inferences)
+
sin
+
�
lnv0
v1
v2
v4
v5
v3
v6
x1
x2 y
![Page 58: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/58.jpg)
AutoDiff - Forward Mode y = f(x1, x2) = ln(x1) + x1x2 � sin(x2)
+
sin
+
�
lnv0
v1
v2
v4
v5
v3
v6
x1
x2 y
v0 = x1
v1 = x2
v2 = ln(v0)
v3 = v0 · v1v4 = sin(v1)
v5 = v2 + v3
v6 = v5 � v4
y = v6
f(2, 5)
Forward Evaluation Trace:
25
ln(2) = 0.6932 x 5 = 10
sin(5) = 0.959
0.693 + 10 = 10.693
10.693 + 0.959 = 11.652
11.652
@f(x1, x2)
@x1
����(x1=2,x2=5)
Lets see how we can evaluate a derivative using computational graph (DNN learning)
We will do this with forward mode first, by introducing a derivative of each variable node with respect to the input variable.
![Page 59: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/59.jpg)
AutoDiff - Forward Mode y = f(x1, x2) = ln(x1) + x1x2 � sin(x2)
v0 = x1
v1 = x2
v2 = ln(v0)
v3 = v0 · v1v4 = sin(v1)
v5 = v2 + v3
v6 = v5 � v4
y = v6
f(2, 5)
Forward Evaluation Trace:
25
ln(2) = 0.6932 x 5 = 10
sin(5) = 0.959
0.693 + 10 = 10.693
10.693 + 0.959 = 11.652
11.652
Forward Derivative Trace:
@v0
@x1= 1
@v1
@x1= 0
@v2
@x1=
1
v0
@v0
@x1
@v3
@x1=
@v0
@x1· v1 + v0 ·
@v1
@x1
@v4
@x1=
@v1
@x1cos(v1)
@v5
@x1=
@v2
@x1+
@v3
@x1
@v6
@x1=
@v5
@x1� @v4
@x1
@y
@x1=
@v6
@x1
@f(x1, x2)
@x1
����(x1=2,x2=5)
1+
sin
+
�
lnv0
v1
v2
v4
v5
v3
v6
x1
x2 y
![Page 60: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/60.jpg)
AutoDiff - Forward Mode y = f(x1, x2) = ln(x1) + x1x2 � sin(x2)
v0 = x1
v1 = x2
v2 = ln(v0)
v3 = v0 · v1v4 = sin(v1)
v5 = v2 + v3
v6 = v5 � v4
y = v6
f(2, 5)
Forward Evaluation Trace:
25
ln(2) = 0.6932 x 5 = 10
sin(5) = 0.959
0.693 + 10 = 10.693
10.693 + 0.959 = 11.652
11.652
Forward Derivative Trace:
@v0
@x1= 1
@v1
@x1= 0
@v2
@x1=
1
v0
@v0
@x1
@v3
@x1=
@v0
@x1· v1 + v0 ·
@v1
@x1
@v4
@x1=
@v1
@x1cos(v1)
@v5
@x1=
@v2
@x1+
@v3
@x1
@v6
@x1=
@v5
@x1� @v4
@x1
@y
@x1=
@v6
@x1
@f(x1, x2)
@x1
����(x1=2,x2=5)
1@v0
@x1= 1
@v1
@x1= 0
@v2
@x1=
1
v0
@v0
@x1
@v3
@x1=
@v0
@x1· v1 + v0 ·
@v1
@x1
@v4
@x1=
@v1
@x1cos(v1)
@v5
@x1=
@v2
@x1+
@v3
@x1
@v6
@x1=
@v5
@x1� @v4
@x1
@y
@x1=
@v6
@x1
0
+
sin
+
�
lnv0
v1
v2
v4
v5
v3
v6
x1
x2 y
![Page 61: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/61.jpg)
AutoDiff - Forward Mode y = f(x1, x2) = ln(x1) + x1x2 � sin(x2)
v0 = x1
v1 = x2
v2 = ln(v0)
v3 = v0 · v1v4 = sin(v1)
v5 = v2 + v3
v6 = v5 � v4
y = v6
f(2, 5)
Forward Evaluation Trace:
25
ln(2) = 0.6932 x 5 = 10
sin(5) = 0.959
0.693 + 10 = 10.693
10.693 + 0.959 = 11.652
11.652
Forward Derivative Trace:
@v0
@x1= 1
@v1
@x1= 0
@v2
@x1=
1
v0
@v0
@x1
@v3
@x1=
@v0
@x1· v1 + v0 ·
@v1
@x1
@v4
@x1=
@v1
@x1cos(v1)
@v5
@x1=
@v2
@x1+
@v3
@x1
@v6
@x1=
@v5
@x1� @v4
@x1
@y
@x1=
@v6
@x1
@v0
@x1= 1
@v1
@x1= 0
@v2
@x1=
1
v0
@v0
@x1
@v3
@x1=
@v0
@x1· v1 + v0 ·
@v1
@x1
@v4
@x1=
@v1
@x1cos(v1)
@v5
@x1=
@v2
@x1+
@v3
@x1
@v6
@x1=
@v5
@x1� @v4
@x1
@y
@x1=
@v6
@x1
@v0
@x1= 1
@v1
@x1= 0
@v2
@x1=
1
v0
@v0
@x1
@v3
@x1=
@v0
@x1· v1 + v0 ·
@v1
@x1
@v4
@x1=
@v1
@x1cos(v1)
@v5
@x1=
@v2
@x1+
@v3
@x1
@v6
@x1=
@v5
@x1� @v4
@x1
@y
@x1=
@v6
@x1
@f(x1, x2)
@x1
����(x1=2,x2=5)
1
0
@v0
@x1= 1
@v1
@x1= 0
@v2
@x1=
1
v0
@v0
@x1
@v3
@x1=
@v0
@x1· v1 + v0 ·
@v1
@x1
@v4
@x1=
@v1
@x1cos(v1)
@v5
@x1=
@v2
@x1+
@v3
@x1
@v6
@x1=
@v5
@x1� @v4
@x1
@y
@x1=
@v6
@x1
Chain Rule
1/2 * 1 = 0.5
+
sin
+
�
lnv0
v1
v2
v4
v5
v3
v6
x1
x2 y
![Page 62: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/62.jpg)
AutoDiff - Forward Mode y = f(x1, x2) = ln(x1) + x1x2 � sin(x2)
v0 = x1
v1 = x2
v2 = ln(v0)
v3 = v0 · v1v4 = sin(v1)
v5 = v2 + v3
v6 = v5 � v4
y = v6
f(2, 5)
Forward Evaluation Trace:
25
ln(2) = 0.6932 x 5 = 10
sin(5) = 0.959
0.693 + 10 = 10.693
10.693 + 0.959 = 11.652
11.652
Forward Derivative Trace:
@v0
@x1= 1
@v1
@x1= 0
@v2
@x1=
1
v0
@v0
@x1
@v3
@x1=
@v0
@x1· v1 + v0 ·
@v1
@x1
@v4
@x1=
@v1
@x1cos(v1)
@v5
@x1=
@v2
@x1+
@v3
@x1
@v6
@x1=
@v5
@x1� @v4
@x1
@y
@x1=
@v6
@x1
@v0
@x1= 1
@v1
@x1= 0
@v2
@x1=
1
v0
@v0
@x1
@v3
@x1=
@v0
@x1· v1 + v0 ·
@v1
@x1
@v4
@x1=
@v1
@x1cos(v1)
@v5
@x1=
@v2
@x1+
@v3
@x1
@v6
@x1=
@v5
@x1� @v4
@x1
@y
@x1=
@v6
@x1
@v0
@x1= 1
@v1
@x1= 0
@v2
@x1=
1
v0
@v0
@x1
@v3
@x1=
@v0
@x1· v1 + v0 ·
@v1
@x1
@v4
@x1=
@v1
@x1cos(v1)
@v5
@x1=
@v2
@x1+
@v3
@x1
@v6
@x1=
@v5
@x1� @v4
@x1
@y
@x1=
@v6
@x1
@v0
@x1= 1
@v1
@x1= 0
@v2
@x1=
1
v0
@v0
@x1
@v3
@x1=
@v0
@x1· v1 + v0 ·
@v1
@x1
@v4
@x1=
@v1
@x1cos(v1)
@v5
@x1=
@v2
@x1+
@v3
@x1
@v6
@x1=
@v5
@x1� @v4
@x1
@y
@x1=
@v6
@x1
@f(x1, x2)
@x1
����(x1=2,x2=5)
1
0
1/2 * 1 = 0.5
1*5 + 2*0 = 5
@v0
@x1= 1
@v1
@x1= 0
@v2
@x1=
1
v0
@v0
@x1
@v3
@x1=
@v0
@x1· v1 + v0 ·
@v1
@x1
@v4
@x1=
@v1
@x1cos(v1)
@v5
@x1=
@v2
@x1+
@v3
@x1
@v6
@x1=
@v5
@x1� @v4
@x1
@y
@x1=
@v6
@x1
Product Rule
+
sin
+
�
lnv0
v1
v2
v4
v5
v3
v6
x1
x2 y
![Page 63: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/63.jpg)
AutoDiff - Forward Mode y = f(x1, x2) = ln(x1) + x1x2 � sin(x2)
v0 = x1
v1 = x2
v2 = ln(v0)
v3 = v0 · v1v4 = sin(v1)
v5 = v2 + v3
v6 = v5 � v4
y = v6
f(2, 5)
Forward Evaluation Trace:
25
ln(2) = 0.6932 x 5 = 10
sin(5) = 0.959
0.693 + 10 = 10.693
10.693 + 0.959 = 11.652
11.652
Forward Derivative Trace:
@v0
@x1= 1
@v1
@x1= 0
@v2
@x1=
1
v0
@v0
@x1
@v3
@x1=
@v0
@x1· v1 + v0 ·
@v1
@x1
@v4
@x1=
@v1
@x1cos(v1)
@v5
@x1=
@v2
@x1+
@v3
@x1
@v6
@x1=
@v5
@x1� @v4
@x1
@y
@x1=
@v6
@x1
@v0
@x1= 1
@v1
@x1= 0
@v2
@x1=
1
v0
@v0
@x1
@v3
@x1=
@v0
@x1· v1 + v0 ·
@v1
@x1
@v4
@x1=
@v1
@x1cos(v1)
@v5
@x1=
@v2
@x1+
@v3
@x1
@v6
@x1=
@v5
@x1� @v4
@x1
@y
@x1=
@v6
@x1
@v0
@x1= 1
@v1
@x1= 0
@v2
@x1=
1
v0
@v0
@x1
@v3
@x1=
@v0
@x1· v1 + v0 ·
@v1
@x1
@v4
@x1=
@v1
@x1cos(v1)
@v5
@x1=
@v2
@x1+
@v3
@x1
@v6
@x1=
@v5
@x1� @v4
@x1
@y
@x1=
@v6
@x1
@v0
@x1= 1
@v1
@x1= 0
@v2
@x1=
1
v0
@v0
@x1
@v3
@x1=
@v0
@x1· v1 + v0 ·
@v1
@x1
@v4
@x1=
@v1
@x1cos(v1)
@v5
@x1=
@v2
@x1+
@v3
@x1
@v6
@x1=
@v5
@x1� @v4
@x1
@y
@x1=
@v6
@x1
@v0
@x1= 1
@v1
@x1= 0
@v2
@x1=
1
v0
@v0
@x1
@v3
@x1=
@v0
@x1· v1 + v0 ·
@v1
@x1
@v4
@x1=
@v1
@x1cos(v1)
@v5
@x1=
@v2
@x1+
@v3
@x1
@v6
@x1=
@v5
@x1� @v4
@x1
@y
@x1=
@v6
@x1
@v0
@x1= 1
@v1
@x1= 0
@v2
@x1=
1
v0
@v0
@x1
@v3
@x1=
@v0
@x1· v1 + v0 ·
@v1
@x1
@v4
@x1=
@v1
@x1cos(v1)
@v5
@x1=
@v2
@x1+
@v3
@x1
@v6
@x1=
@v5
@x1� @v4
@x1
@y
@x1=
@v6
@x1
@v0
@x1= 1
@v1
@x1= 0
@v2
@x1=
1
v0
@v0
@x1
@v3
@x1=
@v0
@x1· v1 + v0 ·
@v1
@x1
@v4
@x1=
@v1
@x1cos(v1)
@v5
@x1=
@v2
@x1+
@v3
@x1
@v6
@x1=
@v5
@x1� @v4
@x1
@y
@x1=
@v6
@x1
@v0
@x1= 1
@v1
@x1= 0
@v2
@x1=
1
v0
@v0
@x1
@v3
@x1=
@v0
@x1· v1 + v0 ·
@v1
@x1
@v4
@x1=
@v1
@x1cos(v1)
@v5
@x1=
@v2
@x1+
@v3
@x1
@v6
@x1=
@v5
@x1� @v4
@x1
@y
@x1=
@v6
@x1
@f(x1, x2)
@x1
����(x1=2,x2=5)
1
0
1/2 * 1 = 0.5
1*5 + 2*0 = 5
0 * cos(5) = 0
0.5 + 5 = 5.5
5.5 - 0 = 5.5
5.5
+
sin
+
�
lnv0
v1
v2
v4
v5
v3
v6
x1
x2 y
![Page 64: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/64.jpg)
AutoDiff - Forward Mode y = f(x1, x2) = ln(x1) + x1x2 � sin(x2)
Forward Derivative Trace:
@v0
@x1= 1
@v1
@x1= 0
@v2
@x1=
1
v0
@v0
@x1
@v3
@x1=
@v0
@x1· v1 + v0 ·
@v1
@x1
@v4
@x1=
@v1
@x1cos(v1)
@v5
@x1=
@v2
@x1+
@v3
@x1
@v6
@x1=
@v5
@x1� @v4
@x1
@y
@x1=
@v6
@x1
@v0
@x1= 1
@v1
@x1= 0
@v2
@x1=
1
v0
@v0
@x1
@v3
@x1=
@v0
@x1· v1 + v0 ·
@v1
@x1
@v4
@x1=
@v1
@x1cos(v1)
@v5
@x1=
@v2
@x1+
@v3
@x1
@v6
@x1=
@v5
@x1� @v4
@x1
@y
@x1=
@v6
@x1
@v0
@x1= 1
@v1
@x1= 0
@v2
@x1=
1
v0
@v0
@x1
@v3
@x1=
@v0
@x1· v1 + v0 ·
@v1
@x1
@v4
@x1=
@v1
@x1cos(v1)
@v5
@x1=
@v2
@x1+
@v3
@x1
@v6
@x1=
@v5
@x1� @v4
@x1
@y
@x1=
@v6
@x1
@v0
@x1= 1
@v1
@x1= 0
@v2
@x1=
1
v0
@v0
@x1
@v3
@x1=
@v0
@x1· v1 + v0 ·
@v1
@x1
@v4
@x1=
@v1
@x1cos(v1)
@v5
@x1=
@v2
@x1+
@v3
@x1
@v6
@x1=
@v5
@x1� @v4
@x1
@y
@x1=
@v6
@x1
@v0
@x1= 1
@v1
@x1= 0
@v2
@x1=
1
v0
@v0
@x1
@v3
@x1=
@v0
@x1· v1 + v0 ·
@v1
@x1
@v4
@x1=
@v1
@x1cos(v1)
@v5
@x1=
@v2
@x1+
@v3
@x1
@v6
@x1=
@v5
@x1� @v4
@x1
@y
@x1=
@v6
@x1
@v0
@x1= 1
@v1
@x1= 0
@v2
@x1=
1
v0
@v0
@x1
@v3
@x1=
@v0
@x1· v1 + v0 ·
@v1
@x1
@v4
@x1=
@v1
@x1cos(v1)
@v5
@x1=
@v2
@x1+
@v3
@x1
@v6
@x1=
@v5
@x1� @v4
@x1
@y
@x1=
@v6
@x1
@v0
@x1= 1
@v1
@x1= 0
@v2
@x1=
1
v0
@v0
@x1
@v3
@x1=
@v0
@x1· v1 + v0 ·
@v1
@x1
@v4
@x1=
@v1
@x1cos(v1)
@v5
@x1=
@v2
@x1+
@v3
@x1
@v6
@x1=
@v5
@x1� @v4
@x1
@y
@x1=
@v6
@x1
@v0
@x1= 1
@v1
@x1= 0
@v2
@x1=
1
v0
@v0
@x1
@v3
@x1=
@v0
@x1· v1 + v0 ·
@v1
@x1
@v4
@x1=
@v1
@x1cos(v1)
@v5
@x1=
@v2
@x1+
@v3
@x1
@v6
@x1=
@v5
@x1� @v4
@x1
@y
@x1=
@v6
@x1
@f(x1, x2)
@x1
����(x1=2,x2=5)
1
0
1/2 * 1 = 0.5
1*5 + 2*0 = 5
0 * cos(5) = 0
0.5 + 5 = 5.5
5.5 - 0 = 5.5
5.5
@f(x1, x2)
@x1
����(x1=2,x2=5)
= 5.5
We now have:
Still need:
@f(x1, x2)
@x2
����(x1=2,x2=5)
+
sin
+
�
lnv0
v1
v2
v4
v5
v3
v6
x1
x2 y
![Page 65: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/65.jpg)
AutoDiff - Forward ModeForward mode needs forward passes to get a full Jacobian (all gradients of output with respect to each input), where is the number of inputs
Problem: DNN typically has large number of inputs:
and very few outputs (many DNNs have )
Automatic differentiation in reservers mode computes all gradients in backwards passes (so for most DNNs in a single back pass — back propagation)
mm
n = 1
image as an input, plus all the weights and biases of layers = millions of inputs!
n
y = f(x) : Rm ! Rn
*slide adopted from T. Chen, H. Shen, A. Krishnamurthy CSE 599G1 lecture at UWashington
![Page 66: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/66.jpg)
AutoDiff - Forward ModeForward mode needs forward passes to get a full Jacobian (all gradients of output with respect to each input), where is the number of inputs
Problem: DNN typically has large number of inputs:
and very few outputs (many DNNs have )
Automatic differentiation in reservers mode computes all gradients in backwards passes (so for most DNNs in a single back pass — back propagation)
y = f(x) : Rm ! Rn
mm
n = 1
n
image as an input, plus all the weights and biases of layers = millions of inputs!
Why?
*slide adopted from T. Chen, H. Shen, A. Krishnamurthy CSE 599G1 lecture at UWashington
![Page 67: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/67.jpg)
AutoDiff - Forward ModeForward mode needs forward passes to get a full Jacobian (all gradients of output with respect to each input), where is the number of inputs
Problem: DNN typically has large number of inputs:
and very few outputs (many DNNs have )
Automatic differentiation in reverse mode computes all gradients in backwards passes (so for most DNNs in a single back pass — back propagation)
mm
n = 1
image as an input, plus all the weights and biases of layers = millions of inputs!
n
y = f(x) : Rm ! Rn
*slide adopted from T. Chen, H. Shen, A. Krishnamurthy CSE 599G1 lecture at UWashington
![Page 68: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/68.jpg)
“local” derivative
AutoDiff - Reverse Mode x1
x2 y
v0 = x1
v1 = x2
v2 = ln(v0)
v3 = v0 · v1v4 = sin(v1)
v5 = v2 + v3
v6 = v5 � v4
y = v6
f(2, 5)
Forward Evaluation Trace:
25
ln(2) = 0.6932 x 5 = 10
sin(5) = 0.959
0.693 + 10 = 10.693
10.693 + 0.959 = 11.652
11.652
Traverse the original graph in the reverse topological order and for each node in the
original graph introduce an adjoint node, which computes derivative of the output with respect
to the local node (using Chain rule):
v0
v1
v2
v3
v4
v5
v6
vi =@yj@vi
=X
k2pa(i)
@vk@vi
@yj@vk
=X
k2pa(i)
@vk@vi
vk
+
sin
+
�
lnv0
v1
v2
v4
v5
v3
v6
x1
x2 y
![Page 69: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/69.jpg)
AutoDiff - Reverse Mode x1
x2 y
v0 = x1
v1 = x2
v2 = ln(v0)
v3 = v0 · v1v4 = sin(v1)
v5 = v2 + v3
v6 = v5 � v4
y = v6
f(2, 5)
Forward Evaluation Trace:
25
ln(2) = 0.6932 x 5 = 10
sin(5) = 0.959
0.693 + 10 = 10.693
10.693 + 0.959 = 11.652
11.652
v0
v1
v2
v3
v4
v5
v6
Backwards Derivative Trace:
v0 = v3@v3@v1
+ v2@v2@v0
v1 = v3@v3@v1
+ v4@v4@v1
v2 = v5@v5@v2
v3 = v5@v5@v3
v4 = v6@v6@v4
v5 = v6@v6@v5
v6 =@y
@v61
+
sin
+
�
lnv0
v1
v2
v4
v5
v3
v6
x1
x2 y
![Page 70: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/70.jpg)
AutoDiff - Reverse Mode x1
x2 y
v0 = x1
v1 = x2
v2 = ln(v0)
v3 = v0 · v1v4 = sin(v1)
v5 = v2 + v3
v6 = v5 � v4
y = v6
f(2, 5)
Forward Evaluation Trace:
25
ln(2) = 0.6932 x 5 = 10
sin(5) = 0.959
0.693 + 10 = 10.693
10.693 + 0.959 = 11.652
11.652
v0
v1
v2
v3
v4
v5
v6
Backwards Derivative Trace:
v0 = v3@v3@v1
+ v2@v2@v0
v1 = v3@v3@v1
+ v4@v4@v1
v2 = v5@v5@v2
v3 = v5@v5@v3
v4 = v6@v6@v4
v5 = v6@v6@v5
v6 =@y
@v6
v0 = v3@v3@v1
+ v2@v2@v0
v1 = v3@v3@v1
+ v4@v4@v1
v2 = v5@v5@v2
v3 = v5@v5@v3
v4 = v6@v6@v4
v5 = v6@v6@v5
v6 =@y
@v6
= v6 · 1
1
1x1 = 1
+
sin
+
�
lnv0
v1
v2
v4
v5
v3
v6
x1
x2 y
![Page 71: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/71.jpg)
AutoDiff - Reverse Mode x1
x2 y
v0 = x1
v1 = x2
v2 = ln(v0)
v3 = v0 · v1v4 = sin(v1)
v5 = v2 + v3
v6 = v5 � v4
y = v6
f(2, 5)
Forward Evaluation Trace:
25
ln(2) = 0.6932 x 5 = 10
sin(5) = 0.959
0.693 + 10 = 10.693
10.693 + 0.959 = 11.652
11.652
v0
v1
v2
v3
v4
v5
v6
Backwards Derivative Trace:
v0 = v3@v3@v1
+ v2@v2@v0
v1 = v3@v3@v1
+ v4@v4@v1
v2 = v5@v5@v2
v3 = v5@v5@v3
v4 = v6@v6@v4
v5 = v6@v6@v5
v6 =@y
@v6
v0 = v3@v3@v1
+ v2@v2@v0
v1 = v3@v3@v1
+ v4@v4@v1
v2 = v5@v5@v2
v3 = v5@v5@v3
v4 = v6@v6@v4
v5 = v6@v6@v5
v6 =@y
@v6
v0 = v3@v3@v1
+ v2@v2@v0
v1 = v3@v3@v1
+ v4@v4@v1
v2 = v5@v5@v2
v3 = v5@v5@v3
v4 = v6@v6@v4
v5 = v6@v6@v5
v6 =@y
@v6
= v6 · 1
= v6 · (�1)
1
1x1 = 1
1x-1 = -1
+
sin
+
�
lnv0
v1
v2
v4
v5
v3
v6
x1
x2 y
![Page 72: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/72.jpg)
AutoDiff - Reverse Mode x1
x2 y
v0 = x1
v1 = x2
v2 = ln(v0)
v3 = v0 · v1v4 = sin(v1)
v5 = v2 + v3
v6 = v5 � v4
y = v6
f(2, 5)
Forward Evaluation Trace:
25
ln(2) = 0.6932 x 5 = 10
sin(5) = 0.959
0.693 + 10 = 10.693
10.693 + 0.959 = 11.652
11.652
v0
v1
v2
v3
v4
v5
v6
Backwards Derivative Trace:
v0 = v3@v3@v1
+ v2@v2@v0
v1 = v3@v3@v1
+ v4@v4@v1
v2 = v5@v5@v2
v3 = v5@v5@v3
v4 = v6@v6@v4
v5 = v6@v6@v5
v6 =@y
@v6
v0 = v3@v3@v1
+ v2@v2@v0
v1 = v3@v3@v1
+ v4@v4@v1
v2 = v5@v5@v2
v3 = v5@v5@v3
v4 = v6@v6@v4
v5 = v6@v6@v5
v6 =@y
@v6
v0 = v3@v3@v1
+ v2@v2@v0
v1 = v3@v3@v1
+ v4@v4@v1
v2 = v5@v5@v2
v3 = v5@v5@v3
v4 = v6@v6@v4
v5 = v6@v6@v5
v6 =@y
@v6
= v6 · 1
= v6 · (�1)
v0 = v3@v3@v1
+ v2@v2@v0
v1 = v3@v3@v1
+ v4@v4@v1
v2 = v5@v5@v2
v3 = v5@v5@v3
v4 = v6@v6@v4
v5 = v6@v6@v5
v6 =@y
@v6
= v5 · (1)
1
1x1 = 1
1x-1 = -1
1x1 = 1
+
sin
+
�
lnv0
v1
v2
v4
v5
v3
v6
x1
x2 y
![Page 73: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/73.jpg)
AutoDiff - Reverse Mode x1
x2 y
v0 = x1
v1 = x2
v2 = ln(v0)
v3 = v0 · v1v4 = sin(v1)
v5 = v2 + v3
v6 = v5 � v4
y = v6
f(2, 5)
Forward Evaluation Trace:
25
ln(2) = 0.6932 x 5 = 10
sin(5) = 0.959
0.693 + 10 = 10.693
10.693 + 0.959 = 11.652
11.652
v0
v1
v2
v3
v4
v5
v6
Backwards Derivative Trace:
v0 = v3@v3@v1
+ v2@v2@v0
v1 = v3@v3@v1
+ v4@v4@v1
v2 = v5@v5@v2
v3 = v5@v5@v3
v4 = v6@v6@v4
v5 = v6@v6@v5
v6 =@y
@v6
v0 = v3@v3@v1
+ v2@v2@v0
v1 = v3@v3@v1
+ v4@v4@v1
v2 = v5@v5@v2
v3 = v5@v5@v3
v4 = v6@v6@v4
v5 = v6@v6@v5
v6 =@y
@v6
v0 = v3@v3@v1
+ v2@v2@v0
v1 = v3@v3@v1
+ v4@v4@v1
v2 = v5@v5@v2
v3 = v5@v5@v3
v4 = v6@v6@v4
v5 = v6@v6@v5
v6 =@y
@v6
= v6 · 1
= v6 · (�1)
v0 = v3@v3@v1
+ v2@v2@v0
v1 = v3@v3@v1
+ v4@v4@v1
v2 = v5@v5@v2
v3 = v5@v5@v3
v4 = v6@v6@v4
v5 = v6@v6@v5
v6 =@y
@v6
= v5 · (1)
v0 = v3@v3@v1
+ v2@v2@v0
v1 = v3@v3@v1
+ v4@v4@v1
v2 = v5@v5@v2
v3 = v5@v5@v3
v4 = v6@v6@v4
v5 = v6@v6@v5
v6 =@y
@v6
= v5 · (1)
1
1x1 = 1
1x-1 = -1
1x1 = 1
1x1 = 1
+
sin
+
�
lnv0
v1
v2
v4
v5
v3
v6
x1
x2 y
![Page 74: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/74.jpg)
AutoDiff - Reverse Mode x1
x2 y
v0 = x1
v1 = x2
v2 = ln(v0)
v3 = v0 · v1v4 = sin(v1)
v5 = v2 + v3
v6 = v5 � v4
y = v6
f(2, 5)
Forward Evaluation Trace:
25
ln(2) = 0.6932 x 5 = 10
sin(5) = 0.959
0.693 + 10 = 10.693
10.693 + 0.959 = 11.652
11.652
v0
v1
v2
v3
v4
v5
v6
Backwards Derivative Trace:
v0 = v3@v3@v1
+ v2@v2@v0
v1 = v3@v3@v1
+ v4@v4@v1
v2 = v5@v5@v2
v3 = v5@v5@v3
v4 = v6@v6@v4
v5 = v6@v6@v5
v6 =@y
@v6
v0 = v3@v3@v1
+ v2@v2@v0
v1 = v3@v3@v1
+ v4@v4@v1
v2 = v5@v5@v2
v3 = v5@v5@v3
v4 = v6@v6@v4
v5 = v6@v6@v5
v6 =@y
@v6
v0 = v3@v3@v1
+ v2@v2@v0
v1 = v3@v3@v1
+ v4@v4@v1
v2 = v5@v5@v2
v3 = v5@v5@v3
v4 = v6@v6@v4
v5 = v6@v6@v5
v6 =@y
@v61
1x1 = 1= v6 · 1
= v6 · (�1) 1x-1 = -1
v0 = v3@v3@v1
+ v2@v2@v0
v1 = v3@v3@v1
+ v4@v4@v1
v2 = v5@v5@v2
v3 = v5@v5@v3
v4 = v6@v6@v4
v5 = v6@v6@v5
v6 =@y
@v6
= v5 · (1) 1x1 = 1
v0 = v3@v3@v1
+ v2@v2@v0
v1 = v3@v3@v1
+ v4@v4@v1
v2 = v5@v5@v2
v3 = v5@v5@v3
v4 = v6@v6@v4
v5 = v6@v6@v5
v6 =@y
@v6
= v5 · (1) 1x1 = 1
v0 = v3@v3@v1
+ v2@v2@v0
v1 = v3@v3@v1
+ v4@v4@v1
v2 = v5@v5@v2
v3 = v5@v5@v3
v4 = v6@v6@v4
v5 = v6@v6@v5
v6 =@y
@v6
v0 = v3@v3@v1
+ v2@v2@v0
v1 = v3@v3@v1
+ v4@v4@v1
v2 = v5@v5@v2
v3 = v5@v5@v3
v4 = v6@v6@v4
v5 = v6@v6@v5
v6 =@y
@v6
= v3v0 + v4cos(v1) 1.716
+
sin
+
�
lnv0
v1
v2
v4
v5
v3
v6
x1
x2 y
![Page 75: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/75.jpg)
AutoDiff - Reverse Mode x1
x2 y
v0 = x1
v1 = x2
v2 = ln(v0)
v3 = v0 · v1v4 = sin(v1)
v5 = v2 + v3
v6 = v5 � v4
y = v6
f(2, 5)
Forward Evaluation Trace:
25
ln(2) = 0.6932 x 5 = 10
sin(5) = 0.959
0.693 + 10 = 10.693
10.693 + 0.959 = 11.652
11.652
v0
v1
v2
v3
v4
v5
v6
Backwards Derivative Trace:
v0 = v3@v3@v1
+ v2@v2@v0
v1 = v3@v3@v1
+ v4@v4@v1
v2 = v5@v5@v2
v3 = v5@v5@v3
v4 = v6@v6@v4
v5 = v6@v6@v5
v6 =@y
@v6
v0 = v3@v3@v1
+ v2@v2@v0
v1 = v3@v3@v1
+ v4@v4@v1
v2 = v5@v5@v2
v3 = v5@v5@v3
v4 = v6@v6@v4
v5 = v6@v6@v5
v6 =@y
@v6
v0 = v3@v3@v1
+ v2@v2@v0
v1 = v3@v3@v1
+ v4@v4@v1
v2 = v5@v5@v2
v3 = v5@v5@v3
v4 = v6@v6@v4
v5 = v6@v6@v5
v6 =@y
@v61
1x1 = 1= v6 · 1
= v6 · (�1) 1x-1 = -1
v0 = v3@v3@v1
+ v2@v2@v0
v1 = v3@v3@v1
+ v4@v4@v1
v2 = v5@v5@v2
v3 = v5@v5@v3
v4 = v6@v6@v4
v5 = v6@v6@v5
v6 =@y
@v6
= v5 · (1) 1x1 = 1
v0 = v3@v3@v1
+ v2@v2@v0
v1 = v3@v3@v1
+ v4@v4@v1
v2 = v5@v5@v2
v3 = v5@v5@v3
v4 = v6@v6@v4
v5 = v6@v6@v5
v6 =@y
@v6
= v5 · (1) 1x1 = 1
v0 = v3@v3@v1
+ v2@v2@v0
v1 = v3@v3@v1
+ v4@v4@v1
v2 = v5@v5@v2
v3 = v5@v5@v3
v4 = v6@v6@v4
v5 = v6@v6@v5
v6 =@y
@v6
v0 = v3@v3@v1
+ v2@v2@v0
v1 = v3@v3@v1
+ v4@v4@v1
v2 = v5@v5@v2
v3 = v5@v5@v3
v4 = v6@v6@v4
v5 = v6@v6@v5
v6 =@y
@v6
= v3v0 + v4cos(v1) 1.716
= v3v1 + v21
v05.5v0 = v3
@v3@v0
+ v2@v2@v0
+
sin
+
�
lnv0
v1
v2
v4
v5
v3
v6
x1
x2 y
![Page 76: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/76.jpg)
AutoDiff - Reverse Mode x1
x2 y
v0 = x1
v1 = x2
v2 = ln(v0)
v3 = v0 · v1v4 = sin(v1)
v5 = v2 + v3
v6 = v5 � v4
y = v6
f(2, 5)
Forward Evaluation Trace:
25
ln(2) = 0.6932 x 5 = 10
sin(5) = 0.959
0.693 + 10 = 10.693
10.693 + 0.959 = 11.652
11.652
v0
v1
v2
v3
v4
v5
v6
Backwards Derivative Trace:
v0 = v3@v3@v1
+ v2@v2@v0
v1 = v3@v3@v1
+ v4@v4@v1
v2 = v5@v5@v2
v3 = v5@v5@v3
v4 = v6@v6@v4
v5 = v6@v6@v5
v6 =@y
@v6
v0 = v3@v3@v1
+ v2@v2@v0
v1 = v3@v3@v1
+ v4@v4@v1
v2 = v5@v5@v2
v3 = v5@v5@v3
v4 = v6@v6@v4
v5 = v6@v6@v5
v6 =@y
@v6
v0 = v3@v3@v1
+ v2@v2@v0
v1 = v3@v3@v1
+ v4@v4@v1
v2 = v5@v5@v2
v3 = v5@v5@v3
v4 = v6@v6@v4
v5 = v6@v6@v5
v6 =@y
@v61
1x1 = 1= v6 · 1
= v6 · (�1) 1x-1 = -1
v0 = v3@v3@v1
+ v2@v2@v0
v1 = v3@v3@v1
+ v4@v4@v1
v2 = v5@v5@v2
v3 = v5@v5@v3
v4 = v6@v6@v4
v5 = v6@v6@v5
v6 =@y
@v6
= v5 · (1) 1x1 = 1
v0 = v3@v3@v1
+ v2@v2@v0
v1 = v3@v3@v1
+ v4@v4@v1
v2 = v5@v5@v2
v3 = v5@v5@v3
v4 = v6@v6@v4
v5 = v6@v6@v5
v6 =@y
@v6
= v5 · (1) 1x1 = 1
v0 = v3@v3@v1
+ v2@v2@v0
v1 = v3@v3@v1
+ v4@v4@v1
v2 = v5@v5@v2
v3 = v5@v5@v3
v4 = v6@v6@v4
v5 = v6@v6@v5
v6 =@y
@v6
v0 = v3@v3@v1
+ v2@v2@v0
v1 = v3@v3@v1
+ v4@v4@v1
v2 = v5@v5@v2
v3 = v5@v5@v3
v4 = v6@v6@v4
v5 = v6@v6@v5
v6 =@y
@v6
= v3v0 + v4cos(v1)
= v3v1 + v21
v0v0 = v3
@v3@v0
+ v2@v2@v0
1.716
5.5
+
sin
+
�
lnv0
v1
v2
v4
v5
v3
v6
x1
x2 y
![Page 77: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/77.jpg)
+
sin
+
�
lnv0
v1
v2
v4
v5
v3
v6
x1
x2 y
Automatic Differentiation (AutoDiff)
v0
v1
v2
x1
x2
yy = f(x1, x2) = ln(x1) + x1x2 � sin(x2)
Elementary function granularity: Complex function granularity:
AutoDiff can be done at various granularities
y = f(x1, x2) = ln(x1) + x1x2 � sin(x2)
![Page 78: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/78.jpg)
Backpropagation Practical Issues
x5
x4
x3
x2
x1
y1
y2
1st Hidden Layer2nd Hidden Layer
Input Layer
Output Layer
Wh1,bh1
Wh2,bh2
Wo
,bo
Easier to deal with in vector form
![Page 79: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/79.jpg)
Backpropagation Practical Issues
y = f(W,b,x) = sigmoid(W · x+ b)
x
W
b
y
![Page 80: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/80.jpg)
Backpropagation Practical Issues
y = f(W,b,x) = sigmoid(W · x+ b)
x
W
b
y
@y
@x
@y
@W
@y
@b
@L(·, ·)@y
@L(·, ·)@x
=@y
@x
@L(·, ·)@y
@L(·, ·)@W
=@y
@W
@L(·, ·)@y
@L(·, ·)@b
=@y
@b
@L(·, ·)@y
“local” Jacobians (matrix of partial derivatives, e.g. size |x| x |y|)
“backprop” Gradient
![Page 81: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/81.jpg)
Jacobian of Sigmoid layer
x
y
x,y 2 R2048
sigmoid
What is the dimension of Jacobian?
What does it look like?
Element-wise sigmoid layer:
If we are working with a mini batch of 100 inputs-output pairs, technically Jacobian is a matrix 204,800 x 204,800
![Page 82: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/82.jpg)
Backpropagation: Common questions
Question: Does BackProp only work for certain layers? Answer: No, for any differentiable functions
Question: What is computational cost of BackProp? Answer: On average about twice the forward pass
Question: Is BackProp a dual of forward propagation? Answer: Yes
+
Sum Copy
Copy Sum
+
FProp BackProp
* Adopted from slides by Marc’Aurelio Ranzato
![Page 83: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/83.jpg)
Activation Function: SigmoidInput Layer
Output Layer
Weighted Sum
Activation Function
X
X
y1
y2
a(·)
a(·)
x5
x4
x3
x2
x1
a(x) = sigmoid(x) =1
1 + e
�x
a
0(x) = sigmoid(x) (1� sigmoid(x))
Sigmoid Activation
![Page 84: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/84.jpg)
Computational Graph: 1-layer network
W
b
o
MSEloss
(y,y)laW · x+ b
yi
xi
sigmoid(o)
![Page 85: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/85.jpg)
Activation Function: SigmoidInput Layer
Output Layer
Weighted Sum
Activation Function
X
X
y1
y2
a(·)
a(·)
x5
x4
x3
x2
x1
a(x) = sigmoid(x) =1
1 + e
�x
a
0(x) = sigmoid(x) (1� sigmoid(x))
Sigmoid Activation
![Page 86: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/86.jpg)
Activation Function: Sigmoid
a(x) = sigmoid(x) =1
1 + e
�x
a
0(x) = sigmoid(x) (1� sigmoid(x))
Sigmoid Activation
Pros: - Squishes everything in the range [0,1] - Can be interpreted as “probability” - Has well defined gradient everywhere
Cons: - Saturated neurons “kill” the gradients - Non-zero centered - Could be expensive to compute
* slide adopted from Li, Karpathy, Johnson’s CS231n at Stanford
![Page 87: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/87.jpg)
Activation Function: Sigmoid
a(x) = sigmoid(x) =1
1 + e
�x
a
0(x) = sigmoid(x) (1� sigmoid(x))
Sigmoid Activation
Cons: - Saturated neurons “kill” the gradients - Non-zero centered - Could be expensive to compute
Sigmoid Gate
a =
@L@x
=@ sigmoid(x)
@x
@L@a
@L@x
=@ sigmoid(x)
@x
@L@a
a(x) = sigmoid(x) =1
1 + e
�x
a
0(x) = sigmoid(x) (1� sigmoid(x))
x
0
* slide adopted from Li, Karpathy, Johnson’s CS231n at Stanford
![Page 88: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/88.jpg)
Activation Function: Tanh
Tanh Activation
* slide adopted from Li, Karpathy, Johnson’s CS231n at Stanford
a(x) = tanh(x) = 2 · sigmoid(2x)� 1
a(x) = tanh(x) =2
1 + e
�2x� 1
a(x) = tanh(x) = 2 · sigmoid(2x)� 1
a(x) = tanh(x) =2
1 + e
�2x� 1
Pros: - Squishes everything in the range [-1,1] - Centered around zero - Has well defined gradient everywhere
Cons: - Saturated neurons “kill” the gradients
![Page 89: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/89.jpg)
Activation Function: Rectified Linear Unit (ReLU)
ReLU Activation
* slide adopted from Li, Karpathy, Johnson’s CS231n at Stanford
Pros: - Does not saturate (for x > 0) - Computationally very efficient - Converges faster in practice (e.g. 6 times faster)
Cons: - Not zero centered
a(x) = max(0, x)
a
0(x) =
(1 if x � 0
0 if x < 0
a(x) = max(0, x)
a
0(x) =
(1 if x � 0
0 if x < 0
![Page 90: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/90.jpg)
Activation Function: Rectified Linear Unit (ReLU)
ReLU Activation
* slide adopted from Li, Karpathy, Johnson’s CS231n at Stanford
a(x) = max(0, x)
a
0(x) =
(1 if x � 0
0 if x < 0
a(x) = max(0, x)
a
0(x) =
(1 if x � 0
0 if x < 0
Question: What do ReLU layers accomplish?
Answer: Locally linear tiling, function is locally linear
![Page 91: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/91.jpg)
Conditions needed to prove NN is a universal approximator: Activation function needs to be well defined
limx!1
a(x) = A
limx!�1
a(x) = B
A 6= B
Recall:
*slide adopted from http://neuralnetworksanddeeplearning.com/chap4.html
![Page 92: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/92.jpg)
Conditions needed to prove NN is a universal approximator: Activation function needs to be well defined
limx!1
a(x) = A
limx!�1
a(x) = B
A 6= B
Recall:
*slide adopted from http://neuralnetworksanddeeplearning.com/chap4.html
Fun Exercise: Try to prove that network with ReLU is still a universal approximator (not too difficult if you think about it visually)
![Page 93: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/93.jpg)
Activation Function: Leaky / Parametrized ReLU
Leaky / Parametrized ReLU Activation
* slide adopted from Li, Karpathy, Johnson’s CS231n at Stanford
Pros: - Does not saturate - Computationally very efficient - Converges faster in practice (e.g. 6x)
a(x) =
(x if x � 0
↵x if x < 0
Leaky: alpha is fixed to a small value (e.g., 0.01)
Parametrized: alpha is optimized as part of the network (BackProp through)
![Page 94: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/94.jpg)
Computational Graph: 1-layer with PReLU
W
b
o
↵
PReLU(↵, ·) MSEloss
(y,y)laW · x+ b
yi
xi
![Page 95: Topics in AI (CPSC 532L)lsigal/532S_2018W2/Lecture2.pdf · Universal Approximation Theorem: Single hidden layer can approximate any continuous function with compact support to arbitrary](https://reader033.fdocuments.in/reader033/viewer/2022042404/5f18768e209d69574d6fe9b0/html5/thumbnails/95.jpg)
Activation Functions: Review a(x) = sigmoid(x) =
1
1 + e
�x
a
0(x) = sigmoid(x) (1� sigmoid(x))
Sigmoid
Tanh
a(x) = tanh(x) = 2 · sigmoid(2x)� 1
a(x) = tanh(x) =2
1 + e
�2x� 1
a(x) = max(0, x)
a
0(x) =
(1 if x � 0
0 if x < 0
ReLU
Leaky / Parametrized ReLU
a(x) =
(x if x � 0
↵x if x < 0