DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks...
Transcript of DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks...
![Page 1: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/1.jpg)
DNN
DNN: Final Presentation
Ryan Jeong, Will Barton, Maricela Ramirez
Summer@ICERM
June-July 2020
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 1 / 29
![Page 2: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/2.jpg)
Outline
1 Deep Learning/Neural Networks Fundamentals
2 Universal Approximation Results with Neural Nets
3 Piecewise Linear Networks
4 On Expressivity vs. Learnability
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 2 / 29
![Page 3: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/3.jpg)
Motivation
Source: Jay Alammar, How GPT3 Works Source: European Go Federation
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 3 / 29
![Page 4: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/4.jpg)
DL Fundamentals
Figure: Source: Virtual Labs, Multilayer Feedforward networks
General idea of a Feedforward networkz(l) = W
(l)a(l�1) + b
(l)
a(l) = ReLU(z (l))
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 4 / 29
![Page 5: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/5.jpg)
DL Fundamentals
Activation function
Cost function
Minimize cost function through backpropogationGradient Descent
@C0
@w (l) =@C0
@a(l)· @a(l)
@z(l)· @z(l)
@w (l)
@C0
@b(l) =@C0
@a(l)· @a(l)
@z(l)· @z(l)
@b(l)
Stochastic Gradient Descent
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 5 / 29
![Page 6: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/6.jpg)
DL Fundamentals
Activation function
Cost function
Minimize cost function through backpropogationGradient Descent@C0
@w (l) =@C0
@a(l)· @a(l)
@z(l)· @z(l)
@w (l)
@C0
@b(l) =@C0
@a(l)· @a(l)
@z(l)· @z(l)
@b(l)
Stochastic Gradient Descent
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 5 / 29
![Page 7: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/7.jpg)
Convolutional Networks
Motivation: computational issues with feedforward networks
Figure: Source: missinglink.ai
For image recognition problems, on which CNNs are largely applied:
Convolution layer: slide small kernel of fixed weights along the image,performing the same computation every time; activation performed onoutput of a convolution layerPooling layer: divides result of convolution into regions and computes afunction on each one (usually just max); intuitively summarizesinformation and compresses dimensionFinishes with fully connected layers
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 6 / 29
![Page 8: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/8.jpg)
Convolutional Networks
Motivation: computational issues with feedforward networks
Figure: Source: missinglink.ai
For image recognition problems, on which CNNs are largely applied:Convolution layer: slide small kernel of fixed weights along the image,performing the same computation every time; activation performed onoutput of a convolution layer
Pooling layer: divides result of convolution into regions and computes afunction on each one (usually just max); intuitively summarizesinformation and compresses dimensionFinishes with fully connected layers
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 6 / 29
![Page 9: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/9.jpg)
Convolutional Networks
Motivation: computational issues with feedforward networks
Figure: Source: missinglink.ai
For image recognition problems, on which CNNs are largely applied:Convolution layer: slide small kernel of fixed weights along the image,performing the same computation every time; activation performed onoutput of a convolution layerPooling layer: divides result of convolution into regions and computes afunction on each one (usually just max); intuitively summarizesinformation and compresses dimension
Finishes with fully connected layers
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 6 / 29
![Page 10: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/10.jpg)
Convolutional Networks
Motivation: computational issues with feedforward networks
Figure: Source: missinglink.ai
For image recognition problems, on which CNNs are largely applied:Convolution layer: slide small kernel of fixed weights along the image,performing the same computation every time; activation performed onoutput of a convolution layerPooling layer: divides result of convolution into regions and computes afunction on each one (usually just max); intuitively summarizesinformation and compresses dimensionFinishes with fully connected layers
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 6 / 29
![Page 11: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/11.jpg)
SVD classification model
Initial model from linear algebraic techniques as a baseline
MNIST dataset: handwritten digit classification for digits 0 to 9
Split the data into 10 classes, by the label of each data point
Compute SVD for each of the 10 new design matrices; number ofcolumns is a hyperparameter; yields matrix Uc for class c
Key idea behind algorithm: PCA equal to SVDPCA: finding a lower-dimensional representation of the data thatpreserves the most varianceBasis vectors come from SVD
Classification objective: argminc=0,1,...,9
||x � Uc(UTc x)||2
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 7 / 29
![Page 12: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/12.jpg)
SVD classification model
Initial model from linear algebraic techniques as a baseline
MNIST dataset: handwritten digit classification for digits 0 to 9
Split the data into 10 classes, by the label of each data point
Compute SVD for each of the 10 new design matrices; number ofcolumns is a hyperparameter; yields matrix Uc for class c
Key idea behind algorithm: PCA equal to SVDPCA: finding a lower-dimensional representation of the data thatpreserves the most varianceBasis vectors come from SVD
Classification objective: argminc=0,1,...,9
||x � Uc(UTc x)||2
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 7 / 29
![Page 13: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/13.jpg)
SVD classification model
Initial model from linear algebraic techniques as a baseline
MNIST dataset: handwritten digit classification for digits 0 to 9
Split the data into 10 classes, by the label of each data point
Compute SVD for each of the 10 new design matrices; number ofcolumns is a hyperparameter; yields matrix Uc for class c
Key idea behind algorithm: PCA equal to SVDPCA: finding a lower-dimensional representation of the data thatpreserves the most varianceBasis vectors come from SVD
Classification objective: argminc=0,1,...,9
||x � Uc(UTc x)||2
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 7 / 29
![Page 14: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/14.jpg)
SVD classification model
Initial model from linear algebraic techniques as a baseline
MNIST dataset: handwritten digit classification for digits 0 to 9
Split the data into 10 classes, by the label of each data point
Compute SVD for each of the 10 new design matrices; number ofcolumns is a hyperparameter; yields matrix Uc for class c
Key idea behind algorithm: PCA equal to SVD
PCA: finding a lower-dimensional representation of the data thatpreserves the most varianceBasis vectors come from SVD
Classification objective: argminc=0,1,...,9
||x � Uc(UTc x)||2
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 7 / 29
![Page 15: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/15.jpg)
SVD classification model
Initial model from linear algebraic techniques as a baseline
MNIST dataset: handwritten digit classification for digits 0 to 9
Split the data into 10 classes, by the label of each data point
Compute SVD for each of the 10 new design matrices; number ofcolumns is a hyperparameter; yields matrix Uc for class c
Key idea behind algorithm: PCA equal to SVDPCA: finding a lower-dimensional representation of the data thatpreserves the most variance
Basis vectors come from SVD
Classification objective: argminc=0,1,...,9
||x � Uc(UTc x)||2
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 7 / 29
![Page 16: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/16.jpg)
SVD classification model
Initial model from linear algebraic techniques as a baseline
MNIST dataset: handwritten digit classification for digits 0 to 9
Split the data into 10 classes, by the label of each data point
Compute SVD for each of the 10 new design matrices; number ofcolumns is a hyperparameter; yields matrix Uc for class c
Key idea behind algorithm: PCA equal to SVDPCA: finding a lower-dimensional representation of the data thatpreserves the most varianceBasis vectors come from SVD
Classification objective: argminc=0,1,...,9
||x � Uc(UTc x)||2
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 7 / 29
![Page 17: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/17.jpg)
SVD classification model
Initial model from linear algebraic techniques as a baseline
MNIST dataset: handwritten digit classification for digits 0 to 9
Split the data into 10 classes, by the label of each data point
Compute SVD for each of the 10 new design matrices; number ofcolumns is a hyperparameter; yields matrix Uc for class c
Key idea behind algorithm: PCA equal to SVDPCA: finding a lower-dimensional representation of the data thatpreserves the most varianceBasis vectors come from SVD
Classification objective: argminc=0,1,...,9
||x � Uc(UTc x)||2
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 7 / 29
![Page 18: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/18.jpg)
Results from SVD Classification Algorithm
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 8 / 29
![Page 19: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/19.jpg)
Results from SVD Classification Algorithm
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 9 / 29
![Page 20: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/20.jpg)
Neural networks for MNIST
Feedforward neural networks: di↵erent architecturesBest performance, 1 hidden layer, with 128 units over 12 EPOCHS,97.8% accuracy
Convolutional neural network: two convolutional layers (convolution+ max pooling) and three fully connected layers (one of them beingoutput)
CNN was the best model, achieving 98.35 percent accuracy after 3epochs (nearing 99 percent accuracy after roughly 10 epochs)
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 10 / 29
![Page 21: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/21.jpg)
Feedforward Network Results
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 11 / 29
![Page 22: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/22.jpg)
Motivating questions in NN approximation theory
What class of functions can be approximated/expressed by a standardfeedforward neural network of depth k?
How do we think about the shift to rectified activations fromsigmoidal functions, which perform well in practice?
What subset of all functions that are provably approximable by neuralnetworks are actually learnable in practice?
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 12 / 29
![Page 23: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/23.jpg)
Motivating questions in NN approximation theory
What class of functions can be approximated/expressed by a standardfeedforward neural network of depth k?
How do we think about the shift to rectified activations fromsigmoidal functions, which perform well in practice?
What subset of all functions that are provably approximable by neuralnetworks are actually learnable in practice?
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 12 / 29
![Page 24: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/24.jpg)
Motivating questions in NN approximation theory
What class of functions can be approximated/expressed by a standardfeedforward neural network of depth k?
How do we think about the shift to rectified activations fromsigmoidal functions, which perform well in practice?
What subset of all functions that are provably approximable by neuralnetworks are actually learnable in practice?
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 12 / 29
![Page 25: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/25.jpg)
Classical Results: Shallow Neural Networks as Universal
Approximators
Definition: let ✏-approximation of a function f (x) by anotherfunction F (x) with a shared domain X denote that for arbitrary ✏ > 0,
supx2X|f (x)� F (x)| < ✏
Definition: Call a function �(t) sigmoidal if
�(t) !(1 as t ! 10 as t ! �1
Cybenko (1989): Shallow neural networks (one hidden layer) withcontinuous sigmoidal activations are universal approximators, i.e.capable of ✏-approximating any continuous function defined on theunit hypercube.
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 13 / 29
![Page 26: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/26.jpg)
Classical Results: Shallow Neural Networks as Universal
Approximators
Definition: let ✏-approximation of a function f (x) by anotherfunction F (x) with a shared domain X denote that for arbitrary ✏ > 0,
supx2X|f (x)� F (x)| < ✏
Definition: Call a function �(t) sigmoidal if
�(t) !(1 as t ! 10 as t ! �1
Cybenko (1989): Shallow neural networks (one hidden layer) withcontinuous sigmoidal activations are universal approximators, i.e.capable of ✏-approximating any continuous function defined on theunit hypercube.
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 13 / 29
![Page 27: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/27.jpg)
Classical Results: Shallow Neural Networks as Universal
Approximators
Definition: let ✏-approximation of a function f (x) by anotherfunction F (x) with a shared domain X denote that for arbitrary ✏ > 0,
supx2X|f (x)� F (x)| < ✏
Definition: Call a function �(t) sigmoidal if
�(t) !(1 as t ! 10 as t ! �1
Cybenko (1989): Shallow neural networks (one hidden layer) withcontinuous sigmoidal activations are universal approximators, i.e.capable of ✏-approximating any continuous function defined on theunit hypercube.
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 13 / 29
![Page 28: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/28.jpg)
Classical Results: Shallow Neural Networks as Universal
Approximators
Loosening of restrictions on activation function that still yield notionof ✏-approximation:
Hornik (1990): extension to any continuous and bounded activationfunctions, support extends to more than just the unit hypercubeLeshno (1993): extension to nonpolynomial activation functions
By extension, deeper neural networks of depth k also enjoy the sametheoretical guarantees.
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 14 / 29
![Page 29: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/29.jpg)
Classical Results: Shallow Neural Networks as Universal
Approximators
Loosening of restrictions on activation function that still yield notionof ✏-approximation:
Hornik (1990): extension to any continuous and bounded activationfunctions, support extends to more than just the unit hypercubeLeshno (1993): extension to nonpolynomial activation functions
By extension, deeper neural networks of depth k also enjoy the sametheoretical guarantees.
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 14 / 29
![Page 30: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/30.jpg)
Improved Expressivity Results with Depth
Let k denote the depth of a network. The following results are due toRolnick and Tegmark (2018).
Let f (x) be a multivariate polynomial function of finite degree d , andN(x) be a network with nonlinear activation having nonzero Taylorcoe�cients up to degree d . Then there exists a number of neuronsmk(f ) that can ✏-approximate f (x), where mk is independent of ✏.
Let f (x) = xr11 x
r22 . . . x rnn be a monomial function of finite terms,
network N(x) has nonlinear activation having nonzero Taylorcoe�cients up to degree 2d , and mk(f ) defined as above. Thenm1(f ) is exponential, but linear in a log-depth network.
m1(f ) = ⇧ni=1(ri + 1)
m(f ) ⌃ni=17dlog2(ri )e+ 4
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 15 / 29
![Page 31: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/31.jpg)
Improved Expressivity Results with Depth
Let k denote the depth of a network. The following results are due toRolnick and Tegmark (2018).
Let f (x) be a multivariate polynomial function of finite degree d , andN(x) be a network with nonlinear activation having nonzero Taylorcoe�cients up to degree d . Then there exists a number of neuronsmk(f ) that can ✏-approximate f (x), where mk is independent of ✏.
Let f (x) = xr11 x
r22 . . . x rnn be a monomial function of finite terms,
network N(x) has nonlinear activation having nonzero Taylorcoe�cients up to degree 2d , and mk(f ) defined as above. Thenm1(f ) is exponential, but linear in a log-depth network.
m1(f ) = ⇧ni=1(ri + 1)
m(f ) ⌃ni=17dlog2(ri )e+ 4
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 15 / 29
![Page 32: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/32.jpg)
Improved Expressivity Results with Depth
Let k denote the depth of a network. The following results are due toRolnick and Tegmark (2018).
Let f (x) be a multivariate polynomial function of finite degree d , andN(x) be a network with nonlinear activation having nonzero Taylorcoe�cients up to degree d . Then there exists a number of neuronsmk(f ) that can ✏-approximate f (x), where mk is independent of ✏.
Let f (x) = xr11 x
r22 . . . x rnn be a monomial function of finite terms,
network N(x) has nonlinear activation having nonzero Taylorcoe�cients up to degree 2d , and mk(f ) defined as above. Thenm1(f ) is exponential, but linear in a log-depth network.
m1(f ) = ⇧ni=1(ri + 1)
m(f ) ⌃ni=17dlog2(ri )e+ 4
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 15 / 29
![Page 33: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/33.jpg)
ReLU Networks as Partitioning Input Space
A ReLU activation function - used between a�ne transformations tointroduce nonlinearities in the learned function.
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 19 / 30
![Page 34: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/34.jpg)
Review: ReLU Networks as Partitioning Input Space
Known theorems:
ReLU networks are bijective to the appropriate class of piecewiselinear functions (up to isomorphism of network)
For one hidden layer neural networks (p-dimensional input,q-dimensional hidden layer): can combinatorially show an upperbound on the number of piecewise linear regions:
r(q,p) =Pp
i=0
�qi
�
Montufar et al. (2014): The maximal number of linear regions of thefunctions computed by a NN with n0 input units and L hidden layers,with ni � n0 rectifiers at the ith layer, is lower bounded by
(⇧L�1i=1 b
nin0cn0)(⌃n0
j=0
�nLj
�)
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 20 / 30
![Page 35: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/35.jpg)
Review: ReLU Networks as Partitioning Input Space
Known theorems:
ReLU networks are bijective to the appropriate class of piecewiselinear functions (up to isomorphism of network)
For one hidden layer neural networks (p-dimensional input,q-dimensional hidden layer): can combinatorially show an upperbound on the number of piecewise linear regions:
r(q,p) =Pp
i=0
�qi
�
Montufar et al. (2014): The maximal number of linear regions of thefunctions computed by a NN with n0 input units and L hidden layers,with ni � n0 rectifiers at the ith layer, is lower bounded by
(⇧L�1i=1 b
nin0cn0)(⌃n0
j=0
�nLj
�)
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 20 / 30
![Page 36: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/36.jpg)
Review: ReLU Networks as Partitioning Input Space
Known theorems:
ReLU networks are bijective to the appropriate class of piecewiselinear functions (up to isomorphism of network)
For one hidden layer neural networks (p-dimensional input,q-dimensional hidden layer): can combinatorially show an upperbound on the number of piecewise linear regions:
r(q,p) =Pp
i=0
�qi
�
Montufar et al. (2014): The maximal number of linear regions of thefunctions computed by a NN with n0 input units and L hidden layers,with ni � n0 rectifiers at the ith layer, is lower bounded by
(⇧L�1i=1 b
nin0cn0)(⌃n0
j=0
�nLj
�)
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 20 / 30
![Page 37: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/37.jpg)
Review: ReLU Networks as Partitioning Input Space
Known theorems:
ReLU networks are bijective to the appropriate class of piecewiselinear functions (up to isomorphism of network)
For one hidden layer neural networks (p-dimensional input,q-dimensional hidden layer): can combinatorially show an upperbound on the number of piecewise linear regions:
r(q,p) =Pp
i=0
�qi
�
Montufar et al. (2014): The maximal number of linear regions of thefunctions computed by a NN with n0 input units and L hidden layers,with ni � n0 rectifiers at the ith layer, is lower bounded by
(⇧L�1i=1 b
nin0cn0)(⌃n0
j=0
�nLj
�)
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 20 / 30
![Page 38: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/38.jpg)
Visualizing the hyperplanes to depth k
First layer: hyperplanes through input space
All proceeding layers: ”bent hyperplanes” that bend at theestablished bent hyperplanes of previous layers
Figure 1 of Raghu et al. (2017)
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 21 / 30
![Page 39: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/39.jpg)
Visualizing the hyperplanes to depth k
First layer: hyperplanes through input space
All proceeding layers: ”bent hyperplanes” that bend at theestablished bent hyperplanes of previous layers
Figure 1 of Raghu et al. (2017)
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 21 / 30
![Page 40: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/40.jpg)
Visualizing the hyperplanes to depth k
First layer: hyperplanes through input space
All proceeding layers: ”bent hyperplanes” that bend at theestablished bent hyperplanes of previous layers
Figure 1 of Raghu et al. (2017)
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 21 / 30
![Page 41: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/41.jpg)
Expressibility vs. Learnability: Theorems
The following is from the work of Hanin and Rolnick (2019ab), andconcerns ReLU networks. The results are stated informally.
For any line segment through input space, the average number ofregions intersected is linear, and not exponential, in the number ofneurons.
Both the number of regions and the distance to the nearest regionboundary stay roughly constant during training.
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 21 / 29
![Page 42: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/42.jpg)
Expressibility vs. Learnability: Theorems
The following is from the work of Hanin and Rolnick (2019ab), andconcerns ReLU networks. The results are stated informally.
For any line segment through input space, the average number ofregions intersected is linear, and not exponential, in the number ofneurons.
Both the number of regions and the distance to the nearest regionboundary stay roughly constant during training.
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 21 / 29
![Page 43: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/43.jpg)
Expressibility vs. Learnability: Theorems
The following is from the work of Hanin and Rolnick (2019ab), andconcerns ReLU networks. The results are stated informally.
For any line segment through input space, the average number ofregions intersected is linear, and not exponential, in the number ofneurons.
Both the number of regions and the distance to the nearest regionboundary stay roughly constant during training.
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 21 / 29
![Page 44: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/44.jpg)
Expressibility vs. Learnability: Graphs
(a) Over Epochs (b) Over Test Accuracy
Figure: Normalization by squared number of neurons
Across di↵erent networks, number of piecewise linear regions isO(n2), and this doesn’t change with greater depth.Upshot: empirical success of depth on certain problems is not becausedeep nets learn a complex function inaccessible to shallow networks.
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 23 / 30
![Page 45: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/45.jpg)
Expressibility vs. Learnability: Graphs
(a) Over Epochs (b) Over Test Accuracy
Figure: Normalization by squared number of neurons
Across di↵erent networks, number of piecewise linear regions isO(n2), and this doesn’t change with greater depth.
Upshot: empirical success of depth on certain problems is not becausedeep nets learn a complex function inaccessible to shallow networks.
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 23 / 30
![Page 46: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/46.jpg)
Expressibility vs. Learnability: Graphs
(a) Over Epochs (b) Over Test Accuracy
Figure: Normalization by squared number of neurons
Across di↵erent networks, number of piecewise linear regions isO(n2), and this doesn’t change with greater depth.Upshot: empirical success of depth on certain problems is not becausedeep nets learn a complex function inaccessible to shallow networks.
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 23 / 30
![Page 47: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/47.jpg)
Bridging the Gap
Two next questions (in context of ReLU networks):
Are networks that learn an exponential number of linear regions”usable”, or are they purely a theoretical guarantee?
How can we adjust the traditional deep learning pipeline to allowlearning of piecewise linear functions in the exponential regime?
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 24 / 29
![Page 48: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/48.jpg)
Bridging the Gap
Two next questions (in context of ReLU networks):
Are networks that learn an exponential number of linear regions”usable”, or are they purely a theoretical guarantee?
How can we adjust the traditional deep learning pipeline to allowlearning of piecewise linear functions in the exponential regime?
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 24 / 29
![Page 49: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/49.jpg)
Bridging the Gap
Two next questions (in context of ReLU networks):
Are networks that learn an exponential number of linear regions”usable”, or are they purely a theoretical guarantee?
How can we adjust the traditional deep learning pipeline to allowlearning of piecewise linear functions in the exponential regime?
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 24 / 29
![Page 50: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/50.jpg)
Sawtooth functions/triangular wave functions
Type of function that achieves an exponential number of regions innumber of nodes/depth.
(a) No Perturbation (b) Random Perturbation
Figure: Sawtooth functions
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 25 / 29
![Page 51: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/51.jpg)
Sawtooth functions as feedforward neural networks
To achieve a feedforward neural network that represents a sawtoothfunction with 2n a�ne regions in 2n hidden layers:
Mirror map, defined on [0, 1]:
f(x) =
(2x when 0 x 1
2
2(1� x) when 12 x 1
As a two-layer neural network:
f (x) = ReLU(2ReLU(x)� 4ReLU(x � 12))
Composing mirror map with itself n times will yield a sawtoothfunction with 2n equally spaced a�ne components on [0, 1].
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 26 / 29
![Page 52: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/52.jpg)
Sawtooth functions as feedforward neural networks
To achieve a feedforward neural network that represents a sawtoothfunction with 2n a�ne regions in 2n hidden layers:
Mirror map, defined on [0, 1]:
f(x) =
(2x when 0 x 1
2
2(1� x) when 12 x 1
As a two-layer neural network:
f (x) = ReLU(2ReLU(x)� 4ReLU(x � 12))
Composing mirror map with itself n times will yield a sawtoothfunction with 2n equally spaced a�ne components on [0, 1].
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 26 / 29
![Page 53: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/53.jpg)
Sawtooth functions as feedforward neural networks
To achieve a feedforward neural network that represents a sawtoothfunction with 2n a�ne regions in 2n hidden layers:
Mirror map, defined on [0, 1]:
f(x) =
(2x when 0 x 1
2
2(1� x) when 12 x 1
As a two-layer neural network:
f (x) = ReLU(2ReLU(x)� 4ReLU(x � 12))
Composing mirror map with itself n times will yield a sawtoothfunction with 2n equally spaced a�ne components on [0, 1].
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 26 / 29
![Page 54: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/54.jpg)
Sawtooth functions as feedforward neural networks
To achieve a feedforward neural network that represents a sawtoothfunction with 2n a�ne regions in 2n hidden layers:
Mirror map, defined on [0, 1]:
f(x) =
(2x when 0 x 1
2
2(1� x) when 12 x 1
As a two-layer neural network:
f (x) = ReLU(2ReLU(x)� 4ReLU(x � 12))
Composing mirror map with itself n times will yield a sawtoothfunction with 2n equally spaced a�ne components on [0, 1].
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 26 / 29
![Page 55: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/55.jpg)
Questions regarding sawtooth functions
How robust are sawtooth functions to multiplicative weightperturbation, of the form w(1 + ✏)? (The perturbations arezero-mean Gaussians, and experiments changed the variance.)
Can randomly initialized or perturbed networks re-learn the sawtoothfunction?
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 27 / 32
![Page 56: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/56.jpg)
Perturbing the parameters
(a) All weights and biases (b) All weights
(c) All biases (d) First layer only
Figure: Another figureRyan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 28 / 32
![Page 57: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/57.jpg)
Re-learning sawtooth networks from di↵erent initializations
(a) Mean squared error (b) Number of linear regions
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 29 / 32
![Page 58: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/58.jpg)
Re-learning sawtooth networks from di↵erent initializations
(c) Variance = 0.01 (d) Variance = 0.05
(e) Variance = 0.1 (f) Variance = 0.5
Figure: Another figureRyan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 30 / 32
![Page 59: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/59.jpg)
Implications
”Sawtooth networks” fall apart with mild variance on the weights.
Even when starting from perturbed versions of the original function,the original sawtooth network is not retained, implying that the set ofparameters yielding exponential nets in the loss landscape is localizedand challenging to learn.
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 31 / 32
![Page 60: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/60.jpg)
Learning functions in the exponential regime (sketch)
How to adjust the DL framework to encourage feedforward model to learnfunctions currently in Fexpress \ Flearn?
First attempt: adding terms to the objective function to encouragenetwork to learn more complex functions (”anti-regularization”)
For 2D input, two hidden-layer ReLU nets, can think aboutencouraging all hyperplanes/bent hyperplanes to intersect:
Can think of an inactive ReLU activation function as replacingappropriate column of outer weight vector by zeros
w2ReLU(W1x + b1) + b2 = w02((W1x + b1) + b2)
For a first-layer activation pattern, this yields a square matrix equationand a collection of inequalities, and one can add ”perceptron-like” errorterms to the objective function to encourage regions to intersect
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 29 / 30
![Page 61: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/61.jpg)
Learning functions in the exponential regime (sketch)
How to adjust the DL framework to encourage feedforward model to learnfunctions currently in Fexpress \ Flearn?
First attempt: adding terms to the objective function to encouragenetwork to learn more complex functions (”anti-regularization”)
For 2D input, two hidden-layer ReLU nets, can think aboutencouraging all hyperplanes/bent hyperplanes to intersect:
Can think of an inactive ReLU activation function as replacingappropriate column of outer weight vector by zeros
w2ReLU(W1x + b1) + b2 = w02((W1x + b1) + b2)
For a first-layer activation pattern, this yields a square matrix equationand a collection of inequalities, and one can add ”perceptron-like” errorterms to the objective function to encourage regions to intersect
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 29 / 30
![Page 62: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/62.jpg)
Learning functions in the exponential regime (sketch)
How to adjust the DL framework to encourage feedforward model to learnfunctions currently in Fexpress \ Flearn?
First attempt: adding terms to the objective function to encouragenetwork to learn more complex functions (”anti-regularization”)
For 2D input, two hidden-layer ReLU nets, can think aboutencouraging all hyperplanes/bent hyperplanes to intersect:
Can think of an inactive ReLU activation function as replacingappropriate column of outer weight vector by zeros
w2ReLU(W1x + b1) + b2 = w02((W1x + b1) + b2)
For a first-layer activation pattern, this yields a square matrix equationand a collection of inequalities, and one can add ”perceptron-like” errorterms to the objective function to encourage regions to intersect
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 29 / 30
![Page 63: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/63.jpg)
Learning functions in the exponential regime (sketch)
How to adjust the DL framework to encourage feedforward model to learnfunctions currently in Fexpress \ Flearn?
First attempt: adding terms to the objective function to encouragenetwork to learn more complex functions (”anti-regularization”)
For 2D input, two hidden-layer ReLU nets, can think aboutencouraging all hyperplanes/bent hyperplanes to intersect:
Can think of an inactive ReLU activation function as replacingappropriate column of outer weight vector by zeros
w2ReLU(W1x + b1) + b2 = w02((W1x + b1) + b2)
For a first-layer activation pattern, this yields a square matrix equationand a collection of inequalities, and one can add ”perceptron-like” errorterms to the objective function to encourage regions to intersect
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 29 / 30
![Page 64: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/64.jpg)
Learning functions in the exponential regime (sketch)
How to adjust the DL framework to encourage feedforward model to learnfunctions currently in Fexpress \ Flearn?
First attempt: adding terms to the objective function to encouragenetwork to learn more complex functions (”anti-regularization”)
For 2D input, two hidden-layer ReLU nets, can think aboutencouraging all hyperplanes/bent hyperplanes to intersect:
Can think of an inactive ReLU activation function as replacingappropriate column of outer weight vector by zeros
w2ReLU(W1x + b1) + b2 = w02((W1x + b1) + b2)
For a first-layer activation pattern, this yields a square matrix equationand a collection of inequalities, and one can add ”perceptron-like” errorterms to the objective function to encourage regions to intersect
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 29 / 30
![Page 65: DNN: Final Presentation - ICERM - Home · 2020. 9. 16. · Outline 1 Deep Learning/Neural Networks Fundamentals 2 Universal Approximation Results with Neural Nets 3 Piecewise Linear](https://reader036.fdocuments.in/reader036/viewer/2022081403/609a1ad1a665d034bc509054/html5/thumbnails/65.jpg)
Conclusion
Thank you for listening!
Ryan Jeong, Will Barton, Maricela Ramirez (Summer@ICERM)DNN: Final Presentation June-July 2020 29 / 29