Post on 20-May-2020
Theory ofDeep Learning,
Neural NetworksIntroduction to the main questions
Image source: https://commons.wikimedia.org/wiki/File:Blackbox3D-withGraphs.png
Main theoretical questions
I. Representation
II. Optimization
III. Generalization
Basics
Neural networks:
Class of parametric models in
machine learning + algos to train them
Supervised learning
Unsupervised Learning
Reinforcement Learning
Basics: Supervised learning
Unkown function πβ: βπ β βπ
e.g. dog=1 cat=-1
k=1
e.g. 100x100 color images
n=100*100*3
Have: data π₯π , π¦π , β = 1,β― , π with π¦π = πβ π₯π
Aim: From π₯π , π¦π produce estimator
απ:βπ β βπ s.t. απ β πβ
Assume: data π₯π , β = 1,β― , π IID samples from unkown distribution P
Basics: Supervised learningMeasure of success:
Risk π απ = πΌπ₯~β πΏ απ π₯ , π π₯
for loss function πΏ ΰ·π¦, π¦ e.g.
πΏ ΰ·π¦, π¦ = (ΰ·π¦ β π¦)2 (L2 loss/Mean Square Error Loss)
πΏ ΰ·π¦, π¦ = 1{ ΰ·π¦β π¦} (Classification loss)
Basics: Supervised learning
Data set
Test data: ( π₯π, π¦π )
Use once to computeπ απ β π π
Train data: π₯π , π¦π
Use to produce απ
β 80% β 20%
...if π₯π , π¦π , β = 1,β― , π not used to produce απ
Canβt compute risk... but can estimate it:
π₯π , β = 1,β― , π IID π₯π~β, π¦π= πβ( π₯π)
β π απ β1
π
π=1
ππΏ απ Ηπ₯π , π¦π β π απ by LLN
Basics: Parametric approach
Define parametric class of functions:
β = ππ: π β βπ (Hypothesis set)
Choose απ βββ Choose π β βπ
Choose απ β βπ by minimizing:π β ΖΈπ π‘ππππ ππ
β Empirical Risk Minimization
Ex 2: Polynomial regression (n=1)β = ππ π₯ = ππβ1π₯
π + ππβ2π₯πβ1 +β―+ π0: π β βπ
Can represent non-linear functions:
Ex 1: Linear regressionβ = ππ π₯ = π1 β π₯ + π2: π = (π1, π2) β βπ+π
Can represent linear functions:
Basics: Parametric approach
Ex 3: Fully connected neural network β =ππ π₯ = π(πΏ)π(π(πΏβ1)π β¦π π(0)π₯ + π(0) β¦+ π(πΏβ1) + π(πΏ)
π = (π(πΏ), π(πΏ), β¦ ,π(0), π(0)) β βπ
Basics: Neural networks
π: β β β non-linear function applied entrywise
e.g. π π₯ = tanh(π₯) (sigmoid)
π π₯ = π₯ + (Rectified Linear; ReLU)
Can represent non-linear functions
Basics: Neural networksπ₯(0) = π₯π₯(π+1) = π π(π)π₯(π) + π(π) , π = 0,β¦ , πΏ β 1π¦ = π(πΏ)π₯(πΏ) + π(πΏ)
π π1β (0)
β π₯ + π0(0)
= π₯0(1)
<- One neuron
π(0) π(1) π(2) π(3)π(0) π(1) π(2)
Image source: neuralnetworksanddeeplearning.com
Basics: Neural networksEx 4: Convolutional neural networks
Layers with local connections, tied weights
β = πΏπππ ππππππ, ππ’π‘ ππ π ππππ π, ππππππ‘ππ πππ‘ππππ
Adapted to:
β’ Images (2D)
β’ Timeseries (1D)
Represents βlocalβ and βtranslation invariantβ function Image source: neuralnetworksanddeeplearning.com
Basics: Neural networks
Ex 5: Residual Network (ResNet)
Connections that skip layers
Image source: https://commons.wikimedia.org/wiki/File:ResNets.svg
Basics: ArchitectureArchitecture of neural network:
Choice of
β’ # of layers
β’ size of layers
β’ types of layers: FC, Conv, Res
β’ ....
Classical NN (80s-90s): Few hidden layers β shallow
Modern NNs (2010+): Many layers β deep learning
Dataset: ImageNet (1.2 million images)
π₯ =
,
π¦π= prob. of category πβcontainer shipβ, βmushroomβπ =1000 categories)
8 layersπ β 60 million parametersError on test set: 37.5% (top-5: 17%)
Input space dimension: π β 150k
ππ
Input space dimension:
Krizhevsky, Sutskever, Hinton "Imagenet classification with deep convolutional neural networks." NIPS β12
AlexNet (2012)
Theory I: RepresentationClassical
Thm (Universal approx. thm for shallow networks)
(Funahashi β89, Cybenko β89, Hornik β91)
Funahashi, Ken-Ichi. "On the approximate realization of continuous mappings by neural networks." Neural networks 2.3 (1989)
Cybenko, George. "Approximation by superpositions of a sigmoidal function." Mathematics of control, signals and systems2.4 (1989)
Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. "Multilayer feedforward networks are universal approximators." Neural networks 2.5
(1989)Hornik, Kurt. "Approximation capabilities of multilayer feedforward networks." Neural networks 4.2 (1991)
Theory I: RepresentationModern
Hypothesis (vague):
Depth is good because βnaturally occuringβ functions are hierarchichal
βMathematical framework to study properties of convolution: Stephane Mallat (ENS)
Mallat, StΓ©phane. "Understanding deep convolutional networks." Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 374.2065 (2016).
Hypothesis (vague):
Depth allows for βmore efficientβ representation of functions (fewer parameters)
Thm (Eldan, Shamir β16):
Theory I: Representation
Similar: Telgarsky β16, Abernathy et al β15, Cohen et al β16, Lu et al β17, Rolnick & Tegmark β18
Telgarsky, Matus. "Benefits of depth in neural networks." β16
Abernethy, Jacob, Alex Kulesza, and Matus Telgarsky. "Representation Results & Algorithms for Deep Feedforward Networks." NIPS β15
Cohen, Nadav, Or Sharir, and Amnon Shashua. "On the expressive power of deep learning: A tensor analysis." COLT β16Lu, Zhou, et al.
"The expressive power of neural networks: A view from the width." NIPS β17.
Rolnick, David, and Max Tegmark. "The power of deeper networks for expressing natural functions.β β17
Basics: OptimizationRecall ERM: To pick π minimize π β π π‘ππππ ππ
e.g. for linear regression:
π π‘ππππ ππ =1
π
π=1
π
(π1 β π₯π + π2 β π¦π)2
β Quadratic in π1, π2βcan be minimzed exactly (least squares)With different convex loss function:
π π‘ππππ ππ =1
π
π=1
π
πΏ π1 β π₯π + π2, π¦π
β not quadratic, but convexβ Can use any of a number of optimization algorithms to find global min (with theoretical guarantee)
Basics: Optimization
Classical NN (80s, 90s):
π β π π‘ππππ ππ = 1
π
π=1
ππΏ π 1 π π 0 + π 0 + π 1 , π¦π
β not convex
β but differentiable in π!
Classical ML:
Define β so that π β π π‘ππππ ππ is convex
(e.g. kernel methods)
(For classification β Cross entropy loss:
πΏ ΰ·π¦, π = β log ΖΈπ¦π = βlog(ππππ. ππ ππππ π π))
Basics: Gradient descent
If convex (essentially) guaranteed to converge to global min:
If non-convex only gauranteed to conv to local min:
Basics: Gradient descentNote:
Can be computed efficiently via backprop algorithm (chain rule)
Basics: Stochastic gradient descent
Basics: Optimization
Classic NN situation (80s, 90s):
β’ Can optmize shallow, small networks
β’ No guarantees
β’ Sometimes find βgoodβ local min, sometimes not
β’ Performance of methods with guarantees same or better
Basics: OptimizationModern NN situation:
β’ Have powerful hardware (GPU) β can optimize deep, large networks
β’ Gradient computation becomes quite unstable for deep networks, but withβ’ Careful choice of scale of random initβ’ SGD variants: momentum, ADAM, ...
β’ can reliably optimize deep networks
β’ Withβ’ Batch normalizaitonβ’ ResNet architecture
β’ Can optimize very deep networks (1000+ layers)
β’ Can mostly tune things to reach π π‘ππππ ππ β 0 β global min!
Theory II: Optimization
Q: Why do we reach π π‘ππππ ππ β 0 despite non-convexity?
Hypothesis: (vague) Local min are rare/appear only in pathological cases
Theory II: OptimizationMathematical evidence from toy problems:
βall local minima are globalβ
Thm (Kawaguchi β16):
Similar: Soudry, Camron β16, Haeffele, Vidal β15 β17, Ge, Lee, Ma β16 (Matrix completion)Soudry, Daniel, and Yair Carmon. "No bad local minima: Data independent training error guarantees for multilayer neural networks." β16
Haeffele, Benjamin D., and RenΓ© Vidal. "Global optimality in tensor factorization, deep learning, and beyond." β15
Haeffele, Benjamin D., and RenΓ© Vidal. "Global optimality in neural network training." ICCVPR β17.Ge, Rong, Jason D. Lee, and Tengyu Ma. "Matrix completion has no spurious local minimum." NIPS β16.
Theory II: Optimization
β (almost) implies that SGD sucessfully minimizes empiral risk (caveat: saddle points)
βall loc min are globalβ definitly not true in general
Thm (Auer et al β96): A single neuron can have empirical risk with exponentially (in dimension) many loc min
Auer, Peter, Mark Herbster, and Manfred K. Warmuth. "Exponentially many local minima for single neurons." NIPS β16
Theory II: Optimization
Interesting empirical observation
In practice, bottom of landscape is large connected flat region
(Sagun et al β17,
Draxler et al β18)
Wrong intuition: Right intuition:
Sagun, Levent, et al. "Empirical analysis of the hessian of over-parametrized neural networks." β17Draxler, Felix, et al. "Essentially no barriers in neural network energy landscape." β18 (Image source)
Theory II: OptimizationRefined Hypothesis: For a large net, there are few or no bad loc min
βlargeβ = βoverparameterizedβ
= many more params π than constraints
in ππ π₯π = π¦π , β = 1β¦π
Further refinement: There may be bad local minima, but the path (S)GD takes avoids them if overparametrized
Theory II: OptimizationRigorous mathematical evidence:
β Scaling limits for training dynamics in limit of infinite width
β Limiting dynamics provably converge
Two distinct scaling limits:
1. Limit is kernel method: Jacot, Gabriel, Hongler β18
2. βMean fieldβ limit (only for one hidden layer)β’ Chizat, Bach β18β’ Mei, Montanari β18β’ Sirigano, Spilopoulus β18
Different scaling limits arise from different scale for random initialization
Q: Which scaling βreflects realityβ?Jacot, Arthur, Franck Gabriel, and ClΓ©ment Hongler. "Neural tangent kernel: Convergence and generalization in neural networks." NIPS β18.
Chizat, Lenaic, and Francis Bach. "On the global convergence of gradient descent for over-parameterized models using optimal transport." NIPS β18
Mei, Song, Andrea Montanari, and Phan-Minh Nguyen. "A mean field view of the landscape of two-layer neural networks.β PNAS β18Sirignano, Justin, and Konstantinos Spiliopoulos. "Mean field analysis of neural networks." β18
Basics: GeneralizationGoal: Find π s.t. π ππ β π π‘ππ π‘ ππ small
β minimize π π‘ππππ ππ
Underfitting: π π‘ππππ ππ large, π π‘ππ π‘ ππ large
β’ Because β to restrictive
β’ Or because minimization failed
Basics: Generalization
Overfitting: π π‘ππππ ππ small, π π‘ππ π‘ ππ large
β’ Fit train data well, but donβt generalize to new unseen data
Ideally want ππ that generalizes well:
π π‘ππ π‘ ππ - π π‘ππππ ππ is small
Image source: https://commons.wikimedia.org/wiki/File:Overfitted_Data.png
Basics: Capacity
Capacity
β’ βRichnessβ of β
β’ E.g. poly. regression βπ={polys. of deg k}:β1 β β2 β β― β βπ β βπ+1 β β―
<- Less capacity More capacity ->
β’ Not enough capacity β underfit
β’ Too much capacity β may overfit
Basics: Overfitting/Underfitting
Basics: Overfitting/Underfitting
Basics: RegularizationRegularization
To avoid overfitting while keeping expressivity of β
β Constrain β
β Hard constraint ΰ·©β = {ππ β β: |π| β€ C}
β Soft constraint: Minimize π π‘ππππ ππ + π|π|2
(L2 regularization/Weight decay)
β Constrain minimization: stop optimizing early
Basics: Regularization
More regularization Less regularization
Classical ML:
Basics: Generalization
Modern NN:
Dropout: During training, set output of each neuron to 0 with prob. 50% (regularizer)
More regularization Less regularization
Theory III: Generalization
Neyshabur, Tomioka, Srebro - In search of the real inductive bias: On the role of implicit regularization in deep learning (2014)
Empirical experiments:
MNIST data set60k images28x28 grayscale
H = # hidden neurons (capacity) Image source: https://commons.wikimedia.org/wiki/File:MnistExamples.png
Theory III: GeneralizationZhang, et al. - Understanding deep learning requires rethinking generalization (2016)
Typical nets can perfectly fit random data
Theory III: GeneralizationZhang, et al. - Understanding deep learning requires rethinking generalization (2016)
Typical nets can perfectly fit random data
Even with regularization
Theory III: GeneralizationQ: Deep neural networks have enormous capacity. Why donβt they catastrophically overfit?
Hypothesis: (S)GD has built-in implicit regularization
Q: What is mechanism behind implicit regularization?
β’ SGD finds flat local minima? (Keskar et al, Dinh el al)
β’ SGD finds large margin solutions? (classifier)
β’ Information content of weights limited? (Tishby)
I. Representation
II. Optimization
III. Generalization
Conclusion