“Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu… · 2019-10-16 · “Sharp...
Transcript of “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu… · 2019-10-16 · “Sharp...
![Page 1: “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu… · 2019-10-16 · “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio](https://reader033.fdocuments.in/reader033/viewer/2022041717/5e4c12535623cd195b4c5513/html5/thumbnails/1.jpg)
“Sharp Minima Can Generalize For Deep Nets,”L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio
Mar. 2017.
At a loss: Landscaping loss, and how it affects generalization-Adam
![Page 2: “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu… · 2019-10-16 · “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio](https://reader033.fdocuments.in/reader033/viewer/2022041717/5e4c12535623cd195b4c5513/html5/thumbnails/2.jpg)
Flat Minima Generalize Well
N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang, “On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima,” arXiv:1609.04836 [cs, math], Sep. 2016.
“We investigate the cause for this generalization drop in the large-batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions—and as is well known, sharp minima lead to poorer generalization.”
![Page 3: “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu… · 2019-10-16 · “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio](https://reader033.fdocuments.in/reader033/viewer/2022041717/5e4c12535623cd195b4c5513/html5/thumbnails/3.jpg)
Flat Minima Can Generalize WellWhen using stochastic gradient descent:
● batches of size 256 generalize better and are flatter than batches of size 10%
![Page 4: “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu… · 2019-10-16 · “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio](https://reader033.fdocuments.in/reader033/viewer/2022041717/5e4c12535623cd195b4c5513/html5/thumbnails/4.jpg)
Flat Minima Generalize Well?Is it really flat minima that generalize well?
Z. Yao, A. Gholami, Q. Lei, K. Keutzer, and M. W. Mahoney, “Hessian-based Analysis of Large Batch Training and Robustness to Adversaries,” arXiv:1802.08241 [cs, stat], Feb. 2018.
![Page 5: “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu… · 2019-10-16 · “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio](https://reader033.fdocuments.in/reader033/viewer/2022041717/5e4c12535623cd195b4c5513/html5/thumbnails/5.jpg)
Which of the following are true:
1. Flat minima are the reason for good generalization2. Gradient descent methods gravitate to flat minima more easily.
a. Thus models that generalize get chosen over models that do not, causing the model selection process to biased towards performing well in flat minima for the data regimes that they have been selected for.
3. Something else
Why do flat minima generalize well?Presuming that flat minima mark
good generalization
![Page 6: “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu… · 2019-10-16 · “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio](https://reader033.fdocuments.in/reader033/viewer/2022041717/5e4c12535623cd195b4c5513/html5/thumbnails/6.jpg)
Do flat minima mark good generalization?
N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang, “On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima,” arXiv:1609.04836 [cs, math], Sep. 2016.
![Page 7: “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu… · 2019-10-16 · “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio](https://reader033.fdocuments.in/reader033/viewer/2022041717/5e4c12535623cd195b4c5513/html5/thumbnails/7.jpg)
Coarse Model for Minima
![Page 8: “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu… · 2019-10-16 · “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio](https://reader033.fdocuments.in/reader033/viewer/2022041717/5e4c12535623cd195b4c5513/html5/thumbnails/8.jpg)
“Sharp Minima Can Generalize For Deep Nets,”L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio
Mar. 2017.
![Page 9: “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu… · 2019-10-16 · “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio](https://reader033.fdocuments.in/reader033/viewer/2022041717/5e4c12535623cd195b4c5513/html5/thumbnails/9.jpg)
Broad StrokesModels can be tweaked after training to adjust sharpness according to common metrics.
![Page 10: “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu… · 2019-10-16 · “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio](https://reader033.fdocuments.in/reader033/viewer/2022041717/5e4c12535623cd195b4c5513/html5/thumbnails/10.jpg)
Common Flatness Metrics1. Volume flatness2. ϵ-sharpness3. Hessian Based
![Page 11: “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu… · 2019-10-16 · “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio](https://reader033.fdocuments.in/reader033/viewer/2022041717/5e4c12535623cd195b4c5513/html5/thumbnails/11.jpg)
Flatness 1Volume flatness:
“Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio
![Page 12: “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu… · 2019-10-16 · “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio](https://reader033.fdocuments.in/reader033/viewer/2022041717/5e4c12535623cd195b4c5513/html5/thumbnails/12.jpg)
Flatness 1Volume flatness has infinite volume for rectified neural networks.
“Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio
![Page 13: “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu… · 2019-10-16 · “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio](https://reader033.fdocuments.in/reader033/viewer/2022041717/5e4c12535623cd195b4c5513/html5/thumbnails/13.jpg)
Flatness 2ϵ-sharpness: (Keskar et. al flatness)
“Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio
![Page 14: “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu… · 2019-10-16 · “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio](https://reader033.fdocuments.in/reader033/viewer/2022041717/5e4c12535623cd195b4c5513/html5/thumbnails/14.jpg)
Flatness 2ϵ-sharpness: (Keskar et. al flatness)
“For rectified neural network every minimum is observationally equivalent to a minimum that generalizes as well but with high ϵ-sharpness. This also applies when using full-space ϵ-sharpness used by Keskar et al. (2017).”
Full-space is related to spectral norm.
“Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio
![Page 15: “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu… · 2019-10-16 · “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio](https://reader033.fdocuments.in/reader033/viewer/2022041717/5e4c12535623cd195b4c5513/html5/thumbnails/15.jpg)
Flatness 2ϵ-sharpness: (Keskar et. al flatness)
“For rectified neural network every minimum is observationally equivalent to a minimum that generalizes as well but with high ϵ-sharpness. This also applies when using full-space ϵ-sharpness used by Keskar et al. (2017).”
Full-space is related to spectral norm of the Hessian.
What about other eigenvalues?
“Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio
![Page 16: “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu… · 2019-10-16 · “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio](https://reader033.fdocuments.in/reader033/viewer/2022041717/5e4c12535623cd195b4c5513/html5/thumbnails/16.jpg)
Flatness 2What about other eigenvalues?
“We have not been able to show a similar problem with random subspace ϵ-sharpness used by Keskar et al. (2017)”
![Page 17: “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu… · 2019-10-16 · “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio](https://reader033.fdocuments.in/reader033/viewer/2022041717/5e4c12535623cd195b4c5513/html5/thumbnails/17.jpg)
Hessian FlatnessLooks at network curvature is with respect to parameters
![Page 18: “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu… · 2019-10-16 · “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio](https://reader033.fdocuments.in/reader033/viewer/2022041717/5e4c12535623cd195b4c5513/html5/thumbnails/18.jpg)
Hessian Flatness QuantificationValues used:
● Spectral Norm● Trace
![Page 19: “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu… · 2019-10-16 · “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio](https://reader033.fdocuments.in/reader033/viewer/2022041717/5e4c12535623cd195b4c5513/html5/thumbnails/19.jpg)
Hessian Flatness QuantificationValues used:
● Spectral Norm● Trace (lower bounded by spectral norm)
![Page 20: “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu… · 2019-10-16 · “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio](https://reader033.fdocuments.in/reader033/viewer/2022041717/5e4c12535623cd195b4c5513/html5/thumbnails/20.jpg)
Alpha-Scale Transformation
![Page 21: “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu… · 2019-10-16 · “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio](https://reader033.fdocuments.in/reader033/viewer/2022041717/5e4c12535623cd195b4c5513/html5/thumbnails/21.jpg)
2-layer version
Spectral norm can be arbitrarily scaled
![Page 22: “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu… · 2019-10-16 · “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio](https://reader033.fdocuments.in/reader033/viewer/2022041717/5e4c12535623cd195b4c5513/html5/thumbnails/22.jpg)
Multi-layer versionEigenvalues of n-1 layers can be scaled arbitrarily large, where n is the count of the layers.
![Page 23: “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu… · 2019-10-16 · “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio](https://reader033.fdocuments.in/reader033/viewer/2022041717/5e4c12535623cd195b4c5513/html5/thumbnails/23.jpg)
Hessian Measures that can avoid thisMaybe a product of hessian eigenvalues… can still be scaled to be sharper, but relative sharpness will be maintained.
![Page 24: “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu… · 2019-10-16 · “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio](https://reader033.fdocuments.in/reader033/viewer/2022041717/5e4c12535623cd195b4c5513/html5/thumbnails/24.jpg)
Model re-parameterization & input spaceCan re-parameterize weights as function to change landscape.
![Page 25: “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu… · 2019-10-16 · “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio](https://reader033.fdocuments.in/reader033/viewer/2022041717/5e4c12535623cd195b4c5513/html5/thumbnails/25.jpg)
ConclusionFlat minima don’t necessarily generalize better than sharp ones.
Exploiting non-identifiability allows changing surface without affecting function.
![Page 26: “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu… · 2019-10-16 · “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio](https://reader033.fdocuments.in/reader033/viewer/2022041717/5e4c12535623cd195b4c5513/html5/thumbnails/26.jpg)
ConclusionFlat minima don’t necessarily generalize better than sharp ones.
Exploiting non-identifiability allows changing surface without affecting function.
![Page 27: “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu… · 2019-10-16 · “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio](https://reader033.fdocuments.in/reader033/viewer/2022041717/5e4c12535623cd195b4c5513/html5/thumbnails/27.jpg)
ConclusionFlat minima don’t necessarily generalize better than sharp ones.
Exploiting non-identifiability allows changing surface without affecting function.
More care is needed to define flatness.
![Page 28: “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu… · 2019-10-16 · “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio](https://reader033.fdocuments.in/reader033/viewer/2022041717/5e4c12535623cd195b4c5513/html5/thumbnails/28.jpg)
Bayesian methodS. L. Smith and Q. V. Le, “A Bayesian Perspective on Generalization and Stochastic Gradient Descent,” arXiv:1710.06451 [cs, stat], Oct. 2017.
Likely models are a combination of depth and breadth
![Page 29: “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu… · 2019-10-16 · “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio](https://reader033.fdocuments.in/reader033/viewer/2022041717/5e4c12535623cd195b4c5513/html5/thumbnails/29.jpg)
Occam FactorRegularization constant divided by product of eigenvalues of hessian.
![Page 30: “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu… · 2019-10-16 · “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio](https://reader033.fdocuments.in/reader033/viewer/2022041717/5e4c12535623cd195b4c5513/html5/thumbnails/30.jpg)
SGD Noise ScaleS. L. Smith and Q. V. Le, “A Bayesian Perspective on Generalization and Stochastic Gradient Descent,” arXiv:1710.06451 [cs, stat], Oct. 2017.
![Page 31: “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu… · 2019-10-16 · “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio](https://reader033.fdocuments.in/reader033/viewer/2022041717/5e4c12535623cd195b4c5513/html5/thumbnails/31.jpg)
N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang, “On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima,” arXiv:1609.04836 [cs, math], Sep. 2016.
![Page 32: “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu… · 2019-10-16 · “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio](https://reader033.fdocuments.in/reader033/viewer/2022041717/5e4c12535623cd195b4c5513/html5/thumbnails/32.jpg)
AddendumWarm-starting models...Check out: S. L. Smith, P.-J. Kindermans, C. Ying, and Q. V. Le, “Don’t Decay the Learning Rate, Increase the Batch Size,” arXiv:1711.00489 [cs, stat], Nov. 2017.
![Page 33: “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu… · 2019-10-16 · “Sharp Minima Can Generalize For Deep Nets,” L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio](https://reader033.fdocuments.in/reader033/viewer/2022041717/5e4c12535623cd195b4c5513/html5/thumbnails/33.jpg)
Is flatness a goose-chase?Why focus on flatness rather than directly on training paradigms that directly affect generalization?
Can we optimize over the model itself to improve flatness as well? Eg. Skip-connections, other online transformations