Variance-based Stochastic Gradient Descent (vSGD)f15ece6504/slides/L23_LROptimizations.pdf · ADAM:...

Variance-based Stochastic Gradient Descent (vSGD):

No More Pesky Learning RatesSchaul et al., ICML13

The idea- Remove need for setting learning rates by updating them optimally from the

Hessian values.

ADAM:A Method For Stochastic Optimization

Kingma & Ba, arXiv14

The idea- Establish and update trust

region where the gradient isassumed to hold.

- Attempts to combine therobustness to sparse gradientsof AdaGrad and the robustnessof RMSProp to non-stationaryobjectives.

Alternative form: AdaMax- The second moment is

calculated as a sum of squaresand its square root is used in theupdate in ADAM.

- Changing that from power of twoto power of p as p goes to infinityyields AdaMax.

Results

AdaGrad:Adaptive Subgradient Methods for Online Learning

and Stochastic OptimizationDuchi et al., COLT10

The idea- Decrease the update over time by penalizing quickly moving values.

The problem- The learning rate only ever decreases.- Complex problems may need more freedom.

Precursor to- AdaDelta (Zeiler, ArXiv12)

- Uses the square root of exponential moving average of squares instead of just accumulating.- Approximate a Hessian correction using the same moving impulse over the weight updates.- Removes need for learning rate

- AdaSecant (Gulcehre et al., ArXiv14)- Uses expected values to reduce variance.

Comparisons- https://cs.stanford.edu/people/karpathy/convnetjs/demo/trainers.html- Doesn’t have ADAM in the default run, but ADAM is implemented and can be

added.- Doesn’t have Batch Normalization, vSGD, AdaMax, or AdaSecant.

Questions?

Variance-based Stochastic Gradient Descent (vSGD)f15ece6504/slides/L23_LROptimizations.pdf · ADAM:...

Documents

Transcript of Variance-based Stochastic Gradient Descent (vSGD)f15ece6504/slides/L23_LROptimizations.pdf · ADAM:...

Stochastic Steepest Descent Optimization of Multiple ...

Parallel Stochastic Gradient Descent with Sound Combiners · 2017-05-24 · Parallel Stochastic Gradient Descent with Sound Combiners Saeed Maleki 1Madanlal Musuvathi Todd Mytkowicz

STOCHASTIC MIRROR DESCENT DYNAMICS AND ... - polaris…

Stochastic Gradient Descent Methods

Statistical Inference Using Stochastic Gradient Descent ... · Statistical Inference Using Stochastic Gradient Descent: Volumes 1 and 2 ; 5. Report Date October 2018 6. Performing

Learning Stochastic Optimal Policies via Gradient Descent

Semi-Stochastic Gradient Descent Methods

Stochastic Mirror Descent in Variationally Coherent Optimization … · 2018. 2. 13. · Stochastic Mirror Descent in Variationally Coherent Optimization Problems Zhengyuan Zhou Stanford

Ant Colony Optimization and Stochastic Gradient Descent - CiteSeer

SpeeDO: Parallelizing Stochastic Gradient Descent for Deep …infolab.stanford.edu/~echang/NIPS2015-LS-SPeeDO.pdf · 2015. 12. 8. · SpeeDO: Parallelizing Stochastic Gradient Descent

GASGD: Stochastic Gradient Descent for Distributed ...midlab.diag.uniroma1.it/articoli/presentation-recsys.pdf · GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock … · Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi

Accelerated Stochastic Mirror Descent: From …web.cs.ucla.edu/~qgu/pdf/AISTATS18.pdfAccelerated Stochastic Mirror Descent: From Continuous-time Dynamics to Discrete-time Algorithms

Stochastic Gradient Descent in Machine Learningkth.diva-portal.org/smash/get/diva2:1335380/FULLTEXT01.pdf · 1.A literature study about Numerical stochastic gradient descent and other

Stochastic Gradient Descent - CMU Statisticsryantibs/convexopt/lectures/stochastic...Stochastic gradient descent Consider minimizing an average of functions min x 1 m Xm i=1 f i(x)

Accelerated Stochastic Greedy Coordinate Descent by Soft …papers.nips.cc/paper/7069-accelerated-stochastic-greedy... · 2018-08-07 · Accelerated Stochastic Greedy Coordinate Descent

Asynchronous Stochastic Gradient Descent with …Asynchronous Stochastic Gradient Descent with Delay Compensation will require the computation of the second-order derivative of the

Leader Stochastic Gradient Descent for Distributed ...

ASYNCHRONOUS STOCHASTIC COORDINATE DESCENT · ASYNCHRONOUS STOCHASTIC COORDINATE DESCENT: ... We describe an asynchronous parallel stochastic proximal coordinate descent algo- ...

Descent algorithm for nonsmooth stochastic multiobjective ...