Variance-based Stochastic Gradient Descent (vSGD)f15ece6504/slides/L23_LR...Variance-based...

14
Variance-based Stochastic Gradient Descent (vSGD): No More Pesky Learning Rates Schaul et al., ICML13

Transcript of Variance-based Stochastic Gradient Descent (vSGD)f15ece6504/slides/L23_LR...Variance-based...

Page 1: Variance-based Stochastic Gradient Descent (vSGD)f15ece6504/slides/L23_LR...Variance-based Stochastic Gradient Descent (vSGD): No More Pesky Learning Rates Schaul et al., ICML13 The

Variance-based Stochastic Gradient Descent (vSGD):

No More Pesky Learning RatesSchaul et al., ICML13

Page 2: Variance-based Stochastic Gradient Descent (vSGD)f15ece6504/slides/L23_LR...Variance-based Stochastic Gradient Descent (vSGD): No More Pesky Learning Rates Schaul et al., ICML13 The

The idea- Remove need for setting learning rates by updating them optimally from the

Hessian values.

Page 3: Variance-based Stochastic Gradient Descent (vSGD)f15ece6504/slides/L23_LR...Variance-based Stochastic Gradient Descent (vSGD): No More Pesky Learning Rates Schaul et al., ICML13 The
Page 4: Variance-based Stochastic Gradient Descent (vSGD)f15ece6504/slides/L23_LR...Variance-based Stochastic Gradient Descent (vSGD): No More Pesky Learning Rates Schaul et al., ICML13 The
Page 5: Variance-based Stochastic Gradient Descent (vSGD)f15ece6504/slides/L23_LR...Variance-based Stochastic Gradient Descent (vSGD): No More Pesky Learning Rates Schaul et al., ICML13 The

ADAM:A Method For Stochastic Optimization

Kingma & Ba, arXiv14

Page 6: Variance-based Stochastic Gradient Descent (vSGD)f15ece6504/slides/L23_LR...Variance-based Stochastic Gradient Descent (vSGD): No More Pesky Learning Rates Schaul et al., ICML13 The

The idea- Establish and update trust

region where the gradient isassumed to hold.

- Attempts to combine therobustness to sparse gradientsof AdaGrad and the robustnessof RMSProp to non-stationaryobjectives.

Page 7: Variance-based Stochastic Gradient Descent (vSGD)f15ece6504/slides/L23_LR...Variance-based Stochastic Gradient Descent (vSGD): No More Pesky Learning Rates Schaul et al., ICML13 The

Alternative form: AdaMax- The second moment is

calculated as a sum of squaresand its square root is used in theupdate in ADAM.

- Changing that from power of twoto power of p as p goes to infinityyields AdaMax.

Page 8: Variance-based Stochastic Gradient Descent (vSGD)f15ece6504/slides/L23_LR...Variance-based Stochastic Gradient Descent (vSGD): No More Pesky Learning Rates Schaul et al., ICML13 The

Results

Page 10: Variance-based Stochastic Gradient Descent (vSGD)f15ece6504/slides/L23_LR...Variance-based Stochastic Gradient Descent (vSGD): No More Pesky Learning Rates Schaul et al., ICML13 The

The idea- Decrease the update over time by penalizing quickly moving values.

Page 11: Variance-based Stochastic Gradient Descent (vSGD)f15ece6504/slides/L23_LR...Variance-based Stochastic Gradient Descent (vSGD): No More Pesky Learning Rates Schaul et al., ICML13 The

The problem- The learning rate only ever decreases.- Complex problems may need more freedom.

Page 12: Variance-based Stochastic Gradient Descent (vSGD)f15ece6504/slides/L23_LR...Variance-based Stochastic Gradient Descent (vSGD): No More Pesky Learning Rates Schaul et al., ICML13 The

Precursor to- AdaDelta (Zeiler, ArXiv12)

- Uses the square root of exponential moving average of squares instead of just accumulating.- Approximate a Hessian correction using the same moving impulse over the weight updates.- Removes need for learning rate

- AdaSecant (Gulcehre et al., ArXiv14)- Uses expected values to reduce variance.

Page 13: Variance-based Stochastic Gradient Descent (vSGD)f15ece6504/slides/L23_LR...Variance-based Stochastic Gradient Descent (vSGD): No More Pesky Learning Rates Schaul et al., ICML13 The

Comparisons- https://cs.stanford.edu/people/karpathy/convnetjs/demo/trainers.html- Doesn’t have ADAM in the default run, but ADAM is implemented and can be

added.- Doesn’t have Batch Normalization, vSGD, AdaMax, or AdaSecant.

Page 14: Variance-based Stochastic Gradient Descent (vSGD)f15ece6504/slides/L23_LR...Variance-based Stochastic Gradient Descent (vSGD): No More Pesky Learning Rates Schaul et al., ICML13 The

Questions?