Semi-Stochastic Gradient Descent Methods

Jakub Konečný (joint work with Peter Richtárik)University of Edinburgh

Introduction

Large scale problem setting Problems are often structured

Frequently arising in machine learning

Structure – sum of functions is BIG

Examples Linear regression (least squares)

Logistic regression (classification)

Assumptions Lipschitz continuity of derivative of

Strong convexity of

Gradient Descent (GD) Update rule

Fast convergence rate

Alternatively, for accuracy we need

iterations

Complexity of single iteration – (measured in gradient evaluations)

Stochastic Gradient Descent (SGD) Update rule

Why it works

Slow convergence

Complexity of single iteration – (measured in gradient evaluations)

a step-size parameter

GD SGD

Fast convergence

gradient evaluations in each iteration

Slow convergence

Complexity of iteration independent of

Combine in a single algorithm

Semi-Stochastic Gradient Descent

Intuition

The gradient does not change drastically We could reuse the information from “old”

gradient

Modifying “old” gradient Imagine someone gives us a “good” point

Gradient at point , near , can be expressed as

Approximation of the gradient

Already computed gradientGradient changeWe can try to estimate

The S2GD Algorithm

Simplification; size of the inner loop is random, following a geometric rule

Theorem

Convergence rate

How to set the parameters ?

Can be made arbitrarily small, by decreasing

For any fixed , can be made arbitrarily small by increasing

Setting the parameters

The accuracy is achieved by setting

Total complexity (in gradient evaluations)# of epochs

full gradient evaluation cheap iterations

# of epochs

stepsize

# of iterations

Fix target accuracy

Complexity S2GD complexity

GD complexity iterations complexity of a single iteration Total

Related Methods SAG – Stochastic Average Gradient

(Mark Schmidt, Nicolas Le Roux, Francis Bach, 2013) Refresh single stochastic gradient in each iteration Need to store gradients. Similar convergence rate Cumbersome analysis

MISO - Minimization by Incremental Surrogate Optimization (Julien Mairal, 2014) Similar to SAG, slightly worse performance Elegant analysis

Related Methods SVRG – Stochastic Variance Reduced Gradient

(Rie Johnson, Tong Zhang, 2013) Arises as a special case in S2GD

Prox-SVRG(Tong Zhang, Lin Xiao, 2014) Extended to proximal setting

EMGD – Epoch Mixed Gradient Descent(Lijun Zhang, Mehrdad Mahdavi , Rong Jin, 2013) Handles simple constraints, Worse convergence rate

Experiment Example problem, with

Semi-Stochastic Gradient Descent Methods

Documents

Transcript of Semi-Stochastic Gradient Descent Methods

Stochastic Gradient Descent - CMU Statistics

Efficient Logistic Regression with Stochastic Gradient Descent

Stochastic Gradient Descent as Approximate Bayesian Inferenceblei/papers/MandtHoffmanBlei2017.pdf1. Introduction Stochastic gradient descent (SGD) has become crucial to modern machine

Efficient Logistic Regression with Stochastic Gradient Descent

Fine-Grained Analysis of Stability and Generalization for Stochastic Gradient Descent · 2020. 6. 15. · Stability and Generalization of Stochastic Gradient Descent gradient assumption

Probabilistic Line Searches for Stochastic Optimizationthe need to deﬁne a learning rate for stochastic gradient descent. 1 Introduction Stochastic gradient descent (SGD) [1] is

Variance-based Stochastic Gradient Descent (vSGD)f15ece6504/slides/L23_LR...Variance-based Stochastic Gradient Descent (vSGD): No More Pesky Learning Rates Schaul et al., ICML13 The

GASGD: Stochastic Gradient Descent for Distributed … · 2015. 8. 2. · rently, stochastic gradient descent (SGD) is considered one of the best latent factor-based algorithm for

Expected tensor decomposition with stochastic gradient descent

GASGD: Stochastic Gradient Descent for Distributed ...midlab.diag.uniroma1.it/articoli/presentation-recsys.pdf · GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix

Asynchronous Stochastic Gradient Descent on GPU: Is It ...faculty.ucmerced.edu/frusu/Talks/2017-06-google-gd-gpu.pdf · Asynchronous Stochastic Gradient Descent on GPU: Is It Really

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock … · Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi

Stochastic gradient descent on Riemannian manifolds · 2013-11-22 · Outline 1 Stochastic gradient descent Introduction and examples SGD and machine learning Standard convergence

Stochastic Gradient Descent Tricks › diglib › lsml › bottou-sgd-tricks-2012.pdfTable 1 illustrates stochastic gradient descent algorithms for a number of classic machine learning

Semi-Stochastic Gradient Descent Methods

Stochastic Gradient Descent Tricks - microsoft.com · Stochastic Gradient Descent Tricks L eon Bottou Microsoft Research, Redmond, WA leon@bottou.org Abstract. Chapter 1 strongly

Stochastic Gradient Descent in Machine Learningkth.diva-portal.org/smash/get/diva2:1335380/FULLTEXT01.pdf · 1.A literature study about Numerical stochastic gradient descent and other

Intro Logistic+Regression Gradient+Descent+++SGD...9 SGD:+Stochastic+Gradient+Ascent+(or+Descent) • “True”gradient: • Samplebasedapproximation: • Whatifweestimategradientwithjustonesample???

When does stochastic gradient descent work without variance …cs229.stanford.edu/proj2016/report/LeeInan-WhenDoesSGD... · 2017. 9. 23. · When does stochastic gradient descent

Ant Colony Optimization and Stochastic Gradient Descent · PDF fileAnt Colony Optimization and Stochastic Gradient Descent Nicolas Meuleau and Marco Dorigo IRIDIA, Universit e Libre