A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis...

84
Introduction SGD General analysis Special cases Discussion Bibliography A Unified Theory of SGD: Variance Reduction, Sampling, Quantization and Coordinate Descent The talk is based on the work [2] and slides [6] Eduard Gorbunov 1 Filip Hanzely 2 Peter Richt´ arik 21 1 Moscow Institute of Physics and Technology, Russia 2 King Abdullah University of Science and Technology, Saudi Arabia INRIA, Paris, 18 October, 2019 Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 1 / 27

Transcript of A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis...

Page 1: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

A Unified Theory of SGD: Variance Reduction,Sampling, Quantization and Coordinate Descent

The talk is based on the work [2] and slides [6]

Eduard Gorbunov1 Filip Hanzely2 Peter Richtarik2 1

1Moscow Institute of Physics and Technology, Russia2King Abdullah University of Science and Technology, Saudi Arabia

INRIA, Paris, 18 October, 2019

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 1 / 27

Page 2: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Table of Contents

1 Introduction

2 SGD

3 General analysis

4 Special cases

5 Discussion

6 Bibliography

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 2 / 27

Page 3: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

The Problem

minx∈Rd

f (x) + R(x) (1)

∙ f (x) — convex function with Lipschitz gradient.∙ R : Rd → R ∪ {+∞} — proximable (proper closed convex) regularizer

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 3 / 27

Page 4: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

The Problem

minx∈Rd

f (x) + R(x) (1)

∙ f (x) — convex function with Lipschitz gradient.

∙ R : Rd → R ∪ {+∞} — proximable (proper closed convex) regularizer

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 3 / 27

Page 5: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

The Problem

minx∈Rd

f (x) + R(x) (1)

∙ f (x) — convex function with Lipschitz gradient.∙ R : Rd → R ∪ {+∞} — proximable (proper closed convex) regularizer

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 3 / 27

Page 6: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Structure of f in supervised learning theory and practice

We focus on the following situations

∙f (x) = E𝜉∼𝒟 [f𝜉(x)] (2)

f (x) =1n

n∑i=1

fi (x). (3)

∙ f (x) = 1n

∑ni=1 fi (x) and

fi (x) =1m

m∑j=1

fij(x), (4)

Typical case: Using the exact gradient of ∇f (x) is too expensive, but anunbiased estimator of ∇f (x) can be computed efficiently.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 4 / 27

Page 7: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Structure of f in supervised learning theory and practice

We focus on the following situations∙

f (x) = E𝜉∼𝒟 [f𝜉(x)] (2)

f (x) =1n

n∑i=1

fi (x). (3)

∙ f (x) = 1n

∑ni=1 fi (x) and

fi (x) =1m

m∑j=1

fij(x), (4)

Typical case: Using the exact gradient of ∇f (x) is too expensive, but anunbiased estimator of ∇f (x) can be computed efficiently.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 4 / 27

Page 8: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Structure of f in supervised learning theory and practice

We focus on the following situations∙

f (x) = E𝜉∼𝒟 [f𝜉(x)] (2)

f (x) =1n

n∑i=1

fi (x). (3)

∙ f (x) = 1n

∑ni=1 fi (x) and

fi (x) =1m

m∑j=1

fij(x), (4)

Typical case: Using the exact gradient of ∇f (x) is too expensive, but anunbiased estimator of ∇f (x) can be computed efficiently.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 4 / 27

Page 9: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Structure of f in supervised learning theory and practice

We focus on the following situations∙

f (x) = E𝜉∼𝒟 [f𝜉(x)] (2)

f (x) =1n

n∑i=1

fi (x). (3)

∙ f (x) = 1n

∑ni=1 fi (x) and

fi (x) =1m

m∑j=1

fij(x), (4)

Typical case: Using the exact gradient of ∇f (x) is too expensive, but anunbiased estimator of ∇f (x) can be computed efficiently.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 4 / 27

Page 10: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Structure of f in supervised learning theory and practice

We focus on the following situations∙

f (x) = E𝜉∼𝒟 [f𝜉(x)] (2)

f (x) =1n

n∑i=1

fi (x). (3)

∙ f (x) = 1n

∑ni=1 fi (x) and

fi (x) =1m

m∑j=1

fij(x), (4)

Typical case: Using the exact gradient of ∇f (x) is too expensive, but anunbiased estimator of ∇f (x) can be computed efficiently.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 4 / 27

Page 11: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Structure of f in supervised learning theory and practice

We focus on the following situations∙

f (x) = E𝜉∼𝒟 [f𝜉(x)] (2)

f (x) =1n

n∑i=1

fi (x). (3)

∙ f (x) = 1n

∑ni=1 fi (x) and

fi (x) =1m

m∑j=1

fij(x), (4)

Typical case:

Using the exact gradient of ∇f (x) is too expensive, but anunbiased estimator of ∇f (x) can be computed efficiently.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 4 / 27

Page 12: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Structure of f in supervised learning theory and practice

We focus on the following situations∙

f (x) = E𝜉∼𝒟 [f𝜉(x)] (2)

f (x) =1n

n∑i=1

fi (x). (3)

∙ f (x) = 1n

∑ni=1 fi (x) and

fi (x) =1m

m∑j=1

fij(x), (4)

Typical case: Using the exact gradient of ∇f (x) is too expensive

, but anunbiased estimator of ∇f (x) can be computed efficiently.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 4 / 27

Page 13: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Structure of f in supervised learning theory and practice

We focus on the following situations∙

f (x) = E𝜉∼𝒟 [f𝜉(x)] (2)

f (x) =1n

n∑i=1

fi (x). (3)

∙ f (x) = 1n

∑ni=1 fi (x) and

fi (x) =1m

m∑j=1

fij(x), (4)

Typical case: Using the exact gradient of ∇f (x) is too expensive, but anunbiased estimator of ∇f (x) can be computed efficiently.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 4 / 27

Page 14: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Popular approach to solve problem (1)

xk+1 = prox𝛾R(xk − 𝛾gk). (5)

∙ gk is unbiased gradient estimator: E[gk | xk

]= ∇f (xk)

∙ proxR(x)def= argminu

{12‖u − x‖2 + R(x)

}∙ ‖ · ‖ — standard Euclidean norm

The prox operator∙ x → proxR(x) is a function∙ ‖ proxR(x)− proxR(y)‖ ≤ ‖x − y‖

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 5 / 27

Page 15: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Popular approach to solve problem (1)

xk+1 = prox𝛾R(xk − 𝛾gk). (5)

∙ gk is unbiased gradient estimator: E[gk | xk

]= ∇f (xk)

∙ proxR(x)def= argminu

{12‖u − x‖2 + R(x)

}∙ ‖ · ‖ — standard Euclidean norm

The prox operator∙ x → proxR(x) is a function∙ ‖ proxR(x)− proxR(y)‖ ≤ ‖x − y‖

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 5 / 27

Page 16: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Popular approach to solve problem (1)

xk+1 = prox𝛾R(xk − 𝛾gk). (5)

∙ gk is unbiased gradient estimator: E[gk | xk

]= ∇f (xk)

∙ proxR(x)def= argminu

{12‖u − x‖2 + R(x)

}

∙ ‖ · ‖ — standard Euclidean norm

The prox operator∙ x → proxR(x) is a function∙ ‖ proxR(x)− proxR(y)‖ ≤ ‖x − y‖

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 5 / 27

Page 17: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Popular approach to solve problem (1)

xk+1 = prox𝛾R(xk − 𝛾gk). (5)

∙ gk is unbiased gradient estimator: E[gk | xk

]= ∇f (xk)

∙ proxR(x)def= argminu

{12‖u − x‖2 + R(x)

}∙ ‖ · ‖ — standard Euclidean norm

The prox operator∙ x → proxR(x) is a function∙ ‖ proxR(x)− proxR(y)‖ ≤ ‖x − y‖

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 5 / 27

Page 18: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Popular approach to solve problem (1)

xk+1 = prox𝛾R(xk − 𝛾gk). (5)

∙ gk is unbiased gradient estimator: E[gk | xk

]= ∇f (xk)

∙ proxR(x)def= argminu

{12‖u − x‖2 + R(x)

}∙ ‖ · ‖ — standard Euclidean norm

The prox operator∙ x → proxR(x) is a function

∙ ‖ proxR(x)− proxR(y)‖ ≤ ‖x − y‖

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 5 / 27

Page 19: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Popular approach to solve problem (1)

xk+1 = prox𝛾R(xk − 𝛾gk). (5)

∙ gk is unbiased gradient estimator: E[gk | xk

]= ∇f (xk)

∙ proxR(x)def= argminu

{12‖u − x‖2 + R(x)

}∙ ‖ · ‖ — standard Euclidean norm

The prox operator∙ x → proxR(x) is a function∙ ‖ proxR(x)− proxR(y)‖ ≤ ‖x − y‖

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 5 / 27

Page 20: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Stochastic gradient

There are infinitely many ways of getting unbiased estimator with “good”properties.

∙ Flexibility to construct stochastic gradients in order to target desirableproperties:

∙ convergence speed∙ iteration cost∙ overall complexity∙ parallelizability∙ communication cost and etc.

∙ Too many methods∙ Hard to keep up with new results∙ Challenges in terms of analysis∙ Some problems with fair comparison: different assumptions are used

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 6 / 27

Page 21: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Stochastic gradient

There are infinitely many ways of getting unbiased estimator with “good”properties.∙ Flexibility to construct stochastic gradients in order to target desirable

properties:

∙ convergence speed∙ iteration cost∙ overall complexity∙ parallelizability∙ communication cost and etc.

∙ Too many methods∙ Hard to keep up with new results∙ Challenges in terms of analysis∙ Some problems with fair comparison: different assumptions are used

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 6 / 27

Page 22: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Stochastic gradient

There are infinitely many ways of getting unbiased estimator with “good”properties.∙ Flexibility to construct stochastic gradients in order to target desirable

properties:∙ convergence speed

∙ iteration cost∙ overall complexity∙ parallelizability∙ communication cost and etc.

∙ Too many methods∙ Hard to keep up with new results∙ Challenges in terms of analysis∙ Some problems with fair comparison: different assumptions are used

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 6 / 27

Page 23: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Stochastic gradient

There are infinitely many ways of getting unbiased estimator with “good”properties.∙ Flexibility to construct stochastic gradients in order to target desirable

properties:∙ convergence speed∙ iteration cost

∙ overall complexity∙ parallelizability∙ communication cost and etc.

∙ Too many methods∙ Hard to keep up with new results∙ Challenges in terms of analysis∙ Some problems with fair comparison: different assumptions are used

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 6 / 27

Page 24: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Stochastic gradient

There are infinitely many ways of getting unbiased estimator with “good”properties.∙ Flexibility to construct stochastic gradients in order to target desirable

properties:∙ convergence speed∙ iteration cost∙ overall complexity

∙ parallelizability∙ communication cost and etc.

∙ Too many methods∙ Hard to keep up with new results∙ Challenges in terms of analysis∙ Some problems with fair comparison: different assumptions are used

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 6 / 27

Page 25: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Stochastic gradient

There are infinitely many ways of getting unbiased estimator with “good”properties.∙ Flexibility to construct stochastic gradients in order to target desirable

properties:∙ convergence speed∙ iteration cost∙ overall complexity∙ parallelizability

∙ communication cost and etc.∙ Too many methods

∙ Hard to keep up with new results∙ Challenges in terms of analysis∙ Some problems with fair comparison: different assumptions are used

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 6 / 27

Page 26: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Stochastic gradient

There are infinitely many ways of getting unbiased estimator with “good”properties.∙ Flexibility to construct stochastic gradients in order to target desirable

properties:∙ convergence speed∙ iteration cost∙ overall complexity∙ parallelizability∙ communication cost and etc.

∙ Too many methods∙ Hard to keep up with new results∙ Challenges in terms of analysis∙ Some problems with fair comparison: different assumptions are used

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 6 / 27

Page 27: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Stochastic gradient

There are infinitely many ways of getting unbiased estimator with “good”properties.∙ Flexibility to construct stochastic gradients in order to target desirable

properties:∙ convergence speed∙ iteration cost∙ overall complexity∙ parallelizability∙ communication cost and etc.

∙ Too many methods

∙ Hard to keep up with new results∙ Challenges in terms of analysis∙ Some problems with fair comparison: different assumptions are used

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 6 / 27

Page 28: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Stochastic gradient

There are infinitely many ways of getting unbiased estimator with “good”properties.∙ Flexibility to construct stochastic gradients in order to target desirable

properties:∙ convergence speed∙ iteration cost∙ overall complexity∙ parallelizability∙ communication cost and etc.

∙ Too many methods∙ Hard to keep up with new results

∙ Challenges in terms of analysis∙ Some problems with fair comparison: different assumptions are used

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 6 / 27

Page 29: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Stochastic gradient

There are infinitely many ways of getting unbiased estimator with “good”properties.∙ Flexibility to construct stochastic gradients in order to target desirable

properties:∙ convergence speed∙ iteration cost∙ overall complexity∙ parallelizability∙ communication cost and etc.

∙ Too many methods∙ Hard to keep up with new results∙ Challenges in terms of analysis

∙ Some problems with fair comparison: different assumptions are used

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 6 / 27

Page 30: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Stochastic gradient

There are infinitely many ways of getting unbiased estimator with “good”properties.∙ Flexibility to construct stochastic gradients in order to target desirable

properties:∙ convergence speed∙ iteration cost∙ overall complexity∙ parallelizability∙ communication cost and etc.

∙ Too many methods∙ Hard to keep up with new results∙ Challenges in terms of analysis∙ Some problems with fair comparison: different assumptions are used

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 6 / 27

Page 31: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Bregman divergence

By Df (x , y) we denote the Bregman divergence associated with f :

Df (x , y)def= f (x)− f (y)− ⟨∇f (y), x − y⟩.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 7 / 27

Page 32: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Key assumption

Assumption 1

Let {xk} be the random iterates produced by proximal SGD.

1 The stochastic gradients gk are unbiased

E[gk | xk

]= ∇f (xk) ∀k ≥ 0. (6)

2 There exist non-negative constants A,B,C ,D1,D2, 𝜌 and a (possibly)random sequence {𝜎2

k}k≥0 such that the following two relations hold

E[

gk −∇f (x*)2

| xk]≤ 2ADf (x

k , x*) + B𝜎2k + D1, (7)

E[𝜎2k+1 | 𝜎2

k

]≤ (1− 𝜌)𝜎2

k + 2CDf (xk , x*) + D2, ∀k ≥ 0 (8)

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 8 / 27

Page 33: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Key assumption

Assumption 1

Let {xk} be the random iterates produced by proximal SGD.1 The stochastic gradients gk are unbiased

E[gk | xk

]= ∇f (xk) ∀k ≥ 0. (6)

2 There exist non-negative constants A,B,C ,D1,D2, 𝜌 and a (possibly)random sequence {𝜎2

k}k≥0 such that the following two relations hold

E[

gk −∇f (x*)2

| xk]≤ 2ADf (x

k , x*) + B𝜎2k + D1, (7)

E[𝜎2k+1 | 𝜎2

k

]≤ (1− 𝜌)𝜎2

k + 2CDf (xk , x*) + D2, ∀k ≥ 0 (8)

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 8 / 27

Page 34: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Key assumption

Assumption 1

Let {xk} be the random iterates produced by proximal SGD.1 The stochastic gradients gk are unbiased

E[gk | xk

]= ∇f (xk) ∀k ≥ 0. (6)

2 There exist non-negative constants A,B,C ,D1,D2, 𝜌 and a (possibly)random sequence {𝜎2

k}k≥0 such that the following two relations hold

E[

gk −∇f (x*)2

| xk]≤ 2ADf (x

k , x*) + B𝜎2k + D1, (7)

E[𝜎2k+1 | 𝜎2

k

]≤ (1− 𝜌)𝜎2

k + 2CDf (xk , x*) + D2, ∀k ≥ 0 (8)

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 8 / 27

Page 35: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Key assumption

Assumption 1

Let {xk} be the random iterates produced by proximal SGD.1 The stochastic gradients gk are unbiased

E[gk | xk

]= ∇f (xk) ∀k ≥ 0. (6)

2 There exist non-negative constants A,B,C ,D1,D2, 𝜌 and a (possibly)random sequence {𝜎2

k}k≥0 such that the following two relations hold

E[

gk −∇f (x*)2

| xk]≤ 2ADf (x

k , x*) + B𝜎2k + D1, (7)

E[𝜎2k+1 | 𝜎2

k

]≤ (1− 𝜌)𝜎2

k + 2CDf (xk , x*) + D2, ∀k ≥ 0 (8)

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 8 / 27

Page 36: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Key assumption

Assumption 1

Let {xk} be the random iterates produced by proximal SGD.1 The stochastic gradients gk are unbiased

E[gk | xk

]= ∇f (xk) ∀k ≥ 0. (6)

2 There exist non-negative constants A,B,C ,D1,D2, 𝜌 and a (possibly)random sequence {𝜎2

k}k≥0 such that the following two relations hold

E[

gk −∇f (x*)2

| xk]≤ 2ADf (x

k , x*) + B𝜎2k + D1, (7)

E[𝜎2k+1 | 𝜎2

k

]≤ (1− 𝜌)𝜎2

k + 2CDf (xk , x*) + D2, ∀k ≥ 0 (8)

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 8 / 27

Page 37: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Gradient Descent satisfies Assumption 1

Assumption 2Assume that f is convex, i.e.

Df (x , y) ≥ 0 ∀x , y ∈ Rd ,

and L-smooth, i.e.

‖∇f (x)−∇f (y)‖ ≤ L‖x − y‖ ∀x , y ∈ Rd .

Assumption 2 implies that (see Nesterov’s book [4])

‖∇f (x)−∇f (y)‖2 ≤ 2LDf (x , y) ∀x , y ∈ Rd .

Therefore, if f satisfies Assumption 2, then gradient descent satisfiesAssumption 1 with

A = L,B = 0,D1 = 0, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 9 / 27

Page 38: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Gradient Descent satisfies Assumption 1

Assumption 2Assume that f is convex, i.e.

Df (x , y) ≥ 0 ∀x , y ∈ Rd ,

and L-smooth, i.e.

‖∇f (x)−∇f (y)‖ ≤ L‖x − y‖ ∀x , y ∈ Rd .

Assumption 2 implies that (see Nesterov’s book [4])

‖∇f (x)−∇f (y)‖2 ≤ 2LDf (x , y) ∀x , y ∈ Rd .

Therefore, if f satisfies Assumption 2, then gradient descent satisfiesAssumption 1 with

A = L,B = 0,D1 = 0, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 9 / 27

Page 39: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Gradient Descent satisfies Assumption 1

Assumption 2Assume that f is convex, i.e.

Df (x , y) ≥ 0 ∀x , y ∈ Rd ,

and L-smooth, i.e.

‖∇f (x)−∇f (y)‖ ≤ L‖x − y‖ ∀x , y ∈ Rd .

Assumption 2 implies that (see Nesterov’s book [4])

‖∇f (x)−∇f (y)‖2 ≤ 2LDf (x , y) ∀x , y ∈ Rd .

Therefore, if f satisfies Assumption 2, then gradient descent satisfiesAssumption 1 with

A = L,B = 0,D1 = 0, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 9 / 27

Page 40: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Other assumptions

Assumption 3 (Unique solution)The problem (1) has an unique minimizer x*.

Assumption 4 (𝜇-strong quasi-convexity)

There exists 𝜇 > 0 such that f : Rd → R is 𝜇-strongly quasi-convex, i.e.

f (x*) ≥ f (x) + ⟨∇f (x), x* − x⟩+ 𝜇

2‖x* − x‖2 , ∀x ∈ Rd . (9)

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 10 / 27

Page 41: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Other assumptions

Assumption 3 (Unique solution)The problem (1) has an unique minimizer x*.

Assumption 4 (𝜇-strong quasi-convexity)

There exists 𝜇 > 0 such that f : Rd → R is 𝜇-strongly quasi-convex, i.e.

f (x*) ≥ f (x) + ⟨∇f (x), x* − x⟩+ 𝜇

2‖x* − x‖2 , ∀x ∈ Rd . (9)

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 10 / 27

Page 42: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Main result

Theorem 1Let Assumptions 1, 3 and 4 be satisfied.

Choose constant M such thatM > B

𝜌 . Choose a stepsize satisfying

0 < 𝛾 ≤ min

{1𝜇,

1A+ CM

}. (10)

Then the iterates {xk}k≥0 of proximal SGD satisfy

E[V k]≤ max

{(1− 𝛾𝜇)k ,

(1+

B

M− 𝜌

)k}V 0 +

(D1 +MD2)𝛾2

min{𝛾𝜇, 𝜌− B

M

} ,(11)

where the Lyapunov function V k is defined by V k def=xk − x*

2+M𝛾2𝜎2

k .

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 11 / 27

Page 43: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Main result

Theorem 1Let Assumptions 1, 3 and 4 be satisfied. Choose constant M such thatM > B

𝜌 . Choose a stepsize satisfying

0 < 𝛾 ≤ min

{1𝜇,

1A+ CM

}.

(10)

Then the iterates {xk}k≥0 of proximal SGD satisfy

E[V k]≤ max

{(1− 𝛾𝜇)k ,

(1+

B

M− 𝜌

)k}V 0 +

(D1 +MD2)𝛾2

min{𝛾𝜇, 𝜌− B

M

} ,(11)

where the Lyapunov function V k is defined by V k def=xk − x*

2+M𝛾2𝜎2

k .

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 11 / 27

Page 44: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Main result

Theorem 1Let Assumptions 1, 3 and 4 be satisfied. Choose constant M such thatM > B

𝜌 . Choose a stepsize satisfying

0 < 𝛾 ≤ min

{1𝜇,

1A+ CM

}. (10)

Then the iterates {xk}k≥0 of proximal SGD satisfy

E[V k]≤ max

{(1− 𝛾𝜇)k ,

(1+

B

M− 𝜌

)k}V 0 +

(D1 +MD2)𝛾2

min{𝛾𝜇, 𝜌− B

M

} ,(11)

where the Lyapunov function V k is defined by V k def=xk − x*

2+M𝛾2𝜎2

k .

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 11 / 27

Page 45: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Main result

Theorem 1Let Assumptions 1, 3 and 4 be satisfied. Choose constant M such thatM > B

𝜌 . Choose a stepsize satisfying

0 < 𝛾 ≤ min

{1𝜇,

1A+ CM

}. (10)

Then the iterates {xk}k≥0 of proximal SGD satisfy

E[V k]≤ max

{(1− 𝛾𝜇)k ,

(1+

B

M− 𝜌

)k}V 0 +

(D1 +MD2)𝛾2

min{𝛾𝜇, 𝜌− B

M

} ,(11)

where the Lyapunov function V k is defined by V k def=xk − x*

2+M𝛾2𝜎2

k .

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 11 / 27

Page 46: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Methods

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 12 / 27

Page 47: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Parameters

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 13 / 27

Page 48: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Proximal Gradient Descent

We checked that GD satisfies Assumption 1 when f is convex and L-smoothwith

A = L,B = 0,D1 = 0, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

We can choose M = 1 in Theorem 1 and get∙ 0 < 𝛾 ≤ 1

L (since 𝜇 ≤ L and C = 0)∙ V k = ‖xk − x*‖2 (since 𝜎k = 0)∙ The rate: ‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2

∙ In particular, for 𝛾 = 1L we recover the standard rate for proximal

gradient descent:

k ≥ L

𝜇log

1𝜀

=⇒ ‖xk − x*‖2 ≤ 𝜀‖x0 − x*‖2.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 14 / 27

Page 49: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Proximal Gradient Descent

We checked that GD satisfies Assumption 1 when f is convex and L-smoothwith

A = L,B = 0,D1 = 0, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

We can choose M = 1 in Theorem 1 and get

∙ 0 < 𝛾 ≤ 1L (since 𝜇 ≤ L and C = 0)

∙ V k = ‖xk − x*‖2 (since 𝜎k = 0)∙ The rate: ‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2

∙ In particular, for 𝛾 = 1L we recover the standard rate for proximal

gradient descent:

k ≥ L

𝜇log

1𝜀

=⇒ ‖xk − x*‖2 ≤ 𝜀‖x0 − x*‖2.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 14 / 27

Page 50: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Proximal Gradient Descent

We checked that GD satisfies Assumption 1 when f is convex and L-smoothwith

A = L,B = 0,D1 = 0, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

We can choose M = 1 in Theorem 1 and get∙ 0 < 𝛾 ≤ 1

L (since 𝜇 ≤ L and C = 0)

∙ V k = ‖xk − x*‖2 (since 𝜎k = 0)∙ The rate: ‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2

∙ In particular, for 𝛾 = 1L we recover the standard rate for proximal

gradient descent:

k ≥ L

𝜇log

1𝜀

=⇒ ‖xk − x*‖2 ≤ 𝜀‖x0 − x*‖2.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 14 / 27

Page 51: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Proximal Gradient Descent

We checked that GD satisfies Assumption 1 when f is convex and L-smoothwith

A = L,B = 0,D1 = 0, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

We can choose M = 1 in Theorem 1 and get∙ 0 < 𝛾 ≤ 1

L (since 𝜇 ≤ L and C = 0)∙ V k = ‖xk − x*‖2 (since 𝜎k = 0)

∙ The rate: ‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2

∙ In particular, for 𝛾 = 1L we recover the standard rate for proximal

gradient descent:

k ≥ L

𝜇log

1𝜀

=⇒ ‖xk − x*‖2 ≤ 𝜀‖x0 − x*‖2.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 14 / 27

Page 52: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Proximal Gradient Descent

We checked that GD satisfies Assumption 1 when f is convex and L-smoothwith

A = L,B = 0,D1 = 0, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

We can choose M = 1 in Theorem 1 and get∙ 0 < 𝛾 ≤ 1

L (since 𝜇 ≤ L and C = 0)∙ V k = ‖xk − x*‖2 (since 𝜎k = 0)∙ The rate: ‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2

∙ In particular, for 𝛾 = 1L we recover the standard rate for proximal

gradient descent:

k ≥ L

𝜇log

1𝜀

=⇒ ‖xk − x*‖2 ≤ 𝜀‖x0 − x*‖2.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 14 / 27

Page 53: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Proximal Gradient Descent

We checked that GD satisfies Assumption 1 when f is convex and L-smoothwith

A = L,B = 0,D1 = 0, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

We can choose M = 1 in Theorem 1 and get∙ 0 < 𝛾 ≤ 1

L (since 𝜇 ≤ L and C = 0)∙ V k = ‖xk − x*‖2 (since 𝜎k = 0)∙ The rate: ‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2

∙ In particular, for 𝛾 = 1L we recover the standard rate for proximal

gradient descent:

k ≥ L

𝜇log

1𝜀

=⇒ ‖xk − x*‖2 ≤ 𝜀‖x0 − x*‖2.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 14 / 27

Page 54: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

SGD-SR (see Gower et al. (2019), [3])

Stochastic reformulation

minx∈Rd

f (x) + R(x), f (x) = E𝒟 [f𝜉(x)] , f𝜉(x)def=

1n

n∑i=1

𝜉i fi (x) (12)

1 𝜉 ∼ 𝒟: E𝒟 [𝜉i ] = 1 for all i2 fi (for all i) is smooth, possibly non-convex function

Algorithm 1 SGD-SR

Input: learning rate 𝛾 > 0, starting point x0 ∈ Rd , distribution 𝒟 over 𝜉 ∈ Rn

such that E𝒟 [𝜉] is vector of ones1: for k = 0, 1, 2, . . . do2: Sample 𝜉 ∼ 𝒟3: gk = ∇f𝜉(x

k)4: xk+1 = prox𝛾R(x

k − 𝛾gk)5: end for

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 15 / 27

Page 55: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

SGD-SR (see Gower et al. (2019), [3])

Stochastic reformulation

minx∈Rd

f (x) + R(x), f (x) = E𝒟 [f𝜉(x)] , f𝜉(x)def=

1n

n∑i=1

𝜉i fi (x) (12)

1 𝜉 ∼ 𝒟: E𝒟 [𝜉i ] = 1 for all i

2 fi (for all i) is smooth, possibly non-convex function

Algorithm 2 SGD-SR

Input: learning rate 𝛾 > 0, starting point x0 ∈ Rd , distribution 𝒟 over 𝜉 ∈ Rn

such that E𝒟 [𝜉] is vector of ones1: for k = 0, 1, 2, . . . do2: Sample 𝜉 ∼ 𝒟3: gk = ∇f𝜉(x

k)4: xk+1 = prox𝛾R(x

k − 𝛾gk)5: end for

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 15 / 27

Page 56: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

SGD-SR (see Gower et al. (2019), [3])

Stochastic reformulation

minx∈Rd

f (x) + R(x), f (x) = E𝒟 [f𝜉(x)] , f𝜉(x)def=

1n

n∑i=1

𝜉i fi (x) (12)

1 𝜉 ∼ 𝒟: E𝒟 [𝜉i ] = 1 for all i2 fi (for all i) is smooth, possibly non-convex function

Algorithm 3 SGD-SR

Input: learning rate 𝛾 > 0, starting point x0 ∈ Rd , distribution 𝒟 over 𝜉 ∈ Rn

such that E𝒟 [𝜉] is vector of ones1: for k = 0, 1, 2, . . . do2: Sample 𝜉 ∼ 𝒟3: gk = ∇f𝜉(x

k)4: xk+1 = prox𝛾R(x

k − 𝛾gk)5: end for

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 15 / 27

Page 57: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

SGD-SR (see Gower et al. (2019), [3])

Stochastic reformulation

minx∈Rd

f (x) + R(x), f (x) = E𝒟 [f𝜉(x)] , f𝜉(x)def=

1n

n∑i=1

𝜉i fi (x) (12)

1 𝜉 ∼ 𝒟: E𝒟 [𝜉i ] = 1 for all i2 fi (for all i) is smooth, possibly non-convex function

Algorithm 4 SGD-SR

Input: learning rate 𝛾 > 0, starting point x0 ∈ Rd , distribution 𝒟 over 𝜉 ∈ Rn

such that E𝒟 [𝜉] is vector of ones1: for k = 0, 1, 2, . . . do2: Sample 𝜉 ∼ 𝒟3: gk = ∇f𝜉(x

k)4: xk+1 = prox𝛾R(x

k − 𝛾gk)5: end for

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 15 / 27

Page 58: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

SGD-SR (see Gower et al. (2019), [3])

Expected smoothnessWe say that f is ℒ-smooth in expectation with respect to distribution 𝒟 ifthere exists ℒ = ℒ(f ,𝒟) > 0 such that

E𝒟

[‖∇f𝜉(x)−∇f𝜉(x

*)‖2]≤ 2ℒDf (x , x

*), (13)

for all x ∈ Rd and write (f ,𝒟) ∼ ES(ℒ).

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 16 / 27

Page 59: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

SGD-SR (see Gower et al. (2019), [3])

Lemma 1 (Generalization of Lemma 2.4, [3])If (f ,𝒟) ∼ ES(ℒ), then

E𝒟

[‖∇f𝜉(x)−∇f (x*)‖2

]≤ 4ℒDf (x , x

*) + 2𝜎2. (14)

where 𝜎2 def= E𝒟

[‖∇f𝜉(x

*)−∇f (x*)‖2].

That is, SGD-SR satisfies Assumption 1 with

A = 2ℒ,B = 0,D1 = 2𝜎2, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

If 𝛾k ≡ 𝛾 ≤ 12ℒ then Theorem 1 implies that

E‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2 +2𝛾𝜎2

𝜇.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 17 / 27

Page 60: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

SGD-SR (see Gower et al. (2019), [3])

Lemma 1 (Generalization of Lemma 2.4, [3])If (f ,𝒟) ∼ ES(ℒ), then

E𝒟

[‖∇f𝜉(x)−∇f (x*)‖2

]≤ 4ℒDf (x , x

*) + 2𝜎2. (14)

where 𝜎2 def= E𝒟

[‖∇f𝜉(x

*)−∇f (x*)‖2].

That is, SGD-SR satisfies Assumption 1 with

A = 2ℒ,B = 0,D1 = 2𝜎2, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

If 𝛾k ≡ 𝛾 ≤ 12ℒ then Theorem 1 implies that

E‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2 +2𝛾𝜎2

𝜇.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 17 / 27

Page 61: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

SGD-SR (see Gower et al. (2019), [3])

Lemma 1 (Generalization of Lemma 2.4, [3])If (f ,𝒟) ∼ ES(ℒ), then

E𝒟

[‖∇f𝜉(x)−∇f (x*)‖2

]≤ 4ℒDf (x , x

*) + 2𝜎2. (14)

where 𝜎2 def= E𝒟

[‖∇f𝜉(x

*)−∇f (x*)‖2].

That is, SGD-SR satisfies Assumption 1 with

A = 2ℒ,B = 0,D1 = 2𝜎2, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

If 𝛾k ≡ 𝛾 ≤ 12ℒ then Theorem 1 implies that

E‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2 +2𝛾𝜎2

𝜇.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 17 / 27

Page 62: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

SAGA (see Defazio, Bach & Lacoste-Julien (2014), [1])

Consider the finite-sum minimization problem

f (x) =1n

n∑i=1

fi (x) + R(x), (15)

where fi is convex, L-smooth for each i and f is 𝜇-strongly convex.

Algorithm 5 SAGA [1]

Input: learning rate 𝛾 > 0, starting point x0 ∈ Rd

1: Set 𝜓0j = x0 for each j ∈ [n]

2: for k = 0, 1, 2, . . . do3: Sample j ∈ [n] uniformly at random4: Set 𝜑k+1

j = xk and 𝜑k+1i = 𝜑ki for i = j

5: gk = ∇fj(𝜑k+1j )−∇fj(𝜑

kj ) +

1n

∑ni=1 ∇fi (𝜑

ki )

6: xk+1 = prox𝛾R

(xk − 𝛾gk

)7: end for

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 18 / 27

Page 63: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

SAGA (see Defazio, Bach & Lacoste-Julien (2014), [1])

Consider the finite-sum minimization problem

f (x) =1n

n∑i=1

fi (x) + R(x), (15)

where fi is convex, L-smooth for each i and f is 𝜇-strongly convex.

Algorithm 6 SAGA [1]

Input: learning rate 𝛾 > 0, starting point x0 ∈ Rd

1: Set 𝜓0j = x0 for each j ∈ [n]

2: for k = 0, 1, 2, . . . do3: Sample j ∈ [n] uniformly at random4: Set 𝜑k+1

j = xk and 𝜑k+1i = 𝜑ki for i = j

5: gk = ∇fj(𝜑k+1j )−∇fj(𝜑

kj ) +

1n

∑ni=1 ∇fi (𝜑

ki )

6: xk+1 = prox𝛾R

(xk − 𝛾gk

)7: end for

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 18 / 27

Page 64: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

SAGA (see Defazio, Bach & Lacoste-Julien (2014), [1])

Consider the finite-sum minimization problem

f (x) =1n

n∑i=1

fi (x) + R(x), (15)

where fi is convex, L-smooth for each i and f is 𝜇-strongly convex.

Algorithm 7 SAGA [1]

Input: learning rate 𝛾 > 0, starting point x0 ∈ Rd

1: Set 𝜓0j = x0 for each j ∈ [n]

2: for k = 0, 1, 2, . . . do3: Sample j ∈ [n] uniformly at random4: Set 𝜑k+1

j = xk and 𝜑k+1i = 𝜑ki for i = j

5: gk = ∇fj(𝜑k+1j )−∇fj(𝜑

kj ) +

1n

∑ni=1 ∇fi (𝜑

ki )

6: xk+1 = prox𝛾R

(xk − 𝛾gk

)7: end for

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 18 / 27

Page 65: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

SAGA (see Defazio, Bach & Lacoste-Julien (2014), [1])

Lemma 2We have

E[

gk −∇f (x*)2

| xk]≤ 4LDf (x

k , x*) + 2𝜎2k

(16)

and

E[𝜎2k+1 | xk

]≤(1− 1

n

)𝜎2k +

2LnDf (x

k , x*), (17)

where 𝜎2k = 1

n

n∑i=1

∇fi (𝜑

ki )−∇fi (x

*)2.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 19 / 27

Page 66: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

SAGA (see Defazio, Bach & Lacoste-Julien (2014), [1])

Lemma 2We have

E[

gk −∇f (x*)2

| xk]≤ 4LDf (x

k , x*) + 2𝜎2k (16)

and

E[𝜎2k+1 | xk

]≤(1− 1

n

)𝜎2k +

2LnDf (x

k , x*), (17)

where 𝜎2k = 1

n

n∑i=1

∇fi (𝜑

ki )−∇fi (x

*)2.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 19 / 27

Page 67: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

SAGA (see Defazio, Bach & Lacoste-Julien (2014), [1])

Lemma 2We have

E[

gk −∇f (x*)2

| xk]≤ 4LDf (x

k , x*) + 2𝜎2k (16)

and

E[𝜎2k+1 | xk

]≤(1− 1

n

)𝜎2k +

2LnDf (x

k , x*), (17)

where 𝜎2k = 1

n

n∑i=1

∇fi (𝜑

ki )−∇fi (x

*)2.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 19 / 27

Page 68: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

SAGA (see Defazio, Bach & Lacoste-Julien (2014), [1])

That is, SAGA satisfies Assumption 1 with

A = 2L,B = 1,D1 = 0, 𝜌 =1n,C =

L

n,D2 = 0.

Theorem 1 with M = 4n implies that for 𝛾 = 16L

EV k ≤(1−min

{𝜇

6L,12n

})2

V 0.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 20 / 27

Page 69: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

SAGA (see Defazio, Bach & Lacoste-Julien (2014), [1])

That is, SAGA satisfies Assumption 1 with

A = 2L,B = 1,D1 = 0, 𝜌 =1n,C =

L

n,D2 = 0.

Theorem 1 with M = 4n implies that for 𝛾 = 16L

EV k ≤(1−min

{𝜇

6L,12n

})2

V 0.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 20 / 27

Page 70: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

SGD-star

Consider the same situation as for SGD-SR. Recall that for SGD-SR we got

E‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2 +2𝛾𝜎2

𝜇, 𝛾 ∈

(0,

12ℒ

]

Algorithm 8 SGD-star

Input: learning rate 𝛾 > 0, starting point x0 ∈ Rd , distribution 𝒟 over𝜉 ∈ Rn such that E𝒟 [𝜉] is vector of ones

1: for k = 0, 1, 2, . . . do2: Sample 𝜉 ∼ 𝒟3: gk = ∇f𝜉(x

k)−∇f𝜉(x*) +∇f (x*)

4: xk+1 = prox𝛾R(xk − 𝛾gk)5: end for

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 21 / 27

Page 71: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

SGD-star

Consider the same situation as for SGD-SR. Recall that for SGD-SR we got

E‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2 +2𝛾𝜎2

𝜇, 𝛾 ∈

(0,

12ℒ

]

Algorithm 9 SGD-star

Input: learning rate 𝛾 > 0, starting point x0 ∈ Rd , distribution 𝒟 over𝜉 ∈ Rn such that E𝒟 [𝜉] is vector of ones

1: for k = 0, 1, 2, . . . do2: Sample 𝜉 ∼ 𝒟3: gk = ∇f𝜉(x

k)−∇f𝜉(x*) +∇f (x*)

4: xk+1 = prox𝛾R(xk − 𝛾gk)5: end for

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 21 / 27

Page 72: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

SGD-star

Lemma 3 (Lemma 2.4 from [3])If (f ,𝒟) ∼ ES(ℒ), then

E𝒟

[gk −∇f (x*)

2]≤ 4ℒDf (x

k , x*). (18)

Thus, SGD-star satisfies Assumption 1 with

A = 2ℒ,B = 0,D1 = 0, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

If 𝛾k ≡ 𝛾 ≤ 12ℒ then Theorem 1 implies that

E‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 22 / 27

Page 73: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

SGD-star

Lemma 3 (Lemma 2.4 from [3])If (f ,𝒟) ∼ ES(ℒ), then

E𝒟

[gk −∇f (x*)

2]≤ 4ℒDf (x

k , x*). (18)

Thus, SGD-star satisfies Assumption 1 with

A = 2ℒ,B = 0,D1 = 0, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

If 𝛾k ≡ 𝛾 ≤ 12ℒ then Theorem 1 implies that

E‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 22 / 27

Page 74: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Quantitative definition of variance reduction

Variance-reduced methodsWe say that SGD with stochastic gradients satisfying Assumption 1 isvariance-reduced if D1 = D2 = 0.

∙ Variance-reduced methods converge to the exact solution∙ The sequence {𝜎2

k}k≥0 reflects the progress of the variance reductionprocess.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 23 / 27

Page 75: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Quantitative definition of variance reduction

Variance-reduced methodsWe say that SGD with stochastic gradients satisfying Assumption 1 isvariance-reduced if D1 = D2 = 0.

∙ Variance-reduced methods converge to the exact solution

∙ The sequence {𝜎2k}k≥0 reflects the progress of the variance reduction

process.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 23 / 27

Page 76: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Quantitative definition of variance reduction

Variance-reduced methodsWe say that SGD with stochastic gradients satisfying Assumption 1 isvariance-reduced if D1 = D2 = 0.

∙ Variance-reduced methods converge to the exact solution∙ The sequence {𝜎2

k}k≥0 reflects the progress of the variance reductionprocess.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 23 / 27

Page 77: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Limitations and Extensions

1 We consider only L-smooth 𝜇-strongly quasi-convex case.

2 It would interesting to unify the theory for biased gradients estimator(e.g. SAG [7], SARAH [5], zero-order optimization and etc.)

3 Our analysis doesn’t recover the best known rates for RCD typemethods with importance sampling.

4 An extension of iteration dependent parameters A,B,C ,D1,D2, 𝜌would cover a new methods, such as SGD with decreasing stepsizes.

5 We consider only non-accelerated methods. It would be interesting toprovide an unified analysis of stochastic methods with acceleration andmomentum.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 24 / 27

Page 78: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Limitations and Extensions

1 We consider only L-smooth 𝜇-strongly quasi-convex case.2 It would interesting to unify the theory for biased gradients estimator

(e.g. SAG [7], SARAH [5], zero-order optimization and etc.)

3 Our analysis doesn’t recover the best known rates for RCD typemethods with importance sampling.

4 An extension of iteration dependent parameters A,B,C ,D1,D2, 𝜌would cover a new methods, such as SGD with decreasing stepsizes.

5 We consider only non-accelerated methods. It would be interesting toprovide an unified analysis of stochastic methods with acceleration andmomentum.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 24 / 27

Page 79: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Limitations and Extensions

1 We consider only L-smooth 𝜇-strongly quasi-convex case.2 It would interesting to unify the theory for biased gradients estimator

(e.g. SAG [7], SARAH [5], zero-order optimization and etc.)3 Our analysis doesn’t recover the best known rates for RCD type

methods with importance sampling.

4 An extension of iteration dependent parameters A,B,C ,D1,D2, 𝜌would cover a new methods, such as SGD with decreasing stepsizes.

5 We consider only non-accelerated methods. It would be interesting toprovide an unified analysis of stochastic methods with acceleration andmomentum.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 24 / 27

Page 80: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Limitations and Extensions

1 We consider only L-smooth 𝜇-strongly quasi-convex case.2 It would interesting to unify the theory for biased gradients estimator

(e.g. SAG [7], SARAH [5], zero-order optimization and etc.)3 Our analysis doesn’t recover the best known rates for RCD type

methods with importance sampling.4 An extension of iteration dependent parameters A,B,C ,D1,D2, 𝜌

would cover a new methods, such as SGD with decreasing stepsizes.

5 We consider only non-accelerated methods. It would be interesting toprovide an unified analysis of stochastic methods with acceleration andmomentum.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 24 / 27

Page 81: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Limitations and Extensions

1 We consider only L-smooth 𝜇-strongly quasi-convex case.2 It would interesting to unify the theory for biased gradients estimator

(e.g. SAG [7], SARAH [5], zero-order optimization and etc.)3 Our analysis doesn’t recover the best known rates for RCD type

methods with importance sampling.4 An extension of iteration dependent parameters A,B,C ,D1,D2, 𝜌

would cover a new methods, such as SGD with decreasing stepsizes.5 We consider only non-accelerated methods. It would be interesting to

provide an unified analysis of stochastic methods with acceleration andmomentum.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 24 / 27

Page 82: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Bibliography I

Aaron Defazio, Francis Bach, and Simon Lacoste-Julien.SAGA: A fast incremental gradient method with support fornon-strongly convex composite objectives.In Advances in Neural Information Processing Systems, pages1646–1654, 2014.

Eduard Gorbunov, Filip Hanzely, and Peter Richtarik.A unified theory of sgd: Variance reduction, sampling, quantizationand coordinate descent.arXiv preprint arXiv:1905.11261, 2019.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 25 / 27

Page 83: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Bibliography II

Robert M Gower, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, EgorShulgin, and Peter Richtarik.SGD: General analysis and improved rates.arXiv preprint arXiv:1901.09401, 2019.

Yurii Nesterov.Introductory lectures on convex optimization: a basic course.2004.

Lam M Nguyen, Jie Liu, Katya Scheinberg, and Martin Takac.SARAH: A novel method for machine learning problems usingstochastic recursive gradient.In Proceedings of the 34th International Conference on MachineLearning, volume 70 of Proceedings of Machine Learning Research,pages 2613–2621. PMLR, 2017.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 26 / 27

Page 84: A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis Specialcases Discussion Bibliography Popularapproachtosolveproblem(1) xk+1 = prox R(x

Introduction SGD General analysis Special cases Discussion Bibliography

Bibliography III

Peter Richtarik.A guided walk through the zoo of stochastic gradient descent methods.

ICCOPT 2019 Summer School.

Nicolas Le Roux, Mark Schmidt, and Francis Bach.A stochastic gradient method with an exponential convergence rate forfinite training sets.In Advances in Neural Information Processing Systems, pages2663–2671, 2012.

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 27 / 27