A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis...

Introduction SGD General analysis Special cases Discussion Bibliography

A Unified Theory of SGD: Variance Reduction,Sampling, Quantization and Coordinate Descent

The talk is based on the work [2] and slides [6]

Eduard Gorbunov1 Filip Hanzely2 Peter Richtarik2 1

1Moscow Institute of Physics and Technology, Russia2King Abdullah University of Science and Technology, Saudi Arabia

INRIA, Paris, 18 October, 2019

Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 1 / 27


Table of Contents

1 Introduction

2 SGD

3 General analysis

4 Special cases

5 Discussion

6 Bibliography



The Problem

minx∈Rd

f (x) + R(x) (1)

∙ f (x) — convex function with Lipschitz gradient.∙ R : Rd → R ∪ {+∞} — proximable (proper closed convex) regularizer



The Problem

minx∈Rd

f (x) + R(x) (1)

∙ f (x) — convex function with Lipschitz gradient.

∙ R : Rd → R ∪ {+∞} — proximable (proper closed convex) regularizer



The Problem

minx∈Rd

f (x) + R(x) (1)

∙ f (x) — convex function with Lipschitz gradient.∙ R : Rd → R ∪ {+∞} — proximable (proper closed convex) regularizer



Structure of f in supervised learning theory and practice

We focus on the following situations

∙f (x) = E𝜉∼𝒟 [f𝜉(x)] (2)

∙

f (x) =1n

n∑i=1

fi (x). (3)

∙ f (x) = 1n

∑ni=1 fi (x) and

fi (x) =1m

m∑j=1

fij(x), (4)

Typical case: Using the exact gradient of ∇f (x) is too expensive, but anunbiased estimator of ∇f (x) can be computed efficiently.




We focus on the following situations∙

f (x) = E𝜉∼𝒟 [f𝜉(x)] (2)

∙

f (x) =1n

n∑i=1

fi (x). (3)

∙ f (x) = 1n

∑ni=1 fi (x) and

fi (x) =1m

m∑j=1

fij(x), (4)






f (x) = E𝜉∼𝒟 [f𝜉(x)] (2)

∙

f (x) =1n

n∑i=1

fi (x). (3)

∙ f (x) = 1n

∑ni=1 fi (x) and

fi (x) =1m

m∑j=1

fij(x), (4)

Typical case:

Using the exact gradient of ∇f (x) is too expensive, but anunbiased estimator of ∇f (x) can be computed efficiently.





f (x) = E𝜉∼𝒟 [f𝜉(x)] (2)

∙

f (x) =1n

n∑i=1

fi (x). (3)

∙ f (x) = 1n

∑ni=1 fi (x) and

fi (x) =1m

m∑j=1

fij(x), (4)

Typical case: Using the exact gradient of ∇f (x) is too expensive

, but anunbiased estimator of ∇f (x) can be computed efficiently.





f (x) = E𝜉∼𝒟 [f𝜉(x)] (2)

∙

f (x) =1n

n∑i=1

fi (x). (3)

∙ f (x) = 1n

∑ni=1 fi (x) and

fi (x) =1m

m∑j=1

fij(x), (4)




Popular approach to solve problem (1)

xk+1 = prox𝛾R(xk − 𝛾gk). (5)

∙ gk is unbiased gradient estimator: E[gk | xk

]= ∇f (xk)

∙ proxR(x)def= argminu

{12‖u − x‖2 + R(x)

}∙ ‖ · ‖ — standard Euclidean norm

The prox operator∙ x → proxR(x) is a function∙ ‖ proxR(x)− proxR(y)‖ ≤ ‖x − y‖






]= ∇f (xk)


{12‖u − x‖2 + R(x)

}

∙ ‖ · ‖ — standard Euclidean norm







]= ∇f (xk)


{12‖u − x‖2 + R(x)








]= ∇f (xk)


{12‖u − x‖2 + R(x)


The prox operator∙ x → proxR(x) is a function

∙ ‖ proxR(x)− proxR(y)‖ ≤ ‖x − y‖






]= ∇f (xk)


{12‖u − x‖2 + R(x)





Stochastic gradient

There are infinitely many ways of getting unbiased estimator with “good”properties.

∙ Flexibility to construct stochastic gradients in order to target desirableproperties:

∙ convergence speed∙ iteration cost∙ overall complexity∙ parallelizability∙ communication cost and etc.

∙ Too many methods∙ Hard to keep up with new results∙ Challenges in terms of analysis∙ Some problems with fair comparison: different assumptions are used



Stochastic gradient

There are infinitely many ways of getting unbiased estimator with “good”properties.∙ Flexibility to construct stochastic gradients in order to target desirable

properties:

∙ convergence speed∙ iteration cost∙ overall complexity∙ parallelizability∙ communication cost and etc.




Stochastic gradient


properties:∙ convergence speed

∙ iteration cost∙ overall complexity∙ parallelizability∙ communication cost and etc.




Stochastic gradient


properties:∙ convergence speed∙ iteration cost

∙ overall complexity∙ parallelizability∙ communication cost and etc.




Stochastic gradient


properties:∙ convergence speed∙ iteration cost∙ overall complexity

∙ parallelizability∙ communication cost and etc.




Stochastic gradient


properties:∙ convergence speed∙ iteration cost∙ overall complexity∙ parallelizability

∙ communication cost and etc.∙ Too many methods

∙ Hard to keep up with new results∙ Challenges in terms of analysis∙ Some problems with fair comparison: different assumptions are used



Stochastic gradient


properties:∙ convergence speed∙ iteration cost∙ overall complexity∙ parallelizability∙ communication cost and etc.




Stochastic gradient



∙ Too many methods

∙ Hard to keep up with new results∙ Challenges in terms of analysis∙ Some problems with fair comparison: different assumptions are used



Stochastic gradient



∙ Too many methods∙ Hard to keep up with new results

∙ Challenges in terms of analysis∙ Some problems with fair comparison: different assumptions are used



Stochastic gradient



∙ Too many methods∙ Hard to keep up with new results∙ Challenges in terms of analysis

∙ Some problems with fair comparison: different assumptions are used



Stochastic gradient






Bregman divergence

By Df (x , y) we denote the Bregman divergence associated with f :

Df (x , y)def= f (x)− f (y)− ⟨∇f (y), x − y⟩.



Key assumption

Assumption 1

Let {xk} be the random iterates produced by proximal SGD.

1 The stochastic gradients gk are unbiased

E[gk | xk

]= ∇f (xk) ∀k ≥ 0. (6)

2 There exist non-negative constants A,B,C ,D1,D2, 𝜌 and a (possibly)random sequence {𝜎2

k}k≥0 such that the following two relations hold

E[

gk −∇f (x*)2

| xk]≤ 2ADf (x

k , x*) + B𝜎2k + D1, (7)

E[𝜎2k+1 | 𝜎2

k

]≤ (1− 𝜌)𝜎2

k + 2CDf (xk , x*) + D2, ∀k ≥ 0 (8)



Key assumption

Assumption 1

Let {xk} be the random iterates produced by proximal SGD.1 The stochastic gradients gk are unbiased

E[gk | xk

]= ∇f (xk) ∀k ≥ 0. (6)

2 There exist non-negative constants A,B,C ,D1,D2, 𝜌 and a (possibly)random sequence {𝜎2

k}k≥0 such that the following two relations hold

E[

gk −∇f (x*)2

| xk]≤ 2ADf (x

k , x*) + B𝜎2k + D1, (7)

E[𝜎2k+1 | 𝜎2

k

]≤ (1− 𝜌)𝜎2

k + 2CDf (xk , x*) + D2, ∀k ≥ 0 (8)



Gradient Descent satisfies Assumption 1

Assumption 2Assume that f is convex, i.e.

Df (x , y) ≥ 0 ∀x , y ∈ Rd ,

and L-smooth, i.e.

‖∇f (x)−∇f (y)‖ ≤ L‖x − y‖ ∀x , y ∈ Rd .

Assumption 2 implies that (see Nesterov’s book [4])

‖∇f (x)−∇f (y)‖2 ≤ 2LDf (x , y) ∀x , y ∈ Rd .

Therefore, if f satisfies Assumption 2, then gradient descent satisfiesAssumption 1 with

A = L,B = 0,D1 = 0, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.



Gradient Descent satisfies Assumption 1

Assumption 2Assume that f is convex, i.e.

Df (x , y) ≥ 0 ∀x , y ∈ Rd ,

and L-smooth, i.e.

‖∇f (x)−∇f (y)‖ ≤ L‖x − y‖ ∀x , y ∈ Rd .

Assumption 2 implies that (see Nesterov’s book [4])

‖∇f (x)−∇f (y)‖2 ≤ 2LDf (x , y) ∀x , y ∈ Rd .

Therefore, if f satisfies Assumption 2, then gradient descent satisfiesAssumption 1 with

A = L,B = 0,D1 = 0, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.Eduard Gorbunov (MIPT) Unified SGD INRIA, 18.10.2019 9 / 27


Other assumptions

Assumption 3 (Unique solution)The problem (1) has an unique minimizer x*.

Assumption 4 (𝜇-strong quasi-convexity)

There exists 𝜇 > 0 such that f : Rd → R is 𝜇-strongly quasi-convex, i.e.

f (x*) ≥ f (x) + ⟨∇f (x), x* − x⟩+ 𝜇

2‖x* − x‖2 , ∀x ∈ Rd . (9)



Main result

Theorem 1Let Assumptions 1, 3 and 4 be satisfied.

Choose constant M such thatM > B

𝜌 . Choose a stepsize satisfying

0 < 𝛾 ≤ min

{1𝜇,

1A+ CM

}. (10)

Then the iterates {xk}k≥0 of proximal SGD satisfy

E[V k]≤ max

{(1− 𝛾𝜇)k ,

(1+

B

M− 𝜌

)k}V 0 +

(D1 +MD2)𝛾2

min{𝛾𝜇, 𝜌− B

M

} ,(11)

where the Lyapunov function V k is defined by V k def=xk − x*

2+M𝛾2𝜎2

k .



Main result

Theorem 1Let Assumptions 1, 3 and 4 be satisfied. Choose constant M such thatM > B


0 < 𝛾 ≤ min

{1𝜇,

1A+ CM

}.

(10)


E[V k]≤ max

{(1− 𝛾𝜇)k ,

(1+

B

M− 𝜌

)k}V 0 +

(D1 +MD2)𝛾2


M

} ,(11)


2+M𝛾2𝜎2

k .



Main result

Theorem 1Let Assumptions 1, 3 and 4 be satisfied. Choose constant M such thatM > B


0 < 𝛾 ≤ min

{1𝜇,

1A+ CM

}. (10)


E[V k]≤ max

{(1− 𝛾𝜇)k ,

(1+

B

M− 𝜌

)k}V 0 +

(D1 +MD2)𝛾2


M

} ,(11)


2+M𝛾2𝜎2

k .



Methods



Parameters



Proximal Gradient Descent

We checked that GD satisfies Assumption 1 when f is convex and L-smoothwith

A = L,B = 0,D1 = 0, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

We can choose M = 1 in Theorem 1 and get∙ 0 < 𝛾 ≤ 1

L (since 𝜇 ≤ L and C = 0)∙ V k = ‖xk − x*‖2 (since 𝜎k = 0)∙ The rate: ‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2

∙ In particular, for 𝛾 = 1L we recover the standard rate for proximal

gradient descent:

k ≥ L

𝜇log

1𝜀

=⇒ ‖xk − x*‖2 ≤ 𝜀‖x0 − x*‖2.





A = L,B = 0,D1 = 0, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

We can choose M = 1 in Theorem 1 and get

∙ 0 < 𝛾 ≤ 1L (since 𝜇 ≤ L and C = 0)

∙ V k = ‖xk − x*‖2 (since 𝜎k = 0)∙ The rate: ‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2


gradient descent:

k ≥ L

𝜇log

1𝜀

=⇒ ‖xk − x*‖2 ≤ 𝜀‖x0 − x*‖2.





A = L,B = 0,D1 = 0, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.


L (since 𝜇 ≤ L and C = 0)

∙ V k = ‖xk − x*‖2 (since 𝜎k = 0)∙ The rate: ‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2


gradient descent:

k ≥ L

𝜇log

1𝜀

=⇒ ‖xk − x*‖2 ≤ 𝜀‖x0 − x*‖2.





A = L,B = 0,D1 = 0, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.


L (since 𝜇 ≤ L and C = 0)∙ V k = ‖xk − x*‖2 (since 𝜎k = 0)

∙ The rate: ‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2


gradient descent:

k ≥ L

𝜇log

1𝜀

=⇒ ‖xk − x*‖2 ≤ 𝜀‖x0 − x*‖2.





A = L,B = 0,D1 = 0, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.


L (since 𝜇 ≤ L and C = 0)∙ V k = ‖xk − x*‖2 (since 𝜎k = 0)∙ The rate: ‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2


gradient descent:

k ≥ L

𝜇log

1𝜀

=⇒ ‖xk − x*‖2 ≤ 𝜀‖x0 − x*‖2.



SGD-SR (see Gower et al. (2019), [3])

Stochastic reformulation

minx∈Rd

f (x) + R(x), f (x) = E𝒟 [f𝜉(x)] , f𝜉(x)def=

1n

n∑i=1

𝜉i fi (x) (12)

1 𝜉 ∼ 𝒟: E𝒟 [𝜉i ] = 1 for all i2 fi (for all i) is smooth, possibly non-convex function

Algorithm 1 SGD-SR

Input: learning rate 𝛾 > 0, starting point x0 ∈ Rd , distribution 𝒟 over 𝜉 ∈ Rn

such that E𝒟 [𝜉] is vector of ones1: for k = 0, 1, 2, . . . do2: Sample 𝜉 ∼ 𝒟3: gk = ∇f𝜉(x

k)4: xk+1 = prox𝛾R(x

k − 𝛾gk)5: end for





minx∈Rd


1n

n∑i=1

𝜉i fi (x) (12)

1 𝜉 ∼ 𝒟: E𝒟 [𝜉i ] = 1 for all i

2 fi (for all i) is smooth, possibly non-convex function

Algorithm 2 SGD-SR









minx∈Rd


1n

n∑i=1

𝜉i fi (x) (12)


Algorithm 3 SGD-SR









minx∈Rd


1n

n∑i=1

𝜉i fi (x) (12)


Algorithm 4 SGD-SR








Expected smoothnessWe say that f is ℒ-smooth in expectation with respect to distribution 𝒟 ifthere exists ℒ = ℒ(f ,𝒟) > 0 such that

E𝒟

[‖∇f𝜉(x)−∇f𝜉(x

*)‖2]≤ 2ℒDf (x , x

*), (13)

for all x ∈ Rd and write (f ,𝒟) ∼ ES(ℒ).




Lemma 1 (Generalization of Lemma 2.4, [3])If (f ,𝒟) ∼ ES(ℒ), then

E𝒟

[‖∇f𝜉(x)−∇f (x*)‖2

]≤ 4ℒDf (x , x

*) + 2𝜎2. (14)

where 𝜎2 def= E𝒟

[‖∇f𝜉(x

*)−∇f (x*)‖2].

That is, SGD-SR satisfies Assumption 1 with

A = 2ℒ,B = 0,D1 = 2𝜎2, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

If 𝛾k ≡ 𝛾 ≤ 12ℒ then Theorem 1 implies that

E‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2 +2𝛾𝜎2

𝜇.



SAGA (see Defazio, Bach & Lacoste-Julien (2014), [1])

Consider the finite-sum minimization problem

f (x) =1n

n∑i=1

fi (x) + R(x), (15)

where fi is convex, L-smooth for each i and f is 𝜇-strongly convex.

Algorithm 5 SAGA [1]

Input: learning rate 𝛾 > 0, starting point x0 ∈ Rd

1: Set 𝜓0j = x0 for each j ∈ [n]

2: for k = 0, 1, 2, . . . do3: Sample j ∈ [n] uniformly at random4: Set 𝜑k+1

j = xk and 𝜑k+1i = 𝜑ki for i = j

5: gk = ∇fj(𝜑k+1j )−∇fj(𝜑

kj ) +

1n

∑ni=1 ∇fi (𝜑

ki )

6: xk+1 = prox𝛾R

(xk − 𝛾gk

)7: end for





f (x) =1n

n∑i=1

fi (x) + R(x), (15)








kj ) +

1n

∑ni=1 ∇fi (𝜑

ki )

6: xk+1 = prox𝛾R

(xk − 𝛾gk

)7: end for




Lemma 2We have

E[

gk −∇f (x*)2

| xk]≤ 4LDf (x

k , x*) + 2𝜎2k

(16)

and

E[𝜎2k+1 | xk

]≤(1− 1

n

)𝜎2k +

2LnDf (x

k , x*), (17)

where 𝜎2k = 1

n

n∑i=1

∇fi (𝜑

ki )−∇fi (x

*)2.




Lemma 2We have

E[

gk −∇f (x*)2

| xk]≤ 4LDf (x

k , x*) + 2𝜎2k (16)

and

E[𝜎2k+1 | xk

]≤(1− 1

n

)𝜎2k +

2LnDf (x

k , x*), (17)

where 𝜎2k = 1

n

n∑i=1

∇fi (𝜑

ki )−∇fi (x

*)2.




That is, SAGA satisfies Assumption 1 with

A = 2L,B = 1,D1 = 0, 𝜌 =1n,C =

L

n,D2 = 0.

Theorem 1 with M = 4n implies that for 𝛾 = 16L

EV k ≤(1−min

{𝜇

6L,12n

})2

V 0.



SGD-star

Consider the same situation as for SGD-SR. Recall that for SGD-SR we got

E‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2 +2𝛾𝜎2

𝜇, 𝛾 ∈

(0,

12ℒ

]

Algorithm 8 SGD-star

Input: learning rate 𝛾 > 0, starting point x0 ∈ Rd , distribution 𝒟 over𝜉 ∈ Rn such that E𝒟 [𝜉] is vector of ones

1: for k = 0, 1, 2, . . . do2: Sample 𝜉 ∼ 𝒟3: gk = ∇f𝜉(x

k)−∇f𝜉(x*) +∇f (x*)

4: xk+1 = prox𝛾R(xk − 𝛾gk)5: end for



SGD-star

Consider the same situation as for SGD-SR. Recall that for SGD-SR we got

E‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2 +2𝛾𝜎2

𝜇, 𝛾 ∈

(0,

12ℒ

]

Algorithm 9 SGD-star

Input: learning rate 𝛾 > 0, starting point x0 ∈ Rd , distribution 𝒟 over𝜉 ∈ Rn such that E𝒟 [𝜉] is vector of ones

1: for k = 0, 1, 2, . . . do2: Sample 𝜉 ∼ 𝒟3: gk = ∇f𝜉(x

k)−∇f𝜉(x*) +∇f (x*)

4: xk+1 = prox𝛾R(xk − 𝛾gk)5: end for



SGD-star

Lemma 3 (Lemma 2.4 from [3])If (f ,𝒟) ∼ ES(ℒ), then

E𝒟

[gk −∇f (x*)

2]≤ 4ℒDf (x

k , x*). (18)

Thus, SGD-star satisfies Assumption 1 with

A = 2ℒ,B = 0,D1 = 0, 𝜎k = 0, 𝜌 = 1,C = 0,D2 = 0.

If 𝛾k ≡ 𝛾 ≤ 12ℒ then Theorem 1 implies that

E‖xk − x*‖2 ≤ (1− 𝛾𝜇)k‖x0 − x*‖2.



Quantitative definition of variance reduction

Variance-reduced methodsWe say that SGD with stochastic gradients satisfying Assumption 1 isvariance-reduced if D1 = D2 = 0.

∙ Variance-reduced methods converge to the exact solution∙ The sequence {𝜎2

k}k≥0 reflects the progress of the variance reductionprocess.





∙ Variance-reduced methods converge to the exact solution

∙ The sequence {𝜎2k}k≥0 reflects the progress of the variance reduction

process.





∙ Variance-reduced methods converge to the exact solution∙ The sequence {𝜎2

k}k≥0 reflects the progress of the variance reductionprocess.



Limitations and Extensions

1 We consider only L-smooth 𝜇-strongly quasi-convex case.

2 It would interesting to unify the theory for biased gradients estimator(e.g. SAG [7], SARAH [5], zero-order optimization and etc.)

3 Our analysis doesn’t recover the best known rates for RCD typemethods with importance sampling.

4 An extension of iteration dependent parameters A,B,C ,D1,D2, 𝜌would cover a new methods, such as SGD with decreasing stepsizes.

5 We consider only non-accelerated methods. It would be interesting toprovide an unified analysis of stochastic methods with acceleration andmomentum.




1 We consider only L-smooth 𝜇-strongly quasi-convex case.2 It would interesting to unify the theory for biased gradients estimator

(e.g. SAG [7], SARAH [5], zero-order optimization and etc.)

3 Our analysis doesn’t recover the best known rates for RCD typemethods with importance sampling.







(e.g. SAG [7], SARAH [5], zero-order optimization and etc.)3 Our analysis doesn’t recover the best known rates for RCD type

methods with importance sampling.








methods with importance sampling.4 An extension of iteration dependent parameters A,B,C ,D1,D2, 𝜌

would cover a new methods, such as SGD with decreasing stepsizes.







methods with importance sampling.4 An extension of iteration dependent parameters A,B,C ,D1,D2, 𝜌

would cover a new methods, such as SGD with decreasing stepsizes.5 We consider only non-accelerated methods. It would be interesting to

provide an unified analysis of stochastic methods with acceleration andmomentum.



Bibliography I

Aaron Defazio, Francis Bach, and Simon Lacoste-Julien.SAGA: A fast incremental gradient method with support fornon-strongly convex composite objectives.In Advances in Neural Information Processing Systems, pages1646–1654, 2014.

Eduard Gorbunov, Filip Hanzely, and Peter Richtarik.A unified theory of sgd: Variance reduction, sampling, quantizationand coordinate descent.arXiv preprint arXiv:1905.11261, 2019.



Bibliography II

Robert M Gower, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, EgorShulgin, and Peter Richtarik.SGD: General analysis and improved rates.arXiv preprint arXiv:1901.09401, 2019.

Yurii Nesterov.Introductory lectures on convex optimization: a basic course.2004.

Lam M Nguyen, Jie Liu, Katya Scheinberg, and Martin Takac.SARAH: A novel method for machine learning problems usingstochastic recursive gradient.In Proceedings of the 34th International Conference on MachineLearning, volume 70 of Proceedings of Machine Learning Research,pages 2613–2621. PMLR, 2017.



Bibliography III

Peter Richtarik.A guided walk through the zoo of stochastic gradient descent methods.

ICCOPT 2019 Summer School.

Nicolas Le Roux, Mark Schmidt, and Francis Bach.A stochastic gradient method with an exponential convergence rate forfinite training sets.In Advances in Neural Information Processing Systems, pages2663–2671, 2012.


A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis...

Documents

Transcript of A Unified Theory of SGD: Variance Reduction, Sampling ... · Introduction SGD Generalanalysis...