JOURNAL OF LA Accelerated First-Order Optimization ... · Accelerated First-Order Optimization...

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

Accelerated First-Order Optimization Algorithmsfor Machine Learning

Huan Li, Member, IEEE, Cong Fang, and Zhouchen Lin, Fellow, IEEE

Abstract—Numerical optimization serves as one of the pillars ofmachine learning. To meet the demands of big data applications,lots of efforts have been done on designing theoretically andpractically fast algorithms. This paper provides a comprehensivesurvey on accelerated first-order algorithms with a focus onstochastic algorithms. Specifically, the paper starts with reviewingthe basic accelerated algorithms on deterministic convex opti-mization, then concentrates on their extensions to stochastic con-vex optimization, and at last introduces some recent developmentson acceleration for nonconvex optimization.

Index Terms—Machine learning, acceleration, convex optimiza-tion, nonconvex optimization, deterministic algorithms, stochasticalgorithms.

I. INTRODUCTION

Many machine learning problems can be formulated as thesum of n loss functions and one regularizer

minx∈Rp

F (x)def= f(x) + h(x)

def=

1

n

n∑i=1

fi(x) + h(x), (1)

where fi(x) is the loss function, h(x) is typically a regularizerand n is the sample size. Examples of fi(x) include fi(x) =(yi − AT

i x)2 for the linear least squared loss and fi(x) =log(1 + exp(−yiAT

i x)) for the logistic loss, where Ai ∈ Rpis the feature vector of the i-th sample and yi ∈ R is its targetvalue or label. Representative examples of h(x) include the `2regularizer h(x) = 1

2‖x‖2 and the `1 regularizer h(x) = ‖x‖1.

Problem (1) covers many famous models in machine learning,e.g., support vector machine (SVM) [1], logistic regression [2],LASSO [3], multi-layer perceptron [4], and so on.

Optimization plays an indispensable role in machine learning,which involves the numerical computation of the optimalparameters with respect to a given learning model based on thetraining data. Note that the dimension p can be very highin many machine learning applications. In such a setting,computing the Hessian matrix of f to use in a second-order

Li is sponsored by Zhejiang Lab (grant no. 2019KB0AB02). Lin is supportedby NSF China (grant no.s 61625301 and 61731018), Major Research Projectof Zhejiang Lab (grant no.s 2019KB0AC01 and 2019KB0AB02) and BeijingAcademy of Artificial Intelligence.

H. Li is with the Institute of Robotics and Automatic InformationSystems, College of Artificial Intelligence, Nankai University, Tianjin, China(lihuan [email protected]). This work was done when Li was an assistant professorat the College of Computer Science and Technology, Nanjing University ofAeronautics and Astronautics, Nanjing, China.

C. Fang is with the Department of Electrical Engineering, PrincetonUniversity ([email protected]). H. Li and C. Fang are equal contributorsto this paper.

Z. Lin is with Key Lab. of Machine Perception (MOE), School ofEECS, Peking University, Beijing, China ([email protected]). Z. Lin is thecorresponding author.

algorithm is time-consuming. Thus, first-order optimizationmethods are usually preferred over high-order ones and theyhave been the main workhorse for a tremendous amount ofmachine learning applications.

Gradient descent (GD) has been one of the most commonlyused first-order method due to its simplicity to implementand low computational cost per iteration. Although practicaland effective, GD converges slowly in many applications. Toaccelerate its convergence, there has been a surge of interest inaccelerated gradient methods, where “accelerated” means thatthe convergence rate can be improved without much strongerassumptions or significant additional computational burden.Nesterov has proposed several accelerated gradient descent(AGD) methods in his celebrated works [5]–[8], which haveprovable faster convergence rates than the basic GD.

Originating from Nesterov’s celebrated works, acceleratedfirst-order methods have become a hot topic in the machinelearning community, yielding great success [9]. In machinelearning, the sample size n can be extremely large andcomputing the full gradient in GD or AGD is time consuming.So stochastic gradient methods are the coin of the realm to dealwith big data, which only use a few randomly-chosen samplesat each iteration. It motivates the extension of Nesterov’saccelerated methods from deterministic optimization to finite-sum stochastic optimization [10]–[20]. Due to the successof deep learning, in recent years there has been a trend todesign and analyze efficient nonconvex optimization algorithms,especially with a focus on accelerated methods [21]–[27].

In this paper, we provide a comprehensive survey on theaccelerated first-order algorithms. To proceed, we provide somenotations and definitions that will be frequently used in thispaper.

A. Notations and Definitions

We use uppercase bold letters to represent matrices, lower-case bold letters for vectors and non-bold letters for scalars.Denote by Ai the i-th column of A, xi the i-th coordinate ofx, and ∇if(x) the i-th coordinate of ∇f(x). We denote byxk the value of x of an algorithm at the k-th iteration and x∗

any optimal solution of problem (1). For scalars, e.g., θ, wedenote by {θk}∞k=0 a sequence of real numbers and by θ2k thepower of θk.

We study both convex and nonconvex problems in this paper.Definition 1: A function f(x) is µ-strongly convex, meaning

thatf(y) ≥ f(x) + 〈ξ,y − x〉+

µ

2‖y − x‖2 (2)


for all x and y, where ξ ∈ ∂f(x) is a subgradient of f .Especially, we allow µ = 0, in which case we call f(x) isnon-strongly convex.

Note that “non-strongly convex” is frequently used in thispaper. So a definition is appropriate. We often assume thatthe objective function is L-smooth, meaning that its gradientcannot change arbitrarily fast.

Definition 2: A function f is L-smooth if it satisfies

‖∇f(y)−∇f(x)‖ ≤ L‖y − x‖

for all x and y and some L ≥ 0.A vital property of a L-smooth function f is:

f(y) ≤ f(x) + 〈ξ,y − x〉+L

2‖y − x‖2,∀x,y.

Classically, the definition of a first-order algorithm inoptimization theory is based on an oracle that only returnsf(x) and ∇f(x) for a given x. Here, we adopt a much moregeneral sense that the oracle also returns the solution of somesimple proximal mapping.

Definition 3: The proximal mapping of a function h forsome some given z is defined as

Proxh(z) = argminx∈Rp

h(x) +1

2‖x− z‖2.

“Simple” means that the solution can be computed efficientlyand it does not dominate the computation time at each iterationof an algorithm, e.g., having a closed solution, which is typicalin machine learning. For example, in compressed sensing, weoften use Proxλ‖·‖1(z) = sign(z) max{0, |z|−λ}. In this paper,we only consider algorithms based on the proximal mappingof fi(x) or h(x) in (1), but not that of F (x).

In this paper, we use iteration complexity to describe theconvergence speed of a deterministic algorithm.

Definition 4: For convex problems, we define iterationcomplexity as the smallest number of iterations needed tofind an ε-optimal solution within a tolerance ε on the error tothe optimal objective, i.e., F (xk)− F (x∗) ≤ ε.

In nonconvex optimization, it is infeasible to describe theconvergence speed by F (x) − F (x∗) ≤ ε, since finding theglobal minima is NP-hard. Alternatively, we use the numberof iterations to find an ε-approximate stationary point.

Definition 5: We say that x is an ε-approximate first-orderstationary point of problem (1), if it satisfies ‖x− Proxh(x−∇f(x))‖ ≤ ε. It reduces to ‖∇f(x)‖ ≤ ε when h(x) = 0.

For nonconvex functions, first-order stationary points can beglobal minima, local minima, saddle points or local maxima.Sometimes, it is not enough to find first-order stationary pointsand it motivates us to pursuit high-order stationary points.

Definition 6: We say that x is an (ε,O(√ε))-approximate

second-order stationary point of problem (1) with h(x) = 0,if it satisfies ‖∇f(x)‖ ≤ ε and σmin(∇2f(x)) ≥ −O(

√ε),

where σmin(∇2f(x)) means the smallest singular value of theHessian matrix.

Intuitively speaking, ‖∇f(x)‖ = 0 and ∇2f(x) � 0 meansthat x is either a local minima or a higher-order saddle point.Since higher-order saddle points do not exist for many machinelearning problems and all local minima are global minima, e.g.,

in matrix sensing [28], matrix completion [29], robust PCA[30] and deep neural networks [31], [32], it is enough to findsecond-order stationary points for these problems.

For stochastic algorithms, to emphasize the dependence onthe sample size n, we use gradient complexity to describe theconvergence speed.

Definition 7: The gradient complexity of a stochastic al-gorithm is defined as the number of accessing the individ-ual gradients for searching an ε-optimal solution or an ε-approximate stationary point in expectation, i.e., replacingF (x), ‖∇f(x)‖ and σmin(∇2f(x)) by E[F (x)], E[‖∇f(x)‖],and E[σmin(∇2f(x))] in the above definitions, respectively.

Finally, we define the Bregman distance. The most commonlyused Bregman distance is D(x,y) = 1

2‖x− y‖2.Definition 8: Bregman distance is defined as

Dϕ(u,v) = ϕ(u)−(ϕ(v) +

⟨∇ϕ(v),u− v

⟩)(3)

for strongly convex ϕ and ∇ϕ(v) ∈ ∂ϕ(v).

II. BASIC ACCELERATED DETERMINISTIC ALGORITHMS

In this section, we discuss the speedup guarantees of thebasic accelerated gradient methods over the basic gradientdescent for deterministic convex optimization.

A. Gradient Descent

GD and its proximal variant have been one of the mostcommonly used first-order deterministic method. The latter oneconsists of the following iterations

xk+1 = Proxηh(xk − η∇f(xk)

),

where we assume that f is L-smooth. η is the step-size and itis usually set to 1

L . When the objective f(x) in (1) is L-smoothand µ-strongly convex, and h(x) is convex, gradient descentand its proximal variant converge linearly [33], described as

F (xk)−F (x∗) ≤(

1− µ

L

)kL‖x0−x∗‖2 = O

((1− µ

L

)k).

In other words, the iteration complexity of GD is O(Lµ log 1

ε

)to find an ε-optimal solution.

When f(x) is smooth and non-strongly convex, GD onlyobtains a sublinear rate [33] of

F (xk)− F (x∗) ≤ L

2(k + 1)‖x0 − x∗‖2 = O (1/k) .

In this case, the iteration complexity of GD becomes O(Lε

).

B. Heavy-Ball Method

The convergence speed of GD for strongly convex problemsis determined by the constant L/µ, which is known as thecondition number of f(x), and it is always greater or equal to1. When the condition number is very large, i.e., the problem isill-conditioned, GD converges slowly. Accelerated methods canspeed up over GD significantly for ill-conditioned problems.

Polyak’s heavy-ball method [34] was the first acceleratedgradient method. It counts for the history of iterates whencomputing the next iterate. The next iterate depends not only


on the current iterate, but also the previous ones. The proximalvariant of the heavy-ball method [35] is

xk+1 = Proxηh(xk − η∇f(xk) + β(xk − xk−1)

), (4)

where η = 4(√L+√µ)2

and β =(√L−√µ)2

(√L+√µ)2

. When f(x) isL-smooth and µ-strongly convex and h(x) is convex, andmoreover, f(x) is twice continuously differentiable, the heavy-ball method and its proximal variant have the following localaccelerated convergence rate [35]

F (xk)−F (x∗)≤O

(√L−√µ√L+√µ

)k≤O((1−√µ

L

)k).

So the iteration complexity of the heavy-ball method isO(√

Lµ log 1

ε

), which is significantly lower than that of the

basic GD when L/µ is large. The twice continuous differen-tiability is necessary to ensure the convergence. Otherwise,the heavy-ball method may fail to converge even for stronglyconvex problems [36].

When the strong convexity assumption is absent, currentlyonly the O (L/ε) iteration complexity is proved for theheavy-ball method [37], which is the same as the basicGD. Theoretically, it is unclear whether the O(1/k) rate istight. [37] numerically observed that O(1/k) is an accurateconvergence rate estimate for the Heavy-ball method. Next,we introduce Nesterov’s basic accelerated gradient methodsto further speedup the convergence for non-strongly convexproblems.

C. Nesterov’s Accelerated Gradient Method

Nesterov’s accelerated gradient methods have faster con-vergence rates than the basic GD for both strongly convexand non-strongly convex problems. In its simplest form, theproximal variant of Nesterov’s AGD [38] takes the form

yk = xk + βk(xk − xk−1), (5a)

xk+1 = Proxηh(yk − η∇f(yk)

). (5b)

Physically, AGD first adds an momentum, i.e., xk − xk−1,to the current point xk to generate an extrapolated point yk,and then performs a proximal gradient descent step at yk.Similar to the heavy-ball method, the iteration complexity of(5a)-(5b) is O

(√Lµ log 1

ε

)for problem (1) with L-smooth and

µ-strongly convex f(x) and convex h(x), by setting η = 1L ,

βk ≡√L−√µ√L+√µ

, and x0 = x−1 [33]. However, it does not needthe assumption of twice continuous differentiability of f(x).

Better than the heavy-ball method, for smooth and non-strongly convex problems, AGD has a faster sublinear ratedescribed as

F (xk)− F (x∗) ≤ O(1/k2

),

and the iteration complexity is improved to O(√

Lε

). One

often sets βk = θk(1−θk−1)θk−1

for non-strongly convex problems,where the positive sequence {θk}∞k=0 is obtained by solving

equation θ2k = (1 − θk)θ2k−1, which is initialized by θ0 = 1.Sometimes, one sets βk = k−1

k+2 for simplicity.Physically, the acceleration can be interpreted as adding

momentum to the iterates. Also, [39] derived a second-orderordinary differential equation to model scheme (5a)-(5b), [40]analyzed it via the notion of integral quadratic constraints [36]from the robust control theory and [41] further explained themechanism of acceleration from a continuous-time variationalpoint of view.

D. Other Variants and Extensions

Besides the basic AGD (5a)-(5b), Nesterov also proposedseveral other accelerated methods [6]–[8], [42], and Tsengfurther provided a unified analysis [43]. We briefly introducethe method in [6], which is easier to extend to many othervariants than (5a)-(5b). These variants include, e.g., acceleratedvariance reduction [10], accelerated randomized coordinatedescent [15], [16], and accelerated asynchronous algorithm[44]. This method consists of the following iterations

yk = (1− θk)xk + θkzk, (6a)

zk+1 = Proxh/(Lθk)

(zk − 1

Lθk∇f(yk)

), (6b)

xk+1 = (1− θk)xk + θkzk+1, (6c)

where θk is the same as that in (5a)-(5b), and we initializez0 = x0. Note that (5a)-(5b) and (6a)-(6c) produce the sameiterates yk and xk when h(x) = 0. To explain the mechanismof acceleration, [45] explicated (6a)-(6c) by linear coupling(step (6a)) of gradient descent (step (6c)) and mirror descent(step (6b)). [20] viewed (6a)-(6c) as an iterative buyer-suppliergame by rewriting it in an equivalent primal-dual form [46],[47].

Motivated by Nesterov’s celebrated work, some researchershave proposed other accelerated methods. [48] proposed ageometric descent method, which has a simple geometricinterpretation of acceleration. [49] explained the geometricdescent from the perspective of optimal average of quadraticlower models, which is related to Nesterov’s estimate sequencetechnique [33]. However, the methods in [48], [49] need aline-search step. They minimize f(x) exactly on the linebetween two points x and y. Thus, their methods are notrigorously “first-order” methods. [50] proposed a numericalprocedure for computing optimal tuning coefficients in a classof first-order algorithms, including Nesterov’s AGD. Motivatedby [50], [51] introduced several new optimized first-ordermethods whose coefficients are analytically found. [52]–[54]extended the accelerated methods to some complex compositeconvex optimization and structured convex optimization viathe gradient sliding technique, where an inner loop is used toskip some computations from time to time. The accelerationtechnique has also been used to solve linearly constrainedproblems [55]–[59]. However, many methods for constrainedproblems [46], [47], [60]–[64] need to solve an optimizationsubproblem exactly, thus they are not first-order methods either.


Method Strongly convex Non-strongly convex

GD O(Lµ log 1

ε

)[33] O

(Lε

)[33]

heavy-ball O(√

Lµ log 1

ε

)[35] O

(Lε

)[37]

AGD O(√

Lµ log 1

ε

)[8], [33] O

(√Lε

)[8], [33], [38], [51]

Lower Bounds O(√

Lµ log 1

ε

)[33], [66], [67] O

(√Lε

)[33], [66], [67]

TABLE IITERATION COMPLEXITY COMPARISONS BETWEEN GD, THE HEAVY-BALL

METHOD AND AGD, AS WELL AS THE LOWER BOUNDS.

E. Lower Bound

Can we find algorithms faster than AGD? Better yet, howfast can we solve problem (1), or its simplified case

minx∈Rp

f(x), (7)

to some accuracy ε, using methods only based on the infor-mation of ∇f(x)? A few existing bounds can answer thesequestions. The first lower bounds for first-order optimizationalgorithms were given in [65], and then were extended in [33].We introduce the widely used conclusion in [33]. Consider anyiterative first-order method generating a sequence of points{xt}kt=0 such that

xk ∈ x0 + Span{∇f(x0), · · · ,∇f(xk−1)}. (8)

[33] constructed a special L-smooth and µ-strongly convexfunction f(x) such that for any sequence satisfying (8), wehave

f(xk)− f(x∗) ≥ µ

2

(√L−√µ√L+√µ

)2k

‖x0 − x∗‖2.

It means that any first-order method satisfying (8) needs atleast O

(√Lµ log 1

ε

)iterations to achieve an ε-optimal solution

for the class of L-smooth and µ-strongly convex problems.Recalling the upper bound given in Section II-C, we can seethat it matches this lower bound. Thus, Nesterov’s AGDs areoptimal and they cannot be further accelerated up to constants.When the strong-convexity is absent, [33] constructed anotherL-smooth convex function f(x) such that for any methodsatisfying (8), we have

f(xk)− f(x∗) ≥ 3L

32(k + 1)2‖x0 − x∗‖2,

‖xk − x∗‖2 ≥ 1

32‖x0 − x∗‖2.

The iteration number k for the counterexample in [33]depends on the dimension p of the problem, e.g., k shouldsatisfy k ≤ 1

2 (p− 1) for non-strongly convex problems. [66],[68] proposed a different framework to establish the samelower bounds as [33], but, the iteration number in [66], [68]is dimension-independent. When considering the compositeproblem (1), we can use the results in [67] to give the samelower bounds as [33] for first-order methods that are onlybased on the information of ∇f(x) and Proxh(z). Although[67] studied the finite-sum problem, their conclusion can beused to (1) as long as f(x) 6= 0, h(x) 6= 0, and f(x) 6= h(x).

For better comparison of different methods, we list theiteration complexities as well as the lower bounds in Table I.

III. ACCELERATED STOCHASTIC ALGORITHMS

In machine learning, people often encounter big data withextremely large n in problem (1). Computing the full gradient off(x) in GD and AGD might be expensive. Stochastic gradientalgorithms might be the most common way to cope with bigdata. They sample, in each iteration, one or several gradientsfrom individual functions as an estimator of the full gradientof f . For example, consider the standard proximal StochasticGradient Descent (SGD), which uses one stochastic gradientat each iteration and proceeds as follows

xk+1 = Proxηkh(xk − ηk∇fik(xk)

),

where ηk denotes the step-size and ik is an index randomlysampled from {1, . . . , n} at iteration k. SGD often suffersfrom slow convergence. For example, when the objective isL-smooth and µ-strongly convex, SGD only obtains a sublinearrate [69] of

E[F (xk)]− F (x∗) ≤ O (1/k) .

In contrast, GD has the linear convergence. In the followingsections, we introduce several techniques to accelerate SGD.Especially, we discuss how Nesterov’s acceleration works instochastic optimization with finite n in problem (1), which isoften called the finite-sum problem.

A. Variance Reduction and Its Acceleration

The main challenge for SGD is the noise of the randomly-drawn gradients. The variance of the noisy gradient will nevergo to zero even if xk → x∗. As a result, one has to gradually cutdown the step-size in SGD to guarantee convergence, whichbrings down the convergence. A technique called VarianceReduction (VR) [70] was designed to reduce the negativeeffect of noise. For finite-sum objective functions, the VRtechnique reduces the variance to zero through the updates. Thefirst VR method might be Stochastic Average Gradient (SAG)[71], which uses the sum of the latest individual gradientsas an estimator of the descent direction. It requires O(np)memory storage and uses a biased gradient estimator. StochasticVariance Reduced Gradient (SVRG) [70] reduces the memorycost to O(p) and uses an unbiased gradient estimator. Later,SAGA [72] improves SAG by using an unbiased update viathe technique of SVRG. Other VR methods can be found in[73]–[78].

We take SVRG [79] as an example, which is relatively simpleand easy to implement. SVRG maintains a snapshot vector xs

after every m SGD iterations and keeps the gradient of theaverages gs = ∇f(xs). Then, it uses ∇f(xk) = ∇fik(xk)−∇fik(xs)+gs as the descent direction at every SGD iterations,and the expectation Eik [∇f(xk)] = ∇f(xk). Moreover, thevariance of the estimated gradient ∇f(xs,k) now can be upperbounded by the distance from the snapshot vector to the latestvariable, i.e., E[‖∇f(xk)−∇f(xk)‖2] ≤ L‖xk− xs‖2, whichis a crucial property of SVRG to guarantee the reduction ofvariance. Algorithm 1 gives the details of SVRG.


Algorithm 1 SVRGInput x0, m = O(L/µ), and η = O(1/L).for s = 0, 1, 2, · · · do

x0 = xs,gs = ∇f(xs),for k = 0, · · · ,m do

Randomly sample ik from {1, . . . , n},∇f(xk) = ∇fik(xk)−∇fik(xs) + gs,xk+1 = Proxηh

(xk − η∇f(xk)

),

end forxs+1 = 1

m

∑mk=1 x

k,end for

For µ-strongly convex problem (1) with L-smooth fi(x),SVRG needs O

(Lµ log 1

ε

)inner iterations to reach an ε-optimal

solution in expectation. Each inner iteration needs to evaluatetwo stochastic gradients while each outer iteration needsadditional n individual gradient evaluations to compute gs.Thus, the gradient complexity of SVRG is O

((n+ L

µ

)log 1

ε

).

Recall that GD has the gradient complexity of O(nLµ log 1

ε

),

since it needs n individual gradient evaluations at each iteration.Thus, SVRG is superior to GD when L/µ > 1.

With the VR technique in hand, one can fuse it with Nes-terov’s acceleration technique to further accelerate stochasticalgorithms, e.g., [10]–[13], [80], [81]. We take Katyusha [10]as an example. Katyusha builds upon the combination of (6a)-(6c) and SVRG. Different from (6a) and (6c), Katyusha furtherintroduces a “negative momentum” with additional τ ′xs in(10a) and (10d), which prevents the extrapolation term frombeing far from the snapshot vector. Algorithm 2 gives thedetails of Katyusha.

Algorithm 2 Katyusha

Input x0 = z0 = x0, m = n, τ = min{√

nµ3L ,

12}, τ

′ = 12 ,

η = O( 1L ), and τ ′′ = µ

3τL + 1.for s = 0, 1, 2, · · · do

gs = ∇f(xs)for k = 0, · · · ,m do

yk = τzk + τ ′xs + (1− τ − τ ′)xk, (10a)Randomly sample ik from {1, . . . , n},∇f(yk) = ∇fik(yk)−∇fik(xs) + gs, (10b)

zk+1 = Proxηh/τ(zk − η/τ∇f(yk)

), (10c)

xk+1 = τzk+1 + τ ′xs + (1− τ − τ ′)xk, (10d)end forxs+1 =

(∑m−1k=0 (τ ′′)k

)−1∑m−1k=0 (τ ′′)kxk,

z0 = xm, x0 = xm,end for

For problems with smooth and convex fi(x) and µ-strongly convex h(x), the gradient complexity of Katyusha isO((n+

√nLµ

)log 1

ε

). When n ≤ O(L/µ), Katyusha further

accelerates SVRG. Comparing SVRG with Katyusha, we can

see that the only difference is that Katyusha uses the mechanismof AGD in (6a)-(6c). Thus, Nesterov’s acceleration techniquealso takes effect in finite-sum stochastic optimization.

We now describe several extensions of Katyusha with someadvanced topics.

1) Loopless Katyusha: Both SVRG and Katyusha havedouble loops, which make them a little complex to analyzeand implement. To remedy the double loops, [12] proposed aloopless SVRG and Katyusha for the simplified problem

minx∈Rp

f(x)def=

1

n

n∑i=1

fi(x) (11)

with smooth and convex fi(x), and strongly convex f(x).Specifically, at each iteration, with a small probability 1/n, themethods update the snapshot vector and perform a full pass overdata to compute the average gradient. With probability 1−1/n,the methods use the previous snapshot vector. The looplessSVRG and Katyusha enjoy the same gradient complexitiesas the original methods. We take the loopless SVRG as anexample, which consists of the following steps at each iteration,

∇f(xk) = ∇fik(xk)−∇fik(xk) +∇f(xk), (12a)

xk+1 = xk − η∇f(xk), (12b)

xk+1 =

{xk with probability 1/n,xk with probability 1− 1/n.

(12c)

When replacing (12a) and (12b) by the following steps, weget the loopless Katyusha,

yk = τzk + τ ′xk + (1− τ − τ ′)xk,∇f(yk) = ∇fik(yk)−∇fik(xk) +∇f(xk),

zk+1 =1

αµ/L+ 1

(αµL

yk + zk − α

L∇f(yk)

),

xk+1 = τzk+1 + τ ′xs + (1− τ − τ ′)xk,

where we set τ = min

{√2nµ3L , 1/2

}, τ ′ = 1/2, and α = 2

3τ .

2) Non-Strongly Convex Problems: When the strong con-vexity assumption is absent, the gradient complexities ofSGD and SVRG are O

(1ε2

)[82] and O

(n log 1

ε + Lε

)[83],

respectively. Katyusha improves the complexity of SVRG

to O

(n√

F (x0)−F (x∗)ε +

√nL‖x0−x∗‖2

ε

)[10], [11]. This

gradient complexity is not more advantageous over Nesterov’sfull batch AGD since they all need O

(n√ε

)individual gradient

evalutions. When applying reductions to extend the algorithmsdesigned for smooth and strongly convex problems to non-strong convex ones, e.g., the HOOD framework [84], thegradient complexity of Katyusha can be further improved toO(n log 1

ε +√

nLε

), which is

√n times faster than the full

batch AGD when high precision is required. On the other hand,[13] proposed a unified VR accelerated gradient method, whichemploys a direct acceleration scheme instead of employingany reduction to obtain the desired gradient complexity ofO(n log n+

√nLε

).


Method Smooth and Strongly Convex Smooth and Non-strongly Convex

SGD O(1ε

)[69] O

(1ε2

)[82]

SVRG O((n+ L

µ

)log 1

ε

)O(n log 1

ε + Lε

)[12], [70]–[72] [83]

AccVR O((n+

√nLµ

)log 1

ε

)O

(n logn+

√nLε

)[10]–[13] [13]

Lower bounds O((n+

√nLµ

)log 1

ε

)[67] O

(n+

√nLε

)[67]

TABLE IIGRADIENT COMPLEXITY COMPARISONS BETWEEN SGD, SVRG AND

ACCELERATED VR METHODS (ACCVR), AS WELL AS THE LOWER BOUNDS.

3) Universal Catalyst Acceleration for First-Order ConvexOptimization: Another way to accelerate SVRG is to usethe universal Catalyst [85], which is a unified framework toaccelerate first-order methods. It builds upon the acceleratedproximal point method with inexactly computed proximalmapping. Analogous to (5a)-(5b), Catalyst takes the followingouter iterations

yk = xk + βk(xk − xk−1), (14a)

xk ≈ argminx∈Rp

{Gk(x)

def= F (x) +

γ

2‖x− yk‖2

}, (14b)

where βk =√γ+µ−√µ√γ+µ+

√µ

for strongly convex problems, andit updates in the same way as that in (5a)-(5b) for non-strongly convex ones. We can use any linearly-convergentmethod that is only based on the information of ∇fi(x) andProxh(z) to approximately solve the subproblem in (14b).The subproblem often has a good condition number and socan be solved efficiently to a high precision. Take SVRG asan example. When we use SVRG to solve the subproblem,Catalyst accelerates SVRG to the gradient complexity ofO((n+

√nLµ

)log L

µ log 1ε

)for strongly convex problems

and O(√

nLε log 1

ε

)for non-strongly convex ones, by setting

γ = L−µn+1 − µ and the inner iteration number in step (14b)

as O((n+ L+γ

µ+γ

)log 1

ε

)with µ ≥ 0 and L ≥ (n + 2)µ.

Besides SVRG, Catalyst can also accelerate other methods, e.g.,SAG and SAGA. The price for generality is that the gradientcomplexities of Catalyst have an additional poly-logarithmicfactor compared with those of Katyusha.

4) Individually Nonconvex: Some problems in machinelearning can be written as minimizing strongly convex functionsthat are finite average of nonconvex ones [75], [77]. That is,each fi(x) in problem (11) is L-smooth and may be nonconvex,but their average f(x) is µ-strongly convex. Examples includethe core machinery for PCA and SVD. SVRG can also beused to solve this problem with the gradient complexity ofO((n+

√nLµ

)log 1

ε

)[80]. [80] further proposed a method

named KatyushaX to improve the gradient complexity toO((n+ n3/4

√Lµ

)log 1

ε

).

5) Lower Complexity Bounds: Similar to the lower boundsfor the class of deterministic first-order algorithms, there arealso lower bounds for the randomized first-order methods forfinite-sum problems. Considering problem (11), [67] proved

that for any first-order algorithm that is only based on the infor-mation of∇fi(x) and Proxfi(z), the lower bound for L-smoothand µ-strongly convex problems is O

((n+

√nLµ

)log 1

ε

).

When the strong convexity is absent, the lower bound becomesO(n+

√nLε

). For better comparison, we list the upper and

lower bounds in Table II.6) Application to Distributed Optimization: Variance reduc-

tion has also been applied to distributed optimization. Classicaldistributed algorithms include the distributed gradient descent(DGD) [86], EXTRA [87], the gradient-tracking-based methods[88]–[91], and distributed stochastic gradient descent (DSGD)[92], [93]. To further improve the convergence of stochasticdistributed algorithms, [94] combined EXTRA with SAGA,[95] combined gradient tracking with SAGA, and [95], [96]implemented gradient tracking in SVRG. See [97] for a detailedreview. It is an interesting work to implement accelerated VRin distributed optimization in the future.

B. Stochastic Coordinate Descent and Its Acceleration

In problem (1), we often assume that fi(x) is smooth andallow h(x) to be nondifferentiable. However, not all machinelearning problems satisfy this assumption. The typical exampleis SVM, which can be formulated as

minx∈Rp

F (x)def=

1

n

n∑i=1

max{

0, 1− yiATi x}︸︷︷︸

fi(x)

+µ

2‖x‖2︸︷︷︸h(x)

. (15)

We can see that each fi(x) is convex but nondifferentiable,and h(x) is smooth and strongly convex. In practice, we oftenminimize the negative of the dual of (15), written as

minu∈Rn

D(u)def=

1

2µ

∥∥∥∥∥Au

n

∥∥∥∥∥2

− 1

n

n∑i=1

ui+1

n

n∑i=1

I[0,1](ui), (16)

where Ai = yiAi, and I[0,1](u) = 0 if 0 ≤ u ≤ 1, and∞, otherwise. Motivated by (16), we consider the followingproblem in this section

minu∈Rn

D(u)def= Φ(u) +

n∑i=1

Ψi(ui). (17)

We assume that Φ(u) satisfies the coordinate-wise smoothcondition ‖∇iΦ(u) −∇iΦ(v)‖ ≤ Li‖u − v‖ for any u andv satisfying uj = vj ,∀j 6= i. We also assume that Φ(u) isµ-strongly convex with respect to norm ‖ · ‖L, i.e., replacing‖y−x‖2 in (2) by ‖y−x‖2L =

∑ni=1 Li(yi−xi)2. We require

Ψi(u) to be convex but can be nondifferentiable. Take problem(16) as an example, Li = ‖Ai‖2

n2µ , but the first term in (16) isnot strongly convex when n > p.

Stochastic Coordinate Descent (SCD) is a popular method tosolve problem (17). It first computes the partial derivative withrespect to one randomly chosen variable, and then updates thisvariable by a coordinate-wise gradient descent while keepingthe other variables unchanged. SCD is sketched as follows:

uk+1ik

= argminu

(Ψik(u) +

⟨∇ikΦ(uk), u

⟩+Lik2|u−ukik |

2

),


where ik is randomly sampled form {1, . . . , n}. In SCD, weoften assume that the proximal mapping of Ψi(u) can beefficiently computed with a closed solution. Also, we need tocompute ∇ikΦ(uk) efficiently. Take problem (16) for example,by keeping track of sk = Auk and updating sk+1 by sk −Aiku

kik

+ Aikuk+1ik

, SCD only uses one column of A periteration, i.e., one sample, to compute ∇ikΦ(uk) = 1

n2µATiksk.

Now, we come to the convergence rate of SCD [98], whichis described as

E[D(uk)]−D(u∗) ≤ min

{(1− µ

n

)k,

n

n+ k

}C (18)

in a unified style for strongly convex and non-strongly convexproblems, where C = D(u0)−D(u∗) + ‖u0 − u∗‖2L.

We can also perform Nesterov’s acceleration technique toaccelerate SCD by combing it with (6a)-(6c). When Φ(u) in(17) is strongly convex, the resultant method is called Acceler-ated randomized Proximal Coordinate Gradient (APCG) [16],and it is described in Algorithm 3. When the strong convexityassumption is absent, the method is called Accelerated ParallelPROXimal coordinate descent (APPROX) [15], and it is writtenin Algorithm 4, where θk > 0 is obtained by solving equationθ2k = (1−θk)θ2k−1, which is initialized as θ0 = 1/n. Specially,APPROX reduces to (6a)-(6c) when n = 1. Both APCG andAPPROX have a faster convergence rate than SCD, which isgiven as follows in a unified style

E[D(uk)]−D(u∗) ≤ min

{(1−√µ

n

)k,

(2n

2n+ k

)2}C,

where C is given in (18).

Algorithm 3 APCGInput u0 = z0.for k = 0, 1, · · · doyk =

1

1 +√µ/n

(uk +

√µ

nzk), (19a)

Randomly sample ik from {1, . . . , n}zk+1ik

= argminz

(Ψik(z) +

⟨∇ikΦ(yk), z

⟩(19b)

+Lik√µ

2

∥∥∥∥z − (1−√µ

n

)zkik −

√µ

nykik

∥∥∥∥2),

zk+1j = zkj ,∀j 6= ik, (19c)

uk+1 = yk +√µ(zk+1 − zk

)+µ

n

(zk − yk

), (19d)

end for

1) Efficient Implementation: Both APCG and APPROX needto perform full-dimensional vector operations in steps (19a),(19d), (20a), and (20d), which make the per-iteration costhigher than that of SCD, where the latter one only needsto consider one dimension per iteration. This may cause theoverall computational cost of APCG and APPROX higher thanthat of the full AGD. To avoid such an situation, we can use achange of variables scheme, which is firstly proposed in [99]and then adopted by [15], [16]. Take APPROX as an example.We only need to introduce an auxiliary variable uk initialized

Algorithm 4 APPROXInput u0 = z0.for k = 0, 1, · · · do

yk = (1− θk)uk + θkzk, (20a)

Randomly sample ik from {1, . . . , n}zk+1ik

= argminz

(Ψik(z) +

⟨∇ikΦ(yk), z

⟩(20b)

+nθkLik

2‖z − zkik‖

2

),

zk+1j = zkj ,∀j 6= ik, (20c)

uk+1 = yk + nθk(zk+1 − zk

), (20d)

end for

at 0 and change the updates by the following ones at eachiteration

zk+1ik

= argminz

(Ψik(z) +

⟨∇ikΦ(zk + θ2ku

k), z⟩

+nθkLik

2‖z − zkik‖

2

),

uk+1ik

= ukik −1− nθkθ2k

(zk+1ik− zkik),

uk+1j = ukj , zk+1

j = zkj , ∀j 6= ik.

The partial gradient ∇ikΦ(zk + θ2kuk) can be efficiently

computed in a similar way to that of SCD discussed abovewith almost no more burden.

2) Applications to the Regularized Empirical Risk Mini-mization: Now, we consider the regularized empirical riskminimization problem, which is a special case of problem (1)and is described as

minx∈Rp

F (x)def=

1

n

n∑i=1

gi(ATi x) + h(x). (22)

We often normalize the columns of A to have unit norm.Motivated by (16), we minimize the negative of the dual of(22) as

minu∈Rn

D(u)def= h∗

(Au

n

)+

1

n

n∑i=1

g∗i (−ui), (23)

where h∗(u) = maxv{〈u,v〉 − h(v)} is the convex conjugateof h. For some applications in machine learning, e.g., SVM,∇ih∗(Au/n) and Proxg∗i (u) can be efficiently computed [100]and we can use APCG and APPROX to solve (23). Whengi is L-smooth and h is µ-strongly convex, APCG needsO((n+

√nLµ

)log 1

ε

)iterations to obtain an ε-approximate

expected dual gap E[F (xk)] + E[D(uk)] ≤ ε [16]. When thesmoothness assumption on gi is absent, the required iterationnumber of APPROX is O

(n log n+

√nε

)for the expected ε

dual gap [18].One limitation of the SCD-based methods is that they require

computing ∇ih∗(Au/n) and Proxg∗i (u), rather than Proxh(x)and ∇gi(AT

i x). In some applications, e.g., regularized logisticregression, the SCD-based methods need inner loops and theyare less efficient than the VR-based methods. However, for


other applications where the VR-based methods cannot be used,e.g., gi is nonsmooth, the SCD-based method may be a betterchoice, especially when h is chosen as the `2 regularizer andthe proximal mapping of gi is simple.

3) Restart for SVM under the Quadratic Growth Condition:In machine learning, some problems may satisfy a conditionthat is weaker than strong convexity and stronger than convexity,namely the quadratic growth condition [101]. For example,the dual problem of SVM [102]. Can we expect a fasterconvergence than the sublinear rate of O(1/k) or O(1/k2)?The answer is yes. Some studies have shown that the acceleratedmethods with restart [103]–[105] enjoy a linear convergenceunder the quadratic growth condition. Generally speaking, ifwe have an accelerated method with an O

(1k2

)rate at hand,

e.g., APPROX, we can run the method without any changeand restart it after several iterations with warm-starts. If we setthe restart period according to the quadratic growth conditionconstant, a similar constant to the condition number in thestrong convexity assumption, the resultant method convergeswith a linear rate, which is faster than the non-acceleratedcounterparts. Moreover, when the quadratic growth conditionconstant is unknown (it is often the case in practice), [18],[104], [105] showed that the method also converges linearly,but the rate may not be optimal.

4) Non-uniform Sampling: In problem (16), we have Li =‖Ai‖2n2µ . For the analysis in Section III-B2, we normalize the

columns of A to have unit norm and Li have the same valuesfor all i. When they are not normalized, a variety of works havefocused on the non-uniform sampling of the training sampleik [99], [106]–[108]. For example, [107] selected the i-thsample with probability proportional to

√Li and obtained better

performance than the uniform sampling scheme. Intuitivelyspeaking, when Li is large, the function is less smooth alongthe i-th coordinate, so we should sample it more often tobalance the overall convergence speed.

C. The Primal-Dual Method and Its Accelerated StochasticVariants

The VR-based methods and SCD-based methods performin the primal space and the dual space, respectively. In thissection, we introduce another common scheme, namely, theprimal-dual-based methods [46], [47], which perform both inthe primal space and the dual space. Consider problem (22).It can be written in the min-max form:

minx∈Rp

maxu∈Rn

1

n

n∑i=1

(⟨ATi x,ui

⟩− g∗i (ui)

)+ h(x). (24)

We first introduce the general primal-dual method with Breg-man distance to solve problem (24) [20], which consists of thefollowing steps at each iteration

xk=α(xk − xk−1) + xk, (25a)

uk+1=argmaxu

(1

n

⟨AT xk,u

⟩− 1

n

n∑i=1

g∗i (ui)−τD(u,uk)

),

(25b)

xk+1=argminx

(h(x)+

⟨x,

1

nAuk+1

⟩+η

2‖x−xk‖2

), (25c)

for constants α, τ , and η to be specified later and x−1 = x0.The primal-dual method alternately maximizes u in the dualspace and minimizes x in the primal space.

As explained in Section III, dealing with all the samplesat each iteration is time-consuming when n is large, so wewant to handle only one sample. Accordingly, we can sampleonly one ik randomly in (25b) at each iteration. The resultantmethod is described in Algorithm 5, and it reduces to theStochastic Primal Dual Coordinate (SPDC) method proposedin [19] when we take D(u, v) = 1

2 (u− v)2 as a special case.Combining the initialization s0 = 1

nAu0 and the update rules(26c) and (26e), we know sk = 1

nAuk.

Algorithm 5 SPDCInput x0 = x−1, τ = 2√

nµL, η = 2

√nµL, and α = 1 −

1

n+2√nL/µ

for k = 0, 1, · · · doxk = α(xk − xk−1) + xk, (26a)

uk+1ik

= argmaxu

(⟨ATikxk, u

⟩− g∗ik(u)− τD(u− ukik)

),

(26b)

uk+1j = ukj ,∀j 6= ik, (26c)

xk+1 = argminx

(h(x)+

⟨x, sk+(uk+1

ik−ukik)Aik

⟩+η

2‖x− xk‖2

), (26d)

sk+1 = sk +1

n(uk+1ik− ukik)Aik , (26e)

end for

Similar to APCG, when each gi is L-smooth, h is µ-stronglyconvex, and the columns of A are normalized to have unitnorm, SPDC needs O

((n+

√nLµ

)log 1

ε

)iterations to find

a solution such that E[‖xk − x∗‖2] ≤ ε.One limitation of SPDC is that it only applies to problems

when the proximal mappings of g∗i and h can be efficientlycomputed. In some applications, we want to use ∇gi, ratherthan Proxg∗i . To remedy this problem, [20] creatively usedthe Bregman distance induced by g∗i in (26b). Specifically,taking ϕ in (3) as g∗ik , letting zk−1ik

= ∇g∗ik(ukik) and defining

z =ATik

xk+τzk−1ik

1+τ , step (26b) reduces to

uk+1ik

= argmaxu

(⟨ATikxk + τ∇g∗ik(ukik), u

⟩− (1 + τ)g∗ik(u)

)= argmax

u

(〈z, u〉 − g∗ik(u)

)= ∇gik(z).

Then, we have z ∈ ∂g∗ik(uk+1ik

) and denote it as zkik . Thus, wecan replace steps (26b) and (26c) by the following two steps

zkj =

{ATj x

k+τzk−1j

1+τ , j = ik,

zk−1j , j 6= ik,

uk+1j =

{∇gj(zkj ), j = ik,ukj , j 6= ik.

Accordingly, the resultant method, named the RandomizedPrimal-Dual Gradient (RPDG) method [20], is only basedon ∇gj(z) and the proximal mapping of h(x). To find an


ε-optimal solution, it needs the same number of iterations asSPDC but each iteration has the same computational cost asthe VR-based methods, e.g., Katyusha.

1) Relation to Nesterov’s AGD: It is interesting to studythe relation between the primal-dual method and Nesterov’saccelerated gradient method. [20] proved that (25a)-(25c) withτk = (1− θk)/θk, ηk = Lθk, αk = θk/θk−1, and appropriateBregman distance reduces to (6a)-(6c) when solving (22), wherewe use adaptive parameters in (25a)-(25c). Thus, RPDG canalso be seen as an extension of Nsterov’s AGD to finite-sumstochastic optimization problems.

2) Non-strongly Convex Problems: When the strong con-vexity assumption on h(x) is absent, [76] studied the O

(1√ε

)iteration upper bound for the stochastic primal-dual hybridgradient algorithm, which is a variant of SPDC. However, noexplicit dependence on n was given in [76]. On the otherhand, the perturbation approach is a popular way to obtainsharp convergence results for non-strongly convex problems.Specifically, define a perturbation problem by adding a smallperturbation term ε‖x0 − x‖2 to problem (22), and solve itby RPDG, which is developed for strongly convex problems.However, the resultant gradient complexity has an additionalterm log 1

ε as compared with the lower bound in [67] andthe upper bound in [13]. Since the conditions in the HOODframework [84] may not be satisfied for RPDG due to the dualterm, currently the reduction approach introduced in SectionIII-A2 has not been applied to the primal-dual-based methodsto remove the additional poly-logarithmic factor.

IV. ACCELERATED NONCONVEX ALGORITHMS

In this section, we introduce the generalization of accel-eration to nonconvex problems. Specifically, Section IV-Aintroduces the deterministic algorithms and Section IV-B forthe stochastic ones.

A. Deterministic Algorithms

In the following two sections, we describe the algorithms tofind first-order and second-order stationary points, respectively.

1) Achieving First-Order Stationary Point: Gradient descentand its proximal variant are widely used in machine learning,both for convex and noncovnex applications. For nonconvexproblems, GD finds an ε-approximate first-order stationarypoint within O

(1ε2

)iterations [33].

Motivated by the success of heavy-ball method, [109] studiedits nonconvex extension with the name of iPiano. Specifically,consider problem (1) with smooth (possibly nonconvex) f(x)and convex h(x) (possibly nonsmooth), and the heavy-ballmethod (4) with β ∈ [0, 1) and η < 2(1−β)

L . [109] provedthat any limit point x∗ of xk is a critical point of (1), i.e.,0 ∈ ∇f(x∗) + ∂h(x∗). Moreover, the number of iterations tofind an ε-approximate first-order stationary point is O

(1ε2

).

Besides the heavy-ball method, some researchers studiedthe nonconvex accelerated gradient method extended fromNesterov’s AGD. For example, [110] studied the following

method for problem (1) with convex h(x):

yk = (1− θk)xk + θkzk, (27a)

zk+1 = Proxδkh(zk − δk∇f(yk)

), (27b)

xk+1 = Proxσkh(yk − σk∇f(yk)

), (27c)

which is motivated by (6a)-(6c). In fact, when h(x) = 0, δk =1Lθk

, and σk = 1L , (6a)-(6c) and (27a)-(27c) are equivalent.

[110] proved that (27a)-(27c) needs O(

1ε2

)iterations to find an

ε-approximate first-order stationary point by setting θk = 2k+1 ,

σk = 12L , and σk ≤ δk ≤ (1 + θk/4)σk. On the other hand,

when f(x) is also convex, (27a)-(27c) has the optimal O(

1√ε

)iteration complexity to find an ε-optimal solution by a differentsetting of δk = kσk

2 .Although (27a)-(27c) guarantees the convergence for non-

convex programming while maintaining the acceleration forconvex programming, one disadvantage is that the parametersettings for convex and noncovnex problems are different. Toaddress this issue, [111] proposed the following method:

yk = xk +θkθk−1

(zk − xk) +θk(1− θk−1)

θk−1(xk − xk−1),

zk+1 = Proxηh(yk − η∇f(yk)),

vk+1 = Proxηh(xk − η∇f(xk)),

xk+1 =

{zk+1, if F (zk+1) ≤ F (vk+1),vk+1, otherwise,

which is motivated by the monotone AGD proposed in [112].Intuitively, the first two steps perform a proximal AGD updatewith the same update rule of θk as that in (5a)-(5b), thethird step performs a proximal GD update, and the last stepchooses the one with the smaller objective. Similar to theheavy ball method, [112] proved that any limit point ofxk a critical point, and the method needs O

(1ε2

)iterations

to find an ε-approximate first-order stationary point. Onthe other hand, when both f(x) and h(x) are convex, thesame O

(1√ε

)iteration complexity as Nesterov’s AGD is

maintained. Moreover, the algorithm for convex programmingand nonconvex programming keeps the same parameters. Theprice paid is that the computational cost per iteration of theabove method is higher than that of (27a)-(27c).

Besides the above algorithms, Nesterov has also extended hisAGD to nonconvex programming [42]. Similar to the geometricdescent [48] discussed in Section II-D, Nesterov’s method alsoneeds a line search and thus it is not a rigorously “first-order”method.

We can see that none of the above algorithms have provableimprovement after adopting the technique of heavy-ball methodor Nesterov’s AGD. One may ask: can we find a provable fasteraccelerated gradient method for nonconvex programming?The answer is yes. [22] proposed a method which achievesan ε-approximate first-order stationary point within O

(1

ε7/4

)gradient and function evaluations. The algorithm in [22] iscomplex to implement, so we omit the details.

2) Achieving Second-Order Stationary Point: We first dis-cuss whether gradient descent can find the approximate second-order stationary point. To answer this question, [113] studieda simple variant of GD with appropriate perturbations and


showed that the method achieves an O(ε,O(√ε))-approximate

second-order stationary point within O(1/ε2) iterations, whereO hides the poly-logarithmic factors. We can see that this rateis exactly the rate of GD to first-order stationary point, withonly the additional log factor. The method proposed in [113]is given in Algorithm 6, where Uniform(B0(r)) means theperturbation uniformly sampled from a ball with radius r.

Algorithm 6 Perturbed GD

Input x0 = z, p= 0, T = O( 1√ε), r= O(ε), η=O( 1

L ), andε′=O(ε1.5).for k = 0, 1, · · · do

if ‖∇f(xk)‖ ≤ ε and k > p+ T (i.e., no perturbation inlast T steps) thenz = xk, p = kxk = xk + ξk, ξk ∼ Uniform(B0(r)),

end ifxk+1 = xk − η∇f(xk),if k = p+ T and f(z)− f(xk) ≤ ε′ then

break,end if

end for

Intuitively speaking, when the norm of the current gradientis small, it indicates that the current iterate is potentially neara saddle point or a local minimum. If it is near a saddle point,the uniformly distributed perturbation helps to escape it, whichis added at most once in every T iterations. On the other hand,when the objective almost does not decrease after T iterationsfrom last perturbation, it achieves the local minimum with highprobability and we can stop the algorithm.

Besides [113], [114] showed that the plain GD without per-turbations almost always escapes saddle points asymptotically.However, it may take exponential time [115].

Now, we come to the accelerated gradient method. Built uponAlgorithm 6 and (5a)-(5b), [24] proposed a variant of AGDwith perturbations and showed that the method needs O

(1

ε7/4

)iterations to achieve an (ε,O(

√ε))-approximate second-order

stationary point, which is faster than the perturbed GD. Wedescribe the method in Algorithm 7, where the NCE (NegativeCurvature Exploitation) step chooses xk+1 to be xk + δ orxk − δ whichever having a smaller objective f , where δ =svk/‖vk‖ for some constant s.

In the above scheme, the first “if” step is similar to theperturbation step in the perturbed GD. The following threesteps are similar to the AGD steps in (5a)-(5b), where vk is themomentum term in (5a). When the function has large negativecurvature between xk and yk, i.e., the second “if” conditionholds, NCE simply moves along the direction based on themomentum.

Besides [24], [21] and [23] also established the O(

1ε7/4

)gradient complexity to achieve an ε-approximate second-orderstationary point. [21] employed a combination of (regularized)AGD and the Lanczos method, and [23] proposed a careful im-plementation of the Nesterov-Polyak method, using acceleratedmethods for fast approximate matrix inversion.

At last, we compare the iteration complexity of the ac-

Algorithm 7 Perturbed AGD

Input x0, v0, T = O( 1ε1/4

), r = O(ε), β = O(1 − ε1/4),η = O( 1

L ), γ = O(√ε) and s = O(

√ε).

for k = 0, 1, · · · doif ‖∇f(xk)‖≤ε and no perturbation in last T steps, then

xk = xk + ξk, ξk ∼ Uniform(B0(r)),end ifyk = xk + βvk,xk+1 = yk − η∇f(yk),vk+1 = xk+1 − xk,if f(xk)≤f(yk)+

⟨∇f(yk),xk−yk

⟩−γ2‖x

k−yk‖2 then(xk+1,vk+1) = NCE(xk,vk),

end ifend for

First-order Stationary Point Second-order Stationary PointMethods Iteration Complexity Methods Iteration ComplexityGD [33] O(1/ε2) Perturbed GD O(1/ε2)

[113]AGD [22] O(1/ε7/4) AGD O(1/ε7/4)

[21], [23], [24]

TABLE IIIITERATION COMPLEXITY COMPARISONS BETWEEN GRADIENT DESCENT

AND ACCELERATED GRADIENT DESCENT FOR NONCONVEX PROBLEMS. WEHIDE THE POLY-LOGARITHMIC FACTORS IN O. WE ALSO HIDE n SINCE WE

ONLY CONSIDER DETERMINISTIC OPTIMIZATION.

celerated methods and non-accelerated methods in Table III,including both the approximation of first-order stationary pointand second-order stationary point.

B. Stochastic Algorithms

Due to the success of deep neural network, in recent yearspeople are interested in stochastic algorithms for nonconvexproblem (1) or (11) with huge n, especially the acceleratedvariants. [116] empirically observed that the following plainstochastic AGD performs well when training deep neuralnetworks:

yk = xk + βk(xk − xk−1),

xk+1 = yk − η∇fik(yk),

where βk is empirically set as

βk = min{1− 2−1−log2(bk/250c+1), βmax},

and βmax is often chosen as 0.999 or 0.995.In this section, we introduce the stochastic nonconvex

algorithms with more theory supports than the above plainstochastic AGD. For simplicity, we consider problem (11) witheach fi(x) being L-smooth.

1) Achieving First-Order Stationary Point: When we assumethat the variance of the gradient is finite, SGD requires thegradient complexity of O(ε−4) to achieve an ε-approximatefirst-order stationary point [33]. Similar to stochastic convexoptimization, this bound can be further improved by VR.In fact, a sight variant of the SVRG algorithm [117]–[119]and also SAGA [120] achieve the gradient complexity ofO((n+ n2/3ε−2

)∧ ε−10/3

), where a ∧ b = min(a, b). This


result means that when n → ∞, the VR technique canstill guarantee a faster convergence rate in the nonconvexstochastic optimization, where we refer this case as the onlineoptimization. However, this bound is still not optimal andit can be further reduced to O

((n+ n1/2ε−2

)∧ ε−3

)by

performing recursive VR [26], [78], [121]–[123]. We takethe Stochastic Path-Integrated Differential Estimator (SPIDER)[26] algorithm as an example. SPIDER can be used for both thefinite-sum problem (11) and the online problem. For simplicity,we consider the following simplified method for the finite-sumproblem, as shown in Algorithm 8.

Algorithm 8 SPIDERfor k = 0 to K do

vk =

{∇f(xk), if mod(k, n)=0,∇fik(xk)−∇fik(xk−1)+vk−1, otherwise.

ηk = min

(ε

L√n‖vk‖

,1

2L√n

),

xk+1 = xk − ηkvk.end for

SPIDER is motivated by SVRG, but using a different VRtechnique. We can compare SPIDER with the loopless SVRG(12a)-(12c) to be more intuitive. SPIDER takes steps alongthe direction based on past accumulated stochastic gradientinformation, i.e.,

vk =

k∑t=k0+1

(∇fit(xt)−∇fit(xt−1)

)+ vk0

for the latest k0 such that mod (k0, n)=0 and vk0 = ∇f(xk0).In contrast, SVRG takes steps only based on the informationof current stochastic gradient and the snapshot vector, i.e.,vk = ∇fik(xk)−∇fik(xs) +∇f(xs). It was shown in [26]that the variance of vk is smaller than that in the SVRGalgorithm by order when xk moves slowly, which contributesto a provably faster convergence rate.

The recursive VR technique was firstly proposed in thealgorithm named SARAH in [78], which was designed forconvex optimization and then was extended to nonconvexoptimization [123]. For the finite-sum smooth optimization,SPIDER and SARAH almost have the same algorithm form.They are different in the step-size. SARAH uses η = O( 1

L√n

)while SPIDER uses a normalized but more conservative step-size (at the order of O(ε)). After [26], some improved versions,such as SpiderBoost [122], also considered allowing a largerstep-size.

As for the lower bounds, [26] proved that the gradientcomplexity of O(n+n1/2ε−2) matches the lower bound undercertain conditions. More recently, [124] showed that O(ε−3)also matches the lower bound when n→∞.

2) Achieving Second-order Stationary Point: When theobjective function is assumed to have a Lipschitz continuousHessian matrix, acceleration has also been done to find anapproximate second-order stationary point. For example, [23],[27] converted the cubic regularization method [131] for findinga second-order stationary point using stochastic-gradient-based

and Hessian-vector-product-based methods. [129], [130] pro-posed a generic saddle-point-escaping method called NEON,which approximates Hessian-vector product by stochasticgradient. For the convergence rate, to search an (ε,O(ε0.5))-approximate second-order stationary point, in the finite-sumcase, the VR and the momentum techniques [21], [23] canreduce the gradient complexity to O(nε−1.5 + n3/4ε−1.75). Inthe online case, [125] first proved that noisy SGD escapesfrom saddle points in polynomial times. Later, [126] obtaineda gradient complexity of O(ε−10). This bound was finallyimproved by [128], in which the authors proved that noisySGD can actually find a second-order stationary point withinthe gradient complexity of O(ε−3.5). For the variants of SGD,by fusing negative curvature search with VR, for finding an(ε,O(ε0.25))-approximate second-order stationary point, [25]obtained a lower gradient complexity of O(ε−3.25). Whenusing the SPIDER [26] technique, one can obtain a complexityof O(ε−3) to find an (ε,O(ε0.5))-approximate second-orderstationary point. Table IV summarizes the gradient complexitycomparisons of the existing algorithms.

V. DISCUSSION AND LIMITATION

Accelerated algorithms have been widely used in machinelearning due to their provably faster convergence and simplicityin implementation. In this paper, we review the accelerateddeterministic algorithms, accelerated stochastic algorithmsand accelerated nonconvex algorithms for machine learning.Due to space limit, our review is incomplete as we haveleft some interesting topics out, e.g., the acceleration fordistributed optimization [132]–[142]. Its challenge over thenon-distributed algorithms introduced in this paper is that weshould pay attention to the agreement among different nodes.Modifications are required to extend the classical AGD todistributed optimization.

The efficiency of accelerated algorithms have been verifiedin practice for convex optimization, either deterministic orstochastic. However, in reality some complex acceleratednonconvex algorithms seem less efficient. One remarkableexample is that although they are proven to converge faster tofirst-order or second-order stationary points than SGD, whentraining deep neural networks, they still cannot beat SGD orthe plain stochastic AGD. There is still a gap between theoryand practice for accelerated nonconvex optimization.

REFERENCES

[1] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning,vol. 20, pp. 273–297, 1995.

[2] J. Berkson, “Application of the logistic function to bio-assay,” Journalof the American Statistical Association, vol. 39, no. 227, pp. 357–365,1944.

[3] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journalof the Royal Statistical Society. Series B (Methodological), vol. 58, no. 1,pp. 267–288, 1996.

[4] S. Haykin, Neural Networks: A Comprehensive Foundation. PrenticeHall, Englewood Cliffs, NJ, 2 ed., 1999.

[5] Y. Nesterov, “A method for unconstrained convex minimization problemwith the rate of convergence O(1/k2),” Doklady AN SSSR, vol. 269,pp. 543–547, 1983.

[6] Y. Nesterov, “On an approach to the construction of optimal methods ofminimization of smooth convex functions,” Ekonomika I MateaticheskieMetody, vol. 24, pp. 509–517, 1988.


Type Algorithm Online Finite-Sum

First-orderStationary Point

Original SGD / GD [33] O(ε−4) O(nε−2)

Acceleration SVRG / SCSG /SAGA [117]–[120] O(ε−10/3

)O(n+ n2/3ε−2

)SPIDER / SpiderBoost / SARAH [26], [121]–[123] O

(ε−3

)O(n+ n1/2ε−2

)

Second-orderStationary Point

(Hessian-smoothRequired)

Original Perturbed GD / SGD

[125][126][113], [127][128]

O(poly(d)ε−4

)O(ε−10

)O(ε−4

)O(ε−3.5

)not givennot givenO(nε−2

)not given

Acceleration

NEON+GD/SGD [129], [130] O(ε−4) O(nε−2)

Perturbed AGD [24] not given nε−1.75

NEON+VR [25], [117]–[119] O(ε−3.5

)O(nε−1.5 + n2/3ε−2

)NEON+VR+Momentum [21], [23], [27] O

(ε−3.5

)O(nε−1.5 + n3/4ε−1.75

)NEON+SPIDER [26] O

(ε−3

)O(n+ n1/2ε−2

)TABLE IV

GRADIENT COMPLEXITY COMPARISONS BETWEEN DIFFERENT ACCELERATED STOCHASTIC ALGORITHMS AND THEIR NON-ACCELERATED COUNTERPARTSTO FIND AN ε-APPROXIMATE FIRST-ORDER STATIONARY POINT OR AN (ε,O(ε0.5))-APPROXIMATE SECOND-ORDER STATIONARY POINT FOR NONCONVEX

PROBLEMS.

[7] Y. Nesterov, “Smooth minimization of non-smooth functions,” Mathe-matical Programming, vol. 103, pp. 127–152, 2005.

[8] Y. Nesterov, “Gradient methods for minimizing composite functions,”Mathematical Programming, vol. 140, pp. 125–161, 2013.

[9] Z. Lin, H. Li, and C. Fang, Accelerated Optimization in MachineLearning: First-Order Algorithms. Springer, 2020.

[10] Z. Allen-Zhu, “Katyusha: The first direct acceleration of stochasticgradient methods,” Journal of Machine Learning Research, vol. 18,no. 221, pp. 1–51, 2018.

[11] K. Zhou, F. Shang, and J. Cheng, “A simple stochastic variance reducedalgorithm with fast convergence rates,” in International Conference onMachine Learning (ICML), pp. 5975–5984, 2019.

[12] D. Kovalev, S. Horvath, and P. Richtarik, “Don’t jump through hoopsand remove those loops: SVRG and Katyusha are better without theouter loop,” in International Conference on Algorithmic Learning Theory(ALT), pp. 451–467, 2020.

[13] G. Lan, Z. Li, and Y. Zhou, “A unified variance-reduced acceleratedgradient method for convex optimization,” in Advances in NeuralInformation Processing Systems (NeurIPS), pp. 10462–10472, 2019.

[14] Y. Nesterov, “Efficiency of coordinate descent methods on huge-scaleoptimization problems,” SIAM Journal on Optimization, vol. 22, no. 2,pp. 341–362, 2012.

[15] O. Fercoq and P. Richtarik, “Accelerated, parallel, and proximalcoordinate descent,” SIAM Journal on Optimization, vol. 25, no. 4,pp. 1997–2023, 2015.

[16] Q. Lin, Z. Lu, and L. Xiao, “An accelerated randomized proximalcoordinate gradient method and its application to regularized empiricalrisk minimization,” SIAM Journal on Optimization, vol. 25, no. 4,pp. 2244–2273, 2015.

[17] S. Shalev-Shwartz and T. Zhang, “Accelerated proximal stochastic dualcoordinate ascent for regularized loss minimization,” MathematicalProgramming, vol. 155, pp. 105–145, 2016.

[18] H. Li and Z. Lin, “On the complexity analysis of the primal solutionsfor the accelerated randomized dual coordinate ascent,” Journal ofMachine Learning Research, vol. 21, no. 33, pp. 1–45, 2020.

[19] Y. Zhang and L. Xiao, “Stochastic primal-dual coordinate method forregularized empirical risk minimization,” Journal of Machine LearningResearch, vol. 18, no. 84, pp. 1–42, 2017.

[20] G. Lan and Y. Zhou, “An optimal randomized incremental gradientmethod,” Mathematical Programming, vol. 171, pp. 167–215, 2018.

[21] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford, “Acceleratedmethods for nonconvex optimization,” SIAM Journal on Optimization,vol. 28, no. 2, pp. 1751–1772, 2018.

[22] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford, “Convex untilproven guilty: Dimension-free acceleration of gradient descent on non-convex functions,” in International Conference on Machine Learning(ICML), pp. 654–663, 2017.

[23] N. Agarwal, Z. Allen-Zhu, B. Bullins, E. Hazan, and T. Ma, “Findingapproximate local minima faster than gradient descent,” in ACMSymposium on Theory of Computing (STOC), pp. 1195–1199, 2017.

[24] C. Jin, P. Netrapalli, and M. I. Jordan, “Accelerated gradient descentescapes saddle points faster than gradient descent,” in Conference OnLearning Theory (COLT), pp. 1042–1085, 2018.

[25] Z. Allen-Zhu, “Natasha2: Faster non-convex optimization than SGD,”in Advances in Neural Information Processing Systems (NeurIPS),pp. 2675–2686, 2018.

[26] C. Fang, C. J. Li, Z. Lin, and T. Zhang, “SPIDER: Near-optimal non-convex optimization via stochastic path integrated differential estimator,”in Advances in Neural Information Processing Systems (NeurIPS),pp. 689–699, 2018.

[27] N. Tripuraneni, M. Stern, C. Jin, J. Regier, and M. I. Jordan, “Stochasticcubic regularization for fast nonconvex optimization,” in Advances inNeural Information Processing Systems (NeurIPS), pp. 2899–2908,2018.

[28] S. Bhojanapalli, B. Neyshabur, and N. Srebro, “Global optimality oflocal search for low rank matrix recovery,” in Advances in NeuralInformation Processing Systems (NIPS), pp. 3873–3881, 2016.

[29] R. Ge, J. D. Lee, and T. Ma, “Matrix completion has no spurious localminimum,” in Advances in Neural Information Processing Systems(NIPS), pp. 2973–2981, 2016.

[30] R. Ge, C. Jin, and Y. Zheng, “No spurious local minima in nonconvexlow rank problems: A unified geometric analysis,” in InternationalConference on Machine Learning (ICML), pp. 1233–1242, 2017.

[31] A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun,“The loss surfaces of multilayer networks,” in International Conferenceon Machine Learning (ICML), pp. 192–204, 2015.

[32] K. Kawaguchi, “Deep learning without poor local minima,” in Advancesin Neural Information Processing Systems (NIPS), pp. 586–594, 2016.

[33] Y. Nesterov, Introductory Lectures on Convex Optimization: A BasicCourse. Kluwer Academic, Boston, 2004.

[34] B. T. Polyak, “Some methods of speeding up the convergence of iterationmethods,” USSR Computational Mathematics and Mathematical Physics,vol. 4, no. 5, pp. 1–17, 1964.

[35] P. Ochs, T. Brox, and T. Pock, “iPiasco: Inertial proximal algorithmfor strongly convex optimization,” Journal of Mathematical Imagingand Vision, vol. 53, pp. 171–181, 2015.

[36] L. Lessard, B. Recht, and A. Packard, “Analysis and design ofoptimization algorithms via integral quadratic constraints,” SIAM Journalon Optimization, vol. 26, no. 1, pp. 57–95, 2016.

[37] E. Ghadimi, H. R. Feyzmahdavian, and M. Johansson, “Globalconvergence of the heavy-ball method for convex optimization,” inEuropean Control Conference (ECC), pp. 310–315, 2015.

[38] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholdingalgorithm for linear inverse problems,” SIAM Journal on ImagingSciences, vol. 2, no. 1, pp. 183–202, 2009.

[39] W. Su, S. Boyd, and E. Candes, “A differential equation for modelingNesterov’s accelerated gradient method: Theory and insights,” Journalof Machine learning Research, vol. 17, no. 153, pp. 1–43, 2016.

[40] S. Safavi, B. Joshi, G. Franca, and J. Bento, “An explicit convergencerate for Nesterov’s method from SDP,” in Innovations in TheoreticalComputer Science (ITCS), 2018.

[41] A. Wibisono, A. C. Wilson, and M. I. Jordan, “A variational perspectiveon accelerated methods in optimization,” Proceedings of the NationalAcademy of Sciences, vol. 113, no. 47, pp. 7351–7358, 2016.

[42] Y. Nesterov, A. Gasnikov, S. Guminov, and P. Dvurechensky, “Primal-


dual accelerated gradient methods with small-dimensional relaxationoracle,” arXiv:1809.05895, 2018.

[43] P. Tseng, “On accelerated proximal gradient methods for convex-concaveoptimization,” tech. rep., University of Washington, Seattle, 2008.

[44] C. Fang, Y. Huang, and Z. Lin, “Accelerating asynchronous algo-rithms for convex optimization by momentum compensation,” arX-iv:1802.09747, 2018.

[45] Z. Allen-Zhu and L. Orecchia, “Linear coupling: An ultimate unificationof gradient and mirror descent,” in Innovations in Theoretical ComputerScience (ITCS), 2017.

[46] A. Chambolle and T. Pock, “A first-order primal-dual algorithm forconvex problems with applications to imaging,” Journal of MathematicalImaging and Vision, vol. 40, pp. 120–145, 2011.

[47] A. Chambolle and T. Pock, “On the ergodic convergence rates of afirst-order primal-dual algorithm,” Mathematical Programming, vol. 159,pp. 253–287, 2016.

[48] S. Bubeck, Y. T. Lee, and M. Singh, “A geometric alternative toNesterov’s accelerated gradient descent,” arXiv:1506.08187, 2015.

[49] D. Drusvyatskiy, M. Fazel, and S. Roy, “An optimal first order methodbased on optimal quadratic averaging,” SIAM Journal on Optimization,vol. 28, no. 1, pp. 251–271, 2018.

[50] Y. Drori and M. Teboulle, “Performance of first-order methods forsmooth convex minimization: a novel approach,” Mathematical Pro-gramming, vol. 145, pp. 451–482, 2014.

[51] D. Kim and J. A. Fessler, “Optimized first-order methods for smoothconvex minimization,” Mathematical Programming, vol. 159, pp. 81–107, 2016.

[52] G. Lan, “Gradient sliding for composite optimization,” MathematicalProgramming, vol. 159, pp. 201–235, 2016.

[53] G. Lan and Y. Zhou, “Conditional gradient sliding for convex optimiza-tion,” SIAM Journal on Optimization, vol. 26, no. 2, pp. 1379–1409,2016.

[54] G. Lan and Y. Ouyang, “Accelerated gradient sliding for structuredconvex optimization,” preprint arXiv:1609.04905, 2016.

[55] G. Lan and R. D. Monteiro, “Iteration-complexity of first-orderpenalty methods for convex programming,” Mathematical Programming,vol. 138, pp. 115–139, 2013.

[56] Y. Chen, G. Lan, and Y. Ouyang, “Optimal primal-dual methods for aclass of saddle point problems,” SIAM Journal on Optimization, vol. 24,no. 4, pp. 1779–1814, 2014.

[57] Y. Xu, “Accelerated first-order primal-dual proximal methods forlinearly constrained composite convex programming,” SIAM Journalon Optimization, vol. 27, no. 3, pp. 1459–1484, 2017.

[58] Y. Ouyang, Y. Chen, G. Lan, and E. Pasiliao Jr, “An acceleratedlinearized alternating direction method of multipliers,” SIAM Journalon Imaging Sciences, vol. 8, no. 1, pp. 644–681, 2015.

[59] H. Li and Z. Lin, “Accelerated alternating direction method ofmultipliers: an optimal O(1/K) nonergodic analysis,” Journal ofScientific Computing, vol. 79, pp. 671–699, 2019.

[60] J. Lu and M. Johansson, “Convergence analysis of approximate primalsolutions in dual first-order methods,” SIAM Journal on Optimization,vol. 26, no. 4, pp. 2430–2467, 2016.

[61] B. He and X. Yuan, “On the acceleration of augmented Lagrangianmethod for linearly constrained optimization,” Optimization Online,2010.

[62] T. Goldstein, B. O’Donoghue, S. Setzer, and R. Baraniuk, “Fastalternating direction optimization methods,” SIAM Journal on ImagingSciences, vol. 7, no. 3, pp. 1588–1623, 2014.

[63] P. Giselsson and S. Boyd, “Linear convergence and metric selectionfor Douglas-Rachford splitting and ADMM,” IEEE Transactions onAutomatic Control, vol. 62, no. 2, pp. 532–544, 2016.

[64] G. Franca and J. Bento, “An explicit rate bound for over-relaxedADMM,” in IEEE International Symposium on Information Theory(ISIT), pp. 2104–2108, 2016.

[65] A. Nemriovsky and D. Yudin, Problem complexity and method efficiencyin optimization. Willey-Interscience, New York, 1983.

[66] Y. Arjevani and O. Shamir, “On the iteration complexity of obliviousfirst-order optimization algorithms,” in International Conference onMachine Learning (ICML), pp. 654–663, 2016.

[67] B. Woodworth and N. Srebro, “Tight complexity bounds for optimizingcomposite objectives,” in Advances in Neural Information ProcessingSystems (NIPS), pp. 3639–3647, 2016.

[68] Y. Arjevani, S. Shalev-Shwartz, and O. Shamir, “On lower and upperbounds for smooth and strongly convex optimization,” Journal ofMachine Learning Research, vol. 17, no. 126, pp. 1–51, 2016.

[69] L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods forlarge-scale machine learning,” SIAM Review, vol. 60, no. 2, pp. 223–311,2018.

[70] R. Johnson and T. Zhang, “Accelerating stochastic gradient descentusing predictive variance reduction,” in Advances in Neural InformationProcessing Systems (NIPS), pp. 315–323, 2013.

[71] M. Schmidt, N. Le Roux, and F. Bach, “Minimizing finite sums withthe stochastic average gradient,” Mathematical Programming, vol. 162,pp. 83–112, 2017.

[72] A. Defazio, F. Bach, and S. Lacoste-Julien, “SAGA: A fast incrementalgradient method with support for non-strongly convex compositeobjectives,” in Advances in Neural Information Processing Systems(NIPS), pp. 1646–1654, 2014.

[73] L. Zhang, M. Mahdavi, and R. Jin, “Linear convergence with conditionnumber independent access of full gradients,” in Advances in NeuralInformation Processing Systems (NIPS), pp. 980–988, 2013.

[74] A. Defazio, T. Caetano, and J. Domke, “Finito: A faster, permutableincremental gradient method for big data problems,” in InternationalConference on Machine Learning (ICML), pp. 1125–1133, 2014.

[75] J. Mairal, “Optimization with first-order surrogate functions,” inInternational Conference on Machine Learning (ICML), pp. 783–791,2013.

[76] A. Chambolle, M. J. Ehrhardt, P. Richtarik, and C.-B. Schonlieb,“Stochastic primal-dual hybrid gradient algorithm with arbitrary samplingand imaging applications,” SIAM Journal on Optimization, vol. 28, no. 4,pp. 2783–2808, 2018.

[77] S. Shalev-Shwartz, “SDCA without duality, regularization, and indi-vidual convexity,” in International Conference on Machine Learning(ICML), pp. 747–754, 2016.

[78] L. M. Nguyen, J. Liu, K. Scheinberg, and M. Takac, “SARAH: Anovel method for machine learning problems using stochastic recursivegradient,” in International Conference on Machine Learning (ICML),pp. 2613–2621, 2017.

[79] L. Xiao and T. Zhang, “A proximal stochastic gradient method withprogressive variance reduction,” SIAM Journal on Optimization, vol. 24,no. 4, pp. 2057–2075, 2014.

[80] Z. Allen-Zhu, “Katyusha X: Practical momentum method for stochas-tic sum-of-nonconvex optimization,” in International Conference onMachine Learning (ICML), pp. 179–185, 2019.

[81] A. Defazio, “A simple practical accelerated method for finite sums,” inAdvances in Neural Information Processing Systems (NIPS), pp. 676–684, 2016.

[82] O. Shamir and T. Zhang, “Stochastic gradient descent for non-smoothoptimization: Convergence results and optimal averaging schemes,” inInternational Conference on Machine Learning (ICML), pp. 71–79,2013.

[83] Z. Allen-Zhu and Y. Yuan, “Improved SVRG for non-strongly-convex orsum-of-non-convex objectives,” in International Conference on MachineLearning (ICML), pp. 1080–1089, 2016.

[84] Z. Allen-Zhu and E. Hazan, “Optimal black-box reductions betweenoptimization objectives,” in Advances in Neural Information ProcessingSystems (NIPS), pp. 1614–1622, 2016.

[85] H. Lin, J. Mairal, and Z. Harchaoui, “Catalyst acceleration for first-order convex optimization: from theory to practice,” Journal of MachineLearning Research, vol. 18, no. 212, pp. 1–54, 2018.

[86] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54,no. 1, pp. 48–61, 2009.

[87] W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: An exact first-orderalgorithm for decentralized consensus optimization,” SIAM Journal onOptimization, vol. 25, no. 2, pp. 944–966, 2015.

[88] J. Xu, S. Zhu, Y. C. Soh, and L. Xie, “Augmented distributed gradientmethods for multi-agent optimization under uncoordinated constantstepsizes,” in IEEE Conference on Decision and Control (CDC),pp. 2055–2060, 2015.

[89] G. Qu and N. Li, “Harnessing smoothness to accelerate distributedoptimization,” IEEE Transactions on Control of Network Systems, vol. 5,no. 3, pp. 1245–1260, 2018.

[90] A. Nedic, A. Olshevsky, and W. Shi, “Achieving geometric convergencefor distributed optimization over time-varying graphs,” SIAM Journalon Optimization, vol. 27, no. 4, pp. 2597–2633, 2017.

[91] R. Xin, U. A. Khan, and S. Kar, “A linear algorithm for optimization overdirected graphs with geometric convergence,” IEEE Control SystemsLetters, vol. 2, no. 3, pp. 315–320, 2018.

[92] J. Chen and A. H. Sayed, “Diffusion adaptation strategies for distributedoptimization and learning over networks,” IEEE Transactions on SignalProcessing, vol. 60, no. 8, pp. 4289–4305, 2012.


[93] S. S. Ram, A. Nedic, and V. V. Veeravalli, “Distributed stochasticsubgradient projection algorithms for convex optimization,” Journal ofOptimization Theory and Applications, vol. 147, pp. 516–545, 2010.

[94] A. Mokhtari and A. Ribeiro, “DSA: Decenrtalized double stochasticaveraging gradient algorithm,” Journal of Machine Learning Research,vol. 17, no. 61, pp. 1–35, 2016.

[95] R. Xin, U. A. Khan, and S. Kar, “Variance-reduced decentralizedstochastic optimization with accelerated convergence,” preprint arX-iv:1912.04230, 2019.

[96] B. Li, S. Cen, Y. Chen, and Y. Chi, “Communication-efficient dis-tributed optimization in networks with gradient tracking,” preprintarXiv:1909.05844 (to appear in Journal of Machine Learning Research),2020.

[97] R. Xin, S. Kar, and U. A. Khan, “Decentralized stochastic optimizationand machine learning: A unified variance-reduction framework forrobust performance and fast convergence,” IEEE Signal ProcessingMagazine, vol. 37, no. 3, pp. 102–113, 2020.

[98] Z. Lu and L. Xiao, “On the complexity analysis of randomized block-coordinate descent methods,” Mathematical Programming, vol. 152,pp. 615–642, 2015.

[99] Y. T. Lee and A. Sidford, “Efficient accelerated coordinate descentmethods and faster algorithms for solving linear systems,” in IEEESymposium on Foundations of Computer Science (FOCS), pp. 147–156,2013.

[100] S. Shalev-Shwartz and T. Zhang, “Stochastic dual coordinate ascentmethods for regularized loss minimization,” Journal of MachineLearning Research, vol. 14, pp. 567–599, 2013.

[101] I. Necoara, Y. Nesterov, and F. Glineur, “Linear convergence of firstorder methods for non-strongly convex optimization,” MathematicalProgramming, vol. 175, pp. 69–107, 2019.

[102] P. W. Wang and C. J. Lin, “Iteration complexity of feasible descentmethods for convex optimization,” Journal of Machine LearningResearch, vol. 15, pp. 1523–1548, 2014.

[103] B. O’Donoghue and E. Candes, “Adaptive restart for accelerated gradientschemes,” Foundations of Computational Mathematics, vol. 15, pp. 715–732, 2015.

[104] O. Fercoq and Z. Qu, “Adaptive restart of accelerated gradient methodsunder local quadratic growth condition,” IMA Journal of NumericalAnalysis, vol. 39, no. 4, pp. 2069–2095, 2019.

[105] O. Fercoq and Z. Qu, “Restarting the accelerated coordinate descentmethod with a rough strong convexity estimate,” ComputationalOptimization and Applications, vol. 75, pp. 63–91, 2020.

[106] Z. Qu, P. Richtarik, and T. Zhang, “Quartz: Randomized dual coordinateascent with arbitrary sampling,” in Advances in Neural InformationProcessing Systems (NIPS), pp. 865–873, 2015.

[107] Z. Allen-Zhu, Z. Qu, P. Richtarik, and Y. Yuan, “Even faster acceleratedcoordinate descent using non-uniform sampling,” in InternationalConference on Machine Learning (ICML), pp. 1110–1119, 2016.

[108] Y. Nesterov and S. U. Stich, “Efficiency of the accelerated coordinatedescent method on structured optimization problems,” SIAM Journalon Optimization, vol. 27, no. 1, pp. 110–123, 2017.

[109] P. Ochs, Y. Chen, T. Brox, and T. Pock, “iPiano: Inertial proximalalgorithm for nonconvex optimization,” SIAM Journal on ImagingSciences, vol. 7, no. 2, pp. 1388–1419, 2014.

[110] S. Ghadimi and G. Lan, “Accelerated gradient methods for nonconvexnonlinear and stochastic programming,” Mathematical Programming,vol. 156, pp. 59–99, 2016.

[111] H. Li and Z. Lin, “Accelerated proximal gradient methods for nonconvexprogramming,” in Advances in Neural Information Processing Systems(NIPS), pp. 379–387, 2015.

[112] A. Beck and M. Teboulle, “Fast gradient-based algorithms for con-strained total variation image denoising and deblurring problems,” IEEETransactions on Image Processing, vol. 18, no. 11, pp. 2419–2434,2009.

[113] C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jordan, “Howto escape saddle points efficiently,” in International Conference onMachine Learning (ICML), pp. 1724–1732, 2017.

[114] J. D. Lee, M. Simchowitz, M. I. Jordan, and B. Recht, “Gradientdescent only converges to minimizers,” in Conference On LearningTheory (COLT), pp. 1246–1257, 2016.

[115] S. S. Du, C. Jin, J. D. Lee, M. I. Jordan, B. Poczos, and A. Singh,“Gradient descent can take exponential time to escape saddle points,” inAdvances in Neural Information Processing Systems (NIPS), pp. 1067–1077, 2017.

[116] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importanceof initialization and momentum in deep learning,” in InternationalConference on Machine Learning (ICML), pp. 1139–1147, 2013.

[117] Z. Allen-Zhu and E. Hazan, “Variance reduction for faster non-convexoptimization,” in International Conference on Machine Learning (ICML),pp. 699–707, 2016.

[118] S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola, “Stochas-tic variance reduction for nonconvex optimization,” in InternationalConference on Machine Learning (ICML), pp. 314–323, 2016.

[119] L. Lei, C. Ju, J. Chen, and M. I. Jordan, “Non-convex finite-sumoptimization via SCSG methods,” in Advances in Neural InformationProcessing Systems (NIPS), pp. 2348–2358, 2017.

[120] S. J. Reddi, S. Sra, B. Poczos, and A. Smola, “Proximal stochastic meth-ods for nonsmooth nonconvex finite-sum optimization,” in Advances inNeural Information Processing Systems (NIPS), pp. 1145–1153, 2016.

[121] D. Zhou, P. Xu, and Q. Gu, “Stochastic nested variance reduction fornonconvex optimization,” in Advances in Neural Information ProcessingSystems (NeurIPS), pp. 3925–3936, 2018.

[122] Z. Wang, K. Ji, Y. Zhou, Y. Liang, and V. Tarokh, “SpiderBoostand momentum: Faster stochastic variance reduction algorithms,”in Advances in Neural Information Processing Systems (NeurIPS),pp. 2403–2413, 2019.

[123] L. M. Nguyen, M. van Dijk, D. T. Phan, P. H. Nguyen, T.-W. Weng,and J. R. Kalagnanam, “Finite-sum smooth optimization with SARAH,”preprint arXiv:1901.07648, 2019.

[124] Y. Arjevani, Y. Carmon, J. C. Duchi, D. J. Foster, N. Srebro, andB. Woodworth, “Lower bounds for non-convex stochastic optimization,”preprint arXiv:1912.02365, 2019.

[125] R. Ge, F. Huang, C. Jin, and Y. Yuan, “Escaping from saddle points –online stochastic gradient for tensor decomposition,” in Conference OnLearning Theory (COLT), pp. 797–842, 2015.

[126] H. Daneshmand, J. Kohler, A. Lucchi, and T. Hofmann, “Escapingsaddles with stochastic gradients,” in International Conference onMachine Learning (ICML), pp. 1155–1164, 2018.

[127] C. Jin, P. Netrapalli, R. Ge, S. M. Kakade, and M. I. Jordan, “Stochasticgradient descent escapes saddle points efficiently,” arXiv:1902.04811,2019.

[128] C. Fang, Z. Lin, and T. Zhang, “Sharp analysis for nonconvex SGDescaping from saddle points,” in Conference On Learning Theory(COLT), pp. 1192–1234, 2019.

[129] Y. Xu, R. Jin, and T. Yang, “First-order stochastic algorithms forescaping from saddle points in almost linear time,” in Advances inNeural Information Processing Systems (NeurIPS), pp. 5530–5540,2018.

[130] Z. Allen-Zhu and Y. Li, “Neon2: Finding local minima via first-order oracles,” in Advances in Neural Information Processing Systems(NeurIPS), pp. 3716–3726, 2018.

[131] Y. Nesterov and B. T. Polyak, “Cubic regularization of Newton methodand its global performance,” Mathematical Programming, vol. 108,pp. 177–205, 2006.

[132] K. Scaman, F. Bach, S. Bubeck, Y. T. Lee, and L. Massoulie, “Optimalalgorithms for smooth and strongly convex distributed optimization innetworks,” in International Conference on Machine Learning (ICML),pp. 3027–3036, 2017.

[133] C. A. Uribe, S. Lee, A. Gasnikov, and A. Nedic, “A dual approachfor optimal algorithms in distributed optimization over networks,”arXiv:1809.00710, 2018.

[134] H. Hendrikx, F. Bach, and L. Massoulie, “An accelerated decentralizedstochastic proximal algorithm for finite sums,” in Advances in NeuralInformation Processing Systems (NeurIPS), pp. 952–962, 2019.

[135] D. Jakovetic, J. Xavier, and J. M. F. Moura, “Fast distributed gradientmethods,” IEEE Transactions on Automatic Control, vol. 59, no. 5,pp. 1131–1146, 2014.

[136] H. Li, C. Fang, W. Yin, and Z. Lin, “A sharp convergence rate analysisfor distributed accelerated gradient methods,” arXiv:1810.01053, 2018.

[137] H. Li and Z. Lin, “Revisiting EXTRA for smooth distributed opti-mization,” SIAM Journal Optimization, vol. 30, no. 3, pp. 1795–1821,2020.

[138] G. Qu and N. Li, “Accelerated distributed Nesterov gradient descent,”IEEE Transactions on Automatic Control, vol. 65, no. 6, pp. 2566–2581,2020.

[139] R. Xin, D. Jakovetic, and U. A. Khan, “Distributed Nesterov gradientmethods over arbitrary graphs,” IEEE Signal Processing Letters, vol. 26,no. 8, pp. 1247–1251, 2019.

[140] K. Scaman, F. Bach, S. Bubeck, Y. T. Lee, and L. Massoulie, “Optimalconvergence rates for convex distributed optimization in networks,”Journal of Machine Learning Research, vol. 20, no. 159, pp. 1–31,2019.


[141] C. Ma, M. Jaggi, F. E. Curtis, N. Srebro, and M. Takac, “Anaccelerated communication-efficient primal-dual optimization frameworkfor structured machine learning,” Optimization Methods and Software,pp. 1–25, 2019.

[142] D. Kovalev, A. Salim, and P. Richtarik, “Optimal and practicalalgorithms for smooth and strongly convex decentralized optimization,”preprint arXiv:2006.11773, 2020.

Huan Li received his Ph.D. degree from PekingUniversity in 2019. He is currently an AssistantResearcher at the Institute of Robotics and Auto-matic Information Systems, College of ArtificialIntelligence, Nankai University. His current researchinterests include optimization and machine learning.

Cong Fang received his Ph.D. degree from PekingUniversity in 2019. He is currently a Postdoctoral Re-searcher at Princeton University.His research interestsinclude machine learning and optimization.

Zhouchen Lin received the Ph.D. degree in AppliedMathematics from Peking University, in 2000. Heis currently a Professor at Key Laboratory of Ma-chine Perception (MOE), School of EECS, PekingUniversity. His research interests include computervision, image processing, machine learning, patternrecognition, and numerical optimization. He is anAssociate Editor of IEEE Trans. Pattern Analysis andMachine Intelligence and International J. ComputerVision, an area chair of CVPR 2014/16/19/20/21,ICCV 2015, NIPS 2015/18/19/20/21, ICML 2020,

AAAI 2019/20, IJCAI 2020 and ICLR 2021, and a Fellow of the IEEE andthe IAPR

JOURNAL OF LA Accelerated First-Order Optimization ... · Accelerated First-Order Optimization...

Documents

Transcript of JOURNAL OF LA Accelerated First-Order Optimization ... · Accelerated First-Order Optimization...