IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH …IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH LOCAL...

24
IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH LOCAL SUPERLINEAR CONVERGENCE RATE * ARYAN MOKHTARI , MARK EISEN , AND ALEJANDRO RIBEIRO Abstract. This paper studies the problem of minimizing a global objective function which can be written as the average of a set of n smooth and strongly convex functions. This manuscript focuses on the case in which the number of functions n is extremely large, which is a common scenario in large-scale machine learning problems. Quasi-Newton methods, which build on the idea of approximating the Newton step using the first-order information of the objective function, are successful in reducing the computational complexity of Newton’s method by avoiding the Hessian and its inverse computation at each iteration, while converging at a superlinear rate to the optimal argument. However, quasi-Newton methods are impractical for solving the finite sum minimization problem because they operate on the information of all n functions at each iteration and thus not computationally affordable. This issue has been addressed by incremental quasi-Newton methods which use the information of a subset of functions at each iteration where the functions are chosen by a random or cyclic routine. Although incremental quasi-Newton methods are able to reduce the computational complexity of traditional (full-batch) quasi-Newton methods significantly, they fail to converge at a superlinear rate. In this paper, we propose the IQN method as the first incremental quasi-Newton method with a local superlinear convergence rate. In the IQN method, we compute and update the information of only a single function at each iteration – as in all incremental methods – and use the gradient (first-order) information to approximate the Newton direction without a computationally expensive inversion – as in all quasi-Newton methods. IQN differs from state-of- the-art incremental quasi-Newton methods in three criteria. First, the use of aggregated information of variables, gradients, and quasi-Newton Hessian approximations; second, the approximation of each individual function by its Taylor’s expansion in which the linear and quadratic terms are evaluated with respect to the same iterate; and third, the use of a cyclic scheme to update the functions in lieu of a random selection routine. We use these fundamental properties of IQN to establish its local superlinear convergence rate. The presented numerical experiments match our theoretical results and justify the advantage of IQN relative to other incremental methods. Key words. Large-scale optimization, stochastic optimization, quasi-Newton methods, incre- mental methods, superlinear convergence 1. Introduction. We study a large scale optimization problem where the objec- tive function is expressed as an aggregation of a set of component objective functions. To be more precise, consider a variable x R p and a function f which is defined as the average of n smooth and strongly convex functions labelled f i : R p R for i =1,...,n. Our goal is to find the optimal argument x * as the solution to the problem (1) x * = argmin xR p f (x) := argmin xR p 1 n n X i=1 f i (x), in a computationally efficient manner even when n is large. Problems of this form arise in machine learning [3, 2, 30, 8], control [5, 7, 15], and wireless communications [27, 23, 24]. In this paper, we focus on the case that the component functions f i are strongly convex, and their gradients are Lipschitz continuous. Much of the theory that has been developed to solve (1) is centered on the use of iterative descent methods. In particular, gradient descent method (GD)—which relies on the computation of the full gradient at each iteration—is the workhorse for solving convex optimization problems. Although GD converges at a fast linear rate, * This work was supported by NSF CAREER CCF-0952867 and ONR N00014-12-1-0997. Department of Electrical and Systems Engineering, University of Pennsylvania, Philadelphia, PA ([email protected],[email protected],[email protected]) 1 arXiv:1702.00709v1 [math.OC] 2 Feb 2017

Transcript of IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH …IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH LOCAL...

Page 1: IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH …IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH LOCAL SUPERLINEAR CONVERGENCE RATE ARYAN MOKHTARI y, MARK EISEN , AND ALEJANDRO RIBEIRO

IQN: AN INCREMENTAL QUASI-NEWTON METHODWITH LOCAL SUPERLINEAR CONVERGENCE RATE∗

ARYAN MOKHTARI† , MARK EISEN† , AND ALEJANDRO RIBEIRO†

Abstract. This paper studies the problem of minimizing a global objective function which canbe written as the average of a set of n smooth and strongly convex functions. This manuscriptfocuses on the case in which the number of functions n is extremely large, which is a commonscenario in large-scale machine learning problems. Quasi-Newton methods, which build on the ideaof approximating the Newton step using the first-order information of the objective function, aresuccessful in reducing the computational complexity of Newton’s method by avoiding the Hessianand its inverse computation at each iteration, while converging at a superlinear rate to the optimalargument. However, quasi-Newton methods are impractical for solving the finite sum minimizationproblem because they operate on the information of all n functions at each iteration and thus notcomputationally affordable. This issue has been addressed by incremental quasi-Newton methodswhich use the information of a subset of functions at each iteration where the functions are chosenby a random or cyclic routine. Although incremental quasi-Newton methods are able to reduce thecomputational complexity of traditional (full-batch) quasi-Newton methods significantly, they fail toconverge at a superlinear rate. In this paper, we propose the IQN method as the first incrementalquasi-Newton method with a local superlinear convergence rate. In the IQN method, we compute andupdate the information of only a single function at each iteration – as in all incremental methods– and use the gradient (first-order) information to approximate the Newton direction without acomputationally expensive inversion – as in all quasi-Newton methods. IQN differs from state-of-the-art incremental quasi-Newton methods in three criteria. First, the use of aggregated informationof variables, gradients, and quasi-Newton Hessian approximations; second, the approximation of eachindividual function by its Taylor’s expansion in which the linear and quadratic terms are evaluatedwith respect to the same iterate; and third, the use of a cyclic scheme to update the functions inlieu of a random selection routine. We use these fundamental properties of IQN to establish its localsuperlinear convergence rate. The presented numerical experiments match our theoretical resultsand justify the advantage of IQN relative to other incremental methods.

Key words. Large-scale optimization, stochastic optimization, quasi-Newton methods, incre-mental methods, superlinear convergence

1. Introduction. We study a large scale optimization problem where the objec-tive function is expressed as an aggregation of a set of component objective functions.To be more precise, consider a variable x ∈ Rp and a function f which is definedas the average of n smooth and strongly convex functions labelled fi : Rp → R fori = 1, . . . , n. Our goal is to find the optimal argument x∗ as the solution to theproblem

(1) x∗ = argminx∈Rp

f(x) := argminx∈Rp

1

n

n∑i=1

fi(x),

in a computationally efficient manner even when n is large. Problems of this formarise in machine learning [3, 2, 30, 8], control [5, 7, 15], and wireless communications[27, 23, 24]. In this paper, we focus on the case that the component functions fi arestrongly convex, and their gradients are Lipschitz continuous.

Much of the theory that has been developed to solve (1) is centered on the useof iterative descent methods. In particular, gradient descent method (GD)—whichrelies on the computation of the full gradient at each iteration—is the workhorse forsolving convex optimization problems. Although GD converges at a fast linear rate,

∗This work was supported by NSF CAREER CCF-0952867 and ONR N00014-12-1-0997.†Department of Electrical and Systems Engineering, University of Pennsylvania, Philadelphia,

PA ([email protected],[email protected],[email protected])

1

arX

iv:1

702.

0070

9v1

[m

ath.

OC

] 2

Feb

201

7

Page 2: IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH …IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH LOCAL SUPERLINEAR CONVERGENCE RATE ARYAN MOKHTARI y, MARK EISEN , AND ALEJANDRO RIBEIRO

2 ARYAN MOKHTARI, MARK EISEN, AND ALEJANDRO RIBEIRO

it has two major issues for solving the problem in (1). Firstly, it requires computingthe gradients of n functions at each iteration, which is computationally expensive andhas a complexity on the order of O(np). Secondly, its constant of linear convergencedepends on the problem condition number, and, consequently suffers from slow con-vergence in ill-conditioned problems. To resolve the second issue, Newton’s methodarises as a natural solution. Newton’s method achieves a quadratic convergence ratein a local neighborhood of the optimal argument by using the function’s curvatureinformation. However, the computational complexity of Newton’s method is on theorder of O(np2 + p3), where the first term comes from Hessian computation and thesecond term is the cost of Hessian inversion.

Quasi-Newton methods, which build on the idea of approximating the Newtonstep using the first-order information of the objective function [4, 22, 12], reduce thecomputational complexity of Newton’s method to the overall cost on the order ofO(np + p2) per iteration, where the first term corresponds to the cost of gradientcomputation and the second term indicates the computational complexity of updat-ing the approximate Hessian inverse matrix. Quasi-Newton methods are preferableto gradient descent methods since they converge at a superlinear rate when the vari-able is a in local neighborhood of the optimal argument and the error of Hessianapproximation is sufficiently small. Although quasi-Newton methods reduce the com-putational complexity of second-order methods and enhance the convergence rate offirst-order methods, they are not applicable to the optimization problem of the form(1) when the number of component functions n is large. This drawback also exists fordeterministic first and second order methods. The natural approach to avoid gradi-ent computation is replacing the gradients ∇f(x) =

∑ni=1∇fi(x) by their stochastic

approximations.The first attempt to replace stochastic gradients for gradients in the update of

quasi-Newton methods was the work in [29] which introduces a stochastic (online)version of the BFGS quasi-Newton method as well as its limited memory variant.Although [29] provides numerical experiments illustrating the advantages of stochasticquasi-Newton methods, it fails to establish any theoretical guarantees. In [18], theauthors show that the stochastic variant of BFGS might not be convergent because ofhaving unbounded eigenvalues. They propose a regularized modification of stochasticBFGS which changes the proximity condition of BFGS to ensure that the eigenvaluesof the Hessian inverse approximation are bounded, and, consequently, the algorithmis convergent. For the limited memory version of stochastic (online) BFGS, it hasbeen shown in [19] that there is no need for regularization since the eigenvalues ofthe Hessian inverse approximation matrices are uniformly bounded by constants thatdepend on the size of memory. According to these bounds, the authors in [19] provethat the limited memory version of stochastic (online) BFGS proposed in [29] is almostsurely convergent and has a sublinear convergence rate in expectation. The work in [6]proposes a limited memory stochastic quasi-Newton method that collects second-orderinformation to compute the product of subsampled Hessian and stochastic gradientas the stochastic gradient variation required in the update of BFGS. This approachis in opposed to the works in [29, 18, 19] that define the stochastic gradient variationas the difference of two consecutive stochastic gradients, and it allows to separatethe noise of stochastic gradient computation from the error in the Hessian inverseapproximation at the cost of computing a subset of Hessians.

Although the methods in [29, 18, 19, 6] are successful in expanding the applica-tion of quasi-Newton methods to stochastic settings, they suffer from slow sublinearconvergence rate. This drawback is the outcome of the noise of stochastic approxi-

Page 3: IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH …IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH LOCAL SUPERLINEAR CONVERGENCE RATE ARYAN MOKHTARI y, MARK EISEN , AND ALEJANDRO RIBEIRO

IQN: AN INCREMENTAL QUASI-NEWTON METHOD 3

mations, thus requiring the use of diminishing stepsizes to reduce stochasticity. Theworks in [16, 20] attempt to resolve this issue by using the variance reduction tech-nique proposed in [13]. The fundamental idea of the work in [13] is to reduce thestochastic gradient approximation noise by computing the exact gradient in an outerloop to use in an inner loop for evaluating the stochastic gradient. The idea of variancereduction in [16, 20] is successful in achieving a linear convergence rate and improvingthe guaranteed sublinear convergence rates in [18, 19, 6]; however, they fail to recoverthe superlinear convergence rate of deterministic quasi-Newton methods. Hence, afundamental question remained unanswered: is it possible to design an incrementalquasi-Newton method that recovers the superlinear convergence rate of deterministic(full-batch) quasi-Newton algorithms?

In this paper, we show that the answer to this open problem is positive by propos-ing the first incremental quasi-Newton method (IQN) that has a local superlinear con-vergence rate. The proposed IQN method has a low computational cost per iterationof order O(p2) and only updates the information of single function at each iteration,as in stochastic quasi-Newton methods.

There are three major differences between the IQN method and state-of-the-artincremental (stochastic) quasi-Newton methods that lead to the former’s superlinearconvergence rate. Firstly, the proposed IQN method uses the aggregated informationof variables, gradients, and the Hessian approximation matrices to reduce the noiseof stochastic approximation both in gradient and Hessian inverse approximations.This is opposed to variance-reduced stochastic quasi-Newton methods in [16, 20] thatattempt to reduce the noise of gradient approximations only. Secondly, the IQNmethod approximates each instantaneous function fi by the sum of its first orderapproximation and a quadratic term, both evaluated with respect to the same iterate,whereas the traditional stochastic quasi-methods in [29, 18, 19, 6] evaluate the linearand quadratic terms for Taylor’s expansion at different points. Thirdly, in IQN theindex of the updated function is chosen in a cyclic fashion, rather than the randomselection scheme used in the incremental methods in [29, 18, 19, 6]. The cyclic routinein IQN allows to bound the error at each iteration as a function of the errors of thelast n iterates, which is not achievable with a random scheme. These three propertiestogether lead to an incremental quasi-Newton method which has a local superlinearconvergence rate.

We start the paper by recapping the BFGS quasi-Newton and the Dennis-Morecondition which is sufficient and necessary to prove superlinear convergence rate of theBFGS method (Section 2). Then, we present the proposed Incremental Quasi Newtonmethod (IQN) as an incremental aggregated version of the traditional BFGS method(Section 3). We first explain the difference between the Taylor’s expansion used inIQN and state-of-the-art incremental (stochastic) quasi-Newton methods. Further,we explain the mechanism for the aggregation of the functions informations and thescheme for updating the stored information. Moreover, we suggest an efficient mech-anism to implement the proposed IQN method with computational complexity of theorder O(p2) (Section 3.1). The convergence analysis of the IQN method is then pre-sented (Section 4). We use the classic analysis of quasi-Newton methods to show thatin a local neighborhood of the optimal solution the sequence of variables converges tothe optimal argument x∗ linearly after each pass over the set of functions (Lemma 3).We use this result to show that for each component function fi the Dennis-Morecondition holds (Proposition 4). However, this condition is not sufficient to provesuperlinear convergence of the sequence of errors ‖xt − x∗‖, since it does not guar-antee the Dennis-More condition for the global objective f . To overcome this issue

Page 4: IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH …IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH LOCAL SUPERLINEAR CONVERGENCE RATE ARYAN MOKHTARI y, MARK EISEN , AND ALEJANDRO RIBEIRO

4 ARYAN MOKHTARI, MARK EISEN, AND ALEJANDRO RIBEIRO

we introduce a novel convergence analysis approach which exploits the local linearconvergence of IQN to present a more general version of the Dennis-More conditionfor each component function fi (Lemma 5). We exploit this result to establish super-linear convergence of the iterates generated by IQN (Theorem 6). In Section 6, wepresent numerical simulation results, comparing the performance of IQN to that ofboth first-order incremental methods and second order and quasi-Newton methods.We test the performance on a set of large scale regression problems and observe strongnumerical gain in total computation time relative to existing methods.

1.1. Notation. Vectors are written as lowercase x ∈ Rp and matrices as upper-case A ∈ Rp×p. We use ‖x‖ and ‖A‖ to denote the Euclidean norm of vector x andmatrix A, respectively. Given a function f its gradient and Hessian at point x aredenoted as ∇f(x) and ∇2f(x), respectively.

2. BFGS Quasi-Newton Method. Consider the problem in (1) for a relativelylarge n. In a conventional optimization setting, it can be solved using a quasi-Newtonmethod which iteaitivley updates a variable xt for t = 0, 1, . . . based on the generalrecursive expression

xt+1 = xt − ηt(Bt)−1∇f(xt),(2)

where ηt is a scalar stepsize and Bt is a positive definite matrix which is an approx-imation of the exact Hessian of the objective function ∇2f(xt). The stepsize ηt isevaluated based on a line search routine for the global convergence of quasi-Newtonmethods. Our focus in this paper is on the local convergence of quasi-Newton meth-ods, which requires the unit stepsize ηt = 1. Therefore, through out the paper weassume that the variable xt is close to the optimal solution x∗ – we will formalize thenotion of being close to the optimal solution – and the stepsize is ηt = 1.

The goal of quasi-Newton methods is to compute the Hessian approximationmatrix Bt and its inverse (Bt)

−1by using only the first-order information, i.e., gradi-

ents, of the objective objective. For this reason, the use of quasi-Newton methods iswidespread since in many applications the Hessian information required in Newton’smethod is either unavailable or too costly to evaluate. There are different approachesto approximate the Hessian, but the common feature among quasi-Newton methodsis that the Hessian approximation matrix must satisfy the secant condition. To bemore precise, consider st and yt as the variable and gradient variations which aredefined as

st := xt+1 − xt, yt := ∇f(xt+1)−∇f(xt).(3)

Then, given the variable variation st and gradient variation yt, the Hessian approxi-mation matrix in all quasi-Newton methods should satisfy the secant condition

(4) Bt+1st = yt,

which is also called the quasi-Newton equation. This condition is fundamental inquasi-Newton methods since the exact Hessian ∇2f(xt) satisfies this equality whenthe iterates xt+1 and xt are close to each other. If we consider the matrix Bt+1 asthe unknown matrix, the system of equations in (4) does not have a unique solution.Different quasi-Newton methods enforce different conditions on the matrix Bt+1 tocome up with a unique update. This extra condition is typically a proximity conditionthat ensures that Bt+1 is close to the previous Hessian approximation matrix Bt

Page 5: IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH …IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH LOCAL SUPERLINEAR CONVERGENCE RATE ARYAN MOKHTARI y, MARK EISEN , AND ALEJANDRO RIBEIRO

IQN: AN INCREMENTAL QUASI-NEWTON METHOD 5

[4, 22, 12]. In particular for the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method,the update of Hessian approximation matrix can be written as

Bt+1 = Bt +ytyt

T

ytT st− Btstst

TBt

stTBtst.(5)

The BFGS method is popular not only for its strong numerical performance relative tothe gradient descent method, but also because it is shown to exhibit a superlinear con-vergence rate [4], thereby providing a theoretical guarantee of superior performance.In fact, it can be shown that, the BFGS update satisfies the condition

limt→∞

‖(Bt −∇2f(x∗))st‖‖st‖

= 0(6)

known as the Dennis-More condition, which is both necessary and sufficient for su-perlinear convergence [12]. This result solidifies quasi-Newton methods as a strongalternative to first order methods when exact second-order information is unavailable.However, implementation of the BFGS method is not feasible when the number offunctions n is large, due to its high computational complexity on the order O(np+p2).In the following section, we propose a novel incremental BFGS method that has thecomputational complexity of O(p2) per iteration and converges at a superlinear rate.

3. IQN: Incremental aggregated BFGS. We introduce an incremental ag-gregated BFGS algorithm, in which the most recent observed information of all func-tions f1, . . . , fn is used to compute the updated variable xt+1, while only the informa-tion of a single function is updated at each iteration. The particular function is chosenby cyclicly iterating through the n functions. This approach is incremental since itupdates the information of only one function at each iteration and is aggregated sinceit operates on the most recent aggregated information of all functions. We name theproposed method IQN as an abbreviation for Incremental Quasi-Newton method.

In the IQN method, we consider zt1, . . . , ztn as the copies of the variable x at time

t associated with the functions f1, . . . , fn, respectively. Likewise, recall ∇fi(zti) asthe gradient corresponding to the i-th function. Further, consider Bt

i as a positivedefinite matrix which approximates the i-th component Hessian ∇2fi(x

t). We referto zti, ∇fi(zti), and Bt

i as the information corresponding to the i-th function fi atstep t. Note that the functions’ information is stored in a shared memory as shown inFig. 1. To introduce the IQN method, we first explain the mechanism for computingthe updated variable xt+1 using the stored information {zti,∇fi(zti),Bt

i}ni=1. Then,we elaborate on the scheme for updating the information of the functions.

To derive the full variable update, consider the second order approximation ofthe objective function fi(x) centered around its current iterate zti,

fi(x) ≈ fi(zti) +∇fi(zti)T (x− zti) +1

2(x− zti)

T∇2fi(zti)(x− zti).(7)

As in traditional quasi-Newton methods, we replace the i-th Hessian ∇2fi(zti) by

Bti. Using the approximation matrices in place of Hessians, the complete (aggregate)

function f(x) can be approximated with

f(x) ≈ 1

n

n∑i=1

[fi(z

ti) +∇fi(zti)T (x− zti) +

1

2(x− zti)

TBti(x− zti)

].(8)

Page 6: IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH …IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH LOCAL SUPERLINEAR CONVERGENCE RATE ARYAN MOKHTARI y, MARK EISEN , AND ALEJANDRO RIBEIRO

6 ARYAN MOKHTARI, MARK EISEN, AND ALEJANDRO RIBEIRO

zt1 ztit ztn

xt+1

zt+11

zt+1it zt+1

n

Bt1

Btit Bt

n

BFGS

Bt+11

Bt+1it Bt+1

n

∇f t1

∇f tit ∇f t

n

∇fit (xt+1)

∇f t+11

∇f t+1it ∇f t+1

n

Fig. 1. The updating scheme for variables, gradients, and Hessian approximation matrices offunction fit at step t. The red arrows indicate the terms used in the update of Bt+1

itusing the BFGS

update in (15). The black arrows show the updates of all variables and gradients. The terms zt+1it

and ∇f t+1it

are updated as xt+1 and ∇fit (xt+1), respectively. All others zt+1j and ∇f t+1

j are set

as ztj and ∇f tj , respectively.

Note that the right hand side of (8) is a quadratic approximation of the function fbased on the available information at step t. Hence, the updated iterate xt+1 can bedefined as the minimizer of the quadratic program in (8), explicitly given by

xt+1 =

(1

n

n∑i=1

Bti

)−1 [1

n

n∑i=1

Btizti −

1

n

n∑i=1

∇fi(zti)

].(9)

First note that the update in (9) shows that the updated variable xt+1 is afunction of the stored information of all functions f1, . . . , fn. Furthermore, we usethe aggregated information of variables, gradients, and the quasi-Newton Hessianapproximations to evaluate the updated variable. This is done to vanish the noise inapproximating both gradients and Hessians as the sequence approaches the optimalargument.

Remark 1. Given the BFGS Hessian approximation matrices {Bti}ni=1 and gradi-

ents {∇fi(zti)}ni=1, one may consider an update more akin to traditional descent-basedmethods, i.e.

xt+1 = xt −

(1

n

n∑i=1

Bti

)−11

n

n∑i=1

∇fi(zti).(10)

To evaluate the advantage of the proposed update for IQN in (9) relative to the updatein (10), we proceed to study the Taylor’s expansion that leads to the update in (10).It can be shown that the update in (10) is the outcome of the following approximation

f(x) ≈ 1

n

n∑i=1

[fi(z

ti) +∇fi(zti)T (x− zti) +

1

2(x− xt)TBt

i(x− xt)

],(11)

which uses the the iterate zti for the linear term of the approximation, while thequadratic term is approximated near the iterate xt. This inconsistency in the Taylor’sexpansion of each function fi leads to an inaccurate second-order approximation, andsubsequently a slower incremental quasi-Newton method.

So far we have discussed the procedure to compute the updated variable xt+1

based on the information of the functions f1, . . . , fn at step t. Now it remains to showhow we update the information of the functions f1, . . . , fn using the variable xt+1. In

Page 7: IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH …IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH LOCAL SUPERLINEAR CONVERGENCE RATE ARYAN MOKHTARI y, MARK EISEN , AND ALEJANDRO RIBEIRO

IQN: AN INCREMENTAL QUASI-NEWTON METHOD 7

the IQN method, at each iteration we only update the information of one function,chosen in a cyclic manner. In particular, if we define it as the index of the functionchosen at iteration t, we update the information corresponding to the function it usingthe updated variable xt+1 while the information corresponding to all other functionsremain unchanged. This is equivalent to the update

zt+1it

= xt+1, zt+1i = zti for all i 6= it.(12)

The update in (12) shows that we store the update variable xt+1 as the variableassociated with the function fit , while the iterates of the remaining functions remainunchanged. Likewise, we update the table of gradients by substituting the old gradient∇fit(zti) corresponding to the function fit with the gradient ∇fit(xt+1) evaluated atthe new iterate xt+1. The rest of gradients stored in the memory will stay unchanged,i.e.,

∇fit(zt+1i ) = ∇fit(xt+1), ∇fi(zt+1

i ) = ∇fi(zti) for all i 6= it.(13)

To update the curvature information, it would be ideal to compute the Hessian∇2fit(x

t+1) and update the curvature information following the schemes for variablesin (12) and gradients in (13). However, our focus is on the applications that thecomputation of the Hessian is either impossible or computationally expensive. Hence,to the update curvature approximation matrix Bt

itcorresponding to the function fit ,

we use the steps of BFGS in (5). To do so, we define variable and gradient variationsassociated with each individual function fi as

sti := zt+1i − zti, yti := ∇fi(zt+1

i )−∇fi(zti),(14)

respectively. Hence, the Hessian approximation Btit

corresponding to the function fitcan be updated based on the update of BFGS as

Bt+1i = Bt

i +ytiy

tTi

ytTi sti− Bt

ististTi Bt

i

stTi Btisti

, for i = it.(15)

Again, the Hessian approximation matrices for all other functions remain unchanged,i.e., Bt+1

i = Bti for i 6= it. The system of updates in (12)-(15) explains the mechanism

of updating the information of the function fit at step t. Notice that to updatethe Hessian approximation matrix for the it-th function there is no need to storethe variations in (14), since the old variables zti and ∇fi(zti) are available in thememory and the updated versions of them zt+1

i = xt+1 and ∇fi(zt+1i ) = ∇fi(xt+1)

are evaluated at step t; see Fig. 1 for more details.Note that with the cyclic update scheme, the set of iterates {zt1, zt2, . . . , ztn} is

equal to the set {xt,xt−1, . . . ,xt−n+1}. Therefore, the set of variables used in theupdate of IQN is the set that contains the last n iterates. This shows that theupdate of IQN in (9) uses the information of all the functions f1, . . . , fn to computethe updated variable xt+1; however, it uses the delayed variables, gradients, andHessian approximations instead of the classic quasi-Newton methods that use theupdated variable xt+1 for all functions. The use of delay allows IQN to update theinformation of a single function at each iteration, thus reducing the computationalcomplexity relative to classic quasi-Newton methods.

Although the update in (9) is helpful in understanding the rationale behind theIQN method, it cannot be implemented at a low computation cost, since it requirescomputation of the sums

∑ni=1 Bt

i,∑ni=1 Bt

izti, and

∑ni=1∇fi(zti) as well as comput-

ing the inversion (∑ni=1 Bt

i)−1. In the following section, we introduce an efficient

implementation of the IQN method that has the computational complexity of O(p2).

Page 8: IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH …IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH LOCAL SUPERLINEAR CONVERGENCE RATE ARYAN MOKHTARI y, MARK EISEN , AND ALEJANDRO RIBEIRO

8 ARYAN MOKHTARI, MARK EISEN, AND ALEJANDRO RIBEIRO

3.1. Efficient implementation of IQN. To see that the updating scheme in(9) requires evaluation of only a single gradient and Hessian approximation matrixper iteration, consider writing the update as

xt+1 = (Bt)−1(ut − gt

),(16)

where we define Bt :=∑ni=1 Bt

i as the aggregate Hessian approximation, ut :=∑ni=1 Bt

izti as the aggregate Hessian-variable product, and gt :=

∑ni=1∇fi(zti) as

the aggregate gradient. Then, given that at step t only a single index it is updated,we can evaluate these variables for step t+ 1 as

Bt+1 = Bt +(Bt+1it−Bt

it

),(17)

ut+1 = ut +(Bt+1it

zt+1it−Bt

itztit

),(18)

gt+1 = gt +(∇fit(zt+1

it)−∇fit(ztit)

).(19)

Thus, only Bt+1it

and ∇fit(zt+1it

) are required to be computed at step t.Although the updates in (17)-(19) have low computational complexity, the update

in (16) requires computing (Bt)−1 which has a computational complexity of O(p3).This inversion can be avoided by simplifying the update in (17) as

Bt+1 = Bt +ytity

tTit

ytTi sitt−

Btit

stitstTit

Btit

stTit Btit

stit.(20)

To derive the expression in (20) we have substituted the difference Bt+1it− Bt

itby

its rank two expression in (15). Given the matrix (Bt)−1, by applying the Sherman-Morrison formula twice to the update in (20) we can compute (Bt+1)−1 as

(Bt+1)−1 = Ut +Ut(Bt

itstit)(B

tit

stit)TUt

stitTBtit

stit − (Btit

stit)TUt(Bt

itstit)

,(21)

where the matrix Ut is evaluated as

Ut = (Bt)−1 −(Bt)−1ytity

tTit

(Bt)−1

ytTit stit + ytTit (Bt)−1ytit.(22)

The computational complexity of the updates in (21) and (22) is of the order O(p2)rather than the O(p3) cost of computing the inverse directly. Therefore, the over-all cost of IQN is of the order O(p2) which is substantially lower than O(np2) ofdeterministic quasi-Newton methods.

The complete IQN algorithm is outlined in Algorithm 1. Beginning with initialvariable x0 and gradient and Hessian estimates ∇fi(x0) and B0

i for all i, each variablecopy z0i is set to x0 in Step 1 and initial values are set for u0, g0 and (B0)−1 in Step2. For all t, in Step 4 the index it of the next function to update is selected cyclically.The variable xt+1 is computed according to the update in (16) in Step 5. In Step 6,the variable st+1

itand gradient yt+1

itvariations are evaluated as in (14) to compute

the BFGS matrix Bt+1it

from the update in (15). This information, as well as theupdated variable and its gradient, are used in Step 7 to update ut+1 and gt+1 asin (18) and (19), respectively. The inverse matrix (Bt+1)−1 is also computed byfollowing the expressions in (21) and (22). Finally in Step 9, we update the variable,gradient, and Hessian approximation tables based on the policies in (12), (13), and(15), respectively.

Page 9: IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH …IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH LOCAL SUPERLINEAR CONVERGENCE RATE ARYAN MOKHTARI y, MARK EISEN , AND ALEJANDRO RIBEIRO

IQN: AN INCREMENTAL QUASI-NEWTON METHOD 9

Algorithm 1 Incremental Quasi-Newton (IQN) method

Require: x0,{∇fi(x0)}ni=1, {B0i }ni=1

1: Set z01 = · · · = z0n = x0

2: Set (B0)−1

= (∑n

i=1 B0i )−1, u0 =

∑ni=1 B

0ix

0, g0 =∑n

i=1∇fi(x0)

3: for t = 0, 1, 2, . . . do4: Set it = (t mod n) + 15: Compute xt+1 = (Bt)−1

(ut − gt

)[cf. (16)]

6: Compute st+1it

, yt+1it

[cf. (14)], and Bt+1it

[cf. (15)]

7: Update ut+1 [cf. (18)], gt+1 [cf. (19)], and (Bt+1)−1 [cf. (21), (22)]8: Update the functions’ information tables as in (12), (13), and (15)9: end for

4. Convergence Analysis. In this section, we study the convergence rate ofthe proposed IQN method. We first establish its local linear convergence rate, thendemonstrate limit properties of the Hessian approximations, and finally show that ina region local to the optimal point the sequence of residuals converges at a superlinearrate. To prove these results we make two main assumptions, both of which arestandard in the analysis of quasi-Newton methods.

Assumption 1. There exist positive constants 0 < µ ≤ L such that, for all i andx, x ∈ Rp, we can write

(23) µ‖x− x‖2 ≤ (∇fi(x)−∇fi(x))T (x− x) ≤ L‖x− x‖2.

Assumption 2. There exists a positive constant 0 < L such that, for all i andx, x ∈ Rp, we can write

(24) ‖∇2fi(x)−∇2fi(x)‖ ≤ L‖x− x‖.

The lower bound in (23) implies that the functions fi are strongly convex withconstant µ, and the upper bound shows that the gradients ∇fi are Lipschitz contin-uous with parameter L.

The condition in Assumption 2, states that the Hessians∇2fi are Lipschitz contin-uous with constant L. This assumption is commonly made in the analyses of Newton’smethod [21] and quasi-Newton algorithms [4, 22, 12]. According to Lemma 3.1 in [4],Lipschitz continuity of the Hessians with constant L implies that for i = 1, . . . , n andarbitrary vectors x, x, x ∈ Rp we can write

(25)∥∥∇2fi(x)(x− x)− (∇fi(x)−∇fi(x))

∥∥ ≤ L‖x− x‖max {‖x− x‖, ‖x− x‖} .

We use the inequality in (25) in the process of proving the convergence of IQN.The goal of BFGS quasi-Newton methods is to approximate the objective function

Hessian using the first-order information. Likewise, in the incremental BFGS method,we aim to show that the Hessian approximation matrices for all functions f1, . . . , fnare close to the exact Hessian. In the following lemma, we study the difference betweenthe i-th optimal Hessian matrix ∇2fi(x

∗) and its approximation Bti over time.

Lemma 1. Consider the proposed IQN method in (9). Further, let i be the indexof the updated function at step t, i.e., i = it. Define the residual sequence for functionfi as σti := max{‖zt+1

i − x∗‖, ‖zti − x∗‖} and set M = ∇2fi(x∗)−1/2. If Assumptions

Page 10: IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH …IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH LOCAL SUPERLINEAR CONVERGENCE RATE ARYAN MOKHTARI y, MARK EISEN , AND ALEJANDRO RIBEIRO

10 ARYAN MOKHTARI, MARK EISEN, AND ALEJANDRO RIBEIRO

1 and 2 hold and the condition σti < 2m/(3L) is satisfied then∥∥Bt+1i −∇2fi(x

∗)∥∥M≤[(1− αθti

2)1/2 + α3σ

ti

] ∥∥Bti −∇2fi(x

∗)∥∥M

+ α4σti ,(26)

where α, α3, and α4 are some positive bounded constants and

θti =‖M(Bt

i −∇2fi(x∗))sti‖

‖Bti −∇2fi(x∗)‖M‖M−1sti‖

for Bti 6= ∇2fi(x

∗), θti = 0 for Bti = ∇2fi(x

∗).

(27)

Proof. See Appendix A.

The result in (26) establishes an upper bound for the weighted norm ‖Bt+1i −

∇2fi(x∗)‖M with respect to its previous value ‖Bt

i − ∇2fi(x∗)‖M and the sequence

σti := max{‖zt+1i − x∗‖, ‖zti − x∗‖}, when the variables are in a neighborhood of

the optimal solution such that σti < m/(3L). Indeed, the result in (26) holds onlyfor the index i = it and for the rest of indices we have ‖Bt+1

i − ∇2fi(x∗)‖M =

‖Bti−∇2fi(x

∗)‖M simply by definition of the cyclic update. Note that if the residualsequence σti associated with fi approaches zero, we can simplify (26) as

‖Bt+1i −∇2fi(x

∗)‖M . (1− αθti2)1/2‖Bt

i −∇2fi(x∗)‖M.(28)

The equation in (28) implies that if θti is always strictly larger than zero, the sequence‖Bt+1

i −∇2fi(x∗)‖M approaches zero. If not, then the sequence θti converges to zero

which implies the Dennis-More condition from (6), i.e.

(29) limt→∞

‖(Bti −∇2fi(x

∗))sti‖‖sti‖

= 0.

Therefore, under both conditions the result in (29) holds. This is true since the limitlimt→∞ ‖Bt+1

i −∇2fi(x∗)‖M = 0 yields the result in (29).

From this observation, we can recover the Dennis-More condition in (29) for allfunctions fi if the sequence σti converges to zero. Hence, we proceed to show thatthe sequence ‖zti − x∗‖ is linearly convergent for each i ∈ {1, . . . , n}. To achieve thisgoal we first prove an upper bound for the error ‖xt+1 − x∗‖ of IQN in the followinglemma.

Lemma 2. Consider the proposed IQN method in (9). If the conditions in As-sumptions 1 and 2 hold, then the sequence of iterates generated by IQN satisfies theinequality

‖xt+1 − x∗‖ ≤ LΓt

n

n∑i=1

∥∥zti − x∗∥∥2 +

Γt

n

n∑i=1

∥∥(Bti −∇2fi(x

∗)) (

zti − x∗)∥∥ ,(30)

where Γt := ‖((1/n)∑ni=1 Bt

i)−1‖.

Proof. See Appendix B.

Lemma 2 shows that the residual ‖xt+1 − x∗‖ is bounded above by a sum ofquadratic and linear terms of the last n residuals. This can eventually lead to asuperlinear convergence rate by establishing the linear term converges to zero at afast rate, leaving us with an upper bound of quadratic terms only. First, however, weestablish a local linear convergence rate in the proceeding theorem to show that thesequence σti converges to zero.

Page 11: IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH …IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH LOCAL SUPERLINEAR CONVERGENCE RATE ARYAN MOKHTARI y, MARK EISEN , AND ALEJANDRO RIBEIRO

IQN: AN INCREMENTAL QUASI-NEWTON METHOD 11

Lemma 3. Consider the proposed IQN method in (9). If Assumptions 1 and 2hold, then, for any r ∈ (0, 1) there are positive constants ε(r) and δ(r) such thatif ‖x0 − x∗‖ < ε(r) and ‖B0

i − ∇2fi(x∗)‖M < δ(r) for M = ∇2fi(x

∗)−1/2 andi = 1, 2, . . . , n, the sequence of iterates generated by IQN satisfies

(31) ‖xt − x∗‖ ≤ r[t−1n ]+1‖x0 − x∗‖.

Moreover, the sequences of norms {‖Bti‖} and {‖Bt

i−1‖} are uniformly bounded.

Proof. See Appendix C.

The result in Lemma 3 shows that the sequence of iterates generated by IQN hasa local linear convergence rate after each pass over all functions. Consequently, weobtain that the i-th residual sequence σti is linearly convergent for all i. Note thatLemma 3 can be considered as the extension of Theorem 3.2 in [4] for the incrementalsetting. Following the arguments in (28) and (29), we use the summability of thesequence σti along with the result in Lemma 1 to prove Dennis-More condition for allfunctions fi.

Proposition 4. Consider the proposed IQN method in (9). Assume that thehypotheses in Lemmata 1 and 3 are satisfied. Then, for all i = 1, . . . , n we can showthat

(32) limt→∞

‖(Bti −∇2fi(x

∗))sti‖‖sti‖

= 0.

Proof. See Appendix D.

The statement in Proposition 4 indicates that for each function fi the Dennis-Morecondition holds. In the tradition quasi-Newton methods the Dennis-More condi-tion is sufficient to show that the method is superlinearly convergent. However,the same argument does not hold for the proposed IQN method, since we can’t re-cover the Dennis-More condition for the global objective function f from the resultin Proposition 4. In other words, the result in (32) does not imply the limit in (6)required in the superlinear convergence analysis of quasi-Newton methods. There-fore, here we pursue a different approach and seek to prove that the linear terms(Bt

i −∇2fi(x∗))(zti − x∗) in (30) converge to zero at a superlinear rate, i.e., for all i

we can write limt→∞‖(Bti −∇2fi(x

∗))(zti − x∗)‖/‖zti − x∗‖ = 0. If we establish thisresult, it follows from the result in Lemma 2 that the sequence of residuals ‖xt − x∗‖converges to zero superlinearly.

We continue the analysis of the proposed IQN method by establishing a gener-alized limit property that follows from the Dennis-More criterion in (6). We showthat that the shows that the vector zti − x∗ lies in the null space of Bt

i − ∇2fi(x∗)

as t approaches infinity. In the following lemma, we leverage the local linear conver-gence of the iterates xt to show that that the vector zti − x∗ lies in the null space ofBti −∇2fi(x

∗) as t approaches infinity.

Lemma 5. Consider the proposed IQN method in (9). Assume that the hypothesesin Lemmata 1 and 3 are satisfied. As t goes to infinity, the following holds for all i,

(33) limt→∞

‖(Bti −H∗i )(z

ti − x∗)‖

‖zti − x∗‖= 0.

Page 12: IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH …IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH LOCAL SUPERLINEAR CONVERGENCE RATE ARYAN MOKHTARI y, MARK EISEN , AND ALEJANDRO RIBEIRO

12 ARYAN MOKHTARI, MARK EISEN, AND ALEJANDRO RIBEIRO

Proof. See Appendix E.

The result in Lemma 5 can thus be used in conjunction with Lemma 2 to showthat the residual ‖xt+1 − x∗‖ is bounded by a sum of quadratic terms of previousresiduals and a term that converges to zero at a fast rate. This result leads us to themain result, namely the local superlinear convergence of the sequence of residuals,stated in the following theorem.

Theorem 6. Consider the proposed IQN method in (9). Suppose that the condi-tions in the hypotheses of Lemmata 1 and 3 are valid. Then, the sequence of residuals‖xt − x∗‖ satisfies

(34) limt→∞

‖xt − x∗‖(‖xt−1 − x∗‖+ · · ·+ ‖xt−n − x∗‖)/n

= 0.

Proof. See Appendix F.

The result in (34) shows a mean-superlinear convergence rate for the sequence ofiterates generated by IQN. To be more precise, it shows that the ratio that capturesthe error at step t divided by the average of last n errors converges to zero. This isnot equivalent to the classic R-superlinear convergence for full-batch quasi-Newtonmethods, i.e., limt→∞ ‖xt+1 − x∗‖/‖xt − x∗‖ = 0. This is not a drawback of ouranalysis, and it is caused by the fact that in incremental methods we cannot showthat the sequence of residuals is monotonically decreasing.

5. Related Works. Various methods have been studied in the literature toimprove the performance of traditional full-batch optimization algorithms. The mostfamous method for reducing the computational complexity of gradient descent (GD)is stochastic gradient descent (SGD), which uses the gradient of a single randomlychosen function to approximate the full-gradient [2]. Incremental gradient descentmethod (IGD) is similar to SGD except the function is chosen in a cyclic routine.Both SGD and IGD suffer from slow sublinear convergence rate because of the noiseof gradient approximation. The incremental aggregated methods, which use memoryto aggregate the gradients of all n functions, are successful in reducing the noise ofgradient approximation to achieve linear convergence rate [26, 28, 9, 13]. The workin [26] suggests a random selection of functions which leads to stochastic averagegradient method (SAG), while the works in [1, 11, 17] use a cyclic scheme for choosingthe functions.

Moving beyond first order information, there have been stochastic quasi-Newtonmethods to approximate Hessian information [29, 18, 19, 20, 10]. All of these stochas-tic quasi-Newton methods reduce computational cost of quasi-Newton methods byupdating only a randomly chosen single or small subset of gradients at each itera-tion. However, they are not able to recover the superlinear convergence rate of quasi-Newton methods [4, 22, 12]. The incremental Newton method (NIM) in [25] is theonly incremental method shown to have a superlinear convergence rate; however, theHessian function is not always available or computationally feasible. Moreover, theimplementation of NIM requires computation of the incremental aggregated Hessianinverse which has the computational complexity of the order O(p3).

6. Numerical Results. We proceed by simulating the performance of IQN on avariety of machine learning problems on both artificial and real datasets. We comparethe performance of IQN against a collection of well known first order stochastic and

Page 13: IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH …IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH LOCAL SUPERLINEAR CONVERGENCE RATE ARYAN MOKHTARI y, MARK EISEN , AND ALEJANDRO RIBEIRO

IQN: AN INCREMENTAL QUASI-NEWTON METHOD 13

incremental algorithms—namely SAG, SAGA, and IAG. To begin, we look at a simplequadratic program, also equivalent to the solution of linear least squares estimationproblem. Consider the objective function to be minimized,

x∗ = argminx∈Rp

f(x) := argminx∈Rp

1

n

n∑i=1

1

2xTAix + bTi x.(35)

We generate Ai ∈ Rp×p as a random positive definite matrix and bi ∈ Rp as a randomvector for all i. In particular we set the matrices Ai := diag{ai} and generate randomvectors ai with the first p/2 elements chosen from [1, 10ξ/2] and last p/2 elementschosen from [10−ξ/2, 1]. The parameter ξ is used to manually set the condition numberfor the quadratic program in (35), ranging from ξ = 1 (i.e. small condition number102) and ξ = 2 (i.e. large condition number 104). The vectors bi are chosen uniformlyand randomly from the box [0, 103]p. The variable dimension is set to be p = 10 andnumber of functions n = 1000. Given that we focus on local convergence, we use aconstant step size of η = 1 for the proposed IQN method while choosing the largeststep size allowable by the other methods to converge.

In Figure 2 we present a simulation of the convergence path of the normalizederror ‖xt − x∗‖/‖x0 − x∗‖ for the quadratic program. In the the left image, we showa sample simulation path for all methods on the quadratic problem with a smallcondition number. Step sizes of η = 5 × 10−5, η = 10−4 and η = 10−6 were usedfor SAG, SAGA, and IAG, respectively. These step sizes are tuned to compare thebest performance of these methods with IQN. The proposed method reaches a errorof 10−10 after 10 passes through the data. Alternatively, SAGA achieves the sameerror of 10−5 after 30 passes, while SAG and IAG do not reach 10−5 after 40 passes.

In the right image of Figure 2, we repeat the same simulation but with largercondition number. In this case, SAG uses stepsize η = 2× 10−4 while others remainthe same. Observe that while the performance of IQN does not degrade with largercondition number, the first order methods all suffer large degradation. SAG, SAGA,and IAG reach after 40 passes a normalized error of 6.5 × 10−3, 5.5 × 10−2, and9.6 × 10−1, respectively. It can be seen that IQN significantly outperforms the firstorder method for both condition number sizes, with the outperformance increasingfor larger condition number. This is an expected result, as first order methods oftendo not perform well for ill conditioned problems.

6.1. Logistic regression. We proceed to numerically evaluate the performanceof IQN relative to existing methods on the classification of handwritten digits in theMNIST database [14]. In particular, we solve the binary logistic regression problem.A logistic regression takes as inputs n training feature vectors ui ∈ Rp with associatedlabels vi ∈ {−1, 1} and outputs a linear classifier x to predict the label of unknownfeature vectors. For the digit classification problem, each feature vector ui representsa vectorized image and label vi its label as one of two digits. We evaluate for anytraining sample i the probability of a label vi = 1 given image ui as P (v = 1|u) =1/(1 + exp(−uTx)). The classifier x is chosen to be the vector which maximizes thelog likelihood across all n samples. Given n images ui with associated labels vi, theoptimization problem for logistic regression is written as

x∗ = argminx∈Rp

f(x) := argminx∈Rp

λ

2‖x‖2 +

1

n

n∑i=1

log[1 + exp(−viuTi x)],(36)

where the first term is a regularization term parametrized by λ ≥ 0.

Page 14: IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH …IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH LOCAL SUPERLINEAR CONVERGENCE RATE ARYAN MOKHTARI y, MARK EISEN , AND ALEJANDRO RIBEIRO

14 ARYAN MOKHTARI, MARK EISEN, AND ALEJANDRO RIBEIRO

Number of E,ective Passes0 10 20 30 40

Nor

mal

ized

Err

or

10-20

10-15

10-10

10-5

100

SAGSAGAIAGIQN

Number of E,ective Passes0 10 20 30 40

Nor

mal

ized

Err

or

10-20

10-15

10-10

10-5

100

SAGSAGAIAGIQN

Fig. 2. Convergence results of proposed IQN method in comparison to SAG, SAGA, and IAG.In the left image, we present a sample convergence path of the normalized error on the quadraticprogram with a small condition number. In the right image, we show the convergence path for thequadratic program with a large condition number. In all cases, IQN provides significant improvementover first order methods, with the difference increasing for larger condition number.

0 10 20 30 40 50 60Number of E,ective Passes

10-8

10-6

10-4

10-2

100

Nor

mof

Gra

dient

SAGSAGAIAGIQN

Fig. 3. Convergence results for a sample convergence path for the logistic regression problemon classifying handwritten digits in the MNIST database. IQN substantially outperforms the firstorder methods.

For our simulations we select from the MNIST dataset n = 1000 images withdimension p = 784 labelled as one of the digits “0” or “8’ and fix the regularizationparameter as λ = 1/n and stepsize η = 0.01 for all first order methods. In Figure 3we present the convergence path of IQN relative to existing methods in terms of thenorm of the gradient. As in the case of the quadratic program, the IQN performs allgradient-based methods. IQN reaches a gradient magnitude of 4.8 × 10−8 after 60passes through the data while the SAGA reaches only a magnitude of 7.4× 10−5 (allother methods perform even worse). Further note that while the first order methodsbegin to level out after 60 passes, the IQN method continues to descend. These resultsdemonstrate the effectiveness of IQN on a practical machine learning problem withreal world data.

Appendix A. Proof of Lemma 1. To prove the claim in Lemma 1, we firstprove the the following lemma which is based on the result in [4, Lemma 5.2].

Lemma 7. Consider the proposed IQN method in (9). Let M be a nonsingularsymmetric matrix such that

(37) ‖Myti −M−1sti‖ ≤ β‖M−1sti‖,

for some β ∈ [0, 1/3] and vectors sti and yti in Rp with sti 6= 0. Consider i as the indexof the updated function at step t, i.e., i = it, and let Bt

i be symmetric and computedaccording to the update in (15). Then, there exist positive constants α, α1, and α2

Page 15: IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH …IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH LOCAL SUPERLINEAR CONVERGENCE RATE ARYAN MOKHTARI y, MARK EISEN , AND ALEJANDRO RIBEIRO

IQN: AN INCREMENTAL QUASI-NEWTON METHOD 15

such that, for any symmetric A ∈ Rp×p we have,

‖Bt+ni −A‖M ≤

[(1− αθ2)1/2 + α1

‖Myti −M−1sti‖‖M−1sti‖

]‖Bt

i −A‖M + α2‖yti −Asti‖‖M−1sti‖

,

(38)

where α = (1 − 2β)/(1 − β2) ∈ [3/8, 1], α1 = 2.5(1 − β)−1, α2 = 2(1 + 2√p)‖M‖F,

and

θ =‖M(Bt

i −A)sti‖‖Bt

i −A‖M‖M−1sti‖for Bt

i 6= A, θ = 0 for Bti = A.(39)

Proof. First note that the Hessian approximation Bt+ni is equal to Bt+1

i if thefunction fi is updated at step t. Considering this observation and the result of Lemma5.2. in [4] the claim in (38) follows.

The result in Lemma 7 provides an upper bound for the difference between theHessian approximation matrix Bt+n

i and any positive definite matrix A with respectto the difference between the previous Hessian approximation Bt

i and the matrixA. The interesting choice for the arbitrary matrix A is the Hessian of the i-thfunction at the optimal argument, i.e., A = ∇2fi(x

∗), which allows us to capture thedifference between the sequence of Hessian approximation matrices for function fiand the Hessian ∇2fi(x

∗) at the optimal argument. We proceed to use the result inLemma 7 for M = ∇2fi(x

∗)−1/2 and A = ∇2fi(x∗) to prove the claim in (26). To do

so, we first need to show that the condition in (37) is satisfied. Note that accordingto the condition in Assumptions 1 and 2 we can write

‖yti −∇2fi(x∗)sti‖

‖∇2fi(x∗)1/2sti‖≤ L‖sti‖max{‖zti − x∗‖, ‖zt+1

i − x∗‖}√m‖sti‖

=L√mσti(40)

This observation implies that the left hand side of the condition in (37) for M =∇2fi(x

∗)−1/2 is bounded above by

‖Myti −M−1sti‖‖M−1sti‖

≤ ‖∇2fi(x

∗)−1/2‖‖yti −∇2fi(x∗)sti‖

‖∇2fi(x∗)1/2sti‖≤ L

mσti(41)

Thus, the condition in (37) is satisfied since Lσti/m < 1/3. Replacing the upperbounds in (40) and (41) into the expression in (38) implies the claim in (26) with

β =L

mσti , α =

1− 2β

1− β2, α3 =

5L

2m(1− β), α4 =

2(1 + 2√p)L

√m

‖∇2fi(x∗)−1/2‖F,

(42)

and the proof is complete.

Appendix B. Proof of Lemma 2. Start by subtracting x∗ from both sidesof (9) to obtain

xt+1 − x∗ =

(1

n

n∑i=1

Bti

)−1(1

n

n∑i=1

Btizti −

1

n

n∑i=1

∇fi(zti)−1

n

n∑i=1

Btix∗

).(43)

As the gradient of f at the optimal point is the vector zero, i.e., (1/n)∑ni=1∇fi(x∗) =

0, we can subtract (1/n)∑ni=1∇fi(x∗) from the right hand side of (43) and rearrange

Page 16: IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH …IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH LOCAL SUPERLINEAR CONVERGENCE RATE ARYAN MOKHTARI y, MARK EISEN , AND ALEJANDRO RIBEIRO

16 ARYAN MOKHTARI, MARK EISEN, AND ALEJANDRO RIBEIRO

terms to obtain

xt+1 − x∗ =

(1

n

n∑i=1

Bti

)−1(1

n

n∑i=1

Bti

(zti − x∗

)− 1

n

n∑i=1

(∇fi(zti)−∇fi(x∗)

)).

(44)

The expression in (44) relates the residual at time t + 1 to the previous n residualsand the Hessian approximations Bt

i. To analyze this further, we can replace the ap-proximations Bt

i with the actual Hessians ∇2fi(x∗) and the approximation difference

∇2fi(x∗)−Bt

i. To do so, we add and subtract (1/n)∑ni=1∇2fi(x

∗) (yti − x∗) to theright hand side of (44) and rearrange terms to obtain

xt+1 − x∗ =

(1

n

n∑i=1

Bti

)−1(1

n

n∑i=1

[∇2fi(x

∗)(yti − x∗

)−(∇fi(yti)−∇fi(x∗)

)])

+

(1

n

n∑i=1

Bti

)−1(1

n

n∑i=1

[Bti −∇2fi(x

∗)] (

yti − x∗))

.(45)

We proceed to take the norms of both sides and use the triangle inequality to obtainan upper bound on the norm of the residual,

‖xt+1 − x∗‖ ≤

∥∥∥∥∥∥(

1

n

n∑i=1

Bti

)−1∥∥∥∥∥∥ 1

n

n∑i=1

∥∥∇2fi(x∗)(yti − x∗

)−(∇fi(yti)−∇fi(x∗)

)∥∥+

∥∥∥∥∥∥(

1

n

n∑i=1

Bti

)−1∥∥∥∥∥∥ 1

n

n∑i=1

∥∥[Bti −∇2fi(x

∗)] (

yti − x∗)∥∥ .(46)

To obtain the quadratic term in (30) from the first term in (46), we use the Lipschitzcontinuity of the Hessian ∇2fi which leads to the inequality

‖∇2fi(x∗)(yti − x∗

)−(∇fi(yti)−∇fi(x∗)

)‖ ≤ L

∥∥yti − x∗∥∥2 .(47)

Replacing the expression ‖∇2fi(x∗) (yti − x∗)− (∇fi(yti)−∇fi(x∗)) ‖ in (46) by the

upper bound in (47), the claim in (30) follows.

Appendix C. Proof of Theorem 3. In this proof we use some steps inthe proof of [4, Theorem 3.2]. To start we use the fact that in a finite-dimensionalvector space there always exists a constant η > 0 such that ‖A‖ ≤ η‖A‖M. Considerγ = 1/m is an upper bound for the norm ‖∇2f(x∗)−1‖. Assume that ε(r) = ε andδ(r) = δ are chosen such that

(2α3δ + α4)ε

1− r≤ δ and γ(1 + r)[Lε+ 2ηδ] ≤ r.(48)

Based on the assumption that ‖B0i −∇2fi(x

∗)‖M ≤ δ we can derive the upper bound‖B0

i − ∇2fi(x∗)‖ ≤ ηδ. This observation along with the inequality ‖∇2fi(x

∗)‖ ≤ Limplies that ‖B0

i ‖ ≤ ηδ + L. Therefore, we obtain ‖(1/n)∑ni=1 B0

i ‖ ≤ ηδ + L. Thesecond inequality in (48) implies that 2γ(1 + r)ηδ ≤ r. Based on this observation andthe inequalities ‖B0

i −∇2fi(x∗)‖ ≤ ηδ < 2ηδ and γ ≥ ‖∇2fi(x

∗)−1‖, we obtain fromBanach Lemma that ‖(B0

i )−1‖ ≤ (1+r)γ. Following the same argument for the matrix

Page 17: IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH …IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH LOCAL SUPERLINEAR CONVERGENCE RATE ARYAN MOKHTARI y, MARK EISEN , AND ALEJANDRO RIBEIRO

IQN: AN INCREMENTAL QUASI-NEWTON METHOD 17

((1/n)∑ni=1 B0

i )−1 with the inequalities ‖(1/n)

∑ni=1 B0

i − (1/n)∑ni=1∇2fi(x

∗)‖ ≤(1/n)

∑ni=1 ‖B0

i −∇2fi(x∗)‖ ≤ ηδ and ‖∇2f(x∗)−1‖ ≤ γ we obtain that∥∥∥∥∥∥(

1

n

n∑i=1

B0i

)−1∥∥∥∥∥∥ ≤ (1 + r)γ.(49)

This upper bound in conjunction with the result in (30) yields

‖x1 − x∗‖ ≤ (1 + r)γ

[L

n

n∑i=1

∥∥y0i − x∗

∥∥2 +1

n

n∑i=1

∥∥[B0i −∇2fi(x

∗)] (

y0i − x∗

)∥∥]

= (1 + r)γ

[L∥∥x0 − x∗

∥∥2+ 1

n

n∑i=1

∥∥[B0i −∇2fi(x

∗)] (

x0 − x∗)∥∥](50)

Considering the assumptions that ‖x0−x∗‖ ≤ ε and ‖B0i −∇2fi(x

∗)‖ ≤ ηδ < 2ηδ wecan write,

‖x1 − x∗‖ ≤ (1 + r)γ[Lε+ 2ηδ]‖x0 − x∗‖≤ r‖x0 − x∗‖(51)

Let’s assume that i0 = 1. Then, based on the result in (26) we obtain∥∥B11 −∇2f1(x∗)

∥∥M≤[(1− αθ01

2)1/2 + α3σ

01

] ∥∥B01 −∇2f1(x∗)

∥∥M

+ α4σ01

≤ (1 + α3ε)δ + α4ε

≤ δ + 2α3εδ + α4ε ≤ 2δ(52)

We proceed to the next iteration which leads to the inequality

‖x2 − x∗‖ ≤ (1 + r)γ

[L

n

n∑i=1

∥∥yti − x∗∥∥2 +

1

n

n∑i=1

∥∥[Bti −∇2fi(x

∗)] (

yti − x∗)∥∥]

≤ (1 + r)γ[Lε+ 2ηδ

](n− 1

n‖x0 − x∗‖+

1

n‖x1 − x∗‖

)≤ r

(n− 1

n‖x0 − x∗‖+

1

n‖x1 − x∗‖

)≤ r‖x0 − x∗‖(53)

And since the updated index is i1 = 2 we obtain∥∥B22 −∇2f2(x∗)

∥∥M≤[(1− αθ02

2)1/2 + α3σ

02

] ∥∥B02 −∇2f2(x∗)

∥∥M

+ α4σ02

≤ (1 + α3ε)δ + α4ε

≤ δ + 2α3εδ + α4ε ≤ 2δ(54)

With the same argument we can show that all∥∥Bt

t −∇2ft(x∗)∥∥M≤ 2δ and ‖xt −

x∗‖ ≤ ε, for all iterates t = 1, . . . , n. Moreover, we have ‖xt − x∗‖ ≤ r‖x0 − x∗‖ fort = 1, . . . , n.

Now we use the results for iterates t = 1, . . . , n as the base of our inductionargument. To be more precise, let’s assume that for iterates t = jn+1, jn+2, . . . , jn+

Page 18: IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH …IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH LOCAL SUPERLINEAR CONVERGENCE RATE ARYAN MOKHTARI y, MARK EISEN , AND ALEJANDRO RIBEIRO

18 ARYAN MOKHTARI, MARK EISEN, AND ALEJANDRO RIBEIRO

n we know that the residuals are bounded above by ‖xt−x∗‖ ≤ rj+1‖x0−x∗‖ and theHessian approximation matrices Bt

i satisfy the inequalities ‖Bti − ∇2fi(x

∗)‖ ≤ 2ηδ.Our goal is to show that for iterates t = (j + 1)n + 1, (j + 1)n + 2, . . . , (j + 1)n + nthe inequality ‖xt − x∗‖ ≤ rj+2‖x0 − x∗‖ and ‖Bt

i −∇2fi(x∗)‖ ≤ 2ηδ hold.

Based on the inequalities ‖Bti −∇2fi(x

∗)‖ ≤ 2ηδ and ‖∇2fi(x∗)−1‖ ≤ γ we can

show that for all t = jn+ 1, jn+ 2, . . . , jn+ n we have ‖(Bti)‖ we can write∥∥∥∥∥∥

(1

n

n∑i=1

Bti

)−1∥∥∥∥∥∥ ≤ (1 + r)γ(55)

Using the result in (55) and the inequality in (26) for the iterates t = (j + 1)n + 1,we obtain

‖x(j+1)n+1 − x∗‖ ≤ (1 + r)γL

n

n∑i=1

∥∥∥z(j+1)ni − x∗

∥∥∥2+ (1 + r)γ

1

n

n∑i=1

∥∥∥[B(j+1)ni −∇2fi(x

∗)] (

z(j+1)ni − x∗

)∥∥∥ .(56)

Since the variables are updated in a cyclic fashion the set of variables {z(j+1)ni }i=ni=1

is equal to the set {x(j+1)n−i}i=n−1i=0 . By considering this relation and replacing the

norms ‖[B(j+1)ni −∇2fi(x

∗)](z(j+1)ni −x∗)‖ by their upper bounds 2ηδ‖z(j+1)n

i −x∗‖we can simplify the right hand side of (56) as

‖x(j+1)n+1 − x∗‖ ≤ (1 + r)γ

[L

n

n∑i=1

∥∥xjn+i − x∗∥∥2 +

2ηδ

n

n∑i=1

∥∥xjn+i − x∗∥∥] .(57)

Since ‖xjn+i − x∗‖ ≤ ε for all j = 1, . . . , n, we obtain

‖x(j+1)n+1 − x∗‖ ≤ (1 + r)γ[Lε+ 2ηδ

]( 1

n

n∑i=1

∥∥xjn+i − x∗∥∥) .(58)

According to the second inequality in (48) and the assumption that for iterates t =jn + 1, jn + 2, . . . , jn + n we know that ‖xt − x∗‖ ≤ rj+1‖x0 − x∗‖, we can replacethe right hand side of (58) by the following upper bound

‖x(j+1)n+1 − x∗‖ ≤ rj+2‖x0 − x∗‖.(59)

Now we show that the updated Hessian B(j+1)n+1it

for t = (j + 1)n + 1 satisfies the

inequality ‖B(j+1)n+1it

− ∇2fit(x∗)‖M ≤ 2δ. According to the result in (26), we can

write ∥∥∥B(j+1)n+1it

−∇2fit(x∗)∥∥∥M−∥∥∥Bjn+1

it−∇2fit(x

∗)∥∥∥M

≤ α3σjn+1it

∥∥∥Bjn+1it

−∇2fit(x∗)∥∥∥M

+ α4σjn+1it

.(60)

Now observe that σjn+1it

= max{‖x(j+1)n+1 − x∗‖, ‖xjn+1 − x∗‖} is bounded above

by rj+1‖x0−x∗‖. Applying this substitution into (60) and considering the conditions‖Bjn+1

it−∇2fit(x

∗)‖M ≤ 2δ and ‖x0 − x∗‖ ≤ ε lead to the inequality∥∥∥B(j+1)n+1it

−∇2fit(x∗)∥∥∥M−∥∥∥Bjn+1

it−∇2fit(x

∗)∥∥∥M≤ rj+1ε(2δα3 + α4).(61)

Page 19: IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH …IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH LOCAL SUPERLINEAR CONVERGENCE RATE ARYAN MOKHTARI y, MARK EISEN , AND ALEJANDRO RIBEIRO

IQN: AN INCREMENTAL QUASI-NEWTON METHOD 19

By writing the expression in (61) for previous iterations and using a recursive logicwe obtain that∥∥∥B(j+1)n+1

it−∇2fit(x

∗)∥∥∥M−∥∥B0

it −∇2fit(x

∗)∥∥M≤ ε(2δα3 + α4)

1

1− r.(62)

Based on the first inequality in (48), the right hand side of (62) is bounded above byδ. Moreover, the norm ‖B0

it− ∇2fit(x

∗)‖M is also upper bounded by δ. These twobounds imply that ∥∥∥B(j+1)n+1

it

∥∥∥M≤ 2δ,(63)

and consequently ‖B(j+1)n+1it

‖ ≤ 2ηδ. By following the steps from (56) to (63),we can show for all iterates t = (j + 1)n + 1, (j + 1)n + 2, . . . , (j + 1)n + n theinequalities ‖xt−x∗‖ ≤ rj+2‖x0−x∗‖ and ‖Bt

i−∇2fi(x∗)‖ ≤ 2ηδ hold. Therefore, the

induction proof is complete and the inequality in (31) holds. Moreover, the inequality‖Bt

i − ∇2fi(x∗)‖ ≤ 2ηδ holds for all i and steps t. Therefore, the norms ‖Bt

i‖ and‖(Bt

i)−1‖, and consequently ‖(1/n)

∑ni=1 Bt

i‖ and ‖((1/n)∑ni=1 Bt

i)−1‖ are uniformly

bounded.

Appendix D. Proof of Theorem 4. According to the result in Lemma 3, wecan show that the sequence of errors σti = max{‖zt+1

i − x∗‖, ‖zti − x∗‖} is summablefor all i. To do so, consider the sum of the sequence σti which is upper bounded by

∞∑t=0

σti =

∞∑t=0

max{‖zt+1i − x∗‖, ‖zti − x∗‖} ≤

∞∑t=0

‖zt+1i − x∗‖+

∞∑t=0

‖zti − x∗‖(64)

Note that the last time that the index i is chosen before time t should be in the set{t− 1, . . . , t− n}. This observation in association with the result in (31) implies that

∞∑t=0

σti ≤ 2

∞∑t=0

r[t−n−1

n ]+1‖x0 − x∗‖ = 2

∞∑t=0

r[t−1n ]‖x0 − x∗‖(65)

Simplifying the sum in the right hand side of (65) yields

∞∑t=0

σti ≤2‖x0 − x∗‖

r+ 2n‖x0 − x∗‖

∞∑t=0

rt <∞.(66)

Thus, the sequence σti is summable for all i = 1, . . . , n. To complete the proof we usethe following result from Lemma 3.3 in [12].

Lemma 8. Let {φt} and {δt} be sequences of nonnegative numbers such that

(67) φt+1 ≤ (1 + δt)φt + δt and

∞∑k=1

δt <∞.

Then, the sequence {φt} converges.Considering the results in Lemmata 1 and 8, and the fact that σti is summable as

shown in(66), we obtain that the sequence∥∥Bt

i −∇2fi(x∗)∥∥M

for M := ∇2fi(x∗)−1/2

is convergent and the following limit exists

(68) limk→∞

‖∇2fi(x∗)−1/2 Bt

i ∇2fi(x∗)−1/2 − I‖F = l

Page 20: IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH …IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH LOCAL SUPERLINEAR CONVERGENCE RATE ARYAN MOKHTARI y, MARK EISEN , AND ALEJANDRO RIBEIRO

20 ARYAN MOKHTARI, MARK EISEN, AND ALEJANDRO RIBEIRO

where l is a nonnegative constant. Moreover, following the proof of Theorem 3.4 in[12] we can show that

α(θti)2‖Bt

i −∇2fi(x∗)‖M ≤ ‖Bt

i −∇2fi(x∗)‖M − ‖Bt+1

i −∇2fi(x∗)‖M

+ σti(α3‖Bti −∇2fi(x

∗)‖M + α4),(69)

and, therefore, summing both sides implies,

∞∑t=0

(θti)2‖Bt

i −∇2fi(x∗)‖M <∞(70)

Replacing θti in (70) by its definition in (27) results in

∞∑t=0

‖M(Bti −∇2fi(x

∗))sti‖2

‖Bti −∇2fi(x∗)‖M‖M−1sti‖2

<∞(71)

Since the norm ‖Bti−∇2fi(x

∗)‖M is upper bounded and the eigenvalues of the matrixM = ∇2fi(x

∗)−1/2 are uniformly lower and upper bounded, we conclude from theresult in (71) that

limt→∞

‖(Bti −∇2fi(x

∗))sti‖2

‖sti‖2= 0,(72)

which yields the claim in (32).

Appendix E. Proof of Lemma 5. Consider the sets of variable variationsS1 = {st+nτi }τ=Tτ=0 and S2 = {st+nτi }τ=∞τ=0 . It is trivial to show that zti − x∗ is in thespan of the set S2, since the sequences of variables xt and zti converge to x∗ and wecan write x∗ − zti =

∑∞τ=0 st+nτi . We proceed to show that the vector zti − x∗ is also

in the span of the set S1 when T is sufficiently large. To do so, we use a contradictionargument. Let’s assume that the vector zti −x∗ does not lie in the span of the set S1,and, therefore, it can be decomposed as the sum of two non-zero vectors given by

(73) zti − x∗ = vt‖ + vt⊥,

where vt‖ lies in the span of S1 and vt⊥ is orthogonal to the span of S1. Since we assume

that zti−x∗ does not lie in the span of S1, we obtain that zt+nTi −x∗ also does not lie in

this span, since zt+nTi −x∗ can be written as the sum zt+nTi −x∗ = zti−x∗+∑Tτ=0 st+nτi .

These observations imply that we can also decompose the vector zt+nTi − x∗ as

(74) zt+nTi − x∗ = vt+nT‖ + vt+nT⊥ ,

where vt+nT‖ lies in the span of S1 and vt+nT⊥ is orthogonal to the span of S1. More-

over, we obtain that vt+nT⊥ is equal to vt⊥, i.e.,

(75) vt+nT⊥ = vt⊥.

This is true since zt+nTi − x∗ can be written as the sum of zti − x∗ and a group ofvectors that lie in the span of S1. We assume that the norm ‖vt+nT⊥ ‖ = ‖vt⊥‖ = εwhere ε > 0 is a strictly positive constant. According to the linear convergence of thesequence ‖xt − x∗‖ in Lemma 3 we know that

(76) ‖zt+nTi − x∗‖ ≤ r[t+nT−1

n ]+1‖x0 − x∗‖ ≤ rT ‖x0 − x∗‖

Page 21: IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH …IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH LOCAL SUPERLINEAR CONVERGENCE RATE ARYAN MOKHTARI y, MARK EISEN , AND ALEJANDRO RIBEIRO

IQN: AN INCREMENTAL QUASI-NEWTON METHOD 21

If we pick large enough T such that rT ‖x0−x∗‖ < ε, then we obtain ‖zt+nTi −x∗‖ < εwhich contradicts the assumption ‖vt⊥‖ = ε. Thus, we obtain that the vector zti − x∗

is also in the span of set S1.Since the vector zti − x∗ is in the span of S1, we can write the normalized vec-

tor (zti − x∗)/‖zti − x∗‖ as a linear combination of the set of normalized vectors{st+nτi /‖st+nτi ‖}τ=Tτ=0 . This property allows to write

limt→∞

‖(Bti −H∗i )(z

ti − x∗)‖

‖zti − x∗‖= limt→∞

∥∥∥∥(Bti −H∗i )

(zti − x∗)

‖zti − x∗‖

∥∥∥∥= limt→∞

∥∥∥∥∥(Bti −H∗i )

T∑τ=0

aτst+nτi

‖st+nτi ‖

∥∥∥∥∥ ,(77)

where aτ is coefficient of the vector st+nτi when we write (zti − x∗)/‖zti − x∗‖ as thelinear combination of the normalized vectors {st+nτi /‖st+nτi ‖}τ=Tτ=0 . Now since theindex of the difference Bt

i −H∗i does not match with the descent directions sti + nτ .We add and subtract the term Bt+nτ

i to the expression Bti −H∗i and use the triangle

inequality to write

limt→∞

‖(Bti −H∗i )(z

ti − x∗)‖

‖zti − x∗‖

≤ limt→∞

∥∥∥∥∥T∑τ=0

aτ(Bt+nτ

i −H∗i )st+nτi

‖st+nτi ‖

∥∥∥∥∥+

∥∥∥∥∥T∑τ=0

aτ(Bt

i −Bt+nτi )st+nτi

‖st+nτi ‖

∥∥∥∥∥ .(78)

We first simplify the first limit in the right hand side of (78). Using the Cauchy-Schwarz inequality and the result in Proposition 4 we can write

limt→∞

∥∥∥∥∥T∑τ=0

aτ(Bt+nτ

i −H∗i )st+nτi

‖st+nτi ‖

∥∥∥∥∥ ≤ limt→∞

T∑τ=0

∥∥∥∥ (Bt+nτi −H∗i )s

t+nτi

‖st+nτi ‖

∥∥∥∥=

T∑τ=0

aτ limt→∞

∥∥∥∥ (Bt+nτi −H∗i )s

t+nτi

‖st+nτi ‖

∥∥∥∥= 0.(79)

Hence, based on the results in (78) and (79), to prove the claim in (33) it remains toshow that

limt→∞

∥∥∥∥∥T∑τ=0

aτ(Bt

i −Bt+nτi )st+nτi

‖st+nτi ‖

∥∥∥∥∥ = 0.(80)

To reach this goal, we first study the limit of the difference between two consec-utive update Hessian approximation matrices limt→∞ ‖Bt

i −Bt+ni ‖. Note that if we

set A = Bti in (38), we obtain that

‖Bt+ni −Bt

i‖M ≤ α2‖yti −Bt

isti‖

‖M−1sti‖.(81)

where M = (H∗i )−1/2. By adding and subtracting the term H∗i s

ti and using the result

in (32), we can show that the difference ‖Bt+ni −Bt

i‖M approaches zero asymptotically.

Page 22: IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH …IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH LOCAL SUPERLINEAR CONVERGENCE RATE ARYAN MOKHTARI y, MARK EISEN , AND ALEJANDRO RIBEIRO

22 ARYAN MOKHTARI, MARK EISEN, AND ALEJANDRO RIBEIRO

In particular,

limt→∞

‖Bt+ni −Bt

i‖M ≤ α2 limt→∞

‖yti −Btisti‖

‖M−1sti‖

≤ α2 limt→∞

‖yti −H∗i sti‖

‖M−1sti‖+ α2 lim

t→∞

‖(H∗i −Bti)s

ti‖

‖M−1sti‖.(82)

Since ‖yti −H∗i sti‖ is bounded above by L‖sti‖max{‖zti − x∗‖, ‖zt+1

i − x∗‖} and theeigenvalues of the matrix M are uniformly bounded we obtain that the first limit inthe right hand side of (82) converges to zero. Further, the result in (32) shows thatthe second limit in the right hand side of (82) also converges to zero. Therefore,

limt→∞

‖Bt+ni −Bt

i‖M = 0.(83)

Following the same argument we can show that for any two consecutive Hessian ap-proximation matrices the difference approaches zero asymptotically. Thus, we obtain

limt→∞

∥∥Bti −Bt+nτ

i

∥∥M≤ limt→∞

∥∥∥∥∥τ−1∑u=0

(Bt+nui −B

t+n(u+1)i

)∥∥∥∥∥M

≤τ−1∑u=0

limt→∞

∥∥∥Bt+nui −B

t+n(u+1)i

∥∥∥M

= 0.(84)

Observing the result in (84) we can show that

limt→∞

∥∥∥∥∥T∑τ=0

aτ(Bt

i −Bt+nτi )st+nτi

‖st+nτi ‖

∥∥∥∥∥ ≤T∑τ=0

aτ limt→∞

∥∥∥∥ (Bti −Bt+nτ

i )st+nτi

‖st+nτi ‖

∥∥∥∥≤

T∑τ=0

aτ limt→∞

∥∥Bti −Bt+nτ

i

∥∥ = 0.(85)

Therefore, the result in (80) holds. The claim in (33) follows by combining the resultsin (78), (79), and (80).

Appendix F. Proof of Theorem 6. Consider the following result fromLemma 2,

‖xt+1 − x∗‖ ≤ LΓt

n

n∑i=1

∥∥zti − x∗∥∥2 +

Γt

n

n∑i=1

∥∥(Bti −∇2fi(x

∗)) (

zti − x∗)∥∥ .(86)

Divide both sides of (86) by the average of the last n steps errors (1/n)∑ni=1 ‖zti − x∗‖

to obtain

‖xt+1 − x∗‖1n

∑ni=1 ‖zti − x∗‖

≤ LΓtn∑i=1

‖zti − x∗‖2∑ni=1 ‖zti − x∗‖

+ Γtn∑i=1

∥∥(Bti −∇2fi(x

∗))

(zti − x∗)∥∥∑n

i=1 ‖zti − x∗‖

(87)

Since the error ‖zti − x∗‖ is a lower bound for the sum of errors∑ni=1 ‖zti − x∗‖, we

can replace ‖zti − x∗‖ for∑ni=1 ‖zti − x∗‖ in the terms in the right hand side of (87)

Page 23: IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH …IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH LOCAL SUPERLINEAR CONVERGENCE RATE ARYAN MOKHTARI y, MARK EISEN , AND ALEJANDRO RIBEIRO

IQN: AN INCREMENTAL QUASI-NEWTON METHOD 23

which implies

‖xt+1 − x∗‖1n

∑ni=1 ‖zti − x∗‖

≤ LΓtn∑i=1

‖zti − x∗‖2

‖zti − x∗‖+ Γt

n∑i=1

∥∥(Bti −∇2fi(x

∗))

(zti − x∗)∥∥

‖zti − x∗‖

= LΓtn∑i=1

∥∥zti − x∗∥∥+ Γt

n∑i=1

∥∥(Bti −∇2fi(x

∗))

(zti − x∗)∥∥

‖zti − x∗‖(88)

Considering the fact that Γt is upper bounded, by computing the limit of both sidesin (88) we obtain

limt→∞

‖xt+1 − x∗‖1n

∑ni=1 ‖zti − x∗‖

= 0,(89)

which follows the claim in (34).

REFERENCES

[1] D. Blatt, A. O. Hero, and H. Gauchman, A convergent incremental gradient method witha constant step size, SIAM Journal on Optimization, 18 (2007), pp. 29–51.

[2] L. Bottou, Large-scale machine learning with stochastic gradient descent, In Proceedings ofCOMPSTAT’2010, (2010), pp. 177–186.

[3] L. Bottou and Y. L. Cun, On-line learning for very large datasets, in Applied StochasticModels in Business and Industry, vol. 21, pp. 137-151, 2005.

[4] C. G. Broyden, J. E. D. Jr., Wang, and J. J. More, On the local and superlinear convergenceof quasi-newton methods, IMA J. Appl. Math, 12 (1973), pp. 223–245.

[5] F. Bullo, J. Cortes, and S. Martinez, Distributed control of robotic networks: a mathemat-ical approach to motion coordination algorithms, Princeton University Press, 2009.

[6] R. H. Byrd, S. Hansen, J. Nocedal, and Y. Singer, A stochastic quasi-Newton method forlarge-scale optimization, SIAM Journal on Optimization, 26 (2016), pp. 1008–1031.

[7] Y. Cao, W. Yu, W. Ren, and G. Chen, An overview of recent progress in the study of dis-tributed multi-agent coordination, IEEE Transactions on Industrial Informatics, 9 (2013),pp. 427–438.

[8] V. Cevher, S. Becker, and M. Schmidt, Convex optimization for big data: Scalable, ran-domized, and parallel algorithms for big data analytics, IEEE Signal Processing Magazine,31 (2014), pp. 32–43.

[9] A. Defazio, F. R. Bach, and S. Lacoste-Julien, SAGA: A fast incremental gradient methodwith support for non-strongly convex composite objectives, in Advances in Neural Informa-tion Processing Systems 27, Montreal, Quebec, Canada, 2014, pp. 1646–1654.

[10] R. M. Gower, D. Goldfarb, and P. Richtarik, Stochastic block BFGS: Squeezing morecurvature out of data, arXiv preprint arXiv:1603.09649, (2016).

[11] M. Gurbuzbalaban, A. Ozdaglar, and P. Parrilo, On the convergence rate of incrementalaggregated gradient algorithms, arXiv preprint arXiv:1506.02081, (2015).

[12] J. J. E. Dennis and J. J. More, A characterization of super linear convergence and itsapplication to quasi-newton methods, Mathematics of computation, 28 (1974), pp. 549–560.

[13] R. Johnson and T. Zhang, Accelerating stochastic gradient descent using predictive variancereduction, in Advances in Neural Information Processing Systems 26, Lake Tahoe, Nevada,United States, 2013, pp. 315–323.

[14] Y. LeCun, C. Cortes, and C. J. Burges, MNIST handwritten digit database, AT&T Labs[Online]. Available: http://yann. lecun. com/exdb/mnist, (2010).

[15] C. G. Lopes and A. H. Sayed, Diffusion least-mean squares over adaptive networks: For-mulation and performance analysis, IEEE Transactions on Signal Processing, 56 (2008),pp. 3122–3136.

[16] A. Lucchi, B. McWilliams, and T. Hofmann, A variance reduced stochastic Newton method,arXiv preprint arXiv:1503.08316, (2015).

[17] A. Mokhtari, M. Gurbuzbalaban, and A. Ribeiro, Surpassing gradient descent provably: Acyclic incremental method with linear convergence rate, arXiv preprint arXiv:1611.00347,(2016).

Page 24: IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH …IQN: AN INCREMENTAL QUASI-NEWTON METHOD WITH LOCAL SUPERLINEAR CONVERGENCE RATE ARYAN MOKHTARI y, MARK EISEN , AND ALEJANDRO RIBEIRO

24 ARYAN MOKHTARI, MARK EISEN, AND ALEJANDRO RIBEIRO

[18] A. Mokhtari and A. Ribeiro, RES: Regularized stochastic BFGS algorithm, IEEE Transac-tions on Signal Processing, 62 (2014), pp. 6089–6104.

[19] A. Mokhtari and A. Ribeiro, Global convergence of online limited memory BFGS, Journalof Machine Learning Research, 16 (2015), pp. 3151–3181.

[20] P. Moritz, R. Nishihara, and M. I. Jordan, A linearly-convergent stochastic L-BFGS al-gorithm, in Proceedings of the 19th International Conference on Artificial Intelligence andStatistics, AISTATS 2016, Cadiz, Spain, May 9-11, 2016, 2016, pp. 249–258.

[21] Y. Nesterov, Introductory lectures on convex optimization: A basic course, vol. 87, SpringerScience & Business Media, 2013.

[22] M. J. D. Powell, Some global convergence properties of a variable metric algorithm for min-imization without exact line search, Academic Press, London, UK, 2 ed., 1971.

[23] A. Ribeiro, Ergodic stochastic optimization algorithms for wireless communication and net-working, IEEE Transactions on Signal Processing, 58 (2010), pp. 6369–6386.

[24] A. Ribeiro, Optimal resource allocation in wireless communication and networking, EURASIPJournal on Wireless Communications and Networking, 2012 (2012), pp. 1–19.

[25] A. Rodomanov and D. Kropotov, A superlinearly-convergent proximal newton-type methodfor the optimization of finite sums, in Proceedings of The 33rd International Conferenceon Machine Learning, 2016, pp. 2597–2605.

[26] N. L. Roux, M. W. Schmidt, and F. R. Bach, A stochastic gradient method with an exponen-tial convergence rate for finite training sets, in Advances in Neural Information ProcessingSystems 25, Lake Tahoe, Nevada, United States., 2012, pp. 2672–2680.

[27] I. Schizas, A. Ribeiro, and G. Giannakis, Consensus in ad hoc wsns with noisy links - part i:Distributed estimation of deterministic signals, IEEE Transactions on Signal Processing,56 (2008), pp. 350–364.

[28] M. Schmidt, N. Le Roux, and F. Bach, Minimizing finite sums with the stochastic averagegradient, Mathematical Programming, (2016), pp. 1–30, doi:10.1007/s10107-016-1030-6,http://dx.doi.org/10.1007/s10107-016-1030-6.

[29] N. N. Schraudolph, J. Yu, and S. Gunter, A stochastic quasi-Newton method for onlineconvex optimization, in Proceedings of the Eleventh International Conference on ArtificialIntelligence and Statistics, AISTATS 2007, San Juan, Puerto Rico, March 21-24, 2007,2007, pp. 436–443.

[30] S. Shalev-Shwartz and N. Srebro, SVM optimization: inverse dependence on trainingset size, in Machine Learning, Proceedings of the Twenty-Fifth International Conference(ICML 2008), Helsinki, Finland, 2008, pp. 928–935.