Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

63
Journal of Machine Learning Research 22 (2021) 1-63 Submitted 9/20; Revised 4/21; Published 5/21 Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled Gradient Descent Tian Tong [email protected] Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh, PA 15213, USA Cong Ma [email protected] Department of Electrical Engineering and Computer Sciences University of California, Berkeley Berkeley CA 94720, USA Yuejie Chi [email protected] Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh, PA 15213, USA Editor: Qiang Liu Abstract Low-rank matrix estimation is a canonical problem that finds numerous applications in signal processing, machine learning and imaging science. A popular approach in practice is to factorize the matrix into two compact low-rank factors, and then optimize these factors directly via simple iterative methods such as gradient descent and alternating minimization. Despite nonconvexity, recent literatures have shown that these simple heuristics in fact achieve linear convergence when initialized properly for a growing number of problems of interest. However, upon closer exam- ination, existing approaches can still be computationally expensive especially for ill-conditioned matrices: the convergence rate of gradient descent depends linearly on the condition number of the low-rank matrix, while the per-iteration cost of alternating minimization is often prohibitive for large matrices. The goal of this paper is to set forth a competitive algorithmic approach dubbed Scaled Gra- dient Descent (ScaledGD) which can be viewed as preconditioned or diagonally-scaled gradient descent, where the preconditioners are adaptive and iteration-varying with a minimal computa- tional overhead. With tailored variants for low-rank matrix sensing, robust principal component analysis and matrix completion, we theoretically show that ScaledGD achieves the best of both worlds: it converges linearly at a rate independent of the condition number of the low-rank matrix similar as alternating minimization, while maintaining the low per-iteration cost of gradient de- scent. Our analysis is also applicable to general loss functions that are restricted strongly convex and smooth over low-rank matrices. To the best of our knowledge, ScaledGD is the first algorithm that provably has such properties over a wide range of low-rank matrix estimation tasks. At the core of our analysis is the introduction of a new distance function that takes account of the preconditioners when measuring the distance between the iterates and the ground truth. Finally, numerical examples are provided to demonstrate the effectiveness of ScaledGD in accelerating the convergence rate of ill-conditioned low-rank matrix estimation in a wide number of applications. Keywords: low-rank matrix factorization, scaled gradient descent, ill-conditioned matrix recov- ery, matrix sensing, robust PCA, matrix completion, general losses c 2021 Tian Tong, Cong Ma, Yuejie Chi. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v22/20-1067.html.

Transcript of Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Page 1: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Journal of Machine Learning Research 22 (2021) 1-63 Submitted 9/20; Revised 4/21; Published 5/21

Accelerating Ill-Conditioned Low-Rank Matrix Estimation viaScaled Gradient Descent

Tian Tong [email protected] of Electrical and Computer EngineeringCarnegie Mellon UniversityPittsburgh, PA 15213, USA

Cong Ma [email protected] of Electrical Engineering and Computer SciencesUniversity of California, BerkeleyBerkeley CA 94720, USA

Yuejie Chi [email protected] of Electrical and Computer EngineeringCarnegie Mellon UniversityPittsburgh, PA 15213, USA

Editor: Qiang Liu

AbstractLow-rank matrix estimation is a canonical problem that finds numerous applications in signal

processing, machine learning and imaging science. A popular approach in practice is to factorizethe matrix into two compact low-rank factors, and then optimize these factors directly via simpleiterative methods such as gradient descent and alternating minimization. Despite nonconvexity,recent literatures have shown that these simple heuristics in fact achieve linear convergence wheninitialized properly for a growing number of problems of interest. However, upon closer exam-ination, existing approaches can still be computationally expensive especially for ill-conditionedmatrices: the convergence rate of gradient descent depends linearly on the condition number ofthe low-rank matrix, while the per-iteration cost of alternating minimization is often prohibitivefor large matrices.

The goal of this paper is to set forth a competitive algorithmic approach dubbed Scaled Gra-dient Descent (ScaledGD) which can be viewed as preconditioned or diagonally-scaled gradientdescent, where the preconditioners are adaptive and iteration-varying with a minimal computa-tional overhead. With tailored variants for low-rank matrix sensing, robust principal componentanalysis and matrix completion, we theoretically show that ScaledGD achieves the best of bothworlds: it converges linearly at a rate independent of the condition number of the low-rank matrixsimilar as alternating minimization, while maintaining the low per-iteration cost of gradient de-scent. Our analysis is also applicable to general loss functions that are restricted strongly convexand smooth over low-rank matrices. To the best of our knowledge, ScaledGD is the first algorithmthat provably has such properties over a wide range of low-rank matrix estimation tasks. Atthe core of our analysis is the introduction of a new distance function that takes account of thepreconditioners when measuring the distance between the iterates and the ground truth. Finally,numerical examples are provided to demonstrate the effectiveness of ScaledGD in accelerating theconvergence rate of ill-conditioned low-rank matrix estimation in a wide number of applications.Keywords: low-rank matrix factorization, scaled gradient descent, ill-conditioned matrix recov-ery, matrix sensing, robust PCA, matrix completion, general losses

c©2021 Tian Tong, Cong Ma, Yuejie Chi.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided athttp://jmlr.org/papers/v22/20-1067.html.

Page 2: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Tong, Ma, Chi

1. Introduction

Low-rank matrix estimation plays a critical role in fields such as machine learning, signal processing,imaging science, and many others. Broadly speaking, one aims to recover a rank-r matrix X? ∈Rn1×n2 from a set of observations y = A(X?), where the operator A(·) models the measurementprocess. It is natural to minimize the least-squares loss function subject to a rank constraint:

minimizeX∈Rn1×n2

f(X) := 12‖A(X)− y‖22 s.t. rank(X) ≤ r, (1)

which is, however, computationally intractable in general due to the rank constraint. Moreover,as the size of the matrix increases, the costs involved in optimizing over the full matrix space(i.e. Rn1×n2) are prohibitive in terms of both memory and computation. To cope with these chal-lenges, one popular approach is to parametrize X = LR> by two low-rank factors L ∈ Rn1×r andR ∈ Rn2×r that are more memory-efficient, and then to optimize over the factors instead:

minimizeL∈Rn1×r,R∈Rn2×r

L(L,R) := f(LR>). (2)

Although this leads to a nonconvex optimization problem over the factors, recent breakthroughshave shown that simple algorithms (e.g. gradient descent, alternating minimization), when properlyinitialized (e.g. via the spectral method), can provably converge to the true low-rank factors undermild statistical assumptions. These benign convergence guarantees hold for a growing number ofproblems such as low-rank matrix sensing, matrix completion, robust principal component analysis(robust PCA), phase synchronization, and so on.

However, upon closer examination, existing approaches such as gradient descent and alternatingminimization are still computationally expensive, especially for ill-conditioned matrices. Take low-rank matrix sensing as an example: although the per-iteration cost is small, the iteration complexityof gradient descent scales linearly with respect to the condition number of the low-rank matrixX? Tuet al. (2016); on the other end, while the iteration complexity of alternating minimization Jain et al.(2013) is independent of the condition number, each iteration requires inverting a linear system whosesize is proportional to the dimension of the matrix and thus the per-iteration cost is prohibitive forlarge-scale problems. These together raise an important open question: can one design an algorithmwith a comparable per-iteration cost as gradient descent, but converges much faster at a rate that isindependent of the condition number as alternating minimization in a provable manner for a widevariety of low-rank matrix estimation tasks?

1.1 Preconditioning helps: scaled gradient descent

In this paper, we answer this question affirmatively by studying the following scaled gradient descent(ScaledGD) algorithm to optimize (2). Given an initialization (L0,R0), ScaledGD proceeds as follows

Lt+1 = Lt − η∇LL(Lt,Rt)(R>t Rt)

−1,

Rt+1 = Rt − η∇RL(Lt,Rt)(L>t Lt)

−1,(3)

where η > 0 is the step size and ∇LL(Lt,Rt) (resp. ∇RL(Lt,Rt)) is the gradient of the loss functionL with respect to the factor Lt (resp.Rt) at the t-th iteration. Comparing to vanilla gradient descent,the search directions of the low-rank factors Lt,Rt in (3) are scaled by (R>t Rt)

−1 and (L>t Lt)−1

respectively. Intuitively, the scaling serves as a preconditioner as in quasi-Newton type algorithms,with the hope of improving the quality of the search direction to allow larger step sizes. Sincethe computation of the Hessian is extremely expensive, it is necessary to design preconditionersthat are both theoretically sound and practically cheap to compute. Such requirements are metby ScaledGD, where the preconditioners are computed by inverting two r × r matrices, whose sizeis much smaller than the dimension of matrix factors. Therefore, each iteration of ScaledGD adds

2

Page 3: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled Gradient Descent

minimal overhead to the gradient computation and has the order-wise same per-iteration cost asgradient descent. Moreover, the preconditioners are adaptive and iteration-varying. Another keyproperty of ScaledGD is that it ensures the iterates are covariant with respect to the parameterizationof low-rank factors up to invertible transforms.

While ScaledGD and its alternating variants have been proposed in Mishra et al. (2012); Mishraand Sepulchre (2016); Tanner and Wei (2016) for a subset of the problems we studied, none of theseprior art provides any theoretical validations to the empirical success. In this work, we confirmtheoretically that ScaledGD achieves linear convergence at a rate independent of the condition numberof the matrix when initialized properly, e.g. using the standard spectral method, for several canonicalproblems: low-rank matrix sensing, robust PCA, and matrix completion. Table 1 summarizes theperformance guarantees of ScaledGD in terms of both statistical and computational complexitieswith comparisons to prior algorithms using the vanilla gradient method.

• Low-rank matrix sensing. As long as the measurement operator satisfies the standard restrictedisometry property (RIP) with an RIP constant δ2r . 1/(

√rκ), where κ is the condition number

of X?, ScaledGD reaches ε-accuracy in O(log(1/ε)) iterations when initialized by the spectralmethod. This strictly improves the iteration complexity O(κ log(1/ε)) of gradient descent in Tuet al. (2016) under the same sample complexity requirement.

• Robust PCA. Under the deterministic corruption model Chandrasekaran et al. (2011), as longas the fraction α of corruptions per row / column satisfies α . 1/(µr3/2κ), where µ is the inco-herence parameter of X?, ScaledGD in conjunction with hard thresholding reaches ε-accuracy inO(log(1/ε)) iterations when initialized by the spectral method. This strictly improves the iterationcomplexity of projected gradient descent Yi et al. (2016).

• Matrix completion. Under the random Bernoulli observation model, as long as the sample com-plexity satisfies n1n2p & (µκ2 ∨ log n)µnr2κ2 with n = n1 ∨ n2, ScaledGD in conjunction with aproperly designed projection operator reaches ε-accuracy in O(log(1/ε)) iterations when initial-ized by the spectral method. This improves the iteration complexity of projected gradient descentZheng and Lafferty (2016) at the expense of requiring a larger sample size.

In addition, ScaledGD does not require any explicit regularizations that balance the norms of twolow-rank factors as required in Tu et al. (2016); Yi et al. (2016); Zheng and Lafferty (2016), andremoved the additional projection that maintains the incoherence properties in robust PCA Yiet al. (2016), thus unveiling the implicit regularization property of ScaledGD. To the best of ourknowledge, this is the first factored gradient descent algorithm that achieves a fast convergencerate that is independent of the condition number of the low-rank matrix at near-optimal samplecomplexities without increasing the per-iteration computational cost. Our analysis is also applicableto general loss functions that are restricted strongly convex and smooth over low-rank matrices.

At the core of our analysis, we introduce a new distance metric (i.e. Lyapunov function) thataccounts for the preconditioners, and carefully show the contraction of the ScaledGD iterates underthe new distance metric. We expect that the ScaledGD algorithm can accelerate the convergencefor other low-rank matrix estimation problems, as well as facilitate the design and analysis of otherquasi-Newton first-order algorithms. As a teaser, Figure 1 illustrates the relative error of completinga 1000× 1000 incoherent matrix of rank 10 with varying condition numbers from 20% of its entries,using either ScaledGD or vanilla GD with spectral initialization. Even for moderately ill-conditionedmatrices, the convergence rate of vanilla GD slows down dramatically, while it is evident thatScaledGD converges at a rate independent of the condition number and therefore is much moreefficient.

Remark 1 (ScaledGD for PSD matrices) When the low-rank matrix of interest is positive semi-definite (PSD), we factorize the matrix X ∈ Rn×n as X = LL>, with L ∈ Rn×r. The update rule

3

Page 4: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Tong, Ma, Chi

Matrix sensing Robust PCA Matrix completion

Algorithms sample iteration corruption iteration sample iterationcomplexity complexity fraction complexity complexity complexity

GD nr2κ2 κ log 1ε

1µr3/2κ3/2∨µrκ2 κ log 1

ε (µ ∨ log n)µnr2κ2 κ log 1ε

ScaledGDnr2κ2 log 1

ε1

µr3/2κ log 1ε (µκ2 ∨ log n)µnr2κ2 log 1

ε(this paper)

Table 1: Comparisons of ScaledGD with gradient descent (GD) when tailored to various problems(with spectral initialization) Tu et al. (2016); Yi et al. (2016); Zheng and Lafferty (2016),where they have comparable per-iteration costs. Here, we say that the output X of analgorithm reaches ε-accuracy, if it satisfies ‖X −X?‖F ≤ εσr(X?). Here, n := n1 ∨ n2 =maxn1, n2, κ and µ are the condition number and incoherence parameter of X?.

0 200 400 600 800 1000

Iteration count

10-14

10-12

10-10

10-8

10-6

10-4

10-2

100

Re

lative

err

or

Figure 1: Performance of ScaledGD and vanilla GD for completing a 1000× 1000 incoherent matrixof rank 10 with different condition numbers κ = 2, 10, 50, where each entry is observedindependently with probability 0.2. Here, both methods are initialized via the spectralmethod. It can be seen that ScaledGD converges much faster than vanilla GD even formoderately large condition numbers.

of ScaledGD simplifies to

Lt+1 = Lt − η∇LL(Lt)(L>t Lt)

−1. (4)

We focus on the asymmetric case since the analysis is more involved with two factors. Our theoryapplies to the PSD case without loss of generality.

1.2 Related work

Our work contributes to the growing literature of design and analysis of provable nonconvex op-timization procedures for high-dimensional signal estimation; see e.g. Jain and Kar (2017); Chenand Chi (2018); Chi et al. (2019) for recent overviews. A growing number of problems have beendemonstrated to possess benign geometry that is amenable for optimization Mei et al. (2018) eitherglobally or locally under appropriate statistical models. On one end, it is shown that there are nospurious local minima in the optimization landscape of matrix sensing and completion Ge et al.(2016); Bhojanapalli et al. (2016b); Park et al. (2017); Ge et al. (2017), phase retrieval Sun et al.(2018); Davis et al. (2017), dictionary learning Sun et al. (2015), kernel PCA Chen and Li (2019)

4

Page 5: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled Gradient Descent

and linear neural networks Baldi and Hornik (1989); Kawaguchi (2016). Such landscape analysisfacilitates the adoption of generic saddle-point escaping algorithms Nesterov and Polyak (2006); Geet al. (2015); Jin et al. (2017) to ensure global convergence. However, the resulting iteration com-plexity is typically high. On the other end, local refinements with carefully-designed initializationsoften admit fast convergence, for example in phase retrieval Candès et al. (2015); Ma et al. (2019),matrix sensing Jain et al. (2013); Zheng and Lafferty (2015); Wei et al. (2016), matrix completionSun and Luo (2016); Chen and Wainwright (2015); Ma et al. (2019); Chen et al. (2020a); Zhengand Lafferty (2016); Chen et al. (2020b), blind deconvolution Li et al. (2019); Ma et al. (2019), androbust PCA Netrapalli et al. (2014); Yi et al. (2016); Chen et al. (2020c), to name a few.

Existing approaches for asymmetric low-rank matrix estimation often require additional regular-ization terms to balance the two factors, either in the form of 1

2‖L>L−R>R‖2F Tu et al. (2016); Park

et al. (2017) or 12‖L‖

2F + 1

2‖R‖2F Zhu et al. (2018); Chen et al. (2020b,c), which ease the theoretical

analysis but are often unnecessary for the practical success, as long as the initialization is balanced.Some recent work studies the unregularized gradient descent for low-rank matrix factorization andsensing including Charisopoulos et al. (2021); Du et al. (2018); Ma et al. (2021). However, theiteration complexity of all these approaches scales at least linearly with respect to the conditionnumber κ of the low-rank matrix, e.g. O(κ log(1/ε)), to reach ε-accuracy, therefore they convergeslowly when the underlying matrix becomes ill-conditioned. In contrast, ScaledGD enjoys a localconvergence rate of O(log(1/ε)), therefore incurring a much smaller computational footprint whenκ is large. Last but not least, alternating minimization Jain et al. (2013); Hardt and Wootters(2014) (which alternatively updates Lt and Rt) or singular value projection Netrapalli et al. (2014);Jain et al. (2010) (which operates in the matrix space) also converge at the rate O(log(1/ε)), butthe per-iteration cost is much higher than ScaledGD. Another notable algorithm is the Riemanniangradient descent algorithm in Wei et al. (2016), which also converges at the rate O(log(1/ε)) underthe same sample complexity for low-rank matrix sensing, but requires a higher memory complexitysince it operates in the matrix space rather than the factor space.

From an algorithmic perspective, our approach is closely related to the alternating steepest de-scent (ASD) method in Tanner and Wei (2016) for low-rank matrix completion, which performs theproposed updates (3) for the low-rank factors in an alternating manner. Furthermore, the scaledgradient updates were also introduced in Mishra et al. (2012); Mishra and Sepulchre (2016) forlow-rank matrix completion from the perspective of Riemannian optimization. However, none ofTanner and Wei (2016); Mishra et al. (2012); Mishra and Sepulchre (2016) offered any statistical norcomputational guarantees for global convergence. Our analysis of ScaledGD can be viewed as pro-viding justifications to these precursors. Moreover, we have systematically extended the frameworkof ScaledGD to work in a large number of low-rank matrix estimation tasks such as robust PCA.

1.3 Paper organization and notation

The rest of this paper is organized as follows. Section 2 describes the proposed ScaledGD methodand details its application to low-rank matrix sensing, robust PCA and matrix completion withtheoretical guarantees in terms of both statistical and computational complexities, highlighting therole of a new distance metric. The convergence guarantee of ScaledGD under the general loss functionis also presented. In Section 3, we outline the proof for our main results. Section 4 illustrates theexcellent empirical performance of ScaledGD in a variety of low-rank matrix estimation problems.Finally, we conclude in Section 5.

Before continuing, we introduce several notation used throughout the paper. First of all, we useboldfaced symbols for vectors and matrices. For a vector v, we use ‖v‖0 to denote its `0 countingnorm, and ‖v‖2 to denote the `2 norm. For any matrix A, we use σi(A) to denote its i-th largestsingular value, and let Ai,· and A·,j denote its i-th row and j-th column, respectively. In addition,‖A‖op, ‖A‖F, ‖A‖1,∞, ‖A‖2,∞, and ‖A‖∞ stand for the spectral norm (i.e. the largest singularvalue), the Frobenius norm, the `1,∞ norm (i.e. the largest `1 norm of the rows), the `2,∞ norm

5

Page 6: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Tong, Ma, Chi

(i.e. the largest `2 norm of the rows), and the entrywise `∞ norm (the largest magnitude of allentries) of a matrix A. We denote

Pr(A) = minA:rank(A)≤r

‖A− A‖2F (5)

as the rank-r approximation of A, which is given by the top-r SVD of A by the Eckart-Young-Mirsky theorem. We also use vec(A) to denote the vectorization of a matrix A. For matrices A,Bof the same size, we use 〈A,B〉 =

∑i,jAi,jBi,j = tr(A>B) to denote their inner product. The set

of invertible matrices in Rr×r is denoted by GL(r). Let a ∨ b = maxa, b and a ∧ b = mina, b.Throughout, f(n) . g(n) or f(n) = O(g(n)) means |f(n)|/|g(n)| ≤ C for some constant C > 0when n is sufficiently large; f(n) & g(n) means |f(n)|/|g(n)| ≥ C for some constant C > 0 when nis sufficiently large. Last but not least, we use the terminology “with overwhelming probability” todenote the event happens with probability at least 1 − c1n−c2 , where c1, c2 > 0 are some universalconstants, whose values may vary from line to line.

2. Scaled Gradient Descent for Low-Rank Matrix Estimation

This section is devoted to introducing ScaledGD and establishing its statistical and computationalguarantees for various low-rank matrix estimation problems. Before we instantiate tailored versionsof ScaledGD on concrete low-rank matrix estimation problems, we first pause to provide more insightsof the update rule of ScaledGD, by connecting it to the quasi-Newton method. Note that the updaterule (3) for ScaledGD can be equivalently written in a vectorization form as

vec(Ft+1) = vec(Ft)− η[(R>t Rt)

−1 ⊗ In10

0 (L>t Lt)−1 ⊗ In2

]vec(∇FL(Ft))

= vec(Ft)− ηH−1t vec(∇FL(Ft)), (6)

where we denote Ft = [L>t ,R>t ]> ∈ R(n1+n2)×r, and by ⊗ the Kronecker product. Here, the block

diagonal matrix Ht is set to be

Ht :=

[(R>t Rt)⊗ In1

00 (L>t Lt)⊗ In2

].

The form (6) makes it apparent that ScaledGD can be interpreted as a quasi-Newton algorithm,where the inverse of Ht can be cheaply computed through inverting two rank-r matrices.

2.1 Assumptions and error metric

Denote by U?Σ?V>? the compact singular value decomposition (SVD) of the rank-r matrix X? ∈

Rn1×n2 . Here U? ∈ Rn1×r and V? ∈ Rn2×r are composed of r left and right singular vectors,respectively, and Σ? ∈ Rr×r is a diagonal matrix consisting of r singular values of X? organized ina non-increasing order, i.e. σ1(X?) ≥ · · · ≥ σr(X?) > 0. Define

κ := σ1(X?)/σr(X?) (7)

as the condition number of X?. Define the ground truth low-rank factors as

L? := U?Σ1/2? , and R? := V?Σ

1/2? , (8)

so that X? = L?R>? . Correspondingly, denote the stacked factor matrix as

F? :=

[L?R?

]∈ R(n1+n2)×r. (9)

6

Page 7: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled Gradient Descent

Next, we are in need of a right metric to measure the performance of the ScaledGD iteratesFt := [L>t ,R

>t ]>. Obviously, the factored representation is not unique in that for any invertible

matrix Q ∈ GL(r), one has LR> = (LQ)(RQ−>)>. Therefore, the reconstruction error metricneeds to take into account this identifiability issue. More importantly, we need a diagonal scaling inthe distance error metric to properly account for the effect of preconditioning. To provide intuition,note that the update rule (3) can be viewed as finding the best local quadratic approximation ofL(·) in the following sense:

(Lt+1,Rt+1) = argminL,R

L(Lt,Rt) + 〈∇LL(Lt,Rt),L−Lt〉+ 〈∇RL(Lt,Rt),R−Rt〉

+1

(∥∥∥(L−Lt)(R>t Rt)

1/2∥∥∥2

F+∥∥∥(R−Rt)(L

>t Lt)

1/2∥∥∥2

F

),

where it is different from the common interpretation of gradient descent in the way the quadraticapproximation is taken by a scaled norm. When Lt ≈ L? and Rt ≈ R? are approaching the groundtruth, the additional scaling factors can be approximated by L>t Lt ≈ Σ? and R>t Rt ≈ Σ?, leadingto the following error metric

dist2(F ,F?) := infQ∈GL(r)

∥∥∥(LQ−L?)Σ1/2?

∥∥∥2

F+∥∥∥(RQ−> −R?)Σ

1/2?

∥∥∥2

F. (10)

Correspondingly, we define the optimal alignment matrix Q between F and F? as

Q := argminQ∈GL(r)

∥∥∥(LQ−L?)Σ1/2?

∥∥∥2

F+∥∥∥(RQ−> −R?)Σ

1/2?

∥∥∥2

F, (11)

whenever the minimum is achieved.1 It turns out that for the ScaledGD iterates Ft, the optimalalignment matrices Qt always exist (at least when properly initialized) and hence are well-defined.The design and analysis of this new distance metric are of crucial importance in obtaining theimproved rate of ScaledGD; see Appendix A.1 for a collection of its properties. In comparison, thepreviously studied distance metrics (proposed mainly for GD) either do not include the diagonalscaling Ma et al. (2021); Tu et al. (2016), or only consider the ambiguity class up to orthonormaltransforms Tu et al. (2016), which fail to unveil the benefit of ScaledGD.

2.2 Matrix sensing

Assume that we have collected a set of linear measurements about a rank-r matrix X? ∈ Rn1×n2 ,given as

y = A(X?) ∈ Rm, (12)

where A(X) = 〈Ak,X〉mk=1 : Rn1×n2 7→ Rm is the linear map modeling the measurement process.The goal of low-rank matrix sensing is to recover X? from y, especially when the number of mea-surements m n1n2, by exploiting the low-rank property. This problem has wide applications inmedical imaging, signal processing, and data compression Candès and Plan (2011).

Algorithm. Writing X ∈ Rn1×n2 into a factored form X = LR>, we consider the followingoptimization problem:

minimizeF∈R(n1+n2)×r

L(F ) =1

2

∥∥A(LR>)− y∥∥2

2. (13)

Here as before, F denotes the stacked factor matrix [L>,R>]>. We suggest running ScaledGD (3)with the spectral initialization to solve (13), which performs the top-r SVD on A∗(y), where A∗(·)is the adjoint operator of A(·). The full algorithm is stated in Algorithm 1. The low-rank matrixcan be estimated as XT = LTR

>T after running T iterations of ScaledGD.

1. If there are multiple minimizers, we can arbitrarily take one to be Q.

7

Page 8: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Tong, Ma, Chi

Algorithm 1 ScaledGD for low-rank matrix sensing with spectral initializationSpectral initialization: Let U0Σ0V

>0 be the top-r SVD of A∗(y), and set

L0 = U0Σ1/20 , and R0 = V0Σ

1/20 . (14)

Scaled gradient updates: for t = 0, 1, 2, . . . , T − 1 do

Lt+1 = Lt − ηA∗(A(LtR>t )− y)Rt(R

>t Rt)

−1,

Rt+1 = Rt − ηA∗(A(LtR>t )− y)>Lt(L

>t Lt)

−1.(15)

Theoretical guarantees. To understand the performance of ScaledGD for low-rank matrix sens-ing, we adopt a standard assumption on the sensing operator A(·), namely the Restricted IsometryProperty (RIP).

Definition 2 (RIP Recht et al. (2010)) The linear map A(·) is said to obey the rank-r RIP witha constant δr ∈ [0, 1), if for all matrices M ∈ Rn1×n2 of rank at most r, one has

(1− δr)‖M‖2F ≤ ‖A(M)‖22 ≤ (1 + δr)‖M‖2F.

It is well-known that many measurement ensembles satisfy the RIP property Recht et al. (2010);Candès and Plan (2011). For example, if the entries of Ai’s are composed of i.i.d. Gaussian entriesN (0, 1/m), then the RIP is satisfied for a constant δr as long as m is on the order of (n1 + n2)r/δ2

r .With the RIP condition in place, the following theorem demonstrates that ScaledGD convergeslinearly — in terms of the new distance metric (cf. (10)) — at a constant rate as long as the sensingoperator A(·) has a sufficiently small RIP constant.

Theorem 3 Suppose that A(·) obeys the 2r-RIP with δ2r ≤ 0.02/(√rκ). If the step size obeys

0 < η ≤ 2/3, then for all t ≥ 0, the iterates of the ScaledGD method in Algorithm 1 satisfy

dist(Ft,F?) ≤ (1− 0.6η)t0.1σr(X?), and∥∥LtR>t −X?

∥∥F≤ (1− 0.6η)t0.15σr(X?).

Theorem 3 establishes that the distance dist(Ft,F?) contracts linearly at a constant rate, aslong as the sample size satisfies m = O(nr2κ2) with Gaussian random measurements Recht et al.(2010), where we recall that n = n1 ∨ n2. To reach ε-accuracy, i.e. ‖LtR>t − X?‖F ≤ εσr(X?),ScaledGD takes at most T = O(log(1/ε)) iterations, which is independent of the condition number κofX?. In comparison, alternating minimization with spectral initialization (AltMinSense) convergesin O(log(1/ε)) iterations as long as m = O(nr3κ4) Jain et al. (2013), where the per-iteration costis much higher.2 On the other end, gradient descent with spectral initialization in Tu et al. (2016)converges in O(κ log(1/ε)) iterations as long as m = O(nr2κ2). Therefore, ScaledGD converges ata much faster rate than GD at the same sample complexity while requiring a significantly lowerper-iteration cost than AltMinSense.

Remark 4 Tu et al. (2016) suggested that one can employ a more expensive initialization scheme,e.g. performing multiple projected gradient descent steps over the low-rank matrix, to reduce the sam-ple complexity. By seeding ScaledGD with the output of updates of the form Xτ+1 = Pr (Xτ −A∗(A(Xτ )− y))after T0 & log(

√rκ) iterations, where Pr(·) is defined in (5), ScaledGD succeeds with the sample

size O(nr) which is information theoretically optimal.2. The exact per-iteration complexity of AltMinSense depends on how the least-squares subproblems are solved withm equations and nr unknowns; see (Luo et al., 2020, Table 1) for detailed comparisons.

8

Page 9: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled Gradient Descent

2.3 Robust PCA

Assume that we have observed the data matrix

Y = X? + S?,

which is a superposition of a rank-r matrix X?, modeling the clean data, and a sparse matrix S?,modeling the corruption or outliers. The goal of robust PCA Candès et al. (2011); Chandrasekaranet al. (2011) is to separate the two matrices X? and S? from their mixture Y . This problem findsnumerous applications in video surveillance, image processing, and so on.

Following Chandrasekaran et al. (2011); Netrapalli et al. (2014); Yi et al. (2016), we consider adeterministic sparsity model for S?, in which S? contains at most α-fraction of nonzero entries perrow and column for some α ∈ [0, 1), i.e. S? ∈ Sα, where we denote

Sα := S ∈ Rn1×n2 : ‖Si,·‖0 ≤ αn2 for all i, and ‖S·,j‖0 ≤ αn1 for all j. (16)

Algorithm. Writing X ∈ Rn1×n2 into the factored form X = LR>, we consider the followingoptimization problem:

minimizeF∈R(n1+n2)×r,S∈Sα

L(F ,S) =1

2

∥∥LR> + S − Y∥∥2

F. (17)

It is thus natural to alternatively update F = [L>,R>]> and S, where F is updated via theproposed ScaledGD algorithm, and S is updated by hard thresholding, which trims the small entriesof the residual matrix Y − LR>. More specifically, for some truncation level 0 ≤ α ≤ 1, we definethe sparsification operator that only keeps α fraction of largest entries in each row and column:

(Tα[A])i,j =

Ai,j , if |A|i,j ≥ |A|i,(αn2), and |A|i,j ≥ |A|(αn1),j

0, otherwise, (18)

where |A|i,(k) (resp. |A|(k),j) denote the k-th largest element in magnitude in the i-th row (resp. j-thcolumn).

The ScaledGD algorithm with the spectral initialization for solving robust PCA is formally statedin Algorithm 2. Note that, comparing with Yi et al. (2016), we do not require a balancing term‖L>L−R>R‖2F in the loss function (17), nor the projection of the low-rank factors onto the `2,∞ball in each iteration.

Algorithm 2 ScaledGD for robust PCA with spectral initializationSpectral initialization: Let U0Σ0V

>0 be the top-r SVD of Y − Tα[Y ], and set

L0 = U0Σ1/20 , and R0 = V0Σ

1/20 . (19)

Scaled gradient updates: for t = 0, 1, 2, . . . , T − 1 do

St = T2α[Y −LtR>t ],

Lt+1 = Lt − η(LtR>t + St − Y )Rt(R

>t Rt)

−1,

Rt+1 = Rt − η(LtR>t + St − Y )>Lt(L

>t Lt)

−1.

(20)

9

Page 10: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Tong, Ma, Chi

Theoretical guarantee. Before stating our main result for robust PCA, we introduce the inco-herence condition which is known to be crucial for reliable estimation of the low-rank matrix X? inrobust PCA Chen (2015).

Definition 5 (Incoherence) A rank-r matrix X? ∈ Rn1×n2 with compact SVD as X? = U?Σ?V>?

is said to be µ-incoherent if

‖U?‖2,∞ ≤√

µ

n1‖U?‖F =

õr

n1, and ‖V?‖2,∞ ≤

õ

n2‖V?‖F =

õr

n2.

The following theorem establishes that ScaledGD converges linearly at a constant rate as long asthe fraction α of corruptions is sufficiently small.

Theorem 6 Suppose that X? is µ-incoherent and that the corruption fraction α obeys α ≤ c/(µr3/2κ)for some sufficiently small constant c > 0. If the step size obeys 0.1 ≤ η ≤ 2/3, then for all t ≥ 0,the iterates of ScaledGD in Algorithm 2 satisfy

dist(Ft,F?) ≤ (1− 0.6η)t0.02σr(X?), and∥∥LtR>t −X?

∥∥F≤ (1− 0.6η)t0.03σr(X?).

Theorem 6 establishes that the distance dist(Ft,F?) contracts linearly at a constant rate, aslong as the fraction of corruptions satisfies α . 1/(µr3/2κ). To reach ε-accuracy, i.e. ‖LtR>t −X?‖F ≤ εσr(X?), ScaledGD takes at most T = O(log(1/ε)) iterations, which is independent ofκ. In comparison, the AltProj algorithm3 with spectral initialization converges in O(log(1/ε))iterations as long as α . 1/(µr) Netrapalli et al. (2014), where the per-iteration cost is much higherboth in terms of computation and memory as it requires the computation of the low-rank SVD ofthe full matrix. On the other hand, projected gradient descent with spectral initialization in Yiet al. (2016) converges in O(κ log(1/ε)) iterations as long as α . 1/(µr3/2κ3/2 ∨ µrκ2). Therefore,ScaledGD converges at a much faster rate than GD while requesting a significantly lower per-iterationcost than AltProj. In addition, our theory suggests that ScaledGD maintains the incoherence andbalancedness of the low-rank factors without imposing explicit regularizations, which is not capturedin previous analysis Yi et al. (2016).

2.4 Matrix completion

Assume that we have observed a subset Ω of entries of X? given as PΩ(X?), where PΩ : Rn1×n2 7→Rn1×n2 is a projection such that

(PΩ(X))i,j =

Xi,j , if (i, j) ∈ Ω

0, otherwise. (21)

Here Ω is generated according to the Bernoulli model in the sense that each (i, j) ∈ Ω independentwith probability p. The goal of matrix completion is to recover the matrix X? from its partial obser-vation PΩ(X?). This problem has many applications in recommendation systems, signal processing,sensor network localization, and so on Candès and Recht (2009).

Algorithm. Again, writing X ∈ Rn1×n2 into the factored form X = LR>, we consider thefollowing optimization problem:

minimizeF∈R(n1+n2)×r

L(F ) :=1

2p

∥∥PΩ(LR> −X?)∥∥2

F. (22)

3. AltProj employs a multi-stage strategy to remove the dependence on κ in α, which we do not consider here. Thesame strategy might also improve the dependence on κ for ScaledGD, which we leave for future work.

10

Page 11: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled Gradient Descent

Similarly to robust PCA, the underlying low-rank matrix X? needs to be incoherent (cf. Defini-tion 5) to avoid ill-posedness. One typical strategy to ensure the incoherence condition is to performprojection after the gradient update, by projecting the iterates to maintain small `2,∞ norms ofthe factor matrices. However, the standard projection operator Chen and Wainwright (2015) is notcovariant with respect to invertible transforms, and consequently, needs to be modified when usingscaled gradient updates. To that end, we introduce the following new projection operator: for everyF ∈ R(n1+n2)×r = [L>, R>]>,

PB(F ) = argminF∈R(n1+n2)×r

∥∥∥(L− L)(R>R)1/2∥∥∥2

F+∥∥∥(R− R)(L>L)1/2

∥∥∥2

F

s.t.√n1

∥∥∥L(R>R)1/2∥∥∥

2,∞∨√n2

∥∥∥R(L>L)1/2∥∥∥

2,∞≤ B

, (23)

which finds a factored matrix that is closest to F and stays incoherent in a weighted sense. Luckily,the solution to the above scaled projection admits a simple closed-form solution, as stated below.

Proposition 7 The solution to (23) is given by

PB(F ) :=

[LR

], where Li,· :=

(1 ∧ B√n1‖Li,·R>‖2

)Li,·, 1 ≤ i ≤ n1,

Rj,· :=

(1 ∧ B√n2‖Rj,·L>‖2

)Rj,·, 1 ≤ j ≤ n2.

(24)

Proof See Appendix E.1.1.

With the new projection operator in place, we propose the scaled projected gradient descent (ScaledPGD)method with the spectral initialization for solving matrix completion, formally stated in Algorithm 3.

Algorithm 3 ScaledPGD for matrix completion with spectral initializationSpectral initialization: Let U0Σ0V

>0 be the top-r SVD of 1

pPΩ(X?), and set

[L0

R0

]= PB

([U0Σ

1/20

V0Σ1/20

]). (25)

Scaled projected gradient updates: for t = 0, 1, 2, . . . , T − 1 do[Lt+1

Rt+1

]= PB

([Lt − η

pPΩ(LtR>t −X?)Rt(R

>t Rt)

−1

Rt − ηpPΩ(LtR

>t −X?)

>Lt(L>t Lt)

−1

]). (26)

Theoretical guarantee. Consider a random observation model, where each index (i, j) belongsto the index set Ω independently with probability 0 < p ≤ 1. The following theorem establishes thatScaledPGD converges linearly at a constant rate as long as the number of observations is sufficientlylarge.

Theorem 8 Suppose that X? is µ-incoherent, and that p satisfies p ≥ C(µκ2∨log(n1∨n2))µr2κ2/(n1∧n2) for some sufficiently large constant C. Set the projection radius as B = CB

√µrσ1(X?) for some

constant CB ≥ 1.02. If the step size obeys 0 < η ≤ 2/3, then with probability at least 1−c1(n1∨n2)−c2 ,for all t ≥ 0, the iterates of ScaledPGD in (26) satisfy

dist(Ft,F?) ≤ (1− 0.6η)t0.02σr(X?), and∥∥LtR>t −X?

∥∥F≤ (1− 0.6η)t0.03σr(X?).

Here c1, c2 > 0 are two universal constants.

11

Page 12: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Tong, Ma, Chi

Theorem 8 establishes that the distance dist(Ft,F?) contracts linearly at a constant rate, as longas the probability of observation satisfies p & (µκ2∨log(n1∨n2))µr2κ2/(n1∧n2). To reach ε-accuracy,i.e. ‖LtR>t − X?‖F ≤ εσr(X?), ScaledPGD takes at most T = O(log(1/ε)) iterations, which isindependent of κ. In comparison, projected gradient descent Zheng and Lafferty (2016) with spectralinitialization converges in O(κ log(1/ε)) iterations as long as p & (µ∨ log(n1 ∨ n2))µr2κ2/(n1 ∧ n2).Therefore, ScaledPGD achieves much faster convergence than its unscaled counterpart, at an expenseof higher sample complexity. We believe this higher sample complexity is an artifact of our prooftechniques, as numerically we do not observe a degradation in terms of sample complexity.

2.5 Optimizing general loss functions

Last but not least, we generalize our analysis of ScaledGD to minimize a general loss function in theform of (2), where the update rule of ScaledGD is given by

Lt+1 = Lt − η∇f(LtR>t )Rt(R

>t Rt)

−1,

Rt+1 = Rt − η∇f(LtR>t )>Lt(L

>t Lt)

−1.(27)

Two important properties of the loss function f : Rn1×n2 7→ R play a key role in the analysis.

Definition 9 (Restricted smoothness) A differentiable function f : Rn1×n2 7→ R is said to berank-r restricted L-smooth for some L > 0 if

f(X2) ≤ f(X1) + 〈∇f(X1),X2 −X1〉+L

2‖X2 −X1‖2F,

for any X1,X2 ∈ Rn1×n2 with rank at most r.

Definition 10 (Restricted strong convexity) A differentiable function f : Rn1×n2 7→ R is saidto be rank-r restricted µ-strongly convex for some µ ≥ 0 if

f(X2) ≥ f(X1) + 〈∇f(X1),X2 −X1〉+µ

2‖X2 −X1‖2F,

for any X1,X2 ∈ Rn1×n2 with rank at most r. When µ = 0, we simply say f(·) is rank-r restrictedconvex.

Further, when µ > 0, define the condition number of the loss function f(·) over rank-r matrices as

κf := L/µ. (28)

Encouragingly, many problems can be viewed as a special case of optimizing this general loss (27),including but not limited to:

• low-rank matrix factorization, where the loss function f(X) = 12‖X − X?‖2F in (29) satisfies

κf = 1;

• low-rank matrix sensing, where the loss function f(X) = 12‖A(X −X?)‖22 in (13) satisfies κf ≈ 1

when A(·) obeys the rank-r RIP with a sufficiently small RIP constant;

• quadratic sampling, where the loss function f(X) = 12

∑mi=1 |〈aia>i ,X −X?〉|2 satisfies restricted

strong convexity and smoothness when ai’s are i.i.d. Gaussian vectors for sufficiently large mSanghavi et al. (2017); Li et al. (2021);

• exponential-family PCA, where the loss function f(X) = −∑i,j log p(Yi,j |Xi,j), where p(Yi,j |Xi,j)

is the probability density function of Yi,j conditional on Xi,j , following an exponential-family dis-tribution such as Bernoulli and Poisson distributions. The resulting loss function satisfies restrictedstrong convexity and smoothness with a condition number κf > 1 depending on the property ofthe specific distribution Gunasekar et al. (2014); Lafond (2015).

12

Page 13: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled Gradient Descent

Indeed, the treatment of a general loss function brings the condition number of f(·) under thespotlight, since in our earlier case studies κf ≈ 1. Our purpose is thus to understand the interplayof two types of conditioning numbers in the convergence of first-order methods. For simplicity,we assume that f(·) is minimized at the ground truth rank-r matrix X?.4 The following theoremestablishes that as long as properly initialized, then ScaledGD converges linearly at a constant rate.

Theorem 11 Suppose that f(·) is rank-2r restricted L-smooth and µ-strongly convex, of which X?

is a minimizer, and that the initialization F0 satisfies dist(F0,F?) ≤ 0.1σr(X?)/√κf . If the step

size obeys 0 < η ≤ 0.4/L, then for all t ≥ 0, the iterates of ScaledGD in (27) satisfy

dist(Ft,F?) ≤ (1− 0.7ηµ)t0.1σr(X?)/√κf , and

∥∥LtR>t −X?

∥∥F≤ (1− 0.7ηµ)t0.15σr(X?)/

√κf .

Theorem 11 establishes that the distance dist(Ft,F?) contracts linearly at a constant rate, as longas the initialization F0 is sufficiently close to F?. To reach ε-accuracy, i.e. ‖LtR>t −X?‖F ≤ εσr(X?),ScaledGD takes at most T = O(κf log(1/ε)) iterations, which depends only on the condition numberκf of f(·), but is independent of the condition number κ of the matrix X?. In contrast, prior theoryof vanilla gradient descent Park et al. (2018); Bhojanapalli et al. (2016a) requires O(κfκ log(1/ε))iterations, which is worse than our rate by a factor of κ.

3. Proof Sketch

In this section, we sketch the proof of the main theorems, highlighting the role of the scaled distancemetric (cf. (10)) in these analyses.

3.1 A warm-up analysis: matrix factorization

Let us consider the problem of factorizing a matrix X? into two low-rank factors:

minimizeF∈R(n1+n2)×r

L(F ) =1

2

∥∥LR> −X?

∥∥2

F. (29)

For this toy problem, the update rule of ScaledGD is given as

Lt+1 = Lt − η(LtR>t −X?)Rt(R

>t Rt)

−1,

Rt+1 = Rt − η(LtR>t −X?)

>Lt(L>t Lt)

−1.(30)

To shed light on why ScaledGD is robust to ill-conditioning, it is worthwhile to think of ScaledGDas a quasi-Newton algorithm: the following proposition (proven in Appendix B.1) reveals thatScaledGD is equivalent to approximating the Hessian of the loss function in (29) by only keeping itsdiagonal blocks.

Proposition 12 For the matrix factorization problem (29), ScaledGD is equivalent to the followingupdate rule

vec(Ft+1) = vec(Ft)− η[∇2

L,LL(Ft) 0

0 ∇2R,RL(Ft)

]−1

vec(∇FL(Ft)).

Here, ∇2L,LL(Ft) (resp. ∇2

R,RL(Ft)) denotes the second order derivative w.r.t. L (resp. R) at Ft.

The following theorem, whose proof can be found in Appendix B.2, formally establishes that aslong as ScaledGD is initialized close to the ground truth, dist(Ft,F?) will contract at a constantlinear rate for the matrix factorization problem.

4. In practice, due to the presence of statistical noise, the minimizer of f(·) might be only approximately low-rank,to which our analysis can be extended in a straightforward fashion.

13

Page 14: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Tong, Ma, Chi

Theorem 13 Suppose that the initialization F0 satisfies dist(F0,F?) ≤ 0.1σr(X?). If the step sizeobeys 0 < η ≤ 2/3, then for all t ≥ 0, the iterates of the ScaledGD method in (30) satisfy

dist(Ft,F?) ≤ (1− 0.7η)t0.1σr(X?), and∥∥LtR>t −X?

∥∥F≤ (1− 0.7η)t0.15σr(X?).

Comparing to the rate of contraction (1 − 1/κ) of gradient descent for matrix factorization Maet al. (2021); Chi et al. (2019), Theorem 13 demonstrates that the preconditioners indeed allow bettersearch directions in the local neighborhood of the ground truth, and hence a faster convergence rate.

3.2 Proof outline for matrix sensing

It can be seen that the update rule (15) of ScaledGD in Algorithm 1 closely mimics (30) whenA(·) satisfies the RIP. Therefore, leveraging the RIP of A(·) and Theorem 13, we can establish thefollowing local convergence guarantee of Algorithm 1, which has a weaker requirement on δ2r thanthe main theorem (cf. Theorem 3).

Lemma 14 Suppose that A(·) obeys the 2r-RIP with δ2r ≤ 0.02. If the t-th iterate satisfiesdist(Ft,F?) ≤ 0.1σr(X?), then ‖LtR>t − X?‖F ≤ 1.5 dist(Ft,F?). In addition, if the step sizeobeys 0 < η ≤ 2/3, then the (t + 1)-th iterate Ft+1 of the ScaledGD method in (15) of Algorithm 1satisfies

dist(Ft+1,F?) ≤ (1− 0.6η) dist(Ft,F?).

It then boils to down to finding a good initialization, for which we have the following lemma onthe quality of the spectral initialization.

Lemma 15 Suppose that A(·) obeys the 2r-RIP with a constant δ2r. Then the spectral initializationin (14) for low-rank matrix sensing satisfies

dist(F0,F?) ≤ 5δ2r√rκσr(X?).

Therefore, as long as δ2r is small enough, say δ2r ≤ 0.02/(√rκ) as specified in Theorem 3, the initial

distance satisfies dist(F0,F?) ≤ 0.1σr(X?), allowing us to invoke Lemma 14 recursively. The proofof Theorem 3 is then complete. The proofs of Lemmas 14-15 can be found in Appendix C.

3.3 Proof outline for robust PCA

As before, we begin with the following local convergence guarantee of Algorithm 2, which has aweaker requirement on α than the main theorem (cf. Theorem 6). The difference with low-rankmatrix sensing is that local convergence for robust PCA requires a further incoherence condition onthe iterates (cf. (31)), where we recall from (11) that Qt is the optimal alignment matrix betweenFt and F?.

Lemma 16 Suppose that X? is µ-incoherent and α ≤ 10−4/(µr). If the t-th iterate satisfiesdist(Ft,F?) ≤ 0.02σr(X?) and the incoherence condition

√n1

∥∥∥(LtQt −L?)Σ1/2?

∥∥∥2,∞∨√n2

∥∥∥(RtQ−>t −R?)Σ

1/2?

∥∥∥2,∞≤ √µrσr(X?), (31)

then ‖LtR>t −X?‖F ≤ 1.5 dist(Ft,F?). In addition, if the step size obeys 0.1 ≤ η ≤ 2/3, then the(t+ 1)-th iterate Ft+1 of the ScaledGD method in (20) of Algorithm 2 satisfies

dist(Ft+1,F?) ≤ (1− 0.6η) dist(Ft,F?),

and the incoherence condition√n1

∥∥∥(Lt+1Qt+1 −L?)Σ1/2?

∥∥∥2,∞∨√n2

∥∥∥(Rt+1Q−>t+1 −R?)Σ

1/2?

∥∥∥2,∞≤ √µrσr(X?).

14

Page 15: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled Gradient Descent

As long as the initialization is close to the ground truth and satisfies the incoherence condition,Lemma 16 ensures that the iterates of ScaledGD remain incoherent and converge linearly. Thisallows us to remove the unnecessary projection step in Yi et al. (2016), whose main objective is toensure the incoherence of the iterates.

We are left with checking the initial conditions. The following lemma ensures that the spectralinitialization in (19) is close to the ground truth as long as α is sufficiently small.

Lemma 17 Suppose that X? is µ-incoherent. Then the spectral initialization (19) for robust PCAsatisfies

dist(F0,F?) ≤ 20αµr3/2κσr(X?).

As a result, setting α ≤ 10−3/(µr3/2κ), the spectral initialization satisfies dist(F0,F?) ≤ 0.02σr(X?).In addition, we need to make sure that the spectral initialization satisfies the incoherence condition,which is provided in the following lemma.

Lemma 18 Suppose that X? is µ-incoherent and α ≤ 0.1/(µrκ), and that dist(F0,F?) ≤ 0.02σr(X?).Then the spectral initialization (19) satisfies the incoherence condition

√n1

∥∥∥(L0Q0 −L?)Σ1/2?

∥∥∥2,∞∨√n2

∥∥∥(R0Q−>0 −R?)Σ

1/2?

∥∥∥2,∞≤ √µrσr(X?).

Combining Lemmas 16-18 finishes the proof of Theorem 6. The proofs of the the three supportinglemmas can be found in Section D.

3.4 Proof outline for matrix completion

A key property of the new projection operator. We start with the following lemma thatentails a key property of the scaled projection (24), which ensures the scaled projection satisfies bothnon-expansiveness and incoherence under the scaled metric.

Lemma 19 Suppose that X? is µ-incoherent, and dist(F ,F?) ≤ εσr(X?) for some ε < 1. SetB ≥ (1 + ε)

√µrσ1(X?), then PB(F ) satisfies the non-expansiveness

dist(PB(F ),F?) ≤ dist(F ,F?),

and the incoherence condition√n1‖LR>‖2,∞ ∨

√n2‖RL>‖2,∞ ≤ B.

It is worth noting that the incoherence condition adopts a slightly different form than that ofrobust PCA, which is more convenient for matrix completion. The next lemma guarantees the fastlocal convergence of Algorithm 3 as long as the sample complexity is large enough and the parameterB is set properly.

Lemma 20 Suppose that X? is µ-incoherent, and p ≥ C(µrκ4 ∨ log(n1 ∨ n2))µr/(n1 ∧ n2) forsome sufficiently large constant C. Set the projection radius as B = CB

√µrσ1(X?) for some

constant CB ≥ 1.02. Under an event E which happens with overwhelming probability (i.e. at least1 − c1(n1 ∨ n2)−c2), if the t-th iterate satisfies dist(Ft,F?) ≤ 0.02σr(X?), and the incoherencecondition

√n1‖LtR>t ‖2,∞ ∨

√n1‖RtL

>t ‖2,∞ ≤ B,

then ‖LtR>t −X?‖F ≤ 1.5 dist(Ft,F?). In addition, if the step size obeys 0 < η ≤ 2/3, then the(t+ 1)-th iterate Ft+1 of the ScaledPGD method in (26) of Algorithm 3 satisfies

dist(Ft+1,F?) ≤ (1− 0.6η) dist(Ft,F?),

15

Page 16: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Tong, Ma, Chi

and the incoherence condition√n1‖Lt+1R

>t+1‖2,∞ ∨

√n2‖Rt+1L

>t+1‖2,∞ ≤ B.

As long as we can find an initialization that is close to the ground truth and satisfies the incoherencecondition, Lemma 20 ensures that the iterates of ScaledPGD remain incoherent and converge linearly.The follow lemma ensures that such an initialization can be ensured via the spectral method.

Lemma 21 Suppose that X? is µ-incoherent, then with overwhelming probability, the spectral ini-

tialization before projection F0 :=

[U0Σ

1/20

V0Σ1/20

]in (25) satisfies

dist(F0,F?) ≤ C0

(µr log(n1 ∨ n2)

p√n1n2

+

√µr log(n1 ∨ n2)

p(n1 ∧ n2)

)5√rκσr(X?).

Therefore, as long as p ≥ Cµr2κ2 log(n1 ∨ n2)/(n1 ∧ n2) for some sufficiently large constant C, theinitial distance satisfies dist(F0,F?) ≤ 0.02σr(X?). One can then invoke Lemma 19 to see thatF0 = PB(F0) meets the requirements of Lemma 20 due to the non-expansiveness and incoherenceproperties of the projection operator. The proofs of the the the supporting lemmas can be found inSection E.

4. Numerical Experiments

In this section, we provide numerical experiments to corroborate our theoretical findings, with thecodes available at

https://github.com/Titan-Tong/ScaledGD.

The simulations are performed in Matlab with a 3.6 GHz Intel Xeon Gold 6244 CPU.

4.1 Comparison with vanilla GD

To begin, we compare the iteration complexity of ScaledGD with vanilla gradient descent (GD). Theupdate rule of vanilla GD for solving (2) is given as

Lt+1 = Lt − ηGD∇LL(Lt,Rt),

Rt+1 = Rt − ηGD∇RL(Lt,Rt),(32)

where ηGD = η/σ1(X?) stands for the step size for gradient descent. This choice is often recommendedby the theory of vanilla GD Tu et al. (2016); Yi et al. (2016); Ma et al. (2019) and the scaling byσ1(X?) is needed for its convergence. For ease of comparison, we fix η = 0.5 for both ScaledGDand vanilla GD (see Figure 4 for justifications). Both algorithms start from the same spectralinitialization. To avoid notational clutter, we work on square asymmetricmatrices with n1 = n2 = n.We consider four low-rank matrix estimation tasks:

• Low-rank matrix sensing. The problem formulation is detailed in Section 2.2. Here, we collectm = 5nr measurements in the form of yk = 〈Ak,X?〉+wk, in which the measurement matricesAk

are generated with i.i.d. Gaussian entries with zero mean and variance 1/m, and wk ∼ N (0, σ2w)

are i.i.d. Gaussian noises.

• Robust PCA. The problem formulation is stated in Section 2.3. We generate the corruption witha sparse matrix S? ∈ Sα with α = 0.1. More specifically, we generate a matrix with standardGaussian entries and pass it through Tα[·] to obtain S?. The observation is Y = X? + S? + W ,where Wi,j ∼ N (0, σ2

w) are i.i.d. Gaussian noises.

16

Page 17: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled Gradient Descent

0 200 400 600 800 1000

Iteration count

10-14

10-12

10-10

10-8

10-6

10-4

10-2

100

Re

lative

err

or

0 200 400 600 800 1000

Iteration count

10-14

10-12

10-10

10-8

10-6

10-4

10-2

100

Re

lative

err

or

(a) Matrix sensing (b) Robust PCAn = 200, r = 10,m = 5nr n = 1000, r = 10, α = 0.1

0 200 400 600 800 1000

Iteration count

10-14

10-12

10-10

10-8

10-6

10-4

10-2

100

Re

lative

err

or

0 200 400 600 800 1000

Iteration count

10-12

10-10

10-8

10-6

10-4

10-2

100

Re

lative

err

or

(c) Matrix completion (d) Hankel matrix completionn = 1000, r = 10, p = 0.2 n = 1000, r = 10, p = 0.2

Figure 2: The relative errors of ScaledGD and vanilla GD with respect to the iteration count underdifferent condition numbers κ = 1, 5, 10, 20 for (a) matrix sensing, (b) robust PCA, (c)matrix completion, and (d) Hankel matrix completion.

• Matrix completion. The problem formulation is stated in Section 2.4. We assume random Bernoulliobservations, where each entry of X? is observed with probability p = 0.2 independently. Theobservation is Y = PΩ(X? + W ), where Wi,j ∼ N (0, σ2

w) are i.i.d. Gaussian noises. Moreover,we perform the scaled gradient updates without projections.

• Hankel matrix completion. Briefly speaking, a Hankel matrix shares the same value along eachskew-diagonal, and we aim at recovering a low-rank Hankel matrix from observing a few skew-diagonals Chen and Chi (2014); Cai et al. (2018). We assume random Bernoulli observations,where each skew-diagonal of X? is observed with probability p = 0.2 independently. The lossfunction is

L(L,R) =1

2p

∥∥HΩ(LR> − Y )∥∥2

F+

1

2

∥∥(I −H)(LR>)∥∥2

F, (33)

where I(·) denotes the identity operator, and the Hankel projection is defined as H(X) :=∑2n−1k=1 〈Hk,X〉Hk, which maps X to its closest Hankel matrix. Here, the Hankel basis ma-

trix Hk is the n × n matrix with the entries in the k-th skew diagonal as 1√ωk

, and all other

17

Page 18: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Tong, Ma, Chi

0 100 200 300 400 500 600

Iteration count

10-5

10-4

10-3

10-2

10-1

100

Re

lative

err

or

0 50 100 150 200 250 300

Iteration count

10-5

10-4

10-3

10-2

10-1

Re

lative

err

or

(a) Matrix Sensing (b) Robust PCAn = 200, r = 10,m = 5nr n = 1000, r = 10, α = 0.1

0 50 100 150 200 250 300

Iteration count

10-5

10-4

10-3

10-2

10-1

100

Re

lative

err

or

0 50 100 150 200 250 300 350 400 450

Iteration count

10-7

10-6

10-5

10-4

10-3

10-2

10-1

100

Re

lative

err

or

(c) Matrix completion (d) Hankel matrix completionn = 1000, r = 10, p = 0.2 n = 1000, r = 10, p = 0.2

Figure 3: The relative errors of ScaledGD and vanilla GD with respect to the iteration count underthe condition number κ = 10 and signal-to-noise ratios SNR = 40, 60, 80dB for (a) matrixsensing, (b) robust PCA, (c) matrix completion, and (d) Hankel matrix completion.

entries as 0, where ωk is the length of the k-th skew diagonal. Note that X is a Hankel matrixif and only if (I − H)(X) = 0. The Hankel projection on the observation index set Ω is definedas HΩ(X) :=

∑k∈Ω〈Hk,X〉Hk. The observation is Y = HΩ(X? + W ), where W is a Hankel

matrix whose entries along each skew-diagonal are i.i.d. Gaussian noises N (0, σ2w).

For the first three problems, we generate the ground truth matrix X? ∈ Rn×n in the followingway. We first generate an n× r matrix with i.i.d. random signs, and take its r left singular vectorsas U?, and similarly for V?. The singular values are set to be linearly distributed from 1 to 1/κ.The ground truth is then defined as X? = U?Σ?V

>? which has the specified condition number κ

and rank r. For Hankel matrix completion, we generate X? as an n× n Hankel matrix with entriesgiven as

(X?)i,j =

r∑`=1

σ`ne2πı(i+j−2)f` , i, j = 1, . . . , n,

18

Page 19: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled Gradient Descent

where f`, ` = 1, . . . , r are randomly chosen from 1/n, 2/n, . . . , 1, and σ` are linearly distributed from1 to 1/κ. The Vandermonde decomposition lemma tells that X? has rank r and singular values σ`,` = 1, . . . , r.

We first illustrate the convergence performance under noise-free observations, i.e. σw = 0. Weplot the relative reconstruction error ‖Xt −X?‖F/‖X?‖F with respect to the iteration count t inFigure 2 for the four problems under different condition numbers κ = 1, 5, 10, 20. For all thesemodels, we can see that ScaledGD has a convergence rate independent of κ, with all curves almostoverlay on each other. Under good conditioning κ = 1, ScaledGD converges at the same rate asvanilla GD; under ill conditioning, i.e. when κ is large, ScaledGD converges much faster than vanillaGD and leads to significant computational savings.

We next move to demonstrate that ScaledGD is robust to small additive noises. Denote the signal-to-noise ratio as SNR := 10 log10

‖X?‖2Fn2σ2

win dB. We plot the reconstruction error ‖Xt−X?‖F/‖X?‖F

with respect to the iteration count t in Figure 3 under the condition number κ = 10 and variousSNR = 40, 60, 80dB. We can see that ScaledGD and vanilla GD achieve the same statistical er-ror eventually, but ScaledGD converges much faster. In addition, the convergence speeds are notinfluenced by the noise levels.

Careful readers might wonder how sensitivity our comparisons are with respect to the choice ofstep sizes. To address this, we illustrate the convergence speeds of both ScaledGD and vanilla GDunder different step sizes η for matrix completion (under the same setting as Figure 2 (c)), wheresimilar plots can be obtained for other problems as well. We run both algorithms for at most 80iterations, and terminate if the relative error exceeds 102 (which happens if the step size is too largeand the algorithm diverges). Figure 4 plots the relative error with respect to the step size η for bothalgorithms, where we can see that ScaledGD outperforms vanilla GD over a large range of step sizes,even under optimized values for performance. Hence, our choice of η = 0.5 in previous experimentsrenders a typical comparison between ScaledGD and vanilla GD.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2

10-15

10-10

10-5

100

Re

lative

err

or

Figure 4: The relative errors of ScaledGD and vanilla GD after 80 iterations with respect to differentstep sizes η from 0.1 to 1.2, for matrix completion with n = 1000, r = 10, p = 0.2.

4.2 Run time comparisons

We now compare the run time of ScaledGD with vanilla GD and alternating minimization (AltMin)Jain et al. (2013). Specifically, for matrix sensing, alternating minimization (AltMinSense) updatesthe factors alternatively as

Lt+1 = argminL

∥∥A(LR>t )− y∥∥2

2,

19

Page 20: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Tong, Ma, Chi

0 200 400 600 800 1000

Iteration count

10-14

10-12

10-10

10-8

10-6

10-4

10-2

100

Re

lative

err

or

0 50 100 150 200 250 300 350 400

Run time (seconds)

10-14

10-12

10-10

10-8

10-6

10-4

10-2

100

Re

lative

err

or

(a) iteration count with r = 10 (b) run time with r = 10

0 200 400 600 800 1000

Iteration count

10-14

10-12

10-10

10-8

10-6

10-4

10-2

100

Re

lative

err

or

0 100 200 300 400 500 600 700 800

Run time (seconds)

10-14

10-12

10-10

10-8

10-6

10-4

10-2

100

Re

lative

err

or

(c) iteration count with r = 20 (d) run time with r = 20

Figure 5: The relative errors of ScaledGD, vanilla GD and AltMin with respect to the iterationcount and run time (in seconds) under different condition numbers κ = 1, 5, 20 for matrixsensing with n = 200, and m = 5nr. (a, b): r = 10; (c, d): r = 20.

Rt+1 = argminR

∥∥A(Lt+1R>)− y

∥∥2

2,

which corresponds to solving two least-squares problems. For matrix completion, the update rule ofalternating minimization proceeds as

Lt+1 = argminL

∥∥PΩ(LR>t − Y )∥∥2

2,

Rt+1 = argminR

∥∥PΩ(Lt+1R> − Y )

∥∥2

2,

which can be implemented more efficiently since each row of L (resp. R) can be updated indepen-dently via solving a much smaller least-squares problem due to the decomposable structure of theobjective function. It is worth noting that, to the best of our knowledge, this most natural variantof alternating minimization for matrix completion still eludes from a provable performance guar-antee, nonetheless, we choose it to compare against due to its popularity and excellent empiricalperformance.

Figure 5 plots the relative errors of ScaledGD, vanilla GD and alternating minimization (AltMin)with respect to the iteration count and run time (in seconds) under different condition numbersκ = 1, 5, 20; and similarly, Figure 6 plots the corresponding results for matrix completion. It

20

Page 21: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled Gradient Descent

0 200 400 600 800 1000

Iteration count

10-14

10-12

10-10

10-8

10-6

10-4

10-2

100

Re

lative

err

or

0 0.5 1 1.5 2 2.5 3

Run time (seconds)

10-14

10-12

10-10

10-8

10-6

10-4

10-2

100

Re

lative

err

or

(a) iteration count with r = 10 (b) run time with r = 10

0 200 400 600 800 1000

Iteration count

10-14

10-12

10-10

10-8

10-6

10-4

10-2

100

Re

lative

err

or

0 2 4 6 8 10 12

Run time (seconds)

10-14

10-12

10-10

10-8

10-6

10-4

10-2

100

Re

lative

err

or

(c) iteration count with r = 50 (d) run time with r = 50

Figure 6: The relative errors of ScaledGD, vanilla GD and AltMin with respect to the iterationcount and run time (in seconds) under different condition numbers κ = 1, 5, 20 for matrixcompletion with n = 1000, and p = 0.2. (a, b): r = 10; (c, d): r = 50.

can be seen that, both ScaledGD and AltMin admit a convergence rate that is independent ofthe condition number, where the per-iteration complexity of AltMin is much higher than that ofScaledGD. As expected, the run time of ScaledGD only adds a minimal overhead to vanilla GDwhile being much more robust to ill-conditioning. Noteworthily, AltMin takes much more time andbecomes significantly slower than ScaledGD when the rank r is larger. Nonetheless, we emphasizethat since the run time is impacted by many factors in terms of problem parameters as well asimplementation details, our purpose is to demonstrate the competitive performance of ScaledGDover alternatives, rather than claiming it as the state-of-the-art.

5. Conclusions

This paper proposes scaled gradient descent (ScaledGD) for factored low-rank matrix estimation,which maintains the low per-iteration computational complexity of vanilla gradient descent, butoffers significant speed-up in terms of the convergence rate with respect to the condition numberκ of the low-rank matrix. In particular, we rigorously establish that for low-rank matrix sensing,robust PCA, and matrix completion, to reach ε-accuracy, ScaledGD only takes O(log(1/ε)) iterationswithout the dependency on the condition number when initialized via the spectral method, under

21

Page 22: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Tong, Ma, Chi

standard assumptions. The key to our analysis is the introduction of a new distance metric thattakes into account the preconditioning and unbalancedness of the low-rank factors, and we havedeveloped new tools to analyze the trajectory of ScaledGD under this new metric. This work opensup many venues for future research, as we discuss below.

• Improved analysis. In this paper, we have focused on establishing the fast local convergence rate.It is interesting to study if the theory developed herein can be further strengthened in terms ofsample complexity and the size of basin of attraction. For matrix completion, it will be interestingto see if a similar guarantee continues to hold in the absence of the projection, which will generalizerecent works Ma et al. (2019); Chen et al. (2020a) that successfully removed these projections forvanilla gradient descent.

• Other low-rank recovery problems. Besides the problems studied herein, there are many otherapplications involving the recovery of an ill-conditioned low-rank matrix, such as robust PCAwith missing data, quadratic sampling, and so on. It is of interest to establish fast convergencerates of ScaledGD that are independent of the condition number for these problems as well. Inaddition, it is worthwhile to explore if a similar preconditioning trick can be useful to problemsbeyond low-rank matrix estimation. One recent attempt is to generalize ScaledGD for low-ranktensor estimation Tong et al. (2021b).

• Acceleration schemes? As it is evident from our analysis of the general loss case, ScaledGD maystill converge slowly when the loss function is ill-conditioned over low-rank matrices, i.e. κf islarge. In this case, it might be of interest to combine techniques such as momentum Kyrillidisand Cevher (2012) from the optimization literature to further accelerate the convergence. In ourcompanion paper Tong et al. (2021a), we have extended ScaledGD to nonsmooth formulations,which possess better curvatures than their smooth counterparts for certain problems.

Acknowledgments

The work of T. Tong and Y. Chi is supported in part by ONR under the grants N00014-18-1-2142and N00014-19-1-2404, by ARO under the grant W911NF-18-1-0303, and by NSF under the grantsCAREER ECCS-1818571, CCF-1806154 and CCF-1901199.

Appendix A. Technical Lemmas

This section gathers several useful lemmas that will be used in the appendix. Throughout all lemmas,we use X? to denote the ground truth low-rank matrix, with its compact SVD as X? = U?Σ?V

>? ,

and the stacked factor matrix is defined as F? =

[L?R?

]=

[U?Σ

1/2?

V?Σ1/2?

].

A.1 New distance metric

We begin with the investigation of the new distance metric (10), where the matrix Q that attainsthe infimum, if exists, is called the optimal alignment matrix between F and F?; see (11). Noticethat (10) involves a minimization problem over an open set (the set of invertible matrices). Hencethe minimizer, i.e. the optimal alignment matrix between F and F? is not guaranteed to be attained.Fortunately, a simple sufficient condition guarantees the existence of the minimizer; see the lemmabelow.

22

Page 23: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled Gradient Descent

Lemma 22 Fix any factor matrix F =

[LR

]∈ R(n1+n2)×r. Suppose that

dist(F ,F?) =

√inf

Q∈GL(r)

∥∥∥(LQ−L?) Σ1/2?

∥∥∥2

F+∥∥∥(RQ−> −R?) Σ

1/2?

∥∥∥2

F< σr(X?), (34)

then the minimizer of the above minimization problem is attained at some Q ∈ GL(r), i.e. theoptimal alignment matrix Q between F and F? exists.

Proof In view of the condition (34) and the definition of infimum, one knows that there must exista matrix Q ∈ GL(r) such that√∥∥∥(LQ−L?

1/2?

∥∥∥2

F+∥∥∥(RQ−> −R?

1/2?

∥∥∥2

F≤ εσr(X?),

for some ε obeying 0 < ε < 1. It further implies that∥∥∥(LQ−L?)Σ−1/2?

∥∥∥op∨∥∥∥(RQ−> −R?

)Σ−1/2?

∥∥∥op≤ ε.

Invoke Weyl’s inequality |σr(A) − σr(B)| ≤ ‖A −B‖op, and use that σr(L?Σ−1/2? ) = σr(U?) = 1

to obtain

σr(LQΣ−1/2? ) ≥ σr(L?Σ−1/2

? )−∥∥∥(LQ−L?

)Σ−1/2?

∥∥∥op≥ 1− ε. (35)

In addition, it is straightforward to verify that

infQ∈GL(r)

∥∥∥(LQ−L?) Σ1/2?

∥∥∥2

F+∥∥∥(RQ−> −R?

1/2?

∥∥∥2

F(36)

= infH∈GL(r)

∥∥∥(LQH −L?)Σ

1/2?

∥∥∥2

F+∥∥∥(RQ−>H−> −R?

1/2?

∥∥∥2

F. (37)

Indeed, if the minimizer of the second optimization problem (cf. (37)) is attained at some H, thenQH must be the minimizer of the first problem (36). Therefore, from now on, we focus on provingthat the minimizer of the second problem (37) is attained at some H. In view of (36) and (37), onehas

infH∈GL(r)

∥∥∥(LQH −L?)Σ

1/2?

∥∥∥2

F+∥∥∥(RQ−>H−> −R?

1/2?

∥∥∥2

F

≤∥∥∥(LQ−L?

1/2?

∥∥∥2

F+∥∥∥(RQ−> −R?

1/2?

∥∥∥2

F,

Clearly, for any QH to yield a smaller distance than Q, H must obey√∥∥∥(LQH −L?)Σ

1/2?

∥∥∥2

F+∥∥∥(RQ−>H−> −R?

1/2?

∥∥∥2

F≤ εσr(X?).

It further implies that∥∥∥(LQH −L?)Σ−1/2?

∥∥∥op∨∥∥∥(RQ−>H−> −R?

)Σ−1/2?

∥∥∥op≤ ε.

Invoke Weyl’s inequality |σ1(A) − σ1(B)| ≤ ‖A −B‖op, and use that σ1(L?Σ−1/2? ) = σ1(U?) = 1

to obtain

σ1(LQHΣ−1/2? ) ≤ σ1(L?Σ

−1/2? ) +

∥∥∥(LQH −L?)Σ−1/2?

∥∥∥op≤ 1 + ε. (38)

23

Page 24: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Tong, Ma, Chi

Combine (35) and (38), and use the relation σr(A)σ1(B) ≤ σ1(AB) to obtain

σr(LQΣ−1/2? )σ1(Σ

1/2? HΣ

−1/2? ) ≤ σ1(LQHΣ

−1/2? ) ≤ 1 + ε

1− εσr(LQΣ

−1/2? ).

As a result, one has σ1(Σ1/2? HΣ

−1/2? ) ≤ 1+ε

1−ε .

Similarly, one can show that σ1(Σ1/2? H−>Σ

−1/2? ) ≤ 1+ε

1−ε , equivalently, σr(Σ1/2? HΣ

−1/2? ) ≥ 1−ε

1+ε .Combining the above two arguments reveals that the minimization problem (37) is equivalent to theconstrained problem:

minimizeH∈GL(r)

∥∥∥(LQH −L?)Σ

1/2?

∥∥∥2

F+∥∥∥(RQ−>H−> −R?

1/2?

∥∥∥2

F

s.t.1− ε1 + ε

≤ σr(Σ1/2? HΣ

−1/2? ) ≤ σ1(Σ

1/2? HΣ

−1/2? ) ≤ 1 + ε

1− ε.

Notice that this is a continuous optimization problem over a compact set. Apply the Weierstrassextreme value theorem to finish the proof.

With the existence of the optimal alignment matrix in place, the following lemma provides thefirst-order necessary condition for the minimizer.

Lemma 23 For any factor matrix F =

[LR

]∈ R(n1+n2)×r, suppose that the optimal alignment

matrix

Q = argminQ∈GL(r)

∥∥∥(LQ−L?)Σ1/2?

∥∥∥2

F+∥∥∥(RQ−> −R?)Σ

1/2?

∥∥∥2

F

between F and F? exists, then Q obeys

(LQ)>(LQ−L?)Σ? = Σ?(RQ−> −R?)>RQ−>. (39)

Proof Expand the squares in the definition of Q to obtain

Q = argminQ∈GL(r)

tr((LQ−L?)

>(LQ−L?)Σ?

)+ tr

((RQ−> −R?)

>(RQ−> −R?)Σ?

).

Clearly, the first order necessary condition (i.e. the gradient is zero) yields

2L>(LQ−L?)Σ? − 2Q−>Σ?(RQ−> −R?)>RQ−> = 0,

which implies the optimal alignment criterion (39).

Last but not least, we connect the newly proposed distance to the usual Frobenius norm inLemma 24, the proof of which is a slight modification to (Tu et al., 2016, Lemma 5.4) and (Ge et al.,2017, Lemma 41).

Lemma 24 For any factor matrix F =

[LR

]∈ R(n1+n2)×r, the distance between F and F? satisfies

dist(F ,F?) ≤(√

2 + 1)1/2

‖LR> −X?‖F.

24

Page 25: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled Gradient Descent

Proof Suppose that X := LR> has compact SVD as X = UΣV >. Without loss of generality, we

can assume that F =

[UΣ1/2

V Σ1/2

], since any factorization of LR> yields the same distance. Introduce

two auxiliary matrices F :=

[UΣ1/2

−V Σ1/2

]and F? :=

[U?Σ

1/2?

−V?Σ1/2?

]. Apply the dilation trick to obtain

2

[0 X

X> 0

]= FF> − F F>, 2

[0 X?

X>? 0

]= F?F

>? − F?F

>? .

As a result, the squared Frobenius norm of X −X? is given by

8‖X −X?‖2F =∥∥FF> − F F> − F?F

>? + F?F

>?

∥∥2

F

=∥∥FF> − F?F

>?

∥∥2

F+∥∥F F> − F?F

>?

∥∥2

F− 2 tr

((FF> − F?F

>? )(F F> − F?F

>? ))

= 2∥∥FF> − F?F

>?

∥∥2

F+ 2‖F>F?‖2F + 2‖F>? F ‖2F

≥ 2∥∥FF> − F?F

>?

∥∥2

F,

where we use the facts that∥∥FF> − F?F

>?

∥∥2

F=∥∥F F> − F?F

>?

∥∥2

Fand F>F = F>? F? = 0.

Let O := sgn(F>F?)5 be the optimal orthonormal alignment matrix between F and F?. Denote

∆ := FO − F?. Follow the same argument as (Tu et al., 2016, Lemma 5.14) and (Ge et al., 2017,Lemma 41) to obtain

4‖X −X?‖2F ≥∥∥F?∆> + ∆F>? + ∆∆>

∥∥2

F

= tr(2F>? F?∆

>∆ + (∆>∆)2 + 2(F>? ∆)2 + 4F>? ∆∆>∆)

= tr(

2F>? F?∆>∆ + (∆>∆ +

√2F>? ∆)2 + (4− 2

√2)F>? ∆∆>∆

)= tr

(2(√

2− 1)F>? F?∆>∆ + (∆>∆ +

√2F>? ∆)2 + (4− 2

√2)F>? FO∆>∆

)≥ tr

(4(√

2− 1)Σ?∆>∆

)= 4(√

2− 1)∥∥∥(FO − F?)Σ

1/2?

∥∥∥2

F,

where the last inequality follows from the facts that F>? F? = 2Σ? and that F>? FO is positivesemi-definite. Therefore we obtain∥∥∥(FO − F?)Σ

1/2?

∥∥∥F≤(√

2 + 1)1/2

‖X −X?‖F.

This in conjunction with dist(F ,F?) ≤ ‖(FO − F?)Σ1/2? ‖F yields the claimed result.

A.2 Matrix perturbation bounds

Lemma 25 For any L ∈ Rn1×r,R ∈ Rn2×r, denote ∆L := L − L? and ∆R := R −R?. Supposethat ‖∆LΣ

−1/2? ‖op ∨ ‖∆RΣ

−1/2? ‖op < 1, then one has∥∥∥L(L>L)−1Σ

1/2?

∥∥∥op≤ 1

1− ‖∆LΣ−1/2? ‖op

; (40a)∥∥∥R(R>R)−1Σ1/2?

∥∥∥op≤ 1

1− ‖∆RΣ−1/2? ‖op

; (40b)

5. Let ASB> be the SVD of F>F?, then the matrix sign is sgn(F>F?) := AB>.

25

Page 26: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Tong, Ma, Chi

∥∥∥L(L>L)−1Σ1/2? −U?

∥∥∥op≤√

2‖∆LΣ−1/2? ‖op

1− ‖∆LΣ−1/2? ‖op

; (40c)

∥∥∥R(R>R)−1Σ1/2? − V?

∥∥∥op≤√

2‖∆RΣ−1/2? ‖op

1− ‖∆RΣ−1/2? ‖op

. (40d)

Proof We only prove claims (40a) and (40c) on the factor L, while the claims on the factor Rfollow from a similar argument. We start to prove (40a). Notice that∥∥∥L(L>L)−1Σ

1/2?

∥∥∥op

=1

σr(LΣ−1/2? )

.

In addition, invoke Weyl’s inequality to obtain

σr(LΣ−1/2? ) ≥ σr(L?Σ−1/2

? )− ‖∆LΣ−1/2? ‖op = 1− ‖∆LΣ

−1/2? ‖op,

where we have used the fact that U? = L?Σ−1/2? satisfies σr(U?) = 1. Combine the preceding two

relations to prove (40a).We proceed to prove (40c). Combine L>? U? = Σ

1/2? and (In1 − L(L>L)−1L>)L = 0 to obtain

the decomposition

L(L>L)−1Σ1/2? −U? = −L(L>L)−1∆>LU? + (In1

−L(L>L)−1L>)∆LΣ−1/2? .

The fact that L(L>L)−1∆>LU? and (In1−L(L>L)−1L>)∆LΣ

−1/2? are orthogonal implies∥∥∥L(L>L)−1Σ

1/2? −U?

∥∥∥2

op≤∥∥L(L>L)−1∆>LU?

∥∥2

op+∥∥∥(In1 −L(L>L)−1L>)∆LΣ

−1/2?

∥∥∥2

op

≤∥∥∥L(L>L)−1Σ

1/2?

∥∥∥2

op‖∆LΣ

−1/2? ‖2op +

∥∥In1−L(L>L)−1L>

∥∥2

op‖∆LΣ

−1/2? ‖2op

≤‖∆LΣ

−1/2? ‖2op

(1− ‖∆LΣ−1/2? ‖op)2

+ ‖∆LΣ−1/2? ‖2op

≤2‖∆LΣ

−1/2? ‖2op

(1− ‖∆LΣ−1/2? ‖op)2

,

where we have used (40a) and the fact that ‖In1 −L(L>L)−1L>‖op ≤ 1 in the third line.

Lemma 26 For any L ∈ Rn1×r,R ∈ Rn2×r, denote ∆L := L − L? and ∆R := R −R?, then onehas

‖LR> −X?‖F ≤ ‖∆LR>? ‖F + ‖L?∆>R‖F + ‖∆L∆>R‖F

≤(

1 +1

2(‖∆LΣ

−1/2? ‖op ∨ ‖∆RΣ

−1/2? ‖op)

)(‖∆LΣ

1/2? ‖F + ‖∆RΣ

1/2? ‖F

).

Proof In light of the decomposition LR> − X? = ∆LR>? + L?∆

>R + ∆L∆>R and the triangle

inequality, one has

‖LR> −X?‖F ≤ ‖∆LR>? ‖F + ‖L?∆>R‖F + ‖∆L∆>R‖F

= ‖∆LΣ1/2? ‖F + ‖∆RΣ

1/2? ‖F + ‖∆L∆>R‖F,

26

Page 27: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled Gradient Descent

where we have used the facts that

‖∆LR>? ‖F = ‖∆LΣ

1/2? V >? ‖F = ‖∆LΣ

1/2? ‖F, and ‖L?∆>R‖F = ‖U?Σ

1/2? ∆>R‖F = ‖∆RΣ

1/2? ‖F.

This together with the simple upper bound

‖∆L∆>R‖F =1

2‖∆LΣ

1/2? (∆RΣ

−1/2? )>‖F +

1

2‖∆LΣ

−1/2? (∆RΣ

1/2? )>‖F

≤ 1

2‖∆LΣ

1/2? ‖F‖∆RΣ

−1/2? ‖op +

1

2‖∆LΣ

−1/2? ‖op‖∆RΣ

1/2? ‖F

≤ 1

2(‖∆LΣ

−1/2? ‖op ∨ ‖∆RΣ

−1/2? ‖op)

(‖∆LΣ

1/2? ‖F + ‖∆RΣ

1/2? ‖F

)finishes the proof.

Lemma 27 For any L ∈ Rn1×r,R ∈ Rn2×r and any invertible matrices Q, Q ∈ GL(r), supposethat ‖(LQ−L?)Σ

−1/2? ‖op ∨ ‖(RQ−> −R?)Σ

−1/2? ‖op < 1, then one has

∥∥∥Σ1/2? Q−1QΣ

1/2? −Σ?

∥∥∥op≤ ‖R(Q−> −Q−>)Σ

1/2? ‖op

1− ‖(RQ−> −R?)Σ−1/2? ‖op

;

∥∥∥Σ1/2? Q>Q−>Σ

1/2? −Σ?

∥∥∥op≤ ‖L(Q−Q)Σ

1/2? ‖op

1− ‖(LQ−L?)Σ−1/2? ‖op

.

Proof Insert R>R(R>R)−1, and use the relation ‖AB‖op ≤ ‖A‖op‖B‖op to obtain∥∥∥Σ1/2? Q−1QΣ

1/2? −Σ?

∥∥∥op

=∥∥∥Σ1/2

? (Q−1 −Q−1)R>R(R>R)−1QΣ1/2?

∥∥∥op

≤∥∥∥R(Q−> −Q−>)Σ

1/2?

∥∥∥op

∥∥∥R(R>R)−1QΣ1/2?

∥∥∥op

=∥∥∥R(Q−> −Q−>)Σ

1/2?

∥∥∥op

∥∥∥RQ−>((RQ−>)>RQ−>)−1Σ1/2?

∥∥∥op

≤ ‖R(Q−> −Q−>)Σ1/2? ‖op

1− ‖(RQ−> −R?)Σ−1/2? ‖op

,

where the last line uses Lemma 25.Similarly, insert L>L(L>L)−1, and use the relation ‖AB‖op ≤ ‖A‖op‖B‖op to obtain∥∥∥Σ1/2

? Q>Q−>Σ1/2? −Σ?

∥∥∥op

=∥∥∥Σ1/2

? (Q> −Q>)L>L(L>L)−1Q−>Σ1/2?

∥∥∥op

≤∥∥∥L(Q−Q)Σ

1/2?

∥∥∥op

∥∥∥L(L>L)−1Q−>Σ1/2?

∥∥∥op

=∥∥∥L(Q−Q)Σ

1/2?

∥∥∥op

∥∥∥LQ((LQ)>LQ)−1Σ1/2?

∥∥∥op

≤ ‖L(Q−Q)Σ1/2? ‖op

1− ‖(LQ−L?)Σ−1/2? ‖op

,

where the last line uses Lemma 25.

27

Page 28: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Tong, Ma, Chi

A.3 Partial Frobenius norm

We introduce the partial Frobenius norm

‖X‖F,r :=

√√√√ r∑i=1

σ2i (X) = ‖Pr(X)‖F (41)

as the `2 norm of the vector composed of the top-r singular values of the matrix X, or equivalentlyas the Frobenius norm of the rank-r approximation Pr(X) defined in (5). It is straightforwardto verify that ‖ · ‖F,r is a norm; see also Mazeika (2016). The following lemma provides severalequivalent and useful characterizations of this partial Frobenius norm.

Lemma 28 For any X ∈ Rn1×n2 , one has

‖X‖F,r = maxV ∈Rn2×r:V >V =Ir

‖XV ‖F (42a)

= maxX∈Rn1×n2 :‖X‖F≤1,rank(X)≤r

|〈X, X〉| (42b)

= maxR∈Rn2×r:‖R‖op≤1

‖XR‖F. (42c)

Proof The first representation (42a) follows immediately from the extremal partial trace identity;see (Mazeika, 2016, Proposition 4.4), by noticing the following relation

r∑i=1

σ2i (X) = max

V⊆Rn2 :dim(V)=rtr(X>X | V

)= max

V ∈Rn2×r:V >V =Ir

‖XV ‖2F.

Here the partial trace over a vector space V is defined as

tr(X>X | V) :=

r∑i=1

v>i X>Xvi,

where vi1≤i≤r is any orthonormal basis of V. The partial trace is invariant to the choice oforthonormal basis and therefore well-defined.

To prove the second representation (42b), for any X ∈ Rn1×n2 obeying rank(X) ≤ r and‖X‖F ≤ 1, denoting X = UΣV > as its compact SVD, one has

|〈X, X〉| = |〈X, UΣV >〉| = |〈XV , UΣ〉| ≤ ‖XV ‖F‖UΣ‖F ≤ ‖X‖F,r,

where the last inequality follows from (42a). In addition, the maximum in (42b) is attained atX = Pr(X)/‖Pr(X)‖F.

To prove the third representation (42c), for any R ∈ Rn2×r obeying ‖R‖op ≤ 1, combine thevariational representation of the Frobenius norm and (42b) to obtain

‖XR‖F = maxL∈Rn1×n2 :‖L‖F≤1

|〈XR, L〉|

= maxL∈Rn1×n2 :‖L‖F≤1

|〈X, LR>〉| ≤ ‖X‖F,r,

where the last inequality follows from (42b). In addition, the maximum in (42c) is attained atR = V , where V denotes the top-r right singular vectors of X.

28

Page 29: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled Gradient Descent

Remark 29 For self-completeness, we also provide a detailed proof of the first representation (42a).This proof is inductive on r. When r = 1, we have

σ1(X) = ‖Xv1‖2 = maxv∈Rn2 :‖v‖2=1

‖Xv‖2,

where v1 denotes the top right singular vector of X. Assume that the statement holds for ‖ · ‖F,r−1.Now consider ‖ · ‖F,r. For any V ∈ Rn2×r such that V >V = Ir, we can first pick v2, . . . , vr as aset of orthonormal vectors in the column space of V that are orthogonal to v1, and then pick v1 viathe Gram-Schmidt process, so that viri=1 provides an orthonormal basis of the column space of V .Further, by the orthogonality of V , there exists an orthonormal matrix O such that

V = [v1, . . . , vr]O.

Combining this formula with the induction hypothesis yields

‖XV ‖2F = ‖X[v1, . . . , vr]‖2F= ‖Xv1‖22 + ‖X[v2, . . . , vr]‖2F= ‖Xv1‖22 + ‖(X − P1(X))[v2, . . . , vr]‖2F≤ σ2

1(X) + ‖X − P1(X)‖2F,r−1

=

r∑i=1

σ2i (X) = ‖X‖2F,r,

where the first line holds since O is orthonormal, the third line holds since P1(X)[v2, . . . , vr] = 0,the fourth line follows from the induction hypothesis, and the last line follows from the definition(41). In addition, the maximum in (42a) is attained at V = V , where V denotes the top-r rightsingular vectors of X. This finishes the proof.

Recall that Pr(X) denotes the best rank-r approximation of X under the Frobenius norm. Itturns out that Pr(X) is also the best rank-r approximation of X under the partial Frobenius norm‖ · ‖F,r. This claim is formally stated below; see also (Mazeika, 2016, Theorem 4.21).

Lemma 30 Fix any X ∈ Rn1×n2 and recall the definition of Pr(X) in (5). One has

Pr(X) = argminX∈Rn1×n2 :rank(X)≤r

‖X − X‖F,r.

Proof For any X of rank at most r, invoke Weyl’s inequality to obtain σr+i(X) ≤ σi(X − X) +

σr+1(X) = σi(X − X), for i = 1, . . . , r. Thus one has

‖X − Pr(X)‖2F,r =

r∑i=1

σ2r+i(X) ≤

r∑i=1

σ2i (X − X) = ‖X − X‖2F,r.

The proof is finished by observing that the rank of Pr(X) is at most r.

Appendix B. Proof for Low-Rank Matrix Factorization

B.1 Proof of Proposition 12

The gradients of L(F ) in (29) with respect to L and R are given as

∇LL(F ) = (LR> −X?)R, ∇RL(F ) = (LR> −X?)>L,

29

Page 30: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Tong, Ma, Chi

which can be used to compute the Hessian with respect to L and R. Writing for the vectorizedvariables, the Hessians are given as

∇2L,LL(F ) = (R>R)⊗ In1

, ∇2R,RL(F ) = (L>L)⊗ In2

.

Viewed in the vectorized form, the ScaledGD update in (3) can be rewritten as

vec(Lt+1) = vec(Lt)− η((R>t Rt)−1 ⊗ In1

) vec((LtR>t −X?)Rt)

= vec(Lt)− η(∇2L,LL(Ft))

−1 vec(∇LL(Ft)),

vec(Rt+1) = vec(Rt)− η((L>t Lt)−1 ⊗ In2) vec((LtR

>t −X?)

>Lt)

= vec(Rt)− η(∇2R,RL(Ft))

−1 vec(∇RL(Ft)).

B.2 Proof of Theorem 13

The proof is inductive in nature. More specifically, we intend to show that for all t ≥ 0,

1. dist(Ft,F?) ≤ (1− 0.7η)t dist(F0,F?) ≤ 0.1(1− 0.7η)tσr(X?), and

2. the optimal alignment matrix Qt between Ft and F? exists.

For the base case, i.e. t = 0, the first induction hypothesis trivially holds, while the second alsoholds true in view of Lemma 22 and the assumption that dist(F0,F?) ≤ 0.1σr(X?). We thereforeconcentrate on the induction step. Suppose that the t-th iterate Ft obeys the aforementionedinduction hypotheses. Our goal is to show that Ft+1 continues to satisfy those.

For notational convenience, denote L := LtQt, R := RtQ−>t , ∆L := L − L?, ∆R := R −R?,

and ε := 0.1. By the definition of dist(Ft+1,F?), one has

dist2(Ft+1,F?) ≤∥∥∥(Lt+1Qt −L?)Σ

1/2?

∥∥∥2

F+∥∥∥(Rt+1Q

−>t −R?)Σ

1/2?

∥∥∥2

F, (43)

where we recall that Qt is the optimal alignment matrix between Ft and F?. Utilize the ScaledGDupdate rule (30) and the decomposition LR> −X? = ∆LR

> + L?∆>R to obtain

(Lt+1Qt −L?)Σ1/2? =

(L− η(LR> −X?)R(R>R)−1 −L?

1/2?

=(∆L − η(∆LR

> + L?∆>R)R(R>R)−1

1/2?

= (1− η)∆LΣ1/2? − ηL?∆>RR(R>R)−1Σ

1/2? .

As a result, one can expand the first square in (43) as∥∥∥(Lt+1Qt −L?)Σ1/2?

∥∥∥2

F= (1− η)2 tr

(∆LΣ?∆

>L

)− 2η(1− η) tr

(L?∆

>RR(R>R)−1Σ?∆

>L

)︸ ︷︷ ︸M1

+ η2∥∥∥L?∆>RR(R>R)−1Σ

1/2?

∥∥∥2

F︸ ︷︷ ︸M2

. (44)

The first term tr(∆LΣ?∆>L ) is closely related to dist(Ft,F?), and hence our focus will be on relating

M1 and M2 to dist(Ft,F?). We start with the term M1. Since L and R are aligned with L? andR?, Lemma 23 tells that Σ?∆

>LL = R>∆RΣ?. This together with L? = L −∆L allows us to

rewrite M1 as

M1 = tr(R(R>R)−1Σ?∆

>LL?∆

>R

)30

Page 31: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled Gradient Descent

= tr(R(R>R)−1Σ?∆

>LL∆>R

)− tr

(R(R>R)−1Σ?∆

>L∆L∆>R

)= tr

(R(R>R)−1R>∆RΣ?∆

>R

)− tr

(R(R>R)−1Σ?∆

>L∆L∆>R

).

Moving on to M2, we can utilize the fact L>? L? = Σ? and the decomposition Σ? = R>R− (R>R−Σ?) to obtain

M2 = tr(R(R>R)−1Σ?(R

>R)−1R>∆RΣ?∆>R

)= tr

(R(R>R)−1R>∆RΣ?∆

>R

)− tr

(R(R>R)−1(R>R−Σ?)(R

>R)−1R>∆RΣ?∆>R

).

Putting M1 and M2 back to (44) yields∥∥∥(Lt+1Qt −L?)Σ1/2?

∥∥∥2

F= (1− η)2 tr

(∆LΣ?∆

>L

)− η(2− 3η) tr

(R(R>R)−1R>∆RΣ?∆

>R

)︸ ︷︷ ︸F1

+ 2η(1− η) tr(R(R>R)−1Σ?∆

>L∆L∆>R

)︸ ︷︷ ︸F2

− η2 tr(R(R>R)−1(R>R−Σ?)(R

>R)−1R>∆RΣ?∆>R

)︸ ︷︷ ︸F3

.

In what follows, we will control the three terms F1,F2 and F3 separately.

1. Notice that F1 is the inner product of two positive semi-definite matrices R(R>R)−1R> and∆RΣ?∆

>R. Consequently we have F1 ≥ 0.

2. To control F2, we need certain control on ‖∆LΣ−1/2? ‖op and ‖∆RΣ

−1/2? ‖op. The first induction

hypothesis

dist(Ft,F?) =

√‖∆LΣ

−1/2? Σ?‖2F + ‖∆RΣ

−1/2? Σ?‖2F ≤ εσr(X?)

together with the relation ‖AB‖F ≥ ‖A‖Fσr(B) tells that√‖∆LΣ

−1/2? ‖2F + ‖∆RΣ

−1/2? ‖2F σr(X?) ≤ εσr(X?).

In light of the relation ‖A‖op ≤ ‖A‖F, this further implies

‖∆LΣ−1/2? ‖op ∨ ‖∆RΣ

−1/2? ‖op ≤ ε. (45)

Invoke Lemma 25 to see ∥∥∥R(R>R)−1Σ1/2?

∥∥∥op≤ 1

1− ε.

With these consequences, one can bound |F2| by

|F2| =∣∣∣ tr(Σ

−1/2? ∆>RR(R>R)−1Σ?∆

>L∆LΣ

1/2?

) ∣∣∣≤∥∥∥Σ−1/2

? ∆>RR(R>R)−1Σ1/2?

∥∥∥op

tr(Σ

1/2? ∆>L∆LΣ

1/2?

)≤ ‖∆RΣ

−1/2? ‖op

∥∥∥R(R>R)−1Σ1/2?

∥∥∥op

tr(∆LΣ?∆

>L

)≤ ε

1− εtr(∆LΣ?∆

>L

).

31

Page 32: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Tong, Ma, Chi

3. Similarly, one can bound |F3| by

|F3| ≤∥∥R(R>R)−1(R>R−Σ?)(R

>R)−1R>∥∥op

tr(∆RΣ?∆

>R

)≤∥∥∥R(R>R)−1Σ

1/2?

∥∥∥2

op

∥∥∥Σ−1/2? (R>R−Σ?)Σ

−1/2?

∥∥∥op

tr(∆RΣ?∆

>R

)≤ 1

(1− ε)2

∥∥∥Σ−1/2? (R>R−Σ?)Σ

−1/2?

∥∥∥op

tr(∆RΣ?∆

>R

).

Further notice that∥∥∥Σ−1/2? (R>R−Σ?)Σ

−1/2?

∥∥∥op

=∥∥∥Σ−1/2

? (R>? ∆R + ∆>RR? + ∆>R∆R)Σ−1/2?

∥∥∥op

≤ 2‖∆RΣ−1/2? ‖op + ‖∆RΣ

−1/2? ‖2op

≤ 2ε+ ε2.

Take the preceding two bounds together to arrive at

|F3| ≤2ε+ ε2

(1− ε)2tr(∆RΣ?∆

>R

).

Combining the bounds for F1,F2,F3, one has∥∥∥(Lt+1Qt −L?)Σ1/2?

∥∥∥2

F=∥∥∥(1− η)∆LΣ

1/2? − ηL?∆>RR(R>R)−1Σ

1/2?

∥∥∥2

F

≤(

(1− η)2 +2ε

1− εη(1− η)

)tr(∆LΣ?∆

>L

)+

2ε+ ε2

(1− ε)2η2 tr

(∆RΣ?∆

>R

). (46)

A similarly bound holds for the second square ‖(Rt+1Qt −R?)Σ1/2? ‖2F in (43). Therefore we obtain∥∥∥(Lt+1Qt −L?)Σ

1/2?

∥∥∥2

F+∥∥∥(Rt+1Q

−>t −R?)Σ

1/2?

∥∥∥2

F≤ ρ2(η; ε) dist2(Ft,F?),

where we identify

dist2(Ft,F?) = tr(∆LΣ?∆>L ) + tr(∆RΣ?∆

>R) (47)

and the contraction rate ρ2(η; ε) is given by

ρ2(η; ε) := (1− η)2 +2ε

1− εη(1− η) +

2ε+ ε2

(1− ε)2η2.

With ε = 0.1 and 0 < η ≤ 2/3, one has ρ(η; ε) ≤ 1− 0.7η. Thus we conclude that

dist(Ft+1,F?) ≤√∥∥∥(Lt+1Qt −L?)Σ

1/2?

∥∥∥2

F+∥∥∥(Rt+1Q

−>t −R?)Σ

1/2?

∥∥∥2

F

≤ (1− 0.7η) dist(Ft,F?)

≤ (1− 0.7η)t+1 dist(F0,F?) ≤ (1− 0.7η)t+10.1σr(X?).

This proves the first induction hypothesis. The existence of the optimal alignment matrix Qt+1

between Ft+1 and F? is assured by Lemma 22, which finishes the proof for the second hypothesis.So far, we have demonstrated the first conclusion in the theorem. The second conclusion is an

easy consequence of Lemma 26 as∥∥LtR>t −X?

∥∥F≤(

1 +ε

2

)(‖∆LΣ

1/2? ‖F + ‖∆RΣ

1/2? ‖F

)≤(

1 +ε

2

)√2 dist(Ft,F?)

≤ 1.5 dist(Ft,F?).

(48)

32

Page 33: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled Gradient Descent

Here, the second line follows from the elementary inequality a+ b ≤√

2(a2 + b2) and the expressionof dist(Ft,F?) in (47). The proof is now completed.

Appendix C. Proof for Low-Rank Matrix Sensing

We start by recording a useful lemma.

Lemma 31 (Candès and Plan (2011)) Suppose that A(·) obeys the 2r-RIP with a constant δ2r.Then for any X1,X2 ∈ Rn1×n2 of rank at most r, one has

|〈A(X1),A(X2)〉 − 〈X1,X2〉| ≤ δ2r‖X1‖F‖X2‖F,

which can be stated equivalently as∣∣tr ((A∗A− I)(X1)X>2)∣∣ ≤ δ2r‖X1‖F‖X2‖F. (49)

As a simple corollary, one has that for any matrix R ∈ Rn2×r:

‖(A∗A− I)(X1)R‖F ≤ δ2r‖X1‖F‖R‖op. (50)

This is due to the fact that

‖(A∗A− I)(X1)R‖F = maxL:‖L‖F≤1

tr(

(A∗A− I)(X1)RL>)

≤ maxL:‖L‖F≤1

δ2r‖X1‖F‖LR>‖F

≤ δ2r‖X1‖F‖R‖op.

Here, the first line follows from the variational representation of the Frobenius norm, the second linefollows from (49), and the last line follows from the relation ‖AB‖F ≤ ‖A‖F‖B‖op.

C.1 Proof of Lemma 14

The proof mostly mirrors that in Section B.2. First, in view of the condition dist(Ft,F?) ≤ 0.1σr(X?)and Lemma 22, one knows that Qt, the optimal alignment matrix between Ft and F? exists. There-fore, for notational convenience, denote L := LtQt, R := RtQ

−>t , ∆L := L− L?, ∆R := R −R?,

and ε := 0.1. Similar to the derivation in (45), we have

‖∆LΣ−1/2? ‖op ∨ ‖∆RΣ

−1/2? ‖op ≤ ε. (51)

The conclusion ‖LtR>t −X?‖F ≤ 1.5 dist(Ft,F?) is a simple consequence of Lemma 26; see (48) fora detailed argument. From now on, we focus on proving the distance contraction.

With these notations in place, we have by the definition of dist(Ft+1,F?) that

dist2(Ft+1,F?) ≤∥∥∥(Lt+1Qt −L?)Σ

1/2?

∥∥∥2

F+∥∥∥(Rt+1Q

−>t −R?)Σ

1/2?

∥∥∥2

F. (52)

Apply the update rule (15) and the decomposition LR> −X? = ∆LR> + L?∆

>R to obtain

(Lt+1Qt −L?)Σ1/2? =

(L− ηA∗A(LR> −X?)R(R>R)−1 −L?

1/2?

=(∆L − η(LR> −X?)R(R>R)−1 − η(A∗A− I)(LR> −X?)R(R>R)−1

1/2?

= (1− η)∆LΣ1/2? − ηL?∆>RR(R>R)−1Σ

1/2? − η(A∗A− I)(LR> −X?)R(R>R)−1Σ

1/2? .

33

Page 34: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Tong, Ma, Chi

This allows us to expand the first square in (52) as∥∥∥(Lt+1Qt −L?)Σ1/2?

∥∥∥2

F=∥∥∥(1− η)∆LΣ

1/2? − ηL?∆>RR(R>R)−1Σ

1/2?

∥∥∥2

F︸ ︷︷ ︸S1

− 2η(1− η) tr((A∗A− I)(LR> −X?)R(R>R)−1Σ?∆

>L

)︸ ︷︷ ︸S2

+ 2η2 tr((A∗A− I)(LR> −X?)R(R>R)−1Σ?(R

>R)−1R>∆RL>?

)︸ ︷︷ ︸S3

+ η2∥∥∥(A∗A− I)(LR> −X?)R(R>R)−1Σ

1/2?

∥∥∥2

F︸ ︷︷ ︸S4

.

In what follows, we shall control the four terms separately, of which S1 is the main term, and S2,S3

and S4 are perturbation terms.

1. Notice that the main term S1 has already been controlled in (46) under the condition (51). Itobeys

S1 ≤(

(1− η)2 +2ε

1− εη(1− η)

)‖∆LΣ

1/2? ‖2F +

2ε+ ε2

(1− ε)2η2‖∆RΣ

1/2? ‖2F.

2. For the second term S2, decompose LR>−X? = ∆LR>? +L?∆

>R+∆L∆>R and apply the triangle

inequality to obtain

|S2| =∣∣∣ tr ((A∗A− I)(∆LR

>? + L?∆

>R + ∆L∆>R)R(R>R)−1Σ?∆

>L

) ∣∣∣≤∣∣∣ tr ((A∗A− I)(∆LR

>? )R(R>R)−1Σ?∆

>L

) ∣∣∣+∣∣∣ tr ((A∗A− I)(L?∆

>R)R(R>R)−1Σ?∆

>L

) ∣∣∣+∣∣∣ tr ((A∗A− I)(∆L∆>R)R(R>R)−1Σ?∆

>L

) ∣∣∣.Invoke Lemma 31 to further obtain

|S2| ≤ δ2r(‖∆LR

>? ‖F + ‖L?∆>R‖F + ‖∆L∆>R‖F

) ∥∥R(R>R)−1Σ?∆>L

∥∥F

≤ δ2r(‖∆LR

>? ‖F + ‖L?∆>R‖F + ‖∆L∆>R‖F

) ∥∥∥R(R>R)−1Σ1/2?

∥∥∥op‖∆LΣ

1/2? ‖F,

where the second line follows from the relation ‖AB‖F ≤ ‖A‖op‖B‖F. Take the condition (51)and Lemmas 25 and 26 together to obtain∥∥∥R(R>R)−1Σ

1/2?

∥∥∥op≤ 1

1− ε;

‖∆LR>? ‖F + ‖L?∆>R‖F + ‖∆L∆>R‖F ≤ (1 +

ε

2)(‖∆LΣ

1/2? ‖F + ‖∆RΣ

1/2? ‖F

).

These consequences further imply that

|S2| ≤δ2r(2 + ε)

2(1− ε)

(‖∆LΣ

1/2? ‖F + ‖∆RΣ

1/2? ‖F

)‖∆LΣ

1/2? ‖F

=δ2r(2 + ε)

2(1− ε)

(‖∆LΣ

1/2? ‖2F + ‖∆LΣ

1/2? ‖F‖∆RΣ

1/2? ‖F

).

34

Page 35: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled Gradient Descent

For the term ‖∆LΣ1/2? ‖F‖∆RΣ

1/2? ‖F, we can apply the elementary inequality 2ab ≤ a2 + b2 to see

‖∆LΣ1/2? ‖F‖∆RΣ

1/2? ‖F ≤

1

2‖∆LΣ

1/2? ‖2F +

1

2‖∆RΣ

1/2? ‖2F.

The preceding two bounds taken collectively yield

|S2| ≤δ2r(2 + ε)

2 (1− ε)

(3

2‖∆LΣ

1/2? ‖2F +

1

2‖∆LΣ

1/2? ‖2F

).

3. The third term S3 can be similarly bounded as

|S3| ≤ δ2r(‖∆LR

>? ‖F + ‖L?∆>R‖F + ‖∆L∆>R‖F

) ∥∥R(R>R)−1Σ?(R>R)−1R>∆RL

>?

∥∥F

≤ δ2r(‖∆LR

>? ‖F + ‖L?∆>R‖F + ‖∆L∆>R‖F

) ∥∥∥R(R>R)−1Σ1/2?

∥∥∥2

op‖∆RL

>? ‖F

≤ δ2r(2 + ε)

2(1− ε)2

(‖∆LΣ

1/2? ‖F + ‖∆RΣ

1/2? ‖F

)‖∆RΣ

1/2? ‖F

≤ δ2r(2 + ε)

2(1− ε)2

(1

2‖∆LΣ

1/2? ‖2F +

3

2‖∆RΣ

1/2? ‖2F

).

4. We are then left with the last term S4, for which we have√S4 =

∥∥∥(A∗A− I)(LR> −X?)R(R>R)−1Σ1/2?

∥∥∥F

≤∥∥∥(A∗A− I)(∆LR

>? )R(R>R)−1Σ

1/2?

∥∥∥F

+∥∥∥(A∗A− I)(L?∆

>R)R(R>R)−1Σ

1/2?

∥∥∥F

+∥∥∥(A∗A− I)(∆L∆>R)R(R>R)−1Σ

1/2?

∥∥∥F,

where once again we use the decomposition LR>−X? = ∆LR>? +L?∆

>R + ∆L∆>R. Use (50) to

see that √S4 ≤ δ2r

(‖∆LR

>? ‖F + ‖L?∆>R‖F + ‖∆L∆>R‖F

) ∥∥∥R(R>R)−1Σ1/2?

∥∥∥op.

Repeating the same argument in bounding S2 yields√S4 ≤

δ2r (2 + ε)

2 (1− ε)

(‖∆LΣ

1/2? ‖F + ‖∆RΣ

1/2? ‖F

).

We can then take the squares of both sides and use (a+ b)2 ≤ 2a2 + 2b2 to reach

S4 ≤δ22r(2 + ε)2

2(1− ε)2

(‖∆LΣ

1/2? ‖2F + ‖∆RΣ

1/2? ‖2F

).

Taking the bounds for S1,S2,S3,S4 collectively yields∥∥∥(Lt+1Qt −L?)Σ1/2?

∥∥∥2

F≤(

(1− η)2 +2ε

1− εη(1− η)

)‖∆LΣ

1/2? ‖2F +

2ε+ ε2

(1− ε)2η2‖∆RΣ

1/2? ‖2F

+δ2r(2 + ε)

1− εη(1− η)

(3

2‖∆LΣ

1/2? ‖2F +

1

2‖∆RΣ

1/2? ‖2F

)+δ2r(2 + ε)

(1− ε)2η2

(1

2‖∆LΣ

1/2? ‖2F +

3

2‖∆RΣ

1/2? ‖2F

)

35

Page 36: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Tong, Ma, Chi

+δ22r(2 + ε)2

2(1− ε)2η2(‖∆LΣ

1/2? ‖2F + ‖∆RΣ

1/2? ‖2F

).

Similarly, we can expand the second square in (52) and obtain a similar bound. Combine both toobtain ∥∥∥(Lt+1Qt −L?)Σ

1/2?

∥∥∥2

F+∥∥∥(Rt+1Q

−>t −R?)Σ

1/2?

∥∥∥2

F≤ ρ2(η; ε, δ2r) dist2(Ft,F?),

where the contraction rate is given by

ρ2(η; ε, δ2r) := (1− η)2 +2ε+ δ2r(4 + 2ε)

1− εη(1− η) +

2ε+ ε2 + δ2r(4 + 2ε) + δ22r(2 + ε)2

(1− ε)2η2.

With ε = 0.1, δ2r ≤ 0.02, and 0 < η ≤ 2/3, one has ρ(η; ε, δ2r) ≤ 1− 0.6η. Thus we conclude that

dist(Ft+1,F?) ≤√∥∥∥(Lt+1Qt −L?)Σ

1/2?

∥∥∥2

F+∥∥∥(Rt+1Q

−>t −R?)Σ

1/2?

∥∥∥2

F

≤ (1− 0.6η) dist(Ft,F?).

C.2 Proof of Lemma 15

With the knowledge of partial Frobenius norm ‖ · ‖F,r, we are ready to establish the claimed result.Invoke Lemma 24 to relate dist(F0,F?) to ‖L0R

>0 −X?‖F, and use that L0R

>0 −X? has rank at

most 2r to obtain

dist(F0,F?) ≤√√

2 + 1∥∥L0R

>0 −X?

∥∥F≤√

2(√

2 + 1)∥∥L0R

>0 −X?

∥∥F,r.

Note that L0R>0 is the best rank-r approximation of A∗A(X?), and apply the triangle inequality

combined with Lemma 30 to obtain∥∥L0R>0 −X?

∥∥F,r≤∥∥A∗A(X?)−L0R

>0

∥∥F,r

+ ‖A∗A(X?)−X?‖F,r≤ 2 ‖(A∗A− I)(X?)‖F,r ≤ 2δ2r‖X?‖F.

Here, the last inequality follows from combining Lemma 28 and (50) as

‖(A∗A− I)(X?)‖F,r = maxR∈Rn2×r:‖R‖op≤1

∥∥∥(A∗A− I)(X?)R∥∥∥F≤ δ2r‖X?‖F.

As a result, one has

dist(F0,F?) ≤ 2

√2(√

2 + 1)δ2r‖X?‖F ≤ 5δ2r√rκσr(X?).

Appendix D. Proof for Robust PCA

We first establish a useful property regarding the truncation operator T2α[·].

Lemma 32 Given S? ∈ Sα and S = T2α[X? + S? −LR>], one has

‖S − S?‖∞ ≤ 2‖LR> −X?‖∞. (53)

In addition, for any low-rank matrix M = LMR>M ∈ Rn1×n2 with LM ∈ Rn1×r,RM ∈ Rn2×r, onehas

|〈S − S?,M〉| ≤√

3αν(‖(L−L?)Σ

1/2? ‖F + ‖(R−R?)Σ

1/2? ‖F

)‖M‖F

+ 2√α (√n1‖LM‖2,∞‖RM‖F ∧

√n2‖LM‖F‖RM‖2,∞) ‖LR> −X?‖F,

(54)

36

Page 37: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled Gradient Descent

where ν obeys

ν ≥√n1

2

(‖LΣ

−1/2? ‖2,∞ + ‖L?Σ−1/2

? ‖2,∞)∨√n2

2

(‖RΣ

−1/2? ‖2,∞ + ‖R?Σ

−1/2? ‖2,∞

).

Proof Denote ∆L := L − L?, ∆R := R −R?, and ∆X := LR> −X?. Let Ω,Ω? be the supportof S and S?, respectively. As a result, S − S? is supported on Ω ∪ Ω?.

We start with proving the first claim, i.e. (53). For (i, j) ∈ Ω, by the definition of T2α[·], we have(S−S?)i,j = (−∆X)i,j . For (i, j) ∈ Ω? \Ω, one necessarily has Si,j = 0 and therefore (S−S?)i,j =(−S?)i,j . Again by the definition of the operator T2α[·], we know |S?−∆X |i,j is either smaller than|S? −∆X |i,(2αn2) or |S? −∆X |(2αn1),j . Furthermore, we know that S? contains at most α-fractionnonzero entries per row and column. Consequently, one has |S?−∆X |i,j ≤ |∆X |i,(αn2)∨|∆X |(αn1),j .Combining the two cases above, we conclude that

|S − S?|i,j ≤

|∆X |i,j , (i, j) ∈ Ω

|∆X |i,j +(|∆X |i,(αn2) ∨ |∆X |(αn1),j

), (i, j) ∈ Ω? \ Ω

. (55)

This immediately implies the `∞ norm bound (53).Next, we prove the second claim (54). Recall that S−S? is supported on Ω∪Ω?. We then have

|〈S − S?,M〉| ≤∑

(i,j)∈Ω

|S − S?|i,j |M |i,j +∑

(i,j)∈Ω?\Ω

|S − S?|i,j |M |i,j

≤∑

(i,j)∈Ω∪Ω?

|∆X |i,j |M |i,j +∑

(i,j)∈Ω?\Ω

(|∆X |i,(αn2) + |∆X |(αn1),j

)|M |i,j ,

where the second line follows from (55). Let β > 0 be some positive number, whose value will bedetermined later. Use 2ab ≤ β−1a2 + βb2 to further obtain

|〈S − S?,M〉| ≤∑

(i,j)∈Ω∪Ω?

|∆X |i,j |M |i,j︸ ︷︷ ︸A1

+1

∑(i,j)∈Ω?\Ω

(|∆X |2i,(αn2) + |∆X |2(αn1),j

)︸ ︷︷ ︸

A2

+β∑

(i,j)∈Ω?\Ω

|M |2i,j︸ ︷︷ ︸A3

.

In regard to the three terms A1,A2 and A3, we have the following claims, whose proofs are deferredto the end.

Claim 1 The first term A1 satisfies

A1 ≤√

3αν(‖∆LΣ

1/2? ‖F + ‖∆RΣ

1/2? ‖F

)‖M‖F.

Claim 2 The second term A2 satisfies

A2 ≤ 2‖∆X‖2F.

Claim 3 The third term A3 satisfies

A3 ≤ α(n1‖LM‖22,∞‖RM‖2F ∧ n2‖LM‖2F‖RM‖22,∞

).

Combine the pieces to reach

|〈S − S?,M〉| ≤√

3αν(‖∆LΣ

1/2? ‖F + ‖∆RΣ

1/2? ‖F

)‖M‖F

+‖∆X‖2Fβ

+ βα(n1‖LM‖22,∞‖RM‖2F ∧ n2‖LM‖2F‖RM‖22,∞

).

37

Page 38: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Tong, Ma, Chi

One can then choose β optimally to yield

|〈S − S?,M〉| ≤√

3αν(‖∆LΣ

1/2? ‖F + ‖∆RΣ

1/2? ‖F

)‖M‖F

+ 2√α (√n1‖LM‖2,∞‖RM‖F ∧

√n2‖LM‖F‖RM‖2,∞) ‖∆X‖F.

This finishes the proof.

Proof [Proof of Claim 1] Use the decomposition ∆X = ∆LR>+L?∆

>R = ∆LR

>? +L∆>R to obtain

|∆X |i,j ≤ ‖(∆LΣ1/2? )i,·‖2‖RΣ

−1/2? ‖2,∞ + ‖L?Σ−1/2

? ‖2,∞‖(∆RΣ1/2? )j,·‖2, and

|∆X |i,j ≤ ‖(∆LΣ1/2? )i,·‖2‖R?Σ

−1/2? ‖2,∞ + ‖LΣ

−1/2? ‖2,∞‖(∆RΣ

1/2? )j,·‖2.

Take the average to yield

|∆X |i,j ≤ν√n2‖(∆LΣ

1/2? )i,·‖2 +

ν√n1‖(∆RΣ

1/2? )j,·‖2,

where we have used the assumption on ν. With this upper bound on |∆X |i,j in place, we can furthercontrol A1 as

A1 ≤∑

(i,j)∈Ω∪Ω?

ν√n2‖(∆LΣ

1/2? )i,·‖2|M |i,j +

∑(i,j)∈Ω∪Ω?

ν√n1‖(∆RΣ

1/2? )j,·‖2|M |i,j

√ ∑(i,j)∈Ω∪Ω?

‖(∆LΣ1/2? )i,·‖22/n2 +

√ ∑(i,j)∈Ω∪Ω?

‖(∆RΣ1/2? )j,·‖22/n1

ν‖M‖F.

Regarding the first term, one has

∑(i,j)∈Ω∪Ω?

‖(∆LΣ1/2? )i,·‖22 =

n1∑i=1

∑j:(i,j)∈Ω∪Ω?

‖(∆LΣ1/2? )i,·‖22

≤ 3αn2

n1∑i=1

‖(∆LΣ1/2? )i,·‖22

= 3αn2‖∆LΣ1/2? ‖2F,

where the second line follows from the fact that Ω ∪ Ω? contains at most 3αn2 non-zero entries ineach row. Similarly, we can show that∑

(i,j)∈Ω∪Ω?

‖(∆RΣ1/2? )j,·‖22 ≤ 3αn1‖∆RΣ

1/2? ‖2F.

In all, we arrive at

A1 ≤√

3αν(‖∆LΣ

1/2? ‖F + ‖∆RΣ

1/2? ‖F

)‖M‖F,

which is the desired claim.

Proof [Proof of Claim 2] Recall that (∆X)i,(αn2) denotes the (αn2)-th largest entry in the i-th rowof ∆X . One necessarily has

αn2|∆X |2i,(αn2) ≤ ‖(∆X)i,·‖22.

38

Page 39: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled Gradient Descent

As a result, we obtain ∑(i,j)∈Ω?\Ω

|∆X |2i,(αn2) ≤∑

(i,j)∈Ω?

|∆X |2i,(αn2)

≤n1∑i=1

∑j:(i,j)∈Ω?

‖(∆X)i,·‖22αn2

≤n1∑i=1

‖(∆X)i,·‖22 = ‖∆X‖2F,

where the last line follows from the fact that Ω? contains at most αn2 nonzero entries in each row.Similarly one can show that ∑

(i,j)∈Ω?\Ω

|∆X |2(αn1),j ≤ ‖∆X‖2F.

Combining the above two bounds with the definition of A2 completes the proof.

Proof [Proof of Claim 3] By definition, M = LMR>M , and hence one has

A3 =∑

(i,j)∈Ω?\Ω

|(LM )i,·(RM )>j,·|2 ≤∑

(i,j)∈Ω?

|(LM )i,·(RM )>j,·|2.

We can further upper bound A3 as

A3 ≤∑

(i,j)∈Ω?

‖(LM )i,·‖22‖(RM )j,·‖22

≤n1∑i=1

∑j:(i,j)∈Ω?

‖(LM )i,·‖22‖RM‖22,∞

≤n1∑i=1

αn2‖(LM )i,·‖22‖RM‖22,∞ = αn2‖LM‖2F‖RM‖22,∞,

where the last line follows from the fact that Ω? contains at most αn2 non-zero entries in each row.Similarly, one can obtain

A3 ≤ αn1‖LM‖22,∞‖RM‖2F,

which completes the proof.

D.1 Proof of Lemma 16

We begin with introducing several useful notations and facts. In view of the condition dist(Ft,F?) ≤0.02σr(X?) and Lemma 22, one knows that Qt, the optimal alignment matrix between Ft and F?exists. Therefore, for notational convenience, denote L := LtQt, R := RtQ

−>t , ∆L := L − L?,

∆R := R −R?, S := St = T2α[X? + S? − LR>], and ε := 0.02. Similar to the derivation in (45),we have

‖∆LΣ−1/2? ‖op ∨ ‖∆RΣ

−1/2? ‖op ≤ ε. (56)

39

Page 40: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Tong, Ma, Chi

Moreover, the incoherence condition√n1‖∆LΣ

1/2? ‖2,∞ ∨

√n2‖∆RΣ

1/2? ‖2,∞ ≤

√µrσr(X?) (57)

implies√n1‖∆LΣ

−1/2? ‖2,∞ ∨

√n2‖∆RΣ

−1/2? ‖2,∞ ≤

õr, (58)

which combined with the triangle inequality further implies√n1‖LΣ

−1/2? ‖2,∞ ∨

√n2‖RΣ

−1/2? ‖2,∞ ≤ 2

õr. (59)

The conclusion ‖LtR>t −X?‖F ≤ 1.5 dist(Ft,F?) is a simple consequence of Lemma 26; see (48) fora detailed argument. In what follows, we shall prove the distance contraction and the incoherencecondition separately.

D.1.1 Distance contraction

By the definition of dist2(Ft+1,F?), one has

dist2(Ft+1,F?) ≤∥∥∥(Lt+1Qt −L?)Σ

1/2?

∥∥∥2

F+∥∥∥(Rt+1Q

−>t −R?)Σ

1/2?

∥∥∥2

F. (60)

From now on, we focus on controlling the first square ‖(Lt+1Qt−L?)Σ1/2? ‖2F. In view of the update

rule (20), one has

(Lt+1Qt −L?)Σ1/2? =

(L− η(LR> + S −X? − S?)R(R>R)−1 −L?

1/2?

=(∆L − η(LR> −X?)R(R>R)−1 − η(S − S?)R(R>R)−1

1/2?

= (1− η)∆LΣ1/2? − ηL?∆>RR(R>R)−1Σ

1/2? − η(S − S?)R(R>R)−1Σ

1/2? .(61)

Here, we use the notation introduced above and the decomposition LR> −X? = ∆LR> + L?∆

>R.

Take the squared Frobenius norm of both sides of (61) to obtain∥∥∥(Lt+1Qt −L?)Σ1/2?

∥∥∥2

F=∥∥∥(1− η)∆LΣ

1/2? − ηL?∆>RR(R>R)−1Σ

1/2?

∥∥∥2

F︸ ︷︷ ︸R1

− 2η(1− η) tr((S − S?)R(R>R)−1Σ?∆

>L

)︸ ︷︷ ︸R2

+ 2η2 tr((S − S?)R(R>R)−1Σ?(R

>R)−1R>∆RL>?

)︸ ︷︷ ︸R3

+ η2∥∥∥(S − S?)R(R>R)−1Σ

1/2?

∥∥∥2

F︸ ︷︷ ︸R4

.

In the sequel, we shall bound the four terms separately, of which R1 is the main term, and R2,R3

and R4 are perturbation terms.

1. Notice that the main term R1 has already been controlled in (46) under the condition (56). Itobeys

R1 ≤(

(1− η)2 +2ε

1− εη(1− η)

)‖∆LΣ

1/2? ‖2F +

2ε+ ε2

(1− ε)2η2‖∆RΣ

1/2? ‖2F.

40

Page 41: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled Gradient Descent

2. For the second term R2, set M := ∆LΣ?(R>R)−1R> with LM := ∆LΣ?(R

>R)−1Σ1/2? , RM :=

RΣ−1/2? , and then invoke Lemma 32 with ν := 3

õr/2 to see

|R2| ≤3

2

√3αµr

(‖∆LΣ

1/2? ‖F + ‖∆RΣ

1/2? ‖F

)∥∥∆LΣ?(R>R)−1R>

∥∥F

+ 2√αn2

∥∥∥∆LΣ?(R>R)−1Σ

1/2?

∥∥∥F‖RΣ

−1/2? ‖2,∞‖LR> −X?‖F

≤ 3

2

√3αµr

(‖∆LΣ

1/2? ‖F + ‖∆RΣ

1/2? ‖F

)‖∆LΣ

1/2? ‖F

∥∥∥R(R>R)−1Σ1/2?

∥∥∥op

+ 2√αn2‖∆LΣ

1/2? ‖F

∥∥∥Σ1/2? (R>R)−1Σ

1/2?

∥∥∥op‖RΣ

−1/2? ‖2,∞‖LR> −X?‖F.

Take the condition (56) and Lemmas 25 and 26 together to obtain∥∥∥R(R>R)−1Σ1/2?

∥∥∥op≤ 1

1− ε;∥∥∥Σ1/2

? (R>R)−1Σ1/2?

∥∥∥op

=∥∥∥R(R>R)−1Σ

1/2?

∥∥∥2

op≤ 1

(1− ε)2;

‖LR> −X?‖F ≤ (1 +ε

2)(‖∆LΣ

1/2? ‖F + ‖∆RΣ

1/2? ‖F

).

(62)

These consequences combined with the condition (59) yield

|R2| ≤3√

3αµr

2(1− ε)

(‖∆LΣ

1/2? ‖F + ‖∆RΣ

1/2? ‖F

)‖∆LΣ

1/2? ‖F

+4√αµr

(1− ε)2‖∆LΣ

1/2? ‖F(1 +

ε

2)(‖∆LΣ

1/2? ‖F + ‖∆RΣ

1/2? ‖F

)≤ √αµr

3√

3 + 4(2+ε)1−ε

2(1− ε)

(‖∆LΣ

1/2? ‖2F + ‖∆LΣ

1/2? ‖F‖∆RΣ

1/2? ‖F

)≤ √αµr

3√

3 + 4(2+ε)1−ε

2(1− ε)

(3

2‖∆LΣ

1/2? ‖2F +

1

2‖∆RΣ

1/2? ‖2F

),

where the last inequality holds since 2ab ≤ a2 + b2.

3. The third term R3 can be controlled similarly. Set M := L?∆>RR(R>R)−1Σ?(R

>R)−1R> withLM := L?Σ

−1/2? and RM := R(R>R)−1Σ?(R

>R)−1R>∆RΣ1/2? , and invoke Lemma 32 with

ν := 3√µr/2 to arrive at

|R3| ≤3

2

√3αµr

(‖∆LΣ

1/2? ‖F + ‖∆RΣ

1/2? ‖F

)∥∥L?∆>RR(R>R)−1Σ?(R>R)−1R>

∥∥F

+ 2√αn1‖L?Σ−1/2

? ‖2,∞∥∥∥R(R>R)−1Σ?(R

>R)−1R>∆RΣ1/2?

∥∥∥F‖LR> −X?‖F

≤ 3

2

√3αµr

(‖∆LΣ

1/2? ‖F + ‖∆RΣ

1/2? ‖F

)‖∆RΣ

1/2? ‖F

∥∥∥R(R>R)−1Σ1/2?

∥∥∥2

op

+ 2√αn1‖L?Σ−1/2

? ‖2,∞∥∥∥R(R>R)−1Σ

1/2?

∥∥∥2

op‖∆RΣ

1/2? ‖F‖LR> −X?‖F.

Use the consequences (62) again to obtain

|R3| ≤3√

3αµr

2(1− ε)2

(‖∆LΣ

1/2? ‖F + ‖∆RΣ

1/2? ‖F

)‖∆RΣ

1/2? ‖F

41

Page 42: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Tong, Ma, Chi

+2√αµr

(1− ε)2‖∆RΣ

1/2? ‖F(1 +

ε

2)(‖∆LΣ

1/2? ‖F + ‖∆RΣ

1/2? ‖F

)≤ √αµr3

√3 + 2(2 + ε)

2(1− ε)2

(‖∆LΣ

1/2? ‖F‖∆RΣ

1/2? ‖F + ‖∆RΣ

1/2? ‖2F

)≤ √αµr3

√3 + 2(2 + ε)

2(1− ε)2

(1

2‖∆LΣ

1/2? ‖2F +

3

2‖∆RΣ

1/2? ‖2F

).

4. For the last term R4, utilize the variational representation of the Frobenius norm to see√R4 = tr

((S − S?)R(R>R)−1Σ

1/2? L>

)for some L ∈ Rn1×r obeying ‖L‖F = 1. Setting M := LΣ

1/2? (R>R)−1R> = LMR>M with

LM := LΣ1/2? (R>R)−1Σ

1/2? and RM := RΣ

−1/2? , we are ready to apply Lemma 32 again with

ν := 3√µr/2 to see√

R4 ≤3

2

√3αµr

(‖∆LΣ

1/2? ‖F + ‖∆RΣ

1/2? ‖F

)∥∥∥LΣ1/2? (R>R)−1R>

∥∥∥F

+ 2√αn2

∥∥∥LΣ1/2? (R>R)−1Σ

1/2?

∥∥∥F‖RΣ

−1/2? ‖2,∞‖LR> −X?‖F

≤ 3

2

√3αµr

(‖∆LΣ

1/2? ‖F + ‖∆RΣ

1/2? ‖F

)∥∥∥R(R>R)−1Σ1/2?

∥∥∥op

+ 2√αn2

∥∥∥Σ1/2? (R>R)−1Σ

1/2?

∥∥∥op‖RΣ

−1/2? ‖2,∞‖LR> −X?‖F.

This combined with the consequences (62) and condition (59) yields

√R4 ≤

√αµr

3√

3 + 4(2+ε)1−ε

2(1− ε)

(‖∆LΣ

1/2? ‖F + ‖∆RΣ

1/2? ‖F

).

Take the square, and use the elementary inequality (a+ b)2 ≤ 2a2 + 2b2 to reach

R4 ≤ αµr(3√

3 + 4(2+ε)1−ε )2

2(1− ε)2

(‖∆LΣ

1/2? ‖2F + ‖∆RΣ

1/2? ‖2F

).

Taking collectively the bounds for R1,R2,R3 and R4 yields the control of ‖(Lt+1Qt − L?)Σ1/2? ‖2F

as ∥∥∥(Lt+1Qt −L?)Σ1/2?

∥∥∥2

F≤(

(1− η)2 +2ε

1− εη(1− η)

)‖∆LΣ

1/2? ‖2F +

2ε+ ε2

(1− ε)2η2‖∆RΣ

1/2? ‖2F

+√αµr

3√

3 + 4(2+ε)1−ε

1− εη(1− η)

(3

2‖∆LΣ

1/2? ‖2F +

1

2‖∆RΣ

1/2? ‖2F

)+√αµr

3√

3 + 2(2 + ε)

(1− ε)2η2

(1

2‖∆LΣ

1/2? ‖2F +

3

2‖∆RΣ

1/2? ‖2F

)+ αµr

(3√

3 + 4(2+ε)1−ε )2

2(1− ε)2η2(‖∆LΣ

1/2? ‖2F + ‖∆RΣ

1/2? ‖2F

).

Similarly, we can obtain the control of ‖(Rt+1Q−>t − R?)Σ

1/2? ‖2F. Combine them together and

identify dist2(Ft,F?) = ‖∆LΣ1/2? ‖2F + ‖∆RΣ

1/2? ‖2F to reach∥∥∥(Lt+1Qt −L?)Σ

1/2?

∥∥∥2

F+∥∥∥(Rt+1Q

−>t −R?)Σ

1/2?

∥∥∥2

F≤ ρ2(η; ε, αµr) dist2(Ft,F?),

42

Page 43: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled Gradient Descent

where the contraction rate ρ2(η; ε, αµr) is given by

ρ2(η; ε, αµr) := (1− η)2 +2ε+

√αµr(6

√3 + 8(2+ε)

1−ε )

1− εη(1− η)

+2ε+ ε2 +

√αµr(6

√3 + 4(2 + ε)) + αµr(3

√3 + 4(2+ε)

1−ε )2

(1− ε)2η2.

With ε = 0.02, αµr ≤ 10−4, and 0 < η ≤ 2/3, one has ρ(η; ε, αµr) ≤ 1 − 0.6η. Thus we concludethat

dist(Ft+1,F?) ≤√∥∥∥(Lt+1Qt −L?)Σ

1/2?

∥∥∥2

F+∥∥∥(Rt+1Q

−>t −R?)Σ

1/2?

∥∥∥2

F

≤ (1− 0.6η) dist(Ft,F?). (63)

D.1.2 Incoherence condition

We start by controlling the term ‖(Lt+1Qt −L?)Σ1/2? ‖2,∞. We know from (61) that

(Lt+1Qt −L?)Σ1/2? = (1− η)∆LΣ

1/2? − ηL?∆>RR(R>R)−1Σ

1/2? − η(S − S?)R(R>R)−1Σ

1/2? .

Apply the triangle inequality to obtain∥∥∥(Lt+1Qt −L?)Σ1/2?

∥∥∥2,∞≤ (1− η)‖∆LΣ

1/2? ‖2,∞ + η

∥∥∥L?∆>RR(R>R)−1Σ1/2?

∥∥∥2,∞︸ ︷︷ ︸

T1

+ η∥∥∥(S − S?)R(R>R)−1Σ

1/2?

∥∥∥2,∞︸ ︷︷ ︸

T2

.

The first term ‖∆LΣ1/2? ‖2,∞ follows from the incoherence condition (57) as

‖∆LΣ1/2? ‖2,∞ ≤

õr

n1σr(X?).

In the sequel, we shall bound the terms T1 and T2.

1. For the term T1, use the relation ‖AB‖2,∞ ≤ ‖A‖2,∞‖B‖op, and combine the condition (56) withthe consequences (62) to obtain

T1 ≤ ‖L?Σ−1/2? ‖2,∞

∥∥∥Σ1/2? ∆>RR(R>R)−1Σ

1/2?

∥∥∥op

≤ ‖L?Σ−1/2? ‖2,∞‖∆RΣ

1/2? ‖op

∥∥∥R(R>R)−1Σ1/2?

∥∥∥op

≤ ε

1− ε

õr

n1σr(X?),

2. For the term T2, use the relation ‖AB‖2,∞ ≤ ‖A‖2,∞‖B‖op to obtain

T2 ≤ ‖S − S?‖2,∞∥∥∥R(R>R)−1Σ

1/2?

∥∥∥op.

We know from Lemma 32 that S − S? has at most 3αn2 non-zero entries in each row, and‖S − S?‖∞ ≤ 2‖LR> −X?‖∞. Upper bound the `2,∞ norm by the `∞ norm as

‖S − S?‖2,∞ ≤√

3αn2‖S − S?‖∞ ≤ 2√

3αn2‖LR> −X?‖∞.

43

Page 44: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Tong, Ma, Chi

Split LR> −X? = ∆LR> + L?∆

>R, and take the conditions (57) and (59) to obtain

‖LR> −X?‖∞ ≤ ‖∆LR>‖∞ + ‖L?∆>R‖∞

≤ ‖∆LΣ1/2? ‖2,∞‖RΣ

−1/2? ‖2,∞ + ‖L?Σ−1/2

? ‖2,∞‖∆RΣ1/2? ‖2,∞

≤√µr

n1σr(X?)2

õr

n2+

õr

n1

õr

n2σr(X?)

=3µr√n1n2

σr(X?).

This combined with the consequences (62) yields

T2 ≤6√

3αµr

1− ε

õr

n1σr(X?).

Taking collectively the bounds for T1,T2 yields the control∥∥∥(Lt+1Qt −L?)Σ1/2?

∥∥∥2,∞≤(

1− η +ε+ 6

√3αµr

1− εη

)õr

n1σr(X?). (64)

The last step is to switch the alignment matrix from Qt to Qt+1. (63) together with Lemma 22demonstrates the existence of Qt+1. Apply the triangle inequality to obtain∥∥∥(Lt+1Qt+1 −L?)Σ

1/2?

∥∥∥2,∞≤∥∥∥(Lt+1Qt −L?)Σ

1/2?

∥∥∥2,∞

+∥∥∥Lt+1(Qt+1 −Qt)Σ

1/2?

∥∥∥2,∞

≤∥∥∥(Lt+1Qt −L?)Σ

1/2?

∥∥∥2,∞

+ ‖Lt+1QtΣ−1/2? ‖2,∞

∥∥∥Σ1/2? Q−1

t Qt+1Σ1/2? −Σ?

∥∥∥op.

We deduct from (64) that

‖Lt+1QtΣ−1/2? ‖2,∞ ≤ ‖L?Σ−1/2

? ‖2,∞ +∥∥∥(Lt+1Qt −L?)Σ

−1/2?

∥∥∥2,∞≤(

2− η +ε+ 6

√3αµr

1− εη

)õr

n1.

Regarding the alignment matrix term, invoke Lemma 27 to obtain

∥∥∥Σ1/2? Q−1

t Qt+1Σ1/2? −Σ?

∥∥∥op≤‖(Rt+1(Q−>t −Q−>t+1)Σ

1/2? ‖op

1− ‖(Rt+1Q−>t+1 −R?)Σ

−1/2? ‖op

≤‖(Rt+1Q

−>t −R?)Σ

1/2? ‖op + ‖(Rt+1Q

−>t+1 −R?)Σ

1/2? ‖op

1− ‖(Rt+1Q−>t+1 −R?)Σ

−1/2? ‖op

≤ 2ε

1− εσr(X?),

where we deduct from (63) that the distances using either Qt or Qt+1 are bounded by

‖(Rt+1Q−>t −R?)Σ

1/2? ‖op ≤ εσr(X?);

‖(Rt+1Q−>t+1 −R?)Σ

1/2? ‖op ≤ εσr(X?);

‖(Rt+1Q−>t+1 −R?)Σ

−1/2? ‖op ≤ ε.

Combine all pieces to reach∥∥∥(Lt+1Qt+1 −L?)Σ1/2?

∥∥∥2,∞≤(

1 + ε

1− ε

(1− η +

ε+ 6√

3αµr

1− εη

)+

1− ε

)õr

n1σr(X?).

44

Page 45: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled Gradient Descent

With ε = 0.02, αµr ≤ 10−4, and 0.1 ≤ η ≤ 2/3, we get the desired incoherence condition∥∥∥(Lt+1Qt+1 −L?)Σ1/2?

∥∥∥2,∞≤√µr

n1σr(X?).

Similarly, we can prove the other part∥∥∥(Rt+1Q−>t+1 −R?)Σ

1/2?

∥∥∥2,∞≤√µr

n2σr(X?).

D.2 Proof of Lemma 17

We first record two lemmas from Yi et al. (2016), which are useful for studying the properties of theinitialization.

Lemma 33 ((Yi et al., 2016, Section 6.1)) Given S? ∈ Sα, one has ‖S? − Tα[X? + S?]‖∞ ≤2‖X?‖∞.

Lemma 34 ((Yi et al., 2016, Lemma 1)) For any matrix M ∈ Sα, one has ‖M‖op ≤ α√n1n2‖M‖∞.

With these two lemmas in place, we are ready to establish the claimed result. Invoke Lemma 24to obtain

dist(F0,F?) ≤√√

2 + 1∥∥L0R

>0 −X?

∥∥F≤√

(√

2 + 1)2r∥∥L0R

>0 −X?

∥∥op,

where the last relation uses the fact that L0R>0 −X? has rank at most 2r. We can further apply

the triangle inequality to see∥∥L0R>0 −X?

∥∥op≤∥∥Y − Tα[Y ]−L0R

>0

∥∥op

+ ‖Y − Tα[Y ]−X?‖op≤ 2 ‖Y − Tα[Y ]−X?‖op = 2 ‖S? − Tα[X? + S?]‖op .

Here the second inequality hinges on the fact that L0R>0 is the best rank-r approximation of Y −

Tα[Y ], and the last identity arises from Y = X?+S?. Follow the same argument as (Yi et al., 2016,Section 6.1), combining Lemmas 33 and 34 to reach

‖S? − Tα[X? + S?]‖op ≤ 2α√n1n2 ‖S? − Tα[X? + S?]‖∞

≤ 4α√n1n2‖X?‖∞ ≤ 4αµrκσr(X?),

where the last inequality follows from the incoherence assumption

‖X?‖∞ ≤ ‖U?‖2,∞‖Σ?‖op‖V?‖2,∞ ≤µr√n1n2

κσr(X?). (65)

Take the above inequalities together to arrive at

dist(F0,F?) ≤ 8

√2(√

2 + 1)αµr3/2κσr(X?) ≤ 20αµr3/2κσr(X?).

D.3 Proof of Lemma 18

In view of the condition dist(F0,F?) ≤ 0.02σr(X?) and Lemma 22, one knows that Q0, the optimalalignment matrix between F0 and F? exists. Therefore, for notational convenience, denote L :=L0Q0, R := R0Q

−>0 , ∆L := L−L?, ∆R := R−R?, and ε := 0.02. Our objective is then translated

to demonstrate√n1‖∆LΣ

1/2? ‖2,∞ ∨

√n2‖∆RΣ

1/2? ‖2,∞ ≤

√µrσr(X?).

45

Page 46: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Tong, Ma, Chi

From now on, we focus on bounding ‖∆LΣ1/2? ‖2,∞. Since U0Σ0V

>0 is the top-r SVD of Y −Tα[Y ],

and recall that Y = X? + S?, we have the relation

(X? + S? − Tα[X? + S?])V0 = U0Σ0,

which further implies the following decomposition of ∆LΣ1/2? .

Claim 4 One has

∆LΣ1/2? = (S? − Tα[X? + S?])R(R>R)−1Σ

1/2? −L?∆

>RR(R>R)−1Σ

1/2? .

Combining Claim 4 with the triangle inequality yields

‖∆LΣ1/2? ‖2,∞ ≤

∥∥∥L?∆>RR(R>R)−1Σ1/2?

∥∥∥2,∞︸ ︷︷ ︸

I1

+∥∥∥(S? − Tα[X? + S?])R(R>R)−1Σ

1/2?

∥∥∥2,∞︸ ︷︷ ︸

I2

.

In what follows, we shall control I1 and I2 in turn.

1. For the term I1, use the relation ‖AB‖2,∞ ≤ ‖A‖2,∞‖B‖op to obtain

I1 ≤ ‖L?Σ−1/2? ‖2,∞‖∆RΣ

1/2? ‖op

∥∥∥R(R>R)−1Σ1/2?

∥∥∥op.

The incoherence assumption tells ‖L?Σ−1/2? ‖2,∞ = ‖U?‖2,∞ ≤

õr/n1. In addition, the assump-

tion dist(F0,F?) ≤ εσr(X?) entails the bound ‖∆RΣ1/2? ‖op ≤ εσr(X?). Finally, repeating the

argument for obtaining (56) yields ‖∆RΣ−1/2? ‖op ≤ ε, which together with Lemma 25 reveals∥∥∥R(R>R)−1Σ

1/2?

∥∥∥op≤ 1

1− ε.

In all, we arrive at

I1 ≤ε

1− ε

õr

n1σr(X?).

2. Proceeding to the term I2, use the relations ‖AB‖2,∞ ≤ ‖A‖1,∞‖B‖2,∞ and ‖AB‖2,∞ ≤‖A‖2,∞‖B‖op to obtain

I2 ≤ ‖S? − Tα[X? + S?]‖1,∞∥∥∥R(R>R)−1Σ

1/2?

∥∥∥2,∞

≤ ‖S? − Tα[X? + S?]‖1,∞ ‖RΣ−1/2? ‖2,∞

∥∥∥Σ1/2? (R>R)−1Σ

1/2?

∥∥∥op.

Regarding S? − Tα[X? + S?], Lemma 33 tells that S? − Tα[X? + S?] has at most 2αn2 non-zeroentries in each row, and ‖S? − Tα[X? + S?]‖∞ ≤ 2‖X?‖∞. Consequently, we can upper boundthe `1,∞ norm by the `∞ norm as

‖S? − Tα[X? + S?]‖1,∞ ≤ 2αn2 ‖S? − Tα[X? + S?]‖∞≤ 4αn2‖X?‖∞

≤ 4αn2µr√n1n2

κσr(X?).

46

Page 47: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled Gradient Descent

Here the last inequality follows from the incoherence assumption (65). For the term ‖RΣ−1/2? ‖2,∞,

one can apply the triangle inequality to see

‖RΣ−1/2? ‖2,∞ ≤ ‖R?Σ

−1/2? ‖2,∞ + ‖∆RΣ

−1/2? ‖2,∞ ≤

õr

n2+‖∆RΣ

1/2? ‖2,∞

σr(X?).

Last but not least, repeat the argument for (62) to obtain∥∥∥Σ1/2? (R>R)−1Σ

1/2?

∥∥∥op

=∥∥∥R(R>R)−1Σ

1/2?

∥∥∥2

op≤ 1

(1− ε)2.

Taking together the above bounds yields

I2 ≤4αµrκ

(1− ε)2

õr

n1σr(X?) +

4αµrκ

(1− ε)2

√n2

n1‖∆RΣ

1/2? ‖2,∞.

Combine the bounds on I1 and I2 to reach

√n1‖∆LΣ

1/2? ‖2,∞ ≤

1− ε+

4αµrκ

(1− ε)2

)√µrσr(X?) +

4αµrκ

(1− ε)2

√n2‖∆RΣ

1/2? ‖2,∞.

Similarly, we have

√n2‖∆RΣ

1/2? ‖2,∞ ≤

1− ε+

4αµrκ

(1− ε)2

)√µrσr(X?) +

4αµrκ

(1− ε)2

√n1‖∆LΣ

1/2? ‖2,∞.

Taking the maximum and solving for√n1‖∆LΣ

1/2? ‖2,∞ ∨

√n2‖∆LΣ

1/2? ‖2,∞ yield the relation

√n1‖∆LΣ

1/2? ‖2,∞ ∨

√n2‖∆LΣ

1/2? ‖2,∞ ≤

ε(1− ε) + 4αµrκ

(1− ε)2 − 4αµrκ

√µrσr(X?).

With ε = 0.02 and αµrκ ≤ 0.1, we get the desired conclusion

√n1‖∆LΣ

1/2? ‖2,∞ ∨

√n2‖∆LΣ

1/2? ‖2,∞ ≤

√µrσr(X?).

Proof [Proof of Claim 4] Identify U0 (resp. V0) with L0Σ−1/20 (resp. R0Σ

−1/20 ) to yield

(X? + S? − Tα[X? + S?])R0Σ−10 = L0,

which is equivalent to (X? + S? − Tα[X? + S?])R0(R>0 R0)−1 = L0 since Σ0 = R>0 R0. Multiplyboth sides by Q0Σ

1/2? to obtain

(X? + S? − Tα[X? + S?])R(R>R)−1Σ1/2? = LΣ

1/2? ,

where we recall that L = L0Q0 and R = R0Q−>0 . In the end, subtract X?R(R>R)−1Σ

1/2? from

both sides to reach

(S? − Tα[X? + S?])R(R>R)−1Σ1/2? = LΣ

1/2? −L?R

>? R(R>R)−1Σ

1/2?

= (L−L?)Σ1/2? + L?(R−R?)

>R(R>R)−1Σ1/2?

= ∆LΣ1/2? + L?∆

>RR(R>R)−1Σ

1/2? .

This finishes the proof.

47

Page 48: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Tong, Ma, Chi

Appendix E. Proof for Matrix Completion

E.1 New projection operator

E.1.1 Proof of Proposition 7

First, notice that the optimization of L and R in (23) can be decomposed and done in parallel,hence we focus on the optimization of L below:

L = argminL∈Rn1×r

∥∥∥(L− L)(R>R)1/2∥∥∥2

Fs.t.

√n1

∥∥∥L(R>R)1/2∥∥∥

2,∞≤ B.

By a change of variables as G := L(R>R)1/2 and G := L(R>R)1/2, we rewrite the above problemequivalently as

G = argminG∈Rn1×r

‖G− G‖2F s.t.√n1 ‖G‖2,∞ ≤ B,

whose solution is given as Chen and Wainwright (2015)

Gi,· =

(1 ∧ B√n1‖Gi,·‖2

)Gi,·, 1 ≤ i ≤ n1.

By applying again the change of variable L = G(R>R)−1/2 and L = G(R>R)−1/2, we obtain theclaimed solution.

E.1.2 Proof of Lemma 19

We begin with proving the non-expansiveness property. Denote the optimal alignment matrix be-tween F and F? as Q, whose existence is guaranteed by Lemma 22. Denoting PB(F ) = [L>,R>]>,by the definition of dist(PB(F ),F?), we know that

dist2(PB(F ),F?) ≤n1∑i=1

∥∥∥Li,·QΣ1/2? − (L?Σ

1/2? )i,·

∥∥∥2

2+

n2∑j=1

∥∥∥Rj,·Q−>Σ

1/2? − (R?Σ

1/2? )j,·

∥∥∥2

2. (66)

Recall that the condition dist(F ,F?) ≤ εσr(X?) implies∥∥∥(LQ−L?)Σ−1/2?

∥∥∥op∨∥∥∥(RQ−> −R?)Σ

−1/2?

∥∥∥op≤ ε,

which, together with R?Σ−1/2? = V?, further implies that∥∥∥Li,·R>∥∥∥

2≤∥∥∥Li,·QΣ

1/2?

∥∥∥2

∥∥∥RQ−>Σ−1/2?

∥∥∥op

≤∥∥∥Li,·QΣ

1/2?

∥∥∥2

(‖V?‖op +

∥∥∥(RQ−> −R?)Σ−1/2?

∥∥∥op

)≤ (1 + ε)

∥∥∥Li,·QΣ1/2?

∥∥∥2.

In addition, the µ-incoherence of X? yields

√n1

∥∥∥(L?Σ1/2? )i,·

∥∥∥2≤√n1‖U?‖2,∞‖Σ?‖op ≤

√µrσ1(X?) ≤

B

1 + ε,

where the last inequality follows from the choice of B. Take the above two relations collectively toreach

B√n1‖Li,·R>‖2

∥∥∥(L?Σ1/2? )i,·

∥∥∥2∥∥∥Li,·QΣ

1/2?

∥∥∥2

.

48

Page 49: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled Gradient Descent

We claim that performing the following projection yields a contraction on each row; see also (Zhengand Lafferty, 2016, Lemma 11).

Claim 5 For vectors u,u? ∈ Rn and λ ≥ ‖u?‖2/‖u‖2, it holds that

‖(1 ∧ λ)u− u?‖2 ≤ ‖u− u?‖2.

Apply Claim 5 with u := Li,·QΣ1/2? , u? := (L?Σ

1/2? )i,·, and λ := B/(

√n1‖Li,·R>‖2) to obtain

∥∥∥Li,·QΣ1/2? − (L?Σ

1/2? )i,·

∥∥∥2

2=

∥∥∥∥∥(

1 ∧ B√n1‖Li,·R>‖2

)Li,·QΣ

1/2? − (L?Σ

1/2? )i,·

∥∥∥∥∥2

2

≤∥∥∥Li,·QΣ

1/2? − (L?Σ

1/2? )i,·

∥∥∥2

2.

Following a similar argument for R, and plugging them back to (66), we conclude that

dist2(PB(F ),F?) ≤n1∑i=1

∥∥∥Li,·QΣ1/2? − (L?Σ

1/2? )i,·

∥∥∥2

2+

n2∑j=1

∥∥∥Rj,·Q−>Σ

1/2? − (R?Σ

1/2? )j,·

∥∥∥2

2= dist2(F ,F?).

We move on to the incoherence condition. For any 1 ≤ i ≤ n1, one has

‖Li,·R>‖22 =

n2∑j=1

〈Li,·,Rj,·〉2 =

n2∑j=1

(1 ∧ B√n1‖Li,·R>‖2

)2

〈Li,·, Rj,·〉2(

1 ∧ B√n2‖Rj,·L>‖2

)2

(i)

(1 ∧ B√n1‖Li,·R>‖2

)2 n2∑j=1

〈Li,·, Rj,·〉2 =

(1 ∧ B√n1‖Li,·R>‖2

)2

‖Li,·R>‖22

(ii)

≤ B2

n1.

where (i) follows from 1 ∧ B√n2‖Rj,·L>‖2

≤ 1, and (ii) follows from 1 ∧ B√n1‖Li,·R>‖2

≤ B√n1‖Li,·R>‖2

.

Similarly, one has ‖Rj,·L>‖22 ≤ B2/n2. Combining these two bounds completes the proof.

Proof [Proof of Claim 5] When λ > 1, the claim holds as an identity. Otherwise λ ≤ 1. Denoteh(λ) := ‖λu−u?‖22. Calculate its derivative to conclude that h(λ) is monotonically increasing whenλ ≥ λ? := 〈u,u?〉/‖u‖22. Note that λ ≥ ‖u?‖2/‖u‖2 ≥ λ?, thus h(λ) ≤ h(1), i.e. the claim holds.

E.2 Proof of Lemma 20

We first record two useful lemmas regarding the projector PΩ(·).

Lemma 35 ((Zheng and Lafferty, 2016, Lemma 10)) Suppose that X? is µ-incoherent, andp & µr log(n1 ∨ n2)/(n1 ∧ n2). With overwhelming probability, one has∣∣⟨(p−1PΩ − I)(L?R

>A + LAR

>? ),L?R

>B + LBR

>?

⟩∣∣≤ C1

√µr log(n1 ∨ n2)

p(n1 ∧ n2)‖L?R>A + LAR

>? ‖F‖L?R>B + LBR

>? ‖F,

simultaneously for all LA,LB ∈ Rn1×r and RA,RB ∈ Rn2×r, where C1 > 0 is some universalconstant.

49

Page 50: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Tong, Ma, Chi

Lemma 36 ((Chen and Li, 2019, Lemma 8),(Chen et al., 2020a, Lemma 12)) Suppose thatp & log(n1 ∨ n2)/(n1 ∧ n2). With overwhelming probability, one has∣∣⟨(p−1PΩ − I)(LAR

>A),LBR

>B

⟩∣∣≤ C2

√n1 ∨ n2

p(‖LA‖F‖LB‖2,∞ ∧ ‖LA‖2,∞‖LB‖F) (‖RA‖F‖RB‖2,∞ ∧ ‖RA‖2,∞‖RB‖F) ,

simultaneously for all LA,LB ∈ Rn1×r and RA,RB ∈ Rn2×r, where C2 > 0 is some universalconstant.

In view of the above two lemmas, define the event E as the intersection of the events that thebounds in Lemmas 35 and 36 hold, which happens with overwhelming probability. The rest of theproof is then performed under the event that E holds.

By the condition dist(Ft,F?) ≤ 0.02σr(X?) and Lemma 22, one knows that Qt, the optimalalignment matrix between Ft and F? exists. Therefore, for notational convenience, we denote L :=LtQt, R := RtQ

−>t , ∆L := L−L?, ∆R := R−R?, and ε := 0.02. In addition, denote Ft+1 as the

update before projection as

Ft+1 :=

[Lt+1

Rt+1

]=

[Lt − ηp−1PΩ(LtR

>t −X?)Rt(R

>t Rt)

−1

Rt − ηp−1PΩ(LtR>t −X?)

>Lt(L>t Lt)

−1

],

and therefore Ft+1 = PB(Ft+1). Note that in view of Lemma 19, it suffices to prove the followingrelation

dist(Ft+1,F?) ≤ (1− 0.6η) dist(Ft,F?). (67)

The conclusion ‖LtR>t −X?‖F ≤ 1.5 dist(Ft,F?) is a simple consequence of Lemma 26; see (48) fora detailed argument. In what follows, we concentrate on proving (67).

To begin with, we list a few easy consequences under the assumed conditions.

Claim 6 Under conditions dist(Ft,F?) ≤ εσr(X?) and√n1‖LR>‖2,∞∨

√n2‖RL>‖2,∞ ≤ CB

√µrσ1(X?),

one has

‖∆LΣ−1/2? ‖op ∨ ‖∆RΣ

−1/2? ‖op ≤ ε; (68a)∥∥∥R(R>R)−1Σ

1/2?

∥∥∥op≤ 1

1− ε; (68b)∥∥∥Σ1/2

? (R>R)−1Σ1/2?

∥∥∥op≤ 1

(1− ε)2; (68c)

√n1‖LΣ

1/2? ‖2,∞ ∨

√n2‖RΣ

1/2? ‖2,∞ ≤

CB1− ε

√µrσ1(X?); (68d)

√n1‖LΣ

−1/2? ‖2,∞ ∨

√n2‖RΣ

−1/2? ‖2,∞ ≤

CBκ

1− ε√µr; (68e)

√n1‖∆LΣ

1/2? ‖2,∞ ∨

√n2‖∆RΣ

1/2? ‖2,∞ ≤

(1 +

CB1− ε

)√µrσ1(X?). (68f)

Now we are ready to embark on the proof of (67). By the definition of dist(Ft+1,F?), one has

dist2(Ft+1,F?) ≤∥∥∥(Lt+1Qt −L?)Σ

1/2?

∥∥∥2

F+∥∥∥(Rt+1Q

−>t −R?)Σ

1/2?

∥∥∥2

F, (69)

where we recall that Qt is the optimal alignment matrix between Ft and F?. Plug in the updaterule (26) and the decomposition LR> −X? = ∆LR

> + L?∆>R to obtain

(Lt+1Qt −L?)Σ1/2? =

(L− ηp−1PΩ(LR> −X?)R(R>R)−1 −L?

1/2?

50

Page 51: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled Gradient Descent

= ∆LΣ1/2? − η(LR> −X?)R(R>R)−1Σ

1/2? − η(p−1PΩ − I)(LR> −X?)R(R>R)−1Σ

1/2?

= (1− η)∆LΣ1/2? − ηL?∆>RR(R>R)−1Σ

1/2? − η(p−1PΩ − I)(LR> −X?)R(R>R)−1Σ

1/2? .

This allows us to expand the first square in (69) as∥∥∥(Lt+1Qt −L?)Σ1/2?

∥∥∥2

F=∥∥∥(1− η)∆LΣ

1/2? − ηL?∆>RR(R>R)−1Σ

1/2?

∥∥∥2

F︸ ︷︷ ︸P1

− 2η(1− η) tr((p−1PΩ − I)(LR> −X?)R(R>R)−1Σ?∆

>L

)︸ ︷︷ ︸P2

+ 2η2 tr((p−1PΩ − I)(LR> −X?)R(R>R)−1Σ?(R

>R)−1R>∆RL>?

)︸ ︷︷ ︸P3

+ η2∥∥∥(p−1PΩ − I)(LR> −X?)R(R>R)−1Σ

1/2?

∥∥∥2

F︸ ︷︷ ︸P4

.

In the sequel, we shall control the four terms separately, of which P1 is the main term, and P2,P3

and P4 are perturbation terms.

1. Notice that the main term P1 has already been controlled in (46) under the condition (68a). Itobeys

P1 ≤(

(1− η)2 +2ε

1− εη(1− η)

)‖∆LΣ

1/2? ‖2F +

2ε+ ε2

(1− ε)2η2‖∆RΣ

1/2? ‖2F.

2. For the second term P2, decompose LR>−X? = ∆LR>? +L∆>R and apply the triangle inequality

to obtain

|P2| =∣∣∣ tr ((p−1PΩ − I)(∆LR

>? + L∆>R)R(R>R)−1Σ?∆

>L

) ∣∣∣≤∣∣∣ tr ((p−1PΩ − I)(∆LR

>? )R?(R

>R)−1Σ?∆>L

) ∣∣∣︸ ︷︷ ︸P2,1

+∣∣∣ tr ((p−1PΩ − I)(∆LR

>? )∆R(R>R)−1Σ?∆

>L

) ∣∣∣︸ ︷︷ ︸P2,2

+∣∣∣ tr ((p−1PΩ − I)(L∆>R)R(R>R)−1Σ?∆

>L

) ∣∣∣︸ ︷︷ ︸P2,3

.

For the first term P2,1, under the event E , we can invoke Lemma 35 to obtain

P2,1 ≤ C1

√µr log(n1 ∨ n2)

p(n1 ∧ n2)‖∆LR

>? ‖F

∥∥∆LΣ?(R>R)−1R>?

∥∥F

≤ C1

√µr log(n1 ∨ n2)

p(n1 ∧ n2)‖∆LΣ

1/2? ‖2F

∥∥∥Σ1/2? (R>R)−1Σ

1/2?

∥∥∥op,

where the second line follows from the relation ‖AB‖F ≤ ‖A‖op‖B‖F. Use the condition (68c) toobtain

P2,1 ≤C1

(1− ε)2

√µr log(n1 ∨ n2)

p(n1 ∧ n2)‖∆LΣ

1/2? ‖2F.

51

Page 52: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Tong, Ma, Chi

Regarding the remaining termsP2,2 andP2,3, our main hammer is Lemma 36. Invoking Lemma 36under the event E withLA := ∆LΣ

1/2? ,RA := R?Σ

−1/2? , LB := ∆LΣ

1/2? , andRB := ∆R(R>R)−1Σ

1/2? ,

we arrive at

P2,2 ≤ C2

√n1 ∨ n2

p‖∆LΣ

1/2? ‖2,∞‖∆LΣ

1/2? ‖F‖R?Σ

−1/2? ‖2,∞

∥∥∥∆R(R>R)−1Σ1/2?

∥∥∥F

≤ C2

√n1 ∨ n2

p‖∆LΣ

1/2? ‖2,∞‖∆LΣ

1/2? ‖F‖R?Σ

−1/2? ‖2,∞‖∆RΣ

−1/2? ‖F

∥∥∥Σ1/2? (R>R)−1Σ

1/2?

∥∥∥op.

Similarly, with the help of Lemma 36, one has

P2,3 ≤ C2

√n1 ∨ n2

p‖LΣ

−1/2? ‖2,∞‖∆LΣ

1/2? ‖F‖∆RΣ

1/2? ‖F‖RΣ

−1/2? ‖2,∞

∥∥∥Σ1/2? (R>R)−1Σ

1/2?

∥∥∥op.

Utilizing the consequences in Claim 6, we arrive at

P2,2 ≤C2κ

(1− ε)2

(1 +

CB1− ε

)µr√

p(n1 ∧ n2)‖∆LΣ

1/2? ‖F‖∆RΣ

1/2? ‖F;

P2,3 ≤C2C

2Bκ

2

(1− ε)4

µr√p(n1 ∧ n2)

‖∆LΣ1/2? ‖F‖∆RΣ

1/2? ‖F.

We then combine the bounds for P2,1,P2,2 and P2,3 to see

P2 ≤C1

(1− ε)2

√µr log(n1 ∨ n2)

p(n1 ∧ n2)‖∆LΣ

1/2? ‖2F

+C2κ

(1− ε)2

(1 +

CB1− ε

+C2Bκ

(1− ε)2

)µr√

p(n1 ∧ n2)‖∆LΣ

1/2? ‖F‖∆RΣ

1/2? ‖F

= δ1‖∆LΣ1/2? ‖2F + δ2‖∆LΣ

1/2? ‖F‖∆RΣ

1/2? ‖F

≤ (δ1 +δ22

)‖∆LΣ1/2? ‖2F +

δ22‖∆RΣ

1/2? ‖2F,

where we denote

δ1 :=C1

(1− ε)2

√µr log(n1 ∨ n2)

p(n1 ∧ n2), and δ2 :=

C2κ

(1− ε)2

(1 +

CB1− ε

+C2Bκ

(1− ε)2

)µr√

p(n1 ∧ n2).

3. Following a similar argument for controlling P2 (i.e. repeatedly using Lemmas 35 and 36), we canobtain the following bounds for P3 and P4, whose proof are deferred to the end of this section.

Claim 7 Under the event E, one has

P3 ≤δ22‖∆LΣ

1/2? ‖2F + (δ1 +

δ22

)‖∆RΣ1/2? ‖2F;

P4 ≤ δ1(δ1 + δ2)‖∆LΣ1/2? ‖2F + δ2(δ1 + δ2)‖∆RΣ

1/2? ‖2F.

Taking the bounds for P1,P2,P3 and P4 collectively yields∥∥∥(Lt+1Qt −L?)Σ1/2?

∥∥∥2

F≤(

(1− η)2 +2ε

1− εη(1− η)

)‖∆LΣ

1/2? ‖2F +

2ε+ ε2

(1− ε)2η2‖∆RΣ

1/2? ‖2F

+ η(1− η)(

(2δ1 + δ2)‖∆LΣ1/2? ‖2F + δ2‖∆RΣ

1/2? ‖2F

)52

Page 53: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled Gradient Descent

+ η2(δ2‖∆LΣ

1/2? ‖2F + (2δ1 + δ2)‖∆RΣ

1/2? ‖2F

)+ η2

(δ1(δ1 + δ2)‖∆LΣ

1/2? ‖2F + δ2(δ1 + δ2)‖∆RΣ

1/2? ‖2F

).

A similar upper bound holds for the second square in (69). As a result, we reach the conclusion that∥∥∥(Lt+1Qt −L?)Σ1/2?

∥∥∥2

F+∥∥∥(Rt+1Q

−>t −R?)Σ

1/2?

∥∥∥2

F≤ ρ2(η; ε, δ1, δ2) dist2(Ft,F?),

where the contraction rate ρ2(η; ε, δ1, δ2) is given by

ρ2(η; ε, δ1, δ2) := (1− η)2 +

(2ε

1− ε+ 2(δ1 + δ2)

)η(1− η) +

(2ε+ ε2

(1− ε)2+ 2(δ1 + δ2) + (δ1 + δ2)2

)η2.

As long as p ≥ C(µrκ4 ∨ log(n1 ∨ n2))µr/(n1 ∧ n2) for some sufficiently large constant C, one hasδ1 +δ2 ≤ 0.1 under the setting ε = 0.02. When 0 < η ≤ 2/3, one further has ρ(η; ε, δ1, δ2) ≤ 1−0.6η.Thus we conclude that

dist(Ft+1,F?) ≤√∥∥∥(Lt+1Qt −L?)Σ

1/2?

∥∥∥2

F+∥∥∥(Rt+1Q

−>t −R?)Σ

1/2?

∥∥∥2

F

≤ (1− 0.6η) dist(Ft,F?),

which is exactly the upper bound we are after; see (67). This finishes the proof.Proof [Proof of Claim 6] First, repeating the derivation for (45) obtains (68a). Second, take thecondition (68a) and Lemma 25 together to obtain (68b) and (68c). Third, take the incoherencecondition

√n1‖LR>‖2,∞ ∨

√n2‖RL>‖2,∞ ≤ CB

√µrσ1(X?) together with the relations

‖LR>‖2,∞ ≥ σr(RΣ−1/2? )‖LΣ

1/2? ‖2,∞

≥(σr(R?Σ

−1/2? )− ‖∆RΣ

−1/2? ‖op

)‖LΣ

1/2? ‖2,∞

≥ (1− ε)‖LΣ1/2? ‖2,∞;

‖RL>‖2,∞ ≥ σr(LΣ−1/2? )‖RΣ

1/2? ‖2,∞

≥(σr(L?Σ

−1/2? )− ‖∆LΣ

−1/2? ‖op

)‖RΣ

1/2? ‖2,∞

≥ (1− ε)‖RΣ1/2? ‖2,∞

to obtain (68d) and (68e). Finally, apply the triangle inequality together with incoherence assump-tion to obtain (68f).

Proof [Proof of Claim 7] We start with the term P3, for which we have

|P3| ≤∣∣∣ tr ((p−1PΩ − I)(L?∆

>R)R(R>R)−1Σ?(R

>R)−1R>∆RL>?

) ∣∣∣︸ ︷︷ ︸P3,1

+∣∣∣ tr ((p−1PΩ − I)(∆LR

>)R(R>R)−1Σ?(R>R)−1R>∆RL

>?

) ∣∣∣︸ ︷︷ ︸P3,2

.

Invoke Lemma 35 to bound P3,1 as

P3,1 ≤ C1

√µr log(n1 ∨ n2)

p(n1 ∧ n2)‖L?∆>R‖F

∥∥L?∆>RR(R>R)−1Σ?(R>R)−1R>

∥∥F

53

Page 54: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Tong, Ma, Chi

≤ C1

√µr log(n1 ∨ n2)

p(n1 ∧ n2)‖∆RΣ

1/2? ‖2F

∥∥∥R(R>R)−1Σ1/2?

∥∥∥2

op.

The condition (68b) allows us to obtain a simplified bound

P3,1 ≤C1

(1− ε)2

√µr log(n1 ∨ n2)

p(n1 ∧ n2)‖∆RΣ

1/2? ‖2F.

In regard to P3,2, we apply Lemma 36 with LA := ∆LΣ1/2? , RA := RΣ

−1/2? , LB := L?Σ

−1/2? , and

RB := R(R>R)−1Σ?(R>R)−1R>∆RΣ

1/2? to see

P3,2 ≤ C2

√n1 ∨ n2

p‖∆LΣ

1/2? ‖F‖L?Σ−1/2

? ‖2,∞‖RΣ−1/2? ‖2,∞

∥∥∥R(R>R)−1Σ?(R>R)−1R>∆RΣ

1/2?

∥∥∥F

≤ C2

√n1 ∨ n2

p‖∆LΣ

1/2? ‖F‖L?Σ−1/2

? ‖2,∞‖RΣ−1/2? ‖2,∞

∥∥∥R(R>R)−1Σ1/2?

∥∥∥2

op‖∆RΣ

1/2? ‖F.

Again, use the consequences in Claim 6 to reach

P3,2 ≤ C2

√n1 ∨ n2

p‖∆LΣ

1/2? ‖F

õr

n1

CBκ

1− ε

õr

n2

1

(1− ε)2‖∆RΣ

1/2? ‖F

=C2CBκ

(1− ε)3

µr√p(n1 ∧ n2)

‖∆LΣ1/2? ‖F‖∆RΣ

1/2? ‖F.

Combine the bounds of P3,1 and P3,2 to reach

P3 ≤C1

(1− ε)2

√µr log(n1 ∨ n2)

p(n1 ∧ n2)‖∆RΣ

1/2? ‖2F

+C2CBκ

(1− ε)3

µr√p(n1 ∧ n2)

‖∆LΣ1/2? ‖F‖∆RΣ

1/2? ‖F

≤ δ1‖∆RΣ1/2? ‖2F + δ2‖∆LΣ

1/2? ‖F‖∆RΣ

1/2? ‖F

≤ δ22‖∆LΣ

1/2? ‖2F + (δ1 +

δ22

)‖∆RΣ1/2? ‖2F.

Moving on to the term P4, we have√P4 =

∥∥∥(p−1PΩ − I)(LR> −X?)R(R>R)−1Σ1/2?

∥∥∥F

≤∣∣∣ tr((p−1PΩ − I)(∆LR

>? )R?(R

>R)−1Σ1/2? L>

) ∣∣∣︸ ︷︷ ︸P4,1

+∣∣∣ tr((p−1PΩ − I)(∆LR

>? )∆R(R>R)−1Σ

1/2? L>

) ∣∣∣︸ ︷︷ ︸P4,2

+∣∣∣ tr((p−1PΩ − I)(L∆>R)R(R>R)−1Σ

1/2? L>

) ∣∣∣︸ ︷︷ ︸P4,3

,

where we have used the variational representation of the Frobenius norm for some L ∈ Rn1×r obeying‖L‖F = 1. Note that the decomposition of

√P4 is extremely similar to that of P2. Therefore we

can follow a similar argument (i.e. applying Lemmas 35 and 36) to control these terms as

P4,1 ≤C1

(1− ε)2

√µr log(n1 ∨ n2)

p(n1 ∧ n2)‖∆LΣ

1/2? ‖F;

54

Page 55: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled Gradient Descent

P4,2 ≤C2κ

(1− ε)2

(1 +

CB1− ε

)µr√

p(n1 ∧ n2)‖∆RΣ

1/2? ‖F;

P4,3 ≤C2C

2Bκ

2

(1− ε)4

µr√p(n1 ∧ n2)

‖∆RΣ1/2? ‖F.

For conciseness, we omit the details for bounding each term. Combine them to reach√P4 ≤ δ1‖∆LΣ

1/2? ‖F + δ2‖∆RΣ

1/2? ‖F.

Finally take the square on both sides and use 2ab ≤ a2 + b2 to obtain the upper bound

P4 ≤ δ1(δ1 + δ2)‖∆LΣ1/2? ‖2F + δ2(δ1 + δ2)‖∆RΣ

1/2? ‖2F.

E.3 Proof of Lemma 21

We start by recording a useful lemma below.

Lemma 37 ((Chen, 2015, Lemma 2), (Chen et al., 2020a, Lemma 4)) For any fixed X ∈Rn1×n2 , with overwhelming probability, one has

∥∥(p−1PΩ − I)(X)∥∥op≤ C0

log(n1 ∨ n2)

p‖X‖∞ + C0

√log(n1 ∨ n2)

p(‖X‖2,∞ ∨ ‖X>‖2,∞),

where C0 > 0 is some universal constant that does not depend on X.

In view of Lemma 24, one has

dist(F0,F?) ≤√√

2 + 1∥∥U0Σ0V

>0 −X?

∥∥F≤√

(√

2 + 1)2r∥∥U0Σ0V

>0 −X?

∥∥op, (70)

where the last relation uses the fact that U0Σ0V>

0 −X? has rank at most 2r. Applying the triangleinequality, we obtain∥∥U0Σ0V

>0 −X?

∥∥op≤∥∥p−1PΩ(X?)−U0Σ0V

>0

∥∥op

+∥∥p−1PΩ(X?)−X?

∥∥op

≤ 2∥∥(p−1PΩ − I)(X?)

∥∥op. (71)

Here the second inequality hinges on the fact that U0Σ0V>

0 is the best rank-r approximation top−1PΩ(X?), i.e. ∥∥p−1PΩ(X?)−U0Σ0V

>0

∥∥op≤∥∥p−1PΩ(X?)−X?

∥∥op.

Combining (70) and (71) yields

dist(F0,F?) ≤ 2

√(√

2 + 1)2r∥∥(p−1PΩ − I)(X?)

∥∥op≤ 5√r∥∥(p−1PΩ − I)(X?)

∥∥op.

It then boils down to controlling∥∥p−1PΩ(X?)−X?

∥∥op, which is readily supplied by Lemma 37 as

∥∥(p−1PΩ − I)(X?)∥∥op≤ C0

log(n1 ∨ n2)

p‖X?‖∞ + C0

√log(n1 ∨ n2)

p(‖X?‖2,∞ ∨ ‖X>? ‖2,∞),

55

Page 56: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Tong, Ma, Chi

which holds with overwhelming probability. The proof is finished by plugging the following boundsfrom incoherence assumption of X?:

‖X?‖∞ ≤ ‖U?‖2,∞‖Σ?‖op‖V?‖2,∞ ≤µr√n1n2

κσr(X?);

‖X?‖2,∞ ≤ ‖U?‖2,∞‖Σ?‖op‖V?‖op ≤√µr

n1κσr(X?);

‖X>? ‖2,∞ ≤ ‖U?‖op‖Σ?‖op‖V?‖2,∞ ≤√µr

n2κσr(X?).

Appendix F. Proof for General Loss Functions

We first present a useful property of restricted smooth and convex functions.

Lemma 38 Suppose that f : Rn1×n2 7→ R is rank-2r restricted L-smooth and rank-2r restrictedconvex. Then for any X1,X2 ∈ Rn1×n2 of rank at most r, one has

〈∇f(X1)−∇f(X2),X1 −X2〉 ≥1

L‖∇f(X1)−∇f(X2)‖2F,r.

Proof Since f(·) is rank-2r restricted L-smooth and convex, it holds for any X ∈ Rn1×n2 withrank at most 2r that

f(X1) + 〈∇f(X1), X −X1〉 ≤ f(X) ≤ f(X2) + 〈∇f(X2), X −X2〉+L

2‖X −X2‖2F.

Reorganize the terms to yield

f(X1) + 〈∇f(X1),X2 −X1〉 ≤ f(X2) + 〈∇f(X2)−∇f(X1), X −X2〉+L

2‖X −X2‖2F.

Take X = X2 − 1LPr(∇f(X2)−∇f(X1)), whose rank is at most 2r, to see

f(X1) + 〈∇f(X1),X2 −X1〉+1

2L‖∇f(X2)−∇f(X1)‖2F,r ≤ f(X2).

We can further switch the roles of X1 and X2 to obtain

f(X2) + 〈∇f(X2),X1 −X2〉+1

2L‖∇f(X2)−∇f(X1)‖2F,r ≤ f(X1).

Adding the above two inequalities yields the desired bound.

F.1 Proof of Theorem 11

Suppose that the t-th iterate Ft obeys the condition dist(Ft,F?) ≤ 0.1σr(X?)/√κf . In view of

Lemma 22, one knows that Qt, the optimal alignment matrix between Ft and F? exists. Therefore,for notational convenience, denote L := LtQt, R := RtQ

−>t , ∆L := L − L?, ∆R := R −R?, and

ε := 0.1/√κf . Similar to the derivation in (45), we have

‖∆LΣ−1/2? ‖op ∨ ‖∆RΣ

−1/2? ‖op ≤ ε. (72)

The conclusion ‖LtR>t −X?‖F ≤ 1.5 dist(Ft,F?) is a simple consequence of Lemma 26; see (48) fora detailed argument. From now on, we focus on proving the distance contraction.

56

Page 57: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled Gradient Descent

By the definition of dist(Ft+1,F?), one has

dist2(Ft+1,F?) ≤∥∥∥(Lt+1Qt −L?)Σ

1/2?

∥∥∥2

F+∥∥∥(Rt+1Q

−>t −R?)Σ

1/2?

∥∥∥2

F. (73)

Introduce an auxiliary function

fµ(X) = f(X)− µ

2‖X −X?‖2F,

which is rank-2r restricted (L − µ)-smooth and rank-2r restricted convex. Using the ScaledGDupdate rule (27) and the decomposition LR> −X? = ∆LR

> + L?∆>R, we obtain

(Lt+1Qt −L?)Σ1/2? =

(L− η∇f(LR>)R(R>R)−1 −L?

1/2?

=(L− ηµ(LR> −X?)R(R>R)−1 − η∇fµ(LR>)R(R>R)−1 −L?

1/2?

= (1− ηµ)∆LΣ1/2? − ηµL?∆>RR(R>R)−1Σ

1/2? − η∇fµ(LR>)R(R>R)−1Σ

1/2? .

As a result, one can expand the first square in (73) as∥∥∥(Lt+1Qt −L?)Σ1/2?

∥∥∥2

F=∥∥∥(1− ηµ)∆LΣ

1/2? − ηµL?∆>RR(R>R)−1Σ

1/2?

∥∥∥2

F︸ ︷︷ ︸G1

− 2η(1− ηµ)

⟨∇fµ(LR>),∆LΣ?(R

>R)−1R> −∆LR>? −

1

2∆L∆>R

⟩︸ ︷︷ ︸

G2

− 2η(1− ηµ)

⟨∇fµ(LR>),∆LR

>? +

1

2∆L∆>R

⟩+ 2η2µ

⟨∇fµ(LR>),L?∆

>RR(R>R)−1Σ?(R

>R)−1R>⟩︸ ︷︷ ︸

G3

+ η2∥∥∥∇fµ(LR>)R(R>R)−1Σ

1/2?

∥∥∥2

F︸ ︷︷ ︸G4

.

In the sequel, we shall bound the four terms separately.

1. Notice that the main term G1 has already been controlled in (46) under the condition (72). Itobeys

G1 ≤(

(1− ηµ)2 +2ε

1− εηµ(1− ηµ)

)‖∆LΣ

1/2? ‖2F +

2ε+ ε2

(1− ε)2η2µ2‖∆RΣ

1/2? ‖2F,

as long as ηµ ≤ 2/3.

2. For the second term G2, note that ∆LΣ?(R>R)−1R> −∆LR

>? − 1

2∆L∆>R has rank at most r.Hence we can invoke Lemma 28 to obtain

|G2| ≤ ‖∇fµ(LR>)‖F,r∥∥∥∥∆LΣ?(R

>R)−1R> −∆LR>? −

1

2∆L∆>R

∥∥∥∥F

≤ ‖∇fµ(LR>)‖F,r‖∆LΣ1/2? ‖F

(∥∥∥R(R>R)−1Σ1/2? − V?

∥∥∥op

+1

2‖∆RΣ

−1/2? ‖op

),

57

Page 58: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Tong, Ma, Chi

where the second line uses R? = V?Σ1/2? . Take the condition (72) and Lemma 25 together to

obtain ∥∥∥R(R>R)−1Σ1/2?

∥∥∥op≤ 1

1− ε;∥∥∥R(R>R)−1Σ

1/2? − V?

∥∥∥op≤√

1− ε.

These consequences further imply that

|G2| ≤ (

√2ε

1− ε+ε

2)‖∇fµ(LR>)‖F,r‖∆LΣ

1/2? ‖F.

3. As above, the third term G3 can be similarly bounded as

|G3| ≤ ‖∇fµ(LR>)‖F,r∥∥L?∆>RR(R>R)−1Σ?(R

>R)−1R>∥∥F

≤ ‖∇fµ(LR>)‖F,r‖∆RΣ1/2? ‖F

∥∥∥R(R>R)−1Σ1/2?

∥∥∥2

op

≤ 1

(1− ε)2‖∇fµ(LR>)‖F,r‖∆RΣ

1/2? ‖F.

4. For the last term G4, invoke Lemma 28 to obtain

G4 ≤ ‖∇fµ(LR>)‖2F,r∥∥∥R(R>R)−1Σ

1/2?

∥∥∥2

op≤ 1

(1− ε)2‖∇fµ(LR>)‖2F,r.

Taking collectively the bounds for G1,G2,G3 and G4 yields∥∥∥(Lt+1Qt −L?)Σ1/2?

∥∥∥2

F≤(

(1− ηµ)2 +2ε

1− εηµ(1− ηµ)

)‖∆LΣ

1/2? ‖2F +

2ε+ ε2

(1− ε)2η2µ2‖∆RΣ

1/2? ‖2F

+ 2η(

√2ε

1− ε+ε

2)(1− ηµ)‖∇fµ(LR>)‖F,r‖∆LΣ

1/2? ‖F

− 2η(1− ηµ)

⟨∇fµ(LR>),∆LR

>? +

1

2∆L∆>R

⟩+

2η2µ

(1− ε)2‖∇fµ(LR>)‖F,r‖∆RΣ

1/2? ‖F +

η2

(1− ε)2‖∇fµ(LR>)‖2F,r.

Similarly, we can obtain the control of ‖(Rt+1Q−>t −R?)Σ

1/2? ‖2F. Combine them together to reach∥∥∥(Lt+1Qt −L?)Σ

1/2?

∥∥∥2

F+∥∥∥(Rt+1Q

−>t −R?)Σ

1/2?

∥∥∥2

F

≤(

(1− ηµ)2 +2ε

1− εηµ(1− ηµ) +

2ε+ ε2

(1− ε)2η2µ2

)(‖∆LΣ

1/2? ‖2F + ‖∆RΣ

1/2? ‖2F

)+ 2η

((

√2ε

1− ε+ε

2)(1− ηµ) +

ηµ

(1− ε)2

)‖∇fµ(LR>)‖F,r

(‖∆LΣ

1/2? ‖F + ‖∆RΣ

1/2? ‖F

)− 2η(1− ηµ)

⟨∇fµ(LR>),∆LR

>? + L?∆

>R + ∆L∆>R

⟩+

2η2

(1− ε)2‖∇fµ(LR>)‖2F,r

≤(

(1− ηµ)2 +2ε

1− εηµ(1− ηµ) +

2ε+ ε2

(1− ε)2η2µ2

)(‖∆LΣ

1/2? ‖2F + ‖∆RΣ

1/2? ‖2F

)

58

Page 59: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled Gradient Descent

+ 2η

((

√2ε

1− ε+ε

2)(1− ηµ) +

ηµ

(1− ε)2

)︸ ︷︷ ︸

C1

‖∇fµ(LR>)‖F,r(‖∆LΣ

1/2? ‖F + ‖∆RΣ

1/2? ‖F

)

− 2η

(1− ηµL− µ

− η

(1− ε)2

)︸ ︷︷ ︸

C2

‖∇fµ(LR>)‖2F,r,

where the last line follows from Lemma 38 (notice that ∇fµ(X?) = 0) as

〈∇fµ(LR>),∆LR>? + L?∆

>R + ∆L∆>R〉 = 〈∇fµ(LR>),LR> −X?〉 ≥

1

L− µ‖∇fµ(LR>)‖2F,r.

Notice that C2 > 0 as long as η ≤ (1− ε)2/L. Maximizing the quadratic function of ‖∇fµ(LR>)‖F,ryields

C1‖∇fµ(LR>)‖F,r(‖∆LΣ

1/2? ‖F + ‖∆RΣ

1/2? ‖F

)− C2‖∇fµ(LR>)‖2F,r ≤

C21

4C2

(‖∆LΣ

1/2? ‖F + ‖∆RΣ

1/2? ‖F

)2

≤ C21

2C2

(‖∆LΣ

1/2? ‖2F + ‖∆RΣ

1/2? ‖2F

),

where the last inequality holds since (a + b)2 ≤ 2(a2 + b2). Identify dist2(Ft,F?) = ‖∆LΣ1/2? ‖2F +

‖∆RΣ1/2? ‖2F to obtain∥∥∥(Lt+1Qt −L?)Σ

1/2?

∥∥∥2

F+∥∥∥(Rt+1Q

−>t −R?)Σ

1/2?

∥∥∥2

F≤ ρ2(η; ε, µ, L) dist2(Ft,F?),

where the contraction rate is given by

ρ2(η; ε, µ, L) := (1− ηµ)2 +2ε

1− εηµ(1− ηµ) +

2ε+ ε2

(1− ε)2η2µ2 +

((√

2ε1−ε + ε

2 )(1− ηµ) + ηµ(1−ε)2

)2

1− ηµ− η(L−µ)(1−ε)2

η(L− µ).

With ε = 0.1/√κf and 0 < η ≤ 0.4/L, one has ρ(η; ε, µ, L) ≤ 1− 0.7ηµ. Thus we conclude that

dist(Ft+1,F?) ≤√∥∥∥(Lt+1Qt −L?)Σ

1/2?

∥∥∥2

F+∥∥∥(Rt+1Q

−>t −R?)Σ

1/2?

∥∥∥2

F

≤ (1− 0.7ηµ) dist(Ft,F?),

which is the desired claim.

Remark 39 We provide numerical details for the contraction rate. For simplicity, we shall proveρ(η; ε, µ, L) ≤ 1 − 0.7ηµ under a stricter condition ε = 0.02/

√κf . The stronger result under the

condition ε = 0.1/√κf can be verified through a subtler analysis.

With ε = 0.02/√κf and 0 < η ≤ 0.4/L, one can bound the terms in ρ2(η; ε, µ, L) as

(1− ηµ)2 +2ε

1− εηµ(1− ηµ) +

2ε+ ε2

(1− ε)2η2µ2 ≤ 1− 1.959ηµ+ 1.002η2µ2; (74)(

(√

2ε1−ε + ε

2 )(1− ηµ) + ηµ(1−ε)2

)2

1− ηµ− η(L−µ)(1−ε)2

η(L− µ) ≤0.0016κf

+ 0.078ηµ+ 1.005η2µ2

1− 1.042ηLηL

≤0.0016η L

κf+ 0.4× (0.078ηµ+ 1.005η2µ2)

1− 0.4× 1.042

59

Page 60: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Tong, Ma, Chi

≤ 0.057ηµ+ 0.69η2µ2, (75)

where the last line uses the definition (28) of κf . Putting (74) and (75) together further implies

ρ2(η; ε, µ, L) ≤ 1− 1.9ηµ+ 1.7η2µ2 ≤ (1− 0.7ηµ)2,

as long as 0 < ηµ ≤ 0.4.

References

Pierre Baldi and Kurt Hornik. Neural networks and principal component analysis: Learning fromexamples without local minima. Neural networks, 2(1):53–58, 1989.

Srinadh Bhojanapalli, Anastasios Kyrillidis, and Sujay Sanghavi. Dropping convexity for fastersemi-definite optimization. In Conference on Learning Theory, pages 530–582. PMLR, 2016a.

Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Global optimality of local search for lowrank matrix recovery. In Advances in Neural Information Processing Systems, pages 3873–3881,2016b.

Jian-Feng Cai, Tianming Wang, and Ke Wei. Spectral compressed sensing via projected gradientdescent. SIAM Journal on Optimization, 28(3):2625–2653, 2018.

Emmanuel Candès, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval via Wirtinger flow:Theory and algorithms. Information Theory, IEEE Transactions on, 61(4):1985–2007, 2015.

Emmanuel J Candès and Yaniv Plan. Tight oracle inequalities for low-rank matrix recovery from aminimal number of noisy random measurements. IEEE Transactions on Information Theory, 57(4):2342–2359, 2011.

Emmanuel J. Candès and Benjamin Recht. Exact matrix completion via convex optimization.Foundations of Computational Mathematics, 9(6):717–772, 2009.

Emmanuel J. Candès, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis?Journal of the ACM, 58(3):11:1–11:37, 2011.

Venkat Chandrasekaran, Sujay Sanghavi, Pablo Parrilo, and Alan Willsky. Rank-sparsity incoher-ence for matrix decomposition. SIAM Journal on Optimization, 21(2):572–596, 2011.

Vasileios Charisopoulos, Yudong Chen, Damek Davis, Mateo Díaz, Lijun Ding, and Dmitriy Drusvy-atskiy. Low-rank matrix recovery with composite optimization: good conditioning and rapidconvergence. Foundations of Computational Mathematics, pages 1–89, 2021.

Ji Chen and Xiaodong Li. Model-free nonconvex matrix completion: Local minima analysis andapplications in memory-efficient kernel PCA. Journal of Machine Learning Research, 20(142):1–39, 2019.

Ji Chen, Dekai Liu, and Xiaodong Li. Nonconvex rectangular matrix completion via gradient descentwithout `2,∞ regularization. IEEE Transactions on Information Theory, 66(9):5806–5841, 2020a.

Yudong Chen. Incoherence-optimal matrix completion. IEEE Transactions on Information Theory,61(5):2909–2923, 2015.

Yudong Chen and Yuejie Chi. Harnessing structures in big data via guaranteed low-rank matrixestimation: Recent theory and fast algorithms via convex and nonconvex optimization. IEEESignal Processing Magazine, 35(4):14 – 31, 2018.

60

Page 61: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled Gradient Descent

Yudong Chen and Martin J Wainwright. Fast low-rank estimation by projected gradient descent:General statistical and algorithmic guarantees. arXiv preprint arXiv:1509.03025, 2015.

Yuxin Chen and Yuejie Chi. Robust spectral compressed sensing via structured matrix completion.IEEE Transactions on Information Theory, 60(10):6576–6601, 2014.

Yuxin Chen, Yuejie Chi, Jianqing Fan, Cong Ma, and Yuling Yan. Noisy matrix completion: Under-standing statistical guarantees for convex relaxation via nonconvex optimization. SIAM Journalon Optimization, 30(4):3098–3121, 2020b.

Yuxin Chen, Jianqing Fan, Cong Ma, and Yuling Yan. Bridging convex and nonconvex optimizationin robust PCA: Noise, outliers, and missing data. arXiv preprint arXiv:2001.05484, 2020c.

Yuejie Chi, Yue M Lu, and Yuxin Chen. Nonconvex optimization meets low-rank matrix factoriza-tion: An overview. IEEE Transactions on Signal Processing, 67(20):5239–5269, 2019.

Damek Davis, Dmitriy Drusvyatskiy, and Courtney Paquette. The nonsmooth landscape of phaseretrieval. arXiv preprint arXiv:1711.03247, 2017.

Simon S Du, Wei Hu, and Jason D Lee. Algorithmic regularization in learning deep homogeneousmodels: Layers are automatically balanced. In Advances in Neural Information Processing Sys-tems, pages 384–395, 2018.

Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points-online stochasticgradient for tensor decomposition. In Conference on Learning Theory (COLT), pages 797–842,2015.

Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. InAdvances in Neural Information Processing Systems, pages 2973–2981, 2016.

Rong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low rank problems: Aunified geometric analysis. In International Conference on Machine Learning, pages 1233–1242,2017.

Suriya Gunasekar, Pradeep Ravikumar, and Joydeep Ghosh. Exponential family matrix completionunder structural constraints. In International Conference on Machine Learning, pages 1917–1925,2014.

Moritz Hardt and Mary Wootters. Fast matrix completion without the condition number. InProceedings of The 27th Conference on Learning Theory, pages 638–678, 2014.

Prateek Jain and Purushottam Kar. Non-convex optimization for machine learning. Foundationsand Trends R© in Machine Learning, 10(3-4):142–336, 2017.

Prateek Jain, Raghu Meka, and Inderjit S Dhillon. Guaranteed rank minimization via singular valueprojection. In Advances in Neural Information Processing Systems, pages 937–945, 2010.

Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrix completion using al-ternating minimization. In Proceedings of the forty-fifth annual ACM symposium on Theory ofcomputing, pages 665–674. ACM, 2013.

Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escapesaddle points efficiently. In International Conference on Machine Learning, pages 1724–1732,2017.

Kenji Kawaguchi. Deep learning without poor local minima. In Advances in neural informationprocessing systems, pages 586–594, 2016.

61

Page 62: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Tong, Ma, Chi

Anastasios Kyrillidis and Volkan Cevher. Matrix ALPS: Accelerated low rank and sparse matrixreconstruction. In 2012 IEEE Statistical Signal Processing Workshop (SSP), pages 185–188. IEEE,2012.

Jean Lafond. Low rank matrix completion with exponential family noise. In Conference on LearningTheory, pages 1224–1243, 2015.

Xiaodong Li, Shuyang Ling, Thomas Strohmer, and Ke Wei. Rapid, robust, and reliable blinddeconvolution via nonconvex optimization. Applied and computational harmonic analysis, 47(3):893–934, 2019.

Y. Li, C. Ma, Y. Chen, and Y. Chi. Nonconvex matrix factorization from rank-one measurements.IEEE Transactions on Information Theory, 67(3):1928–1950, 2021.

Yuetian Luo, Wen Huang, Xudong Li, and Anru R Zhang. Recursive importance sketchingfor rank constrained least squares: Algorithms and high-order convergence. arXiv preprintarXiv:2011.08360, 2020.

Cong Ma, Kaizheng Wang, Yuejie Chi, and Yuxin Chen. Implicit regularization in nonconvexstatistical estimation: Gradient descent converges linearly for phase retrieval, matrix completion,and blind deconvolution. Foundations of Computational Mathematics, pages 1–182, 2019.

Cong Ma, Yuanxin Li, and Yuejie Chi. Beyond Procrustes: Balancing-free gradient descent forasymmetric low-rank matrix sensing. IEEE Transactions on Signal Processing, 69:867–877, 2021.

Mantas Mazeika. The singular value decomposition and low rank approximation. Technical report,University of Chicago, 2016.

Song Mei, Yu Bai, and Andrea Montanari. The landscape of empirical risk for nonconvex losses.The Annals of Statistics, 46(6A):2747–2774, 2018.

Bamdev Mishra and Rodolphe Sepulchre. Riemannian preconditioning. SIAM Journal on Optimiza-tion, 26(1):635–660, 2016.

Bamdev Mishra, K Adithya Apuroop, and Rodolphe Sepulchre. A Riemannian geometry for low-rankmatrix completion. arXiv preprint arXiv:1211.1550, 2012.

Yurii Nesterov and Boris T Polyak. Cubic regularization of Newton method and its global perfor-mance. Mathematical Programming, 108(1):177–205, 2006.

Praneeth Netrapalli, UN Niranjan, Sujay Sanghavi, Animashree Anandkumar, and Prateek Jain.Non-convex robust PCA. In Advances in Neural Information Processing Systems, pages 1107–1115, 2014.

Dohyung Park, Anastasios Kyrillidis, Constantine Carmanis, and Sujay Sanghavi. Non-square ma-trix sensing without spurious local minima via the Burer-Monteiro approach. In Artificial Intel-ligence and Statistics, pages 65–74, 2017.

Dohyung Park, Anastasios Kyrillidis, Constantine Caramanis, and Sujay Sanghavi. Finding low-rank solutions via nonconvex matrix factorization, efficiently and provably. SIAM Journal onImaging Sciences, 11(4):2165–2204, 2018.

Benjamin Recht, Maryam Fazel, and Pablo A Parrilo. Guaranteed minimum-rank solutions of linearmatrix equations via nuclear norm minimization. SIAM review, 52(3):471–501, 2010.

Sujay Sanghavi, Rachel Ward, and Chris DWhite. The local convexity of solving systems of quadraticequations. Results in Mathematics, 71(3-4):569–608, 2017.

62

Page 63: Accelerating Ill-Conditioned Low-Rank Matrix Estimation ...

Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled Gradient Descent

Ju Sun, Qing Qu, and John Wright. Complete dictionary recovery using nonconvex optimization. InProceedings of the 32nd International Conference on Machine Learning, pages 2351–2360, 2015.

Ju Sun, Qing Qu, and John Wright. A geometric analysis of phase retrieval. Foundations ofComputational Mathematics, 18(5):1131–1198, 2018.

Ruoyu Sun and Zhi-Quan Luo. Guaranteed matrix completion via non-convex factorization. IEEETransactions on Information Theory, 62(11):6535–6579, 2016.

Jared Tanner and Ke Wei. Low rank matrix completion by alternating steepest descent methods.Applied and Computational Harmonic Analysis, 40(2):417–429, 2016.

Tian Tong, Cong Ma, and Yuejie Chi. Low-rank matrix recovery with scaled subgradient meth-ods: Fast and robust convergence without the condition number. IEEE Transactions on SignalProcessing, 2021a.

Tian Tong, Cong Ma, Ashley Prater-Bennette, Erin Tripp, and Yuejie Chi. Scaling and scalability:Provable nonconvex low-rank tensor estimation from incomplete measurements. arXiv preprintarXiv:2104.14526, 2021b.

Stephen Tu, Ross Boczar, Max Simchowitz, Mahdi Soltanolkotabi, and Benjamin Recht. Low-ranksolutions of linear matrix equations via Procrustes flow. In International Conference MachineLearning, pages 964–973, 2016.

Ke Wei, Jian-Feng Cai, Tony F Chan, and Shingyu Leung. Guarantees of Riemannian optimizationfor low rank matrix recovery. SIAM Journal on Matrix Analysis and Applications, 37(3):1198–1222, 2016.

Xinyang Yi, Dohyung Park, Yudong Chen, and Constantine Caramanis. Fast algorithms for robustPCA via gradient descent. In Advances in neural information processing systems, pages 4152–4160,2016.

Qinqing Zheng and John Lafferty. A convergent gradient descent algorithm for rank minimization andsemidefinite programming from random linear measurements. In Advances in Neural InformationProcessing Systems, pages 109–117, 2015.

Qinqing Zheng and John Lafferty. Convergence analysis for rectangular matrix completion usingBurer-Monteiro factorization and gradient descent. arXiv preprint arXiv:1605.07051, 2016.

Zhihui Zhu, Qiuwei Li, Gongguo Tang, and Michael B Wakin. Global optimality in low-rank matrixoptimization. IEEE Transactions on Signal Processing, 66(13):3614–3628, 2018.

63