Smoothed Analysis of Tensor Decompositions

Smoothed Analysis of Tensor Decompositions

Aditya Bhaskara∗ Moses Charikar† Ankur Moitra‡ Aravindan Vijayaraghavan§

Abstract

Low rank decomposition of tensors is a powerful tool for learning generative models. Theuniqueness of decomposition gives tensors a significant advantage over matrices. However, ten-sors pose significant algorithmic challenges and tensors analogs of much of the matrix algebratoolkit are unlikely to exist because of hardness results. Efficient decomposition in the overcom-plete case (where rank exceeds dimension) is particularly challenging. We introduce a smoothedanalysis model for studying these questions and develop an efficient algorithm for tensor decom-position in the highly overcomplete case (rank polynomial in the dimension). In this setting, weshow that our algorithm is robust to inverse polynomial error – a crucial property for applica-tions in learning since we are only allowed a polynomial number of samples. While algorithmsare known for exact tensor decomposition in some overcomplete settings, our main contributionis in analyzing their stability in the framework of smoothed analysis.

Our main technical contribution is to show that tensor products of perturbed vectors arelinearly independent in a robust sense (i.e. the associated matrix has singular values that areat least an inverse polynomial). This key result paves the way for applying tensor methodsto learning problems in the smoothed setting. In particular, we use it to obtain results forlearning multi-view models and mixtures of axis-aligned Gaussians where there are many more“components” than dimensions. The assumption here is that the model is not adversariallychosen, formalized by a perturbation of model parameters. We believe this an appealing wayto analyze realistic instances of learning problems, since this framework allows us to overcomemany of the usual limitations of using tensor methods.

∗Google Research NYC. Email: [email protected]. Work done while the author was at EPFL, Switzerland.†Princeton University. Email: [email protected]. Supported by NSF awards CCF 0832797, AF 1218687 and

CCF 1302518‡Massachusetts Institute of Technology, Department of Mathematics and CSAIL. Email: [email protected]. Part of

this work was done while the author was a postdoc at the Institute for Advanced Study and was supported in partby NSF grant No.DMS-0835373 and by an NSF Computing and Innovation Fellowship.§Carnegie Mellon University. Email: [email protected]. Supported by the Simons Postdoctoral Fellowship.

1 Introduction

1.1 Background

Tensor decompositions play a central role in modern statistics (see e.g. [27]). To illustrate theirusefulness, suppose we are given a matrix M =

∑Ri=1 ai ⊗ bi When can we uniquely recover the

factors aii and bii of this decomposition given access to M? In fact, this decomposition isalmost never unique (unless we require that the factors aii and bii are orthonormal, or that Mhas rank one). But given a tensor T =

∑Ri=1 ai ⊗ bi ⊗ ci there are general conditions under which

aii, bii and cii are uniquely determined (up to scaling) given T ; perhaps the most famoussuch condition is due to Kruskal [24], which we review in the next section.

Tensor methods are commonly used to establish that the parameters of a generative model canbe identified given third (or higher) order moments. In contrast, given just second-order moments(e.g. M) we can only hope to recover the factors up to a rotation. This is called the rotationproblem and has been an important issue in statistics since the pioneering work of psychologistCharles Spearman (1904) [31]. Tensors offer a path around this obstacle precisely because theirdecompositions are often unique, and consequently have found applications in phylogenetic recon-struction [11], [29], hidden markov models [29], mixture models [20], topic modeling [5], communitydetection [3], etc.

However most tensor problems are hard: computing the rank [17], the best rank one approxi-mation [18] and the spectral norm [18] are all NP -hard. Also many of the familiar properties ofmatrices do not generalize to tensors. For example, subtracting the best rank one approximationto a tensor can actually increase its rank [34] and there are rank three tensors that can be approx-imated arbitrarily well by a sequence of rank two tensors. One of the few algorithmic results fortensors is an algorithm for computing tensor decompositions in a restricted case. Let A,B and Cbe matrices whose columns are aii, bii and cii respectively.

Theorem 1.1. [25], [11] If rank(A) = rank(B) = R and no pair of columns in C are multiplesof each other, then there is a polynomial time algorithm to compute the minimum rank tensordecomposition of T . Moreover the rank one terms in this decomposition are unique (among alldecompositions with the same rank).

If T is an n × n × n tensor, then R can be at most n in order for the conditions of the theoremto be met. This basic algorithm has been used to design efficient algorithms for phylogeneticreconstruction [11], [29], topic modeling [5], community detection [3] and learning hidden markovmodels and mixtures of spherical Gaussians [20]. However algorithms that make use of tensordecompositions have traditionally been limited to the full-rank case, and our goal is to developstable algorithms that work for R = poly(n). Recently Goyal et al [16] gave a robustness analysisfor this decomposition, and we give an alternative proof in Appendix A.

In fact, this basic tensor decomposition can be bootstrapped to work even when R is largerthan n (if we also increase the order of the tensor). The key parameter that dictates when one canefficiently find a tensor decomposition (or more generally, when it is unique) is the Kruskal rank:

Definition 1.2. The Kruskal rank (or Krank) of a matrix A is the largest k for which every set ofk columns are linearly independent. Also the τ -robust k-rank is denoted by Krankτ (A), and is thelargest k for which every n× k sub-matrix A|S of A has σk(A|S) ≥ 1/τ .

How can we push the above theorem beyond R = n? We can instead work with an order ` tensor.To be concrete set ` = 5 and suppose T is an n× n× ...× n tensor. We can “flatten” T to get an

1

order three tensor

T =R∑i=1

A(1)i ⊗A

(2)i︸︷︷︸

factor

⊗A(3)i ⊗A

(4)i︸︷︷︸

factor

⊗ A(5)i︸︷︷︸

factor

Hence we get an order three tensor T of size n2×n2×n. Alternatively we can define this “flattening”using the following operation:

Definition 1.3. The Khatri-Rao product of U and V which are size m× r and n× r respectivelyis an mn× r matrix U V whose ith column is ui ⊗ vi.

Our new order three tensor T can be written as:

T =R∑i=1

(A(1) A(2)

)i⊗(A(3) A(4)

)i⊗A(5)

The factors are the columns of A(1)A(2), the columns of A(3)A(4) and the columns of A(5). Thecrucial point is that the Kruskal rank of the columns of A(1)A(2) is in fact at least the sum of theKruskal rank of the columns of A(1) and A(2) (and similarly for A(3)A(4)) [1], [9], but this is tightin the worst-case. Consequently this “flattening” operation allows us use the above algorithm untoR = 2n; since the rank (R) is larger than the largest dimension (n), this is called the overcompletecase.

Our main technical result is that in a natural smoothed analysis model, the Kruskal rank robustlymultiplies and this allows us to give algorithms for computing a tensor decomposition even in thehighly overcomplete case, for any R = poly(n) (provided that the order of the tensor is large -but still a constant). Moreover our algorithms have immediate applications in learning mixtures ofGaussians and multiview mixture models.

1.2 Our Results

We introduce the following framework for studying tensor decomposition problems:

• An adversary chooses a tensor T =∑R

i=1A(1)i ⊗A

(2)i ⊗ ...⊗A

(`)i .

• Each vector aji is ρ-perturbed to yield aji .1

• We are given T =∑R

i=1 A(1)i ⊗ A

(2)i ⊗ ...⊗ A

(`)i (possibly with noise.)

Our goal is to recover the factors A(1)i i, A

(2)i i, ..., A(`)

i i (up to rescaling). This model isdirectly inspired by smoothed analysis which was introduced by Spielman and Teng [32], [33] as aframework in which to understand why certain algorithms perform well on realistic inputs.

In applications in learning, tensors are used to encode low-order moments of the distribution. Inparticular, each factor in the decomposition represents a “component”. The intuition is that if these“components” are not chosen in a worst-case configuration, then we can obtain vastly improvedlearning algorithms in various settings. For example, as a direct consequence of our main result,we will give new algorithms for learning mixtures of spherical Gaussians again in the framework ofsmoothed analysis (without any additional separation conditions). There are no known polynomial

1An (independent) random gaussian with zero mean and variance ρ2/n in each coordinate is added to aji to obtainaji . We note that we make the Gaussian assumption for convenience, but our analysis seems to apply to more generalperturbations.

2

time algorithms to learn such mixures if the number of components (k) is larger than the dimension(n). But if their means are perturbed, we give a polynomial time algorithm for any k = poly(n)by virtue of our tensor decomposition algorithm.

Our main technical result is the following:

Theorem 1.4. Let R ≤ n`/2 for some constant ` ∈ N. Let A(1), A(2), . . . A(`) be n × R matriceswith columns of unit norm, and let A(1), A(2), . . . A(`) ∈ Rn×m be their respective ρ-perturbations.Then for τ = (n/ρ)3`, the Khatri-Rao product satisfies

Krankτ

(A(1) A(2) . . . A(`)

)= R w.p. at least 1− exp

(−Cn1/3`

)(1)

In general the Kruskal rank adds [1, 9] but in the framework of smoothed analysis it robustlymultiplies. What is crucial here is that we have a lower bound τ on how close these vectors areto linearly dependent. In almost all of the applications of tensor methods, we are not given Texactly but rather with some noise. This error could arise, for example, because we are using afinite number of samples to estimate the moments of a distribution. It is the condition number ofA(1) A(1) . . . A(`) that will control whether various tensor decomposition algorithms work inthe presence of noise.

Another crucial property our method achieves is exponentially small failure probability for anyconstant `, for our polynomial bound on τ . In particular for ` = 2, we show (in Theorem 3.1) forρ-perturbations of two n × n2/2 matrices U and V , the Krankτ (U V ) = n2/2 for τ = ρ2/nO(1),with probability 1 − exp(−

√n). We remark that it is fairly straightforward to obtain the above

statement (for ` = 2) for failure probability δ, with τ = (n/δ)O(1) (see Remark 3.7 for more on thelatter); however, this is not desirable since the running time has a polynomial dependence on theminimum singular value 1/τ (and hence δ).

We obtain the following main theorem from the above result and from analyzing the stabilityof the algorithm of Leurgans et al [25] (see Theorem 2.3):

Theorem 1.5. Let R ≤ nb`−12c/2 for some constant ` ∈ N. Suppose we are given T + E where

T and E are order `-tensors and T has rank R and is obtained from the above smoothed analysismodel. Moreover suppose the entries of E are at most ε(ρ/n)3` where ε < 1. Then there is analgorithm to recover the rank one terms ⊗`i=1a

ji up to an additive ε error. The algorithm runs in

time nC3` and succeeds with probability at least 1− exp(−Cn1/3`).

As we discussed, tensor methods have had numerous applications in learning. However algo-rithms that make use of tensor decompositions have traditionally been limited to the full-rankcase, and hence can only handle cases when the number of “components” is at most the dimension.However by using our main theorem above, we can get new algorithms for some of these problemsthat work even if there are many more “components” than dimensions.

Multi-view Models (Section 4)

In this setting, each sample is composed of ` views x(1), x(2), . . . , x(`) which are conditionally in-dependent given which component i ∈ [R] the sample is generated from. Hence such a modelis specified by R mixing weights wi and R discrete distributions µi

(1), . . . , µi(j), . . . , µi

(`), one foreach view. Such models are very expressive and are used as a common abstraction for a numberof inference problems. Anandkumar et al [2] gave algorithms in the full rank setting. However,in many practical settings like speech recognition and image classification, the dimension of thefeature space is typically much smaller than the number of components. If we suppose that the

3

distributions that make up the multi-view model are ρ-perturbed (analogously to the tensor set-ting) then we can give the first known algorithms for the overcomplete setting. Suppose that the

means (µi(j)) are ρ-perturbed to obtain µ(j)

i . Then:

Theorem 1.6. This is an algorithm to learn the parameters wi and µ(j)i of an `-view multi-

view model with R ≤ nb`−12c/2 components up to an accuracy ε. The running time and sample

complexity are at most poly`(n, 1/ε, 1/ρ) and succeeds with probability at least 1− exp(−Cn1/3`) forsome constant C > 0.

Mixtures of Axis-Aligned Gaussians (Section 5)

Here we are given samples from a distribution F =∑k

i=1wiFi(µi,Σi) where Fi(µi,Σi) is a Gaussianwith mean µi and covariance Σi and each Σi is diagonal. These mixtures are ubiquitous throughoutmachine learning. Feldman et al [14] gave an algorithm for PAC-learning mixtures of axis alignedGaussians, however the running time is exponential in k, the number of components. Hsu andKakade [20] gave a polynomial time algorithm for learning mixtures of spherical Gaussians providedthat their means are full rank (hence k ≤ n). Again, we turn to the framework of smoothed analysisand suppose that the means are ρ-perturbed. In this framework, we can give a polynomial timealgorithm for learning mixtures of axis-aligned Gaussians for any k = poly(n). Suppose that themeans of a mixture of axis-aligned Gaussians and suppose the means have been ρ-perturbed toobtain µi. Then

Theorem 1.7. There is an algorithm to learn the parameters wi, µi and Σi of a mixture of k ≤nb

`−12c/(2`) axis-aligned Gaussians up to an accuracy ε. The running time and sample complexity

are at most poly`(n, 1/ε, 1/ρ) and succeeds with probability at least 1 − exp(−Cn1/3`) for someconstant C > 0.

We believe that our new algorithms for overcomplete tensor decomposition will have furtherapplications in learning. Additionally this framework of studying distribution learning when theparameters of the distribution we would like to learn are not chosen adversarially, seems quiteappealing.

Remark 1.8. Recall, our main technical result is that the Kruskal rank robustly multiplies. In fact,is is easy to see that for a generic set of vectors it multiplies [1]. This observation, in conjunctionwith the algorithm of Leurgans et al [25] yields an algorithm for tensor decomposition in theovercomplete case. Another approach to overcomplete tensor decomposition was given by [13]

which works up to r ≤ nb`2c. However these algorithms assume that we know T exactly, and are

not known to be stable when we are given T with noise. The main issue is that these algorithmsare based on solving a linear system which is full rank if the factors of T are generic, but whatcontrols whether or not these linear systems can handle noise is their condition number.

Alternatively, algorithms for overcomplete tensor decomposition that assume we know T exactlywould not have any applications in learning because we would need to take too many samples tohave a good enough estimate of T (i.e. the low-order moments of the distribution).

In recent work, Goyal et al [16] also made use of robust algorithms for overcomplete tensordecomposition, and their main application is underdetermined independent component analysis(ICA). The condition that they need to impose on the tensor holds generically (like ours, see e.g.Corollary 2.4) and can show in a smoothed analysis model that this condition holds with inversepolynomial failure probability. However here our focus was on showing a lower bound for the

4

condition number of M` that does not depend (polynomially) on the failure probability. We focuson the failure probability being small (in particular, exponentially small), because in smoothedanalysis, the perturbation is “one-shot” and if it does not result in an easy instance, you cannotask for a new one!

1.3 Our Approach

Here we give some intuition for how we prove our main technical theorem, at least in the ` = 2case. Recall, we are given two matrices U (1) and U (2) whose R columns are ρ-perturbed to obtainU (1) and U (2) respectively. Our goal is to prove that if R ≤ n2

2 then the matrix U (1) U (2) hassmallest singular value that is at least poly(1/n, ρ) with high probability. In fact, it will be easierto work with what we call the leave-one-out distance (see Definition 3.4) as a surrogate for thesmallest singular value (see Lemma 3.5). Alternatively, if we let x and y be the first columns ofU (1) and U (2) respectively, and we set

U = span(U (1)i ⊗ U

(2)i , 2 ≤ i ≤ R)

then we would like to prove that with high probability x ⊗ y has a non-negligible projection onthe orthogonal complement of U . This is the core of our approach. Set V to be the orthogonalcomplement of U . In fact, we prove that for any dimension at least n2

2 subspace V, with highprobability x⊗ y has a non-negligible projection onto V.

How can we reason about the projection of x ⊗ y onto an arbitrary (but large) dimensionalsubspace? If V were (say) the set of all low-rank matrices, then this would be straightforward.But what complicates this is that we are looking at the projection of a rank one matrix onto alarge dimensional subspace of matrices, and these two spaces can be structured quite differently.A natural approach is to construct matrices M1,M2, ...,Mp ∈ V so that with high probability atleast one quadratic form xTMiy is non-negligible. Suppose the following condition were met (inwhich case we would be done): Suppose that there is a large set S of indices so that each vectorxTMi has a large projection onto the orthogonal complement of span(xTMi, i ∈ S). In fact, ifsuch a set S exists with high probability then this would yield our main technical theorem in the` = 2 case. Our main step is in constructing a family of matrices M1,M2, ...Mp that help us showthat S is large. We call this an (θ, δ)-orthogonal system (see Definition 3.13). The intuition behindthis definition is that if we reveal a column in one of the Mi’s that has a significant orthogonalcomponent to all of the columns that we have revealed so far, this is in effect a fresh source ofrandomness that can help us add another index to the set S. See Section 3 for a more completedescription of our approach in the ` = 2 case. The approach for ` > 2 relies on the same basicstrategy but requires a more delicate induction argument. See Section 3.4.

2 Prior Algorithms

Here we review the algorithm of Leurgans et al [25]. It has been discovered many times in differentsettings. It is sometimes referred to as “simultaneous diagonalization” or as Chang’s lemma [11].

Suppose we are given a third-order tensor T =∑R

i=1 ui ⊗ vi ⊗ wi which is n × m × p. LetU, V and W be matrices whose columns are ui, vi and wi respectively. Suppose further that (1)rank(U) = rank(V ) = R and (2) k-rank(W ) ≥ 2. Then we can efficiently recover the factors of T .

We present the algorithm Decompose and its analysis assuming n = m = R. Any instancewith rank(U) = rank(V ) = R can be reduced to this case as follows: find the span of the vectorsuj,k, where uj,k is the n dimensional vector whose ith entry is Tijk. This span must be precisely

5

the span of the columns of U .2 Thus we can pick some orthonormal basis for this span, and writeT as an R×m× p tensor. We can perform this operation again (along the second mode) to moveto an R×R× p tensor.

Theorem 2.1. [25], [11] Given a tensor T there exists an algorithm that runs in polynomialtime and recovers the (unique) factors of T provided that (1) rank(U) = rank(V ) = R and (2)k-rank(W ) ≥ 2.

Proof: The algorithm is to pre-process as above (i.e., obtain m = n = R), and then run Decom-pose stated below. Let us thus analyze Decompose with m,n being R.

We can write Ta = UDaVT where Da = diag(aTw1, a

Tw2, ..., aTwn) and similarly Tb = UDbV

T

where Db = diag(bTw1, bTw2, ..., b

Twn). Moreover we can write Ta(Tb)−1 = UDaD

−1b U−1 and

(Tb)−1(Ta) = V D−1

b DaV−1. So we conclude U and V diagonalize Ta(Tb)

−1 and (Tb)−1Ta respec-

tively. Note that almost surely the diagonals entries of DaD−1b are distinct (Claim A.4). Hence the

eigendecompositions of Ta(Tb)−1 and (Tb)

−1(Ta) are unique, and we can pair up columns in U andcolumns in V based on their eigenvalues (we pair up u and v if their eigenvalues are equal). We canthen solve a linear system to find the remaining factors (columns in W ) and since this is a validdecomposition, we can conclude that these are also the true factors of T appealing to Kruskal’suniqueness theorem [24].

In fact, this algorithm is also stable, as Goyal et al [16] also recently showed. It is intuitive that if Uand V are well-conditioned and each pair of columns in W is well-conditioned then this algorithmcan tolerate some inverse polynomial amount of noise. For completeness, we give a robustnessanalysis of Decompose in Appendix A.

Condition 2.2. 1. The condition numbers κ(U), κ(V ) ≤ κ,

2. The column vectors of W are not close to parallel: for all i 6= j, ‖ wi‖wi‖ −

wj

‖wj‖‖2 ≥ δ ,

3. The decompositions are bounded : for all i, ‖ui‖2, ‖vi‖2, ‖wi‖2 ≤ C.

Theorem 2.3. Suppose we are given tensor T +E ∈ Rm×n×p with the entries of E being boundedby ε · poly(1/κ, 1/n, 1/δ) and moreover T has a decomposition T =

∑Ri=1 ui ⊗ vi ⊗ wi that satisfies

Condition 2.2. Then there exists an efficient algorithm that returns each rank one term in thedecomposition of T (up to renaming), within an additive error of ε.

As before, the algorithm is to preprocess so as to obtain m = n = R, and then run Decompose.The preprocessing step is slightly different because of the presence of error – instead of consideringthe span of the uj,k as above, we need to look at the span of the top R singular vectors of thematrix whose columns are uj,k. If ‖E‖F is small enough (in terms of κ, δ, n), the span of these topsingular vectors suffices to obtain an approximation to the vectors ui (see Appendix A).

Note that the algorithm is limited by the condition that rank(U) = rank(V ) = R since thisrequires that R ≤ min(m,n). But as we have seen before, by “flattening” a higher order tensor, wecan handle overcomplete tensors. The following is an immediately corollary of Theorem 2.3:

2It is easy to see that the span is contained in the span of the columns of U . To see equality, we observe that ifthe span is R − 1 dimensional, then projecting each of the uis on to the span gives a different decomposition, andthis contradicts Kruskal’s uniqueness theorem, which holds in this case.

6

Algorithm 1 Decompose, Input: T ∈ RR×R×R

1. Let Ta = T (·, ·, a), Tb = T (·, ·, b) where a, b are uniformly random unit vectors in <p

2. Set U to be the eigenvectors of Ta(Tb)−1

3. Set V to be the eigenvectors of (Tb)−1Ta

4. Solve the linear system T =∑n

i=1 ui ⊗ vi ⊗ wi for the vectors wi

5. Output U, V,W

Corollary 2.4. Suppose we are given an order-` tensor T +E ∈ Rn×`with the entries of E being

bounded by ε · poly`(1/κ, 1/n, 1/δ), and matrices U (1), U (2) . . . U (`) ∈ Rn×r, whose columns give a

rank-r decomposition T =∑R

i=1 u(1)i ⊗ u

(2)i ⊗ · · · ⊗ u

(`)i . If Condition 2.2 is satisfied by

U = U (1)U (2). . .U (b `−12c) , V = U (b `−1

2c+1). . .U (2b `−1

2c) and W =

U (`) if ` is odd

U (`−1) U (`) otherwise

then there exists an efficient algorithm that computes each rank one term in this decomposition upto an additive error of ε.

Note that Corollary 2.4 does not require the decomposition to be symmetric. Further, any tri-partition of the ` modes that satisfies Condition 2.2 would have sufficed. To understand howlarge a rank we can handle, the key question is: When does the Kruskal rank (or rank) of `-wiseKhatri-Rao product become R?

The following lemma is well-known (see [9] for a robust analogue) and is known to be tight inthe worst case. This allows us to handle a rank of R ≈ `n/2.

Lemma 2.5. Krank(U V ) ≥ min(Krank(U) + Krank(V )− 1, R

)But, for generic vectors set of vectors U and V , a much stronger statement is true [1]: Krank(U

V ) ≥ min(Krank(U)×Krank(V ), R

). Hence given a generic order ` tensor T with R ≤ nb(`−1)/2c,

“flattening” it to order three and appealing to Theorem 2.1 finds the factors uniquely. The algorithmof [13] follows a similar but more involved approach, and works for R ≤ nb(`)/2c.

However in learning applications we are not given T exactly but rather an approximation toit. Our goal is to show that the Kruskal rank robustly multiplies typically, so that these types oftensor algorithms will not only work in the exact case, but are also necessarily stable when weare given T with some noise. In the next section, we show that in the smoothed analysis model,the robust Kruskal rank multiplies on taking Khatri-Rao products. This then establishes our mainresult Theorem 1.5, assuming Theorem 3.3 which we prove in the next section.

Proof of Theorem 1.5: As in Corollary 2.4, let U = U (1) . . . U (b `−12c) , V = U (b `−1

2c+1)

. . . U (`−1) and W = U (`). Theorem 3.3 shows that with probability 1− exp(− n1/3O(`))

over the

random ρ-perturbations, κR(U), κR(V ) ≤ (n/ρ)3` . Further, the columns W are δ = ρ/n far fromparallel with high probability. Hence, Corollary 2.4 implies Theorem 1.5.

3 The Khatri-Rao Product Robustly Multiplies

In the exact case, it is enough to show that the Kruskal rank almost surely multiplies and thisyields algorithms for overcomplete tensor decomposition if we are given T exactly (see Remark 1.8).

7

But if we want to prove that these algorithms are stable, we need to establish that even the robustKruskal rank (possibly with a different threshold τ) also multiplies. This ends up being a verynatural question in random matrix theory, albeit the Khatri-Rao product of two perturbed vectorsin Rn is far from a perturbed vector in Rn2

.Formally, suppose we have two matrices U and V with columns u1, u2, . . . , uR and v1, v2, . . . , vR

in Rn. Let U , V be ρ-perturbations of U, V i.e. for each i ∈ [R], we perturb ui with an (independent)random gaussian perturbation of norm ρ to obtain ui (and similarly for vi). Then we show thefollowing:

Theorem 3.1. Suppose U, V are n×R matrices and let U , V be ρ-perturbations of U, V respectively.Then for any constant δ ∈ (0, 1), R ≤ δn2 and τ = nO(1)/ρ2, the Khatri-Rao product satisfiesKrankτ (U V ) = R with probability at least 1− exp(−

√n).

Remark 3.2. The natural generalization where the vectors ui and vi are in different dimensionalspaces also holds. We omit the details here.

In general, a similar result holds for `-wise Khatri-Rao products which allows us to handle rank

as large as δnb`−12c for ` = O(1). Note that this does not follow by repeatedly applying the above

theorem (say applying the theorem to U V and then taking W ), because perturbing the entriesof (U V ) is not the same as U V . In particular, we have only ` ·nR “truly” random bits, whichare the perturbations of the columns of the base matrices. The overall structure of the proof is thesame, but we need additional ideas followed by a delicate induction.

Theorem 3.3. For any δ ∈ (0, 1), let R = δn` for some constant ` ∈ N. Let U (1), U (2), . . . U (`)

be n × R matrices with unit column norm, and let U (1), U (2), . . . U (`) ∈ Rn×m be their respectiveρ-perturbations. Then for τ = (n/ρ)3`, the Khatri-Rao product satisfies

Krankτ

(U (1) U (1) . . . U (`)

)= n`/2 w.p. at least 1− exp

(−δn1/3`

)(2)

Let A denote the n` × R matrix U (1) U (2) . . . U (`) for convenience. The theorem statesthat the smallest singular value of A is lower-bounded by τ .

How can we lower bound the smallest singular value of A? We define a quantity which is canbe used as a proxy for the least singular value and is simpler to analyze.

Definition 3.4. For any matrix A with columns A1, A2, . . . AR, the leave-one-out distance is

`(A) = mini

dist(Ai, spanAjj 6=i).

The leave-one-out distance is a good proxy for the least singular value, if we are not particularabout losing multiplicative factors that are polynomial in size of the matrix.

Lemma 3.5. For any matrix A with columns A1, A2, . . . AR, we have `(A)√R≤ σmin(A) ≤ `(A).

We will show that each of the vectors Ai = u(1)i ⊗ u

(2)i ⊗· · ·⊗ u

(`)i has a reasonable projection (at

least n`/2/τ) on the space orthogonal to the span of the rest of the vectors span (Aj : j ∈ [R]− i)with high probability. We do not have a good handle on the space spanned by the rest of the R−1vectors, so we will prove a more general statement in Theorem 3.6: we will prove that a perturbedvector x(1) ⊗ · · · ⊗ x(`) has a reasonable projection onto any (fixed) subspace V w.h.p., as long asdim(V) is Ω(n`). To say that a vector w has a reasonable projection onto V, we just need to exhibita set of vectors in V such that one of them have a large inner product with w. This will imply ourthe required bound on the singular value of A as follows:

8

1. Fix an i ∈ [R] and apply Theorem 3.6 with x(t) = u(t)i for all t ∈ [`], and V being the space

orthogonal to rest of the vectors Aj .

2. Apply a union bound over all the R choices for i.

We now state the main technical theorem about projections of perturbed product vectors ontoarbitrary subspaces of large dimension.

Theorem 3.6. For any constant δ ∈ (0, 1), given any subspace V of dimension δ ·n` in Rn×`, there

exists tensors T1, T2, . . . Tr in V of unit norm (‖·‖F = 1), such that for random ρ-perturbationsx(1), x(2), . . . , x(`) ∈ Rn of any vectors x(1), x(2), . . . , x(`) ∈ Rn, we have

Pr

[∃j ∈ [r] s.t ‖Tj

(x(1), x(2), . . . , x(`)

)‖ ≥ ρ`

(1

n

)3`]≥ 1− exp

(−δn1/(2`)`

)(3)

Remark 3.7. Since the squared length of the projection is a degree 2` polynomial of the (Gaussian)variables xi, we can apply standard anti-concentration results (Carbery-Wright, for instance) toconclude that the smallest singular value (in Theorem 3.6) is at least an inverse polynomial, withfailure probability at most an inverse polynomial. This approach can only give a singular valuelower bound of poly`(p/n) for a failure probability of p, which is not desirable since the runningtime depends on the smallest singular value.

Remark 3.8. For meaningful guarantees, we will think of δ as a small constant or n−o(1) (notethe dependence of the error probability on δ in eq (3)). For instance, as we will see in section 3.4,we can not hope for exponential small failure probability when V ⊆ Rn2

has dimension n.

The following restatement of Theorem 3.6 gives a sufficient condition about the singular valuesof a matrix P of size r × n`, that gives a strong anti-concentration property for values attained byvectors obtained by the tensor product of perturbed vectors. This alternate view of Theorem 3.6will be crucial in the inductive proof for higher `-wise products in section 3.4.

Theorem 3.9 (Restatement of Theorem 3.6). Given any constant δ` ∈ (0, 1) and any matrix Tof size r × (n`) such that σδn` ≥ η, then for random ρ-perturbations x(1), x(2), . . . , x(`) ∈ Rn of anyvectors x(1), x(2), . . . , x(`) ∈ Rn, we have

Pr

[‖M

(x(1), x(2), . . . , x(`)

)‖ ≥ ηρ`

(1

n

)3O(`)]≥ 1− exp

(−δn1/3`

)(4)

Remark 3.10. Theorem 3.6 follows from the above theorem by choosing an orthonormal basis forV as the rows of T . The other direction follows by choosing V as the span of the top δ`n

` rightsingular vectors of T .

Remark 3.11. Before proceeding, we remark that both forms of Theorem 3.6 could be of indepen-dent interest. For instance, it follows from the above (by a small trick involving partitioning thecoordinates), that a vector x⊗` has a non-negligible projection into any cn` dimensional subspace

of Rn`with probability 1 − exp(−f`(n)). For a vector x ∈ Rn`

whose entries are all independentGaussians, such a claim follows easily, with probability roughly 1−exp(−n`). The key difference forus is that x⊗` has essentially just n bits of randomness, so many of the entries are highly correlated.So the theorem says that even such a correlated perturbation has enough mass in any large enoughsubspace, with high enough probability. A natural conjecture is that the probability bound can beimproved to 1− exp(−Ω(n)), but it is beyond the reach of our methods.

9

3.1 Khatri-Rao Product of Two Matrices

We first show Theorem 3.9 for the case ` = 2. This illustrates the main ideas underlying the generalproof.

Proposition 3.12. Let 0 < δ < 1 and M be a δn2 × n2 matrix with σδn2(M) ≥ τ . Then forrandom ρ-perturbations x, y of any two x, y ∈ Rn, we have

Pr[‖M (x⊗ y)‖ ≥ τρ

nO(1)

]≥ 1− exp

(−√δn). (5)

The high level outline is now the following. Let U denote the span of the top δn2 singular vectorsof M . We show that for r = Ω(

√n), there exist n × n matrices M1,M2, . . . ,Mr whose columns

satisfy certain orthogonal properties we define, and additionally vec(Mi) ∈ U for all i ∈ [r]. We usethe orthogonality properties to show that (x ⊗ y) has an ρ/poly(n) dot-product with at least oneof the Mi with probability ≥ 1− exp(−r).

The θ-orthogonality property. In order to motivate this, let us consider some matrix Mi ∈Rn×n and considerMi(x⊗y). This is precisely yTMix. Now suppose we have r matricesM1,M2, . . . ,Mr,and we consider the sum

∑i(y

TMix)2. This is also equal to ‖Q(y)x‖2, where Q(y) is an r × nmatrix whose (i, j)th entry is 〈y, (Mi)j〉 (here (Mi)j refers to the jth column in Mi).

Now consider some matrices Mi, and suppose we knew that Q(y) has Ω(r) singular values ofmagnitude ≥ 1/n2. Then, an ρ-perturbed vector x has at least ρ/n of its norm in the space spannedby the corresponding right singular vectors, with probability ≥ 1− exp(−r) (Fact 3.26). Thus weget

Pr[‖Q(y)x‖ ≥ ρ/n3] ≥ 1− exp(−r).

So the key is to prove that the matrix Q(y) has a large number of “non-negligible” singular valueswith high probability (over the perturbation in y). For this, let us examine the entries of Q(y).For a moment suppose that y is a gaussian random vector ∼ N (0, ρ2I) (instead of a perturbation).Then the (i, j)th entry of Q(y) is precisely 〈y, (Mi)j〉, which is distributed like a one dimensionalgaussian of variance ρ2‖(Mi)j‖2. If the entries for different i, j were independent, standard resultsfrom random matrix theory would imply that Q(y) has many non-negligible singular values.

However, this could be far from the truth. Consider, for instance, two vectors (Mi)j and (Mi′)j′

that are parallel. Then their dot products with y are highly correlated. However we note, that aslong as (Mi′)j′ has a reasonable component orthogonal to (Mi)j , the distribution of the (i, j) and(i′, j′)th entries are “somewhat” independent. We will prove that we can roughly achieve such asituation. This motivates the following definition.

Definition 3.13. [Ordered θ-orthogonality] A sequence of vectors v1, v2, . . . , vn has the orderedθ-orthogonality property if for all 1 ≤ i ≤ n, vi has a component of length ≥ θ orthogonal tospanv1, v2, . . . , vi−1.

Now we define a similar notion for a sequence of matrices M1,M2, . . . ,Mr, which says that alarge enough subset of columns should have a certain θ-orthogonality property. More formally,

Definition 3.14 (Ordered (θ, δ)-orthogonal system). A set of n×m matrices M1,M2, . . . ,Mr forman ordered (θ, δ)-orthogonal system if there exists a permutation π on [m] such that the first δmcolumns satisfy the followng property: for i ≤ δm and every j ∈ [R], the π(i)th column of Mj

has a projection of length ≥ θ orthogonal to the span of all the vectors given by the columnsπ(1), π(2), . . . , π(i − 1), π(i) of all the matrices M1,M2, . . .Mr other than itself (i.e. the π(i)thcolumn of Mj).

10

The following lemma shows the use of an ordered (θ, δ) orthogonal system: a matrix Q(y)constructed as above starting with these Mi has many non-negligible singular values with highprobability.

Lemma 3.15 (Ordered θ-orthogonality and perturbed combinations.). Let M1,M2, . . . ,Mr be aset of n ×m matrices of bounded norm (‖·‖F ≥ 1) that are (θ, δ) orthogonal for some parametersθ, δ, and suppose r ≤ δm. Let x be an ρ-perturbation of x ∈ Rn. Then the r × m matrix Q(x)formed with the jth row of (Q(x))j being xTMj satisfies

Prx

[σr/2 (Q(x)) ≥ ρθ

n4

]≥ 1− exp (−r)

We defer the proof of this Lemma to section 3.3. Our focus will now be on constructing such a(θ, δ) orthogonal system of matrices, given a subspace V of Rn2

of dimension Ω(n2). The followinglemma achieves this

Lemma 3.16. Let V be a δ · nm dimensional subspace Rnm, and suppose r, θ, δ′ satisfy δ′ ≤ δ/2,r ·δ′m < δn/2 and θ = 1/(nm3/2). Then there exist r matrices M1,M2, . . . ,Mr of dimension n×mwith the following properties

1. vec(Mi) ∈ V for all i ∈ [r].

2. M1,M2, . . . ,Mr form an ordered (θ, δ′) orthogonal system.

In particular, when m ≤√n, they form an ordered (θ, δ/2) orthogonal system.

We remark that while δ is often a constant in our applications, δ′ does not have to be. We willuse this in the proof that follows, in which we use these above two lemmas regarding constructionand use of an ordered (θ, δ)-orthogonal system to prove Proposition 3.12.

Proof of Proposition 3.12 The proof follows by combining Lemma 3.16 and Lemma 3.15 ina fairly straightforward way. Let U be the span of the top δn2 singular values of M . Thus U is aδn2 dimensional subspace of Rn2

. It has three steps:

1. We use Lemma 3.16 withm = n, δ′ = δn1/2 , θ = 1

n5/2 to obtain r = n1/2

2 matricesM1,M2, . . . ,Mr ∈Rn×n having the (θ, δ′)-orthogonality property.

2. Now, applying Lemma 3.15, we have that the matrix Q(x), defined as before, (given by linearcombinations along x) , has σr/2 (Q(x)) ≥ ρθ

n4 w.p 1− exp(−√n).

3. Applying Fact 3.26 along with a simple averaging argument, we have that for one of the termsMi, we have |Mi(x⊗ y)| ≥ ρθ/n6 with probability ≥ 1− exp(−r/2) as required.

Please refer to Appendix B.2 for the complete details.The proof for higher order tensors will proceed along similar lines. However we require an

additional pre-processing step and a careful inductive statement (Theorem 3.25), whose proof in-vokes Lemmas 3.16 and 3.15. The issues and details with higher order products are covered inSection 3.4. The following two sections are devoted to proving the two lemmas i.e. Lemma 3.16and Lemma 3.15. These will be key to the general case (` > 2) as well.

11

3.2 Constructing the (θ, δ)-Orthogonal System (Proof of Lemma 3.16)

Recollect that V is a subspace of Rn·m of dimension δnm in Lemma 3.16. We will also treat avector M ∈ V as a matrix of size n×m, with its co-ordinates indexed by [n]× [m].

We want to construct many matrices M1,M2, . . .Mr ∈ Rn×m such that a reasonable fractionof the m columns satisfy θ-orthogonality property. Intuitively, such columns would have Ω(n)independent directions in Rn, as choices for the r matrices M1,M2, . . . ,Mr. Hence, we need toidentify columns i ∈ [m], such that the projection of V onto these n co-ordinates (in columni) spans a large dimension, in a robust sense. This notion is formalized by defining the robustdimension of column projections, as follows.

Definition 3.17 (Robust Dimension of projections). For a subspace V of Rn·m, we define its robustdimension dimτ

i (V) to be

dimτi (V) = max

ds.t. ∃ orthonormal v1, v2, . . . , vd ∈ Rn and M1,M2, . . . ,Md ∈ V

with ∀t ∈ [d], ‖Mt‖ ≤ τ and vt = Mt(i).

This definition ensures that we do not take into account those spurious directions in Rn thatare covered to an insignificant extent by projecting (unit) vectors in V to the ith column. Now, wewould like to use the large dimension of V (dim=δnm) to conclude that there are many columnsprojections having large robust dimensions of around δn .

Lemma 3.18. In any subspace V in Rp1·p2 of dimension dim(V) for any τ ≥ √p2, we have∑i∈[p2]

dimτi (V) ≥ dim(V) (6)

Remark 3.19. This lemma will also be used in the first step of the proof of Theorem 3.6 to identifya good block of co-ordinates which span a large projection of a given subspace V.

The above lemma is easy to prove if the dimension of the column projections used is theusual dimension of a vector space. However, with robust dimension, to carefully avoid spurious orinsignificant directions, we identify the robust dimension with the number of large singular valuesof a certain matrix.

Proof: Let d = dim(V). Let B be a (p1p2) × d matrix, with the d columns comprising anorthonormal basis for V. Clearly σd(B) = 1. Now, we split the matrix B into p1 blocks of sizep1 × d each. For i ∈ [p2], let Bi ∈ Rp1×d be the projection of B on the rows given by [p1]× i. Letdi = max t such that σt(Bi) ≥ 1√

p2.

We will first show that∑

i di ≥ d. Then we will show that dimτi (V) ≥ di to complete our proof.

Suppose for contradiction that∑

i∈[p2] di < d. Let Si be the (d − d1)-dimensional subspace of

Rd spanned by the last (d− d1) right singular vectors of Bi. Hence,

for unit vectors α ∈ Si ⊆ Rd, ‖Biα‖ <1√p2.

Since, d−∑

i∈[p2] di > 0, there exists at least one unit vector α ∈⋂i S⊥i . Picking this unit vector

α ∈ Rd, we can contradict σd(B) ≥ 1To establish the second part, consider the di top left-singular vectors for matrix Bi (∈ Rp1) .

These di vectors can be expressed as small combinations (‖·‖2 ≤√p2) of the columns of Bi using

12

Lemma B.1. The corresponding di small combinations of the columns of the whole matrix B, givesvectors in Rp1p2 which have length

√p2 as required (since column of B are orthonormal).

We will construct the matrices M1,M2, . . . ,Mr ∈ Rn×m in multiple stages. In each stage, wewill focus on one column i ∈ [m]: we fix this column for all the matrices M1,M2, . . . ,Mr, so thatthis column satisfies the ordered θ-orthogonal property w.r.t previously chosen columns, and thenleave this column unchanged in the rest of the stages.

In each stage t of this construction we will be looking at subspaces of V which are obtained byzero-ing out all the columns J ⊆ [m] (i.e. all the co-ordinates [n]× J), that we have fixed so far.

Definition 3.20 (Subspace Projections). For J ⊆ [m], let V∗J ⊆ Rn·(m−|J |) represent the subspaceobtained by projecting on to the co-ordinates [n]× ([m]− J), the subspace of V having zeros on allthe co-ordinates [n]× J .

V∗J =M ′ ∈ Rn·(m−|J |) : ∃M ∈ V s.t. columns M(i) = M ′(i) for i ∈ [m]− J, and 0 otherwise .

The extension Ext∗J (M ′) for M ′ ∈ V∗J is the vector M ∈ V obtained by padding M ′ with zerosin the coordinates [n]× J (columns given by J).

The following lemma shows that their dimension remains large as long as |J | is not too large:

Lemma 3.21. For any J ⊆ [m] and any subspace V of Rn·m of dimension δ · nm, the subspacehaving zeros in the co-ordinates [n]× J has dim

(V∗J

)≥ n(δm− |J |).

Proof of Lemma 3.21: Consider a constraint matrix C of size (1− δ)nm× nm which describesV. V∗J is described by the constraint matrix of size (1− δ)nm× n(m− |J |) obtained by removingthe columns of C corresponding to [n]× J . Hence we get a subspace of dimension at least n(m−|J |)− (1− δ)nm.

We now describe the construction more formally.

The Iterative Construction of ordered θ-orthogonal matrices.

Initially set J0 = ∅ and Mj = 0 for all j ∈ [r], τ =√m and s = δm/2.

For t = 1 . . . s,

1. Pick i ∈ [m]− Jt−1 such that dimτi

(V∗Jt−1

)≥ δn/2. If no such i exists, report FAIL.

2. Choose Z1, Z2, . . . , Zr ∈ V∗Jt−1 of length at most√mn such that ith columns

Z1(i), Z2(i), . . . , Zr(i) ∈ Rn are orthonormal, and also orthogonal to the columns Mj(i′)i′∈Jt−1,j∈[r].

If this is not possible, report FAIL.

3. Set for all j ∈ [r], the new Mj ← Mj + Ext∗J (Zj), where Ext∗J (Zj) is the matrix paddedwith zeros in the columns corresponding to J . Set Jt ← Jt−1 ∪ i.

Let J = Js for convenience. We first show that the above process for constructingM1,M2, . . . ,Mr

completes successfully without reporting FAIL.

Claim 3.22. For r, s such that s ≤ δm/2 and r · s ≤ δn/3, the above process does not FAIL.

13

Proof: In each stage, we add one column index to J . Hence, |Jt| ≤ s at all times t ∈ [s].We first show that Step 1 of each iteration does not FAIL. From Lemma 3.21, we have

dim(V∗Jt

)≥ δnm/2. Let W = V∗Jt . Now, applying Lemma 3.18 to W, we see that there ex-

ists i ∈ [m]− Jt such that dimτi (W) ≥ δn/2, as required. Hence, Step 1 does not fail.

dimτi (W) ≥ δn/2 shows that there exist Z ′1, Z

′2, . . . Z

′δn/2 with lengths at most

√m such that

their ith columns Z ′t(i)t≤δn/2 are orthonormal. However, we additionally need to impose that

the ith columns to also be orthogonal to the columns Mj(i′)j∈[r],i′∈Jt−1

. Fortunately, the number

of such orthogonality constraints is at most r|Jt−1| ≤ δn/3. Hence, we can pick the r < δn/6orthonormal ith columns Zj(i)j∈[r] and their respective extensions Zj , by taking linear combina-

tions of Z ′t. Since the linear combinations result again in unit vectors in the ith column, the lengthof Zj ≤

√mn, as required. Hence, Step 2 does not FAIL as well.

Completing the proof of Lemma 3.16. We now show that since the process completes, thenM1,M2, . . . ,Mr have the required ordered (θ, δ′)-orthogonal property for δ′ = s/m. We first checkthat M1,M2, . . . ,Mr belong to V. This is true because in each stage, Ext∗J (Zj) ∈ V, and henceMj ∈ V for j ∈ [r]. Further, since we run for s stages, and each of the Zj are bounded in length by√mn, ‖Mj‖F ≤ s

√mn ≤

√nm3. Our final matrices Mj will be scaled to ‖·‖F = 1. The s columns

that satisfy the ordered θ-orthogonality property are those of J , in the order they were chosen (weset this order to be π, and select an arbitrary order for the rest).

Suppose the column it ∈ [m] was chosen at stage t. The key invariant of the process is that oncea column it is chosen at stage t, the itht column remains unchanged for each Mj in all subsequentstages (t + 1 onwards). By the construction, Zj(it) ∈ Rn is orthogonal to Mj(i)i∈Jt−1 . SinceZj(it) has unit length and Mj is of bounded length, we have the ordered θ-orthogonal property as

required, for θ = 1/√nm3. This concludes the proof.

3.3 (θ, δ)-Orthogonality and ρ-Perturbed Combinations (Proof of Lemma 3.15)

Suppose M1,M2, . . . ,Mr be a (θ, δ)-orthogonal set of matrices (dimensions n ×m). Without lossof generality, suppose that the permutation π in the definition of orthogonality is the identity, andlet I be the first δm columns.

Now let us consider an ρ-perturbed vector x, and consider the matrix Q(x) defined in thestatement – it has dimensions r ×m, and the (i, j)th entry is 〈x, (Mi)j〉, which is distributed as atranslated gaussian. Now for any column i ∈ I, the ith column in Q(x) has every entry having an(ρ · θ) ‘component’ independent of entries in the previous columns, and entries above. This impliesthat for a unit gaussian vector g, we have (by anti-concentration and θ-orthogonality that

Pr[(gTQ(x)i)2 < θ2/4n] < 1/2n. (7)

Furthermore, the above inequality holds, even conditioned on the first (i− 1) columns of Q(x).

Lemma 3.23. Let Q(x) be defined above, and fix some i ∈ I. Then for g ∼ N (0, 1)n, we have

Pr[(gTQ(x)i)2 <

θ2ρ2

4n2| Q(x)1, . . . , Q(x)(i−1)] <

1

2n,

for any given Q(x)1, Q(x)2, . . . , Q(x)(i−1).

Proof: Let g = (g1, g2, . . . , gr). Then we have

gTQi(x) = g1(xT (M1)i) + g2(xT (M2)i) + · · ·+ gr(xT (Mr)i)

= 〈x, g1(M1)i + g2(M2)i + . . . gr(Mr)i〉

14

Let us denote the latter vector by vi for now, so we are interested in 〈x, vi〉. We show that vi hasa non-negligible component orthogonal to the span of v1, v2, . . . v(i−1). Let Π be the matrix whichprojects orthogonal to the span of (Ms)i′ for all i′ < i. Thus any vector Πu is also orthogonal tothe span of vi′ for i′ < i.

Now by hypothesis, every vector Π(Ms)i has length ≥ θ. Thus the vector Π (∑

s gs(Ms)i) = Πvihas length ≥ θ/2 with probability ≥ 1− exp(−r) (Lemma B.2).

Thus if we consider the distribution of 〈x, vi〉 = 〈x, vi〉+ 〈e, vi〉, it is a one-dimensional gaussianwith mean 〈x, vi〉 and variance ρ2. From basic anti-concentration properties of a gaussian (that themass in any ρ · (variance)1/2 interval is at most ρ), the conclusion follows.

We can now do this for all i ∈ I, and conclude that the probability that Eq. (7) holds for alli ∈ I is at most 1/(2n)|I|.

Now what does this imply about the singular values of Q(x)? Suppose it has < r/2 (which is< |I|) non-negligible singular values, then a gaussian random vector g, with probability at leastn−r, has a negligible component along all the corresponding singular vectors, and thus the lengthof gTQ(x) is negligible with at least this probability!

Lemma 3.24. Let M be a t× t matrix with spectral norm ≤ 1. Suppose M has at most r singularvalues of magnitude > τ . Then for g ∼ N (0, 1)t, we have

Pr[‖Mg‖22 < 4tτ2 +t

n2c] ≥ 1

ncr− 1

2t.

Proof: Let u1, u2, . . . , ur be the singular vectors corresponding to value > τ . Consider the eventthat g has a projection of length < 1/nc onto u1, u2, . . . , ur. This has probability ≥ 1

ncr , by anti-concentration properties of the Gaussian (and because N (0, 1)t is rotationally invariant). For anysuch g, we have

‖Mg‖22 =r∑i=1

〈g, ui〉2 + τ2‖g‖2

≤ r

n2c+ τ2‖g‖22.

This contradicts the earlier anti-concentration bound, and so we conclude that the matrix hasat least r/2 non-negligible singular values, as required.

3.4 Higher Order Products

We have a subspace V ∈ Rn`of dimension δn`. The proof for higher order products proceeds

by induction on the order ` of the product. Recall from Remark 3.8 that Proposition 3.12 andTheorem 3.3 do not get good guarantees for small values of δ, like 1/n. In fact, we can nothope to get such exponentially small failure probability in that case, since the all the n degrees offreedom in V may be constrained to the first n co-ordinates of Rn2

(all the independence is in justone mode). Here, it is easy to see that the best we can hope for is an inverse-polynomial failureprobability. Hence, to get exponentially small failure probability, we will always need V to have alarge dimension compared to the dimension of the host space in our inductive statements.

To carry out the induction, we will try to reduce this to a statement about `−1 order products,by taking linear combinations (given by x(1) ∈ Rn) along one of the modes. Loosely speaking,

15

Lemma 3.15 serves this function of “order reduction”, however it needs a set of r matrices in Rn×m(flattened along all the other modes) which are ordered (θ, δ) orthogonal.

Let us consider the case when ` = 3, to illustrate some of the issues that arise. We canuse Lemma 3.16 to come up with r matrices in Rn×n2

that are ordered (θ, δ) orthogonal. Thesecolumns intuitively correspond to independent directions or degrees of freedom, that we can hopeto get substantial projections on. However, since these are vectors in Rn, the number of “flattenedcolumns” can not be comparable to n2 (in fact, δm n) — hence, our induction hypothesis for` = 2 will give no guarantees, (due to Remark 3.8).

To handle this issue, we will first restrict our attention to a smaller block of co-ordinates of sizen1×n2×n3 (with n1n2n3 n) , that has reasonable size in all the three modes (n1, n2, n3 = nΩ(1)).Additionally, we want V’s projection onto this n1×n2×n3 block spans a large subspace of (robust)dimension at least δn1n2n3 (using Lemma 3.18).

Moreover, choosing the main inductive statement also needs to be done carefully. We need someproperty for choosing enough candidate “independent” directions T1, T2, . . . Tr ∈ Rn`

(projected onthe chosen block), such that our process of “order reduction” (by first finding θ-orthogonal systemand then combining along x(1)) maintains this property for order `− 1. This is where the alternateinterpretation in Theorem 3.9 in terms of singular values helps: it suggests the exact property thatwe need! We ensure that the matrix formed by the flattened vectors vec(T1), vec(T2), . . . vec(Tr)(projected onto the n1 × n2 × n3 block) , as rows form a matrix with many large singular values.

We now state the main inductive claim. The claim assumes a block of co-ordinates of reasonablesize in each mode that span many directions in V, and then establishes the anti-concentration boundinductively.

Theorem 3.25 (Main Inductive Claim). Let T1, T2, . . . , Tr ∈ Rn×`be r tensors with bounded norm

(‖·‖F ≤ 1) and I1, I2, . . . I` ⊆ [n] be sets of indices of sizes n1, n2, . . . n`. Let T be the r×n` matrixobtained with rows vec(T1), vec(T2), . . . , vec(Tr). Suppose

• ∀j ∈ [r], Pj is Tj restricted to the block I1 × · · · × I`, and matrix P ∈ Rr×(n1·n2...n`) has jthrow as vec(Pj),

• r ≥ δ`n1n2 . . . n` and ∀t ∈ [`− 1], nt ≥ (nt+1nt+2 . . . n`)2,

• σr(P ) ≥ η.

Then for random ρ-perturbations x(1), x(2) . . . x(`) of any x(1), x(2) . . . x(`) ∈ Rn, we have

Prx(1),...x(`)

[‖T(x(1) ⊗ · · · ⊗ x(`)

)‖ ≥ ρ`

(η

n1

)3`]≥ 1− exp (−δ`n`)

Before we give a proof of the main inductive claim, we first present a standard fact that re-lates the singular value of matrices and some anti-concentration properties of randomly perturbedvectors. This will also establish the base case of our main inductive claim.

Fact 3.26. Let M be a matrix of size m × n with σr(M) ≥ η. Then for any unit vector u ∈ Rnand an random ρ-perturbation x of it, we have

‖Mx‖2 ≥ ηρ/n2 w.p 1− n−Ω(r)

Proof of Theorem 3.25: The proof proceeds by induction. The base case (` = 1) is handled byFact 3.26. Let us assume the theorem for (`− 1)-wise products. The inductive proof will have twomain steps:

16

1. Suppose we flatten the tensors Pjj∈[r] along all but the first mode, and imagine them

as matrices of size n1 × (n2n3 . . . n`). We can use Lemma 3.16 to construct ordered (θ, δ′)orthogonal system w.r.t vectors in Rn1 (columns correspond to [m] = [n2 . . . n`]).

2. When we take combinations along x(1) as T(x(1), ·, ·, . . . , ·

), these tensors will now satisfy

the condition required for (` − 1)-order products in the inductive hypothesis, because ofLemma 3.15.

Unrolling this induction allows us to take combinations along x(1), x(2), . . . as required, until we areleft with the base case. For notational convenience, let y = x(1), δ` = δ, r` = r and N = n1n2 . . . n`.

To carry out the first step, we think of Pjj∈[r] as matrices of size n1 × (n2n3 . . . n`). We then

apply Lemma 3.16 with n = n1, m = Nn1

= n2n3 . . . n` ≤√n1 ; hence there exists r′` = n2 . . . n`

matrices Qqq∈[r′`]with ‖·‖F ≤ 1 which are ordered (θ, δ′`)-orthogonal for δ′` = δ`/3. Further, since

Qq are in the row-span of P , there exists matrix of coefficients α = (α(q, j))q∈[r′`],j∈[r`]such that

∀q ∈ [r′`], Qq =∑j∈[r`]

α(q, j)Pj (8)

‖α(q)‖22 =∑j∈[r`]

α(q, j)2 ≤ 1/η (since σr(P ) ≥ η and ‖Qq‖F ≤ 1) (9)

Further, Qq is the projection of∑

j∈[r`]αq,jTj onto co-ordinates I1 × I2 · · · × I`. Suppose we define

a new set of matricesWqq∈[r′`]in Rn×( N

n1)

by flattening the following into a matrix with n rows:

Wq =

∑j∈[r]

αq,jTj

[n]×(I2×···×I`)

.

In other words, Qq is obtained by projecting Wq on to the n1 rows given by I1. Note that Wqq∈[r′`]

is also ordered (θ′`, δ′`) orthogonal for θ′` = θη.

To carry out the second part, we apply Lemma 3.15 with Wq and infer that the r′` × (N/n1)matrix W (y) with qth row being yTWq has σr`−1

(W (y)) ≥ η′` = θ2ρ2/n41 with probability 1 −

exp(−Ω(r′`)), where r`−1 = r′`/2.We will like to apply the inductive hypothesis for (`− 1) with P being W (y); however W (y) doesnot have full (robust) row rank. Hence we will consider the top r`−1 right singular vectors of W (y)to construct an r`−1 tensors of order `, whose projections to the block I2 × · · · × I`, lead to awell-conditioned r`−1 × (n2n3 . . . n`) matrix for which our inductive hypothesis holds.

Let the top r`−1 right singular vectors of W (y) be Z1, Z2, . . . Zr`−1. Hence, from Lemma B.1,

we have a coefficient β of size r`−1 × r` such that

∀j′ ∈ [r`−1] Zj′ =∑q∈[r′`]

βj′,qWq (y) and ‖β(j′)‖2 ≤ 1/η′`.

Now let us try to represent these new vectors in terms of the original row-vectors of P , to constructthe required tensor of order (`− 1) . Consider the r`−1 × r` matrix Λ = βα. Clearly,

rownorm(Λ) ≤ rownorm(β) · ‖α‖F ≤√r′` · rownorm(β) · rownorm(α) ≤

r′`η`η′`

.

17

Define ∀j′ ∈ [r`−1], an order ` tensor T ′j′ =∑

j∈[r] λj′,jTj ; from the previous equation, ‖T ′j′‖F ≤r′`/(η`η

′`) . We need to get a normalized order (`− 1) tensor: so, we consider Tj′ = T ′j′/‖T ′j′(y)‖F ,

and T be the r`−1 × (n`) matrix with j′th row being Tj′ . Hence,

σr`−1

(T (y, ·, ·, . . . , ·)

)≥

η3`

r′`n31

.

We also have r`−1 ≥ 12 · n2n3 . . . n`. By the inductive hypothesis

‖T(y, x(2), . . . , x(`)

)‖ ≥ η′ ≡ ρ`−1

(η3`

n41n2

)3`−1

w.p 1− exp (−Ω(n`)) (10)

Hence, for one of the j′ ∈ [r`−1],∣∣∣Tj′ (x(1), x(2), . . . x(`)

)∣∣∣ ≥ η′/√r`−1. Finally, since Tj′ is given

by a small combination of the Tjj∈[r], we have from Cauchy-Schwartz

‖T(x(1), x(1), . . . , x(1)

)‖ ≥ η′ ·

η3√r2`n

41

.

The main required theorem now follows by just showing the exists of the n1 × n2 × · · · × n`block that satisfies the theorem conditions. This follows from Lemma 3.18.

Proof of Theorem 3.6: First we set n1, n2, n` by the recurrence ∀t ∈ [`], nt = 2(nt+1·nt+2 . . . n`)2

and n1 = O(n). It is easy to see that this is possible for n` = n1/3` . Now, we partition the set ofco-ordinates [n]` into blocks of size n1×n2× . . . n`. Let p1 = n1 ·n2 . . . n` and p2 = n`/p1. ApplyingLemma 3.18 we see that there exists indices I1, I2, . . . I` of sizes n1, n2, . . . , n` respectively such thatprojectionW = V|I1×I2×···×I` on this block of co-ordinates has dimension dimτ

I (W) ≥ n1n2 . . . n`/4.Let r = n1n2 . . . n`. Now we construct P ′ with the rows of P ′ being an orthonormal basis for W,and let T ′ be the corresponding vectors in V. Note that ∀j ∈ [r], ‖T ′j‖ ≤ n`. Let P be there-scaling of the matrix so that for the jth row(j ∈ [r]), Pj = P ′j/‖T ′j‖ and Tj = T ′j/‖T ′j‖. Hence

σr(P ) ≥ 1/n`. Applying Theorem 3.25 with this choice of P, T , we get the required result.

4 Learning Multi-view Mixture Models

We now see how Theorem 1.5 immediately gives efficient learning algorithms for broad class ofdiscrete mixture models called multi-view models in the over-complete setting. In a multi-viewmixture model, for each sample we are given a few different observations or views x(1), x(2), . . . , x(`)

that are conditionally independent given which component i ∈ [R] the sample is from. Typically,the R components in the mixture are discrete distributions. Multi-view models are very expressive,and capture many well-studied models like Topic Models [2], Hidden Markov Models (HMMs)[29, 1, 2], and random graph mixtures [1]. They are also sometimes referred to as finite mixturesof finite measure products[1] or mixture-learning with multiple snapshots [30].

In this section, we will assume that each of the components in the mixture is a discrete distri-bution with support of size n. We first introduce some notation, along the lines of [2].

18

Parameters and the model: Let the `-view mixture model be parameterized by a set of ` vec-tors in Rn for each mixture component,

µi

(1), µi(2), . . . , µi

(`)i∈[R]

, and mixing weights wii∈[R] ,

that add up to 1. Each of these parameter vectors are normalized : in this work, we will assumethat ‖µi(j)‖1 = 1 for all i ∈ [R], j ∈ [`]. Finally, for notational convenience we think of the param-eters are represented by n×R matrices (one per view) M (1),M (2), . . . ,M (`), with M (j) formed byconcatenating the vectors µi

(j) (1 ≤ i ≤ R).

Samples from the multi-view model with ` views are generated as follows:

1. The mixture component i (i ∈ [R]) is first picked with probability wi

2. The views x(1), . . . , x(j), . . . , x(`) are indicator vectors in n-dimensions, that are drawn accord-ing to the distribution µi

(1), . . . , µi(j), . . . , µi

(`).

The state-of-the-art algorithms for learning multi-view mixture models have guarantees thatmirror those for mixtures of gaussians. In the worst case, the best known algorithms for thisproblem are from a recent work Rabani et al [30], who give an algorithm that has complexityRO(R2) + poly(n,R). In fact they also show a sample complexity lower-bound of exp(Ω(R)) forlearning multi-view models in one dimension (n = 1). Polynomial time algorithms were given byAnandkumar et al. [2] in a restricted setting called the non-singular or non-degenerate setting.When each of these matrices

M (j)

j∈[`]

to have rank R in a robust sense i.e. σR(M (j)) ≥ 1/τ for

all j ∈ [`], their algorithm runs in just poly(R,n, τ, 1/ε)) time to learn the parameters up to errorε. However, their algorithm fails even when R = n+ 1.

However, in many practical settings like speech recognition and image classification, the dimen-sion of the feature space is typically much smaller than the number of components or clusters i.e.n R. To the best of our knowledge, there was no efficient algorithm for learning multi-viewmixture models in such over-complete settings. We now show how Theorem 1.5 gives a polynomialtime algorithm to learn multi-view mixture models in a smoothed sense, even in the over-completesetting R n.

Theorem 4.1. Let (wi, µi(1), . . . , µi

(`)) be a mixture of R = O(n`/2−1) multi-view models with` views, and suppose the means

(µi

(j))i∈[R],j∈[`]

are perturbed independently by gaussian noise of

magnitude ρ. Then there is a polynomial time algorithm to learn the weights wi, the perturbed

parameter vectorsµ

(j)i

j∈[`],i∈[R]

up to an accuracy ε when given samples from this distribution.

The running time and sample complexity is poly`(n, 1/ρ, 1/ε).

The conditional independence property is very useful in obtaining a higher order tensor, interms of the hidden parameter vectors that we need to recover. This allows us to use our resultson tensor decompositions from previous sections.

Lemma 4.2 ([1]). In the notation established above for multi-view models, ∀` ∈ N the `th momenttensor

Mom` = E[x(1) ⊗ . . . x(j) ⊗ . . . x(`)

]=∑r∈[R]

wrµ(1)r ⊗ µ(2)

r · · · ⊗ µ(j)r ⊗ · · · ⊗ µ(`)

r . (11)

Our algorithm to learn multi-view models consists of three steps:

1. Obtain a good empirical estimate T of the order ` tensor Mom` from N = poly`(n,R, 1/ρ, 1/ε)samples (given by Lemma C.3)

T =1

N

∑t=1

xt(1) ⊗ xt(2) ⊗ · · · ⊗ xt(`).

19

2. Apply Theorem 1.5 to T and recover the parameters µ(j)i upto scaling.

3. Normalize the parameter vectors µi(j) to having `1 norm of 1, and hence figure out the weights

wi for i ∈ [R].

Proof of Theorem 4.1: The proof follows from a direct application of Theorem 1.5. Hence,we just sketch the details. We first obtain a good empirical estimate of Mom` that is given inequation (11) using Lemma C.3. Applying Theorem 1.5 to T , we recover each rank-1 term in thedecomposition wiµi

(1)⊗µi(2)⊗· · ·⊗µi(`) up to error ε in frobenius norm (‖·‖F ). However, we knowthat each of the parameter vectors are of unit `1 norm. Hence, by scaling all the parameter vectorsto unit `1 norm, we obtain all the parameters up to the required accuracy.

5 Learning Mixtures of Axis-Aligned Gaussians

Let F be a mixture of k = poly(n) axis-aligned Gaussians in n dimensions, and suppose furtherthat the means of the components are perturbed by Gaussian noise of magnitude ρ. We restrict toGaussian noise not because our results change, but for notational convenience.

Parameters: The mixture is described by a set of k mixing weights wi, means µi and covariancematrices Σi. Since the mixture is axis-aligned, each covariance Σi is diagonal and we will denotethe jth diagonal of Σi as σ2

ij . Our main result in this section is the following:

Theorem 5.1. Let (wi, µi,Σi) be a mixture of k = nb`−12c/(2`) axis-aligned Gaussians and suppose

µii∈[k] are the ρ-perturbations of µii∈[k] (that have polynomially bounded length). Then thereis a polynomial time algorithm to learn the parameters (wi, µi,Σi)i∈[k] up to an accuracy ε whengiven samples from this mixture. The running time and sample complexity is poly`(

nρε).

Next we outline the main steps in our learning algorithm:

1. We first pick an appropriate `, and estimate M` :=∑

iwiµ⊗`i .3

2. We run our decomposition algorithm for overcomplete tensors on M` to recover µi, wi.

3. We then set up a system of linear equations and solve for σ2ij .

We defer a precise description of the second and third steps to the next subsections (in particular,we need to describe how we obtain M` from the moments of F and we need to describe the linearsystem that we will use to solve for σ2

ij).

5.1 Step 2: Recovering the Means and Mixing Weights

Our first goal in this subsection is to construct the tensorM` defined above from random samples.In fact, if we are given many samples we can estimate a related tensor (and our error will be aninverse polynomial in the number of samples we take). Unlike the multi-view mixture model, wedo not have ` independent views in this case. Let us consider the tensor E[x⊗`]:

E[x⊗`] =∑i

wi(µi + ηi)⊗`.

3We do not estimate the entire tensor, but only a relevant “block”, as we will see.

20

Here we have used ηi to denote a Gaussian random variable whose mean is zero and whose covarianceis Σi. Now the first term in the expansion is the one we are interested in, so it would be nice if wecould “zero out” the other terms. Our observation here is that if we restrict to ` distinct indices(j1, j2, . . . , j`), then this coordinate will only have contribution from the means. To see this, notethat the term of interest is ∑

i

[wi∏t=1

(µi(jt) + ηi(jt))]

Since the Gaussians are axis aligned, the ηi(jt) terms are independent for different t, and each is arandom variable of zero expectation. Thus the term in the summation is precisely

∑iwi

∏`t=1 µi(jt).

Our idea to estimate the means is now the following: we partition the indices [n] into ` roughlyequal parts S1, S2, . . . , S`, and estimate a tensor of dimension |S1| × |S2| × · · · × |S`|.

Definition 5.2 (Co-ordinate partitions). Let S1, S2, . . . , S` be a partition of [n] into ` pieces of

equal size (roughly). Let µ(t)i denotes the vector µi restricted to the coordinates St, and for a

sample x, let x(t) denote its restriction to the coordinates St.

Now, we can estimate the order ` tensor E[x(1) ⊗ x(2) · · · ⊗ x(`)] to any inverse polynomialaccuracy using polynomial samples (see Lemma C.3 or [20] for details), where

E[x(1) ⊗ x(2) · · · ⊗ x(`)] =∑i

wi(µ

(1)i ⊗ µ

(2)i ⊗ · · · ⊗ µ

(`)i

).

Now applying the main tensor decomposition theorem (Theorem 1.5) to this order ` tensor, we

obtain a set of vectors ν(1)i , ν

(2)i , . . . , ν

(t)i such that

ν(t)i = citµ

(t)i , and for all t, ci1ci2 · · · ci` = 1/wi.

Now we show how to recover the means µi and weights wi.

Claim 5.3. The algorithm recovers the perturbed means µii∈[R] and weights wi up to any accuracyε in time poly`(n, 1/ε)

So far, we have portions of the mean vectors, each scaled differently (upto some ε/poly`(n)accuracy. We need to estimate the scalars ci1, ci2, . . . , ci` up to a scaling (we need another trick tothen find wi). To do this, the idea is to take a different partition of the indices S′1, S

′2, . . . , S

′`, and

‘match’ the coordinates to find the µi. In general, this is tricky since some portions of the vectormay be zero, but this is another place where the perturbation in µi turns out to be very useful(alternately, we can also apply a random basis change, and a more careful analysis to doing this’match’).

Claim 5.4. Let µ be any d dimensional vector. Then a coordinate-wise σ-perturbation of µ haslength ≥ dσ2/10 w.p. ≥ 1− exp(−d).

The proof is by a basic anti-concentration along with the observation that coordinates areindependently perturbed and hence the failure probability multiplies.

Let us now define the partition S′t. Suppose we divide S1 and S2 into two roughly equalparts each, and call the parts A1, B1 and A2, B2 (respectively). Now consider a partition withS′1 = A1 ∪ A2 and S′2 = B1 ∪ B2, and S′t = St for t > 2. Consider the solution ν ′i we obtainusing the decomposition algorithm, and look at the vectors ν1, ν2, ν

′1, ν′2. For the sake of exposition,

suppose we did not have any error in computing the decomposition. We can scale ν ′1 such that

21

the sub-vector corresponding to A1 is precisely equal to that in ν1. Now look at the remainingsub-vector of ν1, and suppose it is γ times the “A2 portion” of ν2. Then we must have γ = c2/c1.

To see this formally, let us fix some i and write v11 and v12 to denote the sub-vectors of µ(1)i

restricted to coordinates in A1 and B1 respectively. Write v21 and v22 to represent sub-vectors of

µ(2)i restricted to A2 and B2 respectively. Then ν1 is c1v11⊕c1v12 (where ⊕ denotes concatenation).

So also ν2 is c2v21 ⊕ c2v22. Now we scaled ν ′1 such that the A1 portion agrees with ν1, thus wemade ν ′1 equal to c1v11 ⊕ c1v21. Thus by the way γ is defined, we have c1γ = c2, which is what weclaimed.

We can now compute the entire vector µi up to scaling, since we know c1/c2, c1/c3, and so on.Thus it remains to find the mixture weights wi. Note that these are all non-negative. Now fromthe decomposition, note that for each i, we can find the quantity

C` := wi‖µi‖`.

The trick now is to note that by repeating the entire process above with ` replaced by ` + 1, theconditions of the decomposition theorem still hold, and hence we compute

C`+1 := wi‖µi‖`+1.

Thus taking the ratio C`+1/C` we obtain ‖µi‖. This can be done for each i, and thus using C`, we

obtain wi. This completes the analysis assuming we can obtain µ(t)i without any error. Please see

lemma C.4 for details on how to recover the weights wi in the presence of errors. This establishesthe above claim about recovering the means and weights.

5.2 Step 3: Recovering the Variances

Now that we know the values of wi and all the means µi, we show how to recover the variances.This can be done in many ways, and we will outline one which ends up solving a linear system ofequations. Recall that for each Gaussian, the covariance matrix is diagonal (denoted Σi, with jthentry equal to σ2

ij).

Let us show how to recover σ2i1 for 1 ≤ i ≤ R. The same procedure can be applied to the other

dimensions to recover σ2ij for all j. Let us divide the set of indices 2, 3, . . . , n into ` (nearly equal)

sets S1, S2, . . . , S`. Now consider the expression

N1 = E[x(1)2(x|S1⊗ x|S2

⊗ · · · ⊗ x|S`)].

This can be evaluated as before. Write µ(t)i to denote the portion of µi restricted to St, and similarly

η(t)i to denote the portion of the noise vector ηi. This gives

N1 =∑i

wi(µi(1)2 + σ2i1)(µ

(1)i ⊗ µ

(2)i ⊗ · · · ⊗ µ

(`)i ).

Now recall that we know the vectors µi and hence each of the tensors µ(1)i ⊗µ

(2)i ⊗· · ·⊗µ

(`)i . Further,

since our µi are the perturbed means, our theorem (Theorem 3.3) about the condition number of

Khatri-Rao products implies that the matrix (call it M) whose columns are the flattened∏t µ

(t)i

for different i, is well conditioned, i.e., has σR(·) ≥ 1/poly`(n/ρ). This implies that a system oflinear equations Mz = z′ can be solved to recover z up to a 1/poly`(n/ρ) accuracy (assuming weknow z′ up to a similar accuracy).

Now using this with z′ being the flattened N1 allows us to recover the values of wi(µi(1) + σ2i1)

for 1 ≤ i ≤ R. From this, since we know the values of wi and µi(1) for each i, we can recoverthe values σ2

i1 for all i. As mentioned before, we can repeat this process for other dimensions andrecover σ2

ij for all i, j.

22

6 Acknowledgements

We thank Ryan O’Donnell for suggesting that we extend our techniques for learning mixtures ofspherical Gaussians to the more general problem of learning axis-aligned Gaussians.

23

References

[1] E. Allman, C. Matias and J. Rhodes. Identifiability of Parameters in Latent Structure Modelswith many Observed Variables. Annals of Statistics, pages 3099–3132, 2009. 1.1, 1.2, 1.8, 2,4, 4.2

[2] A. Anandkumar, D. Hsu and S. Kakade. A method of moments for mixture models and hiddenMarkov models. In COLT 2012. 1.2, 4, 4

[3] A. Anandkumar, R. Ge, D. Hsu and S. Kakade. A Tensor Spectral Approach to LearningMixed Membership Community Models. In COLT 2013. 1.1, 1.1

[4] A. Anandkumar, R. Ge, D. Hsu, S. Kakade and M. Telgarsky. Tensor Decompositions forLearning Latent Variable Models. arxiv:1210.7559, 2012.

[5] A. Anandkumar, D. Foster, D. Hsu, S. Kakade, Y. Liu. A Spectral Algorithm for LatentDirichlet Allocation. In NIPS, pages 926–934, 2012. 1.1, 1.1

[6] S. Arora, R. Ge, A. Moitra and S. Sachdeva. Provable ICA with Unknown Gaussian Noise,and Implications for Gaussian Mixtures and Autoencoders. In NIPS, pages 2384–2392, 2012.

[7] M. Belkin, L. Rademacher and J. Voss. Bling Signal Separation in the Presence of GaussianNoise. In COLT 2013.

[8] M. Belkin and K. Sinha. Polynomial Learning of Distribution Families. In FOCS, pages103–112, 2010.

[9] A. Bhaskara, M. Charikar and A. Vijayaraghavan. Uniqueness of Tensor Decompositions withApplications to Polynomial Identifiability. arxiv:1304.8087, 2013. 1.1, 1.2, 2, A

[10] J. Chang. Full Reconstruction of Markov Models on Evolutionary Trees: Identifiability andConsistency. Mathematical Biosciences, pages 51–73, 1996.

[11] P. Comon. Independent Component Analysis: A New Concept? Signal Processing, pages287–314, 1994. 1.1, 1.1, 2, 2.1

[12] S. Dasgupta. Learning Mixtures of Gaussians. In FOCS, pages 634–644, 1999.

[13] L. De Lathauwer, J Castaing and J. Cardoso. Fourth-order Cumulant-based Blind Identifica-tion of Underdetermined Mixtures. IEEE Trans. on Signal Processing, 55(6):2965–2973, 2007.1.8, 2

[14] J. Feldman, R. A. Servedio, and R. O’Donnell. PAC Learning Axis-aligned Mixtures of Gaus-sians with No Separation Assumption. In COLT, pages 20–34, 2006. 1.2

[15] A. Frieze, M. Jerrum, R. Kannan. Learning Linear Transformations. In FOCS, pages 359–368,1996.

[16] N. Goyal, S. Vempala and Y. Xiao. Fourier PCA. arxiv:1306.5825, 2013. 1.1, 1.2, 2, A

[17] J. Hastad. Tensor Rank is NP -Complete. Journal of Algorithms, pages 644–654, 1990. 1.1

[18] C. Hillar and L-H. Lim. Most Tensor Problems are NP -Hard. arxiv:0911.1393v4, 2013. 1.1

24

[19] R. Horn and C. Johnson. Matrix Analysis. Cambridge University Press, 1990.

[20] D. Hsu and S. Kakade. Learning Mixtures of Spherical Gaussians: Moment Methods andSpectral Decompositions. In ITCS, pages 11–20, 2013. 1.1, 1.1, 1.2, 5.1

[21] A. Hyvarinen, J. Karhunen and E. Oja. Independent Component Analysis. Wiley Interscience,2001.

[22] A. T. Kalai, A. Moitra, and G. Valiant. Efficiently Learning Mixtures of Two Gaussians. InSTOC, pages 553-562, 2010.

[23] A. T. Kalai, A. Samorodnitsky and S-H Teng. Learning and Smoothed Analysis. In FOCS,pages 395–404, 2009.

[24] J. Kruskal. Three-way Arrays: Rank and Uniqueness of Trilinear Decompositions. LinearAlgebra and Applications, 18:95–138, 1977. 1.1, 2

[25] S. Leurgans, R. Ross and R. Abel. A Decomposition for Three-way Arrays. SIAM Journal onMatrix Analysis and Applications, 14(4):1064–1083, 1993. 1.1, 1.2, 1.8, 2, 2.1

[26] B. Lindsay. Mixture Models: Theory, Geometry and Applications. Institute for MathematicalStatistics, 1995.

[27] P. McCullagh. Tensor Methods in Statistics. Chapman and Hall/CRC, 1987. 1.1

[28] A. Moitra and G. Valiant. Setting the Polynomial Learnability of Mixtures of Gaussians. InFOCS, pages 93–102, 2010.

[29] E. Mossel and S. Roch. Learning Nonsingular Phylogenies and Hidden Markov Models. InSTOC, pages 366–375, 2005. 1.1, 1.1, 4

[30] Y. Rabani, L. Schulman and C. Swamy. Learning mixtures of arbitrary distributions over largediscrete domains. . In ITCS 2014. 4, 4

[31] C. Spearman. General Intelligence. American Journal of Psychology, pages 201–293, 1904. 1.1

[32] D.A. Spielman, S.H. Teng. Smoothed Analysis of Algorithms: Why the Simplex Algorithmusually takes Polynomial Time. Journal of the ACM, pages 385–463, 2004. 1.2

[33] D.A. Spielman, S.H. Teng. Smoothed Analysis: An Attempt to Explain the Behavior ofAlgorithms in Practice. Communications of the ACM, pages 76-84, 2009. 1.2

[34] A. Stegeman and P. Comon. Subtracting a Best Rank-1 Approximation may Increase TensorRank. Linear Algebra and Its Applications, pages 1276–1300, 2010. 1.1

[35] H. Teicher. Identifiability of Mixtures. Annals of Mathematical Statistics, pages 244–248,1961.

[36] S. Vempala, Y. Xiao. Structure from Local Optima: Learning Subspace Juntas via HigherOrder PCA. Arxiv:abs/1108.3329, 2011.

[37] P. Wedin. Perturbation Bounds in Connection with Singular Value Decompositions. BIT,12:99–111, 1972.

25

A Stability of the recovery algorithm

In this section we prove Theorem 2.3, which shows that the algorithm from section 2 is actuallyrobust to errors, under Condition 2.2. This consists of two parts: first proving that the preprocessingstep indeed allows us to recover ui (approximately), and second, that Decompose is robust to noise.

Stability of the preprocessing

Suppose we are given T + E, where T =∑

i ui ⊗ vi ⊗ wi, and E is a tensor each of whose entries

is < ε · poly(1/κ, 1/n, 1/δ). Let uj,k be vectors in <m defined as before, and let U be the m × npmatrix whose columns are uj,k (for different j, k). Let u′i be the projection of ui onto the span of

the top R singular vectors of U . By Claim 4.4 in [9], we have ‖T −∑

i u′i ⊗ vi ⊗ wi‖F < 2‖E‖F ,

and thus from the robust version of Kruskal’s uniqueness theorem [9], we must have that u′i and uiare ε · poly(1/κ, 1/n, 1/δ) close. Repeating the above along the second mode allows us to move toan R×R× p tensor.

Stability of Decompose

Next, we establish that Decompose is stable (in what follows, we have m = n = R). Intuitively,Decompose is stable provided that the matrices U and V are well-conditioned and the eigenvaluesof the matrices that we need to diagonalize are separated.

The main step in Decompose is an eigendecomposition, so first we will establish perturbationbounds. The standard perturbation bounds are known as sin θ theorems following Davis-Kahan andWedin. However these bounds hold most generally for the singular value decomposition of an arbi-trary (not necessarily symmetric) matrix. We require perturbation bounds for eigen-decompositionsof general matrices. There are known bounds due to Eisenstat and Ipsen, however the notion ofseparation required there is difficult to work with and for our purposes it is easier to prove a directbound in our setting.

Suppose M = UDU−1 and M = M(I + E) + F and M and M are n × n matrices. In order

to relate the eigendecompositions of M and M respectively, we will first need to establish that theeigenvalues of M are all distinct. We thank Santosh Vempala for pointing out an error in an earlierversion. We incorrectly used the Bauer-Fike Theorem to show that M is diagonalizable, but thistheorem only shows that each eigenvalue of M is close to some eigenvalue of M , but does not showthat there is a one-to-one mapping. Fortunately there is a fix for this that works under the sameconditions (but again see [16] for an earlier, alternative proof that uses a “homotopy argument”).

Definition A.1. Let sep(D) = mini 6=j |Di,i −Dj,j |.

Our first goal is to prove that M is diagonalizable, and we will do this by establishing that itseigenvalues are distinct if the error matrices E and F are not too large. Consider

U−1(M(I + E) + F )U = D +R

where R = U−1(ME + F )U . We can bound each entry in R by κ(U)(‖ME‖2 + ‖F‖2). Henceif E and F are not too large, the eigenvalues of D + R are close to the eigenvalues of D usingGershgorin’s disk theorem, and the eigenvalues of D+R are the same as the eigenvalues of M sincethese matrices are similar. So we conclude:

Lemma A.2. If κ(U)(‖ME‖2 + ‖F‖2) < sep(D)/(2n) then the eigenvalues of M are distinct andit is diagonalizable.

26

Next we prove that the eigenvectors of M are also close to those of M (this step will rely on

M being diagonalizable). This technique is standard in numerical analysis, but it will be more

convenient for us to work with relative perturbations (i.e. M = M(I + E) + F ) so we include theproof of such a bound for completeness

Consider a right eigenvector ui of M with eigenvalue λi. We will assume that the conditions ofthe above corollary are met, so that there is a unique eigenvector ui of M with eigenvalue λi whichit is paired with. Then since the eigenvectors uii of M are full rank, we can write ui =

∑j cjuj .

Then

Mui = λiui∑j

cjλjuj + (ME + F )ui = λiui∑j

cj(λj − λi)uj = −(ME + F )ui

Now we can left multiply by the jth row of U−1; call this vector wTj . Since U−1U = I, we have

that wTj ui = 1i=j . Hence

cj(λj − λi) = −wTj (ME + F )ui

So we conclude:

‖ui − ui‖22 = 2dist(ui, span(ui))2 ≤ 2

∑j 6=i

((wTj (ME + F )ui)

|λj − λi|

)2≤ 8

∑j 6=i

‖U−1(ME + F )ui‖22sep(D)2

where we have used the condition that κ(U)(‖ME‖2 + ‖F‖2) < sep(D)/2 to lower bound the

denominator. Furthermore: ‖U−1MEui‖2 = ‖DU−1Eui‖2 ≤ σmax(E)λmax(D)σmin(U) since ui is a unit

vector.

Theorem A.3. If κ(U)(‖ME‖2 + ‖F‖2) < sep(D)/2, then

‖ui − ui‖2 ≤ 3σmax(E)λmax(D) + σmax(F )

σmin(U)sep(D)

Now we are ready to analyze the stability of Decompose: Let T =∑n

i=1 ui ⊗ vi ⊗ wi be ann× n× p tensor that satisfies Condition 2.2. In our settings of interest we are not given T exactlybut rather a good approximation to it, and here let us model this noise as an additive error E thatis itself an n× n× p tensor.

Claim A.4. With high probability, sep(DaD−1b ), sep(DbD

−1a ) ≥ δ√

p .

Proof: Fix some i, j. The (i, i)th entry of DaD−1b is precisely 〈wi,a〉

〈wi,b〉 . Note that Pr[|〈wi, b〉| > n‖wi‖]is exp(−n), thus the denominators are all at least 1/(Cn) in magnitude with probability 1−exp(−n).

Now given b for which this happens, we have 〈wi,a〉〈wi,b〉−

〈wj ,a〉〈wj ,b〉 = ci〈wi, a〉−cj〈wj , a〉 where ci, cj have

magnitude > 1/(Cn). Because wi has at least a δ component orthogonal to wj , anti-concentrationof Gaussians implies that the difference above is at least δ/C2n6 with probability at least 1− 1/n4.Thus we can take a union bound over all pairs.

We will make crucial use of the following matrix identity:

27

(A+ Z)−1 = A−1 −A−1Z(I +A−1Z)−1A−1

Let Na = Ta + Ea and Nb = Tb + Eb. Then using the above identity we have:

Na(Nb)−1 = Ta(Tb)

−1(I + F ) +G

where F = −Eb(I + (Tb)−1Eb)

−1(Tb)−1 and G = Ea(Tb)

−1

Claim A.5. σmax(F ) ≤ σmax(Eb)σmin(Tb)−σmax(Eb) and σmax(G) ≤ σmax(Ea)

σmin(Tb)

Proof: Using Weyl’s Inequality we have

σmax(F ) ≤ σmax(Eb)

1− σmax(Eb)σmin(Tb)

× 1

σmin(Tb)=

σmax(Eb)

σmin(Tb)− σmax(Eb)

as desired. The second bound is obvious.

We can now use Theorem A.3 to bound the error in recovering the factors U and V by settinge.g. M = Ta(Tb)

−1. Additionally, the following claim establishes that the linear system used tosolve for W is well-conditioned and hence we can also bound the error in recovering W .

Claim A.6. κ(U V ) ≤ min(σmax(U),σmax(V ))max(σmin(U),σmin(V )) ≤ min(κ(U), κ(V ))

These bounds establish what we qualitatively asserted: Decompose is stable provided that the ma-trices U and V are well-conditioned and the eigenvalues of the matrices that we need to diagonalizeare separated.

B K-rank of the Khatri-Rao product

B.1 Leave-One-Out Distance

Recall: we defined the leave-one-out distance in Section 3. Here we establish that is indeed equiva-lent to the smallest singular value, up to polynomial factors. In our main proof, this quantity willbe much easer to work with since it allows us to translate questions about a set of vectors beingwell-conditioned to reasoning about projection of each vector onto the orthogonal complement ofthe others.

Proof of Lemma 3.5: Using the variational characterization for singular values: σmin(A) =minu,‖u‖2=1 ‖Au‖2. Then let i = argmax|ui|. Clearly |ui| ≥ 1/

√m since ‖u‖2 = 1. Then ‖Ai +∑

j 6=iAjujui‖2 = σmin(A)

ui. Hence

`(A) ≤ dist(Ai, spanAjj 6=i) ≤σmin(A)

ui≤ σmin(A)

√m

Conversely, let i = argminidist(Ai, spanAjj 6=i). Then there are coefficients (with ui = 1) suchthat

‖Aiui +∑j 6=i

Ajuj‖2 = `(A).

Clearly ‖u‖2 ≥ 1 since ui = 1. And we conclude that

`(A) = ‖Aiui +∑j 6=i

Ajuj‖2 ≥‖Aiui +

∑j 6=iAjuj‖2

‖u‖2≥ σmin(A).

28

B.2 Proof of Proposition 3.12

We now give the complete details of the proof of Proposition 3.12, that shows how the Kruskalrank multiplies in the smoothed setting for two-wise products. The proof follows by just combiningLemma 3.16 and Lemma 3.15.

Let U be the span of the top δn2 singular values of M . Thus U is a δn2 dimensional subspaceof Rn2

. Using Lemma 3.16 with:

r =n1/2

2, m = n, δ′ =

δ

n1/2,

we obtain n × n matrices M1,M2, . . . ,Mr having the (θ, δ′)-orthogonality property. Note that in

this setting, δ′m = n1/2

2 .Thus by applying Lemma 3.15, we have that the matrix Q(x), defined as before, satisfies

Prx

[σr/2 (Q(x)) ≥ ρθ

n4

]≥ 1− exp(−r). (12)

Now let us consider ∑s

(yTMsx)2 = ‖yTQ(x)‖2.

Since Q(x) has many non-negligible singular values (Eq.(12)), we have (by Fact 3.26 for details)that an ρ-perturbed vector has a non-negligible norm when multiplied by Q. More precisely,Pr[‖yTQ(x)‖ ≥ ρθ/n4] ≥ 1−exp(−r/2). Thus for one of the terms Ms, we have |Ms(x⊗y)| ≥ ρθ/n5

with probability ≥ 1− exp(−r/2).Now this almost completes the proof, but recall that our aim is to argue about M(x⊗ y), where

M is the given matrix. vec(Ms) is a vector in the span of the top δn2 (right) singular vectors of M ,and σδn2 ≥ τ , thus we can write Ms as a combination of the rows of M , with each weight in thecombination being ≤ n/τ (Lemma B.1). This implies that for at least one row M (j) of the matrixM , we must have

‖M (j)(x⊗ y‖ ≥ θρτ

n6=

ρτ

nO(1).

(Otherwise we have a contradiction). This completes the proof.

Before we give the complete proofs of the two main lemmas regarding ordered (θ, δ) orthogonalsystems (Lemma 3.16 and Lemma 3.15), we start with a simple lemma about top singular vectorsof matrices, which is very useful to obtain linear combinations of small length.

Lemma B.1 (Expressing top singular vectors as small combinations of columns). Suppose we havea m×n matrix M with σt(M) ≥ η, and let v1, v2, . . . vt ∈ Rm be the top t left-singular vectors of M .Then these top t singular vector can be expressed using small linear combinations of the columnsM(i)i∈[n] i.e.

∀k ∈ [t], ∃ αk,ii∈[n] such that vk =∑i∈[n]

αk,iM(i)

and∑i

α2k,i ≤ 1/η2

29

Proof: Let ` correspond to the number of non-zero singular values of M . Using the SVD, thereexists matrices V ∈ Rm×`, U ∈ Rn×` with orthonormal columns (both unitary matrices), anda diagonal matrix Σ ∈ R`×` such that M = V ΣUT . Since the n × ` matrix V = M(UΣ−1),the t columns of V corresponding to the top t singular values (σt(M) ≥ η) correspond to linearcombinations which are small i.e. ∀k ∈ [t], ‖αk‖ ≤ 1/η.

B.3 Constructing the (θ, δ)-Orthogonal System (Proof of Lemma 3.16)

Let V be a subspace of Rn·m, with its co-ordinates indexed by [n] × [m]. Further,remember thatthe vectors in Rn·m are also treated as matrices of size n×m.

We now give the complete proof of lemma 3.18 that shows that the average robust dimensionof column projections is large if the dimension of V is large .

Proof of Lemma 3.18: Let d = dim(V). Let B be a p1p2×d matrix composed of a orthonormalbasis (of d vectors) for V i.e. the jth column of B is the jth basis vector (j ∈ [d]) of V. Clearlyσd(B) = 1.For i ∈ [p2], let Bi be the p1 × d matrix obtained by projecting the columns of B on justthe rows given by [p1] × i. Hence, B is obtained by just concatenating the columns as BT =[B1

T ‖B2T ‖ . . . ‖BpT

]. Finally, let di = max t such that σt(Bi) ≥ 1√

p2.

We will first show that∑

i di ≥ d. Then we will show that dimτi (V) ≥ di to complete our proof.

Suppose for contradiction that∑

i∈[p2] di < d. Let Si be the (d − d1)-dimensional subspace of Rdspanned by the last (d− d1) right singular vectors of Bi. Hence,

for unit vectors α ∈ Si ⊆ Rd, ‖Biα‖ <1√p2.

Since, d−∑

i∈[p2] di > 0, there exists at least one unit vector α ∈⋂i S⊥i . Picking this unit vector

α ∈ Rd, we have ‖Bα‖22 =∑

i∈[p2]‖Biα‖22 < p2 · ( 1√p2

)2 < 1. This contradicts σd(B) ≥ 1

To establish the second part, consider some Bi (i ∈ [p2]). We pick di orthonormal vectors ∈ Rp1corresponding to the top di left-singular vectors of Bi. By using Lemma B.1, we know that eachof these j ∈ [di] vectors can be expressed as a small combination ~αj of the columns of Bi s.t.‖ ~αj‖ ≤

√p2. Further, if we associate with each of these j ∈ [di] vectors, the vector wj ∈ R(p1p2)

given by the same combination ~αj of the columns of B, we see that ‖wj‖ ≤√p2 since the columns

of the matrix B are orthonormal.

B.4 Implications of Ordered (θ, δ)-Orthogonality: Details of Proof of Lemma 3.15

Here we show some auxiliary lemmas that are used in the Proof of Lemma B.4.

Claim B.2. Suppose v1, v2, . . . , vm are a set of vectors in <n of length ≤ 1, having the θ-orthogonalproperty. Then we have

(a) For g ∼ N (0, 1)n, we have∑

i〈vi, g〉2 ≥ θ2/2 with probability ≥ 1− exp(−Ω(m)),

(b) For g ∼ N (0, 1)m, we have ‖∑

i givi‖2 ≥ θ2/2 with probability ≥ 1− exp(−Ω(m)).

Furthermore, part (a) holds even if g is drawn from u+g′, for any fixed vector u and g′ ∼ N (0, 1)n.

Proof: First note that we must have m ≤ n, because otherwise v1, v2, . . . , vm cannot have theθ-orthogonal property for θ > 0. For any j ∈ [m], we claim that

Pr[(〈vj , g〉2 < θ2/2) | v1, v2, . . . , vj−1] < 1/2. (13)

30

To see this, write vj = v′j+v⊥j , where v⊥j is orthogonal to the span of v1, v2, . . . , vj−1. Since j ∈ I,

we have ‖v⊥j ‖ ≥ θ. Now given the vectors v1, v2, . . . , vj−1, the value 〈v′j , g〉 is fixed, but 〈v⊥j , g〉 is

distributed as a Gaussian with variance θ2 (since g is a Gaussian of unit variance in each direction).Thus from a standard anti-concentration property for the one-dimensional Gaussian, 〈vj , g〉

cannot have a mass > 1/2 in any θ2 length interval, in particular, it cannot lie in [−θ2/2, θ2/2] withprobability > 1/2. This proves Eq. (13). Now since this is true for any conditioning v1, v2, . . . , vj−1

and for all j, it follows (see Lemma B.3 for a formal justification) that

Pr[〈vj , g〉2 < θ2/2 for all j] <1

2m< exp(−m/2).

This completes the proof of the claim, part (a). Note that even if we had g replaced by u+g through-out, the anti-concentration property still holds (we have a shifted one-dimensional Gaussian), thusthe proof goes through verbatim.

Let us now prove part (b). First note that if we denote by M the n×m matrix whose columnsare the vi, then part (a) deals with the distribution of gTMMT g, where g ∼ N (0, 1)n. Part (b) dealswith the distribution of gTMTMg, where g ∼ N (0, 1)m. But since the eigenvalues of MMT andMTM are precisely the same, due to the rotational invariance of Gaussians, these two quantitiesare distributed exactly the same way. This completes the proof.

Lemma B.3. Suppose we have random variables X1, X2, . . . , Xr and an event f(·) which is definedto occur if its argument lies in a certain interval (e.g. f(X) occurs iff 0 < X < 1). Further, supposewe have Pr[f(X1)] ≤ p, and Pr[f(Xi)|X1, X2, . . . , Xi−1] ≤ p for all X1, X2, . . . , Xi−1. Then

Pr[f(X1) ∧ f(X2) ∧ · · · ∧ f(Xr)] ≤ pr.

C Applications to Mixture Models

C.1 Sampling Error Estimates for Multi-view Models

In this section, we show error estimates for `-order tensors obtained by looking at the `th momentof the multi-view model.

Lemma C.1 (Error estimates for Multiview mixture model). For every ` ∈ N, suppose we havea multi-view model, with parameters wrr∈[R] and M (j)j∈[`], the n dimensional sample vectors

x(j) have ‖x(j)‖∞ ≤ 1. Then, for every ε > 0, there exists N = O(ε−2√` log n) such that

if N samples x(1)(j)j∈[`], x(2)(j)j∈[`], . . . , x(N)(j)j∈[`] are generated, then with high probability

‖Ex(1) ⊗ x(2) ⊗ . . . x(`) − 1

N

∑t∈[N ]

x(t)(1) ⊗ x(t)(2) ⊗ x(t)(`)

‖∞ < ε (14)

Proof: We first bound the ‖ · ‖∞ norm of the difference of tensors i.e. we show that

∀i1, i2, . . . , i` ∈ [n]`,

∣∣∣∣∣∣E∏j∈[`]

x(j)ij− 1

N

∑t∈[N ]

∏j∈[`]

x(t)(j)ij

∣∣∣∣∣∣ < ε/n`/2.

Consider a fixed entry (i1, i2, . . . , i`) of the tensor.

31

Each sample t ∈ [N ] corresponds to an independent random variable with a bound of 1. Hence,we have a sum of N bounded random variables. By Bernstein bounds, probability for (14) to not

occur exp

(−(εn−`/2)

2N2

2N

)= exp

(−ε2N/

(2n`))

. We have n` events to union bound over. Hence

N = O(ε−2n`√` log n) suffices. Note that similar bounds hold when the x(j) ∈ Rn are generated

from a multivariate gaussian.

C.2 Error Analysis for Multi-view Models

Lemma C.2. Suppose ‖u⊗ v − u′ ⊗ v′‖F < δ, and Lmin ≤ ‖u‖, ‖v‖, ‖u′‖, ‖v′‖ ≤ Lmax,

with δ <minL2

min,1(2 maxLmax,1) . If u = α1u

′+β1u⊥ and v = α2v′+β2v⊥, where u⊥ and v⊥ are unit vectors

orthogonal to u′, v′ respectively, then we have

|1− α1α2| < δ/L2min and β1 <

√δ, β2 <

√δ.

Proof: We are given that u = α1u′ + β1u⊥ and v = α2v

′ + β2v⊥. Now, since the tensored vectorsare close

‖u⊗ v − u′ ⊗ v′‖2F < δ2

‖(1− α1α2)u′ ⊗ v′ + β1α2u⊥ ⊗ v′ + β2α1u′ ⊗ v⊥ + β1β2u⊥ ⊗ v⊥‖2F < δ2

L4min(1− α1α2)2 + β2

1α22L

2min + β2

2α21L

2min + β2

1β22 < δ2 (15)

This implies that |1− α1α2| < δ/L2min as required.

Now, let us assume β1 >√δ. This at once implies that β2 <

√δ. Also

L2min ≤ ‖v‖2 = α2

2‖v′‖2 + β22

L2min − δ ≤ α2

2L2max

Hence, α2 ≥Lmin

2Lmax

Now, using (15), we see that β1 <√δ.

C.3 Sampling Error Estimates for Gaussians

Lemma C.3 (Error estimates for Gaussians). Suppose x is generated from a mixture of R-gaussianswith means µrr∈[R] and covariance Σi that is diagonal , with the means satisfying ‖µr‖ ≤ B. Letσ = maxi σmax(Σi)For every ε > 0, ` ∈ N, there exists N = Ω(poly(1

ε )), σ2, n,R) such that if x(1), x(2), . . . , x(N) ∈ Rnwere the N samples, then

∀i1, i2, . . . , i` ∈ [n]`,

∣∣∣∣∣∣E∏j∈[`]

xij −1

N

∑t∈[N ]

∏j∈[`]

x(t)ij

∣∣∣∣∣∣ < ε. (16)

In other words,

‖Ex⊗` − 1

N

( ∑t∈[N ]

(x(t))⊗`)‖∞ < ε

32

Proof: Fix an element (i1, i2, . . . , i`) of the `-order tensor. Each point t ∈ [N ] corresponds

to an i.i.d random variable Zt = x(t)i1x

(t)i2. . . x

(t)` . We are interested in the deviation of the sum

S = 1N

∑t∈[N ] Z

t. Each of the i.i.d rvs has value Z = xi1xi2 . . . x`. Since the gaussians are axis-

aligned and each mean is bounded by B, |Z| < (B + tσ)` with probability O(exp(−t2/2)

). Hence,

by using standard sub-gaussian tail inequalities, we get

Pr |S −E z| > ε < exp

(− ε2N

(M + σ` log n)`

)Hence, to union bound over all n` events N = O

(ε−2(` log nM)`

)suffices.

C.4 Recovering Weights in Gaussian Mixtures

We now show how we can approximate upto a small error the weight wi of a gaussian components

in a mixture of gaussians, when we have good approximations to wiµ⊗`i and wiµ

⊗(`−1)i .

Lemma C.4 (Recovering Weights). For every δ′ > 0, w > 0, Lmin > 0, ` ∈ N, ∃δ = Ω(δ1w1/(`−1)

`2Lmin

)such that, if µ ∈ Rn be a vector with length ‖µ‖ ≥ Lmin, and suppose

‖v − w1/`µ‖ < δ and ‖u− w1/(`−1)µ‖ < δ.

Then, ∣∣∣∣∣(|〈u, v〉|‖u‖

)`(`−1)

− w

∣∣∣∣∣ < δ′ (17)

Proof: From (C.4) and triangle inequality, we see that

‖w−1/`v − w−1/(`−1)u‖ ≤ δ(w−1/(`) + w−1/(`−1)) = δ1.

Let α1 = w−1/(`−1) and α2 = w−1/`. Suppose v = βu+εu⊥ where u⊥ is a unit vector perpendicularto u. Hence β = 〈v, u〉/‖u‖.

‖α1v − α2u‖2 = ‖(βα1 − α2)u+ α1εu⊥‖ < δ21

(βα1 − α2)2‖u‖2 + α21ε

2 ≤ δ21∣∣∣∣β − α2

α1

∣∣∣∣ < δ1

Lmin

Now, substituting the values for α1, α2, we see that∣∣∣β − w 1(`−1)

− 1`

∣∣∣ < δ1

Lmin.

∣∣∣β − w1/(`(`−1))∣∣∣ < δ

w1/(`−1)Lmin∣∣∣β`(`−1) − w∣∣∣ ≤ δ′ when δ δ′w1/(`−1)

`2Lmin

33

Smoothed Analysis of Tensor Decompositions

Documents

Transcript of Smoothed Analysis of Tensor Decompositions