FACTOR ANALYSIS FOR HIGH-DIMENSIONAL DATA A …owen/students/JingshuWangThesis.pdf · his...

FACTOR ANALYSIS FOR HIGH-DIMENSIONAL DATA

A DISSERTATION

SUBMITTED TO THE DEPARTMENT OF STATISTICS

AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Jingshu Wang

July 2016

c© Copyright by Jingshu Wang 2016

All Rights Reserved

ii

I certify that I have read this dissertation and that, in my opinion, it is fully adequate

in scope and quality as a dissertation for the degree of Doctor of Philosophy.

(Art B. Owen) Principal Adviser



(Wing Hong Wong)



(Guenther Walther)

Approved for the Stanford University Committee on Graduate Studies

iii

Preface

This dissertation is an original intellectual product of the author, Jingshu Wang, supervised by Dr.

Art B. Owen.

A version of Chapter 3 has been published in Statistical Science (Art B. Owen and Jingshu

Wang, Volume 31, No. 1(2016), 119-139). I was the lead investigator, responsible for all major

areas of concept formation, data analysis, as well as manuscript composition. Art B. Owen is the

supervisory author on this project and was involved throughout the project in concept formation

and manuscript composition.

The work in Chapter 4 is unpublished and I was the lead investigator, responsible for all major

areas of concept formation, data analysis and mathematical proofs, as well as manuscript composi-

tion. Art B. Owen is the supervisory author on this project and was involved throughout the project

in concept formation and manuscript composition.

The project in Chapter 5 is a joint work with Qingyuan Zhao, Trevor Hastie and Art B. Owen.

Qingyuan Zhao and I have equal contributions to the work, and I am responsible for all major areas

of modeling and mathematical proofs, as well as the majority of manuscript composition. Trevor

Hastie and Art B. Owen are the supervisory authors on this project and were involved throughout

the project in concept formation and manuscript composition.

iv

Acknowledgments

First and for most, I would like to thank my advisor, Art B. Owen. I would like to express my

deepest gratitude for his full support, expert guidance and encouragement throughout my study

and research. He guided me how to conduct independent research and keep learning new things. He

is an impressive person with great passion and curiosity towards statistics and research. Without

his incredible patience and timely wisdom, my thesis work would not have gone so smoothly. In

addition, I express my appreciation to Dr. Guenther Walther and Dr. Wing Hong Wong for having

served my reading committee. Their thoughtful questions and suggestions have inspired me a lot

during the research. I would also like to thank Dr. Chiara Sabatti and Dr. Hua Tang for being my

oral defense committee. I have had very useful conversation on various research topics with Hua

and have gain great knowledge via attending Chiara’s group meeting on a regular basis.

I am very grateful to Dr. Persi Diaconis, Dr. David Siegmund, Dr. Iain Johnstone and Dr.

Emmanuel Candes for helping me with my coursework during my first year at Stanford. And I

would like to thank Dr. Lan Wu and Dr. Yuan Yao for being my advisors during my undergraduate

years at Peking University in China.

Thanks to my fellow graduate students in the Statistics Department at Stanford University.

Special thanks to my numerous friends who helped and accompanied me throughout this academic

exploration.

Finally I would like to thank my parents and my boyfriend for their unconditional love and

support. They have given my great encouragement to go through the uneasy days in writing this

dissertation.

This work was supported by the US National Science Foundation under grant DMS-1521145.

v

Contents

Preface iv

Acknowledgments v

1 Introduction 1

1.1 Forms of Factor Analysis Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Model assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.1 Random factor score matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.2 Non-random factor score matrix . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Model Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3.1 Identification for random factor score model . . . . . . . . . . . . . . . . . . . 5

1.3.2 Identification for non-random factor score model . . . . . . . . . . . . . . . . 7

2 Background 8

2.1 The maximum likelihood method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.1 Random factor scores model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.2 Non-random factor score model . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.3 Estimating the factor scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.4 Asymptotic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Principal component analysis (PCA) . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.1 Use of PCA in factor analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.2 Asymptotic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3 Estimating the number of factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3.1 Classical methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3.2 Methods for large matrices and strong factors . . . . . . . . . . . . . . . . . . 21

2.3.3 Methods for large matrices and weak factors . . . . . . . . . . . . . . . . . . 22

2.4 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

vi

3 Bi-cross-validation for factor analysis 25

3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2 Estimating X given the rank k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3 Bi-cross-validatory choice of r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3.1 Bi-cross-validation to estimate r?ESA . . . . . . . . . . . . . . . . . . . . . . . 29

3.3.2 Choosing the size of the holdout Y00 . . . . . . . . . . . . . . . . . . . . . . . 31

3.4 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4.1 Factor categories and test cases . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4.2 Empirical properties of ESA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.4.3 Empirical properties of BCV . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.5 Real data example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4 An optimization-shrinkage hybrid method for factor analysis 48

4.1 A joint convex optimization algorithm POT . . . . . . . . . . . . . . . . . . . . . . . 48

4.1.1 the objective function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.1.2 Connection with singular value soft-thresholding . . . . . . . . . . . . . . . . 49

4.1.3 Connection with square-root lasso . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2 Some heuristics of the method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2.1 The theoretical scale of λ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2.2 The bias in using the nuclear penalty . . . . . . . . . . . . . . . . . . . . . . 51

4.3 A hybrid method: POT-S . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.4 Wold-style cross-validatory choice of λ . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.5 Computation: an ADMM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.5.1 The ADMM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.5.2 Techniques to reduce computational cost . . . . . . . . . . . . . . . . . . . . . 58

4.6 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.6.1 Compare the oracle performances . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.6.2 Assess the accuracy in finding λ??Opt . . . . . . . . . . . . . . . . . . . . . . . . 61

5 Confounder adjustment with factor analysis 68

5.1 The model and the algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.1.1 A statistical model for confounding factors . . . . . . . . . . . . . . . . . . . 69

5.1.2 Model identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.1.3 The two-step algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.2 Statistical inference for β . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.2.1 The negative control scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.2.2 The sparsity scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.3 Extension to multiple regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

vii

5.4 Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.4.1 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

6 Conclusions 88

A Proof 91

viii

List of Tables

3.1 Six factor strength scenarios considered in our simulations. . . . . . . . . . . . . . . 35

3.2 ESA using six measurements. For each of Var(σ2i ) = 0, 1 and 10, the average for every

measurement is the average over 10 × 6 × 100 = 6000 simulations, and the standard

deviation is the standard deviation of these 6000 simulations. . . . . . . . . . . . . . 36

3.3 Comparison of ESA results for various (N,n) pairs and number of strong factors in

the scenarios with Var[σ2i

]= 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.4 Worst case REE values for each method of choosing k for white noise and two het-

eroscedastic noise settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.5 Comparison of REE and r for rank selection methods with various (N,n) pairs, and

scenarios. For each different scenario, the factors’ strengths are listed as the number of

“strong/useful/harmful/undetectable” factors. For each (N,n) pair, the first column

is the REE and the second column is k. Both values are averages over 100 simulations.

Var[σ2i

]= 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.6 Like Table 3.5, but for larger γ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.1 Assess the oracle error in estimatingX using four measurements. For each of Var[σ2i

]=

0, 1 and 10, the average for every measurement is the average over 10×6×100 = 6000

simulations, and the standard deviation is the standard deviation of these 6000 sim-

ulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.2 Assess the error in estimating Σ when the oracle estimate of X is achieved. For each

of Var[σ2i

]= 0, 1 and 10, the average for every measurement is the average over

10×6×100 = 6000 simulations, and the standard deviation is the standard deviation

of these 6000 simulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.3 Four measurements comparing the oracle error in estimating X under various (N,n)

pairs and factor strength scenario with Var(σ2i ) = 1. Type-1 to Type-6 correspond to

the six scenarios in Table 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

ix

4.4 Four measurements comparing the error in estimating Σ when the oracle error of X

is achieved under various (N,n) pairs and factor strength scenario with Var(σ2i ) = 1.

Type-1 to Type-6 correspond to the six scenarios in Table 3.1. . . . . . . . . . . . . 63

4.5 Comparison of REE and the rank of X with various (N,n) pairs and scenarios. For

each scenario, the factors’ strengths are listed as the number of “strong/useful/harmful/undetectable”

factors. For each (N,n) pair, the first column is the REE and the second column the

rank the estimated matrix. Both values are averages over 100 simulations. Var[σ2i

]= 1. 65

4.6 Like Table 4.5, but for larger aspect ratios γ . . . . . . . . . . . . . . . . . . . . . . . 66

4.7 Comparison of REEΣ for various (N,n) pairs and scenarios. For each scenario, the

factors’ strengths are listed as the number of “strong/useful/harmful/undetectable”

factors. The values are averages over 100 simulations. Var[σ2i

]= 1. . . . . . . . . . 67

x

List of Figures

3.1 REE survival plots: the proportion of samples with REE exceeding the number on

the horizontal axis. Figure 3.1a-3.1c are for REE calculating using the method ESA.

Figure 3.1a shows all 6000 samples. Figure 3.1b shows only the 3000 simulations

of larger matrices of each aspect ratio. Figure 3.1c shows only the 3000 simulations

of smaller matrices. For comparison, Figure 3.1d is the REE plot for all samples

calculating REE using the method SVD. . . . . . . . . . . . . . . . . . . . . . . . . 40

3.2 The distribution of r for each factor strength case when the matrix size is 5000× 100.

The y axis is r. Each image depicts 100 simulations with counts plotted in grey scale

(larger equals darker). For different scenarios, the factor strengths are listed as the

number of “strong/useful/harmful/undetectable” factors in the title of each subplot.

The true k is always r = 8. The “Oracle” method corresponds to r∗ESA. . . . . . . . 41

3.3 BCV prediction error for the meteorite. The BCV partitions have been repeated 200

times. The solid red line is the average over all held-out blocks, with the cross marking

the minimum BCV error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.4 Distribution patterns of the estimated factors. The first column has the four factors

found by ESA. The second column has the top five factors found by applying SVD on

the unscaled data. The third column has the top five factors found by applying SVD

on scaled data in which each element has been standardized. The values are plotted

in grey scale, and a darker color indicates a higher value. . . . . . . . . . . . . . . . . 45

3.5 Plots of the first two factors and the location clusters. The three plots of column (a)

are the scatter plots of pixels for the first two factors found by the three methods:

ESA, SVD on the original data and SVD on normalized data. The coloring shows

a k-means clustering result for 5 clusters. Column (b) has the five clustered regions

based on the first two factors of ESA. Column (c) has the five clustered regions based

on the first two factors of SVD on the original data after centering. The same color

represents the same cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

xi

4.1 REE survival plots for estimating X: the proportion of samples with REE exceeding

the number on the horizontal axis. Figure 4.1a shows all 6000 samples. Figure 4.1b

shows only the 3000 simulations of larger matrices of each aspect ratio. Figure 4.1c

shows only the 3000 simulations of smaller matrices. . . . . . . . . . . . . . . . . . . 64

5.1 Compare the performance of nine different approaches (from left to right): naive regression

ignoring the confounders (Naive), IRW-SVA, negative control with finite sample correction

(NC) in eq. (5.17), negative control with asymptotic oracle variance (NC-ASY) in eq. (5.18),

RUV-4, robust regression (LEAPP(RR)), robust regression with calibration (LEAPP(RR-

MAD)), LEAPP, oracle regression which observes the confounders (Oracle). The error bars

are one standard deviation over 100 repeated simulations. The three dashed horizontal lines

from bottom to top are the nominal significance level, FDR level and oracle power, respectively. 86

xii

Chapter 1

Introduction

Factor analysis is a statistical method to explain a large number of interrelated variables in terms

of a potentially low number of unobserved variables. In another point of view, it approximates

a matrix shaped data set by a low rank matrix via an explicit probabilistic linear model. Factor

analysis reduces the complexity and reveals the underlining structure of the data set.

Factor analysis is over a century old. In psychology, the factor model dates back at least to

Spearman (1904), who is sometimes credited with the invention of factor analysis. The technique

is later also applied to social science, economics, finance and marketing, signal processing, bioinfor-

matics and etc. The latent factors discovered by factor analysis make the observed variables more

understandable. Typically, factor analysis is classified into two types. One is confirmatory factor

analysis which has pre-determined constraints on factor loadings (for example, the loading of the

observed factor V1 on latent factor F1 is 0). The other is explanatory factor analysis which does not

have such constraints.

More recently, factor analysis also becomes a widely used dimension reduction tool for analyzing

large matrices and high-dimensional data. Factor analysis shares a lot of similarities with low-rank

matrix approximation which has applications in fields such signal processing, collaborative filtering

and personalized learning. Compared with principal component analysis (PCA) or the singular

value decomposition (SVD), factor analysis assumes heteroscedastic noise for each variable, which is

a more reasonable assumption than constant noise variance in many applications. The challenge is

that in those data sets, the dimensionality is often comparable or even larger than the sample size,

thus new methodology and theoretical analysis for solving the model need to be established.

A problem with factor analysis is that it is surprisingly difficult to choose the number of factors.

Even in traditional factor analysis problems which have a small number of variables but a relatively

large sample size, there is no widely agreed best performing methods (see for example Peres-Neto

et al. (2005)). Classical methods such as hypothesis testing based on likelihood ratios (Lawley, 1956)

or methods based on information theoretic criteria (Wax and Kailath, 1985) assume homoscedastic

1

CHAPTER 1. INTRODUCTION 2

noise and thus not fit the heteroscedastic noise assumption for factor analysis directly. In addition,

since these classical methods assume an asymptotic regime with a growing number of observations

and fixed number of variables, they do not perform well on large matrices where both dimensions are

large in modern applications. Modern methods that are developed assuming both the dimensions

are large include modified information criteria method in the econometrics community assuming

strong factors and random matrix theory based methods assuming weak factors and homoscedastic

noise.

The rest of this chapter includes a description of the mathematical model and assumptions of

factor analysis, and a discussion of model identifiability.

1.1 Forms of Factor Analysis Model

Let N denote the number of variables and n denote the sample size. Then the observation yij for

the ith variable and the jth sample is assumed to have the following decomposition:

yij =

r∑k=1

likfkj + σieij (1.1)

where E = (eij)N×n is the noise matrix, Fk = (fk1, fk2, · · · , fkn)T denotes the kth latent variable

and L = (lik)N×r is called the factor loading matrix. Denote each observed variable as Yi =

(yi1, yi2, · · · , yin)T and the noise associated as Ei = (ei1, ei2, · · · , ein)T , then the vector form of

(1.1) is

Yi =

r∑k=1

likFk + σiEi (1.2)

This exactly shows that all the observed variables can be explained by linear combinations of r

common factors. Usually, r is much smaller than N , thus estimating the latent common factors

makes the data more interpretable. Let Y = (yij)N×n be the data matrix and F = (fkj)r×n be the

factor score matrix, then the matrix form of (1.1) is

Y = LF + Σ1/2E. (1.3)

This has the interpretation that the data matrix can be expressed as a low-rank signal matrix

X = LF plus noise. Thus, factor analysis model can be used when a low-rank approximation of the

data matrix is desired.


1.2 Model assumptions

As discussed in Anderson and Rubin (1956), the factor score matrix F can be treated to be either

random or non-random.

1.2.1 Random factor score matrix

Usually, if we think that the columns of Y are randomly and independently drawn from the popu-

lation, we may prefer to assume that F is random to reduce the number of parameters to estimate.

Typically the following assumptions are made:

Assumption 1. For a random factor score model

(a) The factor scores F and the noise E are random, while the factor loading matrix L is non-

random.

(b) F and E are independent: F |= E.

(c) For each latent variable k, fkj , j = 1, 2, · · · , n are i.i.d. with E(fkj) = µk. Also, we assume

Cov(F·j) = ΛF where ΛF ∈ Rr×r is some positive-semidefinite matrix.

(d) For each variable i, the noises ei1, ei2, · · · , ein are i.i.d, with E(eij) = ai and Var(eij) = 1.

Also, we assume Σ = diag(σ21 , σ

22 , · · · , σ2

N ).

Let αi =∑rk=1 likµk + ai and , then

E(yij) = αi, Cov(Y·j) = ΣY = LΛFLT + Σ. (1.4)

An equivalent way to write out (1.1) is

yij = αi +

r∑k=1

likfkj + σieij (1.5)

with the assumptions E(fkj) = 0 and E(eij) = 0. This form is more common to be found in classical

factor analysis literature.

It is also often assumed that the entries of both F and E follow Gaussian distributions. The

advantage of Gaussian assumptions is that then only the first and second moments of the data

matter in estimation and inference. Then from (1.4), only αi and LΛFLT + Σ are identifiable. We

will discuss in more detail about identification of the components in LΛFLT + Σ in Section 1.3.

Sometimes, it’s more reasonable to assume that the factor scores of the individuals (columns of

F ) are also correlated. For example, the individuals can be time series or spatial points. This is a

common assumption when factor analysis is applied to economics or spatial analysis (Forni et al.,

2000; Wang and Wall, 2003).


1.2.2 Non-random factor score matrix

We may tend to assume non-random F when the distributional assumption of F is too complicated

or an estimation of the low rank matrix X = LF is easier than estimating the factors themselves.

For example, the low-rank constraint on X can be relaxed to a nuclear norm constraint, which

enables good optimization algorithms for solving the model (Chapter 4). Another situation is that

the samples are assumed to have an unknown clustering structure and within the clusters samples are

correlated. In this scenario, a random factor score assumption will make the model too complicated

to be solved, thus a non-random factor score would be preferable.

For a non-random F model, both the signal matrix X and the noise covariance matrix Σ are

parameters. Compared with the random F assumptions, the model now has model parameters to

estimate. However, when r min(N,n), there will be still enough data to compensate for the extra

degrees of freedom.

Assumption 2. For a non-random factor score model

(a) The noises E are random, while both the factor loading L and factor score F are non-random.

(b) For each variable i, the noises ei1, ei2, · · · , ein are i.i.d, with E(eij) = ai and Var(eij) = 1.

Also, we assume Σ = diag(σ21 , σ

22 , · · · , σ2

N ).

As the random factor score model, the non-random factor score model can also be rewritten as

Y = α1Tn + LF + Σ1/2E (1.6)

with the additional constraint that F · 1n = 0.

Non-random and random factor score assumptions are closely related to each other. A random

factor score model becomes a non-random factor score model when we make inference conditional

on F . On the other hand, a non-random factor score model turns into a random factor score model

by adding a prior on F (similar to a random effect model in linear regression). We shall see that

in general, the asymptotic results for N,n→∞ would be very similar for random and non-random

factor score models.

For both models, there can be extra constraints imposed on the factors (either the factor loadings

or factor scores) based on the application problems. A typical example is confirmatory factor anal-

ysis. Typically, in confirmatory factor analysis, it is assumed that the loadings on specific entries

are zero, reflecting the structure of the relationships between observed and unobserved factors based

on researchers’ knowledge about the problem (Hoyle, 2000; Anderson and Gerbing, 1988). Another

popular constraint is assuming sparsity on factor loadings and/or factor scores, which is a common

constraint for improving interpretability and estimability of the factors for analyzing large matrices

(Shen and Huang, 2008; Carvalho et al., 2012). Another constraint is assuming non-negativity. For

example in educational research, the observed variables can be scores on questions and the latent


factors are the underling concept. The factor scores are then interpreted as understanding on certain

concept which are more interpretable if non-negativity is assumed (Martens, 1979; Smaragdis and

Brown, 2003).

1.3 Model Identification

Model identification is generally a hard problem for factor analysis and has been discussed for a long

time. Here we list several classical results to discuss the problem. In this section we assume that for

any of the random variables, its parameters can be identified if and only they can be identified via

the first two moments of the variable.

First, when the number of factors r is unknown, there is an identification problem for r. If r is

unknown, we can always set r = N and Σ = 0 and the model becomes a trivially correct model. To

avoid this, we define r as the minimum integer that the factor model (either under the assumption

of random or non-random factor scores) exists. Since r = N also provides a correct model, this

definition itself automatically guarantees the uniqueness of r.

We assume normality for all the random variables, and discuss identification of the model pa-

rameters for both random and non-random factor score models.

1.3.1 Identification for random factor score model

For random factor score models, more constraints are needed for the identification of each elements

of the covariance matrix LΛFLT + Σ. First, we show a sufficient condition discussed in Anderson

and Rubin (1956) for identification of Φ = LΛFLT and Σ given r.

Theorem 1.3.1. Under Assumption 1, a sufficient condition for identification of Σ and Φ = LΛFLT

is that if any row of Φ is deleted, there remain two disjoint subsets of rows of Φ of rank r.

When Σ can be uniquely defined, it is obvious to see that L and ΛF are still unidentifiable.

Actually, given any invertable r×r matrix U , replacing L with L = LU and ΛF with ΛF = U−1ΛFUT

will keep ΣY unchanged. One common constraint to make L identifiable up to rotation is to assume

ΛF = Ir. This is assuming that the latent factors are uncorrelated (under the Gaussian independent)

to each other and are normalized. Some further restrictions can be added to eliminate the rotation

uncertainty. For example, common assumptions are assuming either LTL or LTΣ−1L is diagonal

with distinct entries, thus L can be uniquely identified via the eigenvalues and eigenvectors of Φ (if

diagonality of LTL is assumed) or Σ−1/2ΦΣ−1/2 (if diagonality of LTΣ−1L is assumed). Usually,

both the orthogonality and diagonality constraints mentioned above may not represent the properties

of the actual factors, but just for mathematical convenience.

Assumption 3. ΛF = Ir and either LTL or LTΣ−1L is diagonal with distinct diagonal entries.


Let’s now discuss the identification of L and ΣF from Φ under sparsity assumption. This is

equivalent to unique determination of U up to scaling and row/column permutation of the identity

matrix for L = LU . We state a simplified and generalized version of the classical result in Reiersøl

(1950). We define the s-sparse family of L (we require s ≥ r):

L(s) =L ∈ RN×r : L satisfies conditions 1 and 2

.

The conditions I and II are stated as follows:

(I) L is of rank r and each column of L contains at least s zeros.

(II) For each column m, let Lm be the matrix consisting of all rows of L which have a zero in the

mth column. For any m = 1, 2, · · · , r, Lm is of rank r − 1.

The above two conditions requires all the factors to be sparse, though besides sparsity it should be

quite full rank. Then a necessary and sufficient condition of L being identifiable in L is:

Theorem 1.3.2. Under Assumption 1, the normality assumption and the identification conditions

in Theorem 1.3.1, a necessary and sufficient condition for L in L(s) being identifiable up to scaling

and row/column permutation is that if a sub-matrix L? ∈ Rs×r of L is of the rank of r − 1, then it

must be the sub-matrix of Lm for some m = 1, 2, · · · , r.

Remark. The original theorem in Reiersøl (1950) (also stated in Anderson and Rubin (1956))

stated a different result compared with Theorem 1.3.2. In Reiersøl (1950), we have s = r and a

narrower parameter space L?(r) is assumed with two further restrictions on Lm: (III) the rank of

Lm with one row deleted is still r − 1 and (IV) the rank of Lm with one of other rows of L added

becomes r. As a consequence, the necessary and sufficient condition changes to that L does not

contain any other submatrices satisfying (II)-(IV). Theorem 1.3.2 defines a larger parameter space

L(s) for s = r which is more meaningful for practical usage. Also, Theorem 1.3.2 generalize the

original result to any sparsity level s. An increase of s weakens the identification condition.

Proof. As discussed, we only need to show that for L = LU , if L, L ∈ L, the condition in the

theorem is a necessary and sufficient condition for U having exactly one non-zero in each row and

each column.

Sufficiency: Since L has rank r, U must be full rank and L = LU−1. For any given m ∈ 1, 2, · · · , r,as the rank of Lm is r − 1, there must exist an s × r sub-matrix L? of Lm that is of rank r − 1,

then L?U ∈ Rs×r also has rank r − 1. Since L?U is a sub-matrix of L, then given the condition,

one column, say jm of L?U must all be zero. Let L?·(m) be the sub-matrix of L? excluding the mth

column. Since L?·(m) ∈ Rs×(r−1) is of rank r− 1, then the entries of jmth column except for the mth

row of U must all be zero.

It’s easy to show that if m1 6= m2, then jm1 6= jm2 , thus U has exactly one non-zero in each row

and each column. The sufficiency of the condition is proved.


Necessity: If the condition in the theorem is not satisfied, then there exists a sub-matrix L? ∈ Rs×r

of L that has rank r − 1 but none of its column is all zero. Thus, ∃v ∈ Rr that has at least

two non-zero entries and L?v = 0. W.L.O.G, assume that the first entry of v is non-zero. Let

V−1 = (0 Ir−1)T ∈ Rr×(r−1) and U = (v V−1) ∈ Rr×r. Then U has rank r and it’s easy to check

that LU ∈ L. Thus, L is not identifiable. The necessity of the condition is proved.

1.3.2 Identification for non-random factor score model

For non-random factor score models, we need constraints to identify the signal matrix X and noise

covariance Σ given r first, and then constraints to identify F and L in X = LF .

To identify X and Σ of model Y = X + Σ1/2E, we need to guarantee that if Y = X ′ + Σ′1/2E′

with X ′ having rank r, Σ′ diagonal E′ a random matrix with i.i.d. standard Gaussian entries, then

X ′ = X and Σ′ = Σ. First, if r = N , then the model is trivially unidentifiable. Thus, we need

to have r < N . We give a necessary condition for identifying X and Σ. We find it hard to give a

sufficient condition.

Theorem 1.3.3. Assume r < N . Under Assumption 2 and a known r, a necessary condition for

identifying X and Σ is that if any row of X is removed, the remaining matrix is still of rank r.

Proof. Suppose that there exists one row j that if the jth row is removed, the remaining matrix

X(j)· still has rank k < r. Let the remaining matrix of L after removing the jth row be L(j)·,

then L(j)· ∈ R(p−1)×r also has rank k, thus it is degenerate. Thus, there exists a non-zero vector

v ∈ Rr that L(j)·v = 0. Since L is full rank, Lv 6= 0. Thus Lv has only one non-zero entry. Let

X ′ = X +LvET1 where E1 ∈ Rn is a random vector with i.i.d. standard Gaussian entries which are

also independent from E. Then X ′ is still of rank r, and Σ′ = Σ− LvvTLT is still diagonal. Thus

X and Σ are not identifiable. This proves that the condition is necessary.

After identification of X and Σ, similar to the random factor score cases, we can impose more

restrictions for identification of the decomposition X = LF . The restrictions are similar to Theorem

1.3.1 and Theorem 1.3.2. For the rotation restriction, we can refer to the five scenarios listed in Bai

and Li (2012a). For the sparsity restriction, either a sparsity assumption on L or F will be sufficient

for identification.

Assumption 4. MF = FFT /n = Ir, either LTL or LTΣ−1L is diagonal with distinct diagonal

entries.

Though we have discussed the sparse factor assumptions, we will focus on estimating unre-

stricted factor analysis model in Chapters 2 to 5. The identification condition of sparse factors

in Theorem 1.3.2 is closely related to the model identification of confounder adjustment models

(Corollary 5.1.1) that we will discuss in Chapter 5.

Chapter 2

Background

In this chapter, we will discuss some previous methods. We will review using the maximum likelihood

method (MLE) and principal component analysis (PCA) in estimation of the factor loadings/signal

matrix and the noise variances given r. We discuss them for both the random and non-random factor

score models. Also, we summarize their properties under both the classical and high-dimensional

data asymptotic regions assuming that the number of factors r is fixed and known. Finally, we will

review previous methods in estimating r, which can be a much harder problem than estimating the

factors loading parameters themselves.

2.1 The maximum likelihood method

2.1.1 Random factor scores model

Under Assumption 1 and normality of the random variables, the log-likelihood of the data matrix

Y in (1.5) can be written as

L(Y ;α,L,ΣF ,Σ) =−Nn log(2π)− n

2log det |LΛFL

T + Σ|

− 1

2tr[(Y T − 1nα

T )(LΛFLT + Σ)−1(Y − α1n

T )] (2.1)

where 1 represents a vector of 1s of length n. It’s easy to see immediately that the MLE of α gives

the sample mean

αi =1

n

n∑j=1

yij

Let S = 1n (Y − α1T )(Y − α1T )T be the sample covariance, then (2.1) can be rewritten as

L(Y ;L,ΣF ,Σ) = −Nn log(2π)− n

2log det |LΛFL

T + Σ| − n

2tr[(LΛFL

T + Σ)−1S]

(2.2)

8

CHAPTER 2. BACKGROUND 9

Finding the global optimal solution maximizing (2.2) given r is generally a hard problem. For a

special case Σ = σ2IN , there is an explicit solution for L which are exactly the principal components.

The result is proved in Anderson and Rubin (1956) using the estimation equations derived by Lawley

(1940).

Theorem 2.1.1. Assume that Σ = σ2IN , Assumption 3 and in particular LTL is diagonal. Let the

eigenvalue decomposition of S be S = PΛPT where P ∈ RN×N is orthogonal and Λ = diag(λ, λ2, · · · , λN )

with λ ≥ λ2 ≥ · · · ≥ λN . Let Pr ∈ RN×r be the first r columns of P and Λr = diag(λ, λ2, · · · , λr).Then the solutions maximizing (2.2) are

L = Pr(Λr − σ2Ir)1/2, σ2 =

1

N − r

N∑k=r+1

λk

For a general diagonal Σ, it’s hard to find a global maximum solution. One popular method is

to use the EM algorithm proposed by Rubin and Thayer (1982). Assuming ΛF = Ir, then the joint

log-likelihood of (Y, F ) is

L(Y, F ;L,Σ) =−Nn log(2π)− n

2log det |Σ|

− 1

2tr[(Y − α1Tn − LF )TΣ−1(Y − α1Tn − LF )

]− rn log(2π)− 1

2tr(FTF ).

(2.3)

For the E-step, we have

F·j |Y·j ∼ N(LT (LLT + Σ)−1(Y·j − α), Ir − LT (LLT + Σ)−1L

).

For the M-step, we have

∂

∂LEF·j |Y·j ;L(k),Σ(k),α [L(Y, F ; Λ,Σ)]

=∂

∂LEF·j |Y·j ;L(k),Σ(k),α

[tr[(Y − α1Tn )TΣ−1LF

]− 1

2tr(FTLTΣ−1LF )

]=EF·j |Y·j ;L(k),Σ(k),α

[Σ−1(Y − α1Tn )FT − Σ−1LFFT

]= 0

which results in

L(k+1) =(Y − α1Tn )EF·j |Y·j ;L(k),Σ(k),α

[FT] (

EF·j |Y·j ;L(k),Σ(k),α

[FFT

] )−1

Similarly, by taking derivative of the diagonal entries of Σ, we can get a formula for Σ(k+1), which


is the diagonal of the matrix

1

nEF·j |Y·j ;L(k),Σ(k),α

[(Y − α1Tn − LF )(Y − α1Tn − LF )T

]with L = L(k+1).

Notice that the joint distribution of (Y, F ) is not invariant to the rotation of L, and the solution

path L(k) may be oscillating though the marginal likelihood keeps increasing. We can stop when

the marginal likelihood or Σ converges and rotate the final estimate L to satisfy the identification

constraint.

2.1.2 Non-random factor score model

When assuming F non-random and Gaussian noise, the log-likelihood of (1.6) becomes

L(Y ;L,F,Σ) = −Nn log(2π)− n

2log det |Σ| − 1

2tr[(Y − α1Tn − LF )TΣ−1(Y − α1Tn − LF )

](2.4)

However, the above likelihood is ill-posed since it can achieve∞ by setting σi → 0 for some i and

the ith row of LF the same as the Yi·−αi. There are previously two approaches to solve this problem.

One is proposed in Anderson and Rubin (1956), by using the likelihood of S = 1n (Y −α1T )(Y −α1T )T

which under the non-random factor score assumption follows a non-central Wishart distribution

with covariance Σ and mean matrix XXT /n = LFFTLT /n. The other is called “quasi-likelihood”

(QMLE) in Bai and Li (2012a) (also in Amemiya et al. (1987)) which is essentially using the likelihood

for random score model:

L(Y ;X,Σ) =−Nn log(2π)− n

2log det |XXT /n+ Σ|

− 1

2tr[(Y T − 1nα

T )(XXT /n+ Σ)−1(Y − α1nT )] (2.5)

where X = LF . Under Assumption 4 then XXT /n = LLT . One can as well using the EM algorithm

to solve (2.5) though F is actually non-random.

In Anderson and Rubin (1956), it is shown that the difference between the estimates L and Σ

from MLE of the non-central Wishart distribution and those from maximizing the quasi-likelihood

is uniformly o(1/√n) when N is fixed and n→∞ under Assumption 4.

2.1.3 Estimating the factor scores

We assume that either ΣF = Ir for the random factor score model or MF = Ir for the non-random

factor score model. For the random factor score model, the factor scores are random variables and


we have the posterior mean

E[F·j |Y·j ] = LT (LLT + Σ)−1(Y·j − α)

= (LTΣ−1L+ Ir)−1LTΣ−1(Y·j − α)

using the formula that (LLT + Σ)−1 = Σ−1 − Σ−1L(LTΣ−1L + Ir)−1LTΣ−1. Thus F should be

estimated via an estimate of the posterior mean

F = (LT Σ−1L+ Ir)−1LT Σ−1(Y − α1Tn ) (2.6)

where L, F and α are the MLE estimates from (2.1).

For the non-random factor score model, one can also use the GLS estimate of F given Σ and L:

F = (LT Σ−1L)−1LT Σ−1(Y − α1Tn ) (2.7)

The estimate (2.6) can also be used for non-random factor score model which can be explained as

the ridge regression estimator. From the property of ridge regression we know that when r n,

there would be no noticeable difference between (2.6) and (2.7).

2.1.4 Asymptotic properties

First, we discuss the asymptotic properties of MLE under the classical asymptotic regime that N is

fixed and n→∞ for both the random and non-random factor score models. These results are again

due to Anderson and Rubin (1956).

Theorem 2.1.2. Under either Assumption 1 or Assumption 2, and assume identification conditions

of L and Σ are satisfied including Assumption 3 and Assumption 4. Let L and Σ be the MLE

estimates from (2.1). If S → LLT + Σ and√n(S − LLT − Σ) has a limiting normal distribution,

we have L→ L and Σ→ Σ and√n(L− L) and

√n(Σ− Σ) have limiting normal distributions.

Notice that for the convergence in Theorem 2.1.2 to be meaningful, we explicitly assumes that

the asymptotic regime is N fixed and n → ∞. Also, though the likelihood is derived by assuming

normality of the random factors, the convergence requirements that S → LLT + Σ and√n(S −

LLT −Σ) has a limiting normal distribution do not need normality. For random factor score model,

we only need the distributions of F and E to satisfy central limit theorem (the second moments are

finite). For non-random factor score models,

S − LLT − Σ =LFETΣ1/2 + Σ1/2EFTLT

n− Σ1/2

(EET

n− IN

)Σ1/2

thus we only require the distribution of E satisfy central limit theorem.


In Amemiya et al. (1987), they derived the limiting covariance matrix of√n(L−L) and

√n(Σ−Σ)

for both random and non-random factor score model under normality, which have very complicated

forms. However, both works consider only the classical asymptotic region which only works for small

N and large n problems.

Next, we consider the asymptotic regime for high-dimensional data where both N and n are

large. In Bai and Li (2012a), the authors derived consistency and asymptotic normality of the MLE

estimates under the assumption of strong factors and the asymptotic regime that N → ∞ and

n → ∞. For strong factors we require that LTL/N → ML where ML ∈ Rr×r is positive definite.

Since they allow N →∞, they have the following additional assumptions on the noise variances and

factor loadings besides Assumption 1 and Assumption 2:

Assumption 5. (a) The fourth moments of the noise entries are finite: E(e4ij) ≤ C4 for some

C <∞ for all i = 1, · · · , N and j = 1, · · · , n

(b) The noise variances are bounded: C−2 ≤ σ2i ≤ C2 for all i = 1, 2, · · · , N

(c) The factors loadings are large enough: when N →∞, LTΣ−1L/N → QL where QL ∈ Rr×r is a

positive definite matrix

(d) The entries of the factor loadings are bounded: ‖Li·‖2 ≤ C for i = 1, 2, · · · , N

(e) The MLE (or QMLE for the non-random factor score model) estimates of the noise variances

are bounded: C−2 ≤ σ2i ≤ C2 for i = 1, 2, · · · , N

Then they have the following consistency and asymptotic normality results:

Theorem 2.1.3. Assume Assumption 1( or Assumption 2), Assumption 3 (or Assumption 4),

Assumption 5 and in particular LTΣ−1L is diagonal. Let L and Σ be the MLE estimates from (2.1).

When n,N →∞, then for each variable i,

Li· − Li· = Op(n−1/2), σ2

i − σ2i = Op(n

−1/2)

where the convergence is in probability and√n(Li· − Li·) and

√n(σ2

i − σ2i ) have a limiting normal

distribution for any given i. For the non-random factor score model,

√n(Li· − Li·)

d→ N(0, σ2i Ir),

√n(σ2

i − σ2i )

d→ N(0, (2 + κi)σ4i )

where κi is the excess kurtosis of eij (for Gaussian noise κi = 0).

For the random factor score model, the limiting covariance of√n(σ2

i − σ2i ) stays the same while

that for√n(Li·−Li·) has a much complicated form which can be found in Section F of the appendix

in Bai and Li (2012a).


The consistency in Theorem 2.1.3 holds for each i but may not hold uniformly for all i. Uniform

consistency for all i can be derived by assuming that the random variables have exponential tails

and imposing an extra condition on the relationship between n and N approaching the limit.

Theorem 2.1.4. Under the assumptions of Theorem 2.1.3 and assume that eij are sub-Gaussian

random variables, if (logN)2/n→ 0 as n,N →∞, then

maxj≤N‖L2

i· − L2i·‖2 = Op(

√logN/n), max

j≤N|σ2i − σ2

i | = Op(√

logN/n) (2.8)

For the non-random factor model,

maxi=1,2,··· ,N

∥∥∥∥∥∥Li· − Li· − 1

n

n∑j=1

σieijFT·j

∥∥∥∥∥∥2

= op(n− 1

2 ). (2.9)

Proof. The proof is a modification of the proof of Bai and Li (2012). It is very technical, please see

Appendix A.

In Bai and Li (2012a) the authors also have shown the consistency of the estimated factor scores

using either (2.6) or (2.7) which has the same limiting distribution.

Theorem 2.1.5. Under the assumptions of Theorem 2.1.3 and assume that N/n2 → 0 when n,N →∞, then for the non-random factor score model,

√N(F·j − F·j)

d→ N(0, Q−1L )

Here QL = limN→∞ LTΣ−1L is defined in Assumption 5. For the random factor score model,

when N/n→ γ > 0 the limiting distribution of√N(F·j − F·j) also is asymptotically Gaussian with

a complicated limiting covariance matrix, and the result is stated in Section F of Bai and Li (2012)

appendix.

We should mention that the results in Bai and Li (2012a) are not very satisfactory, though very

impressive. All the results are based on Assumption 5(e), which is not guaranteed to always hold

for MLE (or QMLE) estimates. However, this is currently the best results we can find. In Bai and

Li (2016), the same authors generalized the results to approximate factor models, which allow weak

correlations among the noises (and between the noise and factors for random factor score model).

2.2 Principal component analysis (PCA)

PCA is a very common technique for dimension reduction of the data. It is closely related to factor

analysis, and often is used as a solution (or at least an initial solution) for factor analysis in both

classical and high-dimensional data analysis. In this section, we discuss the use of PCA (and its


equivalent form SVD) in factor analysis and the asymptotic properties of PCA under the factor

analysis assumptions. We consider three different asymptotic regions: the classical fixed N and

n → ∞ assumption; the n,N → ∞ and strong factor (limN→∞ LTL/N is some positive definite

matrix) assumption in econometrics and the n,N → ∞ and weak factor (limN→∞ LTL to some

finite positive definite matrix) assumption in random matrix theory. We still assume that r is given

as a constant in any of the asymptotic regions.

2.2.1 Use of PCA in factor analysis

Basically, PCA tries to find linear combinations of the observed variables to maximize the sample

variance. Let S = (Y − α1T )(Y − α1T )T /n as defined before. Let the eigenvalue decomposition of

S be S = PΛPT where Λ = diag(λ, λ2, · · · , λN ) is a diagonal matrix where λ ≥ λ2 ≥ · · · ≥ λN ≥ 0

and P is an orthogonal N × N matrix. Then the eigenvectors P·1, · · · , P·N are called loadings,

and the rows of PT (Y − α1T ) are called principal components (PCs). For more details of PCA, one

can refer to Jolliffe (1986).

Also, one can derive the PCs and loadings from the singular value decomposition (SVD) of

Y − α1T . Let Y − α1T =√nUDV T where U ∈ RN×min(N,n), V ∈ Rn×min(N,n) and D =

diag(d1, d2, · · · , · · · dmin(N,n)) with UT U = V T V = Ir and d1 ≥ d2 ≥ · · · ≥ dmin(N,n). It’s then

clear that the first m loadings are columns of U and the first m principal components are columns

of√nV D. Notice that one has the identity that d2

k = λk for k = 1, 2, · · · ,min(N,n).

To use PCA in factor analysis, we essentially use the loadings of PCA to estimate the linear

space of factor loadings and the PCs to estimate the linear space of factor scores. Let Pr ∈ RN×r be

the first r columns of P (which is the same as Ur, the first r columns of U) and Λr = diag(λ, · · · , λr)(Dr = diag(d1, · · · , dr)). Under the identification condition that either ΣF = Ir for the random

factor score model or MF = Ir for the non-random factor score model, and LTL is diagonal with

decreasing diagonal entries, we will have

Lpc = PrΛ1/2r = UrDr, F

pc =√nV Tr = LT (Y − α1T ) (2.10)

To estimate the noise variances, it’s common to use

Σpc = diag

(1

n(Y − α1T − LF )(Y − α1T − LF )T

)(2.11)

2.2.2 Asymptotic properties

As promised, we discuss three different asymptotic regimes.


N fixed and n→∞

First, let’s consider the asymptotic region where N is fixed and n→∞. As we have discussed, the

sample covariance S → LLT + Σ in this asymptotic region. Thus, P is a consistent estimator of the

eigenvectors of LLT + Σ. If the diagonal entries of Σ are distinct, then the linear space spanned by

P will be different from the linear space spanned by L, indicating that Lpc would not be a consistent

estimator of L, thus F pc and Σpc would also be inconsistent.

Notice that Lpc in (2.10) is very similar to the MLE estimator when Σ = σ2IN in Theorem 2.1.1,

but without adjustment on Λr by subtracting σ2Ir. Theorem 2.1.2 shows that under the classical

asymptotic region, MLE estimators are consistent. Thus, even in the simplest case where Σ = σ2Ir,

Lpc is inconsistent for estimating L since Lpc − L = σ2Pr is not approaching 0 and L is consistent.

N,n→∞ with strong factors

It turns out that PCA estimates becomes consistent when N,n→∞ and the factors’ strength keeps

growing with N . Bai (2003) and Bai and Ng (2013) derived consistency and asymptotic normality

results for the PCA estimates under the same assumptions as for the MLE in Theorem 2.1.3.

Theorem 2.2.1. Assume all the assumptions in Theorem 2.1.3 except that now LTL instead of

LTΣ−1L is diagonal with LTL/N → ΣL. Then when n,N → ∞, we have Lpci· → Li· with rate

min(√n,N) for each i = 1, 2, · · · , N . Also, F pc

·j → F·j with rate min(√N,n) for each j = 1, · · · , n.

Besides, under the non-random factor score model, we have the following asymptotic normality:

1. If√n/N → 0 when N,n→∞, then for each i

√n(Lpc

i· − Li·)d→ N

(0, σ2

i Ir)

2. If√N/n→ 0 when N,n→∞ then for each j

√N(F pc

·j − F·j)d→ N

(0, (ΣL)−1QΣL

)where Q = limN→∞ LTΣL/N ∈ Rr×r.

For the limiting covariance under the random factor score model which has much more compli-

cated forms, one can refer to Bai (2003). Bai (2003) and Bai and Ng (2013) actually have the results

for more general assumptions which allow for the noise covariance Σ to change with n. This makes

the result more applicable to some econometrics applications.

Comparing the results of PCA in Theorem 2.1.3 and MLE in Theorem 2.2.1, the asymptotic

efficiency in estimating L is the same, while F pc is less efficient than the MLE estimate F . The

difference is the same as running a GLS instead of OLS for solving F given the true L.


In Fan et al. (2013), the authors strengthened the consistency result of Theorem 2.2.1 to uniform

consistency.

Assumption 6. There exists mt and bt for t = 1, 2 such that for any s > 0, i ≤ N and j ≤ n,

P [|εij | > s] ≤ exp (1− (s/b1)m1)

Also, under the non-random factor score model, assume that the entries of F are bounded |Fkj | ≤ Cfor some constant C or under the random factor score model assume that

P [|Fkj | > s] ≤ exp (1− (s/b2)m2)

for any k ≤ r.

Theorem 2.2.2. Under the assumptions of Theorem 2.2.1 and Assumption 6 with n = o(N2) and

logN = o(nγ) where (6γ)−1 = 3m−11 + 1.5m−1

2 + 1, we have

maxi≤N‖Lpc

i· − Li·‖ = Op

(√1

N+

√logN

n

), and

maxj≤n‖F pc·j − F·j‖ = Op

(√1

n+

√n1/2

N

).

The original authors have shown that both Theorem 2.2.1 and Theorem 2.2.2 also hold for a

much weaker assumptions than Assumption 1 and Assumption 2. They hold also for approximate

factor models, where the noise can be weakly correlated.

N,n→∞, N/n→ γ > 0 with weak factors

For high-dimensional data, in many cases there are weak factors where it’s inappropriate to assume

that ‖L·k‖22 is O(N). For example, if factor k has sparse loadings, say ‖L·k‖0 = O(1), then ‖L·k‖22 =

O(1) when ‖L·k‖∞ = O(1) (Witten et al., 2009). It’s very common to have sparse factors for high-

dimensional data. For example, the observed variables may be just locally correlated and have a

block structure. The weak factors become really hard to estimate accurately though. One reason

is that though the true factors are sparse, the singular vectors of the signal matrix X may not be

sparse. There have been research on random matrix theory showing that even for the homoscedastic

noise model Σ = σ2IN , PCA is not consistent. There is also a detection threshold that if some

eigenvalue of ΛTΛ is below the threshold, there is no hope to estimate any its information from

spectral analysis.

There has been rich literature in Random Matrix Theory (RMT) in understanding the asymptotic

properties for estimating the weak factors by PCA (or SVD), especially when Σ = σ2IN . Under


the identification condition that LTL is diagonal, let L = UD where U ∈ RN×r, UTU = Ir and

D = diag(d1, · · · , dr). We impose the following assumptions for weak factors:

Assumption 7. 1. For each k = 1, 2, · · · , r, when n,N → ∞, dka.s.→ ρk for some constant

ρk <∞. For simplicity, also assume that ρ1 > ρ2 > · · · > ρr > 0.

2. The noise entries have finite fourth moment: E[e4ij

]< ∞ for i = 1, 2, · · · , N and j =

1, 2, · · · , n

Then people have shown that the estimates D and U are inconsistent estimates of D and U , thus

the PCA estimates Lpc and F pc are inconsistent for L and F . There are many reference literature.

For instance, the random factor model result can be found in Paul (2007); Yao et al. (2015); Nadler

(2008). The non-random factor model result can be found in Perry (2009); Benaych-Georges and

Raj Rao (2012); Onatski (2012).

Theorem 2.2.3. Assume either Assumption 1 or Assumption 2, Assumption 7 and Σ = σ2IN .

When n,N →∞ with N/n→ γ > 0, we have for k, k1, k2 = 1, 2, · · · , r:

1.

dka.s.→ ρk =

√(

ρk + 1ρk

)(ρk + γ

ρk

)σ if ρk > γ1/4σ(

1 +√γ)σ if ρk ≤ γ1/4σ

2.

UT·k1U·k2

a.s.→

θk =

√ρ4k−γ

ρ4k+βρ2

kif k1 = k2 = k, ρk > γ1/4σ

0 otherwise

3. For the estimated factor scores,

F pck1·F

Tk2·

n

a.s.→

θk =

√ρ4k−γ

ρ4k+βρ2

kif k1 = k2 = k, ρk > γ1/4σ

0 otherwise

If we further assume that N/n = γ + o(n−1/2

)and dk − ρk = o

(n−1/2

)as n,N → ∞, then if

ρk > γ1/4, then√n(dk − ρk

),√n(UT·kU·k − θk

)and

√N(F pck1·F

Tk2·/n− θk

)all have a limiting

Gaussian distribution with mean 0.

The limiting variances for√n(dk − ρk

),√n(UT·kU·k − θk

)and√N(F pck1·F

Tk2·/n− θk

)are dif-

ferent for random and non-random factor score models, and one can find the specific forms in the

above references.

For the heteroscedastic noise factor analysis model where Σ = diag(σ21 , · · · , σ2

N ) has arbitrary

diagonals, the problem is much more complicated, as the distribution noise matrix Σ1/2E is not


invariant under the orthogonal transformation of the rows. One result we can show is that we can

combine the results in Onatski (2012) and Benaych-Georges and Raj Rao (2012) to get limits of dk,

U·k and F pck· under a random factor score model which also has uniformly distributed factor loading

entries for each factor. Let the cumulative distribution function (CDF) of the empirical distribution

of the noise variances be GN (x) = 1N

∑Ni=1 1σ2

i≤x. Then we need the following assumption:

Assumption 8. When n,N →∞, GN (x)→ G0(x) for all x ∈ R where the limiting cumulative dis-

tribution function (CDF) G0(x) has bounded support [a0, b0] and the corresponding density function

g0(x) satisfies minx∈(a0,b0) g(x) > 0. Also maxi≤N σ2i → b0 and mini≤N σ

2i → a0.

Based on the results of Onatski (2010, 2012), we know that under Assumption 8, the empirical

distribution of the eigenvalues of 1nΣ1/2EETΣ1/2 for N/n → γ ≤ 1 converge to some limiting

distribution with CDF G(x). If γ ≤ 1, the support of G(x) is bounded [a0, b0], and the largest

and smallest eigenvalues of 1nΣ1/2EETΣ1/2 converge to b0 and a0 respectively. If γ > 1, G(x) =

1γ G(x) +

(1− 1

γ

)δ0 where the support of G(x) is bounded [a0, b0], and the largest and smallest

non-zero eigenvalues of 1nΣ1/2EETΣ1/2 converge to b0 and a0 respectively. To use Benaych-Georges

and Raj Rao (2012), we assume that the factor loading matrix is also random and impose an extra

condition on L under the random factor score model:

Assumption 9. For factor score matrix L = UD,√N · U ∈ RN×r has i.i.d. entries with mean 0

and variance 1 and D = diag(d1, · · · , dr) is a deterministic diagonal matrix.

Based on results of Benaych-Georges and Raj Rao (2012), let the D-transform of G(x) be defined

as

DG(z) =

[∫z

z2 − xdG(x)

]×[γ

∫z

z2 − tdG(x) +

1− γz

]for z > a.

For f a function and x ∈ R, denote f (x+) = limz↓x f(z). Also, in the theorem below, D−1G (·)

denotes its function inverse on [a,∞). We then have results analogous to Theorem 2.2.3.

Theorem 2.2.4. Under Assumption 1, Assumption 8 and Assumption 9, when n,N → ∞ and

N/n→ γ, we have

1. For k = 1, 2, · · · , r

dka.s.→ ρk =

D−1G

(1/ρ2

k

)if ρk >

(DG

(a+

0

))−1/2

a0 otherwise

2. For k1 = 1, 2, · · · k0 and k2 = 1, 2, · · · , r where k0 =∑rk=1 1

ρk>(DG(a+0 ))−1/2 ,

UT·k1U·k2

a.s.→

√−2φG(ρk)ρ2kD′G(ρk)

if k1 = k2 = k

0 otherwise


3. For the estimated factor scores and k1 = 1, 2, · · · k0 and k2 = 1, 2, · · · , r where k0 =∑rk=1 1

ρk>(DG(a+0 ))−1/2 ,

F pck1·F

Tk2·

n

a.s.→

√−2φG(ρk)

ρ2kD′G(ρk)

if k1 = k2 = k

0 otherwise

where G = γG+ (1− γ)δ0 when γ ≤ 1 and ρk is defined in Theorem 2.2.3.

Because of the inconsistency of PCA even assuming Σ = σ2IN , there have been several improve-

ment of the PCA estimates to reduce the estimation error. One direction is to shrink d1, d2, · · · , drtowards 0 while keeping the estimated eigenvectors (singular vectors) unchanged (Shabalin and No-

bel, 2013). Based on the result of Gavish and Donoho (2014), define the PCA optimal shrinkage

estimator as

Lsk = Urη(Dr

), F sk = F pc (2.12)

where η(Dr) = diag(η(d1), · · · , η(dr)

). The shrinkage function η(·) is defined as

η(d) =

σ2

d

√(d2

σ2 − γ − 1)2 − 4γ if d ≥ (1 +

√γ)σ

0 Otherwise

and is the optimal function that minimizes the asymptotic estimation error limn,N→∞,N/n→γ ‖X −LskF sk‖2F . In practice when σ is unknown, Gavish and Donoho (2014) proposed a consistent esti-

mator σ which is based on the median of the singular values of Y . Raj Rao (2014) considered this

optimal shrinkage of sample singular values for a general noise variance Σ, including the heterosced-

asitc noise factor analysis model once it satisfies the assumptions in Theorem 2.2.4.

2.3 Estimating the number of factors

Now we review the methods in estimating r under the three different regimes that we have discussed

in the previous section.

2.3.1 Classical methods

For the classical problem where N is relatively small compared with the sample size n, estimating

the number of factors r is a very hard problem. Many methods have been proposed for estimating

the number of principal components, but very few methods work specifically for the factor analysis

model, which has additive heteroscedastic noise that is not present in PCA. One method is based

on likelihood ratio tests (Lawley, 1956; Bartlett, 1950; Anderson and Rubin, 1956). For a given r,

define the null hypothesis H0r: Φ = LLT + ΣF for some L ∈ RN×r and diagonal matrix ΣF . The


calculation in Anderson and Rubin (1956) has shown that the likelihood ratio test statistic using

(2.1) is

Ur = n[log det Σ + log det

(Ir + LT Σ−1L

)− log detS

]and under the asymptotic regime that N is fixed and n → ∞, Ur follows a chi-square distribution

with N(N − 1)/2 + r(r − 1)/2 − rN degrees of freedom. To estimate r, one can sequentially test

for H00, H01, · · · and stop at r if H0r is not rejected. However, this sequential testing method does

not have any theoretical guarantees and has been shown to perform poorly in practice (Tucker and

Lewis, 1973; Velicer et al., 2000), for example it is sensitive to the normality assumption and tend

to underestimate r when n is large.

Based on empirical evaluations, some researchers (Velicer et al., 2000; Buja and Eyuboglu, 1992;

Velicer, 1976) suggest that even if one believes the factor analysis model, one can first assume

Σ = σ2IN to determine the number of principal components r and then estimate the factors based

on (1.3) given r. For estimating the number of principal components, popular methods include scree

test (Cattell, 1966; Cattell and Vogelmann, 1977), Kaiser’s rule (Kaiser, 1960), parallel analysis

(PA) (Horn, 1965; Buja and Eyuboglu, 1992), the minimum average partial test of Velicer (1976)

and information criteria based methods such as minimum description length (MDL) (or Bayesian

Information Criteria (BIC)) and Akaike Information Criteria (AIC) (Wax and Kailath, 1985; Fishler

et al., 2002). To use these methods effectively for factor analysis, one essentially applies various rules

on the sample correlation matrix S. For example, the two simplest rules are the Scree test which

plots the eigenvalues of S in an decreasing order and r is determined by identifying an “elbow” of

the eigenvalue curve, and Kaiser’s rule which estimates r as the number of eigenvalues of S that

exceed 1.

Among all these methods, there is a large amount of evidence Zwick and Velicer (1986); Hubbard

and Allen (1987); Velicer et al. (2000); Peres-Neto et al. (2005) showing that PA is one of the most

accurate of the above classical methods for determining the number of factors. Parallel analysis

compares the observed eigenvalues of the correlation matrix to those obtained in a Monte Carlo

simulation. The first factor is retained if and only if its associated eigenvalue is larger than the 95’th

percentile of simulated first eigenvalues. For k ≥ 2, the k’th factor is retained when the first k − 1

factors were retained and the observed k’th eigenvalue is larger than the 95’th percentile of simulated

k’th factors. The permutation version of PA was introduced by Buja and Eyuboglu (1992). There

the eigenvalues are simulated by applying independent uniform random permutations to each of the

variables stored in Y . The earlier method of Horn (1965) resamples from a Gaussian distribution.

Parallel analysis has been used recently in bioinformatics (Leek and Storey, 2008b; Sun et al., 2012).

Though there exist no theoretical results to guarantee the accuracy of PA, it performs very well in

practice.


2.3.2 Methods for large matrices and strong factors

This collection of methods is designed for an asymptotic regime where both n,N → ∞ while r is

fixed. Again, for strong factors, it is assumed that LTL/N → ΣL. Under such asymptotic regime

and the strong factor assumption, it is theoretically easy to find a consistent estimator of r since the

first r eigenvalues of LLT + Σ will explode as n,N →∞.

Some of the most popular methods to estimate the number of factors under the above scenario

are based on the information criteria developed by Bai and Ng (2002). Define

V (k) =1

Nn

∥∥∥Y − Lpck F

pck

∥∥∥2

F(2.13)

where Lpck and F pc

k are defined in (2.10) with the number of factors given as k. Let K ≥ r be

some fixed known constant, then Bai and Ng have proposed a series of information criteria and have

shown the following results:

Theorem 2.3.1. Under the assumptions of Theorem 2.2.1, if g(N,n)→ 0 and min(√N,√n)g(N,n)→

∞ as N,n→∞, then r defined by

r = argmin0≤k≤K V (k) + kg(N,n)

or

r = argmin0≤k≤K log V (k) + kg(N,n)

where V (k) is defined in (2.13) is a consistent estimator: limN,n→∞ P [r = r] = 1.

They then proposed 6 specific forms for the criteria and one that performs among the best based

on their empirical evaluations is

rIC1 = argmin0≤k≤K log V (k) + k

(N + n

Nn

)log

(Nn

N + n

)(2.14)

The bound K is not specified and depends on the researcher’s prior knowledge on the problem. Bai

and Ng’s criteria are known to be unrobust in practice, thus Alessi et al. (2010) proposed a modified

version:

r?IC1 = argmin0≤k≤K log V (k) + ck

(N + n

Nn

)log

(Nn

N + n

)(2.15)

where c is tuning parameter that is determined adaptively based on the data. For determing c,

the authors used a stability principle which chooses c that yields a stable r?IC1 using randomly sub-

sampled rows and columns. This improvement may be hard to execute in practice as a proper range

of c need to be given since a large enough c can stably estimate r as 0 while a small enough c can

stably estimate a large r.

Onatski (2010) developed an estimator based on the difference of two adjacent eigenvalues (ED)


of the sample covariance matrix. The estimator he proposed is

rED = maxk ≤ K : d2

k − d2k+1 ≥ δ

(2.16)

where δ > 0 is some fixed number. Denote the ordered noise variances as σ2(1) ≤ σ2

(2) ≤ · · · ≤ σ2(N).

Roughly speaking, the estimator is based on the result that if for any i, σ2(i+1)− σ

2(i) → 0 as N →∞,

then d2k − d2

k+1 → 0 for any k > r. An advantage of his estimator is that the consistency of rED can

allow for a much weaker strength of the factors:

limn,N→∞,N/n→γ>0

P [rED] = 1

for any fixed δ > 0 as long as the smallest eigenvalue of LLT explodes (tends to infinity). To optimize

the performance of rED, he also gave an iterative procedure to adaptively determine δ from the data.

Another simple criterion is proposed in Ahn and Horenstein (2013). They proposed two estima-

tors for determining the number of factors by simply maximizing the ratio of two adjacent eigenvalues

of the sample covariance matrix. The same idea can be also found in Lam and Yao (2012); Lan and

Du (2014). One specific form is:

rED = argmax0≤k≤Kd2k

d2k+1

(2.17)

with d20 =

∑min(n,N)k=1 d2

k/ log min(n,N).

Besides the above criteria, there are more methods to estimate the number of factors (Forni

et al., 2000; Amengual and Watson, 2007; Hallin and Liska, 2007) for dynamic factor models. As

we have mentioned, such dependency models are beyond the scope of this paper.

Remark. To use rIC1, rED and rER, we need to determine the upper bound K for r. There is no

theoretical result to guide choosing K. For practical usage, Onatski (2010) suggested trying several

different K to see how r changes. Ahn and Horenstein (2013) suggested using

K = min

|i ≥ 1 : d2i ≥

min(n,N)∑k=1

d2k/min(n,N)|, 0.1 min(n,N)

.

2.3.3 Methods for large matrices and weak factors

In contrast to strong factors, for weak factors the asymptotic LLT → ΣL instead of LLT /N → ΣL

where ΣL is a positive definite matrix is more appropriate. Based on our discussion in previous

sections, even for the homoscedastic noise Σ = σ2IN , neither the PCA nor MLE estimates of the

factor loadings and scores are consistent. Moreover, Theorem 2.2.3 shows that there is a phase

transition phenomenon in the limit: if ρk < γ1/4σ, then the spectral analysis from the samples


would not contain any information of the kth principal component of the signal matrix. In other

words, under the identification condition LTL is diagonal, for n,N large enough, if there is some

dk < γ1/4σ, then there would be little chance for detection of the kth factor using PCA or MLE

(Kritchman and Nadler, 2009). For the general factor model where there is heteroscedastic noise,

there would also exist a phase transition phenomenon. Thus, it would be impossible and also very

likely useless to estimate the true number of factors.

For Σ = σ2IN , we define r =∣∣dk : dk > γ1/4σ

∣∣ as the number of detectable factors. One goal

is to estimate this number of detectable factors. Raj Rao and Edelman (2008) used an AIC type

information criterion method based on RMT. The criterion is based on the distribution of

tk = (N − k) ·∑Ni=k+1 d

4k(∑N

i=k+1 d2k

)2

which is asymptotically Gaussian when n,N → ∞ and N/n → γ > 0 for k = 0, 1, · · · ,min(n,N).

The estimator they proposed is

rRE = argmin0≤k≤min(N,n)

[1

4γ2n

(N [tk − (1 + γn)]− γn)2

+ 2(k + 1)

](2.18)

In Raj Rao and Edelman (2008), the authors conjectured that rRE is a consistent estimator for r,

however, Kritchman and Nadler (2009) proved that the conjecture is not true and rRE tends to

underestimate r. Instead, Nadler (2010) proposed another modification of AIC which estimates r

more accurately:

rAIC′ = argmin0≤k≤min(N,n)

[−L(Y ; L, Ir, σ

2IN ) + 2k (2N + 1− k)]

(2.19)

where L(·) is defined in (2.2) with L and σ2 derived in Theorem 2.1.1. For estimating r, Kritchman

and Nadler (2008) also developed a consistent estimator based on a sequence of hypothesis tests

which are connected with Roy’s classical largest root test Roy (1953). It has the form:

rRMT = argmin1≤k≤min(N,n)

d2k < σ2(k) (µn,N−k + s(α)ξn,N−k)

− 1 (2.20)

which is derived based on sequential tests H0k: at most k− 1 signals versus H1k: at least k signals.

α is some significance level and values of µn,N−k, s(α) and ξn,N−k are derived using RMT. σ2(k) is

estimated unbiased via an iterative algorithm.

Instead of estimating r, one would prefer estimating the number of useful factors: r? = argmink E[‖Xk −X‖F

]where X = LF is the signal matrix and Xk is a rank k estimate of X. r? best fits the purpose if one

want an accurate estimation of the signal matrix. For Σ = σ2IN and Xpck = Lpc

k Fpck , Perry (2009)

has shown that under the assumptions of Theorem 2.2.3, when n,N → ∞ and N/n → γ > 0 we


have

r? →∣∣ρk : ρ2

k ≥ µ?F∣∣

where

µ?F = σ2

1 + γ

2+

√(1 + γ

2

)2

+ 3γ

The reason that r? ≤ r is that for some factors are too weak to be estimated accurately, though they

are detectable. Thus we prefer to ignore them to increase the accuracy of our estimation. Perry

(Perry, 2009) proposed a bi-cross-validation (BCV) method to estimate r? that we will discuss in

more detail in next chapter.

2.4 Comments

Theorems 2.1.3, 2.2.1 and 2.2.3 work for both random and non-random factor models. Thus without

additional assumptions, assuming the random and non-random factor score model are equivalent.

With additional assumptions, then assuming one of the two models can be more convenient. For

instance, if the latent variables have a time series structure then it’s more convenient to assume

random factor scores. On the other hand if the factor scores are assumed to be sparse or non-

negative (especially for specific samples or at specific locations), then assuming non-random factor

scores can be more natural.

For high-dimensional data and large matrices where both n and N are large, we believe that

for many real data there are both weak and strong factors. The strong factors are factors that are

uniformly influential on all variables while weaker factors may only have large effects on a subset of

the variables. However, there is not much theoretical work considering presence of both strong and

weak factors.

Apparently, for the general factor model assumption that the noise is heteroscedastic, there

is barely no estimators for r (either r or r?) and not many methods available in estimating the

factors and signal matrix designed for large matrices with presence of weak factors. In the next two

chapters, we propose two methods in estimating the signal matrix X = LF and the noise variance

Σ without knowing r. One method is based on maximum likelihood and BCV, the other is based

on combining optimizing a convex penalized loss function and the optimal shrinkage proposed by

Gavish and Donoho (2014).

Chapter 3

Bi-cross-validation for factor

analysis

3.1 Problem Formulation

Our data matrix is Y ∈ RN×n with a row for each variable and a column for each sample. In the

bioinformatics problems we have worked on, it is usual to have N > n or even N n, but this is

not assumed. In a factor model, Y can be decomposed into a low rank signal matrix plus noise:

Y = X + Σ12E = LF + Σ

12E, (3.1)

where the low rank signal matrix X ∈ RN×n is a product of factors L ∈ RN×r and F ∈ Rr×n, both of

rank r. The noise matrix E ∈ RN×n has independent and identically distributed (IID) entries with

mean 0 and variance 1. Each variable has its own noise variance given by Σ = diag(σ21 , σ

22 , · · · , σ2

N ).

The signal matrix X is a signal that we wish to recover despite the heteroscedastic noise.

The factor model is usually applied when we anticipate that r min(n,N). Then identifying

those factors suggests possible data interpretations to guide further study. When the factors cor-

respond to real world quantities there is no reason why they must be few in number and then we

should not insist on finding them all in our data as some factors maybe too small to estimate. We

should instead seek the relatively important ones, which are the factors that are strong enough to

contribute most to the signals and be accurately estimated.

We focus on the non-random factor score model as we treat the signal matrix X as the parameter.

As we have discussed previously, a random factor score model can be treated as a non-random factor

25

CHAPTER 3. BI-CROSS-VALIDATION FOR FACTOR ANALYSIS 26

score model by conditioning on F . our goal is to recover X, seeking to minimize

ErrX

(X)

= E[∥∥∥X −X∥∥∥2

F

](3.2)

This criterion was used for factor models in Onatski (2015) and for truncated SVDs and nonnegative

matrix factorizations in Owen and Perry (2009). After recovering X, we can estimate the factor

loadings and scores using corresponding identification conditions.

Definition 3.1.1 (Oracle rank and estimate). Let M be a method that for each integer k ≥ 0 gives

a rank k estimate XM (k) of X using Y from model (3.1). The oracle rank for M is

r?M = argmink

∥∥∥XM (k)−X∥∥∥2

F, (3.3)

and the corresponding oracle estimate of X is

XMopt = XM

(r∗M). (3.4)

If all the factors are strong enough, then for a good method M , we anticipate that r?M should

equal the true number of factors r. With weak enough factors we will have r?M < r.

Our algorithm has two steps. First we devise a method M that can effectively estimate X given

the oracle rank r?M . Then with such a method in hand, we need a means to estimate r?M . Section 3.2

describes our early stopping alternation (ESA) algorithm for finding X(k) for each k, which has

the best performance compared with other methods given their own oracle ranks. Then Section 3.3

describes our BCV for estimating k?ESA for the ESA algorithm.

3.2 Estimating X given the rank k

Here we consider how to estimate X using exactly k factors. This will be the inner loop for an

algorithm that tries various k. The goal in this section is to find a method that has good performance

with its oracle rank. We start with the likelihood function

L(Y ;X,Σ) = −Nn2

log(2π)− n

2log det Σ + tr

[−1

2Σ−1(Y −X)(Y −X)T

]. (3.5)

which is similar to (2.4). If Σ were known it would be straightforward to estimate X using an SVD,

but Σ is unknown. Given an estimate of X it is straightforward to optimize the likelihood over

Σ. Thus, if we want to maximize (3.5), it is very natural to design an alternating algorithm that

iteratively estimates X given Σ and then estimates Σ given X. Specifically, define the truncated

SVD of a matrix Y as

Y (k) =√nU(k)D(k)V (k)T (3.6)


where D(k) is the diagonal matrix of the k largest singular values of Y/√n, and U(k) and V (k) are

the matrices of the corresponding singular vectors. The iterative algorithm starts from an initial

estimate Σ(0) using the sample variance:

Σ(0) = diag((Y − 1

nY 1n×n

)(Y − 1

nY 1n×n

)T ). (3.7)

Given an estimate Σ, the rank k estimate X is the truncated SVD of the reweighted matrix Y =

Σ−12Y :

X = Σ12 Y (k). (3.8)

Given an estimate X, the new variance estimate Σ contains the mean squares of the residuals:

Σ =1

ndiag

[(Y − X

)(Y − X

)T ]. (3.9)

Both of the above two steps can increase logL(X,Σ) but not decrease it. However, as we have

discussed in Chapter 2, the likelihood (3.5) is ill-posed, thus this alternating algorithm can’t work

directly. Here we propose an even simpler early-stopping algorithm.

The main challenge for using (3.5) is to prevent any σi from approaching 0. One solution is to

instead optimize the quasi-likelihood (2.5) by EM algorithm. The other solution is to regularize Σ

to prevent σi → 0. One could model the σi as IID from some prior distribution. However, such a

distribution must also avoid putting too much mass near zero. We believe that this transfers the

singularity avoidance problem to the choice of hyperparameters in the σ distribution and does not

really solve it. We have also found in trying it that even when σi are really drawn from our prior,

the algorithm still converged towards some zero estimates.

A related approach is to employ a penalized likelihood

Lreg

(Y ;λ, X, Σ

)= −n log det Σ + tr

[Σ−1(Y − X)(Y − X)T

]+ λP

(Σ), (3.10)

where P penalizes small components σi. This approach has two challenges. It is hard to select a

penalty P that is strong enough to ensure boundedness of the likelihood, without introducing too

much bias. Additionally, it requires a choice of λ. Tuning λ by cross-validation within our bi-cross-

validation algorithm is unattractive. Also there is a risk that cross-validation might choose λ = 0

allowing one or more σi → 0.

We do not claim that the regularization methods cannot in the future be made to work. However,

we propose a much simpler approach that works surprisingly well. Our approach is to employ early

stopping. We start at (3.7) and iterate the pair (3.8) and (3.9) some number m of times and then

stop.

To choose m, we investigated 180 test cases based on the six factor designs in Table 3.1, three

dispersion levels for the σ2i , five aspect ratios γ and 2 data sizes. The details are in the Appendix.


The finding is that taking m = 3 works almost as well as if we used whichever m gave the smallest

error for each given data set.

More specifically, define the oracle estimating error using early stopping at m steps as

ErrX(m) = mink‖Xm(k)−X‖2F (3.11)

where Xm(k) is the estimate of X using m iterations and rank k. We judge each number m of steps,

by the best k that might be used with it.

For early stopping alternation (ESA), we define the oracle stopping number of steps on a data

set as

mOpt = argminm ErrX(m) = argminm mink‖Xm(k)−X‖2F . (3.12)

We have found thatm = 3 is very nearly optimal in almost all cases. We find that ErrX(3)/ErrX(mOpt)

is on average less than 1.01, with a standard deviation of 0.01 (see Appendix). Using m = 3 steps

with the best k is nearly as good as using the best possible combination of m and k. We have tested

early stopping on other data sizes, factor strengths and noise distributions, and find that m = 3 is

a robust choice. Early stopping is also much faster than iterating until a convergence criterion has

been met.

In Section 3.4.2, we compare ESA to other methods for estimating X, including PCA (SVD),

PCA after normalization of each row (variable) of the data and the quasi maximum-likelihood

method (QMLE). For the heteroscedastic noise cases and given the oracle rank of each method,

ESA performs better than PCA or PCA after data normalization in most cases. Surprisingly, it also

performs better than QMLE on average and especially when the aspect ratio N/n is not too small.

Comparing ESA with an oracle SVD method that knows the noise variance, we find that they have

comparable performance.

Given the above findings, we turn our attention to estimating the oracle rank r?ESA for ESA in

Section 3.3.

Remark. Early-stopping of iterative algorithms is a well-known regularization strategy for inverse

problems and training machine learning models like neural networks and boosting Yao et al. (2007);

Zhang and Yu (2005); Hastie et al. (2009); Caruana et al. (2001). An equivalence between early-

stopping and adding a penalty term has been demonstrated in some settings Fleming (1990); Rosset

et al. (2004).

Remark. PCA after normalization of each variable is a common standardization step to keep all

the variables in the same scale before factorization. It is equivalent to ESA starting from (3.7) with

m = 1. However, for the factor analysis model, Σ(0) is a bad estimate of Σ if the scaling of the

noise and signal is different for each variable. Using m > 1 iterations can be interpreted as using

an estimated signal matrix to improve the estimation of Σ, so ESA with m = 3 can be understood

as applying truncated SVD on a more properly reweighted data than one gets with m = 1.


3.3 Bi-cross-validatory choice of r

Here we describe how BCV works in the heteroscedastic noise setting. Then we give our choice for

the shape and size of the held-out submatrix using theory from Perry (2009).

3.3.1 Bi-cross-validation to estimate r?ESA

We want k to minimize the squared estimating error (3.3) in XESA. We adapt the BCV technique of

Owen and Perry (2009) to this setting of unequal variances. We randomly select n0 columns and N0

rows as the held-out block and partition the data matrix Y (by permuting the rows and columns)

into four folds,

Y =

(Y00 Y01

Y10 Y11

)where Y00 is the selected N0 × n0 held-out block, and the other three blocks Y01, Y10 and Y11 are

held-in. Correspondingly, we partition X and Σ as

X =

(X00 X01

X10 X11

), and Σ =

(Σ0 0

0 Σ1

).

The idea is to use the three held-in blocks to estimate X00 for each candidate rank k and then select

the best k based on the BCV estimated prediction error.

We rewrite the model (3.1) in terms of the four blocks:

(Y00 Y01

Y10 Y11

)=

(X00 X01

X10 X11

)+

(Σ0 0

0 Σ1

) 12(E00 E01

E10 E11

)

=

(L0R0 L0R1

L1R0 L1R1

)+

(Σ

120 E00 Σ

120 E01

Σ121 E10 Σ

121 E11

)

where L =

(L0

L1

)and R =

(R0 R1

)are decompositions of the factors.

The held-in block

Y11 = X11 + Σ121 E11 = L1R1 + Σ

121 E11

has the low-rank plus noise form, so we can use ESA to get estimates X11(k) and Σ1 for a given

rank k. Next for k < rank(Y11) we choose rank k matrices L1 and R1 with

X11(k) = L1R1. (3.13)

Then we can estimate L0 by solving N0 linear regression models Y T01 = RT1 LT0 + ET01Σ

1/20 , and

estimate R0 by solving n0 weighted linear regression models Y10 = L1R0 + Σ1/21 E10. These least


square solutions are

R0 = (LT1 Σ−11 L1)−1LT1 Σ−1

1 Y10, and L0 = Y01RT1 (R1R

T1 )−1

which do not depend on the unknown Σ0. We get a rank k estimate of X00 as

X00(k) = L0R0. (3.14)

Though the decomposition (3.13) is not unique, the estimate X00(k) is unique. To prove it we

need a reverse order theorem for Moore-Penrose inverses. For a matrix Z ∈ Rn×d, the Moore-Penrose

pseudo-inverse of Z is denoted Z+.

Theorem 3.3.1. Suppose that X = LR, where L ∈ Rm×r and R ∈ Rr×n both have rank r. Then

X+ = R+L+ = RT (RRT )−1(LTL)−1LT .

Proof. This is MacDuffee’s theorem. There is a proof in Owen and Perry (2009).

Proposition 3.3.1. The estimate X00(k) from (3.14) does not depend on the decomposition of

X11(k) in (3.13) and has the form

X00(k) = Y01

(Σ− 1

21 X11(k)

)+Σ− 1

21 Y10. (3.15)

Proof. Let X11(k) = L1R1 be any decomposition satisfying (3.13). Then

X00 = L0R0

= Y01RT1

(R1R

T1

)−1(LT1 Σ−1

1 L1

)−1LT1 Σ−1

1 Y10

= Y01

(Σ− 1

21 L1R1

)+Σ− 1

21 Y10 = Y01

(Σ− 1

21 X11(k)

)+Σ− 1

21 Y10.

The third equality follows from Theorem 3.3.1.

Next, we define the cross-validation prediction average squared error for block Y00 as

PEk(Y00) =1

n0N0

∥∥Y00 − X00(k)∥∥2

F.

Notice that as the partition is random, we have:

E[PEk(Y00)

]= E

[1

n0N0ErrX00

(X00(k)

)]+

1

N

N∑i=1

σ2i ,

where ErrX(X) is the loss defined at (3.11). The expectation for the left side of the equation is over

the noise and the random partition, for a fixed signal matrix. And the expectation for the right side

of the equation is just over the random partition.


The above random partitioning step is repeated independently S times, yielding the average

BCV mean squared prediction error for Y ,

PE(k) =1

S

S∑s=1

PEk(Y(s)00 )

where Y(s)00 is the held-out data for the sth repeat of the partition. The BCV estimate of k is then

r∗ = argmink PE(k). (3.16)

For using the method in practice, we investigate integer values of k from 0 to some maximum.

We cannot take k as large as min(n1, N1) where n1 = n − n0 and N1 = N − N0, for then we will

surely get σi = 0 even with early stopping. We impose an additional constraint on k to keep the

diagonal of Σ1 away from zero. If for some k we observe that

1

N1

N1∑i=1

log10

(|σ(k)i,1 |)< −6 + log10

(maxi|σ(k)i,1 |)

(3.17)

where Σ1(k) = diag(σ

(k)1,1 , σ

(k)2,1 , · · · , σ

(k)N1,1

), then we do not consider any larger values of k. The

condition (3.17) means that the geometric mean of the variance estimates is below 10−6 times the

largest one.

Remark. Owen and Perry (2009) mentioned that BCV can miss large but very sparse components

in the SVD in a white noise model, and they suggested rotating the data matrix as a remedy. As

the held-out rows and columns are selected at random, sparsity in factor loadings or scores can

greatly increase the variance in prediction error across partitions. In our problem where the noise is

heteroscedastic, we can deal with the sparsity in factor scores by replace Y with Y O where O ∈ Rn×n

is a given dense orthogonal matrix.

3.3.2 Choosing the size of the holdout Y00

We define the true prediction error for ESA as:

PE(k) =1

nN

∥∥X − XESA(k)∥∥2

F+

1

N

∑i

σ2i

and then the oracle rank is r∗ESA = argmink PE(k).

Ideally, we would like PE(k) to be a good estimate of PE(k). Actually, for the purpose of BCV,

it suffices to have r∗ (defined in (3.16)) be a good estimate of r∗ESA. Because of the inconsistency of

PCA for large matrices with presence of weak factors, the size of the holdout Y00 need to be carefully

chosen.


When it is known that Σ = σ2I, we can use the truncated SVD to estimate X and for BCV the

estimate of X00 simplifies to

X00(k) = Y01

(Y11(k)

)+Y10, (3.18)

where Y11(k) is the truncated SVD in (3.6). Perry (2009) proved that r∗ and r∗ESA track each other

asymptotically if the relative size of the held-out matrix Y00 satisfies the following theorem.

Theorem 3.3.2. Under either Assumption 1 or Assumption 2 and Σ = σ2IN , if k0 is fixed and

N/n→ γ ∈ (0,∞) as n→∞, then r?ESA and argmink E[PEk(Y00)

]converge to the same value if

√ρ =

√2√

γ +√γ + 3

(3.19)

holds, where

γ =

(γ1/2 + γ−1/2

2

)2

, and ρ =n− n0

n· N −N0

N.

Here ρ is the fraction of entries from Y in the held-in block Y11. The larger γ is, the smaller ρ

will be, thus ρ reaches its maximum when Y is square with γ = 1. For example, when γ = 1, then

ρ ≈ 22%. In contrast, if γ = 50 or 0.02, ρ then drops to only 3.5%.

Theorem 3.3.2 compares the best k for E[PE(k)

]to the best k for the true error. If the number

of repeats S →∞ and n→∞, then we also have r? → argmink E[PEk(Y00)

]thus r? − r?ESA → 0.

In our simulations, we use (3.19) to determine the size of Y00. The logic is that for a consistent

estimate of Σ, the limits of the singular values and vectors of Σ−1/2Y is the same as for the white noise

model. Thus Theorem 3.3.2 also works for Σ−1/2Y . Further, to determine n0 and N0 individually,

we make Y11 as square as possible as long as n0 ≥ 1 and N0 ≥ 1. For instance, with γ = 1 as

ρ ≈ 22%, we hold out roughly half the rows and columns of the data.

3.4 Simulation results

The empirical properties of ESA and BCV for factor analysis provide the main evidence to show

that our methods performance is better than other methods in practice. We will give a detailed

description of how we generate the data and the simulation results.

3.4.1 Factor categories and test cases

When we simulate the factor model for our tests, we will generate it as

Y = Σ1/2(Σ−1/2X + E) = Σ1/2(√nUDV T + E). (3.20)


The matrix Σ−1/2X =√nUDV T has the same low rank that X does. Here UDV T is an SVD and

we generate the matrices U and V from appropriate distributions. The reason that we rewrite (3.1)

as (3.20) is that the normalization in (3.20) allows us to make direct use of RMT in choosing D.

For our simulations, the matrix V is uniformly distributed in the space of orthogonal matrices, but

U has a non-uniform distribution to avoid making rows with large mean squared U -values coincide

with rows having large σi. Such a coincidence could make the problem artificially easy.

For the factor strength in our simulations, we want to include both strong and weak factors and

see how different methods perform. Under the asymptotics that n,N →∞, we may place each factor

into a category depending on the size of d2i (D defined either via (3.1) or (3.20)). The categories are:

1. Undetectable: d2i is below the detection threshold, thus the factor is asymptotically unde-

tectable by SVD based methods.

2. Harmful: d2i is above the detection threshold but below the estimation threshold at which their

inclusion in the model improves accuracy.

3. Useful: d2i is above the estimation threshold but is O(1). It contributes an N × n matrix to Y

with sum of squares O(n), while the expected sum of squared errors is nNσ2.

4. Strong: d2i grows proportionally to N . The factor sum of squares is then proportional to the

noise level.

The above classification shows a general limiting phenomenon for matrix factorization of high-

dimensional and large matrices. The specific value for the detection and estimation threshold de-

pends on the specific method in estimating X that is being used. For our method, which tries to

whiten the noise first and then use PCA on the reweighted data ΣY , we can choose the detection

and estimation threshold using those derived for a white noise model by RMT.

Here is a full description of the data generating mechanism:

Generating the noise

Recall that the noise matrix is Σ1/2E. The steps are as follows.

1. E = (eij)N×n: here eijiid∼N (0, 1).

2. Σ = diag(σ21 , . . . , σ

2N ): σ2

iiid∼ InvGamma(α, β). Therefore E

[σ2i

]= β/α− 1 and Var

[σ2i

]=

β2/(α− 1)2(α− 2). Parameters α and β are chosen so that E[σ2i

]= 1. We consider two

heteroscedastic noise cases: Var[σ2i

]= 1 and Var

[σ2i

]= 10. We also include a homoscedastic

case with all σ2i = 1.


Generating the signal

The signal matrix is X =√nΣ1/2UDV T , where Σ is the same matrix used to generate the noise.

Entries in D specify the strength of signals of the reweighted matrix Σ−1/2X. Based on Perry (2009),

we use the detection threshold µF =√γ and the estimation threshold

µ∗F =1 + γ

2+

√(1 + γ

2

)2

+ 3γ.

These are the two thresholds for PCA in the homoscedastic noise case with σ = 1.

We explore different combinations of factors from the four factor categories defined above. Specif-

ically, we include the 6 scenarios from Table 3.1. All of these cases have eight nonzero factors of

which one is undetectable. We anticipate that the number of harmful factors is an important vari-

able, and so it generally increases with scenario number, ranging from 1 to 6. The remaining factors

are split between strong and merely useful. By including several scenarios with equal numbers of

harmful factors, we can vary the ratio of strong to useful factors at high and low numbers of harmful

factors.

For the d2i values we take, the strong factors takes values at 1.5N , 2.5N , 3.5N , · · · . The useful

factors takes values at 1.5µ?F , 2.5µ?F , 3.5µ?F , · · · . The harmful factors takes values at equally spaced

interior points of the interval [µF , µ?F ] and the undetectable factors takes values at equally spaced

interior points of the interval [0, µF ].

For U and V , first V is sampled uniformly from the Stiefel manifold Vk(Rn). See Appendix A.1.1

in Perry (2009) for a suitable algorithm. Then an intermediate matrix U∗ is sampled uniformly from

the Stiefel manifold Vk(RN ). Using the previously generated V and Σ we solve

Σ−1/2U∗DV T = UDV T

for U . Now U is nonuniformly distributed on on the Stiefel manifold in such a way that rows of U

with large L2 norm are not necesarily those with large σ2i .

Data dimensions

We consider 5 different N/n ratios: 0.02, 0.2, 1, 5, 50 and for each ratio consider a small matrix size

and a larger matrix size, thus there are in total 10 (N,n) pairs. The specific sample sizes appear at

the top of Table 3.3. In total there are 6× 3× 5× 2 = 180 scenarios. Each was simulated 100 times,

for a total of 18,000 simulated data sets.


Scenario

1 2 3 4 5 6

# Undetectable 1 1 1 1 1 1# Harmful 1 1 1 3 3 6# Useful 6 4 3 1 3 1# Strong 0 2 3 3 1 0

Table 3.1: Six factor strength scenarios considered in our simulations.

3.4.2 Empirical properties of ESA

We use simulations to study the effectiveness and accuracy of ESA, and to empirically determine m,

the number of iteration steps before early stopping. In these simulations we know the true signal X

and so we can measure the errors. As mentioned in Section 3.2, we compare ESA to other methods

for estimating X, including PCA (SVD), PCA after normalization of each row (variable) of the data

and the quasi maximum-likelihood method (QMLE). For an estimation method M , we denote

ErrX(M) = Err(XM

Opt

)= min

kErr

(XM (k)

).

We use the six measurements below to study the effectiveness of ESA with m = 3:

1. ErrX(m = 3)/

ErrX(m = mOpt):

this compares m = 3 to the optimal m defined in (3.12).

2. ErrX(m = 3)/

ErrX(m = 1):

this measures the advantage of ESA beyond PCA after data standardization.

3. ErrX(m = 3)/

ErrX(m = 50):

this measures the advantage of stopping early, using m = 50 as proxy for iteration to conver-

gence.

4. ErrX(m = 3)/

ErrX(SVD):

this compares ESA to applying SVD (PCA) directly to the data.

5. ErrX(m = 3)/

ErrX(QMLE):

this compares ESA to the quasi maximum likelihood method, which is solved using the EM

algorithm with PCA estimates as starting values.

6. ErrX(m = 3)/

ErrX(oSVD):

this compares ESA to the truncated SVD that an oracle which knew Σ could use on Σ−1/2Y .

It measures the relative inaccuracy in X arising from the inaccuracy of Σ.


MeasurementsWhite Noise Heteroscedastic Noise

Var(σ2i ) = 0 Var(σ2

i ) = 1 Var(σ2i ) = 10

ErrX(m=3)ErrX(m=mOpt)

1.01± 0.01 1.00± 0.01 1.00± 0.01

ErrX(m=3)ErrX(m=1) 0.93± 0.09 0.90± 0.11 0.89± 0.12

ErrX(m=3)ErrX(m=50) 0.87± 0.21 0.87± 0.21 0.87± 0.21

ErrX(m=3)ErrX(SVD) 1.03± 0.06 0.81± 0.20 0.75± 0.22

ErrX(m=3)ErrX(QMLE) 1.02± 0.05 0.95± 0.15 0.91± 0.19

ErrX(m=3)ErrX(oSVD) 1.03± 0.06 1.03± 0.07 1.03± 0.08

Table 3.2: ESA using six measurements. For each of Var(σ2i ) = 0, 1 and 10, the average for every

measurement is the average over 10× 6× 100 = 6000 simulations, and the standard deviation is thestandard deviation of these 6000 simulations.

Table 3.2 summarizes the mean and standard deviation of each measurement over 6000 simu-

lations each, for Var[σ2i

]= 0, 1 and 10. Row 1 shows that ESA stopping at m = 3 steps was

almost identical to stopping at the unknown optimal m in terms of the oracle estimating error, as

the mean is nearly 1 and the standard deviation is negligible. Row 2 indicates that taking m = 3

steps brought an improvement compared with PCA (SVD on standardized data). Row 3 shows

that taking m = 3 brought an improvement compared to using m = 50, our proxy for iterating to

convergence to the local minimum of loss. The latter is highly variable. Row 4 shows that trun-

cated SVD is better than ESA when the noise is homoscedastic. But even a noise level as small

as Var[σ2i

]= E

[σ2i

]= 1 reverses the preference sharply. Row 5 shows that ESA beats QMLE on

average for the heteroscedastic noise case, though the latter has theoretical guarantee for the strong

factor scenario. Row 6 shows that an oracle which knew Σ and used it to reduce the data to the

homoscedastic case would gain only 3% over ESA.

Table 3.3 gives the average value of each measurement over 100 replications for all of the simula-

tions with mild heteroscedasticity (Var[σ2i

]= 1). “Type-1” to “Type-6” correspond to the six cases

of factor strengths listed in Table 3.1. The first panel confirms that m = 3 is broadly effective. The

second panel shows that the problem of PCA is more severe at large sample sizes. The third panel

shows in contrast that the disadvantage to m = 50 iterations is more severe at the smaller sample

sizes. The fourth panel shows similar to the second panel that SVD causes greatest losses at large

sample sizes. The fifth panel shows that ESA has great advantage over QMLE when the variable

size is large, even at a low aspect ratio γ.

It worth mentioning that Table 3.3 shows that heteroscedasticity seems to be less of a problem

when the aspect ratio is higher for all the methods. When there are only strong factors, Theorem


FactorScenario

γ = 0.02 γ = 0.2 γ = 1 γ = 5 γ = 50

(20, 1000) (100, 5000) (20, 100) (200, 1000) (50, 50) (500, 500) (100, 20) (1000, 200) (1000, 20) (5000, 100)

ErrX(m = 3)/

ErrX(m = mOpt)

Type-1 1.011 1.000 1.011 1.000 1.004 1.000 1.003 1.000 1.000 1.000Type-2 1.013 1.002 1.012 1.001 1.006 1.000 1.004 1.000 1.000 1.000Type-3 1.016 1.006 1.014 1.005 1.010 1.000 1.002 1.000 1.000 1.000Type-4 1.002 1.002 1.009 1.001 1.008 1.000 1.006 1.000 1.000 1.000Type-5 1.008 1.001 1.011 1.001 1.007 1.000 1.006 1.000 1.000 1.000Type-6 1.007 1.000 1.011 1.000 1.006 1.000 1.003 1.000 1.001 1.000

ErrX(m = 3)/

ErrX(m = 1)


ErrX(m = 3)/

ErrX(m = 50)


ErrX(m = 3)/

ErrX(SVD)


ErrX(m = 3)/

ErrX(QMLE)


ErrX(m = 3)/

ErrX(oSVD)


Table 3.3: Comparison of ESA results for various (N,n) pairs and number of strong factors in thescenarios with Var

[σ2i

]= 1.

2.2.1 has shown that the PCA estimates become consistent when N →∞, and noise heteroscedastity

is not a problem. In the presence of weak factors, the estimation threshold for weak factors increases

with aspect ratio, thus the factor strengths of the retained factors increase proportionally to the


aspect ratio and the noise perturbation becomes relatively small, making the heteroscedastity of the

noise less of a problem.

3.4.3 Empirical properties of BCV

The noise heteroscedasticy of the data we generated falls into three different groups: white noise

with Var[σ2i

]= 0, mild heteroscedasticity with Var

[σ2i

]= 1 and strong heteroscedasticity with

Var[σ2i

]= 10. In this section we begin by summarizing the mild heteroscedastic case: Var

[σ2i

]= 1.

The other cases are similar and we give some results for them later.

To measure the loss in estimating X due to using an estimate r instead of the optimal choice

r∗ESA we use a relative estimation error (REE) given by

REE(r) =‖X(r)−X‖2F‖X(r∗ESA)−X‖2F

− 1. (3.21)

REE is zero if r is the best possible rank for the specific data matrix shown, that is, if r is the same

rank an oracle would choose.

We compare the r? chosen by BCV described in Section 3.3 with 5 other methods, the PA using

permutation described in Section 2.3.1, the IC1 method defined in (2.14), the Eigenvalue difference

method ED defined in (2.16), the eigenvalue ratio method ER define in (2.17) and the AIC type

method RE defined in (2.18).

Recall that Of these methods, ER and IC1 are designed for models with strong factors only. ED

does not require strong factors to work. RE has theoretical guarantees for estimating the number

of detectable weak factors in the white noise model. Finally, PA was designed and tested under the

small N and large n scenarios. We want to compare the finite sized dataset performance of these

methods in settings with both strong and weak factors. In applications one cannot be sure that only

the desired factor strengths are present.

We also include in the comparison the use of the true number of factors as well as the oracle’s

number of factors r∗ESA defined in (3.3). Methods that choose a value closer to k∗ESA, should attain

a small error using ESA.

Figure 3.1 shows for different methods, the proportion of simulations with REE above certain

values for the mild heteroscedastic case Var[σ2i

]= 1. Figure 3.1a shows that BCV is overall best at

recovering the signal matrix X. Figure 3.1b shows that BCV becomes far better than alternatives

when we just compare the larger sample sizes from each aspect ratio. Figure 3.1c shows that at

smaller sample sizes RE is competitive with BCV. The large data case is more important given the

recent emphasis on large data problems.

Our goal is to find the best r for ESA, but the methods ED, ER. IC1 and RE are designed

assuming that the SVD will be used to estimate the factors. To study them in the setting they

were designed for, we include Figure 3.1d, which calculates REE using SVD to estimate X(k) and


Var[σ2i

]PA ED ER IC1 RE BCV

0 1.99 1.41 49.61 1.13 0.12 0.291 2.89 2.42 25.02 3.11 2.45 0.37

10 3.66 2.28 15.62 4.46 2.10 0.62

Table 3.4: Worst case REE values for each method of choosing k for white noise and two het-eroscedastic noise settings.

compares with the oracle rank of SVD. For Figure 3.1d, the BCV method also uses the SVD instead

of ESA. Though the results in Table 3.2 (Appendix) suggest that SVD is in general not recommended

for heteroscedastic noise data, if one does use SVD, BCV is still the best method for choosing r to

recover X.

The proportion of simulations with REE = 0 (matching the oracle’s rank) for BCV was 51.6%,

75.1%, 28.1% and 47.0% in the four scenarios in Figure 3.1. BCV’s percentage was always highest

among the six methods we used. The fraction of REE = 0 sharply increases with sample size and is

somewhat better for ESA than for SVD.

Table 3.4 briefly summarizes the REE values for all three noise variance cases. It shows the worst

case REE over all the 10 matrix sizes and 6 factor strength scenarios. As the variance of σ2i rises

it becomes more difficult to attain a small REE. BCV has substantially smaller worst case REE for

heterscedastic noise than all other methods, but is slightly worse than RE for the white noise case.

This is not surprising as NE is designed for the white noise model.

To better understand the differences among the methods, we compare them directly in estimating

the number of factors with the oracle. As an example, Figure 3.2 plots the distribution of r for all

methods and all 6 cases, on 5000×100 data matrices with Var[σ2i

]= 1. The results are summarized

in more detail in Tables 3.5 and 3.6. In Figure 3.2, BCV closely tracks the oracle. For other methods,

ED performs the best in estimating the oracle rank, though it is more variable and less accurate

than BCV. ER is the most conservative method, trying to estimate at most the number of strong

factors. IC1 also tries to estimate the number of strong factors, but is less conservative than ER. RE

estimates some number between the number of strong factors and the number of useful (including

strong) factors. PA has trouble identifying the useful weak factors when strong factors are present,

and also has trouble rejecting the detectable but not useful factors in the hard case with no strong

factor. This is due the fact that PA is using the sample correlation matrix which has a fixed sum of

eigenvalues, thus the magnitude of the each eigenvalue is influenced by every other one.

Tables 3.5 and 3.6 provide more details of the simulation results for this mildly heteroscedastic

case Var[σ2i

]= 1. We can see that some methods behave very differently for different sized datasets.

For example, IC1 is very non-robust and sharply over-estimates the number of factors for small

datasets, ED will tend to estimate only the number of strong factors when the aspect ratio γ is

small. Overall, BCV has the most robust and accurate performance in estimating r∗ESA of the


0 1 2 3 4 5

0%

20%

40%

60%

80%

100% PAEDERIC1NEBCV

(a) All datasets, ESA

0 1 2 3 4 5

0%

20%

40%

60%

80%

100% PAEDERIC1NEBCV

(b) Large datasets only, ESA

0 1 2 3 4 5

0%

20%

40%

60%

80%

100% PAEDERIC1NEBCV

(c) Small datasets only, ESA

0 1 2 3 4 5

0%

20%

40%

60%

80%

100% PAEDERIC1NEBCV

(d) All datasets, SVD

Figure 3.1: REE survival plots: the proportion of samples with REE exceeding the number on thehorizontal axis. Figure 3.1a-3.1c are for REE calculating using the method ESA. Figure 3.1a showsall 6000 samples. Figure 3.1b shows only the 3000 simulations of larger matrices of each aspect ratio.Figure 3.1c shows only the 3000 simulations of smaller matrices. For comparison, Figure 3.1d is theREE plot for all samples calculating REE using the method SVD.

methods we investigated.

3.5 Real data example

We investigate a real data example to show how our method works in practice. The observed matrix

Y is 15 × 8192, where each row is a chemical element and each column represents a position on a


True PA ED ER IC1 NE BCVOracle

0

2

4

6

8

10

Type−1: 0/6/1/1


0

2

4

6

8

10

Type−2: 2/4/1/1


0

2

4

6

8

10

Type−3: 3/3/1/1


0

2

4

6

8

10

Type−4: 3/1/3/1


0

2

4

6

8

10

Type−5: 1/3/3/1


0

2

4

6

8

10

Type−6: 0/1/6/1

Figure 3.2: The distribution of r for each factor strength case when the matrix size is 5000 ×100. The y axis is r. Each image depicts 100 simulations with counts plotted in grey scale(larger equals darker). For different scenarios, the factor strengths are listed as the number of“strong/useful/harmful/undetectable” factors in the title of each subplot. The true k is alwaysr = 8. The “Oracle” method corresponds to r∗ESA.


FactorType

γ = 0.02 γ = 0.2 γ = 1

Method (20, 1000) (100, 5000) (20, 100) (200, 1000) (50, 50) (500, 500)

Type-10/6/1/1

PA 0.04 5.5 0.07 7.0 0.12 4.9 0.10 6.9 0.05 5.4 0.13 7.0ED 1.93 1.7 2.29 1.3 2.27 1.3 2.40 1.0 2.42 1.2 2.40 0.6ER 2.19 0.9 2.80 0.1 1.68 1.8 2.92 0.1 1.35 2.5 2.72 0.0IC1 2.30 16.0 0.69 3.3 1.44 16.0 0.61 3.5 0.10 5.6 0.69 3.1RE 0.23 6.3 1.82 1.3 0.16 5.0 2.45 0.6 0.08 5.4 2.36 0.5BCV 0.16 5.9 0.03 5.8 0.33 4.5 0.01 5.9 0.12 5.0 0.00 6.0Oracle – 6.0 – 6.0 – 5.9 – 6.0 – 6.0 – 6.0

Type-22/4/1/1


Type-33/3/1/1


Type-43/1/3/1


Type-51/3/3/1


Type-60/1/6/1


Table 3.5: Comparison of REE and r for rank selection methods with various (N,n) pairs,and scenarios. For each different scenario, the factors’ strengths are listed as the number of“strong/useful/harmful/undetectable” factors. For each (N,n) pair, the first column is the REE

and the second column is k. Both values are averages over 100 simulations. Var[σ2i

]= 1.


FactorType

γ = 5 γ = 50

Method (100, 20) (1000, 200) (1000, 20) (5000, 100)

Type-10/6/1/1

PA 0.05 5.0 0.11 6.9 0.01 5.7 0.10 7.0ED 1.89 1.2 1.57 1.6 0.43 4.7 0.10 6.1ER 2.23 0.3 2.18 0.0 1.69 0.0 1.68 0.0IC1 1.23 16.0 0.86 2.2 0.04 5.0 1.10 1.1RE 0.14 4.9 1.17 1.7 0.20 4.2 0.14 3.9BCV 0.37 4.1 0.00 6.0 0.10 4.9 0.01 5.8Oracle – 5.9 – 6.0 – 5.8 – 5.9

Type-22/4/1/1


Type-33/3/1/1


Type-43/1/3/1


Type-51/3/3/1


Type-60/1/6/1


Table 3.6: Like Table 3.5, but for larger γ.

64 × 128 map of a meteorite. We thank Ray Browning for providing this data. Similar data are

discussed in Paque et al. (1990). Each entry in Y is the amount of a chemical element at a grid

point. The task is to analyze the distribution patterns of the chemical elements on that meteorite,

helping us to further understand the composition.


200

250

300

350

400

450

500

rank k

BC

V P

redi

ctio

n E

rror

0 1 2 3 4 5 6 7 8 9

Figure 3.3: BCV prediction error for the meteorite. The BCV partitions have been repeated 200times. The solid red line is the average over all held-out blocks, with the cross marking the minimumBCV error.

A factor structure seems reasonable for the elements as various compounds are distributed over

the map. The amounts of some elements such as Iron and Calcium are on a much larger scale than

some other elements like Sodium and Potassium, and so it is necessary to assume a heteoroscedastic

noise model as (3.1). We center the data for each element before applying our method.

BCV choose r = 4 factors, while PA chooses r = 3. Figure 3.3 plots the BCV error for each rank,

showing that among the selected factors, the first two factors can be considered as strong factors,

which are much more influential than the last two. The first column of Figure 3.4 plots the four

factors ESA has found at their positions. They represents four clearly different patterns.

As a comparison, we also apply a straight SVD on the centered data with and without standard-

ization to analyze the hidden structure. The second and third columns of Figure 3.4 shows the first

five factors of the locations that SVD finds for the original and scaled data respectively. If we do not

scale the data, then the factor (F5) showing the concentration of Sulfur on some specific locations

strangely comes after the factor (F4) which has no apparent pattern; F5 would have been neglected

in a model of three or four factors as BCV or PA suggest. The figure shows that ESA can estimate

the weak factors more accurately compared with SVD.

Paque et al.Paque et al. (1990) investigate this sort of data by clustering the pixels based on the


1:128

ESA_F1

1:128

ESA_F2

1:128

ESA_F3

1:128

ESA_F4

1:128

1:64

SVD_F1

1:128

1:64

SVD_F2

1:128

1:64

SVD_F3

1:128

1:64

SVD_F4

1:64

SVD_F5

1:128

1:64

scale_F1

1:128

1:64

scale_F2

1:128

1:64

scale_F3

1:128

1:64

scale_F4

1:64

scale_F5

Figure 3.4: Distribution patterns of the estimated factors. The first column has the four factorsfound by ESA. The second column has the top five factors found by applying SVD on the unscaleddata. The third column has the top five factors found by applying SVD on scaled data in which eachelement has been standardized. The values are plotted in grey scale, and a darker color indicates ahigher value.


values of the first two factors of a factor analysis. We apply such a clustering in Figure 3.5. The plot

shows that ESA can estimate the factor scores of the strong factors more accurately. Column (a)

shows the resulting clusters. The factors found by ESA clearly divide the locations into five clusters,

while the factors found by an SVD on the original data blur the boundary between clusters 1 and 5.

An SVD on normalized data (third plot in column (a)) blurs together three of the clusters. Columns

(b) and (c) of Figure 3.5 show the quality of clustering using k-means based on the first two plots

of Column (a). Clusters, especially C1 and C5, have much clearer boundaries and are less noisy if

we are using ESA factors than using SVD factors. A k-means clustering depends on the starting

points. For the ESA data the clustering was stable. For SVD the smallest group C3 was sometimes

merged into one of the other clusters; we chose a clustering for SVD that preserved C3.

In this data the ESA based factor analysis found factors that, visually at least, seem better. They

have better spatial coherence, and they provide better clusters than the SVD approaches do. For

data of this type it would be reasonable to use spatial coherence of the latent variables to improve the

fitted model. Here we have used spatial coherence as an informal confirmation that BCV is making

a reasonable choice, which we could not do if we had exploited spatial coherence in estimating our

factors.


−0.01 0.01 0.03

−0.

020.

000.

020.

04

F1

F2

ESA

−0.02 0.00 0.01 0.02 0.03

−0.

020.

000.

020.

04

F1

F2

SVD

−0.02 0.00 0.02 0.04

−0.

15−

0.05

0.00

F1

F2

scale

(a)

1:128

1:64

ESA_C1

1:128

1:64

ESA_C2

1:128

1:64

ESA_C3

1:128

1:64

ESA_C4

1:128

1:64

ESA_C5

(b)

1:128

1:64

SVD_C1

1:128

1:64

SVD_C2

1:128

1:64

SVD_C3

1:128

1:64

SVD_C4

1:128

1:64

SVD_C5

(c)

Figure 3.5: Plots of the first two factors and the location clusters. The three plots of column (a)are the scatter plots of pixels for the first two factors found by the three methods: ESA, SVD onthe original data and SVD on normalized data. The coloring shows a k-means clustering result for5 clusters. Column (b) has the five clustered regions based on the first two factors of ESA. Column(c) has the five clustered regions based on the first two factors of SVD on the original data aftercentering. The same color represents the same cluster.

Chapter 4

An optimization-shrinkage hybrid

method for factor analysis

4.1 A joint convex optimization algorithm POT

4.1.1 the objective function

As we have discussed in Chapter 3, maximizing the log-likelihood function (3.5) directly would not

work as the global optimization solution can arbitrarily have σi → 0. In this chapter, we switch to

considering an alternative objective function which is not ill-posed and allows us to jointly estimate

X and Σ. The

Let Y = (yij)N×n, X = (xij)N×n and still consider the model (3.1). Define the objective function

as

Lλ(X,Σ;Y ) = L0(X,Σ;Y ) + λ ‖X‖? = n

N∑i=1

σi +

N∑i=1

n∑j=1

(yij − xij)2

σi+ 2√nλ‖X‖? (4.1)

We estimate X and Σ by (Xλ, Σλ

)= argminX,Σ Lλ(X,Σ) (4.2)

The loss L0(X,Z;Y ) is based on an idea proposed by Huber (2011) to jointly estimate σ and β

in an regression model Yi = XTi β+σ2Ei. Huber estimated β and σ2 by minimizing nσ+

∑ni=1(Yi−

XTi β)2/σ which is jointly convex in (β, σ) and yields the same estimates of β and σ as MLE. Such

a huber technique is also called perspective transformation in convex optimization (Owen, 2007).

L0(X,Z;Y ) is also jointly convex in (X,σ1, · · · , σN ). More importantly, it is not ill-posed since

L0(X,Z;Y ) is bounded below by 0. To get a low-rank matrix estimation X, we impose a nuclear

norm penalty on X. The nuclear norm penalty has been widely used in low-rank matrix recovery

48

CHAPTER 4. AN OPTIMIZATION-SHRINKAGE HYBRIDMETHOD FOR FACTORANALYSIS49

and completion (Recht et al., 2010) as ‖X‖? is convex in X. The nuclear penalty is a relaxation

of a rank constraint on X. A larger value of the tuning parameter λ results in a lower rank of

Xλ. We name the joint optimization algorithm with objective function (4.1) as POT (Perspective

transformation Optimization with Trace norm penalty).

4.1.2 Connection with singular value soft-thresholding

When Σ = IN , then

Lλ(X, IN ;Y ) = nN + ‖Y −X‖2F + 2√nλ ‖X‖? (4.3)

and Xλ has explicit forms (Parikh and Boyd, 2014, Chap. 6.7.3). As before, denote the SVD of Y as

Y =√nUDV T where D = diag(d1, · · · , dmin(N,n)). Define Dλ = diag

((d1 − λ)+, · · · , (dmin(N,n) − λ)+

).

Then minimizing the objective function (4.3) gives

Xλ =√nUDλV

T (4.4)

Dλ is soft-thresholding of the singular values of Y . In other words, the solution Xλ keeps the sample

singular vectors but applies soft-thresholding to the sample singular values.

4.1.3 Connection with square-root lasso

By taking the derivative of (4.1) with respect to σi, we can plug in

σi =

√√√√ 1

n

n∑j=1

(yij − xij)2 =1√n‖Yi· −Xi·‖2

to (4.1) and and change the objective function to

Lλ(X;Y ) =

N∑i=1

‖Yi· −Xi·‖2 + λ‖X‖? (4.5)

Equation (4.5) is closely related to the square-root lasso method proposed in Belloni et al. (2011)

for linear regression. Consider the multiple regression problem Y = BZ+Σ1/2E where B ∈ RN×p is

the coefficient parameter matrix and Z ∈ Rp×n is the matrix of known covariates. Consider the case

where p is very large and we want a sparse estimate of B. The square-root lasso method estimates

each Bi· separately by minimizing the objective function

Li(Bi·;Yi·) = ‖Yi· −Bi·Z‖2 + λi ‖Bi·‖1

As discussed in Belloni et al. (2011), the main advantage of square-root lasso is that it’s “pivotal”

in that the scale of the tuning parameter λi is irrelevant to the noise variance σi. Thus, we can set


λi ≡ λ for some λ and rewrite the square-root lasso objective function for the multiple regression as

Lλ(B;Y ) =

N∑i=1

Li(Bi·;Yi·) =

N∑i=1

‖Yi· −Bi·Z‖2 + λ ‖B‖1

which has very similar form as (4.5).

4.2 Some heuristics of the method

In this section, we want to provide some understanding of the solutions Xλ, and the choice of λ.

These results, though lacking rigorous mathematical justifications, can guide for using the method

in practice.

Negahban et al. (2012) has provided a general theory for the error rate of the estimates θλn ∈argminθ L(θ;Zn) + λnR(θ) where θ is the vector of parameters, Zn is the observed data and R(θ)

is some penalty function which is supposed to be a norm. They require L(θ;Zn) to be a convex

and differentiable function in θ. However, in our problem, L0(X;Y ) =∑i

√∑j(xij − yij)2 is not

differentiable in X, thus unfortunately we can not apply their theory directly.

4.2.1 The theoretical scale of λ

We define the optimal λ minimizing the estimation error of X as

λOpt = argminλ

∥∥∥Xλ −X∥∥∥

2(4.6)

How would the scale of λOpt change with the dimension?

To avoid confusion, we denote the true value of X as X? in this sub-section. For the es-

timates θλn ∈ argminθ L(θ;Zn) + λnR(θ) discussed in Negahban et al. (2012), they require

λn ≥ cR? (∇L(θ?;Zn)) with high probability to guarantee an upper bound controlling the er-

ror rate ‖θλn − θ?‖2. Here, θ? is the true value of θ, R?(·) is the dual norm of R(·) defined as

R?(v) = supR(u)≤1 uT v and c > 1 is some constant. Also, once λn ≥ R? (∇L(θ?;Zn)), the upper

bound is monotonically increasing with λn.

If we use their result in our problem, then we would need λ > ‖∇L0(X?;Y )‖op since the dual

norm of the nuclear norm is operator norm ‖ · ‖op (the largest singular value of the matrix) and


L0(X;Y ) is differentiable at X? with probability 1 if P [eij = 0] = 0. In other words,

λ >

∥∥∥∥∥∥∥ yij − x?ij√∑n

j′=1(x?ij′ − yij′)2

N×n

∥∥∥∥∥∥∥op

=

∥∥∥∥∥∥∥ εij√∑n

j′=1(εij′)2/n

N×n

∥∥∥∥∥∥∥op

/√n

=

(1√n

+ o

(1√n

))|E‖op

Under the asymptotics that n,N → ∞ and N/n → γ > 0, we have ‖E‖op/√n → 1 +

√γ, thus we

can set the theoretical value of λ as λtheo = 1 +√γ, which is the detection threshold in RMT for

σ = 1.

Another heuristic to derive λtheo is that we would always want the true parameter value be our

solution. Especially when X? = 0, we would like also the estimated X = 0, which is equivalent to

0 ∈ ∇L0(0; Σ1/2E) + λ∂‖X‖? |X=0

This results in λ > ‖∇L(0; Σ1/2E)‖∞ which gives the same λtheo

In practice, we find from our simulations that the actual λOpt follows the trend of λtheo well as

the size of the data matrix changes, though it’s likely to be smaller. We develop a cross-validation

technique to find the actual λOpt from a sequence of candidate λ around λtheo. This will be discussed

in detail in Section 4.4.

4.2.2 The bias in using the nuclear penalty

In our simulations, we find that even XλOpthas a rank that can be much higher than the true rank

of X. Also, when there exist strong factors in the data, XλOpt can be a worse estimator compared

even with the PC estimator Xpc = LpcF pc. This phenomenon stays even for the white noise model

when Σ = IN and XλOptis estimated from (4.3). The phenomenon is due to the bias in estimating

a low-rank matrix introduced by the nuclear penalty. The RMT provides tools to understand (4.3)

under Σ = IN .

As discussed in Section 4.1.2, (4.3) has a closed form solution (4.4) which is keeping the singular

vectors but applying soft-thresholding to the singular values of Y . Under the assumptions of Theorem

2.2.3 and the asympotics N,n→∞ and N/n→ γ, then based on the calculations in Shabalin and


Nobel (2013); Gavish and Donoho (2014), if λ ≥ λtheo = 1 +√γ we have

1

n

∥∥∥Xλ −X∥∥∥2

F

a.s.→r∑

k=1

ρ2k +

r∑k=1

(ρk − λ)

2+ − 2 (ρk − λ)+ ρkθkθk

(4.7)

where ρk, ρk, θk and θk are defined in Theorem 2.2.3. We can show

Proposition 4.2.1. Define

L∞(λ) =

r∑k=1

ρ2k +

r∑k=1

(ρk − λ)

2+ − 2 (ρk − λ)+ ρkθkθk

Then under the assumptions of Theorem 2.2.3, L∞(λ) is a increasing function of λ when λ ≥ 1+

√γ.

Moreover, limλ↓(1+√γ)∇λL∞(λ) > 0.

Proof. Denote ρr+1 = 1 +√γ. As L∞(λ) is continuous, to show that it is increasing in λ we only

need to show that L∞(λ) is increasing in [ρk+1, ρk) for any k = 1, 2, · · · , r. Given a K, the function

L∞(λ) is quadratic in [ρK+1, ρK):

L∞(λ) =

r∑k=1

ρ2k +

K∑k=1

(ρk − λ)

2 − 2 (ρk − λ) ρkθkθk

Then, h

∇λL∞(λ) = −2

K∑k=1

(ρk − λ) +

K∑k=1

ρkθkθk

= 2K

λ− ∑Kk=1

(ρk − ρkθkθk

)K

.

It’s enough if we can show that ρk − ρkθkθk is a strictly decreasing function of ρk when ρk > γ1/4

and ρk − ρkθkθk = 1 +√γ when ρk ≤ γ1/4. The reason is that then we can have ∇λL∞(λ) > 0

inside [ρK+1, ρK) for any K = 1, 2, · · · , r.By plugging in the expression of ρk, θk and θk in Theorem 2.2.3 with σ = 1, when ρk > γ1/4 we

get

ρk − ρkθkθk =

√(ρ2k + 1)(ρ2

k + γ)

ρk− 1

ρk

ρ4k − γ√

(ρ2k + 1)(ρ2

k + γ)

Taking derivative respect to ρk, we get

∇ρkρk − ρkθkθk

=−(1 + γ)ρ6

k − 6γρ4k − 3γ(1 + γ)ρ2

k − 2γ2

ρ2k(ρ2

k + 1)2(ρ2k + γ)2

< 0


Thus ρk − ρkθkθk is a strictly decreasing function of ρk when ρk > γ1/4. When ρk ≤ γ1/4, we have

θk = θk = 0 and ρk = 1 +√γ, thus ρk − ρkθkθk = 1 +

√γ when ρk ≤ γ1/4.

Proposition 4.2.1 indicates that asymptotically λOpt/√n < 1 +

√γ, in other words the optimal

soft-thresholding will include even more than the true number of factors to minimize the estimation

error of X. Also from the above proof we can see that the larger the factors strength (ρ1, · · · , ρr)are, the smaller λ is which means that the rank XλOpt increases when there are stronger factors,

making XλOptless accurate. From our simulations, we find that the solution of (4.1) has the same

problem.

To overcome the bias of soft-thresholding, Shabalin and Nobel (2013) proposed the optimal

shrinker (2.12), which is the optimal shrinkage of the eigenvalues that minimizes the asymptotic

estimation error. Assume Σ = σ2IN , the optimal shrinkage estimator has the form

Xsk =√nUη

(D)V T (4.8)

where η(D) = diag(η(d1), · · · , η(dmin(N,n))

). The shrinkage function η(·) is defined as

η(d) =

σ2

d

√(d2

σ2 − γ − 1)2 − 4γ if d ≥ (1 +

√γ)σ

0 Otherwise(4.9)

Comparing (4.9) with soft-thresholding, the optimal shrinkage has the property that it shrinks larger

eigenvalues less but shrinks smaller eigenvalues more. This provides another point of view why soft-

thresholding works badly when there are strong factors: it shrinks the larger sample singular values

too much while they are actually close to the true values. The optimal shrinkage can be more

accurate than soft-thresholding, but it’s hard to generalize it to fit the heteroscedastic noise factor

analysis model. Unfortunately, there is no convex penalty function to replace the nuclear norm ‖X‖?that has the optimal shrinkage as the solution.

4.3 A hybrid method: POT-S

Based on the discussion of Section 4.2.2, we propose a hybrid method (POT-S) that combine the

POT minimizing (4.1) with the optimal shrinkage (4.8).

Once we know Σ, then we can apply to the whitened data matrix Σ−1/2Y with the optimal

shrinkage (4.8) given σ2 = 1. However, we don’t know Σ, thus one hybrid approach is to first

estimate Σ using Σλ which is estimated from minimizing (4.1), and then apply optimal shrinkage to

Σ−1/2λ Y . More specifically, for a given λ, let Yλ = Σ

−1/2λ Y and

X?λ =√n · Σ1/2

λ UYλη(DYλ

)V TYλ, Σ?λ = Σλ. (4.10)


Here, for a matrix Z ∈ RN×n, we use UZ , VZ and DZ to denote it’s left and right singular vectors

and the singular values with Z =√nUZDZV

TZ .

However, one drawback of X?λ is that it only depends on the estimate Σλ. Another choice is

to have our estimate depend on both Σλ and Xλ. As the main problem of Xλ is that it shrinks

the large singular values too much while inadequately shrinking the small singular values, we can

replace the singular values of Xλ with the singular values of X?λ:

X??λ =

√n · UXλDX?λ

V TXλ, Σ??λ = Σλ. (4.11)

We also define the optimal λ for X?λ and X??

λ respectively as

λ?Opt = argminλ

∥∥∥X?λ −X

∥∥∥2

F,

λ??Opt = argminλ

∥∥∥X??λ −X

∥∥∥2

F.

From our simulations (reference to the empirical results), we find that both X??λ??Opt

and X??λ??Opt

are

more accurate than X?λ?Opt

and X?λ?Opt

. Thus, given λ, we propose the POT-S method using X??λ and

X??λ as the final estimates.

When applying POT-S to a dataset, we need to determine λ. The goal is to find λ that is as

close as possible to the unknown λ??Opt. We will use a Wold-style cross-validation approach which is

discussed in the next section.

4.4 Wold-style cross-validatory choice of λ

We use a cross-validation technique to determine the optimal λ.

There are two types of cross-validation for unsupervised learning. One is bi-cross-validation

(BCV) discussed in Chapter 3 which randomly holds out blocks of the data matrix. The other one

is Wold-style cross-validation which randomly holds out entries of the data matrix. Both techniques

are effective for selecting tuning parameters based on prediction performance if used properly. In

Chapter 3, we used BCV because of the theory proposed by Perry (2009) on choosing the size of

the holdout matrix. For POT and POT-S, we use the Wold-style cross-validation mainly because

of three reasons: 1) for BCV, we find empirically that the estimation of λ is sensitive to the size of

the holdout matrix while finding the theory for choosing the size for POT/POT-S; 2) for Wold-style

cross-validation, the convex optimization step in POT/POT-S can easily handle missing entries; 3)

estimation of λ is not very sensitive to the fraction of hold-out entries once the fraction is small in

Wold-style cross-validation.

The Wold-style cross-validation (Wold, 1978) for a matrix Y start with uniformly and randomly

selecting some entries from Y as held-out data and the rest as held-in data. We apply POT-S to the


held-in data, and calculate the prediction error of the held-out entries. Then we repeat this random

entry selection step for several times and choose λ that minimizes the average prediction error.

Specifically, for one random selection, define an index matrix M = (mij)N×n ∈ 0, 1N×n.

Iij = 1 if the entry is held-in and 0 if held-out. Then we estimate X and Σ from only the held-in

data, treating the held-out entries as missing. Let ni =∑nj=1mij for i = 1, 2, · · · , N . For the joint

convex optimization step, the objective function can be changed from (4.5) to

Lλ(X;Y,M) =

n∑i=1

√nin

∑j

mij(xij − yij)2 + λ‖X‖? (4.12)

Define the joint convex optimization estimates as

Xλ,M = argminX Lλ(X;Y,M), and σ2i,λ,M =

∑jmij(xij,λ,M − yij)2

ni

For the optimal shrinkage step, we need a full data matrix Y , thus the strategy is that we first

fill in the held-out entries based on the held-in entries. A direct approach is to replace the missing

held-out entries by the entries in Xλ,M in the corresponding positions. However, for an entry

yij = xij + σieij , xij,λ,M would not be a good approximation of yij since it approximates xij but

doesn’t include the noise term σieij . Thus, another approach is to also estimate σieij using bootstrap.

The estimate σieij is estimated by random sampling from yijt − Xijt,λ,M ; t = 1, 2, · · · , ni where

yijt are non-missing entries in the ith row. The held-out entry yij is then fulfilled by xij,λ,M + σieij .

Denote the fulfilled matrix by Yλ,M , then the POT-S estimate X??λ,M is given by applying the

optimal shrinkage step of POT-S to Yλ,M with Σλ,M . Then, the prediction error for the held-out

entries are

PEλ(Y,M) =

∑i,j,mij=0

(yij − x??ij,λ,M

)2

∑i,j 1mij=0

The above random entry selection step is repeated independently for S times, yielding the average

Wold-style cross-validation mean squared prediction error for Y :

PE(λ) =1

S

S∑s=1

PEλ(Y,M (s)),

where M (s) is the index matrix for the sth repeat of random entry selection. Finally, the cross-

validation estimate of λ is

λ?? = argminλ∈[λ,λ] PE(λ) (4.13)

We use a grid search to find λ?? within a range [λ, λ]. We find that setting λ = 0.5λtheo and

λ = 1.3λtheo where λtheo = 1 +√γ gives a wide enough range.


We also find empirically (refer to the simulation table) that the bootstrap-CV approach is better

than simply filling in the held-out entries by entries of Xλ,M . Thus, the POT-S method adopts

the bootstrap-CV approach for cross-validation. For the fraction of entries to be held-out, we find

empirically that holding 10% of entries out makes λ?? a good estimate of λ??Opt.

4.5 Computation: an ADMM algorithm

In this section, we describe the algorithm minimizing the objective function Lλ(X;Y, I) in (4.12).

Optimizing the original objective (4.5) is a special case with ni ≡ n. We denote the Hadamard

product of two matrices A = (aij)m×n and B = (bij)m×n as A B = (aijbij)m×n. M = (mij)N×n

is the indicator matrix of held-in entries.

4.5.1 The ADMM algorithm

For notation convenience, denote

f(X) =

n∑i=1

√nin

∑j

mij(xij − yij)2

g(X) = ‖X‖?

since both f and g are not smooth, we use ADMM to solve the problem (Boyd et al., 2011).

For a given α, the ADMM algorithm for this problem is

Xk+1 = proxαf (Zk − Uk),

Zk+1 = proxαλg(Xk+1 + Uk), and

Uk+1 = Uk +Xk+1 − Zk+1

(4.14)

where proxh(·) is the proximal operator for a function h(·). The proximal operator is defined as

proxh(v) = argminu h(u) +1

2‖u− v‖22.

Remember that for a matrix Z, its SVD is denoted as Z =√nUZDZV

TZ . Also, for a diagonal matrix

D = diag(d1, d2, · · · , dm), we denote Dλ = diag ((d1 − λ)+, · · · , (dm − λ)+) as the soft-thresholding.


Fact 1. Both proxαf (·) and proxαλg(·) have close forms:

proxαf (W ) =

(Y + diag

(1−

α ·√ni/n

‖(Wi· − Yi·) Mi·‖2

)+

(W − Y )

)M (4.15)

+W (1N1Tn −M), and

proxαλg(W ) =√nUWDW,αλV

TW . (4.16)

Proof. For proxαf ,

proxαf (W ) = argminX f(X) +1

2α‖X −W‖2F

= argminX

n∑i=1

√nin

∑j

mij(xij − yij)2 +1

2α‖Xi· −Wi·‖22

proxαf (W )i = argminXi·

√nin

∑j,Iij=1

(xij − yij)2 +1

2α‖Xi· −Wi·‖22

= Yi· + argminXi·

√nin

∑j

mij x2ij +

1

2α‖Xi· + Yi· −Wi·‖22

=

(Yi· +

(1−

α ·√ni/n

‖(Wi· − Yi·) Mi·‖2

)+

(Wi· − Yi·)

)Mi· +Wi· (1Tn −Mi·)

Thus, (4.15) holds.

For proxαλg,

proxαλg(W ) = argminX ‖X‖? +1

2αλ‖X −W‖2F

=√nUWDW,αλV

TW

The above fact shows that at each step of the ADMM iteration, X is first shrinked towards Y

and then is shrinked towards 0. The size of λ decides which direction of the shrinkage dominates.

We can have a low-rank estimated X when λ is large enough. When λ is too small, some of the σ2i

will be estimated as 0.

We adopt the stopping rule used in Boyd et al. (2011). The ADMM algorithm is terminated

when both the primal and dual residuals are small. Here, the primal and dual residuals are

Rk = Xk − Zk, and Sk =1

α(Zk+1 − Zk).


It is shown that

f(Xk) + λg(Zk)− p? ≤ 1

α‖Uk‖F ‖Rk‖F + ‖Xk −X‖F ‖Sk‖F

Let εabs > 0 be an absolute tolerance and εrel be the relative error per entry. We stop if both

‖Rk‖F ≤√nNεabs + εrel max‖Xk‖F , ‖Zk‖F

‖Sk‖F ≤√nNεabs + ρεrel‖Uk‖F

4.5.2 Techniques to reduce computational cost

The above algorithm actually is very expensive when used for the POT-S method. In this section,

we discuss three techniques we use to reduce the computational cost: varying the penalty parameter,

an approximate SVD and warm starts.

Varying step size α

The above ADMM actually converges very slowly when we want to achieve the desired accuracy.

One modification to reduce the iterations towards convergence is to change the step size α on every

iteration Boyd et al. (2011). We use this simple scheme

αk+1 =

αk/2 if ‖Rk‖F > 10‖Sk‖F

2αk if ‖Sk‖F > 10‖Rk‖F

αk otherwise

The reason behind it is that if α is smaller than it penalize more on the primal residual Rk, when

α is larger it reduces the dual residual.

Acceleration by avoiding full SVD

The bottleneck of the computation in each iteration is the SVD step required for computing proxαλg(W ).

A full SVD to compute all the singular values and vectors of W will be very time-consuming. How-

ever, if W is known to be low-rank, then there is no need for full SVD. Also, computing the singular

value and vector pairs only when di(W ) > αλ is adequate for the soft-thresholding purpose. Both

the above two reasons suggest a partial SVD. Computing partial SVD for these soft-eigenvalue-

thresholding iterations is also widely suggested in the matrix completion literature Cai et al. (2010).

Partial SVD is to only compute the first K singular values and vectors. As our code is written

in R, we use the “svd” package which provides the PROPACK (Larsen, 1998) and nu-TRLAN

(Yamazaki et al., 2010) implementations of partial SVD. From our experience, we find that nu-

TRLAN is slightly faster than PROPACK and less likely to yield an error message. Also, if K is not


small enough (K > 0.2 min(N,n)), there is no acceleration by computing partial SVD using either

PROPACK or nu-TRLAN, which is suggested in Wen et al. (2012) and also our simulations. In that

situation we will then switch back to the full SVD.

To compute the partial SVD of a matrix W , we need to decide the rank K for W . As suggested

in Cai et al. (2010), we can use information from previous iterations. Here is how this is done. We

initialize with a low rank Z (either using Z = 0 or Z = X computed from a larger λ). Then, as

iterations goes on, the rank of Zk tends to increase slowly. We guess an upper bound of rank(Zk+1)

as rk+1 = rank(Zk) + 5. After computing Zk+1 using rk+1, if rank(Zk+1) < rk+1 then our guess

succeeded. If not, we will recompute Zk+1 using the full SVD.

Warm start

In the cross-validation step, we need to select the best λ from the range [λ, λ] via grid search. Thus,

we indeed need to find a solution path of minimizing the objective function Lλ(X;Y, I) in (4.12) for

λ < λ2 · · · < λM .

To compute the solution path, we start from the largest λ, which gives a very low-rank estimate

Xλ,I . To compute the solution for λm+1, we use the final values of X, Z and U defined in (4.14)

when solving for λm as the starting values of X, Z and U in the optimization for λm+1. This is

called “warm start” in the optimization literature (Boyd et al., 2011). Also, since a small λ results

in a higher rank X which needs a lot time to compute, we want to avoid useless cross-validation for

small λ. In the cross-validation step, we start from the largest λ and stop when min(σi) = 0 for

some i ∈ 1, 2, · · · , N.

4.6 Simulation results

For our simulations, we use the same data generating scheme as described in section 3.4.1. As the

properties of ESA-BCV have been compared thoroughly with other currently existing methods and

ESA-BCV show an advantage, here we mainly compare POT-S with ESA-BCV.

4.6.1 Compare the oracle performances

Here we compare the oracle performances of five methods, the ESA method in Chapter 3 (ESA),

the quasi-maximum likelihood method (QMLE), the method using only joint convex optimization

solution Xλ in (4.2) (POT), the hybrid method using X?λ (POT-S-0) and the hybrid method using

X??λ (POT-S). The oracle estimation error of a method M is denoted as ErrX(M). For ESA and

QMLE, it’s defined as

ErrX(ESA) = Err(XESA(kESA

Opt ))

= mink

Err(XESA(k)

)


ErrX(QMLE) = Err(XQMLE(kQMLE

Opt ))

= mink

Err(XQMLE(k)

)where XESA(k) and XQMLE(k) are the estimates given the rank k. For the three methods based on

the joint convex optimization, the oracle estimation errors of X are denoted as

ErrX(POT) = Err(XλOpt

), ErrX(POT-S-0) = Err

(X?λ?Opt

), and

ErrX(POT-S) = Err(X??λ??Opt

)Table 4.1 compares the oracle error in estimating X of POT-S with four other approaches. It

is clear that POT-S has the smallest oracle error. The detailed result for each factor strength

scenario and matrix size combination is shown in Table 4.3. Table 4.3 shows that when there are

many strong factors but few weak factors (Scenarios 2, 3 and 4), POT which uses only the joint

convex optimization solution, performs the worst compared with other methods. Because of the

optimal shrinkage step, POT-S and POT-S-0 perform better than ESA which basically apply hard-

thresholding on the singular values and POT which is based on singular value soft-thresholding.

Comparing the performance of POT-S and POT-S-0, we see that X??λ which depends on both Xλ

and Σλ has better oracle error than X?λ which only relies on Σλ. This should convince the readers

that in POT-S we should adopt X??λ as the estimator instead of X?

λ.


Var[σ2i

]= 0 Var

[σ2i

]= 1 Var

[σ2i

]= 10

ErrX(POT-S)ErrX(ESA) 0.80± 0.07 0.80± 0.09 0.81± 0.12

ErrX(POT-S)ErrX(QMLE) 0.81± 0.07 0.76± 0.14 0.73± 0.17

ErrX(POT-S)ErrX(POT) 0.77± 0.14 0.79± 0.14 0.80± 0.14

ErrX(POT-S)ErrX(POT-S-0) 0.83± 0.17 0.86± 0.17 0.87± 0.18

Table 4.1: Assess the oracle error in estimating X using four measurements. For each of Var[σ2i

]= 0,

1 and 10, the average for every measurement is the average over 10 × 6 × 100 = 6000 simulations,and the standard deviation is the standard deviation of these 6000 simulations.

POT-S also can estimate Σ more accurately. For an estimator Σ, define the estimation error of

Σ as

R(

Σ,Σ)

=

n∑i=1

| log(σ2i )− log(σ2

i )|

Then we compare the error in estimating Σ when the oracle error in estimating X is achieved. In


other words, define

ErrΣ(ESA) = R(

ΣESA(kESAOpt ),Σ

)ErrΣ(QMLE) = R

(ΣQMLE(kQMLE

Opt ),Σ)

ErrΣ(POT) = R(

ΣλOpt ,Σ), ErrΣ(POT-S-0) = R

(Σλ?Opt

,Σ)

ErrΣ(POT-S) = R(

Σλ??Opt,Σ)

Then the comparison among methods is summarized in Table 4.2. A more detailed result is in

Table 4.3. We do not compare the oracle error in estimating Σ as what we did in estimating X

because of two reasons. One is that the oracle error in estimating X and Σ usually can’t be achieved

at the same tuning parameter, and the other is that achieving the oracle error in estimating Σ

without knowing the true Σ is hard to realize.


Var[σ2i

]= 0 Var

[σ2i

]= 1 Var

[σ2i

]= 10

ErrΣ(POT-S)ErrΣ(ESA) 0.79± 0.24 0.79± 0.23 0.80± 0.23

ErrΣ(POT-S)ErrΣ(QMLE) 1.00± 0.22 0.82± 0.30 0.73± 0.34

ErrΣ(POT-S)ErrΣ(POT) 0.69± 0.22 0.69± 0.22 0.69± 0.22

ErrΣ(POT-S)ErrΣ(POT-S-0) 1.02± 0.08 1.01± 0.08 1.00± 0.08

Table 4.2: Assess the error in estimating Σ when the oracle estimate of X is achieved. For each ofVar

[σ2i

]= 0, 1 and 10, the average for every measurement is the average over 10× 6× 100 = 6000

simulations, and the standard deviation is the standard deviation of these 6000 simulations.

The above tables show a very promising result. The oracle error of POT-S is smaller than other

existing methods. The next step is to assess the efficiency of our cross-validation technique to find

λ??Opt for POT-S adaptively from the data.

4.6.2 Assess the accuracy in finding λ??Opt

Here we empirically test how accurate the Wold-style cross-validation is in estimating λ??Opt. We

compare the approach that fills in the held-out entries by only the corresponding entries of the

held-in estimate Xλ,I (Wold-CV) and the approach that combines cross-validation and bootstrap in

Section 4.4 (CV-Boot). As a reference comparison, we also include in the ESA-BCV method.

The oracle estimate refers to X??λ??Opt

using POT-S. Similar to Section 3.4.3, we calculate the

excessive error using Wold-CV or CV-Boot compared with the oracle. For an estimate λ using


# ofdominatingfactors

γ = 0.02 γ = 0.2 γ = 1 γ = 5 γ = 50

(20, 1000) (100, 5000) (20, 100) (200, 1000) (50, 50) (500, 500) (100, 20) (1000, 200) (1000, 20) (5000, 100)

ErrX(POT-S)/

ErrX(ESA)


ErrX(POT-S)/

ErrX(QMLE)


ErrX(POT-S)/

ErrX(POT)


ErrX(POT-S)/

ErrX(POT-S-0)


Table 4.3: Four measurements comparing the oracle error in estimating X under various (N,n) pairsand factor strength scenario with Var(σ2

i ) = 1. Type-1 to Type-6 correspond to the six scenarios inTable 3.1

either Wold-CV or CV-Boot, define

REE(λ) =

∥∥∥X??λ−X

∥∥∥2

F∥∥∥X??λ??Opt−X

∥∥∥2

F

− 1.

Correspondingly, we redefine the REE of ESA-BCV as

REE(ESA-BCV) =

∥∥∥XESA(kBCV)−X∥∥∥2

F∥∥∥X??λ??Opt−X

∥∥∥2

F

− 1.

Notice that REE(ESA-BCV) defined here can be much higher than REE(kBCV) defined in (3.21)

where the ESA-BCV estimator is compared with the oracle of ESA. REE(ESA-BCV) can be much


# ofdominatingfactors

γ = 0.02 γ = 0.2 γ = 1 γ = 5 γ = 50

(20, 1000) (100, 5000) (20, 100) (200, 1000) (50, 50) (500, 500) (100, 20) (1000, 200) (1000, 20) (5000, 100)

ErrX(POT-S)/

ErrX(ESA)


ErrX(POT-S)/

ErrX(QMLE)


ErrX(POT-S)/

ErrX(POT)


ErrX(POT-S)/

ErrX(POT-S-0)


Table 4.4: Four measurements comparing the error in estimating Σ when the oracle error of X isachieved under various (N,n) pairs and factor strength scenario with Var(σ2

i ) = 1. Type-1 to Type-6correspond to the six scenarios in Table 3.1.

larger due to the gap between the oracles of ESA and POT-S.

Figure 4.1 shows the survival curves of REE in estimating X for the three methods we compare.

Basically, all the three methods perform better for larger matrices and CV-Boot then has almost

a perfect recovery of of λ??Opt. For smaller matrices, there is barely any significant improvement of

CV-Boot over Wold-CV on average mainly because that the bootstrap estimate of the noise prefers

a larger n to have better accuracy.

Table 4.5 and Table 4.6 give more details of the simulation results. First, CV-Boot is constantly

more accurate in Wold-CV in most of the cases. Also, the rank of the oracle estimate tracks well

the theoretical threshold that there are 7 detectable factors.

Finally, we compare these methods in estimating Σ with Σλ??Opt. Define

REEΣ(λ) =R(Σλ,Σ)

R(Σλ??Opt,Σ)− 1.


0.0 0.2 0.4 0.6 0.8 1.0

0%

20%

40%

60%

80%

100% ESA−BCVWold−CVCV−Boot

(a) All datasets

0.0 0.2 0.4 0.6 0.8 1.0

0%

20%

40%

60%

80%


(b) Large datasets only

0.0 0.2 0.4 0.6 0.8 1.0

0%

20%

40%

60%

80%


(c) Small datasets only

Figure 4.1: REE survival plots for estimating X: the proportion of samples with REE exceeding thenumber on the horizontal axis. Figure 4.1a shows all 6000 samples. Figure 4.1b shows only the 3000simulations of larger matrices of each aspect ratio. Figure 4.1c shows only the 3000 simulations ofsmaller matrices.

and redefine REEΣ(ESA-BCV) accordingly. In Table 4.7 we see that Boot-CV still performs better

than Wold-CV in most of the cases.


FactorType

γ = 0.02 γ = 0.2 γ = 1

Method (20, 1000) (100, 5000) (20, 100) (200, 1000) (50, 50) (500, 500)

Type-10/6/1/1

ESA-BCV 0.409 5.9 0.226 5.8 0.527 4.5 0.208 5.9 0.370 5.0 0.262 6.0Wold-CV 0.013 7.0 0.006 11.7 0.082 6.2 0.017 9.4 0.021 7.4 0.014 8.5CV-Boot 0.008 7.1 0.004 9.6 0.092 6.2 0.001 7.2 0.016 7.1 0.001 7.1Oracle – 7.1 – 7.1 – 6.7 – 7.2 – 7.1 – 7.0

Type-22/4/1/1


Type-33/3/1/1


Type-43/1/3/1


Type-51/3/3/1


Type-60/1/6/1


Table 4.5: Comparison of REE and the rank of X with various (N,n) pairs and scenarios. For eachscenario, the factors’ strengths are listed as the number of “strong/useful/harmful/undetectable”factors. For each (N,n) pair, the first column is the REE and the second column the rank theestimated matrix. Both values are averages over 100 simulations. Var

[σ2i

]= 1.


FactorType

γ = 5 γ = 50

Method (100, 20) (1000, 200) (1000, 20) (5000, 100)

Type-10/6/1/1

ESA-BCV 0.794 4.1 0.382 6.0 0.629 4.9 0.526 5.8Wold-CV 0.027 7.6 0.012 8.0 0.009 7.0 0.008 7.6CV-Boot 0.023 7.2 0.001 7.0 0.009 7.0 0.002 7.0Oracle – 7.0 – 7.0 – 7.0 – 7.0

Type-22/4/1/1


Type-33/3/1/1


Type-43/1/3/1


Type-51/3/3/1


Type-60/1/6/1


Table 4.6: Like Table 4.5, but for larger aspect ratios γ


γ = 0.02 γ = 0.2 γ = 1 γ = 5 γ = 50

Method (20, 1000) (100, 5000) (20, 100) (200, 1000) (50, 50) (500, 500) (100, 20) (1000, 200) (1000, 20) (5000, 100)

Type-1 0/6/1/1

ESA-BCV 1.120 0.267 0.189 5.320 0.250 0.046 0.172 0.026 1.604 0.027Wold-CV 0.037 0.147 0.158 0.052 0.017 0.270 0.642 0.090 1.477 -0.023CV-Boot 0.015 0.100 0.052 0.012 0.017 0.001 -0.018 -0.010 0.838 -0.038

Type-2 2/4/1/1

ESA-BCV 0.729 0.159 0.118 7.182 0.183 -0.022 0.095 0.017 0.964 0.059Wold-CV -0.058 0.102 0.152 0.006 0.017 0.385 0.620 0.211 0.560 0.195CV-Boot -0.096 0.058 0.089 0.001 0.017 -0.036 -0.034 -0.003 0.384 0.005

Type-3 3/3/1/1

ESA-BCV 1.046 0.087 0.079 5.325 0.079 -0.050 0.161 -0.016 1.425 0.040Wold-CV -0.090 0.101 0.137 0.002 0.007 0.444 0.584 0.260 0.205 0.201CV-Boot -0.071 0.085 0.091 0.001 0.007 -0.048 -0.051 -0.021 0.155 -0.000

Type-4 3/1/3/1

ESA-BCV 0.448 -0.105 -0.059 1.368 -0.065 -0.061 -0.015 -0.035 0.840 -0.017Wold-CV -0.007 0.075 0.043 0.005 0.006 0.410 0.556 0.216 0.645 0.207CV-Boot -0.041 0.034 -0.000 0.004 0.006 -0.018 -0.013 -0.009 0.164 -0.000

Type-5 1/3/3/1

ESA-BCV 0.689 0.117 0.103 2.198 0.082 -0.012 0.026 -0.006 0.737 0.013Wold-CV 0.049 0.077 0.075 0.085 0.019 0.290 0.635 0.159 1.527 0.119CV-Boot 0.123 0.056 0.024 0.027 0.019 0.002 0.002 0.003 0.193 -0.003

Type-6 0/1/6/1

ESA-BCV 0.670 0.183 0.327 1.689 0.122 0.022 0.212 0.022 0.922 0.030Wold-CV 0.067 0.055 0.004 -0.199 0.030 0.221 0.487 0.117 1.813 0.059CV-Boot 0.043 0.046 0.000 -0.019 0.016 -0.030 -0.009 -0.010 -0.005 -0.004

Table 4.7: Comparison of REEΣ for various (N,n) pairs and scenarios. For each scenario, the factors’strengths are listed as the number of “strong/useful/harmful/undetectable” factors. The values areaverages over 100 simulations. Var

[σ2i

]= 1.

Chapter 5

Confounder adjustment with factor

analysis

In this chapter, we discuss a multiple regression model with bias corrected by factor analysis.

The motivation of the problem is to correct for the biases and correlation of individual test

statistics in multiple hypotheses testing. In many scientific studies, for example microarray analysis,

tens of thousands of tests are typically performed simultaneously. A typical model is that each of

the individual test statistics is via linear regression, regressing the response variable on the variable

of interest, with other known covariates. However, there can be unknown factors that affect the

response variables in many of the individual hypotheses, inducing correlation among the individual

test statistics. Moreover, those latent factors can also be correlated with the variable of interest, then

the test statistics are not only correlated but are also confounded. We use the phrase “confounding”

to emphasize that these latent factors can significantly bias the individual p-values. Simultaneous

inference such as false discovery rate (FDR) control requires independent and correct individual

p-values. Many confounder adjustment methods have already been proposed for multiple testing

over the last decade Gagnon-Bartsch et al. (2013); Leek and Storey (2008b); Price et al. (2006); Sun

et al. (2012). Our goal is to unify these methods in the same framework and study their statistical

properties based on theoretical results in factor analysis.

In microarray data analysis, common sources of confounding factors include unknown technical

bias Gagnon-Bartsch et al. (2013), environmental changes Fare et al. (2003); Gasch et al. (2000)

and surgical manipulation Lin et al. (2006). See Lazar et al. (2013) for a survey. In many studies,

especially for observational clinical research and human expression data, the latent factors, either

genetic or technical, are confounded with primary variables of interest due to the observational

nature of the studies and heterogeneity of samples Ransohoff (2005); Rhodes and Chinnaiyan (2005).

Similar confounding problems also occur in other high-dimensional datasets such as brain imaging

68

CHAPTER 5. CONFOUNDER ADJUSTMENT WITH FACTOR ANALYSIS 69

Schwartzman et al. (2008) and metabonomics Craig et al. (2006).

Notation. Subscripts of matrices are used to indicate row(s) whenever possible. For example, if

C is a set of indices, then XC is the corresponding rows of a matrix X. A random matrix E ∈ Rn×p

is said to follow a matrix normal distribution with mean M ∈ Rn×p, row covariance U ∈ Rn×n

and column covariance V ∈ Rp×p, abbreviated as E ∼ MN (M,U, V ), if the vectorization of E by

column follows the multivariate normal distribution vec(E) ∼ N(vec(M), V ⊗ U). When U = In,

this means the rows of E are i.i.d. N(0, V ). We use the usual notation in asymptotic statistics that a

random variable is Op(1) if it is bounded in probability, and op(1) if it converges to 0 in probability.

Bold symbols Op(1) or op(1) mean each entry of the vector is Op(1) or op(1).

5.1 The model and the algorithm

5.1.1 A statistical model for confounding factors

We consider a single primary variable of interest and no other known control variables in this section.

It is common to add intercepts and known effects (such as lab and batch effects) in the regression

model. This extension to multiple linear regression does not change the main theoretical results in

this paper and is discussed later in Section 5.3.

For simplicity, all the variables in this section are assumed to have mean 0 marginally. Our model

is built on the already widely used linear model in the existing literature and we rewrite it here:

YN×n = βN×1 Z1×n + LN×r Fr×n + Σ1/2EN×n (5.1a)

where Y is observed data matrix of response, Z is the variable of interest and F are the latent factor

variables (or confounders). Each row represents a variable and each column represents a sample.

We assume the random factor score model for the dependence of F and the primary variable Z. We

assume a linear relationship as in

F = αZ +W, (5.1b)

and in addition some distributional assumptions on Z, W and the noise matrix E

Zji.i.d.∼ mean 0, variance 1, j = 1, . . . , n, (5.1c)

W ∼MN (0, Ir, In), W |= Z, (5.1d)

E ∼MN (0, IN , In), E |= (Z,F ). (5.1e)

The parameters in the model eq. (5.1) are β ∈ RN×1 the primary effects we are most interested in,

L ∈ Rp×r the factor loadings, α ∈ RN×1 the association of the primary variable with the confounding

factors, and Σ ∈ RN×N the noise covariance matrix. We assume Σ is diagonal Σ = diag(σ21 , . . . , σ

2N ),


so the noise for different outcome variables is independent.

In (5.1c), Zi is not required to be Gaussian or even continuous. For example, a binary or categor-

ical variable after normalization also meets this assumption. The parameter vector α measures how

severely the data are confounded. For a more intuitive interpretation, consider an oracle procedure of

estimating β when the confounders F in eq. (5.1a) are observed. The best linear unbiased estimator

in this case is the ordinary least squares (βOLSi , LOLS

i ), whose variance is σ2i Var [Zj , Fj ]

−1/n. Using

(5.1b) and (5.1d), it is easy to show that Var(βOLSi ) = (1 + ‖α‖22)σ2

i /n and Cov(βOLSi1

, βOLSi2

) = 0 for

i1 6= i2. In summary,

Var(βOLS) =1

n(1 + ‖α‖22)Σ. (5.2)

Notice that in the unconfounded linear model in which F = 0, the variance of the OLS estimator

of β is Σ/n. Therefore, 1 + ‖α‖22 represents the relative loss of efficiency when we add observed

variables F to the regression which are correlated with Z. In Section 5.2, we show that the oracle

efficiency (5.2) can be asymptotically achieved even when F is unobserved.

5.1.2 Model identification

Following Sun et al. (2012), we introduce a transformation of the data to make both the identification

issues clearer. Consider the Householder rotation matrix Q ∈ Rn×n such that ZQ = ‖Z‖2eT1 =

(‖Z‖2, 0, 0, . . . , 0). Right-multiplying Y by Q, we get Y = Y Q = β‖Z‖2eT1 + LF + Σ1/2E, where

F = FQ = (αZ +W )Q = ‖X‖2e1αT + W , (5.3)

and W = WQd= W , E = EQ

d= E. As a consequence, the first and the rest of the columns of Y are

Y1 = ‖Z‖2β + LF1 + Σ1/2E1 ∼ N (‖X‖2(β + Lα)T , LLT + Σ), (5.4)

Y−1 = LF−1 + Σ1/2E−1 ∼MN (0, In−1, LLT + Σ). (5.5)

Here Y1 is a length N vector, Y−1 is a N × (n− 1) matrix, and the distributions are now conditional

on Z.

Equation (5.5) is just factor analysis model, and the identification of L and Σ has been discussed

thoroughly in Section 1.3. Notice that for our purpose of estimating and testing for β, we only need

to identify the column space of L, as we have a free parameter α that can change to accommodate

any rotation of L. Based on (5.4), after Σ and the column space of L are identified, the task now is

to identify β given β + Lα.

Notice that the parameters β and α cannot be identified from (5.4) given L and Σ because

they have in total N + r parameters while Y1 is a length N vector. If we write PL and PL⊥ as

the projection onto the column space and orthogonal space of L so that β = PLβ + PL⊥β, it is

impossible to identify PLβ from (5.4).


This suggests that we should further restrict the parameter space. We will reduce the degrees

of freedom by restricting at least r entries of β to equal 0. We consider two sufficient identification

conditions of β representing the negative-control scenario discussed in Gagnon-Bartsch et al. (2013)

and the sparsity scenario discussed in Sun et al. (2012).

Under the negative control scenario, for a known subset of the variables C ∈ 1, 2, · · · , N there

are no primary effects, namely βC = 0. It can be immediately seen that a necessary and sufficient

condition to identify β from β + Lα given the column space of L and Σ is that LC has rank r.

Under the sparsity scenario, β is known to be sparse, but none of the locations of the zero entries

are known. For a vector β, define Iβ as the index set of the zero entries of β. Then define the

s-sparse (s ≥ r) space of β as:

B(s) =β ∈ RN : |Iβ | ≥ s, rank(LIβ ) = r

Let B = (β L) ∈ RN×(r+1). Then we have the following result, which is basically a corollary of

Theorem 1.3.2.

Corollary 5.1.1. For the linear model (5.1), assume that Σ and the column space of L can be

identified. Then a necessary and sufficient condition for β to be identified in B(s) (s ≥ r) is that for

any index subset S of size s, if LS is of rank r then βS = 0.

Proof. For sufficiency, we only need to show that if β + Lα = β + Lα, then β = β.

Let δ = α − α, then β = β + Lδ. As β ∈ B(s), then (β + Lδ)Iβ

= 0. As |Iβ | ≥ s, using the

condition we have Iβ ⊆ Iβ . As a result, we have LIβδ = 0. Thus δ = 0 as LI

βhas rank r. Thus

sufficiency is proved.

For necessity, assume that there is subset S which is not a subset of Iβ but BS has rank r. Then

there exists a non-zero δ that βS = LSδ. Then we can set β = β − Lδ. It’s easy to check that

β ∈ B(s) and β 6= β.

5.1.3 The two-step algorithm

Given (5.4) and (5.5), it’s straightforward to estimate the model in two steps

Step 1: Estimate Σ and L in factor analysis

The equation (5.5) is a random score factor analysis model. In this section we will use MLE to

estimate Σ and L mainly for the purpose of theoretical analysis assuming only strong factors. In

practice, one would prefer conditioning on the factor scores Z and using either ESA-BCV proposed

in Chapter 3 or POT-S proposed in Chapter 4, which may have better performance than MLE.

As discussed in Section 2.1, the MLE is to estimate Σ and L by maximizing (2.2)


n

2log det |LLT + Σ| − n

2tr[(LLT + Σ)−1S

]where S is the sample covariance matrix of Y−1 with known 0 mean:

S =Y−1Y

T−1

n− 1.

Step 2: Estimate α and β using linear regression

We consider estimating β in both the negative control scenario and sparsity scenario.

For the negative control scenario, if we know a set C such that βC = 0, then Y1 can be corre-

spondingly separated into two parts:

YC,1‖Z‖2

= LC·

(α+

W1

‖Z‖2

)+

Σ1/2C‖Z‖2

EC,1, and

Y−C,1‖Z‖2

= β−C + L−C

(α+

W1

‖Z‖2

)+

Σ1/2−C‖Z‖2

E−C,1.

(5.6)

We consider the following negative control (NC) estimator via generalized least squares:

αNC =1

‖Z‖2(LTC Σ−1

C LC)−1LTC Σ−1

C Y TC,1, and (5.7)

βNC =Y−C,1‖Z‖2

− L−CαNC. (5.8)

This estimator matches the RUV-4 estimator of Gagnon-Bartsch et al. (2013) except that it uses

maximum likelihood estimates of Σ and L instead of using PCA, and generalized linear squares

instead of ordinary linear squares regression.

For the sparsity scenario where the zero indices in β are unknown but sparse, the estimation of

α and β from Y1/‖Z‖2 = β+LF1 + Σ1/2E1/‖Z‖2 can be cast as a robust regression by viewing Y T1

as observations and L(0) as design matrix. The nonzero entries in β correspond to outliers in this

linear regression.

More specifically, given a robust loss function ρ, we consider the following estimator:

αRR = arg min

N∑i=1

ρ

(Yi1/‖Z‖2 − Li·α

σi

), and (5.9)

βRR = Y1/‖Z‖2 − LαRR. (5.10)


For a broad class of loss functions ρ, estimating α by (5.9) is equivalent to

(αRR, β) = arg minα,β

N∑i=1

1

σ2i

(Y1i/‖Z‖2 − βi − Li·α)2 + Pλ(β), (5.11)

where Pλ(β) is a penalty to promote sparsity of β (She and Owen, 2011). However βRR is not

identical to β, which is a sparse vector that does not have an asymptotic normal distribution. The

LEAPP algorithm (Sun et al., 2012) uses the form (5.11). Replacing it by the robust regression

(5.9) and (5.10) allows us to derive significance tests of H0i : βi = 0.

5.2 Statistical inference for β

In this section, we derive asymptotic distributions of βNC and βRR when n,N → ∞ and propose

asymptotic valid tests for the individual hypotheses: H0i : βi = 0.

The results are based on Theorem 2.1.3 and Theorem 2.1.4 for the asymptotics of Σ and L in

the first step. As the asymptotic distributions for MLE (or QMLE) have the simplest form for the

non-random factor score model, We first convert the random score factor model (5.5) to the non-

random factor score model. We introduce the matrix R ∈ Rr×r such that RRT = (n− 1)−1F−1FT−1

and RTLTΣ−1LR is diagonal. Define F(0)−1 = R−1Z−1, L(0) = LR, then L(0)F

(0)−1 = LF−1 and L(0)

and F(0)−1 satisfy Assumption 4.

Based on Theorem 2.1.3 and Theorem 2.1.4, we have the following results for L and Σ:

Corollary 5.2.1. Under the assumptions of Theorem 2.1.4 for the random factor score model, when

n,N →∞ then for any fixed index set S with finite cardinality,

√n(LS − L(0)

S )d→MN (0,ΣS , Ir) (5.12)

where ΣS is the noise covariance matrix of the variables in S. Also

maxj≤N‖L2

i· − L2i·‖2 = Op(

√logN/n), max

j≤N|σ2i − σ2

i | = Op(√

logN/n) (5.13)

maxj≤N‖Li·‖2 = Op(1) (5.14)

maxi=1,2,··· ,N

∥∥∥∥∥∥Li· − L(0)i· −

1

n− 1

n∑j=2

σieijF(0)T·j

∥∥∥∥∥∥2

= op(n− 1

2 ). (5.15)

Proof. First, we show that conditional on F−1, W.L.O.G., we can also assume L(0) satisfies Assump-

tion 5. In Bai and Li (2012a, Lemma A.1), they proved that for a matrix Q ∈ Rr×r, if QQT = Ir

and QTV Q = D where both V and D are diagonal matrices and V has distinct diagonal entries,


then Q must be a diagonal matrix with entries either 1 or −1. Since RRT = Ir + Op(n−1/2) and

LTΣ−1L is diagonal, we thus have the conclusion that we can find R satisfying R = Ir +Op(n−1/2).

In other words, ‖R− Ir‖max = OP (n−1/2). Since L is bounded by C, we also have ‖L(0) −L‖max =

‖L(R − Ir)‖max = Op(n−1/2). Thus, though L(0) is not bounded, the entries are asymptotically

uniformly bounded with rate Op(n−1/2), thus the unbounded part doesn’t make a difference in the

proof of consistency and asymptotic normality of L.

Then the results in Corollary 5.2.1 are a straight forward consequences of Theorem 2.1.4. Notice

that we have just shown in the previous paragraph that maxi=1,··· ,N ‖L(0)2i· − L2

i·‖2 = Op(n−1/2).

Also (5.14) is a direct consequence of (A.12).

Next, we discuss the asymptotic properties of β and α under both the negative-control and

sparsity scenario.

First, notice that the parameters α and β only appear in (5.4), so their inference can be completely

separated from the inference of L and Σ. In fact, under the Gaussian assumption and conditional

on Z, Y1 |= Y−1 since E1 |= E−1, so the two steps use mutually independent information. This in turn

greatly simplifies the theoretical analysis.

For the rest of the section, we first consider the estimation of β for fixed W1, R and Z, and then

show the asymptotic distribution of β indeed does not depend on W1, R or Z. Thus all the results

also hold unconditionally. This conditional step can simplify our analysis and we will see that it

does not affect asymptotic efficiency.

To use the results in Corollary 5.2.1, we replace L by L(0) and rewrite (5.4) as

Y1/‖Z‖2 = β + L(0)α(0) + E1/‖Z‖2 (5.16)

where L(0) = LR and α(0) = R−1(α+W1/‖Z‖2). Notice that the random R only depends on Y−1 and

thus is independent of Y1. Also, in the proof we have already shown that ‖R− Ir‖max = Op(n−1/2)

and also ‖Z‖2/√n→ 1 by the law of large numbers, it’s thus easy to check that ‖α(0)−α‖2 = op(1).

5.2.1 The negative control scenario

Let’s first analyze the negative control scenario βC = 0 for a known index set C. The number of

negative controls |C| may grow as N → ∞. We impose an additional assumption on the latent

factors of the negative controls.

Assumption 10. limN→∞ |C|−1LTCΣ−1C LC exists and is positive definite.

Then we can show the consistency and asymptotic distribution of βNC from (5.8).

Theorem 5.2.1. Under the assumptions of Corollary 5.2.1 and Assumption 10, if n,N →∞ then


for any fixed index set S with finite cardinality and S ∩ C = ∅, we have

√n(βNCS − βS)

d→ N (0, (1 + ‖α‖22)(ΣS + ∆S)) (5.17)

where ∆S = LS(LTCΣ−1C LC)

−1LTS .

If in addition |C| → ∞,

√n(βNCS − βS)

d→ N (0, (1 + ‖α‖22)ΣS). (5.18)

Proof. First, we can proceed to prove our theorem by showing the conclusion holds for any fixed u

and fixed sequences Z(n)∞n=1 and R(n,N)∞n=1,N=1 such that ‖Z(n)‖2/√n→ 1 and R(n,p) → Ir as

n,N →∞. For brevity we will write Z and R instead of Z(n) and R(n,N) for the rest of this proof.

Plugging (5.6) in the estimator (5.7) and (5.8), we obtain

√n(βNC−C − β−C) =

√n

‖Z‖2(E−C,1 − L−C(LTC Σ−1

C LC)−1LTC Σ−1

C EC,1)

+√n · (L(0)

−C − L−C)α(0)

+√n · L−C(LTC Σ−1

C LC)−1LTC Σ−1

C (LC − L(0)C )α(0).

As n,N →∞,√n/‖Z‖2

a.s.→ 1. Also, using (5.13), both Σ and Γ have entrywise uniform convergence

in probability to Σ and Γ. Thus, using Assumption 10, we get( 1

|C|LTC Σ−1

C LC

)−1

=( 1

|C|LTCΣ−1

C LC

)−1

+ op(1),

1

|C|LTC Σ−1

C EC,1 =1

|C|LTCΣ−1

C EC,1 + op(1), and

1

|C|LTC Σ−1

C(√n(LC − L(0)

C ))

=1

|C|LTCΣ−1

C(√n(LC − L(0)

C ))

+ op(1)

(5.19)

which imply that

√n(βNCS − βS) =ES,1 − LS(LTCΣ−1

C LC)−1LTCΣ−1

C EC,1

+√n · (L(0)

S − LS)α(0)

+√n · LS(LTCΣ−1

C LC)−1LTCΣ−1

C (LC − L(0)C )α(0) + op(1).

(5.20)

Note that E1 |= L, EC,1 |= ES,1, and√n(LS−L(0)

S )d→ N (0,ΣS⊗Ir), the four main terms on the right

hand side of (5.20) are (asymptotically) uncorrelated, so we only need to work out their individual

variances. Since E1 ∼ N (0,Σ), we have ES,1 ∼ N (0,ΣS) and LS(LTCΣ−1C LC)

−1LTCΣ−1C EC,1 ∼


N (0,∆S). Similarly,√n · (L(0)

S − LS)α(0) d→ N (0, ‖α‖2ΣS), and

√n · LS(LTCΣ−1

C LC)−1LTCΣ−1

C (LC − L(0)C )α(0) d→ N(0, ‖α‖2∆S).

If in addition |C| → ∞, then the minimum eigenvalue of LTCΣ−1C LC →∞ by Assumption 10, then

the maximum entry of ∆S goes to 0. Thus, (5.18) holds.

The asymptotic variance in (5.18) is the same as the variance of the oracle least squares in (5.2).

Comparable oracle efficiency statements can be found in the econometrics literature (Bai and Ng,

2006; Wang et al., 2015). This is also the variance used implicitly in RUV-4 as it treats the estimated

Z as given when deriving test statistics for β. When the number of negative controls is not too large,

say |C| = 30, the correction term ∆S is nontrivial and gives more accurate estimate of the variance

of βNC. See Section 5.4.1 for more simulation results.

5.2.2 The sparsity scenario

We then analyze the sparsity scenario where we know β is sparse but the zero indices are unknown.

To guarantee the performance of robust regression estimators βRR, we assume a smooth loss ρ

for the theoretical analysis:

Assumption 11. The penalty ρ : R → [0,∞) with ρ(0) = 0. The function ρ(x) is non-increasing

when x ≤ 0 and is non-decreasing when x > 0. The derivative ψ = ρ′ exists and |ψ| ≤ D for some

D <∞. Furthermore, ρ is strongly convex in a neighborhood of 0.

A sufficient condition for the local strong convexity is that ψ′ > 0 exists in a neighborhood of 0.

The next theorem establishes the consistency of βRR.

Theorem 5.2.2. Under the assumptions of Corollary 5.2.1 and Assumption 11, if n,N → ∞ and

‖β‖1/N → 0, then αRR p→ α. As a consequence, for any i, βRRi

p→ βi.

Proof. We abbreviate αRR as α in this proof. To avoid confusion, we use α for the true value of the

parameter and α to represent a vector in Rr.Because α(0) → α, we prove this theorem by showing that for any ε > 0, P

[‖α− α(0)‖0 ≥ ε

]→ 0.

We break down our proof to two key results: First, we show α and α(0) are close in the following

sense

ϕ(α(0) − α)∆=

1

N

N∑i=1

ρ

(Li·(α

(0) − α)

σi

)= op(1), (5.21)

and second, we show that for sufficiently small ε > 0, there exists τ > 0 such that as n,N →∞

P[

inf‖α‖2≥ε

ϕ(α) > τ

]→ 1. (5.22)


Based on these two results and the observation that

‖α(0) − α‖2 < ε ⊇ϕ(α(0) − α) < τ

⋂inf‖α‖2≥ε

ϕ(α) > τ

,

we conclude that P[‖α− α(0)‖2 ≥ ε

]→ 0.

Let’s start with (5.21). Denote lp(α) = N−1∑Ni=1 ρ

(Yi1/‖Z‖2 − Li·α/σi

). By (5.9), we have

αRR = arg min lp(α), so lp(α) ≤ lp(α(0)). We examine the difference between lp(α) and ϕ(α(0) − α)

for any α, starting from

lp(α) =1

N

N∑i=1

ρ

(Yi1/‖Z‖2 − Li·α

σi

)=

1

N

N∑i=1

ρ

(βi + L

(0)i· α

(0) + Ei1/‖Z‖2 − Li·ασi

).

Because ρ has bounded derivative, |ρ(x) − ρ(y)| ≤ D|x − y| for any x, y ∈ R. As we assume

‖β‖1/N → 0. This together with 1/‖Z‖2 → 0 implies that

lp(α) =1

N

N∑i=1

ρ

(L

(0)i· α

(0) − Li·ασi

)+ op(1).

Next, ∣∣∣∣∣L(0)i· α

(0) − Li·ασi

− Li·(α(0) − α)

σi

∣∣∣∣∣ =

∣∣∣∣∣L(0)i· − Li·α(0)

σi

∣∣∣∣∣ p→ 0.

Therefore, by the same argument as before,

lp(α) =1

N

N∑i=1

ρ

(Li·(α

(0) − α)

σi

)+ op(1) = ϕ(α(0) − α) + op(1). (5.23)

Also, ϕ(0) = 0 because ρ(0) = 0. Therefore lp(α) ≤ lp(α(0)) = op(1). Notice that the op(1) term in

(5.23) does not depend on α, hence ϕ(α(0) − α) = lp(α) + op(1) = op(1).

Next we prove (5.22). Since ρ(x) is non-decreasing when x ≥ 0,

inf‖α‖2≥ε

ϕ(α) = inf‖α‖2≥ε

1

N

N∑i=1

ρ

(Li·α

σi

)≥ inf‖α‖2=ε

1

N

N∑i=1

ρ

(Li·α

σi

).

Using Corollary 5.2.1 (5.14), it’s easy to see that there exists some constantD? that P[maxi ‖Li·‖2 ≤ D?

]→

1. Thus when maxi ‖Li·‖2 ≤ D? holds, there is sufficiently small ε > 0, the α on the right hand side

is within the neighborhood where ρ is strongly convex in Assumption 11, so for some κ > 0

inf‖α‖2=ε

1

N

N∑i=1

ρ

(Li·α

σi

)≥ inf‖α‖2=ε

κ · 1

N

N∑i=1

(Li·α

σi

)2

= κε2 · λmin

(LT Σ−1L

).


By the uniform consistency of L and Σ, we conclude (5.22) is true for τ = κε2λmin(LTΣ−1L)/2,

where λmin(LTΣ−1L) > 0 as in the assumption of Corollary 5.2.1.

To derive the asymptotic distribution, we consider the estimating equation corresponding to

(5.9). By taking the derivative of (5.9), αRR satisfies

ΨN,L,Σ(αRR) =1

N

N∑i=1

ψ

(Yi1/‖Z‖2 − Li·αRR

σi

)Li·/σi = 0. (5.24)

The next assumption is used to control the higher order term in a Taylor expansion of Ψ.

Assumption 12. The first two derivatives of ψ exist and both |ψ′(x)| ≤ C and |ψ′′(x)| ≤ C hold

at all x for some C <∞.

Examples of loss functions ρ that satisfy Assumption 11 and Assumption 12 include smoothed

Huber loss and Tukey’s bisquare.

The next theorem gives the asymptotic distribution of βRR when the nonzero entries of β are

sparse enough. The asymptotic variance of βRR is, again, the oracle variance in (5.2).

Theorem 5.2.3. Under Corollary 5.2.1, Assumption 11 and Assumption 12, if n,N →∞, then

√n(βRRS − βS)

d→ N (0, (1 + ‖α‖22)ΣS)

for any fixed index set S with finite cardinality.

If n/N → 0, then a sufficient condition for ‖β‖1√n/N → 0 in Theorem 5.2.3 is ‖β‖1 = O(

√N).

If instead n/N → γ > 0, then ‖β‖1 = o(√N) suffices.

Proof. Because αRR is consistent, we can approximate the left hand side of (5.24) by its second

order Taylor expansion (we abbreviate ΨN,L,Σ to ΨN if it causes no confusion):

0 = ΨN (α(0)) +∇ΨN (α(0)) · (αRR − α(0)) + rN

where rN is the higher order term and Assumption 12 implies rN = op(‖αRR − α(0)‖2). Therefore

αRR = α(0) −[∇ΨN (α(0)) + op(1)

]−1ΨN (α(0)) and

√n(βRR − β) =

√n

‖Z‖2E1 +

√n(L(0) − L)αRR + L

[∇ΨN (α(0)) + op(1)

]−1√nΨN (α(0)) (5.25)

Because of the consistency of αRR and independence between L and (Z, E1), using Slusky’s

theorem we get (√n/‖Z‖2)ES,1 +

√n(L

(0)S − LS)αRR d→ N (0, (1 + ‖α‖2)ΣS) by Corollary 5.2.1.

Therefore the proof of Theorem 5.2.3 is completed once we can show the largest eigenvalue of[∇ΨN (α(0))

]−1is Op(1) and

√nΨN (α(0))

p→ 0.


We first show that√nΨN (α(0))

p→ 0. By using the representation of L in (5.15), we have

ΨN (α(0)) =1

N

N∑i=1

ψ( Yi1/‖Z‖2 − Li1α(0)

σi

)Li·/σi

=1

N

N∑i=1

ψ(βi + Ei1/‖Z‖2 − 1

n−1 Ei,−1Z(0)T−1 α(0) + εi

σi + δi

)Li·/σi

where maxi |δi| = op(1) and maxi |εi| = op(n−1/2) from Corollary 5.2.1. Because ‖β‖1

√n/N → 0

and ψ′ is bounded,

Ψp(α(0)) =

1

N

N∑i=1

ψ

(Ei1/‖Z‖2 − 1

n−1 Ei,−1Z(0)T−1 α(0) + εi

σi + δi

)Li·/σi + op(n

−1/2)

Let gi be the expression inside ψ in the last equation omitting εi and δi. Conditionally on Z(0)−1 ,

the variables gi,where i = 1, . . . , N are independent and identically distributed with E(gi) = 0 and

gi = Op(n−1/2). Thus, using Assumption 12 and boundedness of σi, and rearranging the variables,∥∥∥∥∥ 1

N

N∑i=1

[ψ(gi +

εi − δigiσi

)− ψ(gi)

]Li·/σi

∥∥∥∥∥2

≤C2 ·

∥∥∥∥∥ 1

N

N∑i=1

(|εi||Li·|+ |gi||δiLi·|

)/σi

∥∥∥∥∥2

= op(n−1/2)

We can further use the facts that ψ(gi) = ψ′(0)gi + op(n−1/2) = Op(n

−1/2) and ψ(gi)− ψ′(0)gi

are i.i.d., and get:

∥∥∥ΨN (α(0))∥∥∥

2=

∥∥∥∥∥ 1

N

N∑i=1

ψ

(gi +

εi − δigiσi

)Li·/σi

∥∥∥∥∥2

+ op(n−1/2)

=

∥∥∥∥∥ 1

N

N∑i=1

ψ (gi) Li·/σi

∥∥∥∥∥2

+ op(n−1/2) =

∥∥∥∥∥ 1

N

N∑i=1

ψ(gi)Li·σi

∥∥∥∥∥2

+ op(n−1/2)

=

∥∥∥∥∥ 1

N

N∑i=1

ψ′(0)Li·σigi

∥∥∥∥∥2

+ op(n−1/2) = op(n

− 12 ).

The last equality holds as the variable inside the norm has mean 0 and standard deviation in the

order of n−1/2N−1/2.

Finally, we show that the largest eigenvalue of the matrix[∇ΨN (α(0))

]−1is bounded in prob-

ability. Because of the assumption of strong factors in Corollary 5.2.1 that limN→∞1NL

TΣ−1L is

positive definite, we use Assumption 12 and the uniform convergence of Σ and Γ and use a similar


argument to get

[∇ΨN (α(0))

]−1

=

[1

N

N∑i=1

ψ′(gi +

εi − δigiσi

)LTi· Li·/σ

2i + op(1)

]−1

=

[1

N

N∑i=1

ψ′ (0)LTi·Li·/σ2i + op(1)

]−1

p→[ψ′(0) lim

N→∞

1

NLTΣ−1L

]−1

.

Notice that ψ′(0) = ρ′′(0) > 0 as ρ is strongly convex in a neighborhood of 0. This means that all

the eigenvalues of[∇Ψp(α

(0))]−1

converge to finite constants.

Based on Theorems 5.2.1 and 5.2.3, we can construct p-values that are asymptotically valid

and independent. Consider the asymptotic test for H0i : βi = 0, i = 1, . . . , N resulting from the

asymptotic distributions of βj derived in Theorems 5.2.1 and 5.2.3.

ti =‖Z‖2βi

σi√

1 + ‖α‖2, i = 1, . . . , N (5.26)

The null hypothesis H0i is rejected at level-α if |ti| > zα/2 = Φ−1(1 − α/2) as usual, where Φ

is the cumulative distribution function of the standard normal. Note that here we slightly abuse

the notation α to represent the significance level and this should not be confused with the model

parameter α. In practice, when |C| is small, we replace√

1 + ‖α‖2 with the inflation corrected

variance in Theorem 5.2.1 when constructing the test statistics.

Remark. We find a calibration technique in Sun et al. (2012) very useful to improve the type I

error and FDR control for finite sample size. Because the asymptotic variance used in eq. (5.26) is

the variance of an oracle OLS estimator, when the sample size is not sufficiently large, the variance

of βRR should be slightly larger than this oracle variance. To correct for this inflation, one can

use median absolute deviation (MAD) with customary scaling to match the standard deviation for a

Gaussian distribution to estimate the empirical standard error of tj , j = 1, . . . , p and divide tj by the

estimated standard error. The performance of this empirical calibration is studied in the simulations

in Section 5.4.1.

5.3 Extension to multiple regression

In (5.1) we assume that there is only one primary variable X and all the random variables Z, Y and

F have mean 0. In practice, there may be several predictors, or we may want to include an intercept

term in the regression model. Here we develop a multiple regression extension to the original model

(5.1).


Suppose we observe in total d = d0 +d1 random predictors that can be separated into two groups:

1. Z0: d0 × n nuisance covariates that we would like to include in the regression model, and

2. Z1: d1 × n primary variables whose effects we want to study.

For example, the intercept term can be included in Z0 as a 1×n vector of 1 (i.e. a random variable

with mean 1 and variance 0).

Leek and Storey (2008a) consider the case d0 = 0 and d1 ≥ 1 for SVA and Sun et al. (2012)

consider the case d0 ≥ 0 and d1 = 1 for LEAPP. Here we study the confounder adjusted multiple

regression in full generality, for any d0 ≥ 0 and d1 ≥ 1. Our model is

Y = B0Z0 + B1Z1 + LF + Σ1/2E, (5.27a)(Z0j

Z1j

)are i.i.d. with E

(Z0j

Z1j

)(Z0j

Z1j

)T = ΣZ , (5.27b)

F | (Z0, Z1) ∼MN (A0Z0 + A1Z1, Ir, In), and (5.27c)

E |= (Z0, Z1, F ), E ∼MN (0, IN , In). (5.27d)

The model does not specify means for X0i and X1i; we do not need them. The parameters in this

model are, for i = 0 or 1, Bi ∈ Rdi×N , L ∈ RN×r, ΣZ ∈ Rd×d, and Ai ∈ Rr×di . The parameters A

and B are the matrix versions of α and β in model (5.1). Additionally, we assume ΣZ is invertible.

Clarifying our purpose, we are primarily interested in estimating and testing for the significance of

B1.

For the multiple regression model (5.27), we again consider the rotation matrix Q that is given

by the QR decomposition(ZT0 ZT1

)= QU where Q ∈ Rn×n is an orthogonal matrix and U is an

upper triangular matrix of size n× d. Therefore we have(Z0

Z1

)Q = UT =

(U00 0 0

U10 U11 0

)

where U00 is a d0 × d0 lower triangular matrix and U11 is a d1 × d1 lower triangular matrix. Now

let the rotated Y be

Y = Y Q =(Y0 Y1 Y−1

)(5.28)

where Y0 is N × d0, Y1 is N × d1 and Y−1 is N × (n− d), then we can partition the model into three


parts: conditional on both Z0 and Z1 (hence U),

Y0 = B0U00 + B1U01 + LF0 + Σ1/2E0, (5.29)

Y1 = B1U11 + LF1 + Σ1/2E1 ∼MN ((B1 + LA1)U11, LLT + Σ, Id1) (5.30)

Y−1 = LF−1 + Σ1/2E−1 ∼MN (0, LLT + Σ, In−d) (5.31)

where F = FQ and E = EQd= E. Equation (5.29) corresponds to the nuisance parameters B0 and

is discarded according to the ancillary principle. Equation (5.30) is the multivariate extension to

(5.4) that is used to estimate B1 and (5.31) plays the same role as (5.5) to estimate L and Σ.

We consider the asymptotics when n,N → ∞ and d, r are fixed and known. Since d is fixed,

the estimation of L is not different from the simple regression case and the results of QMLE in

Corollary 5.2.1 still holds under the same assumptions.

Let

Σ−1Z = Ω =

(Ω00 Ω01

Ω10 Ω11

).

In the proof of Theorems 5.2.1 and 5.2.3, we consider a fixed sequence of Z such that ‖Z‖2/√n→ 1.

Similarly, we have the following lemma in the multiple regression scenario:

Lemma 5.3.1. As n→∞, U11UT11/n

a.s.→ Ω−111 .

Proof. First, notice that by the strong law of large numbers 1n

(Z0

Z1

)(ZT0 ZT1

)a.s.→ ΣZ . Using the

QR decomposition(ZT0 ZT1

)= QU and writing UT =

(V 0

)and V =

(U00 0

U10 U11

), it’s clear

that V V T /na.s.→ ΣZ . Since ΣZ is nonsingular, then V , U00 and U11 are full rank square matrices

with probability 1. Also using the block matrix inversion formula, we have

V −1 =

(U−1

00 0

−U−111 U10U

−100 U−1

11

).

Therefore the right bottom block of nV −TV −1 is nU−T11 U−111 and converges to Ω11 almost surely.

Thus the statement in the lemma holds.

Similar to (5.4), we can rewrite (5.30) as

Y1U−111 = B1 + L(A1 + W1U

−111 ) + Σ1/2E1U

−111

where W1 ∼ MN (0, IN , Id1) is independent from E1. As in Section 5.2, we derive statistical prop-

erties of the estimate of B1 for a fixed sequence of Z, W1 and F , which also hold unconditionally.

For simplicity, we assume that the negative controls are a known set of variables C with B1,C = 0.


We can then estimate each column of A1 by applying the negative control (NC) or robust regression

(RR) we discussed in Section 5.1.3 to the corresponding row of Y1U−111 , and then estimate B1 by

B1 = Y1U−111 − LA1.

Notice that E1U−111 ∼MN

(0, U−T11 U−1

11 ,Σ). Thus the “samples” in the robust regression, which are

actually the N variables in the original problem are still independent within each column. Though

the estimates of each column of A1 may be correlated, we will show that the correlation won’t affect

inference on B1. As a result, we still get asymptotic results similar to Theorem 5.2.3 for the multiple

regression model (5.27):

Theorem 5.3.1. Under the assumptions of Corollary 5.2.1 and Assumptions 10 to 12, if n,N →∞,

then for any fixed index set S with finite cardinality |S|,

√n(BNC

1,S − B1,S)d→MN (0|S|×d1

,ΣS + ∆S ,Ω11 + AT1 A1), and (5.32)

√n(BRR

1,S − B1,S)d→ MN(0|S|×d1

,ΣS ,Ω11 + AT1 A1) (5.33)

where ∆S is defined in Theorem 5.2.1.

Proof. First, for the known zero indices scenario, ANC1 has the following formula, which is similar to

(5.7):

ANC1 = (LTC Σ−1

C LC)−1LTC Σ−1

C Y1,CU−111 (5.34)

which implies a similar formula as (5.20):

√n(BNC

1,S − B1,S) =√nE1,SU

−111 −

√n · LS(LTCΣ−1

C LC)−1LTCΣ−1

C E1,CU−111

+√n · (L(0)

S − LS)A(0)1

+√n · LS(LTCΣ−1

C LC)−1LTCΣ−1

C (LC − L(0)C )A

(0)1 + op(1),

(5.35)

where A(0)1 = R−1(A1 + W1U

−111 ). Following the proof of Theorem 5.2.1 by using Lemma 5.3.1, we

get (5.32).

For the sparsity scenario, Lemma 5.3.1 guarantees the consistency of each column of ARR1 by

using Theorem 5.2.2. Then the Taylor expansion used in the proof of Theorem 5.2.3 still works at

each column of A(0)1 . Similar to (5.25), we get

√n(BRR

1 − B1) =√nE1U

−111 +

√n(L(0) − L)ARR

1

+ L(g1 g2 · · · gd1

) (5.36)

where gi =[∇Ψp(A

(0)1,i )]−1

(√nΨp

(A

(0)1,i ) + op(1)

). Following the proof of Theorem 5.2.3, we get


each gi = op(1). Thus

√n(BRR

1 − B1) =√nE1U

−111 +

√n · (L(0) − L)ARR

1 + op(1)

and (5.33) holds.

As for the asymptotic efficiency of this estimator, we again compare it to the oracle OLS estimator

of B1 which observes confounding variables Z in (5.27). In the multiple regression model, we claim

that BRR1 still reaches the oracle asymptotic efficiency. In fact, let B =

(B0 B1 L

). The oracle

OLS estimator of B, BOLS, is unbiased and its vectorization has variance Σ⊗ V −1/n where

V =

(ΣZ ΣZAT

AΣZ Ir + AΣZAT

), for A =

(A0 A1

).

By the block-wise matrix inversion formula, the top left d×d block of V −1 is Σ−1Z +ATA. The variance

of BOLS1 only depends on the bottom right d1 × d1 sub-block of this d × d block, which is simply

Ω11 + AT1 A1. Therefore BOLS

1 is unbiased and its vectorization has variance Σ ⊗ (Ω11 + AT1 A1)/n,

matching the asymptotic variance of BRR1 in Theorem 5.3.1.

5.4 Numerical experiments

5.4.1 Simulation results

In this section we use numerical simulations to verify the theoretical asymptotic results and further

study the finite sample properties of our estimators and tests statistics.

The simulation data are generated from the single primary variable model (5.1). More specifically,

Zi is a centered binary variable (Zi + 1)/2i.i.d.∼ Bernoulli(0.5), and Y·j , F·j are generated according

to (5.1).

For the parameters in the model, the noise variances are generated by σ2i

i.i.d.∼ InvGamma(3, 2), i =

1, . . . , N , and so E(σ2i ) = Var(σ2

i ) = 1. We set each αk = ‖α‖2/√r equally for k = 1, 2, · · · , r where

‖α‖22 is set to 1, so the variance of Xi explained by the confounding factors is R2 = 50%. The pri-

mary effect β has independent components βi taking the values 3√

1 + ‖α‖22 and 0 with probability

π = 0.05 and 1 − π = 0.95, respectively, so the nonzero effects are sparse and have effect size 3.

This implies that the oracle estimator has power approximately P(N(3, 1) > z0.025) = 0.85 to detect

the signals at a significance level of 0.05. We set the number of latent factors r to be either 2 or

10. For the latent factor loading matrix L, we take L = UD where U is a N × r orthogonal ma-

trices sampled uniformly from the Stiefel manifold Vr(RN ), the set of all N × r orthogonal matrix.

As we assume strong factors, we set the latent factor strength D =√N · diag(d1, · · · , dr) where

dk = 3−2(k−1)/(r−1) thus d1 to dr are distributed evenly inside the interval [3, 1]. As the number


of factors r can be consistently easily estimated for this strong factor setting, we assume that the

number r of factors is known to all of the algorithms in this simulation.

We set N = 5000, n = 100 or 500 to mimic the data size of many genetic studies. For the

negative control scenario, we choose |C| = 30 negative controls at random from the zero positions

of β. We expect that negative control methods would perform better with a larger value of |C| and

worse with a smaller value. The choice |C| = 30 is around the size of the spike-in controls in many

microarray experiments (Gagnon-Bartsch and Speed, 2012). For the loss function in our sparsity

scenario, we use Tukey’s bisquare which is optimized via IRLS with an ordinary least-square fit as

the starting values of the coefficients. Finally, each of the four combinations of n and r is randomly

repeated 100 times.

We compare the performance of nine different approaches. There are two baseline methods:

the “naive” method estimates β by a linear regression of Y on just the observed primary variable

Z and calculates p-values using the classical t-tests, while the “oracle” method regresses Y on

both Z and the confounding variables F as described in ??. There are three methods in the

RUV-4/negative controls family: the RUV-4 method (Gagnon-Bartsch et al., 2013), our “NC”

method which computes test statistics using βNC and its variance estimate (1 + ‖α‖22)(Σ + ∆),

and our “NC-ASY” method which uses the same βNC but estimates its variance by (1 + ‖α‖22)Σ.

We compare four methods in the SVA/LEAPP/sparsity family: these are “IRW-SVA” (Leek and

Storey, 2008b), “LEAPP” (Sun et al., 2012), the “LEAPP(RR)” method which is our RR estimator

using M-estimation at the robustness stage and computes the test-statistics using (5.26), and the

“LEAPP(RR-MAD)” method which uses the median absolute deviation (MAD) of the test statistics

in (5.26) to calibrate them. (see Section 5.2)

To measure the performance of these methods, we report the type I error, power, false discovery

proportion (FDP) and precision of hypotheses with the smallest 100 p-values in the 100 simulations.

For both the type I error and power, we set the significance level to be 0.05. For FDP, we use

Benjamini-Hochberg procedure with FDR controlled at 0.2. These metrics are plotted in Figure 5.1

under different settings of n and r.

First, from Figure 5.1, we see that the oracle method has exactly the same type I error and FDP

as specified, while the naive method and SVA fail drastically. SVA performs performs better than

the naive method in terms of the precision of the smallest 100 p-values, but is still much worse than

other methods. Next, for the negative control scenario, as we only have |C| = 30 negative controls,

ignoring the inflated variance term ∆S in Theorem 5.2.1 will lead to overdispersed test statistics,

and that’s why the type I error and FDP of both NC-ASY and RUV-4 are much larger than the

nominal level. By contrast, the NC method correctly controls type I error and FDP by considering

the variance inflation, though as expected it loses some power compared with the oracle. For the

sparsity scenario, the “LEAPP(RR)” method performs as the asymptotic theory predicted when

n = 500, while when n = 100 the p-values seem a bit too small. This is not surprising because


r = 2 r = 10

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

n = 100

n = 500

NaiveIRW−SVA

NCNC−ASY

RUV−4

LEAPP(RR)

LEAPP(RR−MAD)

LEAPPOracle

NaiveIRW−SVA

NCNC−ASY

RUV−4

LEAPP(RR)

LEAPP(RR−MAD)

LEAPPOracle

Type I error Power FDP Top 100

Figure 5.1: Compare the performance of nine different approaches (from left to right): naive regres-sion ignoring the confounders (Naive), IRW-SVA, negative control with finite sample correction (NC) ineq. (5.17), negative control with asymptotic oracle variance (NC-ASY) in eq. (5.18), RUV-4, robust re-gression (LEAPP(RR)), robust regression with calibration (LEAPP(RR-MAD)), LEAPP, oracle regressionwhich observes the confounders (Oracle). The error bars are one standard deviation over 100 repeated sim-ulations. The three dashed horizontal lines from bottom to top are the nominal significance level, FDR leveland oracle power, respectively.


the asymptotic oracle variance in Theorem 5.2.3 can be optimistic when the sample size is not

sufficiently large. On the other hand, the methods which use empirical calibration for the variance

of test statistics, namely the original LEAPP and “LEAPP(RR-MAD)”, control both FDP and type

I error for data of small sample size in our simulations. The price for the finite sample calibration

is that it tends to be slightly conservative, resulting in a loss of power to some extent.

In conclusion, the simulation results are consistent with our theoretical guarantees when N is as

large as 5000 and n is as large as 500. When n is small, the variance of the test statistics will be

larger than the asymptotic variance for the sparsity scenario and we can use empirical calibrations

(such as MAD) to adjust for the difference.

Chapter 6

Conclusions

Factor analysis is a powerful dimension reduction tool with explicit statistical model and assump-

tions. The main difference between factor analysis and the more popular PCA technique is the

consideration of heteroscedastic noise in factor analysis. There is no reason to believe that all the

collected variables have the same noise level. More over, even the raw data may have homoscedastic

noise, heteroscedasticity can arise from data transformation in the preprocessing steps (Woodward

et al., 1998). In other words, the factor analysis model has more flexible and reasonable assump-

tions compared with the white noise model in many data problems. However, as the diagonal noise

variances matrix is also unknown, a factor analysis model is harder to solve since there are more

parameters to estimate.

For high-dimensional data, especially when both the variable and sample dimensions are large,

noise heteroscedasticity is not a serious problem when there are only strong factors (defined in

Chapter 2). Strong factors are easy to estimate for high-dimensional data as more information are

collected when an increasing number of variables are observed. PCA still give consistent estimates

of the factor loadings, scores and the noise variance. However, there is no theoretical results of PCA

for weak factors and heteroscedastic noise.

The presence of weak factors complicates solving the factor model of a high-dimensional data

matrix. As discussed in Chapter 2, researchers in the econometrics field consider approximate factor

models where the data can be decomposed as linear combinations of a few strong factors plus weakly

correlated noise. In other words, in the econometrics literature, weak factors are treated as noise and

are responsible for the weak correlations in the noise. On the other hand in Random Matrix Theory,

weak factors are treated as signals and the goal is to estimate them. Throughout this article, we are

treating weak factors as signals.

In Chapter 3 and Chapter 4 we developed two approaches for estimating both the signal matrix

(factor loadings × factor scores) and noise variance. Chapter 3 proposes an iterating algorithm

ESA and a bi-cross-validation technique to estimate the number of factors. ESA can be considered

88

CHAPTER 6. CONCLUSIONS 89

as a heteroscedastic noise version of PCA/SVD and bi-cross-validation randomly select a block of

the matrix as held-out data. In Chapter 4, we proposed an alternative approach called POT-S

starting with a joint convex optimization using perspective transformation and nuclear penalty. At

the final stage, an optimal shrinkage on the singular values is applied to correct for the bias of the

solutions of the optimization. The tuning parameter is selected using Wold-style cross-validation

which randomly selects entries of the matrix as held-out data.

Empirically, the factors retained using ESA-BCV are always fewer than the factors retained using

POT-S. One explanation is that the shrinkage of the estimated barely detectable factors in POT-S

due to the nuclear penalty makes them more “useful” than in ESA-BCV in terms of estimating the

signal matrix. In practice, ESA-BCV is a better tool in giving a more interpretative estimation of

the number of factors while POT-S is superior in reducing the error of the estimated signal matrix

and noise variance. Besides, the cross-validation error plot in ESA-BCV helps analyzing the strength

of each factor.

ESA-BCV and POT-S are two algorithms to estimate a factor analysis model for high-dimensional

data empirically, while there are still many questions need to be answered for theoretical analysis

of the model. One future work is to develop random matrix theory for the factor model with both

strong and weak factors along with the heteroscedastic noise. Another future work is to give upper

bounds for the estimation errors of the signal matrix and noise in the convex optimization algorithm

POT.

Neither ESA-BCV nor POT-S have made use of sparsity of the factors. It has been shown that

the sparsity assumption can greatly improve estimating the weak factors. However, the scenario

becomes complicated when there are both strong and weak factors. A reasonable assumption is that

sparsity of the loadings of a factor is correlated with the factor’s strength. A strong factor usually

has dense loadings while the loadings of a weak factor are likely to be sparse. Adding a penalty which

encourages sparsity will decrease the accuracy in estimating strong factors but improves estimating

the weak factors. It’s an interesting topic to design an adaptive penalization term based on the

strength and initial estimates of factors.

Confounding factor adjustment in multiple regression is an important application of high-dimensional

factor analysis. In Chapter 5 we analyzed a two-step algorithm for the linear regression model with

Gaussian noise. If there are only strong latent factors, it is shown that we can get asymptotically

valid p-values for the individual primary effects with good power even when the factors are con-

founding with the primary variables. The conditions for the result are that the primary effects are

either sparse enough or contain negative controls. When there are also weak factors, empirically we

find that the ranking of the p-values are still meaningful while the p-values themselves can be biased

if the weak factors are confounding.

To broaden the use of the confounding factor adjustment model, a future research direction is

to extend the model to confounder adjustment of multiple generalized linear regression. The model

CHAPTER 6. CONCLUSIONS 90

then can be applied to nonGaussian response matrix such as binary data or counts, which appear

often in applications such as analyzing SNPs and DNA/RNA sequencing data.

Appendix A

Proof

This appendix shows the proof of of Theorem 2.1.4.

We need the following two lemmas before we prove the results. The first lemma shows that the

product of two independent sub-Gaussian random variables is sub-exponential.

Lemma A.0.1. If the random variables Z1 and Z2 both are sub-Gaussian random variables, and

Z1 and Z2 are independent, then their product Z1Z2 is a sub-exponential random variable.

Proof. W.L.O.G. assume that E [Z1] = E [Z2] = 0, thus E [Z1Z2] = 0. As Z1 and Z2 are sub-

Gaussian random variables, using the results in Rivasplata (2012), there exists a1 > 0 and b2 > 0

that

E[ea1X

21

]≤ 2, E

[etX2

]≤ eb

22t

2/2

Then for ∀ |λ| ≤√

2a1/b22, we have

E[eλX1X2

]≤ E

[eλ

2b22X21/2]≤ E

[ea1X

21

]≤ 2

Thus, there exists some b > 0 that

E[eλX1X2

]≤ ev

2λ2/2

for all |λ| ≤ 1/b and v2 ≥ E[X1X

22

]. Thus X1X2 is a sub-exponential variable.

Here is a restatement of Theorem 2.1.4:

Theorem. 2.1.4. Under the assumptions of Theorem 2.1.3 and assume that eij are sub-Gaussian

random variables, if (logN)2/n→ 0 as n,N →∞, then

maxj≤N‖L2

i· − L2i·‖2 = Op(

√logN/n), max

j≤N|σ2i − σ2

i | = Op(√

logN/n) (2.8)

91

APPENDIX A. PROOF 92

For the non-random factor model,

maxi=1,2,··· ,N

∥∥∥∥∥∥Li· − Li· − 1

n

n∑j=1

σieijFT·j

∥∥∥∥∥∥2

= op(n− 1

2 ). (2.9)

We prove uniform convergence of the estimated factors and noise variances by intensively using

some of the technical results in Bai and Li (2012a) and also modify internal parts of their proof.

Before reading the following proof, we recommend that the reader first read the original proof in

Bai and Li (2012a,b). To help the readers to follow, the variables N , T , Λ (or Λ?) and f (or f?) in

Bai and Li (2012a) correspond to N , n, L and F in our notation. The identification condition in

Theorem 2.1.4 for the non-random factor score model corresponds to the IC3 identification condition

in Bai and Li (2012a). Define

H = (LT Σ−1L)−1, and HN = NH.

The lemma below integrates Equation (A.14) of Bai and Li (2012a), Equation (B.9) and the state-

ments Lemma C.1 in Bai and Li (2012b).

Lemma A.0.2. Under the assumptions of Theorem 2.1.4, we have for any i = 1, 2, · · · , N :

LTi· − LTi· =(L− L

)TΣ−1LHLTi· − HLT Σ−1

(L− L

)(L− L

)TΣ−1LHLTi·

− HLT Σ−1L

(1

nFET

)Σ1/2Σ−1LHLTi· − HLT Σ−1Σ1/2

(1

nEFT

)LT Σ−1LHLTi·

− H

(N∑i1=1

N∑i2=1

σi1σi2σ2i1σ2i2

(1

n

(Ei1·E

Ti2· − E

[Ei1·E

Ti2·]))

LTi1·Li2·

)HLTi·

+ H

N∑i1=1

σ2i1− σ2

i1

σ4i1

LTi1·Li1·HLTi·

+ HLT Σ−1Σ1/2

(1

nEFT

)LTi· + HLT Σ−1L

(1

nFETi·

)σi

+ H

(N∑i1=1

σi1σiσ2i1

(1

n

(Ei1·E

Ti· − E

[Ei1·E

Ti·]))

LTi1·

)− H σ2

i − σ2i

σ2i

LTi·

(A.1)


σ2i − σ2

i =1

n

n∑j=1

(e2ij − σ2

i )− (Li· − Li·)(Li· − Li·)T

+ Li·HLT Σ−1(L− L)(L− L)TΣ−1LHLTi· + 2Li·HL

T Σ−1L

(1

nFET

)Σ1/2LHLTi·

+ Li·H

(N∑i1=1

N∑i2=1

σi1σi2σ2i1σ2i2

(1

n

(Ei1·E

Ti2· − E

[Ei1·E

Ti2·]))

LTi1·Li2·

)HLTi·

− Li·HN∑i1=1

σ2i1− σ2

i1

σ4i1

LTi1·Li1·HLTi·

− 2Li·HLT Σ−1Σ1/2

(1

nEFT

)LTi· + 2Li·H

σ2i − σ2

i

σ2i

LTi·

− 2Li·H

(N∑i1=1

σi1σiσ2i1

(1

n

(Ei1·E

Ti· − E

[Ei1·E

Ti·]))

LTi1·

)

+ 2Li·HLT Σ−1(L− L)

(1

nFETi·

)σi

(A.2)

Also, we have the following approximations:

‖HLT Σ−1(L− L)‖F = Op(n−1) +Op(n

−1/2N−1/2) (A.3)

∥∥∥∥∥H(

N∑i1=1

σi1σiσ2i1

(1

n

(Ei1·E

Ti· − E

[Ei1·E

Ti·]))

LTi1·

)∥∥∥∥∥F

= Op(N−1/2n−1/2) +Op(n

−1) (A.4)

∥∥∥∥∥H(

N∑i1=1

N∑i2=1

σi1σi2σ2i1σ2i2

(1

n

(Ei1·E

Ti2· − E

[Ei1·E

Ti2·]))

LTi1·Li2·

)H

∥∥∥∥∥F

= Op(N−1n−1/2) +Op(n

−1)

(A.5)

1

n

∥∥HLT Σ−1Σ1/2EFT∥∥F

= Op(n−1/2N−1/2) +Op(n

−1) (A.6)

∥∥∥∥∥HN∑i1=1

σ2i1− σ2

i1

σ4i1

LTi1·Li1·H

∥∥∥∥∥F

= Op(N−1n−1/2) (A.7)

Proof. Comparing the results with Equation (A.14) of Bai and Li (2012a), Equation (B.9) and the

statements Lemma C.1 in Bai and Li (2012b), we only need to prove (A.3) and (A.6).


To show (A.6), one just needs to apply HN = Op(1) (Bai and Li, 2012a, Corollary A.1), and

the identification condition MF = Ir to simplify Lemma C.1(e) of Bai and Li (2012b) using Central

Limit Theorem.

To prove (A.3), notice that under our conditions (or the IC3 condition of Bai and Li (2012a)),

the left hand side of Equation (A.13) in Bai and Li (2012a) is actually 0 as both the terms Mff

and M?ff in their notation are exactly Ir. Also, HLT Σ−1L = Ir + op(1) from Bai and Li (2012a,

Corollary A.1). Thus, (A.3) holds by applying (A.5), (A.6) and (A.7) to Bai and Li (2012b) Equation

(A.13).

We are now ready to prove Theorem 2.1.4.

First, notice that we only need to prove for the non-random factor score model. For the random

factor score model, we can condition on the factor scores and make a rotation of the factor loadings

and scores to satisfy the identifiability condition that LTΣ−1L is diagonal. The rotation matrix has

size r × r thus would not affect the uniform convergence rate.

Based on equation (F.1) in Bai and Li (2012b), we have

√n(Li· − Li·) =

1√n

n∑j=1

σieijFT·j + op(1). (A.8)

Now we prove (2.8). Let Li· − Li· = b1i + b2i + · · · + b10,i where bkj represents the kth term in

the right hand side of (A.1). Also, let σ2i − σ2

i = a1i + a2i + · · ·+ a10,i where aki represents the kth

term in the right hand side of (A.2).

By applying (A.3), (A.4),(A.5),(A.6), and (A.7) and boundedness of L, we can immediately

get maxj |bkj | = op(n−1/2) for k 6= 8, 10 and maxj |akj | = op(n

−1/2) for k 6= 1, 2, 8, 9, 10. Using

independence of the noise, boundedness of σi and the exponential-decay tail assumption, we can

find that maxj |b8j | = Op(√

logN/n) and maxj |akj | = Op(√

logN/n) for k = 1, 10 by simply using

central limit theorem.

Next, we show the following facts under the assumption that logN/√n → 0: for each s =

1, 2, · · · , r,

maxi=1,2,··· ,N

1

nN

∣∣∣ N∑i1=1

Lis

n∑j=1

[ei1jeij − E [ei1jeij ]]∣∣∣ = op(n

−1/2), and (A.9)

maxi=1,2,··· ,N

1

n2N

N∑i1=1

( n∑j=1

[ei1jeij − E [ei1jeij ]])2

= op(n−1/2). (A.10)

To prove (A.9), we only need to show maxi1nN

∣∣∑i1 6=i

∑nj=1 Lisei1jeij

∣∣ = op(n−1/2) as the remaining

term is op(n−1/2) because of the independence. This approximation is proven by the union bound


and boundedness of L: for ∀ε > 0

limn,N→∞

P

√n maxi=1,2,··· ,N

1

nN

∣∣∑i1 6=i

n∑j=1

Lisei1jeij∣∣ > ε

≤ limn,N→∞

2N · P

C√nN

∑i1 6=1

n∑j=1

ei1je1j > ε

= limn,N→∞

2N · P

1√n

n∑j=1

e1j

( 1√N − 1

∑i1 6=1

ei1j)>

ε

C

N√N − 1

≤ limn,N→∞

2N · E

( 1√n

n∑j=1

e1j

( 1√N − 1

∑i1 6=1

ei1j))4

/( εC

N√N − 1

)4

= 0

To see why the last inequality holds, (N −1)−1/2∑i1 6=1 ei1j ∼ N (0, 1) is independent from e1j , thus

the fourth moment of n−1/2∑nj=1 e1j

((N − 1)−1/2

∑i1 6=1 ei1j

)is bounded which enables us to use

the Markov inequality. To prove (A.10), we start with the same union bound as for (A.9),

limn,N→∞

P

√n maxi=1,2,··· ,N

1

n2N

∑i1 6=i

( n∑j=1

ei1jeij)2> ε

≤ limn,N→∞

N · P

√nn2N

N∑i1=2

( n∑j=1

ei1je1j

)2> ε

≤ limn,N→∞

2N2 · P

1

n

n∑j=1

e2je1j >√εn−1/4

≤ limn,N→∞

2N2 exp(−√nε2) = 0

where ε2 is some positive constant. The last inequality holds as by using Lemma A.0.1. We can

use the Bernstein inequality for sub-Gaussian variables to bound the tail probability. The last limit

holds as we assume logN/√n→ 0.

Equation (A.9) directly implies that

maxi=1,··· ,N

∣∣∣H( N∑i1=1

LTi·1

n

n∑j=1

[ei1jeij − E(ei1jeij)])∣∣∣ = op(n

−1/2)

as H = Op(N−1). Using (A.10) and N−1

∑i ‖Li·−Li·‖22 = Op(n

−1) from Theorem 2.1.3, we get by


using the Cauchy-Schwartz inequality:

maxi=1,··· ,N

∣∣∣H( N∑i1=1

(Li· − Li·)T1

n

n∑j=1


−1).

By writing Li· = Li· − Li· + Li· and using boundedness of both σi and σi,

maxi=1,··· ,N

∣∣∣H( N∑i1=1

σi1σiσ2i1

LTi·1

n

n∑j=1


−1/2) (A.11)

which indicates that maxj |a9j | = op(n−1/2).

To bound the remaining terms, we use the fact that

maxi=1,··· ,N

‖Li·‖2 = Op(1). (A.12)

To see this, first notice that because of boundedness of σi and σi and the fact that H = Op(N−1),

we have maxj |b10,j | = Op(N−1 maxi |Li·|). Combining the previous results on (A.1), we have

maxi |Li· − Li·| = Op(√

logN/n) + op(maxi |Li·|) which indicates that maxi |Li·| = Op(1). Thus,

maxi |a8i| = op(maxi |σ2i −σ2

i |) is negligible and maxi |Li·−Li·| = Op(√

logN/n)+op(maxi |σ2i −σ2

i |).The latter conclusion also indicates that maxi |a2i| = Op(

√logN/n) + op(maxi |σ2

i − σ2i |). As a

consequence, the second claim in (2.8) holds.

Finally, to prove (2.9), we actually have already shown that maxi |Li· − Li· − b8i| = op(n−1/2).

Then,

maxi=1,2,··· ,N

∣∣∣Li· − Li· − 1

n

n∑j=1

σieijFT·j

∣∣∣≤ maxi=1,2,··· ,N

∣∣∣Li· − Li· − b8i∣∣∣+ maxi=1,2,··· ,N

∣∣∣b8i − 1

n

n∑j=1

σieijFT·j

∣∣∣≤op(n−1/2) + ‖HLT Σ−1(L− L)‖F max

i=1,2,··· ,N

∣∣∣ 1n

n∑j=1

σieijFT·j

∣∣∣ = op(n−1/2).

Thus, (2.9) holds.

Bibliography

S. C. Ahn and A. R. Horenstein. Eigenvalue ratio test for the number of factors. Econometrica, 81

(3):1203–1227, 2013.

L. Alessi, M. Barigozzi, and M. Capasso. Improved penalization for determining the number of

factors in approximate factor models. Statistics & Probability Letters, 80(23):1806–1813, 2010.

Y. Amemiya, W. A. Fuller, and S. G. Pantula. The asymptotic distributions of some estimators for

a factor analysis model. Journal of Multivariate Analysis, 22(1):51–64, 1987.

D. Amengual and M. W. Watson. Consistent estimation of the number of dynamic factors in a large

n and t panel. Journal of Business & Economic Statistics, 25(1):91–96, 2007.

J. C. Anderson and D. W. Gerbing. Structural equation modeling in practice: A review and recom-

mended two-step approach. Psychological bulletin, 103(3):411, 1988.

T. W. Anderson and H. Rubin. Statistical inference in factor analysis. In Proceedings of the third

Berkeley symposium on mathematical statistics and probability, volume 5, 1956.

J. Bai. Inferential theory for factor models of large dimensions. Econometrica, 71(1):135–171, 2003.

J. Bai and K. Li. Statistical analysis of factor models of high dimension. The Annals of Statistics,

40(1):436–465, 2012a.

J. Bai and K. Li. Supplement to “statistical analysis of factor models of high dimension.”. 2012b.

J. Bai and K. Li. Maximum likelihood estimation and inference for approximate factor models of

high dimension. The Review of Economics and Statistics, 98:298–309, 2016.

J. Bai and S. Ng. Determining the number of factors in approximate factor models. Econometrica,

70(1):191–221, 2002.

J. Bai and S. Ng. Confidence intervals for diffusion index forecasts and inference for factor-augmented

regressions. Econometrica, 74(4):1133–1150, 2006.

97

BIBLIOGRAPHY 98

J. Bai and S. Ng. Principal components estimation and identification of static factors. Journal of

Econometrics, 176(1):18–29, 2013.

M. S. Bartlett. Tests of significance in factor analysis. British Journal of Statistical Psychology, 3

(2):77–85, 1950.

A. Belloni, V. Chernozhukov, and L. Wang. Square-root lasso: pivotal recovery of sparse signals via

conic programming. Biometrika, 98(4):791–806, 2011.

F. Benaych-Georges and N. Raj Rao. The singular values and vectors of low rank perturbations of

large rectangular random matrices. Journal of Multivariate Analysis, 111:120–135, 2012.

S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical

learning via the alternating direction method of multipliers. Foundations and Trends R© in Machine

Learning, 3(1):1–122, 2011.

A. Buja and N. Eyuboglu. Remarks on parallel analysis. Multivariate behavioral research, 27(4):

509–540, 1992.

J.-F. Cai, E. J. Candes, and Z. Shen. A singular value thresholding algorithm for matrix completion.

SIAM Journal on Optimization, 20(4):1956–1982, 2010.

R. Caruana, S. Lawrence, and L. Giles. Overfitting in neural nets: Backpropagation, conjugate gra-

dient, and early stopping. In Advances in Neural Information Processing Systems 13: Proceedings

of the 2000 Conference, volume 13, pages 402–408. MIT Press, 2001.

C. M. Carvalho, J. Chang, J. E. Lucas, J. R. Nevins, Q. Wang, and M. West. High-dimensional sparse

factor modeling: applications in gene expression genomics. Journal of the American Statistical

Association, 2012.

R. B. Cattell. The scree test for the number of factors. Multivariate behavioral research, 1(2):

245–276, 1966.

R. B. Cattell and S. Vogelmann. A comprehensive trial of the scree and KG criteria for determining

the number of factors. Multivariate Behavioral Research, 12(3):289–325, 1977.

A. Craig, O. Cloarec, E. Holmes, J. K. Nicholson, and J. C. Lindon. Scaling and normalization effects

in nmr spectroscopic metabonomic data sets. Analytical Chemistry, 78(7):2262–2267, 2006.

J. Fan, Y. Liao, and M. Mincheva. Large covariance estimation by thresholding principal orthogonal

complements. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(4):

603–680, 2013.

BIBLIOGRAPHY 99

T. L. Fare, E. M. Coffey, H. Dai, Y. D. He, D. A. Kessler, K. A. Kilian, J. E. Koch, E. LeProust,

M. J. Marton, M. R. Meyer, et al. Effects of atmospheric ozone on microarray data quality.

Analytical chemistry, 75(17):4672–4675, 2003.

E. Fishler, M. Grosmann, and H. Messer. Detection of signals by information theoretic criteria:

general asymptotic performance analysis. IEEE Transactions on Signal Processing, 50(5):1027–

1036, 2002.

H. E. Fleming. Equivalence of regularization and truncated iteration in the solution of ill-posed

image reconstruction problems. Linear Algebra and its applications, 130:133–150, 1990.

M. Forni, M. Hallin, M. Lippi, and L. Reichlin. The generalized dynamic-factor model: Identification

and estimation. Review of Economics and statistics, 82(4):540–554, 2000.

J. Gagnon-Bartsch, L. Jacob, and T. Speed. Removing unwanted variation from high dimensional

data with negative controls. Technical report, Technical Report 820, Department of Statistics,

University of California, Berkeley, 2013.

J. A. Gagnon-Bartsch and T. P. Speed. Using control genes to correct for unwanted variation in

microarray data. Biostatistics, 13(3):539–552, 2012.

A. P. Gasch, P. T. Spellman, C. M. Kao, O. Carmel-Harel, M. B. Eisen, G. Storz, D. Botstein,

and P. O. Brown. Genomic expression programs in the response of yeast cells to environmental

changes. Molecular biology of the cell, 11(12):4241–4257, 2000.

M. Gavish and D. L. Donoho. Optimal shrinkage of singular values. arXiv preprint arXiv:1405.7511,

2014.

M. Hallin and R. Liska. The generalized dynamic factor model: determining the number of factors.

Journal of the American Statistical Association, 102(478):603–617, 2007.

T. Hastie, R. Tibshirani, and J. H. Friedman. The elements of statistical learning. Springer, 2009.

J. L. Horn. A rationale and test for the number of factors in factor analysis. Psychometrika, 30(2):

179–185, 1965.

R. H. Hoyle. Confirmatory factor analysis. Handbook of applied multivariate statistics and mathe-

matical modeling, pages 465–497, 2000.

R. Hubbard and S. J. Allen. An empirical comparison of alternative methods for principal component

extraction. Journal of Business Research, 15(2):173–190, 1987.

P. J. Huber. Robust statistics. Springer, 2011.

I. Jolliffe. Principal component analysis. New York: Springer-Verlag, 1986.

BIBLIOGRAPHY 100

H. F. Kaiser. The application of electronic computers to factor analysis. Educational and psycho-

logical measurement, 20(1):141–151, 1960.

S. Kritchman and B. Nadler. Determining the number of components in a factor model from limited

noisy data. Chemometrics and Intelligent Laboratory Systems, 94(1):19–32, 2008.

S. Kritchman and B. Nadler. Non-parametric detection of the number of signals: Hypothesis testing

and random matrix theory. Signal Processing, IEEE Transactions on, 57(10):3930–3941, 2009.

C. Lam and Q. Yao. Factor modeling for high-dimensional time series: inference for the number of

factors. The Annals of Statistics, 40(2):694–726, 2012.

W. Lan and L. Du. A factor-adjusted multiple testing procedure with application to mutual fund

selection. arXiv:1407.5515, 2014.

R. M. Larsen. Lanczos bidiagonalization with partial reorthogonalization. DAIMI Report Series, 27

(537), 1998.

D. N. Lawley. Vi. the estimation of factor loadings by the method of maximum likelihood. Proceedings

of the Royal Society of Edinburgh, 60(01):64–82, 1940.

D. N. Lawley. Tests of significance for the latent roots of covariance and correlation matrices.

Biometrika, 43(1/2):128–136, 1956.

C. Lazar, S. Meganck, J. Taminau, D. Steenhoff, A. Coletta, C. Molter, D. Y. Weiss-Solıs, R. Duque,

H. Bersini, and A. Nowe. Batch effect removal methods for microarray gene expression data

integration: a survey. Briefings in bioinformatics, 14(4):469–490, 2013.

J. T. Leek and J. D. Storey. A general framework for multiple testing dependence. Proceedings of

the National Academy of Sciences, 105(48):18718–18723, 2008a.

J. T. Leek and J. D. Storey. A general framework for multiple testing dependence. Proceedings of

the National Academy of Sciences, 105(48):18718–18723, 2008b.

D. W. Lin, I. M. Coleman, S. Hawley, C. Y. Huang, R. Dumpit, D. Gifford, P. Kezele, H. Hung,

B. S. Knudsen, A. R. Kristal, et al. Influence of surgical manipulation on prostate gene expression:

implications for molecular correlates of treatment effects and disease prognosis. Journal of clinical

oncology, 24(23):3763–3770, 2006.

H. Martens. Factor analysis of chemical mixtures: Non-negative factor solutions for spectra of

cereal amino acids11presented at the international conference on computers and optimization in

analytical chemistry, amsterdam, april 1978. Analytica Chimica Acta, 112(4):423–442, 1979.

B. Nadler. Finite sample approximation results for principal component analysis: A matrix pertur-

bation approach. The Annals of Statistics, 36(6):2791–2817, 2008.

BIBLIOGRAPHY 101

B. Nadler. Nonparametric detection of signals by information theoretic criteria: performance analysis

and an improved estimator. Signal Processing, IEEE Transactions on, 58(5):2746–2756, 2010.

S. Negahban, B. Yu, M. J. Wainwright, and P. K. Ravikumar. A unified framework for high-

dimensional analysis of m-estimators with decomposable regularizers. pages 538–557, 2012.

A. Onatski. Determining the number of factors from empirical distribution of eigenvalues. The

Review of Economics and Statistics, 92(4):1004–1016, 2010.

A. Onatski. Asymptotics of the principal components estimator of large factor models with weakly

influential factors. Journal of Econometrics, 168(2):244–258, 2012.

A. Onatski. Asymptotic analysis of the squared estimation error in misspecified factor models.

Journal of Econometrics, 186(2):388–406, 2015.

A. B. Owen. A robust hybrid of lasso and ridge regression. Contemporary Mathematics, 443:59–72,

2007.

A. B. Owen and P. O. Perry. Bi-cross-validation of the SVD and the nonnegative matrix factorization.

The Annals of Applied Statistics, 3(2):564–594, 06 2009.

J. M. Paque, R. Browning, P. L. King, and P. Pianetta. Quantitative information from x-ray images

of geological materials. Proceedings of the XIIth International Congress for Electron Microscopy,

2:244–247, 1990.

N. Parikh and S. P. Boyd. Proximal algorithms. Foundations and Trends in optimization, 1(3):

127–239, 2014.

D. Paul. Asymptotics of sample eigenstructure for a large dimensional spiked covariance model.

Statistica Sinica, 17(4):1617–1642, 2007.

P. R. Peres-Neto, D. A. Jackson, and K. M. Somers. How many principal components? Stopping

rules for determining the number of non-trivial axes revisited. Computational Statistics & Data

Analysis, 49(4):974–997, 2005.

P. O. Perry. Cross-validation for unsupervised learning. arXiv preprint arXiv:0909.3052, 2009.

A. L. Price, N. J. Patterson, R. M. Plenge, M. E. Weinblatt, N. A. Shadick, and D. Reich. Principal

components analysis corrects for stratification in genome-wide association studies. Nature genetics,

38(8):904–909, 2006.

N. Raj Rao. Optshrink: An algorithm for improved low-rank signal matrix denoising by optimal,

data-driven singular value shrinkage. IEEE Transactions on Information Theory, 60(5):3002–3018,

2014.

BIBLIOGRAPHY 102

N. Raj Rao and A. Edelman. Sample eigenvalue based detection of high-dimensional signals in

white noise using relatively few samples. IEEE Transactions on Signal Processing, 56(7):2625–

2638, 2008.

D. F. Ransohoff. Bias as a threat to the validity of cancer molecular-marker research. Nature Reviews

Cancer, 5(2):142–149, 2005.

B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations

via nuclear norm minimization. SIAM review, 52(3):471–501, 2010.

O. Reiersøl. On the identifiability of parameters in thurstone’s multiple factor analysis. Psychome-

trika, 15(2):121–149, 1950.

D. R. Rhodes and A. M. Chinnaiyan. Integrative analysis of the cancer transcriptome. Nature

genetics, 37:S31–S37, 2005.

O. Rivasplata. Subgaussian random variables: an expository note. Internet publication, PDF, 2012.

S. Rosset, J. Zhu, and T. Hastie. Boosting as a regularized path to a maximum margin classifier.

The Journal of Machine Learning Research, 5:941–973, 2004.

S. N. Roy. On a heuristic method of test construction and its use in multivariate analysis. The

Annals of Mathematical Statistics, 24(2):220–238, 1953.

A. Schwartzman, R. F. Dougherty, and J. E. Taylor. False discovery rate analysis of brain diffusion

direction maps. The Annals of Applied Statistics, 2(1):153–175, 2008.

A. A. Shabalin and A. B. Nobel. Reconstruction of a low-rank matrix in the presence of Gaussian

noise. Journal of Multivariate Analysis, 118:67–76, 2013.

Y. She and A. B. Owen. Outlier detection using nonconvex penalized regression. Journal of the

American Statistical Association, 106(494):626–639, 2011.

H. Shen and J. Z. Huang. Sparse principal component analysis via regularized low rank matrix

approximation. Journal of multivariate analysis, 99(6):1015–1034, 2008.

P. Smaragdis and J. C. Brown. Non-negative matrix factorization for polyphonic music transcription.

In Applications of Signal Processing to Audio and Acoustics, 2003 IEEE Workshop on., pages 177–

180. IEEE, 2003.

C. Spearman. “ general intelligence,” objectively determined and measured. The American Journal

of Psychology, 15(2):201–292, 1904.

BIBLIOGRAPHY 103

Y. Sun, N. R. Zhang, and A. B. Owen. Multiple hypothesis testing adjusted for latent variables,

with an application to the agemap gene expression data. The Annals of Applied Statistics, 6(4):

1664–1688, 2012.

L. R. Tucker and C. Lewis. A reliability coefficient for maximum likelihood factor analysis. Psy-

chometrika, 38(1):1–10, 1973.

W. F. Velicer. Determining the number of components from the matrix of partial correlations.

Psychometrika, 41(3):321–327, 1976.

W. F. Velicer, C. A. Eaton, and J. L. Fava. Construct explication through factor or component

analysis: A review and evaluation of alternative procedures for determining the number of factors

or components. In Problems and solutions in human assessment, pages 41–71. Springer, 2000.

F. Wang and M. M. Wall. Generalized common spatial factor model. Biostatistics, 4(4):569–582,

2003.

S. Wang, G. Cui, and K. Li. Factor-augmented regression models with structural change. Economics

Letters, 130:124–127, 2015.

M. Wax and T. Kailath. Detection of signals by information theoretic criteria. IEEE Transactions

on Acoustics, Speech and Signal Processing, 33(2):387–392, 1985.

Z. Wen, W. Yin, and Y. Zhang. Solving a low-rank factorization model for matrix completion by

a nonlinear successive over-relaxation algorithm. Mathematical Programming Computation, 4(4):

333–361, 2012.

D. M. Witten, R. Tibshirani, and T. Hastie. A penalized matrix decomposition, with applications to

sparse principal components and canonical correlation analysis. Biostatistics, page kxp008, 2009.

S. Wold. Cross-validatory estimation of the number of components in factor and principal compo-

nents models. Technometrics, 20(4):397–405, 1978.

A. M. Woodward, B. K. Alsberg, and D. B. Kell. The effect of heteroscedastic noise on the chemo-

metric modelling of frequency domain data. Chemometrics and intelligent laboratory systems, 40

(1):101–107, 1998.

I. Yamazaki, Z. Bai, H. Simon, L.-W. Wang, and K. Wu. Adaptive projection subspace dimension

for the thick-restart lanczos method. ACM Transactions on Mathematical Software (TOMS), 37

(3):27, 2010.

J. Yao, Z. Bai, and S. Zheng. Large Sample Covariance Matrices and High-Dimensional Data

Analysis. Number 39. Cambridge University Press, 2015.

BIBLIOGRAPHY 104

Y. Yao, L. Rosasco, and A. Caponnetto. On early stopping in gradient descent learning. Constructive

Approximation, 26(2):289–315, 2007.

T. Zhang and B. Yu. Boosting with early stopping: Convergence and consistency. Annals of

Statistics, 33(4):1538–1579, 2005.

W. R. Zwick and W. F. Velicer. Comparison of five rules for determining the number of components

to retain. Psychological bulletin, 99(3):432–442, 1986.

FACTOR ANALYSIS FOR HIGH-DIMENSIONAL DATA A …owen/students/JingshuWangThesis.pdf · his...

Documents

Transcript of FACTOR ANALYSIS FOR HIGH-DIMENSIONAL DATA A …owen/students/JingshuWangThesis.pdf · his...