Three-dimensional versus two-dimensional high-definition ...
FACTOR ANALYSIS FOR HIGH-DIMENSIONAL...
Transcript of FACTOR ANALYSIS FOR HIGH-DIMENSIONAL...
FACTOR ANALYSIS FOR HIGH-DIMENSIONAL DATA
A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF STATISTICS
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Jingshu Wang
July 2016
c© Copyright by Jingshu Wang 2016
All Rights Reserved
ii
I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.
(Art B. Owen) Principal Adviser
I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.
(Wing Hong Wong)
I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.
(Guenther Walther)
Approved for the Stanford University Committee on Graduate Studies
iii
Preface
This dissertation is an original intellectual product of the author, Jingshu Wang, supervised by Dr.
Art B. Owen.
A version of Chapter 3 has been published in Statistical Science (Art B. Owen and Jingshu
Wang, Volume 31, No. 1(2016), 119-139). I was the lead investigator, responsible for all major
areas of concept formation, data analysis, as well as manuscript composition. Art B. Owen is the
supervisory author on this project and was involved throughout the project in concept formation
and manuscript composition.
The work in Chapter 4 is unpublished and I was the lead investigator, responsible for all major
areas of concept formation, data analysis and mathematical proofs, as well as manuscript composi-
tion. Art B. Owen is the supervisory author on this project and was involved throughout the project
in concept formation and manuscript composition.
The project in Chapter 5 is a joint work with Qingyuan Zhao, Trevor Hastie and Art B. Owen.
Qingyuan Zhao and I have equal contributions to the work, and I am responsible for all major areas
of modeling and mathematical proofs, as well as the majority of manuscript composition. Trevor
Hastie and Art B. Owen are the supervisory authors on this project and were involved throughout
the project in concept formation and manuscript composition.
iv
Acknowledgments
First and for most, I would like to thank my advisor, Art B. Owen. I would like to express my
deepest gratitude for his full support, expert guidance and encouragement throughout my study
and research. He guided me how to conduct independent research and keep learning new things. He
is an impressive person with great passion and curiosity towards statistics and research. Without
his incredible patience and timely wisdom, my thesis work would not have gone so smoothly. In
addition, I express my appreciation to Dr. Guenther Walther and Dr. Wing Hong Wong for having
served my reading committee. Their thoughtful questions and suggestions have inspired me a lot
during the research. I would also like to thank Dr. Chiara Sabatti and Dr. Hua Tang for being my
oral defense committee. I have had very useful conversation on various research topics with Hua
and have gain great knowledge via attending Chiara’s group meeting on a regular basis.
I am very grateful to Dr. Persi Diaconis, Dr. David Siegmund, Dr. Iain Johnstone and Dr.
Emmanuel Candes for helping me with my coursework during my first year at Stanford. And I
would like to thank Dr. Lan Wu and Dr. Yuan Yao for being my advisors during my undergraduate
years at Peking University in China.
Thanks to my fellow graduate students in the Statistics Department at Stanford University.
Special thanks to my numerous friends who helped and accompanied me throughout this academic
exploration.
Finally I would like to thank my parents and my boyfriend for their unconditional love and
support. They have given my great encouragement to go through the uneasy days in writing this
dissertation.
This work was supported by the US National Science Foundation under grant DMS-1521145.
v
Contents
Preface iv
Acknowledgments v
1 Introduction 1
1.1 Forms of Factor Analysis Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Model assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Random factor score matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Non-random factor score matrix . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Model Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 Identification for random factor score model . . . . . . . . . . . . . . . . . . . 5
1.3.2 Identification for non-random factor score model . . . . . . . . . . . . . . . . 7
2 Background 8
2.1 The maximum likelihood method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Random factor scores model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 Non-random factor score model . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.3 Estimating the factor scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.4 Asymptotic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Principal component analysis (PCA) . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Use of PCA in factor analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 Asymptotic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Estimating the number of factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.1 Classical methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.2 Methods for large matrices and strong factors . . . . . . . . . . . . . . . . . . 21
2.3.3 Methods for large matrices and weak factors . . . . . . . . . . . . . . . . . . 22
2.4 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
vi
3 Bi-cross-validation for factor analysis 25
3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Estimating X given the rank k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Bi-cross-validatory choice of r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.1 Bi-cross-validation to estimate r?ESA . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.2 Choosing the size of the holdout Y00 . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4.1 Factor categories and test cases . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4.2 Empirical properties of ESA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.3 Empirical properties of BCV . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5 Real data example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4 An optimization-shrinkage hybrid method for factor analysis 48
4.1 A joint convex optimization algorithm POT . . . . . . . . . . . . . . . . . . . . . . . 48
4.1.1 the objective function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.1.2 Connection with singular value soft-thresholding . . . . . . . . . . . . . . . . 49
4.1.3 Connection with square-root lasso . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Some heuristics of the method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.1 The theoretical scale of λ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.2 The bias in using the nuclear penalty . . . . . . . . . . . . . . . . . . . . . . 51
4.3 A hybrid method: POT-S . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4 Wold-style cross-validatory choice of λ . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.5 Computation: an ADMM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.5.1 The ADMM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.5.2 Techniques to reduce computational cost . . . . . . . . . . . . . . . . . . . . . 58
4.6 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.6.1 Compare the oracle performances . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.6.2 Assess the accuracy in finding λ??Opt . . . . . . . . . . . . . . . . . . . . . . . . 61
5 Confounder adjustment with factor analysis 68
5.1 The model and the algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.1.1 A statistical model for confounding factors . . . . . . . . . . . . . . . . . . . 69
5.1.2 Model identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.1.3 The two-step algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2 Statistical inference for β . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2.1 The negative control scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2.2 The sparsity scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3 Extension to multiple regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
vii
5.4 Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.4.1 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6 Conclusions 88
A Proof 91
viii
List of Tables
3.1 Six factor strength scenarios considered in our simulations. . . . . . . . . . . . . . . 35
3.2 ESA using six measurements. For each of Var(σ2i ) = 0, 1 and 10, the average for every
measurement is the average over 10 × 6 × 100 = 6000 simulations, and the standard
deviation is the standard deviation of these 6000 simulations. . . . . . . . . . . . . . 36
3.3 Comparison of ESA results for various (N,n) pairs and number of strong factors in
the scenarios with Var[σ2i
]= 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Worst case REE values for each method of choosing k for white noise and two het-
eroscedastic noise settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5 Comparison of REE and r for rank selection methods with various (N,n) pairs, and
scenarios. For each different scenario, the factors’ strengths are listed as the number of
“strong/useful/harmful/undetectable” factors. For each (N,n) pair, the first column
is the REE and the second column is k. Both values are averages over 100 simulations.
Var[σ2i
]= 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.6 Like Table 3.5, but for larger γ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1 Assess the oracle error in estimatingX using four measurements. For each of Var[σ2i
]=
0, 1 and 10, the average for every measurement is the average over 10×6×100 = 6000
simulations, and the standard deviation is the standard deviation of these 6000 sim-
ulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2 Assess the error in estimating Σ when the oracle estimate of X is achieved. For each
of Var[σ2i
]= 0, 1 and 10, the average for every measurement is the average over
10×6×100 = 6000 simulations, and the standard deviation is the standard deviation
of these 6000 simulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3 Four measurements comparing the oracle error in estimating X under various (N,n)
pairs and factor strength scenario with Var(σ2i ) = 1. Type-1 to Type-6 correspond to
the six scenarios in Table 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
ix
4.4 Four measurements comparing the error in estimating Σ when the oracle error of X
is achieved under various (N,n) pairs and factor strength scenario with Var(σ2i ) = 1.
Type-1 to Type-6 correspond to the six scenarios in Table 3.1. . . . . . . . . . . . . 63
4.5 Comparison of REE and the rank of X with various (N,n) pairs and scenarios. For
each scenario, the factors’ strengths are listed as the number of “strong/useful/harmful/undetectable”
factors. For each (N,n) pair, the first column is the REE and the second column the
rank the estimated matrix. Both values are averages over 100 simulations. Var[σ2i
]= 1. 65
4.6 Like Table 4.5, but for larger aspect ratios γ . . . . . . . . . . . . . . . . . . . . . . . 66
4.7 Comparison of REEΣ for various (N,n) pairs and scenarios. For each scenario, the
factors’ strengths are listed as the number of “strong/useful/harmful/undetectable”
factors. The values are averages over 100 simulations. Var[σ2i
]= 1. . . . . . . . . . 67
x
List of Figures
3.1 REE survival plots: the proportion of samples with REE exceeding the number on
the horizontal axis. Figure 3.1a-3.1c are for REE calculating using the method ESA.
Figure 3.1a shows all 6000 samples. Figure 3.1b shows only the 3000 simulations
of larger matrices of each aspect ratio. Figure 3.1c shows only the 3000 simulations
of smaller matrices. For comparison, Figure 3.1d is the REE plot for all samples
calculating REE using the method SVD. . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 The distribution of r for each factor strength case when the matrix size is 5000× 100.
The y axis is r. Each image depicts 100 simulations with counts plotted in grey scale
(larger equals darker). For different scenarios, the factor strengths are listed as the
number of “strong/useful/harmful/undetectable” factors in the title of each subplot.
The true k is always r = 8. The “Oracle” method corresponds to r∗ESA. . . . . . . . 41
3.3 BCV prediction error for the meteorite. The BCV partitions have been repeated 200
times. The solid red line is the average over all held-out blocks, with the cross marking
the minimum BCV error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4 Distribution patterns of the estimated factors. The first column has the four factors
found by ESA. The second column has the top five factors found by applying SVD on
the unscaled data. The third column has the top five factors found by applying SVD
on scaled data in which each element has been standardized. The values are plotted
in grey scale, and a darker color indicates a higher value. . . . . . . . . . . . . . . . . 45
3.5 Plots of the first two factors and the location clusters. The three plots of column (a)
are the scatter plots of pixels for the first two factors found by the three methods:
ESA, SVD on the original data and SVD on normalized data. The coloring shows
a k-means clustering result for 5 clusters. Column (b) has the five clustered regions
based on the first two factors of ESA. Column (c) has the five clustered regions based
on the first two factors of SVD on the original data after centering. The same color
represents the same cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
xi
4.1 REE survival plots for estimating X: the proportion of samples with REE exceeding
the number on the horizontal axis. Figure 4.1a shows all 6000 samples. Figure 4.1b
shows only the 3000 simulations of larger matrices of each aspect ratio. Figure 4.1c
shows only the 3000 simulations of smaller matrices. . . . . . . . . . . . . . . . . . . 64
5.1 Compare the performance of nine different approaches (from left to right): naive regression
ignoring the confounders (Naive), IRW-SVA, negative control with finite sample correction
(NC) in eq. (5.17), negative control with asymptotic oracle variance (NC-ASY) in eq. (5.18),
RUV-4, robust regression (LEAPP(RR)), robust regression with calibration (LEAPP(RR-
MAD)), LEAPP, oracle regression which observes the confounders (Oracle). The error bars
are one standard deviation over 100 repeated simulations. The three dashed horizontal lines
from bottom to top are the nominal significance level, FDR level and oracle power, respectively. 86
xii
Chapter 1
Introduction
Factor analysis is a statistical method to explain a large number of interrelated variables in terms
of a potentially low number of unobserved variables. In another point of view, it approximates
a matrix shaped data set by a low rank matrix via an explicit probabilistic linear model. Factor
analysis reduces the complexity and reveals the underlining structure of the data set.
Factor analysis is over a century old. In psychology, the factor model dates back at least to
Spearman (1904), who is sometimes credited with the invention of factor analysis. The technique
is later also applied to social science, economics, finance and marketing, signal processing, bioinfor-
matics and etc. The latent factors discovered by factor analysis make the observed variables more
understandable. Typically, factor analysis is classified into two types. One is confirmatory factor
analysis which has pre-determined constraints on factor loadings (for example, the loading of the
observed factor V1 on latent factor F1 is 0). The other is explanatory factor analysis which does not
have such constraints.
More recently, factor analysis also becomes a widely used dimension reduction tool for analyzing
large matrices and high-dimensional data. Factor analysis shares a lot of similarities with low-rank
matrix approximation which has applications in fields such signal processing, collaborative filtering
and personalized learning. Compared with principal component analysis (PCA) or the singular
value decomposition (SVD), factor analysis assumes heteroscedastic noise for each variable, which is
a more reasonable assumption than constant noise variance in many applications. The challenge is
that in those data sets, the dimensionality is often comparable or even larger than the sample size,
thus new methodology and theoretical analysis for solving the model need to be established.
A problem with factor analysis is that it is surprisingly difficult to choose the number of factors.
Even in traditional factor analysis problems which have a small number of variables but a relatively
large sample size, there is no widely agreed best performing methods (see for example Peres-Neto
et al. (2005)). Classical methods such as hypothesis testing based on likelihood ratios (Lawley, 1956)
or methods based on information theoretic criteria (Wax and Kailath, 1985) assume homoscedastic
1
CHAPTER 1. INTRODUCTION 2
noise and thus not fit the heteroscedastic noise assumption for factor analysis directly. In addition,
since these classical methods assume an asymptotic regime with a growing number of observations
and fixed number of variables, they do not perform well on large matrices where both dimensions are
large in modern applications. Modern methods that are developed assuming both the dimensions
are large include modified information criteria method in the econometrics community assuming
strong factors and random matrix theory based methods assuming weak factors and homoscedastic
noise.
The rest of this chapter includes a description of the mathematical model and assumptions of
factor analysis, and a discussion of model identifiability.
1.1 Forms of Factor Analysis Model
Let N denote the number of variables and n denote the sample size. Then the observation yij for
the ith variable and the jth sample is assumed to have the following decomposition:
yij =
r∑k=1
likfkj + σieij (1.1)
where E = (eij)N×n is the noise matrix, Fk = (fk1, fk2, · · · , fkn)T denotes the kth latent variable
and L = (lik)N×r is called the factor loading matrix. Denote each observed variable as Yi =
(yi1, yi2, · · · , yin)T and the noise associated as Ei = (ei1, ei2, · · · , ein)T , then the vector form of
(1.1) is
Yi =
r∑k=1
likFk + σiEi (1.2)
This exactly shows that all the observed variables can be explained by linear combinations of r
common factors. Usually, r is much smaller than N , thus estimating the latent common factors
makes the data more interpretable. Let Y = (yij)N×n be the data matrix and F = (fkj)r×n be the
factor score matrix, then the matrix form of (1.1) is
Y = LF + Σ1/2E. (1.3)
This has the interpretation that the data matrix can be expressed as a low-rank signal matrix
X = LF plus noise. Thus, factor analysis model can be used when a low-rank approximation of the
data matrix is desired.
CHAPTER 1. INTRODUCTION 3
1.2 Model assumptions
As discussed in Anderson and Rubin (1956), the factor score matrix F can be treated to be either
random or non-random.
1.2.1 Random factor score matrix
Usually, if we think that the columns of Y are randomly and independently drawn from the popu-
lation, we may prefer to assume that F is random to reduce the number of parameters to estimate.
Typically the following assumptions are made:
Assumption 1. For a random factor score model
(a) The factor scores F and the noise E are random, while the factor loading matrix L is non-
random.
(b) F and E are independent: F |= E.
(c) For each latent variable k, fkj , j = 1, 2, · · · , n are i.i.d. with E(fkj) = µk. Also, we assume
Cov(F·j) = ΛF where ΛF ∈ Rr×r is some positive-semidefinite matrix.
(d) For each variable i, the noises ei1, ei2, · · · , ein are i.i.d, with E(eij) = ai and Var(eij) = 1.
Also, we assume Σ = diag(σ21 , σ
22 , · · · , σ2
N ).
Let αi =∑rk=1 likµk + ai and , then
E(yij) = αi, Cov(Y·j) = ΣY = LΛFLT + Σ. (1.4)
An equivalent way to write out (1.1) is
yij = αi +
r∑k=1
likfkj + σieij (1.5)
with the assumptions E(fkj) = 0 and E(eij) = 0. This form is more common to be found in classical
factor analysis literature.
It is also often assumed that the entries of both F and E follow Gaussian distributions. The
advantage of Gaussian assumptions is that then only the first and second moments of the data
matter in estimation and inference. Then from (1.4), only αi and LΛFLT + Σ are identifiable. We
will discuss in more detail about identification of the components in LΛFLT + Σ in Section 1.3.
Sometimes, it’s more reasonable to assume that the factor scores of the individuals (columns of
F ) are also correlated. For example, the individuals can be time series or spatial points. This is a
common assumption when factor analysis is applied to economics or spatial analysis (Forni et al.,
2000; Wang and Wall, 2003).
CHAPTER 1. INTRODUCTION 4
1.2.2 Non-random factor score matrix
We may tend to assume non-random F when the distributional assumption of F is too complicated
or an estimation of the low rank matrix X = LF is easier than estimating the factors themselves.
For example, the low-rank constraint on X can be relaxed to a nuclear norm constraint, which
enables good optimization algorithms for solving the model (Chapter 4). Another situation is that
the samples are assumed to have an unknown clustering structure and within the clusters samples are
correlated. In this scenario, a random factor score assumption will make the model too complicated
to be solved, thus a non-random factor score would be preferable.
For a non-random F model, both the signal matrix X and the noise covariance matrix Σ are
parameters. Compared with the random F assumptions, the model now has model parameters to
estimate. However, when r min(N,n), there will be still enough data to compensate for the extra
degrees of freedom.
Assumption 2. For a non-random factor score model
(a) The noises E are random, while both the factor loading L and factor score F are non-random.
(b) For each variable i, the noises ei1, ei2, · · · , ein are i.i.d, with E(eij) = ai and Var(eij) = 1.
Also, we assume Σ = diag(σ21 , σ
22 , · · · , σ2
N ).
As the random factor score model, the non-random factor score model can also be rewritten as
Y = α1Tn + LF + Σ1/2E (1.6)
with the additional constraint that F · 1n = 0.
Non-random and random factor score assumptions are closely related to each other. A random
factor score model becomes a non-random factor score model when we make inference conditional
on F . On the other hand, a non-random factor score model turns into a random factor score model
by adding a prior on F (similar to a random effect model in linear regression). We shall see that
in general, the asymptotic results for N,n→∞ would be very similar for random and non-random
factor score models.
For both models, there can be extra constraints imposed on the factors (either the factor loadings
or factor scores) based on the application problems. A typical example is confirmatory factor anal-
ysis. Typically, in confirmatory factor analysis, it is assumed that the loadings on specific entries
are zero, reflecting the structure of the relationships between observed and unobserved factors based
on researchers’ knowledge about the problem (Hoyle, 2000; Anderson and Gerbing, 1988). Another
popular constraint is assuming sparsity on factor loadings and/or factor scores, which is a common
constraint for improving interpretability and estimability of the factors for analyzing large matrices
(Shen and Huang, 2008; Carvalho et al., 2012). Another constraint is assuming non-negativity. For
example in educational research, the observed variables can be scores on questions and the latent
CHAPTER 1. INTRODUCTION 5
factors are the underling concept. The factor scores are then interpreted as understanding on certain
concept which are more interpretable if non-negativity is assumed (Martens, 1979; Smaragdis and
Brown, 2003).
1.3 Model Identification
Model identification is generally a hard problem for factor analysis and has been discussed for a long
time. Here we list several classical results to discuss the problem. In this section we assume that for
any of the random variables, its parameters can be identified if and only they can be identified via
the first two moments of the variable.
First, when the number of factors r is unknown, there is an identification problem for r. If r is
unknown, we can always set r = N and Σ = 0 and the model becomes a trivially correct model. To
avoid this, we define r as the minimum integer that the factor model (either under the assumption
of random or non-random factor scores) exists. Since r = N also provides a correct model, this
definition itself automatically guarantees the uniqueness of r.
We assume normality for all the random variables, and discuss identification of the model pa-
rameters for both random and non-random factor score models.
1.3.1 Identification for random factor score model
For random factor score models, more constraints are needed for the identification of each elements
of the covariance matrix LΛFLT + Σ. First, we show a sufficient condition discussed in Anderson
and Rubin (1956) for identification of Φ = LΛFLT and Σ given r.
Theorem 1.3.1. Under Assumption 1, a sufficient condition for identification of Σ and Φ = LΛFLT
is that if any row of Φ is deleted, there remain two disjoint subsets of rows of Φ of rank r.
When Σ can be uniquely defined, it is obvious to see that L and ΛF are still unidentifiable.
Actually, given any invertable r×r matrix U , replacing L with L = LU and ΛF with ΛF = U−1ΛFUT
will keep ΣY unchanged. One common constraint to make L identifiable up to rotation is to assume
ΛF = Ir. This is assuming that the latent factors are uncorrelated (under the Gaussian independent)
to each other and are normalized. Some further restrictions can be added to eliminate the rotation
uncertainty. For example, common assumptions are assuming either LTL or LTΣ−1L is diagonal
with distinct entries, thus L can be uniquely identified via the eigenvalues and eigenvectors of Φ (if
diagonality of LTL is assumed) or Σ−1/2ΦΣ−1/2 (if diagonality of LTΣ−1L is assumed). Usually,
both the orthogonality and diagonality constraints mentioned above may not represent the properties
of the actual factors, but just for mathematical convenience.
Assumption 3. ΛF = Ir and either LTL or LTΣ−1L is diagonal with distinct diagonal entries.
CHAPTER 1. INTRODUCTION 6
Let’s now discuss the identification of L and ΣF from Φ under sparsity assumption. This is
equivalent to unique determination of U up to scaling and row/column permutation of the identity
matrix for L = LU . We state a simplified and generalized version of the classical result in Reiersøl
(1950). We define the s-sparse family of L (we require s ≥ r):
L(s) =L ∈ RN×r : L satisfies conditions 1 and 2
.
The conditions I and II are stated as follows:
(I) L is of rank r and each column of L contains at least s zeros.
(II) For each column m, let Lm be the matrix consisting of all rows of L which have a zero in the
mth column. For any m = 1, 2, · · · , r, Lm is of rank r − 1.
The above two conditions requires all the factors to be sparse, though besides sparsity it should be
quite full rank. Then a necessary and sufficient condition of L being identifiable in L is:
Theorem 1.3.2. Under Assumption 1, the normality assumption and the identification conditions
in Theorem 1.3.1, a necessary and sufficient condition for L in L(s) being identifiable up to scaling
and row/column permutation is that if a sub-matrix L? ∈ Rs×r of L is of the rank of r − 1, then it
must be the sub-matrix of Lm for some m = 1, 2, · · · , r.
Remark. The original theorem in Reiersøl (1950) (also stated in Anderson and Rubin (1956))
stated a different result compared with Theorem 1.3.2. In Reiersøl (1950), we have s = r and a
narrower parameter space L?(r) is assumed with two further restrictions on Lm: (III) the rank of
Lm with one row deleted is still r − 1 and (IV) the rank of Lm with one of other rows of L added
becomes r. As a consequence, the necessary and sufficient condition changes to that L does not
contain any other submatrices satisfying (II)-(IV). Theorem 1.3.2 defines a larger parameter space
L(s) for s = r which is more meaningful for practical usage. Also, Theorem 1.3.2 generalize the
original result to any sparsity level s. An increase of s weakens the identification condition.
Proof. As discussed, we only need to show that for L = LU , if L, L ∈ L, the condition in the
theorem is a necessary and sufficient condition for U having exactly one non-zero in each row and
each column.
Sufficiency: Since L has rank r, U must be full rank and L = LU−1. For any given m ∈ 1, 2, · · · , r,as the rank of Lm is r − 1, there must exist an s × r sub-matrix L? of Lm that is of rank r − 1,
then L?U ∈ Rs×r also has rank r − 1. Since L?U is a sub-matrix of L, then given the condition,
one column, say jm of L?U must all be zero. Let L?·(m) be the sub-matrix of L? excluding the mth
column. Since L?·(m) ∈ Rs×(r−1) is of rank r− 1, then the entries of jmth column except for the mth
row of U must all be zero.
It’s easy to show that if m1 6= m2, then jm1 6= jm2 , thus U has exactly one non-zero in each row
and each column. The sufficiency of the condition is proved.
CHAPTER 1. INTRODUCTION 7
Necessity: If the condition in the theorem is not satisfied, then there exists a sub-matrix L? ∈ Rs×r
of L that has rank r − 1 but none of its column is all zero. Thus, ∃v ∈ Rr that has at least
two non-zero entries and L?v = 0. W.L.O.G, assume that the first entry of v is non-zero. Let
V−1 = (0 Ir−1)T ∈ Rr×(r−1) and U = (v V−1) ∈ Rr×r. Then U has rank r and it’s easy to check
that LU ∈ L. Thus, L is not identifiable. The necessity of the condition is proved.
1.3.2 Identification for non-random factor score model
For non-random factor score models, we need constraints to identify the signal matrix X and noise
covariance Σ given r first, and then constraints to identify F and L in X = LF .
To identify X and Σ of model Y = X + Σ1/2E, we need to guarantee that if Y = X ′ + Σ′1/2E′
with X ′ having rank r, Σ′ diagonal E′ a random matrix with i.i.d. standard Gaussian entries, then
X ′ = X and Σ′ = Σ. First, if r = N , then the model is trivially unidentifiable. Thus, we need
to have r < N . We give a necessary condition for identifying X and Σ. We find it hard to give a
sufficient condition.
Theorem 1.3.3. Assume r < N . Under Assumption 2 and a known r, a necessary condition for
identifying X and Σ is that if any row of X is removed, the remaining matrix is still of rank r.
Proof. Suppose that there exists one row j that if the jth row is removed, the remaining matrix
X(j)· still has rank k < r. Let the remaining matrix of L after removing the jth row be L(j)·,
then L(j)· ∈ R(p−1)×r also has rank k, thus it is degenerate. Thus, there exists a non-zero vector
v ∈ Rr that L(j)·v = 0. Since L is full rank, Lv 6= 0. Thus Lv has only one non-zero entry. Let
X ′ = X +LvET1 where E1 ∈ Rn is a random vector with i.i.d. standard Gaussian entries which are
also independent from E. Then X ′ is still of rank r, and Σ′ = Σ− LvvTLT is still diagonal. Thus
X and Σ are not identifiable. This proves that the condition is necessary.
After identification of X and Σ, similar to the random factor score cases, we can impose more
restrictions for identification of the decomposition X = LF . The restrictions are similar to Theorem
1.3.1 and Theorem 1.3.2. For the rotation restriction, we can refer to the five scenarios listed in Bai
and Li (2012a). For the sparsity restriction, either a sparsity assumption on L or F will be sufficient
for identification.
Assumption 4. MF = FFT /n = Ir, either LTL or LTΣ−1L is diagonal with distinct diagonal
entries.
Though we have discussed the sparse factor assumptions, we will focus on estimating unre-
stricted factor analysis model in Chapters 2 to 5. The identification condition of sparse factors
in Theorem 1.3.2 is closely related to the model identification of confounder adjustment models
(Corollary 5.1.1) that we will discuss in Chapter 5.
Chapter 2
Background
In this chapter, we will discuss some previous methods. We will review using the maximum likelihood
method (MLE) and principal component analysis (PCA) in estimation of the factor loadings/signal
matrix and the noise variances given r. We discuss them for both the random and non-random factor
score models. Also, we summarize their properties under both the classical and high-dimensional
data asymptotic regions assuming that the number of factors r is fixed and known. Finally, we will
review previous methods in estimating r, which can be a much harder problem than estimating the
factors loading parameters themselves.
2.1 The maximum likelihood method
2.1.1 Random factor scores model
Under Assumption 1 and normality of the random variables, the log-likelihood of the data matrix
Y in (1.5) can be written as
L(Y ;α,L,ΣF ,Σ) =−Nn log(2π)− n
2log det |LΛFL
T + Σ|
− 1
2tr[(Y T − 1nα
T )(LΛFLT + Σ)−1(Y − α1n
T )] (2.1)
where 1 represents a vector of 1s of length n. It’s easy to see immediately that the MLE of α gives
the sample mean
αi =1
n
n∑j=1
yij
Let S = 1n (Y − α1T )(Y − α1T )T be the sample covariance, then (2.1) can be rewritten as
L(Y ;L,ΣF ,Σ) = −Nn log(2π)− n
2log det |LΛFL
T + Σ| − n
2tr[(LΛFL
T + Σ)−1S]
(2.2)
8
CHAPTER 2. BACKGROUND 9
Finding the global optimal solution maximizing (2.2) given r is generally a hard problem. For a
special case Σ = σ2IN , there is an explicit solution for L which are exactly the principal components.
The result is proved in Anderson and Rubin (1956) using the estimation equations derived by Lawley
(1940).
Theorem 2.1.1. Assume that Σ = σ2IN , Assumption 3 and in particular LTL is diagonal. Let the
eigenvalue decomposition of S be S = PΛPT where P ∈ RN×N is orthogonal and Λ = diag(λ, λ2, · · · , λN )
with λ ≥ λ2 ≥ · · · ≥ λN . Let Pr ∈ RN×r be the first r columns of P and Λr = diag(λ, λ2, · · · , λr).Then the solutions maximizing (2.2) are
L = Pr(Λr − σ2Ir)1/2, σ2 =
1
N − r
N∑k=r+1
λk
For a general diagonal Σ, it’s hard to find a global maximum solution. One popular method is
to use the EM algorithm proposed by Rubin and Thayer (1982). Assuming ΛF = Ir, then the joint
log-likelihood of (Y, F ) is
L(Y, F ;L,Σ) =−Nn log(2π)− n
2log det |Σ|
− 1
2tr[(Y − α1Tn − LF )TΣ−1(Y − α1Tn − LF )
]− rn log(2π)− 1
2tr(FTF ).
(2.3)
For the E-step, we have
F·j |Y·j ∼ N(LT (LLT + Σ)−1(Y·j − α), Ir − LT (LLT + Σ)−1L
).
For the M-step, we have
∂
∂LEF·j |Y·j ;L(k),Σ(k),α [L(Y, F ; Λ,Σ)]
=∂
∂LEF·j |Y·j ;L(k),Σ(k),α
[tr[(Y − α1Tn )TΣ−1LF
]− 1
2tr(FTLTΣ−1LF )
]=EF·j |Y·j ;L(k),Σ(k),α
[Σ−1(Y − α1Tn )FT − Σ−1LFFT
]= 0
which results in
L(k+1) =(Y − α1Tn )EF·j |Y·j ;L(k),Σ(k),α
[FT] (
EF·j |Y·j ;L(k),Σ(k),α
[FFT
] )−1
Similarly, by taking derivative of the diagonal entries of Σ, we can get a formula for Σ(k+1), which
CHAPTER 2. BACKGROUND 10
is the diagonal of the matrix
1
nEF·j |Y·j ;L(k),Σ(k),α
[(Y − α1Tn − LF )(Y − α1Tn − LF )T
]with L = L(k+1).
Notice that the joint distribution of (Y, F ) is not invariant to the rotation of L, and the solution
path L(k) may be oscillating though the marginal likelihood keeps increasing. We can stop when
the marginal likelihood or Σ converges and rotate the final estimate L to satisfy the identification
constraint.
2.1.2 Non-random factor score model
When assuming F non-random and Gaussian noise, the log-likelihood of (1.6) becomes
L(Y ;L,F,Σ) = −Nn log(2π)− n
2log det |Σ| − 1
2tr[(Y − α1Tn − LF )TΣ−1(Y − α1Tn − LF )
](2.4)
However, the above likelihood is ill-posed since it can achieve∞ by setting σi → 0 for some i and
the ith row of LF the same as the Yi·−αi. There are previously two approaches to solve this problem.
One is proposed in Anderson and Rubin (1956), by using the likelihood of S = 1n (Y −α1T )(Y −α1T )T
which under the non-random factor score assumption follows a non-central Wishart distribution
with covariance Σ and mean matrix XXT /n = LFFTLT /n. The other is called “quasi-likelihood”
(QMLE) in Bai and Li (2012a) (also in Amemiya et al. (1987)) which is essentially using the likelihood
for random score model:
L(Y ;X,Σ) =−Nn log(2π)− n
2log det |XXT /n+ Σ|
− 1
2tr[(Y T − 1nα
T )(XXT /n+ Σ)−1(Y − α1nT )] (2.5)
where X = LF . Under Assumption 4 then XXT /n = LLT . One can as well using the EM algorithm
to solve (2.5) though F is actually non-random.
In Anderson and Rubin (1956), it is shown that the difference between the estimates L and Σ
from MLE of the non-central Wishart distribution and those from maximizing the quasi-likelihood
is uniformly o(1/√n) when N is fixed and n→∞ under Assumption 4.
2.1.3 Estimating the factor scores
We assume that either ΣF = Ir for the random factor score model or MF = Ir for the non-random
factor score model. For the random factor score model, the factor scores are random variables and
CHAPTER 2. BACKGROUND 11
we have the posterior mean
E[F·j |Y·j ] = LT (LLT + Σ)−1(Y·j − α)
= (LTΣ−1L+ Ir)−1LTΣ−1(Y·j − α)
using the formula that (LLT + Σ)−1 = Σ−1 − Σ−1L(LTΣ−1L + Ir)−1LTΣ−1. Thus F should be
estimated via an estimate of the posterior mean
F = (LT Σ−1L+ Ir)−1LT Σ−1(Y − α1Tn ) (2.6)
where L, F and α are the MLE estimates from (2.1).
For the non-random factor score model, one can also use the GLS estimate of F given Σ and L:
F = (LT Σ−1L)−1LT Σ−1(Y − α1Tn ) (2.7)
The estimate (2.6) can also be used for non-random factor score model which can be explained as
the ridge regression estimator. From the property of ridge regression we know that when r n,
there would be no noticeable difference between (2.6) and (2.7).
2.1.4 Asymptotic properties
First, we discuss the asymptotic properties of MLE under the classical asymptotic regime that N is
fixed and n→∞ for both the random and non-random factor score models. These results are again
due to Anderson and Rubin (1956).
Theorem 2.1.2. Under either Assumption 1 or Assumption 2, and assume identification conditions
of L and Σ are satisfied including Assumption 3 and Assumption 4. Let L and Σ be the MLE
estimates from (2.1). If S → LLT + Σ and√n(S − LLT − Σ) has a limiting normal distribution,
we have L→ L and Σ→ Σ and√n(L− L) and
√n(Σ− Σ) have limiting normal distributions.
Notice that for the convergence in Theorem 2.1.2 to be meaningful, we explicitly assumes that
the asymptotic regime is N fixed and n → ∞. Also, though the likelihood is derived by assuming
normality of the random factors, the convergence requirements that S → LLT + Σ and√n(S −
LLT −Σ) has a limiting normal distribution do not need normality. For random factor score model,
we only need the distributions of F and E to satisfy central limit theorem (the second moments are
finite). For non-random factor score models,
S − LLT − Σ =LFETΣ1/2 + Σ1/2EFTLT
n− Σ1/2
(EET
n− IN
)Σ1/2
thus we only require the distribution of E satisfy central limit theorem.
CHAPTER 2. BACKGROUND 12
In Amemiya et al. (1987), they derived the limiting covariance matrix of√n(L−L) and
√n(Σ−Σ)
for both random and non-random factor score model under normality, which have very complicated
forms. However, both works consider only the classical asymptotic region which only works for small
N and large n problems.
Next, we consider the asymptotic regime for high-dimensional data where both N and n are
large. In Bai and Li (2012a), the authors derived consistency and asymptotic normality of the MLE
estimates under the assumption of strong factors and the asymptotic regime that N → ∞ and
n → ∞. For strong factors we require that LTL/N → ML where ML ∈ Rr×r is positive definite.
Since they allow N →∞, they have the following additional assumptions on the noise variances and
factor loadings besides Assumption 1 and Assumption 2:
Assumption 5. (a) The fourth moments of the noise entries are finite: E(e4ij) ≤ C4 for some
C <∞ for all i = 1, · · · , N and j = 1, · · · , n
(b) The noise variances are bounded: C−2 ≤ σ2i ≤ C2 for all i = 1, 2, · · · , N
(c) The factors loadings are large enough: when N →∞, LTΣ−1L/N → QL where QL ∈ Rr×r is a
positive definite matrix
(d) The entries of the factor loadings are bounded: ‖Li·‖2 ≤ C for i = 1, 2, · · · , N
(e) The MLE (or QMLE for the non-random factor score model) estimates of the noise variances
are bounded: C−2 ≤ σ2i ≤ C2 for i = 1, 2, · · · , N
Then they have the following consistency and asymptotic normality results:
Theorem 2.1.3. Assume Assumption 1( or Assumption 2), Assumption 3 (or Assumption 4),
Assumption 5 and in particular LTΣ−1L is diagonal. Let L and Σ be the MLE estimates from (2.1).
When n,N →∞, then for each variable i,
Li· − Li· = Op(n−1/2), σ2
i − σ2i = Op(n
−1/2)
where the convergence is in probability and√n(Li· − Li·) and
√n(σ2
i − σ2i ) have a limiting normal
distribution for any given i. For the non-random factor score model,
√n(Li· − Li·)
d→ N(0, σ2i Ir),
√n(σ2
i − σ2i )
d→ N(0, (2 + κi)σ4i )
where κi is the excess kurtosis of eij (for Gaussian noise κi = 0).
For the random factor score model, the limiting covariance of√n(σ2
i − σ2i ) stays the same while
that for√n(Li·−Li·) has a much complicated form which can be found in Section F of the appendix
in Bai and Li (2012a).
CHAPTER 2. BACKGROUND 13
The consistency in Theorem 2.1.3 holds for each i but may not hold uniformly for all i. Uniform
consistency for all i can be derived by assuming that the random variables have exponential tails
and imposing an extra condition on the relationship between n and N approaching the limit.
Theorem 2.1.4. Under the assumptions of Theorem 2.1.3 and assume that eij are sub-Gaussian
random variables, if (logN)2/n→ 0 as n,N →∞, then
maxj≤N‖L2
i· − L2i·‖2 = Op(
√logN/n), max
j≤N|σ2i − σ2
i | = Op(√
logN/n) (2.8)
For the non-random factor model,
maxi=1,2,··· ,N
∥∥∥∥∥∥Li· − Li· − 1
n
n∑j=1
σieijFT·j
∥∥∥∥∥∥2
= op(n− 1
2 ). (2.9)
Proof. The proof is a modification of the proof of Bai and Li (2012). It is very technical, please see
Appendix A.
In Bai and Li (2012a) the authors also have shown the consistency of the estimated factor scores
using either (2.6) or (2.7) which has the same limiting distribution.
Theorem 2.1.5. Under the assumptions of Theorem 2.1.3 and assume that N/n2 → 0 when n,N →∞, then for the non-random factor score model,
√N(F·j − F·j)
d→ N(0, Q−1L )
Here QL = limN→∞ LTΣ−1L is defined in Assumption 5. For the random factor score model,
when N/n→ γ > 0 the limiting distribution of√N(F·j − F·j) also is asymptotically Gaussian with
a complicated limiting covariance matrix, and the result is stated in Section F of Bai and Li (2012)
appendix.
We should mention that the results in Bai and Li (2012a) are not very satisfactory, though very
impressive. All the results are based on Assumption 5(e), which is not guaranteed to always hold
for MLE (or QMLE) estimates. However, this is currently the best results we can find. In Bai and
Li (2016), the same authors generalized the results to approximate factor models, which allow weak
correlations among the noises (and between the noise and factors for random factor score model).
2.2 Principal component analysis (PCA)
PCA is a very common technique for dimension reduction of the data. It is closely related to factor
analysis, and often is used as a solution (or at least an initial solution) for factor analysis in both
classical and high-dimensional data analysis. In this section, we discuss the use of PCA (and its
CHAPTER 2. BACKGROUND 14
equivalent form SVD) in factor analysis and the asymptotic properties of PCA under the factor
analysis assumptions. We consider three different asymptotic regions: the classical fixed N and
n → ∞ assumption; the n,N → ∞ and strong factor (limN→∞ LTL/N is some positive definite
matrix) assumption in econometrics and the n,N → ∞ and weak factor (limN→∞ LTL to some
finite positive definite matrix) assumption in random matrix theory. We still assume that r is given
as a constant in any of the asymptotic regions.
2.2.1 Use of PCA in factor analysis
Basically, PCA tries to find linear combinations of the observed variables to maximize the sample
variance. Let S = (Y − α1T )(Y − α1T )T /n as defined before. Let the eigenvalue decomposition of
S be S = PΛPT where Λ = diag(λ, λ2, · · · , λN ) is a diagonal matrix where λ ≥ λ2 ≥ · · · ≥ λN ≥ 0
and P is an orthogonal N × N matrix. Then the eigenvectors P·1, · · · , P·N are called loadings,
and the rows of PT (Y − α1T ) are called principal components (PCs). For more details of PCA, one
can refer to Jolliffe (1986).
Also, one can derive the PCs and loadings from the singular value decomposition (SVD) of
Y − α1T . Let Y − α1T =√nUDV T where U ∈ RN×min(N,n), V ∈ Rn×min(N,n) and D =
diag(d1, d2, · · · , · · · dmin(N,n)) with UT U = V T V = Ir and d1 ≥ d2 ≥ · · · ≥ dmin(N,n). It’s then
clear that the first m loadings are columns of U and the first m principal components are columns
of√nV D. Notice that one has the identity that d2
k = λk for k = 1, 2, · · · ,min(N,n).
To use PCA in factor analysis, we essentially use the loadings of PCA to estimate the linear
space of factor loadings and the PCs to estimate the linear space of factor scores. Let Pr ∈ RN×r be
the first r columns of P (which is the same as Ur, the first r columns of U) and Λr = diag(λ, · · · , λr)(Dr = diag(d1, · · · , dr)). Under the identification condition that either ΣF = Ir for the random
factor score model or MF = Ir for the non-random factor score model, and LTL is diagonal with
decreasing diagonal entries, we will have
Lpc = PrΛ1/2r = UrDr, F
pc =√nV Tr = LT (Y − α1T ) (2.10)
To estimate the noise variances, it’s common to use
Σpc = diag
(1
n(Y − α1T − LF )(Y − α1T − LF )T
)(2.11)
2.2.2 Asymptotic properties
As promised, we discuss three different asymptotic regimes.
CHAPTER 2. BACKGROUND 15
N fixed and n→∞
First, let’s consider the asymptotic region where N is fixed and n→∞. As we have discussed, the
sample covariance S → LLT + Σ in this asymptotic region. Thus, P is a consistent estimator of the
eigenvectors of LLT + Σ. If the diagonal entries of Σ are distinct, then the linear space spanned by
P will be different from the linear space spanned by L, indicating that Lpc would not be a consistent
estimator of L, thus F pc and Σpc would also be inconsistent.
Notice that Lpc in (2.10) is very similar to the MLE estimator when Σ = σ2IN in Theorem 2.1.1,
but without adjustment on Λr by subtracting σ2Ir. Theorem 2.1.2 shows that under the classical
asymptotic region, MLE estimators are consistent. Thus, even in the simplest case where Σ = σ2Ir,
Lpc is inconsistent for estimating L since Lpc − L = σ2Pr is not approaching 0 and L is consistent.
N,n→∞ with strong factors
It turns out that PCA estimates becomes consistent when N,n→∞ and the factors’ strength keeps
growing with N . Bai (2003) and Bai and Ng (2013) derived consistency and asymptotic normality
results for the PCA estimates under the same assumptions as for the MLE in Theorem 2.1.3.
Theorem 2.2.1. Assume all the assumptions in Theorem 2.1.3 except that now LTL instead of
LTΣ−1L is diagonal with LTL/N → ΣL. Then when n,N → ∞, we have Lpci· → Li· with rate
min(√n,N) for each i = 1, 2, · · · , N . Also, F pc
·j → F·j with rate min(√N,n) for each j = 1, · · · , n.
Besides, under the non-random factor score model, we have the following asymptotic normality:
1. If√n/N → 0 when N,n→∞, then for each i
√n(Lpc
i· − Li·)d→ N
(0, σ2
i Ir)
2. If√N/n→ 0 when N,n→∞ then for each j
√N(F pc
·j − F·j)d→ N
(0, (ΣL)−1QΣL
)where Q = limN→∞ LTΣL/N ∈ Rr×r.
For the limiting covariance under the random factor score model which has much more compli-
cated forms, one can refer to Bai (2003). Bai (2003) and Bai and Ng (2013) actually have the results
for more general assumptions which allow for the noise covariance Σ to change with n. This makes
the result more applicable to some econometrics applications.
Comparing the results of PCA in Theorem 2.1.3 and MLE in Theorem 2.2.1, the asymptotic
efficiency in estimating L is the same, while F pc is less efficient than the MLE estimate F . The
difference is the same as running a GLS instead of OLS for solving F given the true L.
CHAPTER 2. BACKGROUND 16
In Fan et al. (2013), the authors strengthened the consistency result of Theorem 2.2.1 to uniform
consistency.
Assumption 6. There exists mt and bt for t = 1, 2 such that for any s > 0, i ≤ N and j ≤ n,
P [|εij | > s] ≤ exp (1− (s/b1)m1)
Also, under the non-random factor score model, assume that the entries of F are bounded |Fkj | ≤ Cfor some constant C or under the random factor score model assume that
P [|Fkj | > s] ≤ exp (1− (s/b2)m2)
for any k ≤ r.
Theorem 2.2.2. Under the assumptions of Theorem 2.2.1 and Assumption 6 with n = o(N2) and
logN = o(nγ) where (6γ)−1 = 3m−11 + 1.5m−1
2 + 1, we have
maxi≤N‖Lpc
i· − Li·‖ = Op
(√1
N+
√logN
n
), and
maxj≤n‖F pc·j − F·j‖ = Op
(√1
n+
√n1/2
N
).
The original authors have shown that both Theorem 2.2.1 and Theorem 2.2.2 also hold for a
much weaker assumptions than Assumption 1 and Assumption 2. They hold also for approximate
factor models, where the noise can be weakly correlated.
N,n→∞, N/n→ γ > 0 with weak factors
For high-dimensional data, in many cases there are weak factors where it’s inappropriate to assume
that ‖L·k‖22 is O(N). For example, if factor k has sparse loadings, say ‖L·k‖0 = O(1), then ‖L·k‖22 =
O(1) when ‖L·k‖∞ = O(1) (Witten et al., 2009). It’s very common to have sparse factors for high-
dimensional data. For example, the observed variables may be just locally correlated and have a
block structure. The weak factors become really hard to estimate accurately though. One reason
is that though the true factors are sparse, the singular vectors of the signal matrix X may not be
sparse. There have been research on random matrix theory showing that even for the homoscedastic
noise model Σ = σ2IN , PCA is not consistent. There is also a detection threshold that if some
eigenvalue of ΛTΛ is below the threshold, there is no hope to estimate any its information from
spectral analysis.
There has been rich literature in Random Matrix Theory (RMT) in understanding the asymptotic
properties for estimating the weak factors by PCA (or SVD), especially when Σ = σ2IN . Under
CHAPTER 2. BACKGROUND 17
the identification condition that LTL is diagonal, let L = UD where U ∈ RN×r, UTU = Ir and
D = diag(d1, · · · , dr). We impose the following assumptions for weak factors:
Assumption 7. 1. For each k = 1, 2, · · · , r, when n,N → ∞, dka.s.→ ρk for some constant
ρk <∞. For simplicity, also assume that ρ1 > ρ2 > · · · > ρr > 0.
2. The noise entries have finite fourth moment: E[e4ij
]< ∞ for i = 1, 2, · · · , N and j =
1, 2, · · · , n
Then people have shown that the estimates D and U are inconsistent estimates of D and U , thus
the PCA estimates Lpc and F pc are inconsistent for L and F . There are many reference literature.
For instance, the random factor model result can be found in Paul (2007); Yao et al. (2015); Nadler
(2008). The non-random factor model result can be found in Perry (2009); Benaych-Georges and
Raj Rao (2012); Onatski (2012).
Theorem 2.2.3. Assume either Assumption 1 or Assumption 2, Assumption 7 and Σ = σ2IN .
When n,N →∞ with N/n→ γ > 0, we have for k, k1, k2 = 1, 2, · · · , r:
1.
dka.s.→ ρk =
√(
ρk + 1ρk
)(ρk + γ
ρk
)σ if ρk > γ1/4σ(
1 +√γ)σ if ρk ≤ γ1/4σ
2.
UT·k1U·k2
a.s.→
θk =
√ρ4k−γ
ρ4k+βρ2
kif k1 = k2 = k, ρk > γ1/4σ
0 otherwise
3. For the estimated factor scores,
F pck1·F
Tk2·
n
a.s.→
θk =
√ρ4k−γ
ρ4k+βρ2
kif k1 = k2 = k, ρk > γ1/4σ
0 otherwise
If we further assume that N/n = γ + o(n−1/2
)and dk − ρk = o
(n−1/2
)as n,N → ∞, then if
ρk > γ1/4, then√n(dk − ρk
),√n(UT·kU·k − θk
)and
√N(F pck1·F
Tk2·/n− θk
)all have a limiting
Gaussian distribution with mean 0.
The limiting variances for√n(dk − ρk
),√n(UT·kU·k − θk
)and√N(F pck1·F
Tk2·/n− θk
)are dif-
ferent for random and non-random factor score models, and one can find the specific forms in the
above references.
For the heteroscedastic noise factor analysis model where Σ = diag(σ21 , · · · , σ2
N ) has arbitrary
diagonals, the problem is much more complicated, as the distribution noise matrix Σ1/2E is not
CHAPTER 2. BACKGROUND 18
invariant under the orthogonal transformation of the rows. One result we can show is that we can
combine the results in Onatski (2012) and Benaych-Georges and Raj Rao (2012) to get limits of dk,
U·k and F pck· under a random factor score model which also has uniformly distributed factor loading
entries for each factor. Let the cumulative distribution function (CDF) of the empirical distribution
of the noise variances be GN (x) = 1N
∑Ni=1 1σ2
i≤x. Then we need the following assumption:
Assumption 8. When n,N →∞, GN (x)→ G0(x) for all x ∈ R where the limiting cumulative dis-
tribution function (CDF) G0(x) has bounded support [a0, b0] and the corresponding density function
g0(x) satisfies minx∈(a0,b0) g(x) > 0. Also maxi≤N σ2i → b0 and mini≤N σ
2i → a0.
Based on the results of Onatski (2010, 2012), we know that under Assumption 8, the empirical
distribution of the eigenvalues of 1nΣ1/2EETΣ1/2 for N/n → γ ≤ 1 converge to some limiting
distribution with CDF G(x). If γ ≤ 1, the support of G(x) is bounded [a0, b0], and the largest
and smallest eigenvalues of 1nΣ1/2EETΣ1/2 converge to b0 and a0 respectively. If γ > 1, G(x) =
1γ G(x) +
(1− 1
γ
)δ0 where the support of G(x) is bounded [a0, b0], and the largest and smallest
non-zero eigenvalues of 1nΣ1/2EETΣ1/2 converge to b0 and a0 respectively. To use Benaych-Georges
and Raj Rao (2012), we assume that the factor loading matrix is also random and impose an extra
condition on L under the random factor score model:
Assumption 9. For factor score matrix L = UD,√N · U ∈ RN×r has i.i.d. entries with mean 0
and variance 1 and D = diag(d1, · · · , dr) is a deterministic diagonal matrix.
Based on results of Benaych-Georges and Raj Rao (2012), let the D-transform of G(x) be defined
as
DG(z) =
[∫z
z2 − xdG(x)
]×[γ
∫z
z2 − tdG(x) +
1− γz
]for z > a.
For f a function and x ∈ R, denote f (x+) = limz↓x f(z). Also, in the theorem below, D−1G (·)
denotes its function inverse on [a,∞). We then have results analogous to Theorem 2.2.3.
Theorem 2.2.4. Under Assumption 1, Assumption 8 and Assumption 9, when n,N → ∞ and
N/n→ γ, we have
1. For k = 1, 2, · · · , r
dka.s.→ ρk =
D−1G
(1/ρ2
k
)if ρk >
(DG
(a+
0
))−1/2
a0 otherwise
2. For k1 = 1, 2, · · · k0 and k2 = 1, 2, · · · , r where k0 =∑rk=1 1
ρk>(DG(a+0 ))−1/2 ,
UT·k1U·k2
a.s.→
√−2φG(ρk)ρ2kD′G(ρk)
if k1 = k2 = k
0 otherwise
CHAPTER 2. BACKGROUND 19
3. For the estimated factor scores and k1 = 1, 2, · · · k0 and k2 = 1, 2, · · · , r where k0 =∑rk=1 1
ρk>(DG(a+0 ))−1/2 ,
F pck1·F
Tk2·
n
a.s.→
√−2φG(ρk)
ρ2kD′G(ρk)
if k1 = k2 = k
0 otherwise
where G = γG+ (1− γ)δ0 when γ ≤ 1 and ρk is defined in Theorem 2.2.3.
Because of the inconsistency of PCA even assuming Σ = σ2IN , there have been several improve-
ment of the PCA estimates to reduce the estimation error. One direction is to shrink d1, d2, · · · , drtowards 0 while keeping the estimated eigenvectors (singular vectors) unchanged (Shabalin and No-
bel, 2013). Based on the result of Gavish and Donoho (2014), define the PCA optimal shrinkage
estimator as
Lsk = Urη(Dr
), F sk = F pc (2.12)
where η(Dr) = diag(η(d1), · · · , η(dr)
). The shrinkage function η(·) is defined as
η(d) =
σ2
d
√(d2
σ2 − γ − 1)2 − 4γ if d ≥ (1 +
√γ)σ
0 Otherwise
and is the optimal function that minimizes the asymptotic estimation error limn,N→∞,N/n→γ ‖X −LskF sk‖2F . In practice when σ is unknown, Gavish and Donoho (2014) proposed a consistent esti-
mator σ which is based on the median of the singular values of Y . Raj Rao (2014) considered this
optimal shrinkage of sample singular values for a general noise variance Σ, including the heterosced-
asitc noise factor analysis model once it satisfies the assumptions in Theorem 2.2.4.
2.3 Estimating the number of factors
Now we review the methods in estimating r under the three different regimes that we have discussed
in the previous section.
2.3.1 Classical methods
For the classical problem where N is relatively small compared with the sample size n, estimating
the number of factors r is a very hard problem. Many methods have been proposed for estimating
the number of principal components, but very few methods work specifically for the factor analysis
model, which has additive heteroscedastic noise that is not present in PCA. One method is based
on likelihood ratio tests (Lawley, 1956; Bartlett, 1950; Anderson and Rubin, 1956). For a given r,
define the null hypothesis H0r: Φ = LLT + ΣF for some L ∈ RN×r and diagonal matrix ΣF . The
CHAPTER 2. BACKGROUND 20
calculation in Anderson and Rubin (1956) has shown that the likelihood ratio test statistic using
(2.1) is
Ur = n[log det Σ + log det
(Ir + LT Σ−1L
)− log detS
]and under the asymptotic regime that N is fixed and n → ∞, Ur follows a chi-square distribution
with N(N − 1)/2 + r(r − 1)/2 − rN degrees of freedom. To estimate r, one can sequentially test
for H00, H01, · · · and stop at r if H0r is not rejected. However, this sequential testing method does
not have any theoretical guarantees and has been shown to perform poorly in practice (Tucker and
Lewis, 1973; Velicer et al., 2000), for example it is sensitive to the normality assumption and tend
to underestimate r when n is large.
Based on empirical evaluations, some researchers (Velicer et al., 2000; Buja and Eyuboglu, 1992;
Velicer, 1976) suggest that even if one believes the factor analysis model, one can first assume
Σ = σ2IN to determine the number of principal components r and then estimate the factors based
on (1.3) given r. For estimating the number of principal components, popular methods include scree
test (Cattell, 1966; Cattell and Vogelmann, 1977), Kaiser’s rule (Kaiser, 1960), parallel analysis
(PA) (Horn, 1965; Buja and Eyuboglu, 1992), the minimum average partial test of Velicer (1976)
and information criteria based methods such as minimum description length (MDL) (or Bayesian
Information Criteria (BIC)) and Akaike Information Criteria (AIC) (Wax and Kailath, 1985; Fishler
et al., 2002). To use these methods effectively for factor analysis, one essentially applies various rules
on the sample correlation matrix S. For example, the two simplest rules are the Scree test which
plots the eigenvalues of S in an decreasing order and r is determined by identifying an “elbow” of
the eigenvalue curve, and Kaiser’s rule which estimates r as the number of eigenvalues of S that
exceed 1.
Among all these methods, there is a large amount of evidence Zwick and Velicer (1986); Hubbard
and Allen (1987); Velicer et al. (2000); Peres-Neto et al. (2005) showing that PA is one of the most
accurate of the above classical methods for determining the number of factors. Parallel analysis
compares the observed eigenvalues of the correlation matrix to those obtained in a Monte Carlo
simulation. The first factor is retained if and only if its associated eigenvalue is larger than the 95’th
percentile of simulated first eigenvalues. For k ≥ 2, the k’th factor is retained when the first k − 1
factors were retained and the observed k’th eigenvalue is larger than the 95’th percentile of simulated
k’th factors. The permutation version of PA was introduced by Buja and Eyuboglu (1992). There
the eigenvalues are simulated by applying independent uniform random permutations to each of the
variables stored in Y . The earlier method of Horn (1965) resamples from a Gaussian distribution.
Parallel analysis has been used recently in bioinformatics (Leek and Storey, 2008b; Sun et al., 2012).
Though there exist no theoretical results to guarantee the accuracy of PA, it performs very well in
practice.
CHAPTER 2. BACKGROUND 21
2.3.2 Methods for large matrices and strong factors
This collection of methods is designed for an asymptotic regime where both n,N → ∞ while r is
fixed. Again, for strong factors, it is assumed that LTL/N → ΣL. Under such asymptotic regime
and the strong factor assumption, it is theoretically easy to find a consistent estimator of r since the
first r eigenvalues of LLT + Σ will explode as n,N →∞.
Some of the most popular methods to estimate the number of factors under the above scenario
are based on the information criteria developed by Bai and Ng (2002). Define
V (k) =1
Nn
∥∥∥Y − Lpck F
pck
∥∥∥2
F(2.13)
where Lpck and F pc
k are defined in (2.10) with the number of factors given as k. Let K ≥ r be
some fixed known constant, then Bai and Ng have proposed a series of information criteria and have
shown the following results:
Theorem 2.3.1. Under the assumptions of Theorem 2.2.1, if g(N,n)→ 0 and min(√N,√n)g(N,n)→
∞ as N,n→∞, then r defined by
r = argmin0≤k≤K V (k) + kg(N,n)
or
r = argmin0≤k≤K log V (k) + kg(N,n)
where V (k) is defined in (2.13) is a consistent estimator: limN,n→∞ P [r = r] = 1.
They then proposed 6 specific forms for the criteria and one that performs among the best based
on their empirical evaluations is
rIC1 = argmin0≤k≤K log V (k) + k
(N + n
Nn
)log
(Nn
N + n
)(2.14)
The bound K is not specified and depends on the researcher’s prior knowledge on the problem. Bai
and Ng’s criteria are known to be unrobust in practice, thus Alessi et al. (2010) proposed a modified
version:
r?IC1 = argmin0≤k≤K log V (k) + ck
(N + n
Nn
)log
(Nn
N + n
)(2.15)
where c is tuning parameter that is determined adaptively based on the data. For determing c,
the authors used a stability principle which chooses c that yields a stable r?IC1 using randomly sub-
sampled rows and columns. This improvement may be hard to execute in practice as a proper range
of c need to be given since a large enough c can stably estimate r as 0 while a small enough c can
stably estimate a large r.
Onatski (2010) developed an estimator based on the difference of two adjacent eigenvalues (ED)
CHAPTER 2. BACKGROUND 22
of the sample covariance matrix. The estimator he proposed is
rED = maxk ≤ K : d2
k − d2k+1 ≥ δ
(2.16)
where δ > 0 is some fixed number. Denote the ordered noise variances as σ2(1) ≤ σ2
(2) ≤ · · · ≤ σ2(N).
Roughly speaking, the estimator is based on the result that if for any i, σ2(i+1)− σ
2(i) → 0 as N →∞,
then d2k − d2
k+1 → 0 for any k > r. An advantage of his estimator is that the consistency of rED can
allow for a much weaker strength of the factors:
limn,N→∞,N/n→γ>0
P [rED] = 1
for any fixed δ > 0 as long as the smallest eigenvalue of LLT explodes (tends to infinity). To optimize
the performance of rED, he also gave an iterative procedure to adaptively determine δ from the data.
Another simple criterion is proposed in Ahn and Horenstein (2013). They proposed two estima-
tors for determining the number of factors by simply maximizing the ratio of two adjacent eigenvalues
of the sample covariance matrix. The same idea can be also found in Lam and Yao (2012); Lan and
Du (2014). One specific form is:
rED = argmax0≤k≤Kd2k
d2k+1
(2.17)
with d20 =
∑min(n,N)k=1 d2
k/ log min(n,N).
Besides the above criteria, there are more methods to estimate the number of factors (Forni
et al., 2000; Amengual and Watson, 2007; Hallin and Liska, 2007) for dynamic factor models. As
we have mentioned, such dependency models are beyond the scope of this paper.
Remark. To use rIC1, rED and rER, we need to determine the upper bound K for r. There is no
theoretical result to guide choosing K. For practical usage, Onatski (2010) suggested trying several
different K to see how r changes. Ahn and Horenstein (2013) suggested using
K = min
|i ≥ 1 : d2i ≥
min(n,N)∑k=1
d2k/min(n,N)|, 0.1 min(n,N)
.
2.3.3 Methods for large matrices and weak factors
In contrast to strong factors, for weak factors the asymptotic LLT → ΣL instead of LLT /N → ΣL
where ΣL is a positive definite matrix is more appropriate. Based on our discussion in previous
sections, even for the homoscedastic noise Σ = σ2IN , neither the PCA nor MLE estimates of the
factor loadings and scores are consistent. Moreover, Theorem 2.2.3 shows that there is a phase
transition phenomenon in the limit: if ρk < γ1/4σ, then the spectral analysis from the samples
CHAPTER 2. BACKGROUND 23
would not contain any information of the kth principal component of the signal matrix. In other
words, under the identification condition LTL is diagonal, for n,N large enough, if there is some
dk < γ1/4σ, then there would be little chance for detection of the kth factor using PCA or MLE
(Kritchman and Nadler, 2009). For the general factor model where there is heteroscedastic noise,
there would also exist a phase transition phenomenon. Thus, it would be impossible and also very
likely useless to estimate the true number of factors.
For Σ = σ2IN , we define r =∣∣dk : dk > γ1/4σ
∣∣ as the number of detectable factors. One goal
is to estimate this number of detectable factors. Raj Rao and Edelman (2008) used an AIC type
information criterion method based on RMT. The criterion is based on the distribution of
tk = (N − k) ·∑Ni=k+1 d
4k(∑N
i=k+1 d2k
)2
which is asymptotically Gaussian when n,N → ∞ and N/n → γ > 0 for k = 0, 1, · · · ,min(n,N).
The estimator they proposed is
rRE = argmin0≤k≤min(N,n)
[1
4γ2n
(N [tk − (1 + γn)]− γn)2
+ 2(k + 1)
](2.18)
In Raj Rao and Edelman (2008), the authors conjectured that rRE is a consistent estimator for r,
however, Kritchman and Nadler (2009) proved that the conjecture is not true and rRE tends to
underestimate r. Instead, Nadler (2010) proposed another modification of AIC which estimates r
more accurately:
rAIC′ = argmin0≤k≤min(N,n)
[−L(Y ; L, Ir, σ
2IN ) + 2k (2N + 1− k)]
(2.19)
where L(·) is defined in (2.2) with L and σ2 derived in Theorem 2.1.1. For estimating r, Kritchman
and Nadler (2008) also developed a consistent estimator based on a sequence of hypothesis tests
which are connected with Roy’s classical largest root test Roy (1953). It has the form:
rRMT = argmin1≤k≤min(N,n)
d2k < σ2(k) (µn,N−k + s(α)ξn,N−k)
− 1 (2.20)
which is derived based on sequential tests H0k: at most k− 1 signals versus H1k: at least k signals.
α is some significance level and values of µn,N−k, s(α) and ξn,N−k are derived using RMT. σ2(k) is
estimated unbiased via an iterative algorithm.
Instead of estimating r, one would prefer estimating the number of useful factors: r? = argmink E[‖Xk −X‖F
]where X = LF is the signal matrix and Xk is a rank k estimate of X. r? best fits the purpose if one
want an accurate estimation of the signal matrix. For Σ = σ2IN and Xpck = Lpc
k Fpck , Perry (2009)
has shown that under the assumptions of Theorem 2.2.3, when n,N → ∞ and N/n → γ > 0 we
CHAPTER 2. BACKGROUND 24
have
r? →∣∣ρk : ρ2
k ≥ µ?F∣∣
where
µ?F = σ2
1 + γ
2+
√(1 + γ
2
)2
+ 3γ
The reason that r? ≤ r is that for some factors are too weak to be estimated accurately, though they
are detectable. Thus we prefer to ignore them to increase the accuracy of our estimation. Perry
(Perry, 2009) proposed a bi-cross-validation (BCV) method to estimate r? that we will discuss in
more detail in next chapter.
2.4 Comments
Theorems 2.1.3, 2.2.1 and 2.2.3 work for both random and non-random factor models. Thus without
additional assumptions, assuming the random and non-random factor score model are equivalent.
With additional assumptions, then assuming one of the two models can be more convenient. For
instance, if the latent variables have a time series structure then it’s more convenient to assume
random factor scores. On the other hand if the factor scores are assumed to be sparse or non-
negative (especially for specific samples or at specific locations), then assuming non-random factor
scores can be more natural.
For high-dimensional data and large matrices where both n and N are large, we believe that
for many real data there are both weak and strong factors. The strong factors are factors that are
uniformly influential on all variables while weaker factors may only have large effects on a subset of
the variables. However, there is not much theoretical work considering presence of both strong and
weak factors.
Apparently, for the general factor model assumption that the noise is heteroscedastic, there
is barely no estimators for r (either r or r?) and not many methods available in estimating the
factors and signal matrix designed for large matrices with presence of weak factors. In the next two
chapters, we propose two methods in estimating the signal matrix X = LF and the noise variance
Σ without knowing r. One method is based on maximum likelihood and BCV, the other is based
on combining optimizing a convex penalized loss function and the optimal shrinkage proposed by
Gavish and Donoho (2014).
Chapter 3
Bi-cross-validation for factor
analysis
3.1 Problem Formulation
Our data matrix is Y ∈ RN×n with a row for each variable and a column for each sample. In the
bioinformatics problems we have worked on, it is usual to have N > n or even N n, but this is
not assumed. In a factor model, Y can be decomposed into a low rank signal matrix plus noise:
Y = X + Σ12E = LF + Σ
12E, (3.1)
where the low rank signal matrix X ∈ RN×n is a product of factors L ∈ RN×r and F ∈ Rr×n, both of
rank r. The noise matrix E ∈ RN×n has independent and identically distributed (IID) entries with
mean 0 and variance 1. Each variable has its own noise variance given by Σ = diag(σ21 , σ
22 , · · · , σ2
N ).
The signal matrix X is a signal that we wish to recover despite the heteroscedastic noise.
The factor model is usually applied when we anticipate that r min(n,N). Then identifying
those factors suggests possible data interpretations to guide further study. When the factors cor-
respond to real world quantities there is no reason why they must be few in number and then we
should not insist on finding them all in our data as some factors maybe too small to estimate. We
should instead seek the relatively important ones, which are the factors that are strong enough to
contribute most to the signals and be accurately estimated.
We focus on the non-random factor score model as we treat the signal matrix X as the parameter.
As we have discussed previously, a random factor score model can be treated as a non-random factor
25
CHAPTER 3. BI-CROSS-VALIDATION FOR FACTOR ANALYSIS 26
score model by conditioning on F . our goal is to recover X, seeking to minimize
ErrX
(X)
= E[∥∥∥X −X∥∥∥2
F
](3.2)
This criterion was used for factor models in Onatski (2015) and for truncated SVDs and nonnegative
matrix factorizations in Owen and Perry (2009). After recovering X, we can estimate the factor
loadings and scores using corresponding identification conditions.
Definition 3.1.1 (Oracle rank and estimate). Let M be a method that for each integer k ≥ 0 gives
a rank k estimate XM (k) of X using Y from model (3.1). The oracle rank for M is
r?M = argmink
∥∥∥XM (k)−X∥∥∥2
F, (3.3)
and the corresponding oracle estimate of X is
XMopt = XM
(r∗M). (3.4)
If all the factors are strong enough, then for a good method M , we anticipate that r?M should
equal the true number of factors r. With weak enough factors we will have r?M < r.
Our algorithm has two steps. First we devise a method M that can effectively estimate X given
the oracle rank r?M . Then with such a method in hand, we need a means to estimate r?M . Section 3.2
describes our early stopping alternation (ESA) algorithm for finding X(k) for each k, which has
the best performance compared with other methods given their own oracle ranks. Then Section 3.3
describes our BCV for estimating k?ESA for the ESA algorithm.
3.2 Estimating X given the rank k
Here we consider how to estimate X using exactly k factors. This will be the inner loop for an
algorithm that tries various k. The goal in this section is to find a method that has good performance
with its oracle rank. We start with the likelihood function
L(Y ;X,Σ) = −Nn2
log(2π)− n
2log det Σ + tr
[−1
2Σ−1(Y −X)(Y −X)T
]. (3.5)
which is similar to (2.4). If Σ were known it would be straightforward to estimate X using an SVD,
but Σ is unknown. Given an estimate of X it is straightforward to optimize the likelihood over
Σ. Thus, if we want to maximize (3.5), it is very natural to design an alternating algorithm that
iteratively estimates X given Σ and then estimates Σ given X. Specifically, define the truncated
SVD of a matrix Y as
Y (k) =√nU(k)D(k)V (k)T (3.6)
CHAPTER 3. BI-CROSS-VALIDATION FOR FACTOR ANALYSIS 27
where D(k) is the diagonal matrix of the k largest singular values of Y/√n, and U(k) and V (k) are
the matrices of the corresponding singular vectors. The iterative algorithm starts from an initial
estimate Σ(0) using the sample variance:
Σ(0) = diag((Y − 1
nY 1n×n
)(Y − 1
nY 1n×n
)T ). (3.7)
Given an estimate Σ, the rank k estimate X is the truncated SVD of the reweighted matrix Y =
Σ−12Y :
X = Σ12 Y (k). (3.8)
Given an estimate X, the new variance estimate Σ contains the mean squares of the residuals:
Σ =1
ndiag
[(Y − X
)(Y − X
)T ]. (3.9)
Both of the above two steps can increase logL(X,Σ) but not decrease it. However, as we have
discussed in Chapter 2, the likelihood (3.5) is ill-posed, thus this alternating algorithm can’t work
directly. Here we propose an even simpler early-stopping algorithm.
The main challenge for using (3.5) is to prevent any σi from approaching 0. One solution is to
instead optimize the quasi-likelihood (2.5) by EM algorithm. The other solution is to regularize Σ
to prevent σi → 0. One could model the σi as IID from some prior distribution. However, such a
distribution must also avoid putting too much mass near zero. We believe that this transfers the
singularity avoidance problem to the choice of hyperparameters in the σ distribution and does not
really solve it. We have also found in trying it that even when σi are really drawn from our prior,
the algorithm still converged towards some zero estimates.
A related approach is to employ a penalized likelihood
Lreg
(Y ;λ, X, Σ
)= −n log det Σ + tr
[Σ−1(Y − X)(Y − X)T
]+ λP
(Σ), (3.10)
where P penalizes small components σi. This approach has two challenges. It is hard to select a
penalty P that is strong enough to ensure boundedness of the likelihood, without introducing too
much bias. Additionally, it requires a choice of λ. Tuning λ by cross-validation within our bi-cross-
validation algorithm is unattractive. Also there is a risk that cross-validation might choose λ = 0
allowing one or more σi → 0.
We do not claim that the regularization methods cannot in the future be made to work. However,
we propose a much simpler approach that works surprisingly well. Our approach is to employ early
stopping. We start at (3.7) and iterate the pair (3.8) and (3.9) some number m of times and then
stop.
To choose m, we investigated 180 test cases based on the six factor designs in Table 3.1, three
dispersion levels for the σ2i , five aspect ratios γ and 2 data sizes. The details are in the Appendix.
CHAPTER 3. BI-CROSS-VALIDATION FOR FACTOR ANALYSIS 28
The finding is that taking m = 3 works almost as well as if we used whichever m gave the smallest
error for each given data set.
More specifically, define the oracle estimating error using early stopping at m steps as
ErrX(m) = mink‖Xm(k)−X‖2F (3.11)
where Xm(k) is the estimate of X using m iterations and rank k. We judge each number m of steps,
by the best k that might be used with it.
For early stopping alternation (ESA), we define the oracle stopping number of steps on a data
set as
mOpt = argminm ErrX(m) = argminm mink‖Xm(k)−X‖2F . (3.12)
We have found thatm = 3 is very nearly optimal in almost all cases. We find that ErrX(3)/ErrX(mOpt)
is on average less than 1.01, with a standard deviation of 0.01 (see Appendix). Using m = 3 steps
with the best k is nearly as good as using the best possible combination of m and k. We have tested
early stopping on other data sizes, factor strengths and noise distributions, and find that m = 3 is
a robust choice. Early stopping is also much faster than iterating until a convergence criterion has
been met.
In Section 3.4.2, we compare ESA to other methods for estimating X, including PCA (SVD),
PCA after normalization of each row (variable) of the data and the quasi maximum-likelihood
method (QMLE). For the heteroscedastic noise cases and given the oracle rank of each method,
ESA performs better than PCA or PCA after data normalization in most cases. Surprisingly, it also
performs better than QMLE on average and especially when the aspect ratio N/n is not too small.
Comparing ESA with an oracle SVD method that knows the noise variance, we find that they have
comparable performance.
Given the above findings, we turn our attention to estimating the oracle rank r?ESA for ESA in
Section 3.3.
Remark. Early-stopping of iterative algorithms is a well-known regularization strategy for inverse
problems and training machine learning models like neural networks and boosting Yao et al. (2007);
Zhang and Yu (2005); Hastie et al. (2009); Caruana et al. (2001). An equivalence between early-
stopping and adding a penalty term has been demonstrated in some settings Fleming (1990); Rosset
et al. (2004).
Remark. PCA after normalization of each variable is a common standardization step to keep all
the variables in the same scale before factorization. It is equivalent to ESA starting from (3.7) with
m = 1. However, for the factor analysis model, Σ(0) is a bad estimate of Σ if the scaling of the
noise and signal is different for each variable. Using m > 1 iterations can be interpreted as using
an estimated signal matrix to improve the estimation of Σ, so ESA with m = 3 can be understood
as applying truncated SVD on a more properly reweighted data than one gets with m = 1.
CHAPTER 3. BI-CROSS-VALIDATION FOR FACTOR ANALYSIS 29
3.3 Bi-cross-validatory choice of r
Here we describe how BCV works in the heteroscedastic noise setting. Then we give our choice for
the shape and size of the held-out submatrix using theory from Perry (2009).
3.3.1 Bi-cross-validation to estimate r?ESA
We want k to minimize the squared estimating error (3.3) in XESA. We adapt the BCV technique of
Owen and Perry (2009) to this setting of unequal variances. We randomly select n0 columns and N0
rows as the held-out block and partition the data matrix Y (by permuting the rows and columns)
into four folds,
Y =
(Y00 Y01
Y10 Y11
)where Y00 is the selected N0 × n0 held-out block, and the other three blocks Y01, Y10 and Y11 are
held-in. Correspondingly, we partition X and Σ as
X =
(X00 X01
X10 X11
), and Σ =
(Σ0 0
0 Σ1
).
The idea is to use the three held-in blocks to estimate X00 for each candidate rank k and then select
the best k based on the BCV estimated prediction error.
We rewrite the model (3.1) in terms of the four blocks:
(Y00 Y01
Y10 Y11
)=
(X00 X01
X10 X11
)+
(Σ0 0
0 Σ1
) 12(E00 E01
E10 E11
)
=
(L0R0 L0R1
L1R0 L1R1
)+
(Σ
120 E00 Σ
120 E01
Σ121 E10 Σ
121 E11
)
where L =
(L0
L1
)and R =
(R0 R1
)are decompositions of the factors.
The held-in block
Y11 = X11 + Σ121 E11 = L1R1 + Σ
121 E11
has the low-rank plus noise form, so we can use ESA to get estimates X11(k) and Σ1 for a given
rank k. Next for k < rank(Y11) we choose rank k matrices L1 and R1 with
X11(k) = L1R1. (3.13)
Then we can estimate L0 by solving N0 linear regression models Y T01 = RT1 LT0 + ET01Σ
1/20 , and
estimate R0 by solving n0 weighted linear regression models Y10 = L1R0 + Σ1/21 E10. These least
CHAPTER 3. BI-CROSS-VALIDATION FOR FACTOR ANALYSIS 30
square solutions are
R0 = (LT1 Σ−11 L1)−1LT1 Σ−1
1 Y10, and L0 = Y01RT1 (R1R
T1 )−1
which do not depend on the unknown Σ0. We get a rank k estimate of X00 as
X00(k) = L0R0. (3.14)
Though the decomposition (3.13) is not unique, the estimate X00(k) is unique. To prove it we
need a reverse order theorem for Moore-Penrose inverses. For a matrix Z ∈ Rn×d, the Moore-Penrose
pseudo-inverse of Z is denoted Z+.
Theorem 3.3.1. Suppose that X = LR, where L ∈ Rm×r and R ∈ Rr×n both have rank r. Then
X+ = R+L+ = RT (RRT )−1(LTL)−1LT .
Proof. This is MacDuffee’s theorem. There is a proof in Owen and Perry (2009).
Proposition 3.3.1. The estimate X00(k) from (3.14) does not depend on the decomposition of
X11(k) in (3.13) and has the form
X00(k) = Y01
(Σ− 1
21 X11(k)
)+Σ− 1
21 Y10. (3.15)
Proof. Let X11(k) = L1R1 be any decomposition satisfying (3.13). Then
X00 = L0R0
= Y01RT1
(R1R
T1
)−1(LT1 Σ−1
1 L1
)−1LT1 Σ−1
1 Y10
= Y01
(Σ− 1
21 L1R1
)+Σ− 1
21 Y10 = Y01
(Σ− 1
21 X11(k)
)+Σ− 1
21 Y10.
The third equality follows from Theorem 3.3.1.
Next, we define the cross-validation prediction average squared error for block Y00 as
PEk(Y00) =1
n0N0
∥∥Y00 − X00(k)∥∥2
F.
Notice that as the partition is random, we have:
E[PEk(Y00)
]= E
[1
n0N0ErrX00
(X00(k)
)]+
1
N
N∑i=1
σ2i ,
where ErrX(X) is the loss defined at (3.11). The expectation for the left side of the equation is over
the noise and the random partition, for a fixed signal matrix. And the expectation for the right side
of the equation is just over the random partition.
CHAPTER 3. BI-CROSS-VALIDATION FOR FACTOR ANALYSIS 31
The above random partitioning step is repeated independently S times, yielding the average
BCV mean squared prediction error for Y ,
PE(k) =1
S
S∑s=1
PEk(Y(s)00 )
where Y(s)00 is the held-out data for the sth repeat of the partition. The BCV estimate of k is then
r∗ = argmink PE(k). (3.16)
For using the method in practice, we investigate integer values of k from 0 to some maximum.
We cannot take k as large as min(n1, N1) where n1 = n − n0 and N1 = N − N0, for then we will
surely get σi = 0 even with early stopping. We impose an additional constraint on k to keep the
diagonal of Σ1 away from zero. If for some k we observe that
1
N1
N1∑i=1
log10
(|σ(k)i,1 |)< −6 + log10
(maxi|σ(k)i,1 |)
(3.17)
where Σ1(k) = diag(σ
(k)1,1 , σ
(k)2,1 , · · · , σ
(k)N1,1
), then we do not consider any larger values of k. The
condition (3.17) means that the geometric mean of the variance estimates is below 10−6 times the
largest one.
Remark. Owen and Perry (2009) mentioned that BCV can miss large but very sparse components
in the SVD in a white noise model, and they suggested rotating the data matrix as a remedy. As
the held-out rows and columns are selected at random, sparsity in factor loadings or scores can
greatly increase the variance in prediction error across partitions. In our problem where the noise is
heteroscedastic, we can deal with the sparsity in factor scores by replace Y with Y O where O ∈ Rn×n
is a given dense orthogonal matrix.
3.3.2 Choosing the size of the holdout Y00
We define the true prediction error for ESA as:
PE(k) =1
nN
∥∥X − XESA(k)∥∥2
F+
1
N
∑i
σ2i
and then the oracle rank is r∗ESA = argmink PE(k).
Ideally, we would like PE(k) to be a good estimate of PE(k). Actually, for the purpose of BCV,
it suffices to have r∗ (defined in (3.16)) be a good estimate of r∗ESA. Because of the inconsistency of
PCA for large matrices with presence of weak factors, the size of the holdout Y00 need to be carefully
chosen.
CHAPTER 3. BI-CROSS-VALIDATION FOR FACTOR ANALYSIS 32
When it is known that Σ = σ2I, we can use the truncated SVD to estimate X and for BCV the
estimate of X00 simplifies to
X00(k) = Y01
(Y11(k)
)+Y10, (3.18)
where Y11(k) is the truncated SVD in (3.6). Perry (2009) proved that r∗ and r∗ESA track each other
asymptotically if the relative size of the held-out matrix Y00 satisfies the following theorem.
Theorem 3.3.2. Under either Assumption 1 or Assumption 2 and Σ = σ2IN , if k0 is fixed and
N/n→ γ ∈ (0,∞) as n→∞, then r?ESA and argmink E[PEk(Y00)
]converge to the same value if
√ρ =
√2√
γ +√γ + 3
(3.19)
holds, where
γ =
(γ1/2 + γ−1/2
2
)2
, and ρ =n− n0
n· N −N0
N.
Here ρ is the fraction of entries from Y in the held-in block Y11. The larger γ is, the smaller ρ
will be, thus ρ reaches its maximum when Y is square with γ = 1. For example, when γ = 1, then
ρ ≈ 22%. In contrast, if γ = 50 or 0.02, ρ then drops to only 3.5%.
Theorem 3.3.2 compares the best k for E[PE(k)
]to the best k for the true error. If the number
of repeats S →∞ and n→∞, then we also have r? → argmink E[PEk(Y00)
]thus r? − r?ESA → 0.
In our simulations, we use (3.19) to determine the size of Y00. The logic is that for a consistent
estimate of Σ, the limits of the singular values and vectors of Σ−1/2Y is the same as for the white noise
model. Thus Theorem 3.3.2 also works for Σ−1/2Y . Further, to determine n0 and N0 individually,
we make Y11 as square as possible as long as n0 ≥ 1 and N0 ≥ 1. For instance, with γ = 1 as
ρ ≈ 22%, we hold out roughly half the rows and columns of the data.
3.4 Simulation results
The empirical properties of ESA and BCV for factor analysis provide the main evidence to show
that our methods performance is better than other methods in practice. We will give a detailed
description of how we generate the data and the simulation results.
3.4.1 Factor categories and test cases
When we simulate the factor model for our tests, we will generate it as
Y = Σ1/2(Σ−1/2X + E) = Σ1/2(√nUDV T + E). (3.20)
CHAPTER 3. BI-CROSS-VALIDATION FOR FACTOR ANALYSIS 33
The matrix Σ−1/2X =√nUDV T has the same low rank that X does. Here UDV T is an SVD and
we generate the matrices U and V from appropriate distributions. The reason that we rewrite (3.1)
as (3.20) is that the normalization in (3.20) allows us to make direct use of RMT in choosing D.
For our simulations, the matrix V is uniformly distributed in the space of orthogonal matrices, but
U has a non-uniform distribution to avoid making rows with large mean squared U -values coincide
with rows having large σi. Such a coincidence could make the problem artificially easy.
For the factor strength in our simulations, we want to include both strong and weak factors and
see how different methods perform. Under the asymptotics that n,N →∞, we may place each factor
into a category depending on the size of d2i (D defined either via (3.1) or (3.20)). The categories are:
1. Undetectable: d2i is below the detection threshold, thus the factor is asymptotically unde-
tectable by SVD based methods.
2. Harmful: d2i is above the detection threshold but below the estimation threshold at which their
inclusion in the model improves accuracy.
3. Useful: d2i is above the estimation threshold but is O(1). It contributes an N × n matrix to Y
with sum of squares O(n), while the expected sum of squared errors is nNσ2.
4. Strong: d2i grows proportionally to N . The factor sum of squares is then proportional to the
noise level.
The above classification shows a general limiting phenomenon for matrix factorization of high-
dimensional and large matrices. The specific value for the detection and estimation threshold de-
pends on the specific method in estimating X that is being used. For our method, which tries to
whiten the noise first and then use PCA on the reweighted data ΣY , we can choose the detection
and estimation threshold using those derived for a white noise model by RMT.
Here is a full description of the data generating mechanism:
Generating the noise
Recall that the noise matrix is Σ1/2E. The steps are as follows.
1. E = (eij)N×n: here eijiid∼N (0, 1).
2. Σ = diag(σ21 , . . . , σ
2N ): σ2
iiid∼ InvGamma(α, β). Therefore E
[σ2i
]= β/α− 1 and Var
[σ2i
]=
β2/(α− 1)2(α− 2). Parameters α and β are chosen so that E[σ2i
]= 1. We consider two
heteroscedastic noise cases: Var[σ2i
]= 1 and Var
[σ2i
]= 10. We also include a homoscedastic
case with all σ2i = 1.
CHAPTER 3. BI-CROSS-VALIDATION FOR FACTOR ANALYSIS 34
Generating the signal
The signal matrix is X =√nΣ1/2UDV T , where Σ is the same matrix used to generate the noise.
Entries in D specify the strength of signals of the reweighted matrix Σ−1/2X. Based on Perry (2009),
we use the detection threshold µF =√γ and the estimation threshold
µ∗F =1 + γ
2+
√(1 + γ
2
)2
+ 3γ.
These are the two thresholds for PCA in the homoscedastic noise case with σ = 1.
We explore different combinations of factors from the four factor categories defined above. Specif-
ically, we include the 6 scenarios from Table 3.1. All of these cases have eight nonzero factors of
which one is undetectable. We anticipate that the number of harmful factors is an important vari-
able, and so it generally increases with scenario number, ranging from 1 to 6. The remaining factors
are split between strong and merely useful. By including several scenarios with equal numbers of
harmful factors, we can vary the ratio of strong to useful factors at high and low numbers of harmful
factors.
For the d2i values we take, the strong factors takes values at 1.5N , 2.5N , 3.5N , · · · . The useful
factors takes values at 1.5µ?F , 2.5µ?F , 3.5µ?F , · · · . The harmful factors takes values at equally spaced
interior points of the interval [µF , µ?F ] and the undetectable factors takes values at equally spaced
interior points of the interval [0, µF ].
For U and V , first V is sampled uniformly from the Stiefel manifold Vk(Rn). See Appendix A.1.1
in Perry (2009) for a suitable algorithm. Then an intermediate matrix U∗ is sampled uniformly from
the Stiefel manifold Vk(RN ). Using the previously generated V and Σ we solve
Σ−1/2U∗DV T = UDV T
for U . Now U is nonuniformly distributed on on the Stiefel manifold in such a way that rows of U
with large L2 norm are not necesarily those with large σ2i .
Data dimensions
We consider 5 different N/n ratios: 0.02, 0.2, 1, 5, 50 and for each ratio consider a small matrix size
and a larger matrix size, thus there are in total 10 (N,n) pairs. The specific sample sizes appear at
the top of Table 3.3. In total there are 6× 3× 5× 2 = 180 scenarios. Each was simulated 100 times,
for a total of 18,000 simulated data sets.
CHAPTER 3. BI-CROSS-VALIDATION FOR FACTOR ANALYSIS 35
Scenario
1 2 3 4 5 6
# Undetectable 1 1 1 1 1 1# Harmful 1 1 1 3 3 6# Useful 6 4 3 1 3 1# Strong 0 2 3 3 1 0
Table 3.1: Six factor strength scenarios considered in our simulations.
3.4.2 Empirical properties of ESA
We use simulations to study the effectiveness and accuracy of ESA, and to empirically determine m,
the number of iteration steps before early stopping. In these simulations we know the true signal X
and so we can measure the errors. As mentioned in Section 3.2, we compare ESA to other methods
for estimating X, including PCA (SVD), PCA after normalization of each row (variable) of the data
and the quasi maximum-likelihood method (QMLE). For an estimation method M , we denote
ErrX(M) = Err(XM
Opt
)= min
kErr
(XM (k)
).
We use the six measurements below to study the effectiveness of ESA with m = 3:
1. ErrX(m = 3)/
ErrX(m = mOpt):
this compares m = 3 to the optimal m defined in (3.12).
2. ErrX(m = 3)/
ErrX(m = 1):
this measures the advantage of ESA beyond PCA after data standardization.
3. ErrX(m = 3)/
ErrX(m = 50):
this measures the advantage of stopping early, using m = 50 as proxy for iteration to conver-
gence.
4. ErrX(m = 3)/
ErrX(SVD):
this compares ESA to applying SVD (PCA) directly to the data.
5. ErrX(m = 3)/
ErrX(QMLE):
this compares ESA to the quasi maximum likelihood method, which is solved using the EM
algorithm with PCA estimates as starting values.
6. ErrX(m = 3)/
ErrX(oSVD):
this compares ESA to the truncated SVD that an oracle which knew Σ could use on Σ−1/2Y .
It measures the relative inaccuracy in X arising from the inaccuracy of Σ.
CHAPTER 3. BI-CROSS-VALIDATION FOR FACTOR ANALYSIS 36
MeasurementsWhite Noise Heteroscedastic Noise
Var(σ2i ) = 0 Var(σ2
i ) = 1 Var(σ2i ) = 10
ErrX(m=3)ErrX(m=mOpt)
1.01± 0.01 1.00± 0.01 1.00± 0.01
ErrX(m=3)ErrX(m=1) 0.93± 0.09 0.90± 0.11 0.89± 0.12
ErrX(m=3)ErrX(m=50) 0.87± 0.21 0.87± 0.21 0.87± 0.21
ErrX(m=3)ErrX(SVD) 1.03± 0.06 0.81± 0.20 0.75± 0.22
ErrX(m=3)ErrX(QMLE) 1.02± 0.05 0.95± 0.15 0.91± 0.19
ErrX(m=3)ErrX(oSVD) 1.03± 0.06 1.03± 0.07 1.03± 0.08
Table 3.2: ESA using six measurements. For each of Var(σ2i ) = 0, 1 and 10, the average for every
measurement is the average over 10× 6× 100 = 6000 simulations, and the standard deviation is thestandard deviation of these 6000 simulations.
Table 3.2 summarizes the mean and standard deviation of each measurement over 6000 simu-
lations each, for Var[σ2i
]= 0, 1 and 10. Row 1 shows that ESA stopping at m = 3 steps was
almost identical to stopping at the unknown optimal m in terms of the oracle estimating error, as
the mean is nearly 1 and the standard deviation is negligible. Row 2 indicates that taking m = 3
steps brought an improvement compared with PCA (SVD on standardized data). Row 3 shows
that taking m = 3 brought an improvement compared to using m = 50, our proxy for iterating to
convergence to the local minimum of loss. The latter is highly variable. Row 4 shows that trun-
cated SVD is better than ESA when the noise is homoscedastic. But even a noise level as small
as Var[σ2i
]= E
[σ2i
]= 1 reverses the preference sharply. Row 5 shows that ESA beats QMLE on
average for the heteroscedastic noise case, though the latter has theoretical guarantee for the strong
factor scenario. Row 6 shows that an oracle which knew Σ and used it to reduce the data to the
homoscedastic case would gain only 3% over ESA.
Table 3.3 gives the average value of each measurement over 100 replications for all of the simula-
tions with mild heteroscedasticity (Var[σ2i
]= 1). “Type-1” to “Type-6” correspond to the six cases
of factor strengths listed in Table 3.1. The first panel confirms that m = 3 is broadly effective. The
second panel shows that the problem of PCA is more severe at large sample sizes. The third panel
shows in contrast that the disadvantage to m = 50 iterations is more severe at the smaller sample
sizes. The fourth panel shows similar to the second panel that SVD causes greatest losses at large
sample sizes. The fifth panel shows that ESA has great advantage over QMLE when the variable
size is large, even at a low aspect ratio γ.
It worth mentioning that Table 3.3 shows that heteroscedasticity seems to be less of a problem
when the aspect ratio is higher for all the methods. When there are only strong factors, Theorem
CHAPTER 3. BI-CROSS-VALIDATION FOR FACTOR ANALYSIS 37
FactorScenario
γ = 0.02 γ = 0.2 γ = 1 γ = 5 γ = 50
(20, 1000) (100, 5000) (20, 100) (200, 1000) (50, 50) (500, 500) (100, 20) (1000, 200) (1000, 20) (5000, 100)
ErrX(m = 3)/
ErrX(m = mOpt)
Type-1 1.011 1.000 1.011 1.000 1.004 1.000 1.003 1.000 1.000 1.000Type-2 1.013 1.002 1.012 1.001 1.006 1.000 1.004 1.000 1.000 1.000Type-3 1.016 1.006 1.014 1.005 1.010 1.000 1.002 1.000 1.000 1.000Type-4 1.002 1.002 1.009 1.001 1.008 1.000 1.006 1.000 1.000 1.000Type-5 1.008 1.001 1.011 1.001 1.007 1.000 1.006 1.000 1.000 1.000Type-6 1.007 1.000 1.011 1.000 1.006 1.000 1.003 1.000 1.001 1.000
ErrX(m = 3)/
ErrX(m = 1)
Type-1 0.900 0.936 0.913 0.957 0.924 0.977 0.967 0.987 0.995 0.998Type-2 0.819 0.626 0.844 0.680 0.833 0.785 0.942 0.909 0.990 0.987Type-3 0.827 0.613 0.840 0.616 0.801 0.739 0.925 0.887 0.987 0.984Type-4 0.781 0.723 0.837 0.751 0.864 0.833 0.947 0.926 0.990 0.990Type-5 0.854 0.789 0.904 0.834 0.911 0.899 0.962 0.956 0.993 0.994Type-6 0.987 0.993 0.997 0.996 0.997 0.998 0.999 0.999 0.999 1.000
ErrX(m = 3)/
ErrX(m = 50)
Type-1 0.441 0.802 0.473 0.985 0.759 1.000 0.590 1.000 1.000 1.000Type-2 0.472 0.839 0.486 0.984 0.765 1.000 0.605 1.000 1.000 1.000Type-3 0.501 0.918 0.463 0.994 0.751 1.000 0.626 1.000 1.000 1.000Type-4 0.560 0.975 0.541 0.989 0.899 1.000 0.854 1.000 1.000 1.000Type-5 0.604 0.907 0.671 0.992 0.821 1.000 0.842 1.000 1.000 1.000Type-6 0.947 0.982 0.981 0.999 0.988 1.000 0.997 1.000 1.000 1.000
ErrX(m = 3)/
ErrX(SVD)
Type-1 0.638 0.348 0.740 0.366 0.722 0.466 0.882 0.727 0.977 0.966Type-2 0.785 0.450 0.829 0.451 0.749 0.525 0.898 0.754 0.980 0.972Type-3 0.870 0.611 0.896 0.548 0.772 0.599 0.903 0.791 0.983 0.976Type-4 0.872 0.810 0.923 0.809 0.893 0.872 0.960 0.942 0.991 0.990Type-5 0.704 0.542 0.798 0.552 0.770 0.605 0.888 0.779 0.978 0.972Type-6 0.935 0.906 0.972 0.925 0.971 0.943 0.985 0.966 0.993 0.991
ErrX(m = 3)/
ErrX(QMLE)
Type-1 0.915 0.633 0.966 0.677 0.985 0.858 0.997 0.988 1.000 1.000Type-2 1.104 0.672 1.058 0.725 1.000 0.863 0.999 0.989 1.000 1.000Type-3 1.199 0.826 1.129 0.766 1.008 0.878 0.997 0.990 1.000 1.000Type-4 1.035 0.991 1.033 0.954 1.005 0.973 1.002 0.997 1.000 1.000Type-5 0.966 0.661 0.996 0.744 0.989 0.885 0.998 0.991 1.000 1.000Type-6 0.971 0.912 0.993 0.942 0.999 0.974 0.999 0.999 1.000 1.000
ErrX(m = 3)/
ErrX(oSVD)
Type-1 1.029 0.994 1.064 0.998 1.036 1.001 1.026 1.001 1.003 1.000Type-2 1.220 1.014 1.156 0.999 1.040 1.001 1.027 1.001 1.002 1.000Type-3 1.298 1.150 1.223 1.020 1.053 1.001 1.026 1.001 1.002 1.000Type-4 1.087 1.067 1.095 1.013 1.036 1.002 1.021 1.001 1.002 1.000Type-5 1.075 0.998 1.087 1.000 1.029 1.002 1.027 1.001 1.003 1.000Type-6 1.011 1.000 1.023 1.002 1.016 1.002 1.006 1.001 1.002 1.000
Table 3.3: Comparison of ESA results for various (N,n) pairs and number of strong factors in thescenarios with Var
[σ2i
]= 1.
2.2.1 has shown that the PCA estimates become consistent when N →∞, and noise heteroscedastity
is not a problem. In the presence of weak factors, the estimation threshold for weak factors increases
with aspect ratio, thus the factor strengths of the retained factors increase proportionally to the
CHAPTER 3. BI-CROSS-VALIDATION FOR FACTOR ANALYSIS 38
aspect ratio and the noise perturbation becomes relatively small, making the heteroscedastity of the
noise less of a problem.
3.4.3 Empirical properties of BCV
The noise heteroscedasticy of the data we generated falls into three different groups: white noise
with Var[σ2i
]= 0, mild heteroscedasticity with Var
[σ2i
]= 1 and strong heteroscedasticity with
Var[σ2i
]= 10. In this section we begin by summarizing the mild heteroscedastic case: Var
[σ2i
]= 1.
The other cases are similar and we give some results for them later.
To measure the loss in estimating X due to using an estimate r instead of the optimal choice
r∗ESA we use a relative estimation error (REE) given by
REE(r) =‖X(r)−X‖2F‖X(r∗ESA)−X‖2F
− 1. (3.21)
REE is zero if r is the best possible rank for the specific data matrix shown, that is, if r is the same
rank an oracle would choose.
We compare the r? chosen by BCV described in Section 3.3 with 5 other methods, the PA using
permutation described in Section 2.3.1, the IC1 method defined in (2.14), the Eigenvalue difference
method ED defined in (2.16), the eigenvalue ratio method ER define in (2.17) and the AIC type
method RE defined in (2.18).
Recall that Of these methods, ER and IC1 are designed for models with strong factors only. ED
does not require strong factors to work. RE has theoretical guarantees for estimating the number
of detectable weak factors in the white noise model. Finally, PA was designed and tested under the
small N and large n scenarios. We want to compare the finite sized dataset performance of these
methods in settings with both strong and weak factors. In applications one cannot be sure that only
the desired factor strengths are present.
We also include in the comparison the use of the true number of factors as well as the oracle’s
number of factors r∗ESA defined in (3.3). Methods that choose a value closer to k∗ESA, should attain
a small error using ESA.
Figure 3.1 shows for different methods, the proportion of simulations with REE above certain
values for the mild heteroscedastic case Var[σ2i
]= 1. Figure 3.1a shows that BCV is overall best at
recovering the signal matrix X. Figure 3.1b shows that BCV becomes far better than alternatives
when we just compare the larger sample sizes from each aspect ratio. Figure 3.1c shows that at
smaller sample sizes RE is competitive with BCV. The large data case is more important given the
recent emphasis on large data problems.
Our goal is to find the best r for ESA, but the methods ED, ER. IC1 and RE are designed
assuming that the SVD will be used to estimate the factors. To study them in the setting they
were designed for, we include Figure 3.1d, which calculates REE using SVD to estimate X(k) and
CHAPTER 3. BI-CROSS-VALIDATION FOR FACTOR ANALYSIS 39
Var[σ2i
]PA ED ER IC1 RE BCV
0 1.99 1.41 49.61 1.13 0.12 0.291 2.89 2.42 25.02 3.11 2.45 0.37
10 3.66 2.28 15.62 4.46 2.10 0.62
Table 3.4: Worst case REE values for each method of choosing k for white noise and two het-eroscedastic noise settings.
compares with the oracle rank of SVD. For Figure 3.1d, the BCV method also uses the SVD instead
of ESA. Though the results in Table 3.2 (Appendix) suggest that SVD is in general not recommended
for heteroscedastic noise data, if one does use SVD, BCV is still the best method for choosing r to
recover X.
The proportion of simulations with REE = 0 (matching the oracle’s rank) for BCV was 51.6%,
75.1%, 28.1% and 47.0% in the four scenarios in Figure 3.1. BCV’s percentage was always highest
among the six methods we used. The fraction of REE = 0 sharply increases with sample size and is
somewhat better for ESA than for SVD.
Table 3.4 briefly summarizes the REE values for all three noise variance cases. It shows the worst
case REE over all the 10 matrix sizes and 6 factor strength scenarios. As the variance of σ2i rises
it becomes more difficult to attain a small REE. BCV has substantially smaller worst case REE for
heterscedastic noise than all other methods, but is slightly worse than RE for the white noise case.
This is not surprising as NE is designed for the white noise model.
To better understand the differences among the methods, we compare them directly in estimating
the number of factors with the oracle. As an example, Figure 3.2 plots the distribution of r for all
methods and all 6 cases, on 5000×100 data matrices with Var[σ2i
]= 1. The results are summarized
in more detail in Tables 3.5 and 3.6. In Figure 3.2, BCV closely tracks the oracle. For other methods,
ED performs the best in estimating the oracle rank, though it is more variable and less accurate
than BCV. ER is the most conservative method, trying to estimate at most the number of strong
factors. IC1 also tries to estimate the number of strong factors, but is less conservative than ER. RE
estimates some number between the number of strong factors and the number of useful (including
strong) factors. PA has trouble identifying the useful weak factors when strong factors are present,
and also has trouble rejecting the detectable but not useful factors in the hard case with no strong
factor. This is due the fact that PA is using the sample correlation matrix which has a fixed sum of
eigenvalues, thus the magnitude of the each eigenvalue is influenced by every other one.
Tables 3.5 and 3.6 provide more details of the simulation results for this mildly heteroscedastic
case Var[σ2i
]= 1. We can see that some methods behave very differently for different sized datasets.
For example, IC1 is very non-robust and sharply over-estimates the number of factors for small
datasets, ED will tend to estimate only the number of strong factors when the aspect ratio γ is
small. Overall, BCV has the most robust and accurate performance in estimating r∗ESA of the
CHAPTER 3. BI-CROSS-VALIDATION FOR FACTOR ANALYSIS 40
0 1 2 3 4 5
0%
20%
40%
60%
80%
100% PAEDERIC1NEBCV
(a) All datasets, ESA
0 1 2 3 4 5
0%
20%
40%
60%
80%
100% PAEDERIC1NEBCV
(b) Large datasets only, ESA
0 1 2 3 4 5
0%
20%
40%
60%
80%
100% PAEDERIC1NEBCV
(c) Small datasets only, ESA
0 1 2 3 4 5
0%
20%
40%
60%
80%
100% PAEDERIC1NEBCV
(d) All datasets, SVD
Figure 3.1: REE survival plots: the proportion of samples with REE exceeding the number on thehorizontal axis. Figure 3.1a-3.1c are for REE calculating using the method ESA. Figure 3.1a showsall 6000 samples. Figure 3.1b shows only the 3000 simulations of larger matrices of each aspect ratio.Figure 3.1c shows only the 3000 simulations of smaller matrices. For comparison, Figure 3.1d is theREE plot for all samples calculating REE using the method SVD.
methods we investigated.
3.5 Real data example
We investigate a real data example to show how our method works in practice. The observed matrix
Y is 15 × 8192, where each row is a chemical element and each column represents a position on a
CHAPTER 3. BI-CROSS-VALIDATION FOR FACTOR ANALYSIS 41
True PA ED ER IC1 NE BCVOracle
0
2
4
6
8
10
Type−1: 0/6/1/1
True PA ED ER IC1 NE BCVOracle
0
2
4
6
8
10
Type−2: 2/4/1/1
True PA ED ER IC1 NE BCVOracle
0
2
4
6
8
10
Type−3: 3/3/1/1
True PA ED ER IC1 NE BCVOracle
0
2
4
6
8
10
Type−4: 3/1/3/1
True PA ED ER IC1 NE BCVOracle
0
2
4
6
8
10
Type−5: 1/3/3/1
True PA ED ER IC1 NE BCVOracle
0
2
4
6
8
10
Type−6: 0/1/6/1
Figure 3.2: The distribution of r for each factor strength case when the matrix size is 5000 ×100. The y axis is r. Each image depicts 100 simulations with counts plotted in grey scale(larger equals darker). For different scenarios, the factor strengths are listed as the number of“strong/useful/harmful/undetectable” factors in the title of each subplot. The true k is alwaysr = 8. The “Oracle” method corresponds to r∗ESA.
CHAPTER 3. BI-CROSS-VALIDATION FOR FACTOR ANALYSIS 42
FactorType
γ = 0.02 γ = 0.2 γ = 1
Method (20, 1000) (100, 5000) (20, 100) (200, 1000) (50, 50) (500, 500)
Type-10/6/1/1
PA 0.04 5.5 0.07 7.0 0.12 4.9 0.10 6.9 0.05 5.4 0.13 7.0ED 1.93 1.7 2.29 1.3 2.27 1.3 2.40 1.0 2.42 1.2 2.40 0.6ER 2.19 0.9 2.80 0.1 1.68 1.8 2.92 0.1 1.35 2.5 2.72 0.0IC1 2.30 16.0 0.69 3.3 1.44 16.0 0.61 3.5 0.10 5.6 0.69 3.1RE 0.23 6.3 1.82 1.3 0.16 5.0 2.45 0.6 0.08 5.4 2.36 0.5BCV 0.16 5.9 0.03 5.8 0.33 4.5 0.01 5.9 0.12 5.0 0.00 6.0Oracle – 6.0 – 6.0 – 5.9 – 6.0 – 6.0 – 6.0
Type-22/4/1/1
PA 0.27 3.7 0.15 4.6 0.55 3.4 0.34 4.0 0.69 3.2 0.31 3.9ED 0.61 3.5 1.03 2.9 0.95 3.0 1.18 2.5 1.00 3.0 1.03 2.6ER 1.52 1.8 1.21 2.0 1.64 1.9 1.33 2.0 1.34 2.0 1.23 2.0IC1 1.87 16.0 0.58 3.6 1.34 16.0 0.57 3.7 0.09 5.8 0.66 3.2RE 0.42 6.6 0.87 2.7 0.12 5.3 1.13 2.4 0.10 5.6 1.11 2.2BCV 0.26 5.4 0.12 5.7 0.24 4.5 0.00 5.9 0.19 4.7 0.00 6.0Oracle – 5.1 – 5.8 – 5.5 – 6.0 – 5.9 – 6.0
Type-33/3/1/1
PA 0.35 3.2 0.46 3.1 0.62 3.1 0.72 3.0 0.76 3.0 0.69 3.0ED 0.30 4.0 0.55 4.0 0.46 3.8 0.54 3.5 0.56 3.7 0.56 3.5ER 4.15 1.8 16.18 2.2 3.40 1.9 13.62 2.6 0.78 3.0 0.69 3.0IC1 1.70 16.0 0.41 4.2 1.23 16.0 0.41 4.1 0.11 5.9 0.52 3.5RE 0.41 6.8 0.41 3.7 0.14 5.5 0.56 3.4 0.10 5.6 0.60 3.2BCV 0.17 5.1 0.26 5.3 0.26 4.5 0.08 5.8 0.21 4.6 0.01 5.9Oracle – 5.0 – 4.8 – 5.5 – 5.8 – 5.9 – 6.0
Type-43/1/3/1
PA 0.01 3.0 0.02 3.0 0.03 3.0 0.07 3.0 0.05 3.0 0.06 3.0ED 0.11 3.3 0.81 4.4 0.08 3.3 0.29 3.9 0.07 3.3 0.08 3.8ER 5.10 1.8 19.24 2.2 3.50 1.9 16.79 2.5 3.33 2.3 0.50 3.0IC1 2.62 16.0 0.66 4.1 1.60 16.0 0.33 4.1 0.10 3.7 0.06 3.5RE 0.63 5.7 0.54 3.8 0.13 3.7 0.14 3.6 0.09 3.9 0.05 3.3BCV 0.02 3.1 0.19 3.5 0.03 3.3 0.05 3.7 0.05 3.1 0.01 3.9Oracle – 3.2 – 3.2 – 3.5 – 3.9 – 3.8 – 4.0
Type-51/3/3/1
PA 0.02 3.4 0.01 4.3 0.08 3.0 0.01 3.8 0.10 2.9 0.02 3.7ED 0.40 2.0 0.58 1.9 0.54 1.6 0.56 1.6 0.57 1.6 0.45 2.0ER 0.69 1.0 0.78 1.0 0.70 1.0 0.79 1.0 0.71 1.0 0.72 1.0IC1 2.63 16.0 0.41 2.1 1.53 16.0 0.45 2.0 0.10 3.3 0.55 1.5RE 0.40 5.3 0.48 1.9 0.13 3.2 0.59 1.5 0.08 3.5 0.62 1.2BCV 0.12 3.1 0.04 3.9 0.27 2.4 0.01 3.9 0.16 2.8 0.00 4.0Oracle – 3.7 – 4.0 – 4.0 – 4.0 – 4.0 – 4.0
Type-60/1/6/1
PA 0.45 5.6 0.68 7.3 0.22 4.0 2.00 10.4 0.34 4.5 2.89 12.8ED 0.07 0.8 0.11 1.8 0.06 0.7 0.12 1.4 0.06 0.4 0.09 1.1ER 0.07 0.1 0.09 0.1 0.03 0.2 0.08 0.1 0.05 0.1 0.06 0.1IC1 3.11 13.6 0.06 1.1 1.74 16.0 0.07 1.0 0.05 0.5 0.06 0.5RE 0.21 3.2 0.06 1.0 0.05 0.8 0.06 0.7 0.06 0.9 0.05 0.3BCV 0.06 0.2 0.04 1.0 0.03 0.1 0.02 0.8 0.03 0.0 0.00 1.0Oracle – 1.0 – 1.0 – 0.8 – 1.0 – 0.8 – 1.0
Table 3.5: Comparison of REE and r for rank selection methods with various (N,n) pairs,and scenarios. For each different scenario, the factors’ strengths are listed as the number of“strong/useful/harmful/undetectable” factors. For each (N,n) pair, the first column is the REE
and the second column is k. Both values are averages over 100 simulations. Var[σ2i
]= 1.
CHAPTER 3. BI-CROSS-VALIDATION FOR FACTOR ANALYSIS 43
FactorType
γ = 5 γ = 50
Method (100, 20) (1000, 200) (1000, 20) (5000, 100)
Type-10/6/1/1
PA 0.05 5.0 0.11 6.9 0.01 5.7 0.10 7.0ED 1.89 1.2 1.57 1.6 0.43 4.7 0.10 6.1ER 2.23 0.3 2.18 0.0 1.69 0.0 1.68 0.0IC1 1.23 16.0 0.86 2.2 0.04 5.0 1.10 1.1RE 0.14 4.9 1.17 1.7 0.20 4.2 0.14 3.9BCV 0.37 4.1 0.00 6.0 0.10 4.9 0.01 5.8Oracle – 5.9 – 6.0 – 5.8 – 5.9
Type-22/4/1/1
PA 0.68 2.8 0.23 3.9 0.32 3.1 0.12 4.0ED 0.83 2.9 0.65 3.2 0.17 5.2 0.06 6.0ER 1.05 2.0 0.94 2.0 0.95 1.9 0.68 2.0IC1 1.24 16.0 0.86 2.2 0.05 5.0 0.68 2.0RE 0.07 5.2 0.77 2.4 0.08 4.5 0.13 4.0BCV 0.31 4.2 0.00 6.0 0.09 4.9 0.01 5.8Oracle – 5.9 – 6.0 – 5.7 – 5.9
Type-33/3/1/1
PA 0.59 3.0 0.51 3.0 0.35 3.0 0.35 3.0ED 0.48 3.6 0.36 3.9 0.11 5.5 0.06 6.2ER 3.51 1.9 22.02 2.1 3.33 2.0 15.40 2.0IC1 1.27 16.0 0.48 3.1 0.04 5.0 0.35 3.0RE 0.09 5.2 0.47 3.1 0.05 4.7 0.14 3.9BCV 0.25 4.5 0.01 5.8 0.09 4.6 0.01 5.8Oracle – 5.9 – 6.0 – 5.8 – 5.9
Type-43/1/3/1
PA 0.03 3.0 0.03 3.0 0.01 3.0 0.01 3.0ED 0.05 3.2 0.05 3.6 0.01 3.3 0.03 4.0ER 3.36 1.8 25.02 2.1 3.67 2.0 18.55 2.0IC1 1.53 16.0 0.03 3.1 0.01 3.0 0.01 3.0RE 0.04 3.4 0.03 3.2 0.01 3.0 0.01 3.0BCV 0.03 3.2 0.01 3.8 0.01 3.2 0.01 3.7Oracle – 3.8 – 4.0 – 3.6 – 3.8
Type-51/3/3/1
PA 0.11 2.7 0.01 3.6 0.01 3.1 0.00 4.0ED 0.42 1.8 0.32 2.1 0.31 1.9 0.12 3.7ER 0.57 1.0 0.57 1.0 0.43 1.0 0.42 1.0IC1 1.45 16.0 0.54 1.1 0.34 1.3 0.42 1.0RE 0.12 2.8 0.53 1.1 0.08 2.5 0.15 2.0BCV 0.22 2.4 0.01 3.9 0.12 2.6 0.02 3.8Oracle – 3.9 – 4.0 – 3.7 – 3.8
Type-60/1/6/1
PA 0.29 3.4 2.27 10.5 0.77 5.4 1.24 7.1ED 0.03 0.2 0.04 0.6 0.02 0.5 0.03 0.9ER 0.02 0.0 0.04 0.0 0.01 0.0 0.01 0.0IC1 1.00 7.4 0.03 0.1 0.01 0.0 0.01 0.0RE 0.03 0.2 0.03 0.2 0.01 0.0 0.01 0.0BCV 0.02 0.1 0.01 0.8 0.01 0.1 0.02 0.7Oracle – 0.5 – 0.9 – 0.6 – 0.8
Table 3.6: Like Table 3.5, but for larger γ.
64 × 128 map of a meteorite. We thank Ray Browning for providing this data. Similar data are
discussed in Paque et al. (1990). Each entry in Y is the amount of a chemical element at a grid
point. The task is to analyze the distribution patterns of the chemical elements on that meteorite,
helping us to further understand the composition.
CHAPTER 3. BI-CROSS-VALIDATION FOR FACTOR ANALYSIS 44
200
250
300
350
400
450
500
rank k
BC
V P
redi
ctio
n E
rror
0 1 2 3 4 5 6 7 8 9
Figure 3.3: BCV prediction error for the meteorite. The BCV partitions have been repeated 200times. The solid red line is the average over all held-out blocks, with the cross marking the minimumBCV error.
A factor structure seems reasonable for the elements as various compounds are distributed over
the map. The amounts of some elements such as Iron and Calcium are on a much larger scale than
some other elements like Sodium and Potassium, and so it is necessary to assume a heteoroscedastic
noise model as (3.1). We center the data for each element before applying our method.
BCV choose r = 4 factors, while PA chooses r = 3. Figure 3.3 plots the BCV error for each rank,
showing that among the selected factors, the first two factors can be considered as strong factors,
which are much more influential than the last two. The first column of Figure 3.4 plots the four
factors ESA has found at their positions. They represents four clearly different patterns.
As a comparison, we also apply a straight SVD on the centered data with and without standard-
ization to analyze the hidden structure. The second and third columns of Figure 3.4 shows the first
five factors of the locations that SVD finds for the original and scaled data respectively. If we do not
scale the data, then the factor (F5) showing the concentration of Sulfur on some specific locations
strangely comes after the factor (F4) which has no apparent pattern; F5 would have been neglected
in a model of three or four factors as BCV or PA suggest. The figure shows that ESA can estimate
the weak factors more accurately compared with SVD.
Paque et al.Paque et al. (1990) investigate this sort of data by clustering the pixels based on the
CHAPTER 3. BI-CROSS-VALIDATION FOR FACTOR ANALYSIS 45
1:128
ESA_F1
1:128
ESA_F2
1:128
ESA_F3
1:128
ESA_F4
1:128
1:64
SVD_F1
1:128
1:64
SVD_F2
1:128
1:64
SVD_F3
1:128
1:64
SVD_F4
1:64
SVD_F5
1:128
1:64
scale_F1
1:128
1:64
scale_F2
1:128
1:64
scale_F3
1:128
1:64
scale_F4
1:64
scale_F5
Figure 3.4: Distribution patterns of the estimated factors. The first column has the four factorsfound by ESA. The second column has the top five factors found by applying SVD on the unscaleddata. The third column has the top five factors found by applying SVD on scaled data in which eachelement has been standardized. The values are plotted in grey scale, and a darker color indicates ahigher value.
CHAPTER 3. BI-CROSS-VALIDATION FOR FACTOR ANALYSIS 46
values of the first two factors of a factor analysis. We apply such a clustering in Figure 3.5. The plot
shows that ESA can estimate the factor scores of the strong factors more accurately. Column (a)
shows the resulting clusters. The factors found by ESA clearly divide the locations into five clusters,
while the factors found by an SVD on the original data blur the boundary between clusters 1 and 5.
An SVD on normalized data (third plot in column (a)) blurs together three of the clusters. Columns
(b) and (c) of Figure 3.5 show the quality of clustering using k-means based on the first two plots
of Column (a). Clusters, especially C1 and C5, have much clearer boundaries and are less noisy if
we are using ESA factors than using SVD factors. A k-means clustering depends on the starting
points. For the ESA data the clustering was stable. For SVD the smallest group C3 was sometimes
merged into one of the other clusters; we chose a clustering for SVD that preserved C3.
In this data the ESA based factor analysis found factors that, visually at least, seem better. They
have better spatial coherence, and they provide better clusters than the SVD approaches do. For
data of this type it would be reasonable to use spatial coherence of the latent variables to improve the
fitted model. Here we have used spatial coherence as an informal confirmation that BCV is making
a reasonable choice, which we could not do if we had exploited spatial coherence in estimating our
factors.
CHAPTER 3. BI-CROSS-VALIDATION FOR FACTOR ANALYSIS 47
−0.01 0.01 0.03
−0.
020.
000.
020.
04
F1
F2
ESA
−0.02 0.00 0.01 0.02 0.03
−0.
020.
000.
020.
04
F1
F2
SVD
−0.02 0.00 0.02 0.04
−0.
15−
0.05
0.00
F1
F2
scale
(a)
1:128
1:64
ESA_C1
1:128
1:64
ESA_C2
1:128
1:64
ESA_C3
1:128
1:64
ESA_C4
1:128
1:64
ESA_C5
(b)
1:128
1:64
SVD_C1
1:128
1:64
SVD_C2
1:128
1:64
SVD_C3
1:128
1:64
SVD_C4
1:128
1:64
SVD_C5
(c)
Figure 3.5: Plots of the first two factors and the location clusters. The three plots of column (a)are the scatter plots of pixels for the first two factors found by the three methods: ESA, SVD onthe original data and SVD on normalized data. The coloring shows a k-means clustering result for5 clusters. Column (b) has the five clustered regions based on the first two factors of ESA. Column(c) has the five clustered regions based on the first two factors of SVD on the original data aftercentering. The same color represents the same cluster.
Chapter 4
An optimization-shrinkage hybrid
method for factor analysis
4.1 A joint convex optimization algorithm POT
4.1.1 the objective function
As we have discussed in Chapter 3, maximizing the log-likelihood function (3.5) directly would not
work as the global optimization solution can arbitrarily have σi → 0. In this chapter, we switch to
considering an alternative objective function which is not ill-posed and allows us to jointly estimate
X and Σ. The
Let Y = (yij)N×n, X = (xij)N×n and still consider the model (3.1). Define the objective function
as
Lλ(X,Σ;Y ) = L0(X,Σ;Y ) + λ ‖X‖? = n
N∑i=1
σi +
N∑i=1
n∑j=1
(yij − xij)2
σi+ 2√nλ‖X‖? (4.1)
We estimate X and Σ by (Xλ, Σλ
)= argminX,Σ Lλ(X,Σ) (4.2)
The loss L0(X,Z;Y ) is based on an idea proposed by Huber (2011) to jointly estimate σ and β
in an regression model Yi = XTi β+σ2Ei. Huber estimated β and σ2 by minimizing nσ+
∑ni=1(Yi−
XTi β)2/σ which is jointly convex in (β, σ) and yields the same estimates of β and σ as MLE. Such
a huber technique is also called perspective transformation in convex optimization (Owen, 2007).
L0(X,Z;Y ) is also jointly convex in (X,σ1, · · · , σN ). More importantly, it is not ill-posed since
L0(X,Z;Y ) is bounded below by 0. To get a low-rank matrix estimation X, we impose a nuclear
norm penalty on X. The nuclear norm penalty has been widely used in low-rank matrix recovery
48
CHAPTER 4. AN OPTIMIZATION-SHRINKAGE HYBRIDMETHOD FOR FACTORANALYSIS49
and completion (Recht et al., 2010) as ‖X‖? is convex in X. The nuclear penalty is a relaxation
of a rank constraint on X. A larger value of the tuning parameter λ results in a lower rank of
Xλ. We name the joint optimization algorithm with objective function (4.1) as POT (Perspective
transformation Optimization with Trace norm penalty).
4.1.2 Connection with singular value soft-thresholding
When Σ = IN , then
Lλ(X, IN ;Y ) = nN + ‖Y −X‖2F + 2√nλ ‖X‖? (4.3)
and Xλ has explicit forms (Parikh and Boyd, 2014, Chap. 6.7.3). As before, denote the SVD of Y as
Y =√nUDV T where D = diag(d1, · · · , dmin(N,n)). Define Dλ = diag
((d1 − λ)+, · · · , (dmin(N,n) − λ)+
).
Then minimizing the objective function (4.3) gives
Xλ =√nUDλV
T (4.4)
Dλ is soft-thresholding of the singular values of Y . In other words, the solution Xλ keeps the sample
singular vectors but applies soft-thresholding to the sample singular values.
4.1.3 Connection with square-root lasso
By taking the derivative of (4.1) with respect to σi, we can plug in
σi =
√√√√ 1
n
n∑j=1
(yij − xij)2 =1√n‖Yi· −Xi·‖2
to (4.1) and and change the objective function to
Lλ(X;Y ) =
N∑i=1
‖Yi· −Xi·‖2 + λ‖X‖? (4.5)
Equation (4.5) is closely related to the square-root lasso method proposed in Belloni et al. (2011)
for linear regression. Consider the multiple regression problem Y = BZ+Σ1/2E where B ∈ RN×p is
the coefficient parameter matrix and Z ∈ Rp×n is the matrix of known covariates. Consider the case
where p is very large and we want a sparse estimate of B. The square-root lasso method estimates
each Bi· separately by minimizing the objective function
Li(Bi·;Yi·) = ‖Yi· −Bi·Z‖2 + λi ‖Bi·‖1
As discussed in Belloni et al. (2011), the main advantage of square-root lasso is that it’s “pivotal”
in that the scale of the tuning parameter λi is irrelevant to the noise variance σi. Thus, we can set
CHAPTER 4. AN OPTIMIZATION-SHRINKAGE HYBRIDMETHOD FOR FACTORANALYSIS50
λi ≡ λ for some λ and rewrite the square-root lasso objective function for the multiple regression as
Lλ(B;Y ) =
N∑i=1
Li(Bi·;Yi·) =
N∑i=1
‖Yi· −Bi·Z‖2 + λ ‖B‖1
which has very similar form as (4.5).
4.2 Some heuristics of the method
In this section, we want to provide some understanding of the solutions Xλ, and the choice of λ.
These results, though lacking rigorous mathematical justifications, can guide for using the method
in practice.
Negahban et al. (2012) has provided a general theory for the error rate of the estimates θλn ∈argminθ L(θ;Zn) + λnR(θ) where θ is the vector of parameters, Zn is the observed data and R(θ)
is some penalty function which is supposed to be a norm. They require L(θ;Zn) to be a convex
and differentiable function in θ. However, in our problem, L0(X;Y ) =∑i
√∑j(xij − yij)2 is not
differentiable in X, thus unfortunately we can not apply their theory directly.
4.2.1 The theoretical scale of λ
We define the optimal λ minimizing the estimation error of X as
λOpt = argminλ
∥∥∥Xλ −X∥∥∥
2(4.6)
How would the scale of λOpt change with the dimension?
To avoid confusion, we denote the true value of X as X? in this sub-section. For the es-
timates θλn ∈ argminθ L(θ;Zn) + λnR(θ) discussed in Negahban et al. (2012), they require
λn ≥ cR? (∇L(θ?;Zn)) with high probability to guarantee an upper bound controlling the er-
ror rate ‖θλn − θ?‖2. Here, θ? is the true value of θ, R?(·) is the dual norm of R(·) defined as
R?(v) = supR(u)≤1 uT v and c > 1 is some constant. Also, once λn ≥ R? (∇L(θ?;Zn)), the upper
bound is monotonically increasing with λn.
If we use their result in our problem, then we would need λ > ‖∇L0(X?;Y )‖op since the dual
norm of the nuclear norm is operator norm ‖ · ‖op (the largest singular value of the matrix) and
CHAPTER 4. AN OPTIMIZATION-SHRINKAGE HYBRIDMETHOD FOR FACTORANALYSIS51
L0(X;Y ) is differentiable at X? with probability 1 if P [eij = 0] = 0. In other words,
λ >
∥∥∥∥∥∥∥ yij − x?ij√∑n
j′=1(x?ij′ − yij′)2
N×n
∥∥∥∥∥∥∥op
=
∥∥∥∥∥∥∥ εij√∑n
j′=1(εij′)2/n
N×n
∥∥∥∥∥∥∥op
/√n
=
(1√n
+ o
(1√n
))|E‖op
Under the asymptotics that n,N → ∞ and N/n → γ > 0, we have ‖E‖op/√n → 1 +
√γ, thus we
can set the theoretical value of λ as λtheo = 1 +√γ, which is the detection threshold in RMT for
σ = 1.
Another heuristic to derive λtheo is that we would always want the true parameter value be our
solution. Especially when X? = 0, we would like also the estimated X = 0, which is equivalent to
0 ∈ ∇L0(0; Σ1/2E) + λ∂‖X‖? |X=0
This results in λ > ‖∇L(0; Σ1/2E)‖∞ which gives the same λtheo
In practice, we find from our simulations that the actual λOpt follows the trend of λtheo well as
the size of the data matrix changes, though it’s likely to be smaller. We develop a cross-validation
technique to find the actual λOpt from a sequence of candidate λ around λtheo. This will be discussed
in detail in Section 4.4.
4.2.2 The bias in using the nuclear penalty
In our simulations, we find that even XλOpthas a rank that can be much higher than the true rank
of X. Also, when there exist strong factors in the data, XλOpt can be a worse estimator compared
even with the PC estimator Xpc = LpcF pc. This phenomenon stays even for the white noise model
when Σ = IN and XλOptis estimated from (4.3). The phenomenon is due to the bias in estimating
a low-rank matrix introduced by the nuclear penalty. The RMT provides tools to understand (4.3)
under Σ = IN .
As discussed in Section 4.1.2, (4.3) has a closed form solution (4.4) which is keeping the singular
vectors but applying soft-thresholding to the singular values of Y . Under the assumptions of Theorem
2.2.3 and the asympotics N,n→∞ and N/n→ γ, then based on the calculations in Shabalin and
CHAPTER 4. AN OPTIMIZATION-SHRINKAGE HYBRIDMETHOD FOR FACTORANALYSIS52
Nobel (2013); Gavish and Donoho (2014), if λ ≥ λtheo = 1 +√γ we have
1
n
∥∥∥Xλ −X∥∥∥2
F
a.s.→r∑
k=1
ρ2k +
r∑k=1
(ρk − λ)
2+ − 2 (ρk − λ)+ ρkθkθk
(4.7)
where ρk, ρk, θk and θk are defined in Theorem 2.2.3. We can show
Proposition 4.2.1. Define
L∞(λ) =
r∑k=1
ρ2k +
r∑k=1
(ρk − λ)
2+ − 2 (ρk − λ)+ ρkθkθk
Then under the assumptions of Theorem 2.2.3, L∞(λ) is a increasing function of λ when λ ≥ 1+
√γ.
Moreover, limλ↓(1+√γ)∇λL∞(λ) > 0.
Proof. Denote ρr+1 = 1 +√γ. As L∞(λ) is continuous, to show that it is increasing in λ we only
need to show that L∞(λ) is increasing in [ρk+1, ρk) for any k = 1, 2, · · · , r. Given a K, the function
L∞(λ) is quadratic in [ρK+1, ρK):
L∞(λ) =
r∑k=1
ρ2k +
K∑k=1
(ρk − λ)
2 − 2 (ρk − λ) ρkθkθk
Then, h
∇λL∞(λ) = −2
K∑k=1
(ρk − λ) +
K∑k=1
ρkθkθk
= 2K
λ− ∑Kk=1
(ρk − ρkθkθk
)K
.
It’s enough if we can show that ρk − ρkθkθk is a strictly decreasing function of ρk when ρk > γ1/4
and ρk − ρkθkθk = 1 +√γ when ρk ≤ γ1/4. The reason is that then we can have ∇λL∞(λ) > 0
inside [ρK+1, ρK) for any K = 1, 2, · · · , r.By plugging in the expression of ρk, θk and θk in Theorem 2.2.3 with σ = 1, when ρk > γ1/4 we
get
ρk − ρkθkθk =
√(ρ2k + 1)(ρ2
k + γ)
ρk− 1
ρk
ρ4k − γ√
(ρ2k + 1)(ρ2
k + γ)
Taking derivative respect to ρk, we get
∇ρkρk − ρkθkθk
=−(1 + γ)ρ6
k − 6γρ4k − 3γ(1 + γ)ρ2
k − 2γ2
ρ2k(ρ2
k + 1)2(ρ2k + γ)2
< 0
CHAPTER 4. AN OPTIMIZATION-SHRINKAGE HYBRIDMETHOD FOR FACTORANALYSIS53
Thus ρk − ρkθkθk is a strictly decreasing function of ρk when ρk > γ1/4. When ρk ≤ γ1/4, we have
θk = θk = 0 and ρk = 1 +√γ, thus ρk − ρkθkθk = 1 +
√γ when ρk ≤ γ1/4.
Proposition 4.2.1 indicates that asymptotically λOpt/√n < 1 +
√γ, in other words the optimal
soft-thresholding will include even more than the true number of factors to minimize the estimation
error of X. Also from the above proof we can see that the larger the factors strength (ρ1, · · · , ρr)are, the smaller λ is which means that the rank XλOpt increases when there are stronger factors,
making XλOptless accurate. From our simulations, we find that the solution of (4.1) has the same
problem.
To overcome the bias of soft-thresholding, Shabalin and Nobel (2013) proposed the optimal
shrinker (2.12), which is the optimal shrinkage of the eigenvalues that minimizes the asymptotic
estimation error. Assume Σ = σ2IN , the optimal shrinkage estimator has the form
Xsk =√nUη
(D)V T (4.8)
where η(D) = diag(η(d1), · · · , η(dmin(N,n))
). The shrinkage function η(·) is defined as
η(d) =
σ2
d
√(d2
σ2 − γ − 1)2 − 4γ if d ≥ (1 +
√γ)σ
0 Otherwise(4.9)
Comparing (4.9) with soft-thresholding, the optimal shrinkage has the property that it shrinks larger
eigenvalues less but shrinks smaller eigenvalues more. This provides another point of view why soft-
thresholding works badly when there are strong factors: it shrinks the larger sample singular values
too much while they are actually close to the true values. The optimal shrinkage can be more
accurate than soft-thresholding, but it’s hard to generalize it to fit the heteroscedastic noise factor
analysis model. Unfortunately, there is no convex penalty function to replace the nuclear norm ‖X‖?that has the optimal shrinkage as the solution.
4.3 A hybrid method: POT-S
Based on the discussion of Section 4.2.2, we propose a hybrid method (POT-S) that combine the
POT minimizing (4.1) with the optimal shrinkage (4.8).
Once we know Σ, then we can apply to the whitened data matrix Σ−1/2Y with the optimal
shrinkage (4.8) given σ2 = 1. However, we don’t know Σ, thus one hybrid approach is to first
estimate Σ using Σλ which is estimated from minimizing (4.1), and then apply optimal shrinkage to
Σ−1/2λ Y . More specifically, for a given λ, let Yλ = Σ
−1/2λ Y and
X?λ =√n · Σ1/2
λ UYλη(DYλ
)V TYλ, Σ?λ = Σλ. (4.10)
CHAPTER 4. AN OPTIMIZATION-SHRINKAGE HYBRIDMETHOD FOR FACTORANALYSIS54
Here, for a matrix Z ∈ RN×n, we use UZ , VZ and DZ to denote it’s left and right singular vectors
and the singular values with Z =√nUZDZV
TZ .
However, one drawback of X?λ is that it only depends on the estimate Σλ. Another choice is
to have our estimate depend on both Σλ and Xλ. As the main problem of Xλ is that it shrinks
the large singular values too much while inadequately shrinking the small singular values, we can
replace the singular values of Xλ with the singular values of X?λ:
X??λ =
√n · UXλDX?λ
V TXλ, Σ??λ = Σλ. (4.11)
We also define the optimal λ for X?λ and X??
λ respectively as
λ?Opt = argminλ
∥∥∥X?λ −X
∥∥∥2
F,
λ??Opt = argminλ
∥∥∥X??λ −X
∥∥∥2
F.
From our simulations (reference to the empirical results), we find that both X??λ??Opt
and X??λ??Opt
are
more accurate than X?λ?Opt
and X?λ?Opt
. Thus, given λ, we propose the POT-S method using X??λ and
X??λ as the final estimates.
When applying POT-S to a dataset, we need to determine λ. The goal is to find λ that is as
close as possible to the unknown λ??Opt. We will use a Wold-style cross-validation approach which is
discussed in the next section.
4.4 Wold-style cross-validatory choice of λ
We use a cross-validation technique to determine the optimal λ.
There are two types of cross-validation for unsupervised learning. One is bi-cross-validation
(BCV) discussed in Chapter 3 which randomly holds out blocks of the data matrix. The other one
is Wold-style cross-validation which randomly holds out entries of the data matrix. Both techniques
are effective for selecting tuning parameters based on prediction performance if used properly. In
Chapter 3, we used BCV because of the theory proposed by Perry (2009) on choosing the size of
the holdout matrix. For POT and POT-S, we use the Wold-style cross-validation mainly because
of three reasons: 1) for BCV, we find empirically that the estimation of λ is sensitive to the size of
the holdout matrix while finding the theory for choosing the size for POT/POT-S; 2) for Wold-style
cross-validation, the convex optimization step in POT/POT-S can easily handle missing entries; 3)
estimation of λ is not very sensitive to the fraction of hold-out entries once the fraction is small in
Wold-style cross-validation.
The Wold-style cross-validation (Wold, 1978) for a matrix Y start with uniformly and randomly
selecting some entries from Y as held-out data and the rest as held-in data. We apply POT-S to the
CHAPTER 4. AN OPTIMIZATION-SHRINKAGE HYBRIDMETHOD FOR FACTORANALYSIS55
held-in data, and calculate the prediction error of the held-out entries. Then we repeat this random
entry selection step for several times and choose λ that minimizes the average prediction error.
Specifically, for one random selection, define an index matrix M = (mij)N×n ∈ 0, 1N×n.
Iij = 1 if the entry is held-in and 0 if held-out. Then we estimate X and Σ from only the held-in
data, treating the held-out entries as missing. Let ni =∑nj=1mij for i = 1, 2, · · · , N . For the joint
convex optimization step, the objective function can be changed from (4.5) to
Lλ(X;Y,M) =
n∑i=1
√nin
∑j
mij(xij − yij)2 + λ‖X‖? (4.12)
Define the joint convex optimization estimates as
Xλ,M = argminX Lλ(X;Y,M), and σ2i,λ,M =
∑jmij(xij,λ,M − yij)2
ni
For the optimal shrinkage step, we need a full data matrix Y , thus the strategy is that we first
fill in the held-out entries based on the held-in entries. A direct approach is to replace the missing
held-out entries by the entries in Xλ,M in the corresponding positions. However, for an entry
yij = xij + σieij , xij,λ,M would not be a good approximation of yij since it approximates xij but
doesn’t include the noise term σieij . Thus, another approach is to also estimate σieij using bootstrap.
The estimate σieij is estimated by random sampling from yijt − Xijt,λ,M ; t = 1, 2, · · · , ni where
yijt are non-missing entries in the ith row. The held-out entry yij is then fulfilled by xij,λ,M + σieij .
Denote the fulfilled matrix by Yλ,M , then the POT-S estimate X??λ,M is given by applying the
optimal shrinkage step of POT-S to Yλ,M with Σλ,M . Then, the prediction error for the held-out
entries are
PEλ(Y,M) =
∑i,j,mij=0
(yij − x??ij,λ,M
)2
∑i,j 1mij=0
The above random entry selection step is repeated independently for S times, yielding the average
Wold-style cross-validation mean squared prediction error for Y :
PE(λ) =1
S
S∑s=1
PEλ(Y,M (s)),
where M (s) is the index matrix for the sth repeat of random entry selection. Finally, the cross-
validation estimate of λ is
λ?? = argminλ∈[λ,λ] PE(λ) (4.13)
We use a grid search to find λ?? within a range [λ, λ]. We find that setting λ = 0.5λtheo and
λ = 1.3λtheo where λtheo = 1 +√γ gives a wide enough range.
CHAPTER 4. AN OPTIMIZATION-SHRINKAGE HYBRIDMETHOD FOR FACTORANALYSIS56
We also find empirically (refer to the simulation table) that the bootstrap-CV approach is better
than simply filling in the held-out entries by entries of Xλ,M . Thus, the POT-S method adopts
the bootstrap-CV approach for cross-validation. For the fraction of entries to be held-out, we find
empirically that holding 10% of entries out makes λ?? a good estimate of λ??Opt.
4.5 Computation: an ADMM algorithm
In this section, we describe the algorithm minimizing the objective function Lλ(X;Y, I) in (4.12).
Optimizing the original objective (4.5) is a special case with ni ≡ n. We denote the Hadamard
product of two matrices A = (aij)m×n and B = (bij)m×n as A B = (aijbij)m×n. M = (mij)N×n
is the indicator matrix of held-in entries.
4.5.1 The ADMM algorithm
For notation convenience, denote
f(X) =
n∑i=1
√nin
∑j
mij(xij − yij)2
g(X) = ‖X‖?
since both f and g are not smooth, we use ADMM to solve the problem (Boyd et al., 2011).
For a given α, the ADMM algorithm for this problem is
Xk+1 = proxαf (Zk − Uk),
Zk+1 = proxαλg(Xk+1 + Uk), and
Uk+1 = Uk +Xk+1 − Zk+1
(4.14)
where proxh(·) is the proximal operator for a function h(·). The proximal operator is defined as
proxh(v) = argminu h(u) +1
2‖u− v‖22.
Remember that for a matrix Z, its SVD is denoted as Z =√nUZDZV
TZ . Also, for a diagonal matrix
D = diag(d1, d2, · · · , dm), we denote Dλ = diag ((d1 − λ)+, · · · , (dm − λ)+) as the soft-thresholding.
CHAPTER 4. AN OPTIMIZATION-SHRINKAGE HYBRIDMETHOD FOR FACTORANALYSIS57
Fact 1. Both proxαf (·) and proxαλg(·) have close forms:
proxαf (W ) =
(Y + diag
(1−
α ·√ni/n
‖(Wi· − Yi·) Mi·‖2
)+
(W − Y )
)M (4.15)
+W (1N1Tn −M), and
proxαλg(W ) =√nUWDW,αλV
TW . (4.16)
Proof. For proxαf ,
proxαf (W ) = argminX f(X) +1
2α‖X −W‖2F
= argminX
n∑i=1
√nin
∑j
mij(xij − yij)2 +1
2α‖Xi· −Wi·‖22
proxαf (W )i = argminXi·
√nin
∑j,Iij=1
(xij − yij)2 +1
2α‖Xi· −Wi·‖22
= Yi· + argminXi·
√nin
∑j
mij x2ij +
1
2α‖Xi· + Yi· −Wi·‖22
=
(Yi· +
(1−
α ·√ni/n
‖(Wi· − Yi·) Mi·‖2
)+
(Wi· − Yi·)
)Mi· +Wi· (1Tn −Mi·)
Thus, (4.15) holds.
For proxαλg,
proxαλg(W ) = argminX ‖X‖? +1
2αλ‖X −W‖2F
=√nUWDW,αλV
TW
The above fact shows that at each step of the ADMM iteration, X is first shrinked towards Y
and then is shrinked towards 0. The size of λ decides which direction of the shrinkage dominates.
We can have a low-rank estimated X when λ is large enough. When λ is too small, some of the σ2i
will be estimated as 0.
We adopt the stopping rule used in Boyd et al. (2011). The ADMM algorithm is terminated
when both the primal and dual residuals are small. Here, the primal and dual residuals are
Rk = Xk − Zk, and Sk =1
α(Zk+1 − Zk).
CHAPTER 4. AN OPTIMIZATION-SHRINKAGE HYBRIDMETHOD FOR FACTORANALYSIS58
It is shown that
f(Xk) + λg(Zk)− p? ≤ 1
α‖Uk‖F ‖Rk‖F + ‖Xk −X‖F ‖Sk‖F
Let εabs > 0 be an absolute tolerance and εrel be the relative error per entry. We stop if both
‖Rk‖F ≤√nNεabs + εrel max‖Xk‖F , ‖Zk‖F
‖Sk‖F ≤√nNεabs + ρεrel‖Uk‖F
4.5.2 Techniques to reduce computational cost
The above algorithm actually is very expensive when used for the POT-S method. In this section,
we discuss three techniques we use to reduce the computational cost: varying the penalty parameter,
an approximate SVD and warm starts.
Varying step size α
The above ADMM actually converges very slowly when we want to achieve the desired accuracy.
One modification to reduce the iterations towards convergence is to change the step size α on every
iteration Boyd et al. (2011). We use this simple scheme
αk+1 =
αk/2 if ‖Rk‖F > 10‖Sk‖F
2αk if ‖Sk‖F > 10‖Rk‖F
αk otherwise
The reason behind it is that if α is smaller than it penalize more on the primal residual Rk, when
α is larger it reduces the dual residual.
Acceleration by avoiding full SVD
The bottleneck of the computation in each iteration is the SVD step required for computing proxαλg(W ).
A full SVD to compute all the singular values and vectors of W will be very time-consuming. How-
ever, if W is known to be low-rank, then there is no need for full SVD. Also, computing the singular
value and vector pairs only when di(W ) > αλ is adequate for the soft-thresholding purpose. Both
the above two reasons suggest a partial SVD. Computing partial SVD for these soft-eigenvalue-
thresholding iterations is also widely suggested in the matrix completion literature Cai et al. (2010).
Partial SVD is to only compute the first K singular values and vectors. As our code is written
in R, we use the “svd” package which provides the PROPACK (Larsen, 1998) and nu-TRLAN
(Yamazaki et al., 2010) implementations of partial SVD. From our experience, we find that nu-
TRLAN is slightly faster than PROPACK and less likely to yield an error message. Also, if K is not
CHAPTER 4. AN OPTIMIZATION-SHRINKAGE HYBRIDMETHOD FOR FACTORANALYSIS59
small enough (K > 0.2 min(N,n)), there is no acceleration by computing partial SVD using either
PROPACK or nu-TRLAN, which is suggested in Wen et al. (2012) and also our simulations. In that
situation we will then switch back to the full SVD.
To compute the partial SVD of a matrix W , we need to decide the rank K for W . As suggested
in Cai et al. (2010), we can use information from previous iterations. Here is how this is done. We
initialize with a low rank Z (either using Z = 0 or Z = X computed from a larger λ). Then, as
iterations goes on, the rank of Zk tends to increase slowly. We guess an upper bound of rank(Zk+1)
as rk+1 = rank(Zk) + 5. After computing Zk+1 using rk+1, if rank(Zk+1) < rk+1 then our guess
succeeded. If not, we will recompute Zk+1 using the full SVD.
Warm start
In the cross-validation step, we need to select the best λ from the range [λ, λ] via grid search. Thus,
we indeed need to find a solution path of minimizing the objective function Lλ(X;Y, I) in (4.12) for
λ < λ2 · · · < λM .
To compute the solution path, we start from the largest λ, which gives a very low-rank estimate
Xλ,I . To compute the solution for λm+1, we use the final values of X, Z and U defined in (4.14)
when solving for λm as the starting values of X, Z and U in the optimization for λm+1. This is
called “warm start” in the optimization literature (Boyd et al., 2011). Also, since a small λ results
in a higher rank X which needs a lot time to compute, we want to avoid useless cross-validation for
small λ. In the cross-validation step, we start from the largest λ and stop when min(σi) = 0 for
some i ∈ 1, 2, · · · , N.
4.6 Simulation results
For our simulations, we use the same data generating scheme as described in section 3.4.1. As the
properties of ESA-BCV have been compared thoroughly with other currently existing methods and
ESA-BCV show an advantage, here we mainly compare POT-S with ESA-BCV.
4.6.1 Compare the oracle performances
Here we compare the oracle performances of five methods, the ESA method in Chapter 3 (ESA),
the quasi-maximum likelihood method (QMLE), the method using only joint convex optimization
solution Xλ in (4.2) (POT), the hybrid method using X?λ (POT-S-0) and the hybrid method using
X??λ (POT-S). The oracle estimation error of a method M is denoted as ErrX(M). For ESA and
QMLE, it’s defined as
ErrX(ESA) = Err(XESA(kESA
Opt ))
= mink
Err(XESA(k)
)
CHAPTER 4. AN OPTIMIZATION-SHRINKAGE HYBRIDMETHOD FOR FACTORANALYSIS60
ErrX(QMLE) = Err(XQMLE(kQMLE
Opt ))
= mink
Err(XQMLE(k)
)where XESA(k) and XQMLE(k) are the estimates given the rank k. For the three methods based on
the joint convex optimization, the oracle estimation errors of X are denoted as
ErrX(POT) = Err(XλOpt
), ErrX(POT-S-0) = Err
(X?λ?Opt
), and
ErrX(POT-S) = Err(X??λ??Opt
)Table 4.1 compares the oracle error in estimating X of POT-S with four other approaches. It
is clear that POT-S has the smallest oracle error. The detailed result for each factor strength
scenario and matrix size combination is shown in Table 4.3. Table 4.3 shows that when there are
many strong factors but few weak factors (Scenarios 2, 3 and 4), POT which uses only the joint
convex optimization solution, performs the worst compared with other methods. Because of the
optimal shrinkage step, POT-S and POT-S-0 perform better than ESA which basically apply hard-
thresholding on the singular values and POT which is based on singular value soft-thresholding.
Comparing the performance of POT-S and POT-S-0, we see that X??λ which depends on both Xλ
and Σλ has better oracle error than X?λ which only relies on Σλ. This should convince the readers
that in POT-S we should adopt X??λ as the estimator instead of X?
λ.
MeasurementsWhite Noise Heteroscedastic Noise
Var[σ2i
]= 0 Var
[σ2i
]= 1 Var
[σ2i
]= 10
ErrX(POT-S)ErrX(ESA) 0.80± 0.07 0.80± 0.09 0.81± 0.12
ErrX(POT-S)ErrX(QMLE) 0.81± 0.07 0.76± 0.14 0.73± 0.17
ErrX(POT-S)ErrX(POT) 0.77± 0.14 0.79± 0.14 0.80± 0.14
ErrX(POT-S)ErrX(POT-S-0) 0.83± 0.17 0.86± 0.17 0.87± 0.18
Table 4.1: Assess the oracle error in estimating X using four measurements. For each of Var[σ2i
]= 0,
1 and 10, the average for every measurement is the average over 10 × 6 × 100 = 6000 simulations,and the standard deviation is the standard deviation of these 6000 simulations.
POT-S also can estimate Σ more accurately. For an estimator Σ, define the estimation error of
Σ as
R(
Σ,Σ)
=
n∑i=1
| log(σ2i )− log(σ2
i )|
Then we compare the error in estimating Σ when the oracle error in estimating X is achieved. In
CHAPTER 4. AN OPTIMIZATION-SHRINKAGE HYBRIDMETHOD FOR FACTORANALYSIS61
other words, define
ErrΣ(ESA) = R(
ΣESA(kESAOpt ),Σ
)ErrΣ(QMLE) = R
(ΣQMLE(kQMLE
Opt ),Σ)
ErrΣ(POT) = R(
ΣλOpt ,Σ), ErrΣ(POT-S-0) = R
(Σλ?Opt
,Σ)
ErrΣ(POT-S) = R(
Σλ??Opt,Σ)
Then the comparison among methods is summarized in Table 4.2. A more detailed result is in
Table 4.3. We do not compare the oracle error in estimating Σ as what we did in estimating X
because of two reasons. One is that the oracle error in estimating X and Σ usually can’t be achieved
at the same tuning parameter, and the other is that achieving the oracle error in estimating Σ
without knowing the true Σ is hard to realize.
MeasurementsWhite Noise Heteroscedastic Noise
Var[σ2i
]= 0 Var
[σ2i
]= 1 Var
[σ2i
]= 10
ErrΣ(POT-S)ErrΣ(ESA) 0.79± 0.24 0.79± 0.23 0.80± 0.23
ErrΣ(POT-S)ErrΣ(QMLE) 1.00± 0.22 0.82± 0.30 0.73± 0.34
ErrΣ(POT-S)ErrΣ(POT) 0.69± 0.22 0.69± 0.22 0.69± 0.22
ErrΣ(POT-S)ErrΣ(POT-S-0) 1.02± 0.08 1.01± 0.08 1.00± 0.08
Table 4.2: Assess the error in estimating Σ when the oracle estimate of X is achieved. For each ofVar
[σ2i
]= 0, 1 and 10, the average for every measurement is the average over 10× 6× 100 = 6000
simulations, and the standard deviation is the standard deviation of these 6000 simulations.
The above tables show a very promising result. The oracle error of POT-S is smaller than other
existing methods. The next step is to assess the efficiency of our cross-validation technique to find
λ??Opt for POT-S adaptively from the data.
4.6.2 Assess the accuracy in finding λ??Opt
Here we empirically test how accurate the Wold-style cross-validation is in estimating λ??Opt. We
compare the approach that fills in the held-out entries by only the corresponding entries of the
held-in estimate Xλ,I (Wold-CV) and the approach that combines cross-validation and bootstrap in
Section 4.4 (CV-Boot). As a reference comparison, we also include in the ESA-BCV method.
The oracle estimate refers to X??λ??Opt
using POT-S. Similar to Section 3.4.3, we calculate the
excessive error using Wold-CV or CV-Boot compared with the oracle. For an estimate λ using
CHAPTER 4. AN OPTIMIZATION-SHRINKAGE HYBRIDMETHOD FOR FACTORANALYSIS62
# ofdominatingfactors
γ = 0.02 γ = 0.2 γ = 1 γ = 5 γ = 50
(20, 1000) (100, 5000) (20, 100) (200, 1000) (50, 50) (500, 500) (100, 20) (1000, 200) (1000, 20) (5000, 100)
ErrX(POT-S)/
ErrX(ESA)
Type-1 0.829 0.834 0.882 0.833 0.832 0.791 0.777 0.722 0.675 0.658Type-2 0.752 0.909 0.833 0.912 0.860 0.871 0.829 0.808 0.749 0.759Type-3 0.750 0.858 0.823 0.941 0.885 0.920 0.870 0.859 0.803 0.817Type-4 0.863 0.920 0.861 0.953 0.881 0.939 0.876 0.902 0.849 0.872Type-5 0.740 0.816 0.830 0.822 0.800 0.803 0.762 0.751 0.689 0.696Type-6 0.675 0.664 0.716 0.723 0.748 0.737 0.738 0.702 0.644 0.628
ErrX(POT-S)/
ErrX(QMLE)
Type-1 0.756 0.527 0.851 0.565 0.820 0.680 0.774 0.713 0.675 0.658Type-2 0.821 0.607 0.880 0.660 0.859 0.751 0.828 0.799 0.749 0.759Type-3 0.885 0.701 0.921 0.717 0.891 0.807 0.868 0.851 0.803 0.817Type-4 0.892 0.911 0.888 0.909 0.885 0.913 0.878 0.900 0.849 0.872Type-5 0.713 0.539 0.827 0.611 0.792 0.711 0.761 0.744 0.689 0.696Type-6 0.655 0.605 0.711 0.681 0.748 0.717 0.737 0.701 0.644 0.628
ErrX(POT-S)/
ErrX(POT)
Type-1 0.773 0.702 0.912 0.685 0.865 0.676 0.915 0.721 0.819 0.783Type-2 0.735 0.637 0.870 0.618 0.823 0.609 0.877 0.655 0.763 0.707Type-3 0.717 0.611 0.844 0.590 0.803 0.580 0.854 0.625 0.728 0.673Type-4 0.719 0.601 0.870 0.603 0.828 0.598 0.868 0.631 0.728 0.663Type-5 0.795 0.688 1.018 0.698 0.915 0.694 0.950 0.725 0.825 0.760Type-6 0.977 0.847 1.102 0.900 1.112 0.909 1.126 0.915 0.974 0.903
ErrX(POT-S)/
ErrX(POT-S-0)
Type-1 1.012 1.071 0.932 1.042 0.901 0.991 0.797 0.932 0.810 0.891Type-2 0.947 1.109 0.799 1.039 0.750 0.923 0.619 0.811 0.611 0.742Type-3 0.873 1.116 0.695 1.017 0.659 0.862 0.500 0.723 0.494 0.647Type-4 0.905 1.120 0.731 1.014 0.666 0.856 0.517 0.713 0.487 0.628Type-5 1.003 1.081 0.964 1.028 0.877 0.961 0.777 0.883 0.748 0.825Type-6 1.041 1.018 0.977 0.993 0.970 0.970 0.942 0.947 0.914 0.918
Table 4.3: Four measurements comparing the oracle error in estimating X under various (N,n) pairsand factor strength scenario with Var(σ2
i ) = 1. Type-1 to Type-6 correspond to the six scenarios inTable 3.1
either Wold-CV or CV-Boot, define
REE(λ) =
∥∥∥X??λ−X
∥∥∥2
F∥∥∥X??λ??Opt−X
∥∥∥2
F
− 1.
Correspondingly, we redefine the REE of ESA-BCV as
REE(ESA-BCV) =
∥∥∥XESA(kBCV)−X∥∥∥2
F∥∥∥X??λ??Opt−X
∥∥∥2
F
− 1.
Notice that REE(ESA-BCV) defined here can be much higher than REE(kBCV) defined in (3.21)
where the ESA-BCV estimator is compared with the oracle of ESA. REE(ESA-BCV) can be much
CHAPTER 4. AN OPTIMIZATION-SHRINKAGE HYBRIDMETHOD FOR FACTORANALYSIS63
# ofdominatingfactors
γ = 0.02 γ = 0.2 γ = 1 γ = 5 γ = 50
(20, 1000) (100, 5000) (20, 100) (200, 1000) (50, 50) (500, 500) (100, 20) (1000, 200) (1000, 20) (5000, 100)
ErrX(POT-S)/
ErrX(ESA)
Type-1 0.232 0.398 0.428 0.792 0.724 0.962 0.733 0.988 0.756 0.978Type-2 0.329 0.451 0.512 0.809 0.746 0.962 0.764 0.966 0.783 0.936Type-3 0.338 0.552 0.521 0.831 0.757 0.966 0.784 0.972 0.794 0.944Type-4 0.560 0.747 0.754 0.944 0.960 1.011 1.024 1.013 1.020 1.002Type-5 0.382 0.653 0.634 0.925 0.886 0.999 0.945 1.001 0.952 0.989Type-6 0.455 0.618 0.602 0.864 0.808 0.983 0.867 0.986 0.939 0.982
ErrX(POT-S)/
ErrX(QMLE)
Type-1 0.830 0.434 0.886 0.681 0.969 0.938 0.796 1.007 0.764 0.981Type-2 0.990 0.457 0.955 0.746 0.985 0.944 0.834 0.987 0.792 0.940Type-3 1.117 0.525 1.015 0.765 1.003 0.963 0.846 0.992 0.801 0.948Type-4 0.722 0.723 0.921 0.897 1.039 1.000 1.064 1.019 1.025 1.004Type-5 0.565 0.374 0.859 0.687 0.980 0.929 0.976 1.004 0.958 0.990Type-6 0.309 0.070 0.487 0.246 0.642 0.567 0.665 0.911 0.780 0.843
ErrX(POT-S)/
ErrX(POT)
Type-1 0.622 0.460 0.564 0.653 0.593 0.842 0.568 0.913 0.910 0.922Type-2 0.615 0.134 0.586 0.572 0.550 0.827 0.571 0.796 0.910 0.879Type-3 0.593 0.136 0.594 0.486 0.518 0.722 0.578 0.796 0.917 0.881Type-4 0.559 0.141 0.547 0.595 0.558 0.825 0.549 0.798 0.908 0.878Type-5 0.587 0.453 0.565 0.649 0.650 0.902 0.637 0.882 0.897 0.874Type-6 0.610 0.410 0.665 0.961 0.850 1.004 0.814 0.997 0.875 0.966
ErrX(POT-S)/
ErrX(POT-S-0)
Type-1 1.023 1.009 1.014 0.999 1.011 0.978 1.005 0.988 0.994 0.988Type-2 1.003 1.000 1.021 1.014 1.006 1.005 1.004 1.001 1.000 1.000Type-3 1.006 1.000 1.015 1.013 1.007 1.007 1.005 1.003 1.002 1.002Type-4 1.021 1.000 0.999 1.012 1.009 1.006 1.012 1.004 1.006 1.003Type-5 1.016 1.001 1.086 1.000 1.017 1.001 1.008 1.000 1.001 0.999Type-6 1.262 0.995 0.997 0.988 0.997 1.014 1.013 0.996 1.003 0.961
Table 4.4: Four measurements comparing the error in estimating Σ when the oracle error of X isachieved under various (N,n) pairs and factor strength scenario with Var(σ2
i ) = 1. Type-1 to Type-6correspond to the six scenarios in Table 3.1.
larger due to the gap between the oracles of ESA and POT-S.
Figure 4.1 shows the survival curves of REE in estimating X for the three methods we compare.
Basically, all the three methods perform better for larger matrices and CV-Boot then has almost
a perfect recovery of of λ??Opt. For smaller matrices, there is barely any significant improvement of
CV-Boot over Wold-CV on average mainly because that the bootstrap estimate of the noise prefers
a larger n to have better accuracy.
Table 4.5 and Table 4.6 give more details of the simulation results. First, CV-Boot is constantly
more accurate in Wold-CV in most of the cases. Also, the rank of the oracle estimate tracks well
the theoretical threshold that there are 7 detectable factors.
Finally, we compare these methods in estimating Σ with Σλ??Opt. Define
REEΣ(λ) =R(Σλ,Σ)
R(Σλ??Opt,Σ)− 1.
CHAPTER 4. AN OPTIMIZATION-SHRINKAGE HYBRIDMETHOD FOR FACTORANALYSIS64
0.0 0.2 0.4 0.6 0.8 1.0
0%
20%
40%
60%
80%
100% ESA−BCVWold−CVCV−Boot
(a) All datasets
0.0 0.2 0.4 0.6 0.8 1.0
0%
20%
40%
60%
80%
100% ESA−BCVWold−CVCV−Boot
(b) Large datasets only
0.0 0.2 0.4 0.6 0.8 1.0
0%
20%
40%
60%
80%
100% ESA−BCVWold−CVCV−Boot
(c) Small datasets only
Figure 4.1: REE survival plots for estimating X: the proportion of samples with REE exceeding thenumber on the horizontal axis. Figure 4.1a shows all 6000 samples. Figure 4.1b shows only the 3000simulations of larger matrices of each aspect ratio. Figure 4.1c shows only the 3000 simulations ofsmaller matrices.
and redefine REEΣ(ESA-BCV) accordingly. In Table 4.7 we see that Boot-CV still performs better
than Wold-CV in most of the cases.
CHAPTER 4. AN OPTIMIZATION-SHRINKAGE HYBRIDMETHOD FOR FACTORANALYSIS65
FactorType
γ = 0.02 γ = 0.2 γ = 1
Method (20, 1000) (100, 5000) (20, 100) (200, 1000) (50, 50) (500, 500)
Type-10/6/1/1
ESA-BCV 0.409 5.9 0.226 5.8 0.527 4.5 0.208 5.9 0.370 5.0 0.262 6.0Wold-CV 0.013 7.0 0.006 11.7 0.082 6.2 0.017 9.4 0.021 7.4 0.014 8.5CV-Boot 0.008 7.1 0.004 9.6 0.092 6.2 0.001 7.2 0.016 7.1 0.001 7.1Oracle – 7.1 – 7.1 – 6.7 – 7.2 – 7.1 – 7.0
Type-22/4/1/1
ESA-BCV 0.722 5.4 0.236 5.7 0.557 4.5 0.104 5.9 0.418 4.7 0.151 6.0Wold-CV 0.001 7.1 0.001 11.9 0.133 6.3 0.014 9.6 0.017 7.3 0.020 9.5CV-Boot 0.000 7.1 0.001 11.1 0.041 6.3 0.001 7.4 0.016 7.2 0.001 7.2Oracle – 7.1 – 10.0 – 7.2 – 7.5 – 7.1 – 7.3
Type-33/3/1/1
ESA-BCV 0.641 5.1 0.488 5.3 0.640 4.5 0.163 5.8 0.406 4.6 0.097 5.9Wold-CV 0.000 7.2 0.000 12.1 0.017 6.7 0.014 9.7 0.017 7.5 0.022 10.1CV-Boot 0.000 7.2 0.000 11.9 0.037 6.8 0.001 7.4 0.015 7.3 0.002 7.2Oracle – 7.2 – 11.3 – 7.2 – 7.7 – 7.2 – 7.5
Type-43/1/3/1
ESA-BCV 0.236 3.1 0.304 3.5 0.261 3.3 0.110 3.7 0.231 3.1 0.074 3.9Wold-CV 0.040 6.9 0.123 9.9 0.317 5.9 0.012 9.2 0.019 6.5 0.022 9.4CV-Boot 0.002 6.9 0.035 10.4 0.117 5.7 0.000 7.3 0.018 6.3 0.002 7.0Oracle – 7.0 – 11.0 – 6.0 – 7.3 – 6.6 – 7.1
Type-51/3/3/1
ESA-BCV 0.535 3.1 0.270 3.9 0.571 2.4 0.225 3.9 0.483 2.8 0.247 4.0Wold-CV 0.015 6.8 0.005 11.8 0.271 4.8 0.019 9.0 0.026 6.3 0.020 8.3CV-Boot 0.010 6.9 0.001 7.8 0.155 4.7 0.002 7.1 0.026 6.1 0.002 6.9Oracle – 6.9 – 7.3 – 5.2 – 7.1 – 6.5 – 7.0
Type-60/1/6/1
ESA-BCV 0.570 0.2 0.567 1.0 0.468 0.1 0.410 0.8 0.401 0.0 0.363 1.0Wold-CV 0.055 6.3 0.025 11.4 0.091 4.0 0.031 8.0 0.060 4.2 0.040 7.7CV-Boot 0.022 5.7 0.000 7.0 0.087 3.8 0.003 6.3 0.071 3.8 0.006 6.0Oracle – 5.6 – 7.0 – 5.3 – 6.3 – 5.4 – 5.9
Table 4.5: Comparison of REE and the rank of X with various (N,n) pairs and scenarios. For eachscenario, the factors’ strengths are listed as the number of “strong/useful/harmful/undetectable”factors. For each (N,n) pair, the first column is the REE and the second column the rank theestimated matrix. Both values are averages over 100 simulations. Var
[σ2i
]= 1.
CHAPTER 4. AN OPTIMIZATION-SHRINKAGE HYBRIDMETHOD FOR FACTORANALYSIS66
FactorType
γ = 5 γ = 50
Method (100, 20) (1000, 200) (1000, 20) (5000, 100)
Type-10/6/1/1
ESA-BCV 0.794 4.1 0.382 6.0 0.629 4.9 0.526 5.8Wold-CV 0.027 7.6 0.012 8.0 0.009 7.0 0.008 7.6CV-Boot 0.023 7.2 0.001 7.0 0.009 7.0 0.002 7.0Oracle – 7.0 – 7.0 – 7.0 – 7.0
Type-22/4/1/1
ESA-BCV 0.624 4.2 0.239 6.0 0.454 4.9 0.329 5.8Wold-CV 0.015 7.3 0.018 9.6 0.003 7.0 0.037 15.4CV-Boot 0.011 7.2 0.001 7.2 0.003 7.0 0.000 7.4Oracle – 7.1 – 7.1 – 7.0 – 7.1
Type-33/3/1/1
ESA-BCV 0.477 4.5 0.173 5.8 0.362 4.6 0.233 5.8Wold-CV 0.018 7.5 0.023 10.5 0.001 7.0 0.036 16.4CV-Boot 0.015 7.4 0.001 7.2 0.001 7.0 0.000 7.6Oracle – 7.1 – 7.3 – 7.0 – 7.5
Type-43/1/3/1
ESA-BCV 0.218 3.2 0.118 3.8 0.194 3.2 0.157 3.7Wold-CV 0.024 6.7 0.019 9.5 0.001 7.0 0.043 16.3CV-Boot 0.024 6.2 0.001 7.0 0.001 7.0 0.000 7.5Oracle – 6.5 – 7.1 – 7.0 – 7.5
Type-51/3/3/1
ESA-BCV 0.639 2.4 0.338 3.9 0.620 2.6 0.462 3.8Wold-CV 0.029 6.3 0.022 8.5 0.005 7.0 0.034 11.7CV-Boot 0.027 6.3 0.002 6.9 0.005 7.0 0.001 7.0Oracle – 6.3 – 7.0 – 6.9 – 7.0
Type-60/1/6/1
ESA-BCV 0.399 0.1 0.434 0.8 0.568 0.1 0.622 0.7Wold-CV 0.094 3.9 0.035 7.8 0.010 6.8 0.039 9.1CV-Boot 0.113 3.3 0.005 6.0 0.010 6.4 0.003 6.8Oracle – 5.0 – 6.1 – 6.3 – 6.9
Table 4.6: Like Table 4.5, but for larger aspect ratios γ
CHAPTER 4. AN OPTIMIZATION-SHRINKAGE HYBRIDMETHOD FOR FACTORANALYSIS67
γ = 0.02 γ = 0.2 γ = 1 γ = 5 γ = 50
Method (20, 1000) (100, 5000) (20, 100) (200, 1000) (50, 50) (500, 500) (100, 20) (1000, 200) (1000, 20) (5000, 100)
Type-1 0/6/1/1
ESA-BCV 1.120 0.267 0.189 5.320 0.250 0.046 0.172 0.026 1.604 0.027Wold-CV 0.037 0.147 0.158 0.052 0.017 0.270 0.642 0.090 1.477 -0.023CV-Boot 0.015 0.100 0.052 0.012 0.017 0.001 -0.018 -0.010 0.838 -0.038
Type-2 2/4/1/1
ESA-BCV 0.729 0.159 0.118 7.182 0.183 -0.022 0.095 0.017 0.964 0.059Wold-CV -0.058 0.102 0.152 0.006 0.017 0.385 0.620 0.211 0.560 0.195CV-Boot -0.096 0.058 0.089 0.001 0.017 -0.036 -0.034 -0.003 0.384 0.005
Type-3 3/3/1/1
ESA-BCV 1.046 0.087 0.079 5.325 0.079 -0.050 0.161 -0.016 1.425 0.040Wold-CV -0.090 0.101 0.137 0.002 0.007 0.444 0.584 0.260 0.205 0.201CV-Boot -0.071 0.085 0.091 0.001 0.007 -0.048 -0.051 -0.021 0.155 -0.000
Type-4 3/1/3/1
ESA-BCV 0.448 -0.105 -0.059 1.368 -0.065 -0.061 -0.015 -0.035 0.840 -0.017Wold-CV -0.007 0.075 0.043 0.005 0.006 0.410 0.556 0.216 0.645 0.207CV-Boot -0.041 0.034 -0.000 0.004 0.006 -0.018 -0.013 -0.009 0.164 -0.000
Type-5 1/3/3/1
ESA-BCV 0.689 0.117 0.103 2.198 0.082 -0.012 0.026 -0.006 0.737 0.013Wold-CV 0.049 0.077 0.075 0.085 0.019 0.290 0.635 0.159 1.527 0.119CV-Boot 0.123 0.056 0.024 0.027 0.019 0.002 0.002 0.003 0.193 -0.003
Type-6 0/1/6/1
ESA-BCV 0.670 0.183 0.327 1.689 0.122 0.022 0.212 0.022 0.922 0.030Wold-CV 0.067 0.055 0.004 -0.199 0.030 0.221 0.487 0.117 1.813 0.059CV-Boot 0.043 0.046 0.000 -0.019 0.016 -0.030 -0.009 -0.010 -0.005 -0.004
Table 4.7: Comparison of REEΣ for various (N,n) pairs and scenarios. For each scenario, the factors’strengths are listed as the number of “strong/useful/harmful/undetectable” factors. The values areaverages over 100 simulations. Var
[σ2i
]= 1.
Chapter 5
Confounder adjustment with factor
analysis
In this chapter, we discuss a multiple regression model with bias corrected by factor analysis.
The motivation of the problem is to correct for the biases and correlation of individual test
statistics in multiple hypotheses testing. In many scientific studies, for example microarray analysis,
tens of thousands of tests are typically performed simultaneously. A typical model is that each of
the individual test statistics is via linear regression, regressing the response variable on the variable
of interest, with other known covariates. However, there can be unknown factors that affect the
response variables in many of the individual hypotheses, inducing correlation among the individual
test statistics. Moreover, those latent factors can also be correlated with the variable of interest, then
the test statistics are not only correlated but are also confounded. We use the phrase “confounding”
to emphasize that these latent factors can significantly bias the individual p-values. Simultaneous
inference such as false discovery rate (FDR) control requires independent and correct individual
p-values. Many confounder adjustment methods have already been proposed for multiple testing
over the last decade Gagnon-Bartsch et al. (2013); Leek and Storey (2008b); Price et al. (2006); Sun
et al. (2012). Our goal is to unify these methods in the same framework and study their statistical
properties based on theoretical results in factor analysis.
In microarray data analysis, common sources of confounding factors include unknown technical
bias Gagnon-Bartsch et al. (2013), environmental changes Fare et al. (2003); Gasch et al. (2000)
and surgical manipulation Lin et al. (2006). See Lazar et al. (2013) for a survey. In many studies,
especially for observational clinical research and human expression data, the latent factors, either
genetic or technical, are confounded with primary variables of interest due to the observational
nature of the studies and heterogeneity of samples Ransohoff (2005); Rhodes and Chinnaiyan (2005).
Similar confounding problems also occur in other high-dimensional datasets such as brain imaging
68
CHAPTER 5. CONFOUNDER ADJUSTMENT WITH FACTOR ANALYSIS 69
Schwartzman et al. (2008) and metabonomics Craig et al. (2006).
Notation. Subscripts of matrices are used to indicate row(s) whenever possible. For example, if
C is a set of indices, then XC is the corresponding rows of a matrix X. A random matrix E ∈ Rn×p
is said to follow a matrix normal distribution with mean M ∈ Rn×p, row covariance U ∈ Rn×n
and column covariance V ∈ Rp×p, abbreviated as E ∼ MN (M,U, V ), if the vectorization of E by
column follows the multivariate normal distribution vec(E) ∼ N(vec(M), V ⊗ U). When U = In,
this means the rows of E are i.i.d. N(0, V ). We use the usual notation in asymptotic statistics that a
random variable is Op(1) if it is bounded in probability, and op(1) if it converges to 0 in probability.
Bold symbols Op(1) or op(1) mean each entry of the vector is Op(1) or op(1).
5.1 The model and the algorithm
5.1.1 A statistical model for confounding factors
We consider a single primary variable of interest and no other known control variables in this section.
It is common to add intercepts and known effects (such as lab and batch effects) in the regression
model. This extension to multiple linear regression does not change the main theoretical results in
this paper and is discussed later in Section 5.3.
For simplicity, all the variables in this section are assumed to have mean 0 marginally. Our model
is built on the already widely used linear model in the existing literature and we rewrite it here:
YN×n = βN×1 Z1×n + LN×r Fr×n + Σ1/2EN×n (5.1a)
where Y is observed data matrix of response, Z is the variable of interest and F are the latent factor
variables (or confounders). Each row represents a variable and each column represents a sample.
We assume the random factor score model for the dependence of F and the primary variable Z. We
assume a linear relationship as in
F = αZ +W, (5.1b)
and in addition some distributional assumptions on Z, W and the noise matrix E
Zji.i.d.∼ mean 0, variance 1, j = 1, . . . , n, (5.1c)
W ∼MN (0, Ir, In), W |= Z, (5.1d)
E ∼MN (0, IN , In), E |= (Z,F ). (5.1e)
The parameters in the model eq. (5.1) are β ∈ RN×1 the primary effects we are most interested in,
L ∈ Rp×r the factor loadings, α ∈ RN×1 the association of the primary variable with the confounding
factors, and Σ ∈ RN×N the noise covariance matrix. We assume Σ is diagonal Σ = diag(σ21 , . . . , σ
2N ),
CHAPTER 5. CONFOUNDER ADJUSTMENT WITH FACTOR ANALYSIS 70
so the noise for different outcome variables is independent.
In (5.1c), Zi is not required to be Gaussian or even continuous. For example, a binary or categor-
ical variable after normalization also meets this assumption. The parameter vector α measures how
severely the data are confounded. For a more intuitive interpretation, consider an oracle procedure of
estimating β when the confounders F in eq. (5.1a) are observed. The best linear unbiased estimator
in this case is the ordinary least squares (βOLSi , LOLS
i ), whose variance is σ2i Var [Zj , Fj ]
−1/n. Using
(5.1b) and (5.1d), it is easy to show that Var(βOLSi ) = (1 + ‖α‖22)σ2
i /n and Cov(βOLSi1
, βOLSi2
) = 0 for
i1 6= i2. In summary,
Var(βOLS) =1
n(1 + ‖α‖22)Σ. (5.2)
Notice that in the unconfounded linear model in which F = 0, the variance of the OLS estimator
of β is Σ/n. Therefore, 1 + ‖α‖22 represents the relative loss of efficiency when we add observed
variables F to the regression which are correlated with Z. In Section 5.2, we show that the oracle
efficiency (5.2) can be asymptotically achieved even when F is unobserved.
5.1.2 Model identification
Following Sun et al. (2012), we introduce a transformation of the data to make both the identification
issues clearer. Consider the Householder rotation matrix Q ∈ Rn×n such that ZQ = ‖Z‖2eT1 =
(‖Z‖2, 0, 0, . . . , 0). Right-multiplying Y by Q, we get Y = Y Q = β‖Z‖2eT1 + LF + Σ1/2E, where
F = FQ = (αZ +W )Q = ‖X‖2e1αT + W , (5.3)
and W = WQd= W , E = EQ
d= E. As a consequence, the first and the rest of the columns of Y are
Y1 = ‖Z‖2β + LF1 + Σ1/2E1 ∼ N (‖X‖2(β + Lα)T , LLT + Σ), (5.4)
Y−1 = LF−1 + Σ1/2E−1 ∼MN (0, In−1, LLT + Σ). (5.5)
Here Y1 is a length N vector, Y−1 is a N × (n− 1) matrix, and the distributions are now conditional
on Z.
Equation (5.5) is just factor analysis model, and the identification of L and Σ has been discussed
thoroughly in Section 1.3. Notice that for our purpose of estimating and testing for β, we only need
to identify the column space of L, as we have a free parameter α that can change to accommodate
any rotation of L. Based on (5.4), after Σ and the column space of L are identified, the task now is
to identify β given β + Lα.
Notice that the parameters β and α cannot be identified from (5.4) given L and Σ because
they have in total N + r parameters while Y1 is a length N vector. If we write PL and PL⊥ as
the projection onto the column space and orthogonal space of L so that β = PLβ + PL⊥β, it is
impossible to identify PLβ from (5.4).
CHAPTER 5. CONFOUNDER ADJUSTMENT WITH FACTOR ANALYSIS 71
This suggests that we should further restrict the parameter space. We will reduce the degrees
of freedom by restricting at least r entries of β to equal 0. We consider two sufficient identification
conditions of β representing the negative-control scenario discussed in Gagnon-Bartsch et al. (2013)
and the sparsity scenario discussed in Sun et al. (2012).
Under the negative control scenario, for a known subset of the variables C ∈ 1, 2, · · · , N there
are no primary effects, namely βC = 0. It can be immediately seen that a necessary and sufficient
condition to identify β from β + Lα given the column space of L and Σ is that LC has rank r.
Under the sparsity scenario, β is known to be sparse, but none of the locations of the zero entries
are known. For a vector β, define Iβ as the index set of the zero entries of β. Then define the
s-sparse (s ≥ r) space of β as:
B(s) =β ∈ RN : |Iβ | ≥ s, rank(LIβ ) = r
Let B = (β L) ∈ RN×(r+1). Then we have the following result, which is basically a corollary of
Theorem 1.3.2.
Corollary 5.1.1. For the linear model (5.1), assume that Σ and the column space of L can be
identified. Then a necessary and sufficient condition for β to be identified in B(s) (s ≥ r) is that for
any index subset S of size s, if LS is of rank r then βS = 0.
Proof. For sufficiency, we only need to show that if β + Lα = β + Lα, then β = β.
Let δ = α − α, then β = β + Lδ. As β ∈ B(s), then (β + Lδ)Iβ
= 0. As |Iβ | ≥ s, using the
condition we have Iβ ⊆ Iβ . As a result, we have LIβδ = 0. Thus δ = 0 as LI
βhas rank r. Thus
sufficiency is proved.
For necessity, assume that there is subset S which is not a subset of Iβ but BS has rank r. Then
there exists a non-zero δ that βS = LSδ. Then we can set β = β − Lδ. It’s easy to check that
β ∈ B(s) and β 6= β.
5.1.3 The two-step algorithm
Given (5.4) and (5.5), it’s straightforward to estimate the model in two steps
Step 1: Estimate Σ and L in factor analysis
The equation (5.5) is a random score factor analysis model. In this section we will use MLE to
estimate Σ and L mainly for the purpose of theoretical analysis assuming only strong factors. In
practice, one would prefer conditioning on the factor scores Z and using either ESA-BCV proposed
in Chapter 3 or POT-S proposed in Chapter 4, which may have better performance than MLE.
As discussed in Section 2.1, the MLE is to estimate Σ and L by maximizing (2.2)
CHAPTER 5. CONFOUNDER ADJUSTMENT WITH FACTOR ANALYSIS 72
n
2log det |LLT + Σ| − n
2tr[(LLT + Σ)−1S
]where S is the sample covariance matrix of Y−1 with known 0 mean:
S =Y−1Y
T−1
n− 1.
Step 2: Estimate α and β using linear regression
We consider estimating β in both the negative control scenario and sparsity scenario.
For the negative control scenario, if we know a set C such that βC = 0, then Y1 can be corre-
spondingly separated into two parts:
YC,1‖Z‖2
= LC·
(α+
W1
‖Z‖2
)+
Σ1/2C‖Z‖2
EC,1, and
Y−C,1‖Z‖2
= β−C + L−C
(α+
W1
‖Z‖2
)+
Σ1/2−C‖Z‖2
E−C,1.
(5.6)
We consider the following negative control (NC) estimator via generalized least squares:
αNC =1
‖Z‖2(LTC Σ−1
C LC)−1LTC Σ−1
C Y TC,1, and (5.7)
βNC =Y−C,1‖Z‖2
− L−CαNC. (5.8)
This estimator matches the RUV-4 estimator of Gagnon-Bartsch et al. (2013) except that it uses
maximum likelihood estimates of Σ and L instead of using PCA, and generalized linear squares
instead of ordinary linear squares regression.
For the sparsity scenario where the zero indices in β are unknown but sparse, the estimation of
α and β from Y1/‖Z‖2 = β+LF1 + Σ1/2E1/‖Z‖2 can be cast as a robust regression by viewing Y T1
as observations and L(0) as design matrix. The nonzero entries in β correspond to outliers in this
linear regression.
More specifically, given a robust loss function ρ, we consider the following estimator:
αRR = arg min
N∑i=1
ρ
(Yi1/‖Z‖2 − Li·α
σi
), and (5.9)
βRR = Y1/‖Z‖2 − LαRR. (5.10)
CHAPTER 5. CONFOUNDER ADJUSTMENT WITH FACTOR ANALYSIS 73
For a broad class of loss functions ρ, estimating α by (5.9) is equivalent to
(αRR, β) = arg minα,β
N∑i=1
1
σ2i
(Y1i/‖Z‖2 − βi − Li·α)2 + Pλ(β), (5.11)
where Pλ(β) is a penalty to promote sparsity of β (She and Owen, 2011). However βRR is not
identical to β, which is a sparse vector that does not have an asymptotic normal distribution. The
LEAPP algorithm (Sun et al., 2012) uses the form (5.11). Replacing it by the robust regression
(5.9) and (5.10) allows us to derive significance tests of H0i : βi = 0.
5.2 Statistical inference for β
In this section, we derive asymptotic distributions of βNC and βRR when n,N → ∞ and propose
asymptotic valid tests for the individual hypotheses: H0i : βi = 0.
The results are based on Theorem 2.1.3 and Theorem 2.1.4 for the asymptotics of Σ and L in
the first step. As the asymptotic distributions for MLE (or QMLE) have the simplest form for the
non-random factor score model, We first convert the random score factor model (5.5) to the non-
random factor score model. We introduce the matrix R ∈ Rr×r such that RRT = (n− 1)−1F−1FT−1
and RTLTΣ−1LR is diagonal. Define F(0)−1 = R−1Z−1, L(0) = LR, then L(0)F
(0)−1 = LF−1 and L(0)
and F(0)−1 satisfy Assumption 4.
Based on Theorem 2.1.3 and Theorem 2.1.4, we have the following results for L and Σ:
Corollary 5.2.1. Under the assumptions of Theorem 2.1.4 for the random factor score model, when
n,N →∞ then for any fixed index set S with finite cardinality,
√n(LS − L(0)
S )d→MN (0,ΣS , Ir) (5.12)
where ΣS is the noise covariance matrix of the variables in S. Also
maxj≤N‖L2
i· − L2i·‖2 = Op(
√logN/n), max
j≤N|σ2i − σ2
i | = Op(√
logN/n) (5.13)
maxj≤N‖Li·‖2 = Op(1) (5.14)
maxi=1,2,··· ,N
∥∥∥∥∥∥Li· − L(0)i· −
1
n− 1
n∑j=2
σieijF(0)T·j
∥∥∥∥∥∥2
= op(n− 1
2 ). (5.15)
Proof. First, we show that conditional on F−1, W.L.O.G., we can also assume L(0) satisfies Assump-
tion 5. In Bai and Li (2012a, Lemma A.1), they proved that for a matrix Q ∈ Rr×r, if QQT = Ir
and QTV Q = D where both V and D are diagonal matrices and V has distinct diagonal entries,
CHAPTER 5. CONFOUNDER ADJUSTMENT WITH FACTOR ANALYSIS 74
then Q must be a diagonal matrix with entries either 1 or −1. Since RRT = Ir + Op(n−1/2) and
LTΣ−1L is diagonal, we thus have the conclusion that we can find R satisfying R = Ir +Op(n−1/2).
In other words, ‖R− Ir‖max = OP (n−1/2). Since L is bounded by C, we also have ‖L(0) −L‖max =
‖L(R − Ir)‖max = Op(n−1/2). Thus, though L(0) is not bounded, the entries are asymptotically
uniformly bounded with rate Op(n−1/2), thus the unbounded part doesn’t make a difference in the
proof of consistency and asymptotic normality of L.
Then the results in Corollary 5.2.1 are a straight forward consequences of Theorem 2.1.4. Notice
that we have just shown in the previous paragraph that maxi=1,··· ,N ‖L(0)2i· − L2
i·‖2 = Op(n−1/2).
Also (5.14) is a direct consequence of (A.12).
Next, we discuss the asymptotic properties of β and α under both the negative-control and
sparsity scenario.
First, notice that the parameters α and β only appear in (5.4), so their inference can be completely
separated from the inference of L and Σ. In fact, under the Gaussian assumption and conditional
on Z, Y1 |= Y−1 since E1 |= E−1, so the two steps use mutually independent information. This in turn
greatly simplifies the theoretical analysis.
For the rest of the section, we first consider the estimation of β for fixed W1, R and Z, and then
show the asymptotic distribution of β indeed does not depend on W1, R or Z. Thus all the results
also hold unconditionally. This conditional step can simplify our analysis and we will see that it
does not affect asymptotic efficiency.
To use the results in Corollary 5.2.1, we replace L by L(0) and rewrite (5.4) as
Y1/‖Z‖2 = β + L(0)α(0) + E1/‖Z‖2 (5.16)
where L(0) = LR and α(0) = R−1(α+W1/‖Z‖2). Notice that the random R only depends on Y−1 and
thus is independent of Y1. Also, in the proof we have already shown that ‖R− Ir‖max = Op(n−1/2)
and also ‖Z‖2/√n→ 1 by the law of large numbers, it’s thus easy to check that ‖α(0)−α‖2 = op(1).
5.2.1 The negative control scenario
Let’s first analyze the negative control scenario βC = 0 for a known index set C. The number of
negative controls |C| may grow as N → ∞. We impose an additional assumption on the latent
factors of the negative controls.
Assumption 10. limN→∞ |C|−1LTCΣ−1C LC exists and is positive definite.
Then we can show the consistency and asymptotic distribution of βNC from (5.8).
Theorem 5.2.1. Under the assumptions of Corollary 5.2.1 and Assumption 10, if n,N →∞ then
CHAPTER 5. CONFOUNDER ADJUSTMENT WITH FACTOR ANALYSIS 75
for any fixed index set S with finite cardinality and S ∩ C = ∅, we have
√n(βNCS − βS)
d→ N (0, (1 + ‖α‖22)(ΣS + ∆S)) (5.17)
where ∆S = LS(LTCΣ−1C LC)
−1LTS .
If in addition |C| → ∞,
√n(βNCS − βS)
d→ N (0, (1 + ‖α‖22)ΣS). (5.18)
Proof. First, we can proceed to prove our theorem by showing the conclusion holds for any fixed u
and fixed sequences Z(n)∞n=1 and R(n,N)∞n=1,N=1 such that ‖Z(n)‖2/√n→ 1 and R(n,p) → Ir as
n,N →∞. For brevity we will write Z and R instead of Z(n) and R(n,N) for the rest of this proof.
Plugging (5.6) in the estimator (5.7) and (5.8), we obtain
√n(βNC−C − β−C) =
√n
‖Z‖2(E−C,1 − L−C(LTC Σ−1
C LC)−1LTC Σ−1
C EC,1)
+√n · (L(0)
−C − L−C)α(0)
+√n · L−C(LTC Σ−1
C LC)−1LTC Σ−1
C (LC − L(0)C )α(0).
As n,N →∞,√n/‖Z‖2
a.s.→ 1. Also, using (5.13), both Σ and Γ have entrywise uniform convergence
in probability to Σ and Γ. Thus, using Assumption 10, we get( 1
|C|LTC Σ−1
C LC
)−1
=( 1
|C|LTCΣ−1
C LC
)−1
+ op(1),
1
|C|LTC Σ−1
C EC,1 =1
|C|LTCΣ−1
C EC,1 + op(1), and
1
|C|LTC Σ−1
C(√n(LC − L(0)
C ))
=1
|C|LTCΣ−1
C(√n(LC − L(0)
C ))
+ op(1)
(5.19)
which imply that
√n(βNCS − βS) =ES,1 − LS(LTCΣ−1
C LC)−1LTCΣ−1
C EC,1
+√n · (L(0)
S − LS)α(0)
+√n · LS(LTCΣ−1
C LC)−1LTCΣ−1
C (LC − L(0)C )α(0) + op(1).
(5.20)
Note that E1 |= L, EC,1 |= ES,1, and√n(LS−L(0)
S )d→ N (0,ΣS⊗Ir), the four main terms on the right
hand side of (5.20) are (asymptotically) uncorrelated, so we only need to work out their individual
variances. Since E1 ∼ N (0,Σ), we have ES,1 ∼ N (0,ΣS) and LS(LTCΣ−1C LC)
−1LTCΣ−1C EC,1 ∼
CHAPTER 5. CONFOUNDER ADJUSTMENT WITH FACTOR ANALYSIS 76
N (0,∆S). Similarly,√n · (L(0)
S − LS)α(0) d→ N (0, ‖α‖2ΣS), and
√n · LS(LTCΣ−1
C LC)−1LTCΣ−1
C (LC − L(0)C )α(0) d→ N(0, ‖α‖2∆S).
If in addition |C| → ∞, then the minimum eigenvalue of LTCΣ−1C LC →∞ by Assumption 10, then
the maximum entry of ∆S goes to 0. Thus, (5.18) holds.
The asymptotic variance in (5.18) is the same as the variance of the oracle least squares in (5.2).
Comparable oracle efficiency statements can be found in the econometrics literature (Bai and Ng,
2006; Wang et al., 2015). This is also the variance used implicitly in RUV-4 as it treats the estimated
Z as given when deriving test statistics for β. When the number of negative controls is not too large,
say |C| = 30, the correction term ∆S is nontrivial and gives more accurate estimate of the variance
of βNC. See Section 5.4.1 for more simulation results.
5.2.2 The sparsity scenario
We then analyze the sparsity scenario where we know β is sparse but the zero indices are unknown.
To guarantee the performance of robust regression estimators βRR, we assume a smooth loss ρ
for the theoretical analysis:
Assumption 11. The penalty ρ : R → [0,∞) with ρ(0) = 0. The function ρ(x) is non-increasing
when x ≤ 0 and is non-decreasing when x > 0. The derivative ψ = ρ′ exists and |ψ| ≤ D for some
D <∞. Furthermore, ρ is strongly convex in a neighborhood of 0.
A sufficient condition for the local strong convexity is that ψ′ > 0 exists in a neighborhood of 0.
The next theorem establishes the consistency of βRR.
Theorem 5.2.2. Under the assumptions of Corollary 5.2.1 and Assumption 11, if n,N → ∞ and
‖β‖1/N → 0, then αRR p→ α. As a consequence, for any i, βRRi
p→ βi.
Proof. We abbreviate αRR as α in this proof. To avoid confusion, we use α for the true value of the
parameter and α to represent a vector in Rr.Because α(0) → α, we prove this theorem by showing that for any ε > 0, P
[‖α− α(0)‖0 ≥ ε
]→ 0.
We break down our proof to two key results: First, we show α and α(0) are close in the following
sense
ϕ(α(0) − α)∆=
1
N
N∑i=1
ρ
(Li·(α
(0) − α)
σi
)= op(1), (5.21)
and second, we show that for sufficiently small ε > 0, there exists τ > 0 such that as n,N →∞
P[
inf‖α‖2≥ε
ϕ(α) > τ
]→ 1. (5.22)
CHAPTER 5. CONFOUNDER ADJUSTMENT WITH FACTOR ANALYSIS 77
Based on these two results and the observation that
‖α(0) − α‖2 < ε ⊇ϕ(α(0) − α) < τ
⋂inf‖α‖2≥ε
ϕ(α) > τ
,
we conclude that P[‖α− α(0)‖2 ≥ ε
]→ 0.
Let’s start with (5.21). Denote lp(α) = N−1∑Ni=1 ρ
(Yi1/‖Z‖2 − Li·α/σi
). By (5.9), we have
αRR = arg min lp(α), so lp(α) ≤ lp(α(0)). We examine the difference between lp(α) and ϕ(α(0) − α)
for any α, starting from
lp(α) =1
N
N∑i=1
ρ
(Yi1/‖Z‖2 − Li·α
σi
)=
1
N
N∑i=1
ρ
(βi + L
(0)i· α
(0) + Ei1/‖Z‖2 − Li·ασi
).
Because ρ has bounded derivative, |ρ(x) − ρ(y)| ≤ D|x − y| for any x, y ∈ R. As we assume
‖β‖1/N → 0. This together with 1/‖Z‖2 → 0 implies that
lp(α) =1
N
N∑i=1
ρ
(L
(0)i· α
(0) − Li·ασi
)+ op(1).
Next, ∣∣∣∣∣L(0)i· α
(0) − Li·ασi
− Li·(α(0) − α)
σi
∣∣∣∣∣ =
∣∣∣∣∣L(0)i· − Li·α(0)
σi
∣∣∣∣∣ p→ 0.
Therefore, by the same argument as before,
lp(α) =1
N
N∑i=1
ρ
(Li·(α
(0) − α)
σi
)+ op(1) = ϕ(α(0) − α) + op(1). (5.23)
Also, ϕ(0) = 0 because ρ(0) = 0. Therefore lp(α) ≤ lp(α(0)) = op(1). Notice that the op(1) term in
(5.23) does not depend on α, hence ϕ(α(0) − α) = lp(α) + op(1) = op(1).
Next we prove (5.22). Since ρ(x) is non-decreasing when x ≥ 0,
inf‖α‖2≥ε
ϕ(α) = inf‖α‖2≥ε
1
N
N∑i=1
ρ
(Li·α
σi
)≥ inf‖α‖2=ε
1
N
N∑i=1
ρ
(Li·α
σi
).
Using Corollary 5.2.1 (5.14), it’s easy to see that there exists some constantD? that P[maxi ‖Li·‖2 ≤ D?
]→
1. Thus when maxi ‖Li·‖2 ≤ D? holds, there is sufficiently small ε > 0, the α on the right hand side
is within the neighborhood where ρ is strongly convex in Assumption 11, so for some κ > 0
inf‖α‖2=ε
1
N
N∑i=1
ρ
(Li·α
σi
)≥ inf‖α‖2=ε
κ · 1
N
N∑i=1
(Li·α
σi
)2
= κε2 · λmin
(LT Σ−1L
).
CHAPTER 5. CONFOUNDER ADJUSTMENT WITH FACTOR ANALYSIS 78
By the uniform consistency of L and Σ, we conclude (5.22) is true for τ = κε2λmin(LTΣ−1L)/2,
where λmin(LTΣ−1L) > 0 as in the assumption of Corollary 5.2.1.
To derive the asymptotic distribution, we consider the estimating equation corresponding to
(5.9). By taking the derivative of (5.9), αRR satisfies
ΨN,L,Σ(αRR) =1
N
N∑i=1
ψ
(Yi1/‖Z‖2 − Li·αRR
σi
)Li·/σi = 0. (5.24)
The next assumption is used to control the higher order term in a Taylor expansion of Ψ.
Assumption 12. The first two derivatives of ψ exist and both |ψ′(x)| ≤ C and |ψ′′(x)| ≤ C hold
at all x for some C <∞.
Examples of loss functions ρ that satisfy Assumption 11 and Assumption 12 include smoothed
Huber loss and Tukey’s bisquare.
The next theorem gives the asymptotic distribution of βRR when the nonzero entries of β are
sparse enough. The asymptotic variance of βRR is, again, the oracle variance in (5.2).
Theorem 5.2.3. Under Corollary 5.2.1, Assumption 11 and Assumption 12, if n,N →∞, then
√n(βRRS − βS)
d→ N (0, (1 + ‖α‖22)ΣS)
for any fixed index set S with finite cardinality.
If n/N → 0, then a sufficient condition for ‖β‖1√n/N → 0 in Theorem 5.2.3 is ‖β‖1 = O(
√N).
If instead n/N → γ > 0, then ‖β‖1 = o(√N) suffices.
Proof. Because αRR is consistent, we can approximate the left hand side of (5.24) by its second
order Taylor expansion (we abbreviate ΨN,L,Σ to ΨN if it causes no confusion):
0 = ΨN (α(0)) +∇ΨN (α(0)) · (αRR − α(0)) + rN
where rN is the higher order term and Assumption 12 implies rN = op(‖αRR − α(0)‖2). Therefore
αRR = α(0) −[∇ΨN (α(0)) + op(1)
]−1ΨN (α(0)) and
√n(βRR − β) =
√n
‖Z‖2E1 +
√n(L(0) − L)αRR + L
[∇ΨN (α(0)) + op(1)
]−1√nΨN (α(0)) (5.25)
Because of the consistency of αRR and independence between L and (Z, E1), using Slusky’s
theorem we get (√n/‖Z‖2)ES,1 +
√n(L
(0)S − LS)αRR d→ N (0, (1 + ‖α‖2)ΣS) by Corollary 5.2.1.
Therefore the proof of Theorem 5.2.3 is completed once we can show the largest eigenvalue of[∇ΨN (α(0))
]−1is Op(1) and
√nΨN (α(0))
p→ 0.
CHAPTER 5. CONFOUNDER ADJUSTMENT WITH FACTOR ANALYSIS 79
We first show that√nΨN (α(0))
p→ 0. By using the representation of L in (5.15), we have
ΨN (α(0)) =1
N
N∑i=1
ψ( Yi1/‖Z‖2 − Li1α(0)
σi
)Li·/σi
=1
N
N∑i=1
ψ(βi + Ei1/‖Z‖2 − 1
n−1 Ei,−1Z(0)T−1 α(0) + εi
σi + δi
)Li·/σi
where maxi |δi| = op(1) and maxi |εi| = op(n−1/2) from Corollary 5.2.1. Because ‖β‖1
√n/N → 0
and ψ′ is bounded,
Ψp(α(0)) =
1
N
N∑i=1
ψ
(Ei1/‖Z‖2 − 1
n−1 Ei,−1Z(0)T−1 α(0) + εi
σi + δi
)Li·/σi + op(n
−1/2)
Let gi be the expression inside ψ in the last equation omitting εi and δi. Conditionally on Z(0)−1 ,
the variables gi,where i = 1, . . . , N are independent and identically distributed with E(gi) = 0 and
gi = Op(n−1/2). Thus, using Assumption 12 and boundedness of σi, and rearranging the variables,∥∥∥∥∥ 1
N
N∑i=1
[ψ(gi +
εi − δigiσi
)− ψ(gi)
]Li·/σi
∥∥∥∥∥2
≤C2 ·
∥∥∥∥∥ 1
N
N∑i=1
(|εi||Li·|+ |gi||δiLi·|
)/σi
∥∥∥∥∥2
= op(n−1/2)
We can further use the facts that ψ(gi) = ψ′(0)gi + op(n−1/2) = Op(n
−1/2) and ψ(gi)− ψ′(0)gi
are i.i.d., and get:
∥∥∥ΨN (α(0))∥∥∥
2=
∥∥∥∥∥ 1
N
N∑i=1
ψ
(gi +
εi − δigiσi
)Li·/σi
∥∥∥∥∥2
+ op(n−1/2)
=
∥∥∥∥∥ 1
N
N∑i=1
ψ (gi) Li·/σi
∥∥∥∥∥2
+ op(n−1/2) =
∥∥∥∥∥ 1
N
N∑i=1
ψ(gi)Li·σi
∥∥∥∥∥2
+ op(n−1/2)
=
∥∥∥∥∥ 1
N
N∑i=1
ψ′(0)Li·σigi
∥∥∥∥∥2
+ op(n−1/2) = op(n
− 12 ).
The last equality holds as the variable inside the norm has mean 0 and standard deviation in the
order of n−1/2N−1/2.
Finally, we show that the largest eigenvalue of the matrix[∇ΨN (α(0))
]−1is bounded in prob-
ability. Because of the assumption of strong factors in Corollary 5.2.1 that limN→∞1NL
TΣ−1L is
positive definite, we use Assumption 12 and the uniform convergence of Σ and Γ and use a similar
CHAPTER 5. CONFOUNDER ADJUSTMENT WITH FACTOR ANALYSIS 80
argument to get
[∇ΨN (α(0))
]−1
=
[1
N
N∑i=1
ψ′(gi +
εi − δigiσi
)LTi· Li·/σ
2i + op(1)
]−1
=
[1
N
N∑i=1
ψ′ (0)LTi·Li·/σ2i + op(1)
]−1
p→[ψ′(0) lim
N→∞
1
NLTΣ−1L
]−1
.
Notice that ψ′(0) = ρ′′(0) > 0 as ρ is strongly convex in a neighborhood of 0. This means that all
the eigenvalues of[∇Ψp(α
(0))]−1
converge to finite constants.
Based on Theorems 5.2.1 and 5.2.3, we can construct p-values that are asymptotically valid
and independent. Consider the asymptotic test for H0i : βi = 0, i = 1, . . . , N resulting from the
asymptotic distributions of βj derived in Theorems 5.2.1 and 5.2.3.
ti =‖Z‖2βi
σi√
1 + ‖α‖2, i = 1, . . . , N (5.26)
The null hypothesis H0i is rejected at level-α if |ti| > zα/2 = Φ−1(1 − α/2) as usual, where Φ
is the cumulative distribution function of the standard normal. Note that here we slightly abuse
the notation α to represent the significance level and this should not be confused with the model
parameter α. In practice, when |C| is small, we replace√
1 + ‖α‖2 with the inflation corrected
variance in Theorem 5.2.1 when constructing the test statistics.
Remark. We find a calibration technique in Sun et al. (2012) very useful to improve the type I
error and FDR control for finite sample size. Because the asymptotic variance used in eq. (5.26) is
the variance of an oracle OLS estimator, when the sample size is not sufficiently large, the variance
of βRR should be slightly larger than this oracle variance. To correct for this inflation, one can
use median absolute deviation (MAD) with customary scaling to match the standard deviation for a
Gaussian distribution to estimate the empirical standard error of tj , j = 1, . . . , p and divide tj by the
estimated standard error. The performance of this empirical calibration is studied in the simulations
in Section 5.4.1.
5.3 Extension to multiple regression
In (5.1) we assume that there is only one primary variable X and all the random variables Z, Y and
F have mean 0. In practice, there may be several predictors, or we may want to include an intercept
term in the regression model. Here we develop a multiple regression extension to the original model
(5.1).
CHAPTER 5. CONFOUNDER ADJUSTMENT WITH FACTOR ANALYSIS 81
Suppose we observe in total d = d0 +d1 random predictors that can be separated into two groups:
1. Z0: d0 × n nuisance covariates that we would like to include in the regression model, and
2. Z1: d1 × n primary variables whose effects we want to study.
For example, the intercept term can be included in Z0 as a 1×n vector of 1 (i.e. a random variable
with mean 1 and variance 0).
Leek and Storey (2008a) consider the case d0 = 0 and d1 ≥ 1 for SVA and Sun et al. (2012)
consider the case d0 ≥ 0 and d1 = 1 for LEAPP. Here we study the confounder adjusted multiple
regression in full generality, for any d0 ≥ 0 and d1 ≥ 1. Our model is
Y = B0Z0 + B1Z1 + LF + Σ1/2E, (5.27a)(Z0j
Z1j
)are i.i.d. with E
(Z0j
Z1j
)(Z0j
Z1j
)T = ΣZ , (5.27b)
F | (Z0, Z1) ∼MN (A0Z0 + A1Z1, Ir, In), and (5.27c)
E |= (Z0, Z1, F ), E ∼MN (0, IN , In). (5.27d)
The model does not specify means for X0i and X1i; we do not need them. The parameters in this
model are, for i = 0 or 1, Bi ∈ Rdi×N , L ∈ RN×r, ΣZ ∈ Rd×d, and Ai ∈ Rr×di . The parameters A
and B are the matrix versions of α and β in model (5.1). Additionally, we assume ΣZ is invertible.
Clarifying our purpose, we are primarily interested in estimating and testing for the significance of
B1.
For the multiple regression model (5.27), we again consider the rotation matrix Q that is given
by the QR decomposition(ZT0 ZT1
)= QU where Q ∈ Rn×n is an orthogonal matrix and U is an
upper triangular matrix of size n× d. Therefore we have(Z0
Z1
)Q = UT =
(U00 0 0
U10 U11 0
)
where U00 is a d0 × d0 lower triangular matrix and U11 is a d1 × d1 lower triangular matrix. Now
let the rotated Y be
Y = Y Q =(Y0 Y1 Y−1
)(5.28)
where Y0 is N × d0, Y1 is N × d1 and Y−1 is N × (n− d), then we can partition the model into three
CHAPTER 5. CONFOUNDER ADJUSTMENT WITH FACTOR ANALYSIS 82
parts: conditional on both Z0 and Z1 (hence U),
Y0 = B0U00 + B1U01 + LF0 + Σ1/2E0, (5.29)
Y1 = B1U11 + LF1 + Σ1/2E1 ∼MN ((B1 + LA1)U11, LLT + Σ, Id1) (5.30)
Y−1 = LF−1 + Σ1/2E−1 ∼MN (0, LLT + Σ, In−d) (5.31)
where F = FQ and E = EQd= E. Equation (5.29) corresponds to the nuisance parameters B0 and
is discarded according to the ancillary principle. Equation (5.30) is the multivariate extension to
(5.4) that is used to estimate B1 and (5.31) plays the same role as (5.5) to estimate L and Σ.
We consider the asymptotics when n,N → ∞ and d, r are fixed and known. Since d is fixed,
the estimation of L is not different from the simple regression case and the results of QMLE in
Corollary 5.2.1 still holds under the same assumptions.
Let
Σ−1Z = Ω =
(Ω00 Ω01
Ω10 Ω11
).
In the proof of Theorems 5.2.1 and 5.2.3, we consider a fixed sequence of Z such that ‖Z‖2/√n→ 1.
Similarly, we have the following lemma in the multiple regression scenario:
Lemma 5.3.1. As n→∞, U11UT11/n
a.s.→ Ω−111 .
Proof. First, notice that by the strong law of large numbers 1n
(Z0
Z1
)(ZT0 ZT1
)a.s.→ ΣZ . Using the
QR decomposition(ZT0 ZT1
)= QU and writing UT =
(V 0
)and V =
(U00 0
U10 U11
), it’s clear
that V V T /na.s.→ ΣZ . Since ΣZ is nonsingular, then V , U00 and U11 are full rank square matrices
with probability 1. Also using the block matrix inversion formula, we have
V −1 =
(U−1
00 0
−U−111 U10U
−100 U−1
11
).
Therefore the right bottom block of nV −TV −1 is nU−T11 U−111 and converges to Ω11 almost surely.
Thus the statement in the lemma holds.
Similar to (5.4), we can rewrite (5.30) as
Y1U−111 = B1 + L(A1 + W1U
−111 ) + Σ1/2E1U
−111
where W1 ∼ MN (0, IN , Id1) is independent from E1. As in Section 5.2, we derive statistical prop-
erties of the estimate of B1 for a fixed sequence of Z, W1 and F , which also hold unconditionally.
For simplicity, we assume that the negative controls are a known set of variables C with B1,C = 0.
CHAPTER 5. CONFOUNDER ADJUSTMENT WITH FACTOR ANALYSIS 83
We can then estimate each column of A1 by applying the negative control (NC) or robust regression
(RR) we discussed in Section 5.1.3 to the corresponding row of Y1U−111 , and then estimate B1 by
B1 = Y1U−111 − LA1.
Notice that E1U−111 ∼MN
(0, U−T11 U−1
11 ,Σ). Thus the “samples” in the robust regression, which are
actually the N variables in the original problem are still independent within each column. Though
the estimates of each column of A1 may be correlated, we will show that the correlation won’t affect
inference on B1. As a result, we still get asymptotic results similar to Theorem 5.2.3 for the multiple
regression model (5.27):
Theorem 5.3.1. Under the assumptions of Corollary 5.2.1 and Assumptions 10 to 12, if n,N →∞,
then for any fixed index set S with finite cardinality |S|,
√n(BNC
1,S − B1,S)d→MN (0|S|×d1
,ΣS + ∆S ,Ω11 + AT1 A1), and (5.32)
√n(BRR
1,S − B1,S)d→ MN(0|S|×d1
,ΣS ,Ω11 + AT1 A1) (5.33)
where ∆S is defined in Theorem 5.2.1.
Proof. First, for the known zero indices scenario, ANC1 has the following formula, which is similar to
(5.7):
ANC1 = (LTC Σ−1
C LC)−1LTC Σ−1
C Y1,CU−111 (5.34)
which implies a similar formula as (5.20):
√n(BNC
1,S − B1,S) =√nE1,SU
−111 −
√n · LS(LTCΣ−1
C LC)−1LTCΣ−1
C E1,CU−111
+√n · (L(0)
S − LS)A(0)1
+√n · LS(LTCΣ−1
C LC)−1LTCΣ−1
C (LC − L(0)C )A
(0)1 + op(1),
(5.35)
where A(0)1 = R−1(A1 + W1U
−111 ). Following the proof of Theorem 5.2.1 by using Lemma 5.3.1, we
get (5.32).
For the sparsity scenario, Lemma 5.3.1 guarantees the consistency of each column of ARR1 by
using Theorem 5.2.2. Then the Taylor expansion used in the proof of Theorem 5.2.3 still works at
each column of A(0)1 . Similar to (5.25), we get
√n(BRR
1 − B1) =√nE1U
−111 +
√n(L(0) − L)ARR
1
+ L(g1 g2 · · · gd1
) (5.36)
where gi =[∇Ψp(A
(0)1,i )]−1
(√nΨp
(A
(0)1,i ) + op(1)
). Following the proof of Theorem 5.2.3, we get
CHAPTER 5. CONFOUNDER ADJUSTMENT WITH FACTOR ANALYSIS 84
each gi = op(1). Thus
√n(BRR
1 − B1) =√nE1U
−111 +
√n · (L(0) − L)ARR
1 + op(1)
and (5.33) holds.
As for the asymptotic efficiency of this estimator, we again compare it to the oracle OLS estimator
of B1 which observes confounding variables Z in (5.27). In the multiple regression model, we claim
that BRR1 still reaches the oracle asymptotic efficiency. In fact, let B =
(B0 B1 L
). The oracle
OLS estimator of B, BOLS, is unbiased and its vectorization has variance Σ⊗ V −1/n where
V =
(ΣZ ΣZAT
AΣZ Ir + AΣZAT
), for A =
(A0 A1
).
By the block-wise matrix inversion formula, the top left d×d block of V −1 is Σ−1Z +ATA. The variance
of BOLS1 only depends on the bottom right d1 × d1 sub-block of this d × d block, which is simply
Ω11 + AT1 A1. Therefore BOLS
1 is unbiased and its vectorization has variance Σ ⊗ (Ω11 + AT1 A1)/n,
matching the asymptotic variance of BRR1 in Theorem 5.3.1.
5.4 Numerical experiments
5.4.1 Simulation results
In this section we use numerical simulations to verify the theoretical asymptotic results and further
study the finite sample properties of our estimators and tests statistics.
The simulation data are generated from the single primary variable model (5.1). More specifically,
Zi is a centered binary variable (Zi + 1)/2i.i.d.∼ Bernoulli(0.5), and Y·j , F·j are generated according
to (5.1).
For the parameters in the model, the noise variances are generated by σ2i
i.i.d.∼ InvGamma(3, 2), i =
1, . . . , N , and so E(σ2i ) = Var(σ2
i ) = 1. We set each αk = ‖α‖2/√r equally for k = 1, 2, · · · , r where
‖α‖22 is set to 1, so the variance of Xi explained by the confounding factors is R2 = 50%. The pri-
mary effect β has independent components βi taking the values 3√
1 + ‖α‖22 and 0 with probability
π = 0.05 and 1 − π = 0.95, respectively, so the nonzero effects are sparse and have effect size 3.
This implies that the oracle estimator has power approximately P(N(3, 1) > z0.025) = 0.85 to detect
the signals at a significance level of 0.05. We set the number of latent factors r to be either 2 or
10. For the latent factor loading matrix L, we take L = UD where U is a N × r orthogonal ma-
trices sampled uniformly from the Stiefel manifold Vr(RN ), the set of all N × r orthogonal matrix.
As we assume strong factors, we set the latent factor strength D =√N · diag(d1, · · · , dr) where
dk = 3−2(k−1)/(r−1) thus d1 to dr are distributed evenly inside the interval [3, 1]. As the number
CHAPTER 5. CONFOUNDER ADJUSTMENT WITH FACTOR ANALYSIS 85
of factors r can be consistently easily estimated for this strong factor setting, we assume that the
number r of factors is known to all of the algorithms in this simulation.
We set N = 5000, n = 100 or 500 to mimic the data size of many genetic studies. For the
negative control scenario, we choose |C| = 30 negative controls at random from the zero positions
of β. We expect that negative control methods would perform better with a larger value of |C| and
worse with a smaller value. The choice |C| = 30 is around the size of the spike-in controls in many
microarray experiments (Gagnon-Bartsch and Speed, 2012). For the loss function in our sparsity
scenario, we use Tukey’s bisquare which is optimized via IRLS with an ordinary least-square fit as
the starting values of the coefficients. Finally, each of the four combinations of n and r is randomly
repeated 100 times.
We compare the performance of nine different approaches. There are two baseline methods:
the “naive” method estimates β by a linear regression of Y on just the observed primary variable
Z and calculates p-values using the classical t-tests, while the “oracle” method regresses Y on
both Z and the confounding variables F as described in ??. There are three methods in the
RUV-4/negative controls family: the RUV-4 method (Gagnon-Bartsch et al., 2013), our “NC”
method which computes test statistics using βNC and its variance estimate (1 + ‖α‖22)(Σ + ∆),
and our “NC-ASY” method which uses the same βNC but estimates its variance by (1 + ‖α‖22)Σ.
We compare four methods in the SVA/LEAPP/sparsity family: these are “IRW-SVA” (Leek and
Storey, 2008b), “LEAPP” (Sun et al., 2012), the “LEAPP(RR)” method which is our RR estimator
using M-estimation at the robustness stage and computes the test-statistics using (5.26), and the
“LEAPP(RR-MAD)” method which uses the median absolute deviation (MAD) of the test statistics
in (5.26) to calibrate them. (see Section 5.2)
To measure the performance of these methods, we report the type I error, power, false discovery
proportion (FDP) and precision of hypotheses with the smallest 100 p-values in the 100 simulations.
For both the type I error and power, we set the significance level to be 0.05. For FDP, we use
Benjamini-Hochberg procedure with FDR controlled at 0.2. These metrics are plotted in Figure 5.1
under different settings of n and r.
First, from Figure 5.1, we see that the oracle method has exactly the same type I error and FDP
as specified, while the naive method and SVA fail drastically. SVA performs performs better than
the naive method in terms of the precision of the smallest 100 p-values, but is still much worse than
other methods. Next, for the negative control scenario, as we only have |C| = 30 negative controls,
ignoring the inflated variance term ∆S in Theorem 5.2.1 will lead to overdispersed test statistics,
and that’s why the type I error and FDP of both NC-ASY and RUV-4 are much larger than the
nominal level. By contrast, the NC method correctly controls type I error and FDP by considering
the variance inflation, though as expected it loses some power compared with the oracle. For the
sparsity scenario, the “LEAPP(RR)” method performs as the asymptotic theory predicted when
n = 500, while when n = 100 the p-values seem a bit too small. This is not surprising because
CHAPTER 5. CONFOUNDER ADJUSTMENT WITH FACTOR ANALYSIS 86
r = 2 r = 10
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
n = 100
n = 500
NaiveIRW−SVA
NCNC−ASY
RUV−4
LEAPP(RR)
LEAPP(RR−MAD)
LEAPPOracle
NaiveIRW−SVA
NCNC−ASY
RUV−4
LEAPP(RR)
LEAPP(RR−MAD)
LEAPPOracle
Type I error Power FDP Top 100
Figure 5.1: Compare the performance of nine different approaches (from left to right): naive regres-sion ignoring the confounders (Naive), IRW-SVA, negative control with finite sample correction (NC) ineq. (5.17), negative control with asymptotic oracle variance (NC-ASY) in eq. (5.18), RUV-4, robust re-gression (LEAPP(RR)), robust regression with calibration (LEAPP(RR-MAD)), LEAPP, oracle regressionwhich observes the confounders (Oracle). The error bars are one standard deviation over 100 repeated sim-ulations. The three dashed horizontal lines from bottom to top are the nominal significance level, FDR leveland oracle power, respectively.
CHAPTER 5. CONFOUNDER ADJUSTMENT WITH FACTOR ANALYSIS 87
the asymptotic oracle variance in Theorem 5.2.3 can be optimistic when the sample size is not
sufficiently large. On the other hand, the methods which use empirical calibration for the variance
of test statistics, namely the original LEAPP and “LEAPP(RR-MAD)”, control both FDP and type
I error for data of small sample size in our simulations. The price for the finite sample calibration
is that it tends to be slightly conservative, resulting in a loss of power to some extent.
In conclusion, the simulation results are consistent with our theoretical guarantees when N is as
large as 5000 and n is as large as 500. When n is small, the variance of the test statistics will be
larger than the asymptotic variance for the sparsity scenario and we can use empirical calibrations
(such as MAD) to adjust for the difference.
Chapter 6
Conclusions
Factor analysis is a powerful dimension reduction tool with explicit statistical model and assump-
tions. The main difference between factor analysis and the more popular PCA technique is the
consideration of heteroscedastic noise in factor analysis. There is no reason to believe that all the
collected variables have the same noise level. More over, even the raw data may have homoscedastic
noise, heteroscedasticity can arise from data transformation in the preprocessing steps (Woodward
et al., 1998). In other words, the factor analysis model has more flexible and reasonable assump-
tions compared with the white noise model in many data problems. However, as the diagonal noise
variances matrix is also unknown, a factor analysis model is harder to solve since there are more
parameters to estimate.
For high-dimensional data, especially when both the variable and sample dimensions are large,
noise heteroscedasticity is not a serious problem when there are only strong factors (defined in
Chapter 2). Strong factors are easy to estimate for high-dimensional data as more information are
collected when an increasing number of variables are observed. PCA still give consistent estimates
of the factor loadings, scores and the noise variance. However, there is no theoretical results of PCA
for weak factors and heteroscedastic noise.
The presence of weak factors complicates solving the factor model of a high-dimensional data
matrix. As discussed in Chapter 2, researchers in the econometrics field consider approximate factor
models where the data can be decomposed as linear combinations of a few strong factors plus weakly
correlated noise. In other words, in the econometrics literature, weak factors are treated as noise and
are responsible for the weak correlations in the noise. On the other hand in Random Matrix Theory,
weak factors are treated as signals and the goal is to estimate them. Throughout this article, we are
treating weak factors as signals.
In Chapter 3 and Chapter 4 we developed two approaches for estimating both the signal matrix
(factor loadings × factor scores) and noise variance. Chapter 3 proposes an iterating algorithm
ESA and a bi-cross-validation technique to estimate the number of factors. ESA can be considered
88
CHAPTER 6. CONCLUSIONS 89
as a heteroscedastic noise version of PCA/SVD and bi-cross-validation randomly select a block of
the matrix as held-out data. In Chapter 4, we proposed an alternative approach called POT-S
starting with a joint convex optimization using perspective transformation and nuclear penalty. At
the final stage, an optimal shrinkage on the singular values is applied to correct for the bias of the
solutions of the optimization. The tuning parameter is selected using Wold-style cross-validation
which randomly selects entries of the matrix as held-out data.
Empirically, the factors retained using ESA-BCV are always fewer than the factors retained using
POT-S. One explanation is that the shrinkage of the estimated barely detectable factors in POT-S
due to the nuclear penalty makes them more “useful” than in ESA-BCV in terms of estimating the
signal matrix. In practice, ESA-BCV is a better tool in giving a more interpretative estimation of
the number of factors while POT-S is superior in reducing the error of the estimated signal matrix
and noise variance. Besides, the cross-validation error plot in ESA-BCV helps analyzing the strength
of each factor.
ESA-BCV and POT-S are two algorithms to estimate a factor analysis model for high-dimensional
data empirically, while there are still many questions need to be answered for theoretical analysis
of the model. One future work is to develop random matrix theory for the factor model with both
strong and weak factors along with the heteroscedastic noise. Another future work is to give upper
bounds for the estimation errors of the signal matrix and noise in the convex optimization algorithm
POT.
Neither ESA-BCV nor POT-S have made use of sparsity of the factors. It has been shown that
the sparsity assumption can greatly improve estimating the weak factors. However, the scenario
becomes complicated when there are both strong and weak factors. A reasonable assumption is that
sparsity of the loadings of a factor is correlated with the factor’s strength. A strong factor usually
has dense loadings while the loadings of a weak factor are likely to be sparse. Adding a penalty which
encourages sparsity will decrease the accuracy in estimating strong factors but improves estimating
the weak factors. It’s an interesting topic to design an adaptive penalization term based on the
strength and initial estimates of factors.
Confounding factor adjustment in multiple regression is an important application of high-dimensional
factor analysis. In Chapter 5 we analyzed a two-step algorithm for the linear regression model with
Gaussian noise. If there are only strong latent factors, it is shown that we can get asymptotically
valid p-values for the individual primary effects with good power even when the factors are con-
founding with the primary variables. The conditions for the result are that the primary effects are
either sparse enough or contain negative controls. When there are also weak factors, empirically we
find that the ranking of the p-values are still meaningful while the p-values themselves can be biased
if the weak factors are confounding.
To broaden the use of the confounding factor adjustment model, a future research direction is
to extend the model to confounder adjustment of multiple generalized linear regression. The model
CHAPTER 6. CONCLUSIONS 90
then can be applied to nonGaussian response matrix such as binary data or counts, which appear
often in applications such as analyzing SNPs and DNA/RNA sequencing data.
Appendix A
Proof
This appendix shows the proof of of Theorem 2.1.4.
We need the following two lemmas before we prove the results. The first lemma shows that the
product of two independent sub-Gaussian random variables is sub-exponential.
Lemma A.0.1. If the random variables Z1 and Z2 both are sub-Gaussian random variables, and
Z1 and Z2 are independent, then their product Z1Z2 is a sub-exponential random variable.
Proof. W.L.O.G. assume that E [Z1] = E [Z2] = 0, thus E [Z1Z2] = 0. As Z1 and Z2 are sub-
Gaussian random variables, using the results in Rivasplata (2012), there exists a1 > 0 and b2 > 0
that
E[ea1X
21
]≤ 2, E
[etX2
]≤ eb
22t
2/2
Then for ∀ |λ| ≤√
2a1/b22, we have
E[eλX1X2
]≤ E
[eλ
2b22X21/2]≤ E
[ea1X
21
]≤ 2
Thus, there exists some b > 0 that
E[eλX1X2
]≤ ev
2λ2/2
for all |λ| ≤ 1/b and v2 ≥ E[X1X
22
]. Thus X1X2 is a sub-exponential variable.
Here is a restatement of Theorem 2.1.4:
Theorem. 2.1.4. Under the assumptions of Theorem 2.1.3 and assume that eij are sub-Gaussian
random variables, if (logN)2/n→ 0 as n,N →∞, then
maxj≤N‖L2
i· − L2i·‖2 = Op(
√logN/n), max
j≤N|σ2i − σ2
i | = Op(√
logN/n) (2.8)
91
APPENDIX A. PROOF 92
For the non-random factor model,
maxi=1,2,··· ,N
∥∥∥∥∥∥Li· − Li· − 1
n
n∑j=1
σieijFT·j
∥∥∥∥∥∥2
= op(n− 1
2 ). (2.9)
We prove uniform convergence of the estimated factors and noise variances by intensively using
some of the technical results in Bai and Li (2012a) and also modify internal parts of their proof.
Before reading the following proof, we recommend that the reader first read the original proof in
Bai and Li (2012a,b). To help the readers to follow, the variables N , T , Λ (or Λ?) and f (or f?) in
Bai and Li (2012a) correspond to N , n, L and F in our notation. The identification condition in
Theorem 2.1.4 for the non-random factor score model corresponds to the IC3 identification condition
in Bai and Li (2012a). Define
H = (LT Σ−1L)−1, and HN = NH.
The lemma below integrates Equation (A.14) of Bai and Li (2012a), Equation (B.9) and the state-
ments Lemma C.1 in Bai and Li (2012b).
Lemma A.0.2. Under the assumptions of Theorem 2.1.4, we have for any i = 1, 2, · · · , N :
LTi· − LTi· =(L− L
)TΣ−1LHLTi· − HLT Σ−1
(L− L
)(L− L
)TΣ−1LHLTi·
− HLT Σ−1L
(1
nFET
)Σ1/2Σ−1LHLTi· − HLT Σ−1Σ1/2
(1
nEFT
)LT Σ−1LHLTi·
− H
(N∑i1=1
N∑i2=1
σi1σi2σ2i1σ2i2
(1
n
(Ei1·E
Ti2· − E
[Ei1·E
Ti2·]))
LTi1·Li2·
)HLTi·
+ H
N∑i1=1
σ2i1− σ2
i1
σ4i1
LTi1·Li1·HLTi·
+ HLT Σ−1Σ1/2
(1
nEFT
)LTi· + HLT Σ−1L
(1
nFETi·
)σi
+ H
(N∑i1=1
σi1σiσ2i1
(1
n
(Ei1·E
Ti· − E
[Ei1·E
Ti·]))
LTi1·
)− H σ2
i − σ2i
σ2i
LTi·
(A.1)
APPENDIX A. PROOF 93
σ2i − σ2
i =1
n
n∑j=1
(e2ij − σ2
i )− (Li· − Li·)(Li· − Li·)T
+ Li·HLT Σ−1(L− L)(L− L)TΣ−1LHLTi· + 2Li·HL
T Σ−1L
(1
nFET
)Σ1/2LHLTi·
+ Li·H
(N∑i1=1
N∑i2=1
σi1σi2σ2i1σ2i2
(1
n
(Ei1·E
Ti2· − E
[Ei1·E
Ti2·]))
LTi1·Li2·
)HLTi·
− Li·HN∑i1=1
σ2i1− σ2
i1
σ4i1
LTi1·Li1·HLTi·
− 2Li·HLT Σ−1Σ1/2
(1
nEFT
)LTi· + 2Li·H
σ2i − σ2
i
σ2i
LTi·
− 2Li·H
(N∑i1=1
σi1σiσ2i1
(1
n
(Ei1·E
Ti· − E
[Ei1·E
Ti·]))
LTi1·
)
+ 2Li·HLT Σ−1(L− L)
(1
nFETi·
)σi
(A.2)
Also, we have the following approximations:
‖HLT Σ−1(L− L)‖F = Op(n−1) +Op(n
−1/2N−1/2) (A.3)
∥∥∥∥∥H(
N∑i1=1
σi1σiσ2i1
(1
n
(Ei1·E
Ti· − E
[Ei1·E
Ti·]))
LTi1·
)∥∥∥∥∥F
= Op(N−1/2n−1/2) +Op(n
−1) (A.4)
∥∥∥∥∥H(
N∑i1=1
N∑i2=1
σi1σi2σ2i1σ2i2
(1
n
(Ei1·E
Ti2· − E
[Ei1·E
Ti2·]))
LTi1·Li2·
)H
∥∥∥∥∥F
= Op(N−1n−1/2) +Op(n
−1)
(A.5)
1
n
∥∥HLT Σ−1Σ1/2EFT∥∥F
= Op(n−1/2N−1/2) +Op(n
−1) (A.6)
∥∥∥∥∥HN∑i1=1
σ2i1− σ2
i1
σ4i1
LTi1·Li1·H
∥∥∥∥∥F
= Op(N−1n−1/2) (A.7)
Proof. Comparing the results with Equation (A.14) of Bai and Li (2012a), Equation (B.9) and the
statements Lemma C.1 in Bai and Li (2012b), we only need to prove (A.3) and (A.6).
APPENDIX A. PROOF 94
To show (A.6), one just needs to apply HN = Op(1) (Bai and Li, 2012a, Corollary A.1), and
the identification condition MF = Ir to simplify Lemma C.1(e) of Bai and Li (2012b) using Central
Limit Theorem.
To prove (A.3), notice that under our conditions (or the IC3 condition of Bai and Li (2012a)),
the left hand side of Equation (A.13) in Bai and Li (2012a) is actually 0 as both the terms Mff
and M?ff in their notation are exactly Ir. Also, HLT Σ−1L = Ir + op(1) from Bai and Li (2012a,
Corollary A.1). Thus, (A.3) holds by applying (A.5), (A.6) and (A.7) to Bai and Li (2012b) Equation
(A.13).
We are now ready to prove Theorem 2.1.4.
First, notice that we only need to prove for the non-random factor score model. For the random
factor score model, we can condition on the factor scores and make a rotation of the factor loadings
and scores to satisfy the identifiability condition that LTΣ−1L is diagonal. The rotation matrix has
size r × r thus would not affect the uniform convergence rate.
Based on equation (F.1) in Bai and Li (2012b), we have
√n(Li· − Li·) =
1√n
n∑j=1
σieijFT·j + op(1). (A.8)
Now we prove (2.8). Let Li· − Li· = b1i + b2i + · · · + b10,i where bkj represents the kth term in
the right hand side of (A.1). Also, let σ2i − σ2
i = a1i + a2i + · · ·+ a10,i where aki represents the kth
term in the right hand side of (A.2).
By applying (A.3), (A.4),(A.5),(A.6), and (A.7) and boundedness of L, we can immediately
get maxj |bkj | = op(n−1/2) for k 6= 8, 10 and maxj |akj | = op(n
−1/2) for k 6= 1, 2, 8, 9, 10. Using
independence of the noise, boundedness of σi and the exponential-decay tail assumption, we can
find that maxj |b8j | = Op(√
logN/n) and maxj |akj | = Op(√
logN/n) for k = 1, 10 by simply using
central limit theorem.
Next, we show the following facts under the assumption that logN/√n → 0: for each s =
1, 2, · · · , r,
maxi=1,2,··· ,N
1
nN
∣∣∣ N∑i1=1
Lis
n∑j=1
[ei1jeij − E [ei1jeij ]]∣∣∣ = op(n
−1/2), and (A.9)
maxi=1,2,··· ,N
1
n2N
N∑i1=1
( n∑j=1
[ei1jeij − E [ei1jeij ]])2
= op(n−1/2). (A.10)
To prove (A.9), we only need to show maxi1nN
∣∣∑i1 6=i
∑nj=1 Lisei1jeij
∣∣ = op(n−1/2) as the remaining
term is op(n−1/2) because of the independence. This approximation is proven by the union bound
APPENDIX A. PROOF 95
and boundedness of L: for ∀ε > 0
limn,N→∞
P
√n maxi=1,2,··· ,N
1
nN
∣∣∑i1 6=i
n∑j=1
Lisei1jeij∣∣ > ε
≤ limn,N→∞
2N · P
C√nN
∑i1 6=1
n∑j=1
ei1je1j > ε
= limn,N→∞
2N · P
1√n
n∑j=1
e1j
( 1√N − 1
∑i1 6=1
ei1j)>
ε
C
N√N − 1
≤ limn,N→∞
2N · E
( 1√n
n∑j=1
e1j
( 1√N − 1
∑i1 6=1
ei1j))4
/( εC
N√N − 1
)4
= 0
To see why the last inequality holds, (N −1)−1/2∑i1 6=1 ei1j ∼ N (0, 1) is independent from e1j , thus
the fourth moment of n−1/2∑nj=1 e1j
((N − 1)−1/2
∑i1 6=1 ei1j
)is bounded which enables us to use
the Markov inequality. To prove (A.10), we start with the same union bound as for (A.9),
limn,N→∞
P
√n maxi=1,2,··· ,N
1
n2N
∑i1 6=i
( n∑j=1
ei1jeij)2> ε
≤ limn,N→∞
N · P
√nn2N
N∑i1=2
( n∑j=1
ei1je1j
)2> ε
≤ limn,N→∞
2N2 · P
1
n
n∑j=1
e2je1j >√εn−1/4
≤ limn,N→∞
2N2 exp(−√nε2) = 0
where ε2 is some positive constant. The last inequality holds as by using Lemma A.0.1. We can
use the Bernstein inequality for sub-Gaussian variables to bound the tail probability. The last limit
holds as we assume logN/√n→ 0.
Equation (A.9) directly implies that
maxi=1,··· ,N
∣∣∣H( N∑i1=1
LTi·1
n
n∑j=1
[ei1jeij − E(ei1jeij)])∣∣∣ = op(n
−1/2)
as H = Op(N−1). Using (A.10) and N−1
∑i ‖Li·−Li·‖22 = Op(n
−1) from Theorem 2.1.3, we get by
APPENDIX A. PROOF 96
using the Cauchy-Schwartz inequality:
maxi=1,··· ,N
∣∣∣H( N∑i1=1
(Li· − Li·)T1
n
n∑j=1
[ei1jeij − E(ei1jeij)])∣∣∣ = op(n
−1).
By writing Li· = Li· − Li· + Li· and using boundedness of both σi and σi,
maxi=1,··· ,N
∣∣∣H( N∑i1=1
σi1σiσ2i1
LTi·1
n
n∑j=1
[ei1jeij − E(ei1jeij)])∣∣∣ = op(n
−1/2) (A.11)
which indicates that maxj |a9j | = op(n−1/2).
To bound the remaining terms, we use the fact that
maxi=1,··· ,N
‖Li·‖2 = Op(1). (A.12)
To see this, first notice that because of boundedness of σi and σi and the fact that H = Op(N−1),
we have maxj |b10,j | = Op(N−1 maxi |Li·|). Combining the previous results on (A.1), we have
maxi |Li· − Li·| = Op(√
logN/n) + op(maxi |Li·|) which indicates that maxi |Li·| = Op(1). Thus,
maxi |a8i| = op(maxi |σ2i −σ2
i |) is negligible and maxi |Li·−Li·| = Op(√
logN/n)+op(maxi |σ2i −σ2
i |).The latter conclusion also indicates that maxi |a2i| = Op(
√logN/n) + op(maxi |σ2
i − σ2i |). As a
consequence, the second claim in (2.8) holds.
Finally, to prove (2.9), we actually have already shown that maxi |Li· − Li· − b8i| = op(n−1/2).
Then,
maxi=1,2,··· ,N
∣∣∣Li· − Li· − 1
n
n∑j=1
σieijFT·j
∣∣∣≤ maxi=1,2,··· ,N
∣∣∣Li· − Li· − b8i∣∣∣+ maxi=1,2,··· ,N
∣∣∣b8i − 1
n
n∑j=1
σieijFT·j
∣∣∣≤op(n−1/2) + ‖HLT Σ−1(L− L)‖F max
i=1,2,··· ,N
∣∣∣ 1n
n∑j=1
σieijFT·j
∣∣∣ = op(n−1/2).
Thus, (2.9) holds.
Bibliography
S. C. Ahn and A. R. Horenstein. Eigenvalue ratio test for the number of factors. Econometrica, 81
(3):1203–1227, 2013.
L. Alessi, M. Barigozzi, and M. Capasso. Improved penalization for determining the number of
factors in approximate factor models. Statistics & Probability Letters, 80(23):1806–1813, 2010.
Y. Amemiya, W. A. Fuller, and S. G. Pantula. The asymptotic distributions of some estimators for
a factor analysis model. Journal of Multivariate Analysis, 22(1):51–64, 1987.
D. Amengual and M. W. Watson. Consistent estimation of the number of dynamic factors in a large
n and t panel. Journal of Business & Economic Statistics, 25(1):91–96, 2007.
J. C. Anderson and D. W. Gerbing. Structural equation modeling in practice: A review and recom-
mended two-step approach. Psychological bulletin, 103(3):411, 1988.
T. W. Anderson and H. Rubin. Statistical inference in factor analysis. In Proceedings of the third
Berkeley symposium on mathematical statistics and probability, volume 5, 1956.
J. Bai. Inferential theory for factor models of large dimensions. Econometrica, 71(1):135–171, 2003.
J. Bai and K. Li. Statistical analysis of factor models of high dimension. The Annals of Statistics,
40(1):436–465, 2012a.
J. Bai and K. Li. Supplement to “statistical analysis of factor models of high dimension.”. 2012b.
J. Bai and K. Li. Maximum likelihood estimation and inference for approximate factor models of
high dimension. The Review of Economics and Statistics, 98:298–309, 2016.
J. Bai and S. Ng. Determining the number of factors in approximate factor models. Econometrica,
70(1):191–221, 2002.
J. Bai and S. Ng. Confidence intervals for diffusion index forecasts and inference for factor-augmented
regressions. Econometrica, 74(4):1133–1150, 2006.
97
BIBLIOGRAPHY 98
J. Bai and S. Ng. Principal components estimation and identification of static factors. Journal of
Econometrics, 176(1):18–29, 2013.
M. S. Bartlett. Tests of significance in factor analysis. British Journal of Statistical Psychology, 3
(2):77–85, 1950.
A. Belloni, V. Chernozhukov, and L. Wang. Square-root lasso: pivotal recovery of sparse signals via
conic programming. Biometrika, 98(4):791–806, 2011.
F. Benaych-Georges and N. Raj Rao. The singular values and vectors of low rank perturbations of
large rectangular random matrices. Journal of Multivariate Analysis, 111:120–135, 2012.
S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical
learning via the alternating direction method of multipliers. Foundations and Trends R© in Machine
Learning, 3(1):1–122, 2011.
A. Buja and N. Eyuboglu. Remarks on parallel analysis. Multivariate behavioral research, 27(4):
509–540, 1992.
J.-F. Cai, E. J. Candes, and Z. Shen. A singular value thresholding algorithm for matrix completion.
SIAM Journal on Optimization, 20(4):1956–1982, 2010.
R. Caruana, S. Lawrence, and L. Giles. Overfitting in neural nets: Backpropagation, conjugate gra-
dient, and early stopping. In Advances in Neural Information Processing Systems 13: Proceedings
of the 2000 Conference, volume 13, pages 402–408. MIT Press, 2001.
C. M. Carvalho, J. Chang, J. E. Lucas, J. R. Nevins, Q. Wang, and M. West. High-dimensional sparse
factor modeling: applications in gene expression genomics. Journal of the American Statistical
Association, 2012.
R. B. Cattell. The scree test for the number of factors. Multivariate behavioral research, 1(2):
245–276, 1966.
R. B. Cattell and S. Vogelmann. A comprehensive trial of the scree and KG criteria for determining
the number of factors. Multivariate Behavioral Research, 12(3):289–325, 1977.
A. Craig, O. Cloarec, E. Holmes, J. K. Nicholson, and J. C. Lindon. Scaling and normalization effects
in nmr spectroscopic metabonomic data sets. Analytical Chemistry, 78(7):2262–2267, 2006.
J. Fan, Y. Liao, and M. Mincheva. Large covariance estimation by thresholding principal orthogonal
complements. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(4):
603–680, 2013.
BIBLIOGRAPHY 99
T. L. Fare, E. M. Coffey, H. Dai, Y. D. He, D. A. Kessler, K. A. Kilian, J. E. Koch, E. LeProust,
M. J. Marton, M. R. Meyer, et al. Effects of atmospheric ozone on microarray data quality.
Analytical chemistry, 75(17):4672–4675, 2003.
E. Fishler, M. Grosmann, and H. Messer. Detection of signals by information theoretic criteria:
general asymptotic performance analysis. IEEE Transactions on Signal Processing, 50(5):1027–
1036, 2002.
H. E. Fleming. Equivalence of regularization and truncated iteration in the solution of ill-posed
image reconstruction problems. Linear Algebra and its applications, 130:133–150, 1990.
M. Forni, M. Hallin, M. Lippi, and L. Reichlin. The generalized dynamic-factor model: Identification
and estimation. Review of Economics and statistics, 82(4):540–554, 2000.
J. Gagnon-Bartsch, L. Jacob, and T. Speed. Removing unwanted variation from high dimensional
data with negative controls. Technical report, Technical Report 820, Department of Statistics,
University of California, Berkeley, 2013.
J. A. Gagnon-Bartsch and T. P. Speed. Using control genes to correct for unwanted variation in
microarray data. Biostatistics, 13(3):539–552, 2012.
A. P. Gasch, P. T. Spellman, C. M. Kao, O. Carmel-Harel, M. B. Eisen, G. Storz, D. Botstein,
and P. O. Brown. Genomic expression programs in the response of yeast cells to environmental
changes. Molecular biology of the cell, 11(12):4241–4257, 2000.
M. Gavish and D. L. Donoho. Optimal shrinkage of singular values. arXiv preprint arXiv:1405.7511,
2014.
M. Hallin and R. Liska. The generalized dynamic factor model: determining the number of factors.
Journal of the American Statistical Association, 102(478):603–617, 2007.
T. Hastie, R. Tibshirani, and J. H. Friedman. The elements of statistical learning. Springer, 2009.
J. L. Horn. A rationale and test for the number of factors in factor analysis. Psychometrika, 30(2):
179–185, 1965.
R. H. Hoyle. Confirmatory factor analysis. Handbook of applied multivariate statistics and mathe-
matical modeling, pages 465–497, 2000.
R. Hubbard and S. J. Allen. An empirical comparison of alternative methods for principal component
extraction. Journal of Business Research, 15(2):173–190, 1987.
P. J. Huber. Robust statistics. Springer, 2011.
I. Jolliffe. Principal component analysis. New York: Springer-Verlag, 1986.
BIBLIOGRAPHY 100
H. F. Kaiser. The application of electronic computers to factor analysis. Educational and psycho-
logical measurement, 20(1):141–151, 1960.
S. Kritchman and B. Nadler. Determining the number of components in a factor model from limited
noisy data. Chemometrics and Intelligent Laboratory Systems, 94(1):19–32, 2008.
S. Kritchman and B. Nadler. Non-parametric detection of the number of signals: Hypothesis testing
and random matrix theory. Signal Processing, IEEE Transactions on, 57(10):3930–3941, 2009.
C. Lam and Q. Yao. Factor modeling for high-dimensional time series: inference for the number of
factors. The Annals of Statistics, 40(2):694–726, 2012.
W. Lan and L. Du. A factor-adjusted multiple testing procedure with application to mutual fund
selection. arXiv:1407.5515, 2014.
R. M. Larsen. Lanczos bidiagonalization with partial reorthogonalization. DAIMI Report Series, 27
(537), 1998.
D. N. Lawley. Vi. the estimation of factor loadings by the method of maximum likelihood. Proceedings
of the Royal Society of Edinburgh, 60(01):64–82, 1940.
D. N. Lawley. Tests of significance for the latent roots of covariance and correlation matrices.
Biometrika, 43(1/2):128–136, 1956.
C. Lazar, S. Meganck, J. Taminau, D. Steenhoff, A. Coletta, C. Molter, D. Y. Weiss-Solıs, R. Duque,
H. Bersini, and A. Nowe. Batch effect removal methods for microarray gene expression data
integration: a survey. Briefings in bioinformatics, 14(4):469–490, 2013.
J. T. Leek and J. D. Storey. A general framework for multiple testing dependence. Proceedings of
the National Academy of Sciences, 105(48):18718–18723, 2008a.
J. T. Leek and J. D. Storey. A general framework for multiple testing dependence. Proceedings of
the National Academy of Sciences, 105(48):18718–18723, 2008b.
D. W. Lin, I. M. Coleman, S. Hawley, C. Y. Huang, R. Dumpit, D. Gifford, P. Kezele, H. Hung,
B. S. Knudsen, A. R. Kristal, et al. Influence of surgical manipulation on prostate gene expression:
implications for molecular correlates of treatment effects and disease prognosis. Journal of clinical
oncology, 24(23):3763–3770, 2006.
H. Martens. Factor analysis of chemical mixtures: Non-negative factor solutions for spectra of
cereal amino acids11presented at the international conference on computers and optimization in
analytical chemistry, amsterdam, april 1978. Analytica Chimica Acta, 112(4):423–442, 1979.
B. Nadler. Finite sample approximation results for principal component analysis: A matrix pertur-
bation approach. The Annals of Statistics, 36(6):2791–2817, 2008.
BIBLIOGRAPHY 101
B. Nadler. Nonparametric detection of signals by information theoretic criteria: performance analysis
and an improved estimator. Signal Processing, IEEE Transactions on, 58(5):2746–2756, 2010.
S. Negahban, B. Yu, M. J. Wainwright, and P. K. Ravikumar. A unified framework for high-
dimensional analysis of m-estimators with decomposable regularizers. pages 538–557, 2012.
A. Onatski. Determining the number of factors from empirical distribution of eigenvalues. The
Review of Economics and Statistics, 92(4):1004–1016, 2010.
A. Onatski. Asymptotics of the principal components estimator of large factor models with weakly
influential factors. Journal of Econometrics, 168(2):244–258, 2012.
A. Onatski. Asymptotic analysis of the squared estimation error in misspecified factor models.
Journal of Econometrics, 186(2):388–406, 2015.
A. B. Owen. A robust hybrid of lasso and ridge regression. Contemporary Mathematics, 443:59–72,
2007.
A. B. Owen and P. O. Perry. Bi-cross-validation of the SVD and the nonnegative matrix factorization.
The Annals of Applied Statistics, 3(2):564–594, 06 2009.
J. M. Paque, R. Browning, P. L. King, and P. Pianetta. Quantitative information from x-ray images
of geological materials. Proceedings of the XIIth International Congress for Electron Microscopy,
2:244–247, 1990.
N. Parikh and S. P. Boyd. Proximal algorithms. Foundations and Trends in optimization, 1(3):
127–239, 2014.
D. Paul. Asymptotics of sample eigenstructure for a large dimensional spiked covariance model.
Statistica Sinica, 17(4):1617–1642, 2007.
P. R. Peres-Neto, D. A. Jackson, and K. M. Somers. How many principal components? Stopping
rules for determining the number of non-trivial axes revisited. Computational Statistics & Data
Analysis, 49(4):974–997, 2005.
P. O. Perry. Cross-validation for unsupervised learning. arXiv preprint arXiv:0909.3052, 2009.
A. L. Price, N. J. Patterson, R. M. Plenge, M. E. Weinblatt, N. A. Shadick, and D. Reich. Principal
components analysis corrects for stratification in genome-wide association studies. Nature genetics,
38(8):904–909, 2006.
N. Raj Rao. Optshrink: An algorithm for improved low-rank signal matrix denoising by optimal,
data-driven singular value shrinkage. IEEE Transactions on Information Theory, 60(5):3002–3018,
2014.
BIBLIOGRAPHY 102
N. Raj Rao and A. Edelman. Sample eigenvalue based detection of high-dimensional signals in
white noise using relatively few samples. IEEE Transactions on Signal Processing, 56(7):2625–
2638, 2008.
D. F. Ransohoff. Bias as a threat to the validity of cancer molecular-marker research. Nature Reviews
Cancer, 5(2):142–149, 2005.
B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations
via nuclear norm minimization. SIAM review, 52(3):471–501, 2010.
O. Reiersøl. On the identifiability of parameters in thurstone’s multiple factor analysis. Psychome-
trika, 15(2):121–149, 1950.
D. R. Rhodes and A. M. Chinnaiyan. Integrative analysis of the cancer transcriptome. Nature
genetics, 37:S31–S37, 2005.
O. Rivasplata. Subgaussian random variables: an expository note. Internet publication, PDF, 2012.
S. Rosset, J. Zhu, and T. Hastie. Boosting as a regularized path to a maximum margin classifier.
The Journal of Machine Learning Research, 5:941–973, 2004.
S. N. Roy. On a heuristic method of test construction and its use in multivariate analysis. The
Annals of Mathematical Statistics, 24(2):220–238, 1953.
A. Schwartzman, R. F. Dougherty, and J. E. Taylor. False discovery rate analysis of brain diffusion
direction maps. The Annals of Applied Statistics, 2(1):153–175, 2008.
A. A. Shabalin and A. B. Nobel. Reconstruction of a low-rank matrix in the presence of Gaussian
noise. Journal of Multivariate Analysis, 118:67–76, 2013.
Y. She and A. B. Owen. Outlier detection using nonconvex penalized regression. Journal of the
American Statistical Association, 106(494):626–639, 2011.
H. Shen and J. Z. Huang. Sparse principal component analysis via regularized low rank matrix
approximation. Journal of multivariate analysis, 99(6):1015–1034, 2008.
P. Smaragdis and J. C. Brown. Non-negative matrix factorization for polyphonic music transcription.
In Applications of Signal Processing to Audio and Acoustics, 2003 IEEE Workshop on., pages 177–
180. IEEE, 2003.
C. Spearman. “ general intelligence,” objectively determined and measured. The American Journal
of Psychology, 15(2):201–292, 1904.
BIBLIOGRAPHY 103
Y. Sun, N. R. Zhang, and A. B. Owen. Multiple hypothesis testing adjusted for latent variables,
with an application to the agemap gene expression data. The Annals of Applied Statistics, 6(4):
1664–1688, 2012.
L. R. Tucker and C. Lewis. A reliability coefficient for maximum likelihood factor analysis. Psy-
chometrika, 38(1):1–10, 1973.
W. F. Velicer. Determining the number of components from the matrix of partial correlations.
Psychometrika, 41(3):321–327, 1976.
W. F. Velicer, C. A. Eaton, and J. L. Fava. Construct explication through factor or component
analysis: A review and evaluation of alternative procedures for determining the number of factors
or components. In Problems and solutions in human assessment, pages 41–71. Springer, 2000.
F. Wang and M. M. Wall. Generalized common spatial factor model. Biostatistics, 4(4):569–582,
2003.
S. Wang, G. Cui, and K. Li. Factor-augmented regression models with structural change. Economics
Letters, 130:124–127, 2015.
M. Wax and T. Kailath. Detection of signals by information theoretic criteria. IEEE Transactions
on Acoustics, Speech and Signal Processing, 33(2):387–392, 1985.
Z. Wen, W. Yin, and Y. Zhang. Solving a low-rank factorization model for matrix completion by
a nonlinear successive over-relaxation algorithm. Mathematical Programming Computation, 4(4):
333–361, 2012.
D. M. Witten, R. Tibshirani, and T. Hastie. A penalized matrix decomposition, with applications to
sparse principal components and canonical correlation analysis. Biostatistics, page kxp008, 2009.
S. Wold. Cross-validatory estimation of the number of components in factor and principal compo-
nents models. Technometrics, 20(4):397–405, 1978.
A. M. Woodward, B. K. Alsberg, and D. B. Kell. The effect of heteroscedastic noise on the chemo-
metric modelling of frequency domain data. Chemometrics and intelligent laboratory systems, 40
(1):101–107, 1998.
I. Yamazaki, Z. Bai, H. Simon, L.-W. Wang, and K. Wu. Adaptive projection subspace dimension
for the thick-restart lanczos method. ACM Transactions on Mathematical Software (TOMS), 37
(3):27, 2010.
J. Yao, Z. Bai, and S. Zheng. Large Sample Covariance Matrices and High-Dimensional Data
Analysis. Number 39. Cambridge University Press, 2015.
BIBLIOGRAPHY 104
Y. Yao, L. Rosasco, and A. Caponnetto. On early stopping in gradient descent learning. Constructive
Approximation, 26(2):289–315, 2007.
T. Zhang and B. Yu. Boosting with early stopping: Convergence and consistency. Annals of
Statistics, 33(4):1538–1579, 2005.
W. R. Zwick and W. F. Velicer. Comparison of five rules for determining the number of components
to retain. Psychological bulletin, 99(3):432–442, 1986.