Matrix--free Likelihood Method for Exploratory Factor ...
Transcript of Matrix--free Likelihood Method for Exploratory Factor ...
Creative Components Iowa State University Capstones, Theses and Dissertations
Spring 2019
Matrix--free Likelihood Method for Exploratory Factor Analysis Matrix--free Likelihood Method for Exploratory Factor Analysis
with High-dimensional Gaussian Data with High-dimensional Gaussian Data
Fan Dai
Follow this and additional works at: https://lib.dr.iastate.edu/creativecomponents
Part of the Multivariate Analysis Commons
Recommended Citation Recommended Citation Dai, Fan, "Matrix--free Likelihood Method for Exploratory Factor Analysis with High-dimensional Gaussian Data" (2019). Creative Components. 452. https://lib.dr.iastate.edu/creativecomponents/452
This Creative Component is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Creative Components by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected].
Matrix–free Likelihood Method for Exploratory FactorAnalysis with High-dimensional Gaussian Data
Fan DaiIowa State University, Ames, Iowa
Abstract
This paper proposes a novel profile likelihood method for estimating the covariance pa-rameters in exploratory factor analysis (EFA) with high-dimensional Gaussian data. Byimplementing a Lanczos algorithm and a limited-memory quasi-Newton method, we developa matrix free algorithm (HDFA) which does partial singular value decomposition (partialSVD) for data matrix where number of observations n is typically less than the dimension pand it only requires limited amount of memory during likelihood maximization. We performsimulation study with both the randomly generated models and the data-driven models.Results indicate that HDFA substantially outperforms the EM algorithm in all cases. Fur-thermore, Our algorithm is applied to fit factor models for a fMRI dataset with suicidalattempters, suicidal nonattempters and a control group.
Keywords: Profile likelihood, Partial SVD, Lanczos algorithm, L-BFGS-B, fMRI, Suicidal ideationdata
1
Contents
1 Introduction 3
2 Methodology 42.1 Background and preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Factor model for Gaussian data . . . . . . . . . . . . . . . . . . . . . . . . 42.1.2 Model identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.3 EM algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.4 Convergence issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Profile likelihood for maximum likelihood estimation . . . . . . . . . . . . . . . . 72.2.1 Profile log-likelihood function . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.2 Partial singular value decomposition with Lanczos algorithm . . . . . . . . 102.2.3 Likelihood maximization with limited-memory quasi-Newton method . . . 10
3 Performance illustrations 113.1 Data generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 Algorithm settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3 Result evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3.1 Computation time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.3.2 Estimation accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Application to fMRI dataset 144.1 Suicidal ideation dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.2 Exploratory factor analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5 Discussion 16
2
1 Introduction
Factor analysis is a popular multivariate statistical technique which reduces the data dimension-
ality by characterizing the dependence using a small number of latent factors [1]. To that end
suppose we have n independently and identically distributed random vectors y1, . . . ,yn from a
p−dimensional multivariate normal distribution with mean vector µ and a covariance matrix Σ.
In the rest of the paper we assume Σ = LL> + D, where L is a p × q matrix with rank q that
describes the amount of variance shared among the p coordinates and D is a diagonal matrix
with positive diagonal entries representing the unique variance part specific to each coordinate
and associated with the random error term. The low dimensional set up (i.e. when p << n) was
first proposed by Lawley [10] and the maximum likelihood estimation of such low dimensional
factor models has been well-studied in [9, 11, 2] and others. Although efficient implementations
of algorithms are available when p << n, (e.g. factanal and fa in R and factoran in Matlab) these
algorithms require full eigen decomposition of a positive definite covariance or correlation matrix,
and hence do not scale well with the dimension p.
However, in the exploratory factor analysis for high-dimensional data, such as analyzing the
transcription factor activity profiles of gene regulatory networks using massive gene expression
datasets [19], constructing a Markowitz minimum variance portfolio from the stock return data by
a high dimensional factor AR model [18] or identifying the contributions of potential characteristics
associated with cast posts in a root filled tooth [12], the number of variables p typically exceed the
sample size n by a huge margin and possesses tremendous difficulties in parameter estimations.
Another way to obtain the MLEs in factor analysis is via EM-type algorithms, such as Rubin
and Thayer’s EM algorithm [2], Liu and Rubin’s ECME type algorithm [13] and Zhao, Yu and
Jiang’s CM algorithm [24], etc. In an EM implementation, latent factors are treated as missing
data [4] updated in the expectation step which requires low-dimensional matrix inversion, and
iterative updates for L and D are available in closed forms. Although the EM algorithms are
applicable when p > n, they suffer from slow and possibly local convergence problems especially
in high-dimensional cases.
In order to handle problems discussed above, we propose a profile likelihood when p >> n
and develop matrix-free computational framework (denoted as HDFA) for maximizing the profile
likelihood function. The key ingredients in our computational framework are (1) a Lanczos al-
3
gorithm that allows us to compute few dominant singular values and vectors of high-dimensional
matrices required to compute the profile log-likelihood and its gradient, and (2) a limited memory
quasi-Newton optimization algorithm that avoids the storage of the large p × p Hessian of the
profile log-likelihood function.
The remainder of this paper is organized as follows. Section 2 first describes the factor model
for Gaussian data and the EM algorithm, and then proposes the profile likelihood and HDFA
algorithm. Section 3 presents simulation studies to compare and evaluate the performance of the
HDFA and the EM algorithm. Section 4 illustrates an exploratory factor analysis on a fMRI sui-
cidal ideation dataset using the HDFA algorithm. Section 5 concludes the paper with a discussion
on possible extensions and future works.
2 Methodology
2.1 Background and preliminaries
In this paper, the data matrix Y = [y>1 , · · · ,y>n ]> consists of independent and normally dis-
tributed samples which share similar patterns through a set of latent random factor scores Z =
[z>1 , · · · , z>n ]>. The Gaussian factor model and EM algorithm for parameter estimation are pre-
sented in the following sections.
2.1.1 Factor model for Gaussian data
Suppose y1, · · · ,yn are i.i.d observed samples from a p-dimensional normal distribution, then, yi = µ+ Lzi + εi
zi ∼ Nq(0, Iq×q), εi ∼ Np(0,D)(1)
where L is a p × q factor loading matrix with q < min(p, n) and (p − q)2 > p + q. D is a p × p
diagonal matrix. zi consists of q unobserved orthogonal factors.
Based on model (1), yi|zi ∼ Np(µ+ Lzi,D) and yi ∼ Np(µ,Σ) with Σ = LL> + D.
4
2.1.2 Model identification
Since LQ and L produce the same likelihood given any orthogonal matrix Q, the factor model
requires specific conditions to be identifiable [15]. Commonly used techniques include constraining
columns in Q to be the eigenvectors of L>D−1L, the signal matrix, such that L>D−1L is diagonal
with decreasing diagonal entries or assigning 0’s to the loading matrix based on prior information.
More conditions can be found in the literature [1] [16].
2.1.3 EM algorithms
The unobserved factors z1, · · · , zn can be treated as missing data and the complete data log-
likelihood function is then derived as,
`C(µ,L,D) = c − n
2log |D| − 1
2
n∑i=1
{(yi − µ− Lzi)>(yi − µ− Lzi)}
= c − n
2log |D| − 1
2Tr{D−1
n∑i=1
(yi − µ)(yi − µ)> − 2D−1Ln∑
i=1
zi(yi − µ)>
+ L>D−1Ln∑
i=1
ziz>i }
(2)
Where c represents the constant term w.r.t µ,L and D.
E-Step computations
The expected complete log-likelihood is given by,
E[`C(µ,L,D|y1 ∼ yn] = c − n
2log |D| − 1
2Tr{D−1
n∑i=1
(yi − µ)(yi − µ)>
− 2D−1Ln∑
i=1
E[zi|yi](yi − µ)>
+ L>D−1Ln∑
i=1
E[ziz>i |yi]}
(3)
By multivariate normal theory, zi|yi ∼ Nq(L>Σ−1(yi − µ), (Iq×q + L>D−1L)−1). So,
E[zi|yi] = L>Σ−1(yi − µ) (4)
5
E[ziz>i |yi] =Var[zi|yi] + E[zi|yi]E[z>i |yi]
=(Iq×q + L>D−1L)−1 + L>Σ−1(yi − µ)(yi − µ)>Σ−1L(5)
M-Step computations
From the marginal distribution of yi, the global MLE of µ is µ̂ =1
n
n∑i=1
yi. Hence, given
current estimates Lt and Dt , the MLEs of Lt+1 and Dt+1 are obtained from maximizing equation
(6),
E[`C(L,D|y1 ∼ yn, µ̂,Lt,Dt] = c − n
2log |D| − n
2Tr{D−1S
−D−1L2
n
n∑i=1
E[zi|yi,Lt,Dt](yi − µ̂)>
+ L>D−1L1
n
n∑i=1
E[ziz>i |yi,Lt,Dt]}
(6)
Where S =1
n
n∑i=1
(yi − µ̂)(yi − µ̂)> is the sample covariance matrix.
Then, conditional on µ̂, Lt and Dt, the MLE of Lt+1 is given by,
L̂t+1 =(1
n
n∑i=1
(yi − µ̂)E[z>i |yi,Lt,Dt])(1
n
n∑i=1
E[ziz>i |yi,Lt,Dt])
−1
=SΣ−1t Lt((Iq×q + L>t D−1t Lt)−1 + L>t Σ−1t SΣ−1t Lt)
−1
(7)
Where Σt = LtL>t + Dt. By Woodbury matrix identity, Σ−1t Lt = D−1t Lt(Iq×q + L>t D−1t Lt)
−1,
then equation (7) can be simplified as,
L̂t+1 =SD−1t Lt(Iq×q + L>t D−1t Lt)−1((Iq×q + L>t D−1t Lt)
−1 + L>t Σ−1t SD−1t Lt(Iq×q + L>t D−1t Lt)−1)−1
=SD−1t Lt(Iq×q + L>t Σ−1t SD−1t Lt)−1
(8)
Next, conditional on µ̂ Lt, Dt and L̂t+1, the MLE of Dt+1 is given by,
D̂t+1 =diag{
S− 2
n
n∑i=1
(yi − µ̂)E[z>i |yi,Lt,Dt]L̂>t+1 + L̂t+1
1
n
n∑i=1
E[ziz>i |yi,Lt,Dt])
−1L̂>t+1
}(9)
6
Substitute equation (7),
D̂t+1 =diag{
S− 2
n
n∑i=1
(yi − µ̂)E[z>i |yi,Lt,Dt]L̂>t+1 +
1
n
n∑i=1
(yi − µ̂)E[z>i |yi,Lt,Dt]L̂>t+1
}=diag
{S− 2SΣ−1t LtL̂
>t+1
} (10)
Where diag{A} refers to the diagonal elements of matrix A.
2.1.4 Convergence issues
The EM algorithm involves inexpensive calculations and requires only matrix-vector products
and an inverse of a low-dimensional q × q matrix∑n
i=1 E[ziz>i |yi] which is preferable in handling
high-dimensional problems. However, the EM-type methods are known for slow convergence that
depends on initial parameter values [4]. Possible solutions under specific model assumptions are
discussed in the literature, such as the squared iterative method (SQUAREM) which applies an
extrapolation schemes to achieve superlinear convergence [21].
2.2 Profile likelihood for maximum likelihood estimation
In this section, we illustrate a new profile likelihood for maximum likelihood estimation and develop
our matrix-free high dimensional factor analysis algorithm (HDFA). The identifiability condition
utilized here is that the signal matrix L>D−1L is diagonal with decreasing entries.
2.2.1 Profile log-likelihood function
Given the MLE of µ, the marginal log-likelihood function is,
`n(µ,L,D) = c − n
2log |LL> + D| − n
2Tr{(LL> + D)−1S} (11)
Where, as before, S =1
n
n∑i=1
(yi − µ̂)(yi − µ̂)> and µ̂ =1
n
n∑i=1
yi.
7
Then, the MLEs of L and D are solved from, L(Iq×q + L>D−1L) = SD−1L
D = diag{S− LL>}(12)
From L(Iq×q + L>D−1L) = SD−1L, we have,
D−1/2L(Iq×q + (D−1/2L)>D−1/2L) = D−1/2SD−1/2D−1/2L
Let the spectral decomposition of D−1/2SD−1/2 = VΛV>. Then the above formula implies
that
D−1/2L(D−1/2L)> = D−1/2SD−1/2 − Ip×p = V(Λ− Ip×p)V> (13)
Equation (12) and (13) give Lawley’s algorithm [10] that recursively updates L and D until
convergences. However, for high-dimensional datasets, manipulating the p × p matrix S and the
eigen decomposition of D−1/2SD−1/2 become computationally expensive and are problematic for
singular cases. In order to handle such problems, we propose a new profile log-likelihood as follows.
Let λ1 ≥ λ2 ≥, · · · ,≥ λp denote the eigenvalues of D−1/2SD−1/2 and form the decreasing
diagonal elements in Λ with Λ =
Λq 0
0 Λm
. Then, Λq contains the largest q eigenvalues λ1 ≥
λ2 ≥, · · · ,≥ λq and the corresponding q eigenvectors form the columns in matrix Vq so that
V = [Vq,Vm]. Equation (13) shows that,
L = D1/2Vq[(Λq − I)+]1/2 (14)
Given the n × p data matrix Y, since S =1
n(Y − 1µ̂>)(Y − 1µ̂>)>, the square roots of
λ1, · · · , λq are the largest q singular values of1√n
(Y−1µ̂>)D−1/2 and columns in Vq are then the
corresponding q right-singular vectors. Hence, conditional on D, L is maximized at L̂ = D1/2Vq∆,
where ∆ is a diagonal matrix with diagonal elements max[(λi − 1, 0)]1/2, i = 1, · · · , q.
From the construction of Vq and Vm, we have the following results,
V>q Vq = Iq×q, V>mVm = I(p−q)×(p−q), VqV>q + VmV>m = Ip×p, V>q Vm = 0
8
And hence,
(VqΛqV>q + VmV>m)(VqΛ
−1q V>q + VmV>m) = Ip×p
now let A = Vq∆2V>q , then AA = Vq∆
4V>q and,
|A + Ip×p| =|(A + Ip×p)A|/|A|
=|Vq(∆4 + ∆2)V>q |/|Vq∆
2V>q |
=|∆2 + Iq×q| =q∏
j=1
λj
(A + Ip×p)−1 =(Vq∆
2V>q + VqV>q + VmV>m)−1
=(Vq(∆2 + I)V>q + VmV>m)−1
=(VqΛqV>q + VmV>m)−1
=VqΛ−1q V>q + VmV>m
(15)
Based on above results, the profile log-likelihood is given by,
`p(D) = c − n
2log |L̂L̂> + D| − n
2Tr{(L̂L̂> + D)−1S}
= c − n
2
{log |D1/2(Vq∆
2V>q + Ip×p)D1/2|+ Tr{(D1/2(Vq∆
2V>q + Ip×p)D1/2)−1S}
}= c − n
2
{log |D|+ log |Vq∆
2V>q + Ip×p|+ Tr{(VqΛ−1q V>q + VmV>m)D−1/2SD−1/2}
}= c − n
2
{log |D|+
q∑j=1
log λj + Tr{(Λ−1q V>q VΛV>Vq}+ Tr{V>mVΛV>Vm}}
= c − n
2
{log |D|+
q∑j=1
log λj + Tr{Λ−1q Λq}+ Tr{Λm}}
= c − n
2
{log |D|+
q∑j=1
log λj + q + Tr{D−1S} −q∑
j=1
λj
}(16)
For standardized data matrix Y and the sample correlation matrix S, equation (16) can be further
simplified as,
`p(D) = c − n
2
{log |D|+
q∑j=1
log λj + q + Tr{D−1} −q∑
j=1
λj
}(17)
9
2.2.2 Partial singular value decomposition with Lanczos algorithm
The partial SVD of1√n
(Y − 1µ̂>)D−1/2 is carried out by Lanczos method which only involves
matrix vector multiplications and is preferable for large matrices with sparse structures. Specifi-
cally, suppose A is n×p real matrix, the Lanczos algorithm provides easy steps for bidiagonalizing
A as follows [6],
Algorithm 1 Lanczos Algorithm for PSVD of Matrix A
1: Generate q1 as a random vector with ||q1|| = 1;2: Compute v1 = Aq1 and p1 = v1/||v1||;3: for i = 1 : k
u>i = p>i A− ||vi||qi; qi = ui/||ui||;for j = 2 : kvj = Aqj − ||uj||pj−1; pj = vj/||vj||;
Let P and Q be matrices with columns p1, · · · ,pk and q1, · · · ,qk respectively, then B =
P>AQ gives an bidiagonal matrix. k = max(2q+ 1, 20) to have an accurate approximation of the
largest q eigenvalues.
Due to the round-off error associated with the Lanczos vectors qi and pi, reorthogonalization
of them are required and several implementations are given in the literature [17],[20] and [23].
Furthermore, there are modified Lanczos algorithms with higher numerical stability, such as the
restarted Lanczos bidiagonalization methods [3].
Given those computed lanczos vectors, the largest k singular values in A is the same as those
in B which can be obtained from the square roots of eigenvalues in B>B, a tridiagonal hermitian
matrix where the eigenvalues can be calculated effectively via the Sturm sequence algorithm [7].
2.2.3 Likelihood maximization with limited-memory quasi-Newton method
To maximize the profile log-likelihood function (16) with respect to D, we implement the Limited-
memory Broyden-Fletcher-Goldfarb-Shanno quasi-newton method (L-BFGS-B) [14][25]. From
equation (16), the gradient is in an analytical form of g = −n2
diag{L̂L̂>+D−S} with the sample
correlation matrix S and scaled parameter D of which the diagonal elements are restricted to
[0, 1], i = 1, · · · , p.
10
The BFGS-type algorithms searches the optimization solution via quasi-Newton method which
iteratively produces an approximation of the inverse Hessian matrix H−1 and is well suited for
objective functions where the analytic form of the Hessian is not available or difficult to compute.
Furthermore, the BFGS approximation provides a recurrence relation for H−1 that allows matrix-
free computations for the product of the inverse Hessian and the gradient without calculating the
exact form of matrix H−1.
However, the BFGS update maintains the approximated inverse Hessian information for each
iteration, with which the memory requirements become a challenge in high-dimensional optimiza-
tion problems. The L-BFGS type algorithms, on the other hand, are designed as a truncated
variation of BFGS that only store the last few m updates of the position differences and gradi-
ent differences in order to compute the approximation of H−1k at the k-th iteration based on the
BFGS recursive formula and then produce the new descent direction from −H−1k gk where gk is
the gradient at the k-th iteration.
3 Performance illustrations
3.1 Data generation
Performance of the EM and HDFA algorithms are compared using 100 simulation running for
high-dimensional datasets of three sizes: p = 1000 with n = 100, p = 3375 with n = 225, and
p = 8000 with n = 400 which are evaluated at two factor models with q = 3 and q = 5. In
each simulation, diagonal entries in model parameter D are generated from uniform distribution
U(0.2, 0.8), elements in L and µ are generated from standard normal distribution N (0, 1). In
addition, we also performed a data-driven simulation for p = 24547, n = 160, 180 with the 2-
factor models and p = 24547, n = 340 with the 4-factor models fitted from the fMRI dataset in
Section 4.
3.2 Algorithm settings
The set-up of the two algorithms are as follows:
• Data standardization: The scaled data matrix are implemented for convenience and param-
11
eters are estimated for the correlation matrix.
• Parameter initialization: For EM algorithm, L is started as the first q principle components
for the data and the initial values of D come from the absolute differences between the
sample correlation matrix and LL>. For HDFA, only initialization of D is required.
• Convergence criterion: In HDFA, convergence of the optimization process using the L-BFGS-
B method is evaluated via the gradient of the profile log-likelihood. Hence, in order to be
assessed on the same criterion, the convergence of the EM algorithm is determined not only
by the relative change in log-likelihood but also by the summed absolute values of gradient
elements. Specifically, the EM algorithm is stopped if the relative log-likelihood change is
less than 10−6 and the sum of the absolute gradient values of D is less or equal to the
final value obtained from HDFA , or the number of iterations reaches the pre-determined
maximum of 5000.
• Optimal factor number determination: For each dataset, we fit models with 1, 2, · · · , 2q
factors respectively and the optimal number of factors is estimated through the extended
Bayes Information Criterion (eBIC) [5]: eBIC = −2`p + pq log n.
All of our simulations are carried out using R running on a workstation with Intel E5-2640 v3
CPU clocked at 2.60 GHz and cache size: 20MB.
3.3 Result evaluation
In each simulation, the required CPU time, parameter estimates and the observed log-likelihood
values at convergence along with the eBIC values are collected. The fitted model with the optimal
q factors is evaluated by the relative Frobenius distance for the correlation matrix R and the signal
matrix Γ = L>D−1L, which are computed as dR̂ =||R̂−R||F||R||F
using estimates R̂ = L̂L̂> + D̂
and true parameters R = LL>+D. We also check the differences in the final log-likelihood values
obtained by HDFA and EM.
3.3.1 Computation time
Figure 1 presents the speedup in the CPU elapsed time in terms of the ratio of computation time
for EM over that for HDFA. It can be seen that HDFA outperforms EM over all fitted factor models
12
for both the two types of simulations. Specifically, for the three kinds of datasets with dimensions
p = 1000 with n = 100, p = 3375 with n = 225 and p = 8000 with n = 400, computations in
EM cost 10 ∼ 70 times as much as that in HDFA, especially when fitting with the correct factor
models. And for data-driven cases, the advantage of using HDFA is more obvious in reducing the
computation burden.
1
10
20
30
40
50
60
70
1 2 3 4 5 6 7 8 9 10Fitted number of factors
Spe
edup
Simulation conditionsn = 100, p = 1000n = 225, p = 3375n = 400, p = 8000
Number of factorsq = 3q = 5
(a)
1
20
40
60
80
100
120
1 2 3 4 5 6 7 8Fitted number of factors
Spe
edup
Simulation conditionsn = 160, p = 24547, q = 2n = 180, p = 24547, q = 2n = 340, p = 24547, q = 4
(b)
Figure 1: Speedup in CPU time of HDFA vs EM for (a) datasets with randomly simulated modelparameters and (b) datasets with data-driven model parameters
3.3.2 Estimation accuracy
The final log-likelihood values turn out to be almost the same for HDFA and EM and eBIC picks
the right number of factors in 100 percent accuracy over all the simulation cases. Furthermore,
under the best fitted models, the relative Frobenius distances for correlation matrix and the signal
13
matrix produced by HDFA and by EM are very closed to each other, which are indicated by Figure
2 where there is no obvious relative differences between the estimators and the true parameters.
0.05
0.10
0.15
n=100, p=1000 n=225, p=3375 n=400, p=8000Simulation conditions
Rel
ativ
e F
robe
nius
dis
tanc
es fo
r R Number of factors q = 3
q = 5 Algorithms EMHDFA
(a)
0.1
0.2
0.3
0.4
n=100, p=1000 n=225, p=3375 n=400, p=8000Simulation conditions
Rel
ativ
e F
robe
nius
dis
tanc
es fo
r Γ Number of factors q = 3
q = 5 Algorithms EMHDFA
(b)
0.8
0.9
1.0
n=180, p=24547, q=2 n=160, p=24547, q=2 n=340, p=24547, q=4Simulation cases
Rel
ativ
e F
robe
nius
dis
tanc
es fo
r R Algorithms EM
HDFA
(c)
0.0
0.1
0.2
0.3
0.4
n=180, p=24547, q=2 n=160, p=24547, q=2 n=340, p=24547, q=4Simulation conditions
Rel
ativ
e F
robe
nius
dis
tanc
es fo
r Γ Algorithms EM
HDFA
(d)
Figure 2: Relative Frobenius distances of correlation matrix for (a) datasets with randomly simu-lated model parameters and (c) datasets with data-driven model parameters, and Relative Frobe-nius distances of signal matrix for (b) datasets with randomly simulated model parameters and(d) datasets with data-driven model parameters
4 Application to fMRI dataset
In this section, we apply exploratory factor analysis to a functional magnetic resonance imaging
(fMRI) dataset collected for a human behavior study done by Just et al.[8] that focuses on iden-
tifying potential biological indicators of suicide risk by classifying the suicidal ideation groups of
people through their neural representations of emotion concepts.
14
4.1 Suicidal ideation dataset
The dataset consists of 1020 normally distributed observations derived from fMRI-based neurolog-
ical signatures in the dimensions of 50× 61× 23, which are shown as clusters of voxels distributed
over locations near the ventral attention network (VAN) on the right hemisphere of human brain,
where the VAN is one of two sensory orienting systems that reorient attention towards notable
stimuli and is closely related to involuntary actions [22].
. Data were collected from 34 participants who randomly received 30 stimulus word related
to death, negative effect and positive effect (10 words in each type) and were asked to provide
thoughts on the concepts and properties associated with those words. The participants are formed
of 3 groups where the 9 suicidal attempters (with a history of suicide attempts) and 8 suicidal
non-attempters were obtained from a clinically verified population with current suicidal ideation
and the rest 17 control participants came from the population with no personal or family history
of suicide attempt or psychiatric disorder [8].
4.2 Exploratory factor analysis
To compare the 3 groups of participants, we focus on their neural representations of the concepts
associated with both death and negativeness. The fitted factor models from our HDFA algorithm
identify 2 factors for both the suicidal attempters and non-attempters, and 4 factors for the control
group.
Table 1 presents the proportion of total variance σ2 accounted for by each factor and the
percent of absolute loading values that are bigger than 0.4 in each factor. It can be concluded
that first, the proportion of sufficient large absolute loading values in the first factor for suicidal
non-attempters and participants in control group are higher than that for the suicidal attempters.
Second, the loading values associated with the second factor are more close to 0 for control group
compared with the two suicidal groups.
15
GroupFactor 1 Factor 2 Factor 3 Factor 4
% σ2 % >0.4 % σ2 % >0.4 % σ2 % >0.4 % σ2 % >0.4
Attempter 0.036 0.038 0.029 0.022 - - - -Non-attempter 0.046 0.060 0.031 0.020 - - - -Control 0.041 0.066 0.024 0.011 0.021 0.016 0.019 0.015
Table 1: The percent of total variance (% σ2) accounted for and the percent of absolute loadingvalues that are bigger than 0.4 (% >0.4) by each factor for the suicidal attempters, non-attemptersand the control group.
(a) (b) (c) (d)
(e) (f) (g) (h)
−0
.6−
0.4
−0
.20
.00
.20
.40
.6
Figure 3: Visualization on the right semipshere of human brains with loading values of (a) thefirst factor for suicidal attempter group, (b) the second factor for suicidal attempter group, (c)the first factor for suicidal non-attempter group, (d) the second factor for suicidal non-attemptergroup, (e) the first factor for control group, (f) the second factor for control group, (g) the thirdfactor for control group, (h) the fourth factor for control group
5 Discussion
In this paper, we propose a new maximum likelihood method for factor analysis that achieves both
a sufficient accuracy in parameter estimations and fast convergence via matrix-free computations.
Our HDFA algorithm is based on a novel profiled log-likelihood function as well as a Lanczos
method for partial SVD and a limited-memory quasi-Newton method for numerical optimization.
HDFA fixes the computation issues associated with the traditionally used Lawley’s algorithm and
16
is capable of dealing with high-dimensional problems with p >> n. Moreover, it exhibits superior
performance in terms of convergence speed compared to the EM algorithm.
Possible extensions of HDFA include fitting the factor model for more general distributions,
such as the multivariate projected normal describing directional data on high-dimensional spheres,
which is currently under development, and moreover, for mixtures of factor analyzers applied to
classification problems and clustering analysis. On the other hand, understanding and illustrating
the loading matrices in high-dimensional exploratory factor analysis still remains challenging as
shown in our application results, indicating that it would be worth exploring simpler and more
interpretable forms of the correlation structure for high-dimensional data.
17
References
[1] Anderson, T. An Introduction to Multivariate Statistical Analysis. Wiley Series in Proba-
bility and Statistics. Wiley, 2003.
[2] B. Rubin, D., and T. Thayer, D. Em algorithms for ml factor analysis. Psychometrika
47 (02 1982), 69–76.
[3] Baglama, J., and Reichel, L. Augmented implicitly restarted lanczos bidiagonalization
methods. SIAM Journal on Scientific Computing 27, 1 (2005), 19–42.
[4] Dempster, A., Laird, N., and B. Rubin, D. Maximum likelihood from incomplete data
via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39
(01 1977), 1–38.
[5] Gao, X. C., and Song, P. X.-K. Composite likelihood bayesian information criteria for
model selection in high-dimensional data.
[6] Golub, G., and Kahan, W. Calculating the singular values and pseudo-inverse of a
matrix. Journal of the Society for Industrial and Applied Mathematics: Series B, Numerical
Analysis 2, 2 (1965), 205–224.
[7] H. Wilkinson, J. The calculation of the eigenvectors of codiagonal matrices. Computer
Journal 1 (02 1958).
[8] Just, M. A., Pan, L. A., Cherkassky, V. L., McMakin, D. L., Cha, C. B., Nock,
M. K., and Brent, D. P. Machine learning of neural representations of suicide and emotion
concepts identifies suicidal youth. Nature Human Behaviour 1 (2017), 911–919.
[9] Jreskog, K. G. Some contributions to maximum likelihood factor analysis*. ETS Research
Bulletin Series (1966).
[10] Lawley, D. The estimation of factor loadings by the method of maximum likelihood.
Proceedings of the Royal Society of Edinburgh 60 (01 2014), 64–82.
[11] Lawley, D. N., and Maxwell, A. E. Factor analysis as a statistical method, 2 ed. New
York : American Elsevier Pub. Co, 1971.
18
[12] Lin, C.-L., Chang, S.-H., Chang, W.-J., and Kuo, Y.-C. Factorial analysis of variables
influencing mechanical characteristics of a single tooth implant placed in the maxilla using
finite element analysis and the statistics-based taguchi method. European journal of oral
sciences 115 (11 2007), 408–16.
[13] Liu, C., and Rubin, D. Maximum likelihood estimation of factor analysis using the ecme
algorithm with complete and incomplete data. Statist. Sin. 8 (08 2002).
[14] Liu, D. C., and Nocedal, J. On the limited memory bfgs method for large scale opti-
mization. Math. Program. 45 (1989), 503–528.
[15] Mardia, K. V. Statistics of directional data. Journal of the royal statistical society series
b-methodological 37 (1975), 349–371.
[16] Mardia, K. V., Kent, J. T., and Bibby, J. M. Multivariate analysis. New York:
Academic Press, 1979.
[17] Munk Larsen, R. Lanczos bidiagonalization with partial reorthogonalization. DAIMI
Report Series 27 (07 1999).
[18] Ng, C. T., Yau, C. Y., and Hang Chan, N. Likelihood inferences for high-dimensional
factor analysis of time series with applications in finance. Journal of Computational and
Graphical Statistics 24 (11 2014).
[19] Pournara, I., and Wernisch, L. Factor analysis for gene regulatory networks and
transcription factor activity profiles. BMC Bioinformatics 8 (2006), 61–61.
[20] Simon, H. D., and Zha, H. Low-rank matrix approximation using the lanczos bidiagonal-
ization process with applications. SIAM J. Scientific Computing 21 (2000), 2257–2274.
[21] Varadhan, R., and ROLAND, C. Simple and globally convergent methods for acceler-
ating the convergence of any em algorithm. Scandinavian Journal of Statistics 35 (06 2008),
335–353.
[22] Vossel, S., Geng, J. J., and Fink, G. R. Dorsal and ventral attention systems: distinct
neural circuits but collaborative roles. The Neuroscientist 20 (2014), 150–159.
19
[23] Wu, K., and Simon, H. D. Thick-restart lanczos method for large symmetric eigenvalue
problems. SIAM J. Matrix Analysis Applications 22 (2000), 602–616.
[24] Zhao, JH., Y. P., and Jiang, Q. Ml estimation for factor analysis: Em or non-em? Stat
Comput 18 (2008).
[25] Zhu, C., Byrd, R. H., Lu, P., and Nocedal, J. Algorithm 778: L-bfgs-b: Fortran
subroutines for large-scale bound-constrained optimization. ACM Trans. Math. Softw. 23
(1994), 550–560.
20