Matrix--free Likelihood Method for Exploratory Factor ...

Creative Components Iowa State University Capstones, Theses and Dissertations

Spring 2019

Matrix--free Likelihood Method for Exploratory Factor Analysis Matrix--free Likelihood Method for Exploratory Factor Analysis

with High-dimensional Gaussian Data with High-dimensional Gaussian Data

Fan Dai

Follow this and additional works at: https://lib.dr.iastate.edu/creativecomponents

Part of the Multivariate Analysis Commons

Recommended Citation Recommended Citation Dai, Fan, "Matrix--free Likelihood Method for Exploratory Factor Analysis with High-dimensional Gaussian Data" (2019). Creative Components. 452. https://lib.dr.iastate.edu/creativecomponents/452

This Creative Component is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Creative Components by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected].

http://lib.dr.iastate.edu/

http://lib.dr.iastate.edu/

https://lib.dr.iastate.edu/creativecomponents

https://lib.dr.iastate.edu/theses

https://lib.dr.iastate.edu/theses

https://lib.dr.iastate.edu/creativecomponents?utm_source=lib.dr.iastate.edu%2Fcreativecomponents%2F452&utm_medium=PDF&utm_campaign=PDFCoverPages

http://network.bepress.com/hgg/discipline/824?utm_source=lib.dr.iastate.edu%2Fcreativecomponents%2F452&utm_medium=PDF&utm_campaign=PDFCoverPages

https://lib.dr.iastate.edu/creativecomponents/452?utm_source=lib.dr.iastate.edu%2Fcreativecomponents%2F452&utm_medium=PDF&utm_campaign=PDFCoverPages

mailto:[email protected]

Matrix–free Likelihood Method for Exploratory FactorAnalysis with High-dimensional Gaussian Data

Fan DaiIowa State University, Ames, Iowa

Abstract

This paper proposes a novel profile likelihood method for estimating the covariance pa-rameters in exploratory factor analysis (EFA) with high-dimensional Gaussian data. Byimplementing a Lanczos algorithm and a limited-memory quasi-Newton method, we developa matrix free algorithm (HDFA) which does partial singular value decomposition (partialSVD) for data matrix where number of observations n is typically less than the dimension pand it only requires limited amount of memory during likelihood maximization. We performsimulation study with both the randomly generated models and the data-driven models.Results indicate that HDFA substantially outperforms the EM algorithm in all cases. Fur-thermore, Our algorithm is applied to fit factor models for a fMRI dataset with suicidalattempters, suicidal nonattempters and a control group.

Keywords: Profile likelihood, Partial SVD, Lanczos algorithm, L-BFGS-B, fMRI, Suicidal ideationdata

1

Contents

1 Introduction 3

2 Methodology 42.1 Background and preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Factor model for Gaussian data . . . . . . . . . . . . . . . . . . . . . . . . 42.1.2 Model identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.3 EM algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.4 Convergence issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Profile likelihood for maximum likelihood estimation . . . . . . . . . . . . . . . . 72.2.1 Profile log-likelihood function . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.2 Partial singular value decomposition with Lanczos algorithm . . . . . . . . 102.2.3 Likelihood maximization with limited-memory quasi-Newton method . . . 10

3 Performance illustrations 113.1 Data generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 Algorithm settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3 Result evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3.1 Computation time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.3.2 Estimation accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4 Application to fMRI dataset 144.1 Suicidal ideation dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.2 Exploratory factor analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5 Discussion 16

2

1 Introduction

Factor analysis is a popular multivariate statistical technique which reduces the data dimension-

ality by characterizing the dependence using a small number of latent factors [1]. To that end

suppose we have n independently and identically distributed random vectors y1, . . . ,yn from a

p−dimensional multivariate normal distribution with mean vector µ and a covariance matrix Σ.

In the rest of the paper we assume Σ = LL> + D, where L is a p × q matrix with rank q that

describes the amount of variance shared among the p coordinates and D is a diagonal matrix

with positive diagonal entries representing the unique variance part specific to each coordinate

and associated with the random error term. The low dimensional set up (i.e. when p << n) was

first proposed by Lawley [10] and the maximum likelihood estimation of such low dimensional

factor models has been well-studied in [9, 11, 2] and others. Although efficient implementations

of algorithms are available when p << n, (e.g. factanal and fa in R and factoran in Matlab) these

algorithms require full eigen decomposition of a positive definite covariance or correlation matrix,

and hence do not scale well with the dimension p.

However, in the exploratory factor analysis for high-dimensional data, such as analyzing the

transcription factor activity profiles of gene regulatory networks using massive gene expression

datasets [19], constructing a Markowitz minimum variance portfolio from the stock return data by

a high dimensional factor AR model [18] or identifying the contributions of potential characteristics

associated with cast posts in a root filled tooth [12], the number of variables p typically exceed the

sample size n by a huge margin and possesses tremendous difficulties in parameter estimations.

Another way to obtain the MLEs in factor analysis is via EM-type algorithms, such as Rubin

and Thayer’s EM algorithm [2], Liu and Rubin’s ECME type algorithm [13] and Zhao, Yu and

Jiang’s CM algorithm [24], etc. In an EM implementation, latent factors are treated as missing

data [4] updated in the expectation step which requires low-dimensional matrix inversion, and

iterative updates for L and D are available in closed forms. Although the EM algorithms are

applicable when p > n, they suffer from slow and possibly local convergence problems especially

in high-dimensional cases.

In order to handle problems discussed above, we propose a profile likelihood when p >> n

and develop matrix-free computational framework (denoted as HDFA) for maximizing the profile

likelihood function. The key ingredients in our computational framework are (1) a Lanczos al-

3

gorithm that allows us to compute few dominant singular values and vectors of high-dimensional

matrices required to compute the profile log-likelihood and its gradient, and (2) a limited memory

quasi-Newton optimization algorithm that avoids the storage of the large p × p Hessian of the

profile log-likelihood function.

The remainder of this paper is organized as follows. Section 2 first describes the factor model

for Gaussian data and the EM algorithm, and then proposes the profile likelihood and HDFA

algorithm. Section 3 presents simulation studies to compare and evaluate the performance of the

HDFA and the EM algorithm. Section 4 illustrates an exploratory factor analysis on a fMRI sui-

cidal ideation dataset using the HDFA algorithm. Section 5 concludes the paper with a discussion

on possible extensions and future works.

2 Methodology

2.1 Background and preliminaries

In this paper, the data matrix Y = [y>1 , · · · ,y>n ]> consists of independent and normally dis-

tributed samples which share similar patterns through a set of latent random factor scores Z =

[z>1 , · · · , z>n ]>. The Gaussian factor model and EM algorithm for parameter estimation are pre-

sented in the following sections.

2.1.1 Factor model for Gaussian data

Suppose y1, · · · ,yn are i.i.d observed samples from a p-dimensional normal distribution, then, yi = µ+ Lzi + εi

zi ∼ Nq(0, Iq×q), εi ∼ Np(0,D)(1)

where L is a p × q factor loading matrix with q < min(p, n) and (p − q)2 > p + q. D is a p × p

diagonal matrix. zi consists of q unobserved orthogonal factors.

Based on model (1), yi|zi ∼ Np(µ+ Lzi,D) and yi ∼ Np(µ,Σ) with Σ = LL> + D.

4

2.1.2 Model identification

Since LQ and L produce the same likelihood given any orthogonal matrix Q, the factor model

requires specific conditions to be identifiable [15]. Commonly used techniques include constraining

columns in Q to be the eigenvectors of L>D−1L, the signal matrix, such that L>D−1L is diagonal

with decreasing diagonal entries or assigning 0’s to the loading matrix based on prior information.

More conditions can be found in the literature [1] [16].

2.1.3 EM algorithms

The unobserved factors z1, · · · , zn can be treated as missing data and the complete data log-

likelihood function is then derived as,

`C(µ,L,D) = c − n

2log |D| − 1

2

n∑i=1

{(yi − µ− Lzi)>(yi − µ− Lzi)}

= c − n

2log |D| − 1

2Tr{D−1

n∑i=1

(yi − µ)(yi − µ)> − 2D−1Ln∑

i=1

zi(yi − µ)>

+ L>D−1Ln∑

i=1

ziz>i }

(2)

Where c represents the constant term w.r.t µ,L and D.

E-Step computations

The expected complete log-likelihood is given by,

E[`C(µ,L,D|y1 ∼ yn] = c − n

2log |D| − 1

2Tr{D−1

n∑i=1

(yi − µ)(yi − µ)>

− 2D−1Ln∑

i=1

E[zi|yi](yi − µ)>

+ L>D−1Ln∑

i=1

E[ziz>i |yi]}

(3)

By multivariate normal theory, zi|yi ∼ Nq(L>Σ−1(yi − µ), (Iq×q + L>D−1L)−1). So,

E[zi|yi] = L>Σ−1(yi − µ) (4)

5

E[ziz>i |yi] =Var[zi|yi] + E[zi|yi]E[z>i |yi]

=(Iq×q + L>D−1L)−1 + L>Σ−1(yi − µ)(yi − µ)>Σ−1L(5)

M-Step computations

From the marginal distribution of yi, the global MLE of µ is µ̂ =1

n

n∑i=1

yi. Hence, given

current estimates Lt and Dt , the MLEs of Lt+1 and Dt+1 are obtained from maximizing equation

(6),

E[`C(L,D|y1 ∼ yn, µ̂,Lt,Dt] = c − n

2log |D| − n

2Tr{D−1S

−D−1L2

n

n∑i=1

E[zi|yi,Lt,Dt](yi − µ̂)>

+ L>D−1L1

n

n∑i=1

E[ziz>i |yi,Lt,Dt]}

(6)

Where S =1

n

n∑i=1

(yi − µ̂)(yi − µ̂)> is the sample covariance matrix.

Then, conditional on µ̂, Lt and Dt, the MLE of Lt+1 is given by,

L̂t+1 =(1

n

n∑i=1

(yi − µ̂)E[z>i |yi,Lt,Dt])(1

n

n∑i=1

E[ziz>i |yi,Lt,Dt])

−1

=SΣ−1t Lt((Iq×q + L>t D−1t Lt)−1 + L>t Σ−1t SΣ−1t Lt)

−1

(7)

Where Σt = LtL>t + Dt. By Woodbury matrix identity, Σ−1t Lt = D−1t Lt(Iq×q + L>t D−1t Lt)

−1,

then equation (7) can be simplified as,

L̂t+1 =SD−1t Lt(Iq×q + L>t D−1t Lt)−1((Iq×q + L>t D−1t Lt)

−1 + L>t Σ−1t SD−1t Lt(Iq×q + L>t D−1t Lt)−1)−1

=SD−1t Lt(Iq×q + L>t Σ−1t SD−1t Lt)−1

(8)

Next, conditional on µ̂ Lt, Dt and L̂t+1, the MLE of Dt+1 is given by,

D̂t+1 =diag{

S− 2

n

n∑i=1

(yi − µ̂)E[z>i |yi,Lt,Dt]L̂>t+1 + L̂t+1

1

n

n∑i=1

E[ziz>i |yi,Lt,Dt])

−1L̂>t+1

}(9)

6

Substitute equation (7),

D̂t+1 =diag{

S− 2

n

n∑i=1

(yi − µ̂)E[z>i |yi,Lt,Dt]L̂>t+1 +

1

n

n∑i=1

(yi − µ̂)E[z>i |yi,Lt,Dt]L̂>t+1

}=diag

{S− 2SΣ−1t LtL̂

>t+1

} (10)

Where diag{A} refers to the diagonal elements of matrix A.

2.1.4 Convergence issues

The EM algorithm involves inexpensive calculations and requires only matrix-vector products

and an inverse of a low-dimensional q × q matrix∑n

i=1 E[ziz>i |yi] which is preferable in handling

high-dimensional problems. However, the EM-type methods are known for slow convergence that

depends on initial parameter values [4]. Possible solutions under specific model assumptions are

discussed in the literature, such as the squared iterative method (SQUAREM) which applies an

extrapolation schemes to achieve superlinear convergence [21].

2.2 Profile likelihood for maximum likelihood estimation

In this section, we illustrate a new profile likelihood for maximum likelihood estimation and develop

our matrix-free high dimensional factor analysis algorithm (HDFA). The identifiability condition

utilized here is that the signal matrix L>D−1L is diagonal with decreasing entries.

2.2.1 Profile log-likelihood function

Given the MLE of µ, the marginal log-likelihood function is,

`n(µ,L,D) = c − n

2log |LL> + D| − n

2Tr{(LL> + D)−1S} (11)

Where, as before, S =1

n

n∑i=1

(yi − µ̂)(yi − µ̂)> and µ̂ =1

n

n∑i=1

yi.

7

Then, the MLEs of L and D are solved from, L(Iq×q + L>D−1L) = SD−1L

D = diag{S− LL>}(12)

From L(Iq×q + L>D−1L) = SD−1L, we have,

D−1/2L(Iq×q + (D−1/2L)>D−1/2L) = D−1/2SD−1/2D−1/2L

Let the spectral decomposition of D−1/2SD−1/2 = VΛV>. Then the above formula implies

that

D−1/2L(D−1/2L)> = D−1/2SD−1/2 − Ip×p = V(Λ− Ip×p)V> (13)

Equation (12) and (13) give Lawley’s algorithm [10] that recursively updates L and D until

convergences. However, for high-dimensional datasets, manipulating the p × p matrix S and the

eigen decomposition of D−1/2SD−1/2 become computationally expensive and are problematic for

singular cases. In order to handle such problems, we propose a new profile log-likelihood as follows.

Let λ1 ≥ λ2 ≥, · · · ,≥ λp denote the eigenvalues of D−1/2SD−1/2 and form the decreasing

diagonal elements in Λ with Λ =

Λq 0

0 Λm

. Then, Λq contains the largest q eigenvalues λ1 ≥

λ2 ≥, · · · ,≥ λq and the corresponding q eigenvectors form the columns in matrix Vq so that

V = [Vq,Vm]. Equation (13) shows that,

L = D1/2Vq[(Λq − I)+]1/2 (14)

Given the n × p data matrix Y, since S =1

n(Y − 1µ̂>)(Y − 1µ̂>)>, the square roots of

λ1, · · · , λq are the largest q singular values of1√n

(Y−1µ̂>)D−1/2 and columns in Vq are then the

corresponding q right-singular vectors. Hence, conditional on D, L is maximized at L̂ = D1/2Vq∆,

where ∆ is a diagonal matrix with diagonal elements max[(λi − 1, 0)]1/2, i = 1, · · · , q.

From the construction of Vq and Vm, we have the following results,

V>q Vq = Iq×q, V>mVm = I(p−q)×(p−q), VqV>q + VmV>m = Ip×p, V>q Vm = 0

8

And hence,

(VqΛqV>q + VmV>m)(VqΛ

−1q V>q + VmV>m) = Ip×p

now let A = Vq∆2V>q , then AA = Vq∆

4V>q and,

|A + Ip×p| =|(A + Ip×p)A|/|A|

=|Vq(∆4 + ∆2)V>q |/|Vq∆

2V>q |

=|∆2 + Iq×q| =q∏

j=1

λj

(A + Ip×p)−1 =(Vq∆

2V>q + VqV>q + VmV>m)−1

=(Vq(∆2 + I)V>q + VmV>m)−1

=(VqΛqV>q + VmV>m)−1

=VqΛ−1q V>q + VmV>m

(15)

Based on above results, the profile log-likelihood is given by,

`p(D) = c − n

2log |L̂L̂> + D| − n

2Tr{(L̂L̂> + D)−1S}

= c − n

2

{log |D1/2(Vq∆

2V>q + Ip×p)D1/2|+ Tr{(D1/2(Vq∆

2V>q + Ip×p)D1/2)−1S}

}= c − n

2

{log |D|+ log |Vq∆

2V>q + Ip×p|+ Tr{(VqΛ−1q V>q + VmV>m)D−1/2SD−1/2}

}= c − n

2

{log |D|+

q∑j=1

log λj + Tr{(Λ−1q V>q VΛV>Vq}+ Tr{V>mVΛV>Vm}}

= c − n

2

{log |D|+

q∑j=1

log λj + Tr{Λ−1q Λq}+ Tr{Λm}}

= c − n

2

{log |D|+

q∑j=1

log λj + q + Tr{D−1S} −q∑

j=1

λj

}(16)

For standardized data matrix Y and the sample correlation matrix S, equation (16) can be further

simplified as,

`p(D) = c − n

2

{log |D|+

q∑j=1

log λj + q + Tr{D−1} −q∑

j=1

λj

}(17)

9

2.2.2 Partial singular value decomposition with Lanczos algorithm

The partial SVD of1√n

(Y − 1µ̂>)D−1/2 is carried out by Lanczos method which only involves

matrix vector multiplications and is preferable for large matrices with sparse structures. Specifi-

cally, suppose A is n×p real matrix, the Lanczos algorithm provides easy steps for bidiagonalizing

A as follows [6],

Algorithm 1 Lanczos Algorithm for PSVD of Matrix A

1: Generate q1 as a random vector with ||q1|| = 1;2: Compute v1 = Aq1 and p1 = v1/||v1||;3: for i = 1 : k

u>i = p>i A− ||vi||qi; qi = ui/||ui||;for j = 2 : kvj = Aqj − ||uj||pj−1; pj = vj/||vj||;

Let P and Q be matrices with columns p1, · · · ,pk and q1, · · · ,qk respectively, then B =

P>AQ gives an bidiagonal matrix. k = max(2q+ 1, 20) to have an accurate approximation of the

largest q eigenvalues.

Due to the round-off error associated with the Lanczos vectors qi and pi, reorthogonalization

of them are required and several implementations are given in the literature [17],[20] and [23].

Furthermore, there are modified Lanczos algorithms with higher numerical stability, such as the

restarted Lanczos bidiagonalization methods [3].

Given those computed lanczos vectors, the largest k singular values in A is the same as those

in B which can be obtained from the square roots of eigenvalues in B>B, a tridiagonal hermitian

matrix where the eigenvalues can be calculated effectively via the Sturm sequence algorithm [7].

2.2.3 Likelihood maximization with limited-memory quasi-Newton method

To maximize the profile log-likelihood function (16) with respect to D, we implement the Limited-

memory Broyden-Fletcher-Goldfarb-Shanno quasi-newton method (L-BFGS-B) [14][25]. From

equation (16), the gradient is in an analytical form of g = −n2

diag{L̂L̂>+D−S} with the sample

correlation matrix S and scaled parameter D of which the diagonal elements are restricted to

[0, 1], i = 1, · · · , p.

10

The BFGS-type algorithms searches the optimization solution via quasi-Newton method which

iteratively produces an approximation of the inverse Hessian matrix H−1 and is well suited for

objective functions where the analytic form of the Hessian is not available or difficult to compute.

Furthermore, the BFGS approximation provides a recurrence relation for H−1 that allows matrix-

free computations for the product of the inverse Hessian and the gradient without calculating the

exact form of matrix H−1.

However, the BFGS update maintains the approximated inverse Hessian information for each

iteration, with which the memory requirements become a challenge in high-dimensional optimiza-

tion problems. The L-BFGS type algorithms, on the other hand, are designed as a truncated

variation of BFGS that only store the last few m updates of the position differences and gradi-

ent differences in order to compute the approximation of H−1k at the k-th iteration based on the

BFGS recursive formula and then produce the new descent direction from −H−1k gk where gk is

the gradient at the k-th iteration.

3 Performance illustrations

3.1 Data generation

Performance of the EM and HDFA algorithms are compared using 100 simulation running for

high-dimensional datasets of three sizes: p = 1000 with n = 100, p = 3375 with n = 225, and

p = 8000 with n = 400 which are evaluated at two factor models with q = 3 and q = 5. In

each simulation, diagonal entries in model parameter D are generated from uniform distribution

U(0.2, 0.8), elements in L and µ are generated from standard normal distribution N (0, 1). In

addition, we also performed a data-driven simulation for p = 24547, n = 160, 180 with the 2-

factor models and p = 24547, n = 340 with the 4-factor models fitted from the fMRI dataset in

Section 4.

3.2 Algorithm settings

The set-up of the two algorithms are as follows:

• Data standardization: The scaled data matrix are implemented for convenience and param-

11

eters are estimated for the correlation matrix.

• Parameter initialization: For EM algorithm, L is started as the first q principle components

for the data and the initial values of D come from the absolute differences between the

sample correlation matrix and LL>. For HDFA, only initialization of D is required.

• Convergence criterion: In HDFA, convergence of the optimization process using the L-BFGS-

B method is evaluated via the gradient of the profile log-likelihood. Hence, in order to be

assessed on the same criterion, the convergence of the EM algorithm is determined not only

by the relative change in log-likelihood but also by the summed absolute values of gradient

elements. Specifically, the EM algorithm is stopped if the relative log-likelihood change is

less than 10−6 and the sum of the absolute gradient values of D is less or equal to the

final value obtained from HDFA , or the number of iterations reaches the pre-determined

maximum of 5000.

• Optimal factor number determination: For each dataset, we fit models with 1, 2, · · · , 2q

factors respectively and the optimal number of factors is estimated through the extended

Bayes Information Criterion (eBIC) [5]: eBIC = −2`p + pq log n.

All of our simulations are carried out using R running on a workstation with Intel E5-2640 v3

CPU clocked at 2.60 GHz and cache size: 20MB.

3.3 Result evaluation

In each simulation, the required CPU time, parameter estimates and the observed log-likelihood

values at convergence along with the eBIC values are collected. The fitted model with the optimal

q factors is evaluated by the relative Frobenius distance for the correlation matrix R and the signal

matrix Γ = L>D−1L, which are computed as dR̂ =||R̂−R||F||R||F

using estimates R̂ = L̂L̂> + D̂

and true parameters R = LL>+D. We also check the differences in the final log-likelihood values

obtained by HDFA and EM.

3.3.1 Computation time

Figure 1 presents the speedup in the CPU elapsed time in terms of the ratio of computation time

for EM over that for HDFA. It can be seen that HDFA outperforms EM over all fitted factor models

12

for both the two types of simulations. Specifically, for the three kinds of datasets with dimensions

p = 1000 with n = 100, p = 3375 with n = 225 and p = 8000 with n = 400, computations in

EM cost 10 ∼ 70 times as much as that in HDFA, especially when fitting with the correct factor

models. And for data-driven cases, the advantage of using HDFA is more obvious in reducing the

computation burden.

1

10

20

30

40

50

60

70

1 2 3 4 5 6 7 8 9 10Fitted number of factors

Spe

edup

Simulation conditionsn = 100, p = 1000n = 225, p = 3375n = 400, p = 8000

Number of factorsq = 3q = 5

(a)

1

20

40

60

80

100

120

1 2 3 4 5 6 7 8Fitted number of factors

Spe

edup

Simulation conditionsn = 160, p = 24547, q = 2n = 180, p = 24547, q = 2n = 340, p = 24547, q = 4

(b)

Figure 1: Speedup in CPU time of HDFA vs EM for (a) datasets with randomly simulated modelparameters and (b) datasets with data-driven model parameters

3.3.2 Estimation accuracy

The final log-likelihood values turn out to be almost the same for HDFA and EM and eBIC picks

the right number of factors in 100 percent accuracy over all the simulation cases. Furthermore,

under the best fitted models, the relative Frobenius distances for correlation matrix and the signal

13

matrix produced by HDFA and by EM are very closed to each other, which are indicated by Figure

2 where there is no obvious relative differences between the estimators and the true parameters.

0.05

0.10

0.15

n=100, p=1000 n=225, p=3375 n=400, p=8000Simulation conditions

Rel

ativ

e F

robe

nius

dis

tanc

es fo

r R Number of factors q = 3

q = 5 Algorithms EMHDFA

(a)

0.1

0.2

0.3

0.4

n=100, p=1000 n=225, p=3375 n=400, p=8000Simulation conditions

Rel

ativ

e F

robe

nius

dis

tanc

es fo

r Γ Number of factors q = 3

q = 5 Algorithms EMHDFA

(b)

0.8

0.9

1.0

n=180, p=24547, q=2 n=160, p=24547, q=2 n=340, p=24547, q=4Simulation cases

Rel

ativ

e F

robe

nius

dis

tanc

es fo

r R Algorithms EM

HDFA

(c)

0.0

0.1

0.2

0.3

0.4

n=180, p=24547, q=2 n=160, p=24547, q=2 n=340, p=24547, q=4Simulation conditions

Rel

ativ

e F

robe

nius

dis

tanc

es fo

r Γ Algorithms EM

HDFA

(d)

Figure 2: Relative Frobenius distances of correlation matrix for (a) datasets with randomly simu-lated model parameters and (c) datasets with data-driven model parameters, and Relative Frobe-nius distances of signal matrix for (b) datasets with randomly simulated model parameters and(d) datasets with data-driven model parameters

4 Application to fMRI dataset

In this section, we apply exploratory factor analysis to a functional magnetic resonance imaging

(fMRI) dataset collected for a human behavior study done by Just et al.[8] that focuses on iden-

tifying potential biological indicators of suicide risk by classifying the suicidal ideation groups of

people through their neural representations of emotion concepts.

14

4.1 Suicidal ideation dataset

The dataset consists of 1020 normally distributed observations derived from fMRI-based neurolog-

ical signatures in the dimensions of 50× 61× 23, which are shown as clusters of voxels distributed

over locations near the ventral attention network (VAN) on the right hemisphere of human brain,

where the VAN is one of two sensory orienting systems that reorient attention towards notable

stimuli and is closely related to involuntary actions [22].

. Data were collected from 34 participants who randomly received 30 stimulus word related

to death, negative effect and positive effect (10 words in each type) and were asked to provide

thoughts on the concepts and properties associated with those words. The participants are formed

of 3 groups where the 9 suicidal attempters (with a history of suicide attempts) and 8 suicidal

non-attempters were obtained from a clinically verified population with current suicidal ideation

and the rest 17 control participants came from the population with no personal or family history

of suicide attempt or psychiatric disorder [8].

4.2 Exploratory factor analysis

To compare the 3 groups of participants, we focus on their neural representations of the concepts

associated with both death and negativeness. The fitted factor models from our HDFA algorithm

identify 2 factors for both the suicidal attempters and non-attempters, and 4 factors for the control

group.

Table 1 presents the proportion of total variance σ2 accounted for by each factor and the

percent of absolute loading values that are bigger than 0.4 in each factor. It can be concluded

that first, the proportion of sufficient large absolute loading values in the first factor for suicidal

non-attempters and participants in control group are higher than that for the suicidal attempters.

Second, the loading values associated with the second factor are more close to 0 for control group

compared with the two suicidal groups.

15

GroupFactor 1 Factor 2 Factor 3 Factor 4

% σ2 % >0.4 % σ2 % >0.4 % σ2 % >0.4 % σ2 % >0.4

Attempter 0.036 0.038 0.029 0.022 - - - -Non-attempter 0.046 0.060 0.031 0.020 - - - -Control 0.041 0.066 0.024 0.011 0.021 0.016 0.019 0.015

Table 1: The percent of total variance (% σ2) accounted for and the percent of absolute loadingvalues that are bigger than 0.4 (% >0.4) by each factor for the suicidal attempters, non-attemptersand the control group.

(a) (b) (c) (d)

(e) (f) (g) (h)

−0

.6−

0.4

−0

.20

.00

.20

.40

.6

Figure 3: Visualization on the right semipshere of human brains with loading values of (a) thefirst factor for suicidal attempter group, (b) the second factor for suicidal attempter group, (c)the first factor for suicidal non-attempter group, (d) the second factor for suicidal non-attemptergroup, (e) the first factor for control group, (f) the second factor for control group, (g) the thirdfactor for control group, (h) the fourth factor for control group

5 Discussion

In this paper, we propose a new maximum likelihood method for factor analysis that achieves both

a sufficient accuracy in parameter estimations and fast convergence via matrix-free computations.

Our HDFA algorithm is based on a novel profiled log-likelihood function as well as a Lanczos

method for partial SVD and a limited-memory quasi-Newton method for numerical optimization.

HDFA fixes the computation issues associated with the traditionally used Lawley’s algorithm and

16

is capable of dealing with high-dimensional problems with p >> n. Moreover, it exhibits superior

performance in terms of convergence speed compared to the EM algorithm.

Possible extensions of HDFA include fitting the factor model for more general distributions,

such as the multivariate projected normal describing directional data on high-dimensional spheres,

which is currently under development, and moreover, for mixtures of factor analyzers applied to

classification problems and clustering analysis. On the other hand, understanding and illustrating

the loading matrices in high-dimensional exploratory factor analysis still remains challenging as

shown in our application results, indicating that it would be worth exploring simpler and more

interpretable forms of the correlation structure for high-dimensional data.

17

References

[1] Anderson, T. An Introduction to Multivariate Statistical Analysis. Wiley Series in Proba-

bility and Statistics. Wiley, 2003.

[2] B. Rubin, D., and T. Thayer, D. Em algorithms for ml factor analysis. Psychometrika

47 (02 1982), 69–76.

[3] Baglama, J., and Reichel, L. Augmented implicitly restarted lanczos bidiagonalization

methods. SIAM Journal on Scientific Computing 27, 1 (2005), 19–42.

[4] Dempster, A., Laird, N., and B. Rubin, D. Maximum likelihood from incomplete data

via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39

(01 1977), 1–38.

[5] Gao, X. C., and Song, P. X.-K. Composite likelihood bayesian information criteria for

model selection in high-dimensional data.

[6] Golub, G., and Kahan, W. Calculating the singular values and pseudo-inverse of a

matrix. Journal of the Society for Industrial and Applied Mathematics: Series B, Numerical

Analysis 2, 2 (1965), 205–224.

[7] H. Wilkinson, J. The calculation of the eigenvectors of codiagonal matrices. Computer

Journal 1 (02 1958).

[8] Just, M. A., Pan, L. A., Cherkassky, V. L., McMakin, D. L., Cha, C. B., Nock,

M. K., and Brent, D. P. Machine learning of neural representations of suicide and emotion

concepts identifies suicidal youth. Nature Human Behaviour 1 (2017), 911–919.

[9] Jreskog, K. G. Some contributions to maximum likelihood factor analysis*. ETS Research

Bulletin Series (1966).

[10] Lawley, D. The estimation of factor loadings by the method of maximum likelihood.

Proceedings of the Royal Society of Edinburgh 60 (01 2014), 64–82.

[11] Lawley, D. N., and Maxwell, A. E. Factor analysis as a statistical method, 2 ed. New

York : American Elsevier Pub. Co, 1971.

18

[12] Lin, C.-L., Chang, S.-H., Chang, W.-J., and Kuo, Y.-C. Factorial analysis of variables

influencing mechanical characteristics of a single tooth implant placed in the maxilla using

finite element analysis and the statistics-based taguchi method. European journal of oral

sciences 115 (11 2007), 408–16.

[13] Liu, C., and Rubin, D. Maximum likelihood estimation of factor analysis using the ecme

algorithm with complete and incomplete data. Statist. Sin. 8 (08 2002).

[14] Liu, D. C., and Nocedal, J. On the limited memory bfgs method for large scale opti-

mization. Math. Program. 45 (1989), 503–528.

[15] Mardia, K. V. Statistics of directional data. Journal of the royal statistical society series

b-methodological 37 (1975), 349–371.

[16] Mardia, K. V., Kent, J. T., and Bibby, J. M. Multivariate analysis. New York:

Academic Press, 1979.

[17] Munk Larsen, R. Lanczos bidiagonalization with partial reorthogonalization. DAIMI

Report Series 27 (07 1999).

[18] Ng, C. T., Yau, C. Y., and Hang Chan, N. Likelihood inferences for high-dimensional

factor analysis of time series with applications in finance. Journal of Computational and

Graphical Statistics 24 (11 2014).

[19] Pournara, I., and Wernisch, L. Factor analysis for gene regulatory networks and

transcription factor activity profiles. BMC Bioinformatics 8 (2006), 61–61.

[20] Simon, H. D., and Zha, H. Low-rank matrix approximation using the lanczos bidiagonal-

ization process with applications. SIAM J. Scientific Computing 21 (2000), 2257–2274.

[21] Varadhan, R., and ROLAND, C. Simple and globally convergent methods for acceler-

ating the convergence of any em algorithm. Scandinavian Journal of Statistics 35 (06 2008),

335–353.

[22] Vossel, S., Geng, J. J., and Fink, G. R. Dorsal and ventral attention systems: distinct

neural circuits but collaborative roles. The Neuroscientist 20 (2014), 150–159.

19

[23] Wu, K., and Simon, H. D. Thick-restart lanczos method for large symmetric eigenvalue

problems. SIAM J. Matrix Analysis Applications 22 (2000), 602–616.

[24] Zhao, JH., Y. P., and Jiang, Q. Ml estimation for factor analysis: Em or non-em? Stat

Comput 18 (2008).

[25] Zhu, C., Byrd, R. H., Lu, P., and Nocedal, J. Algorithm 778: L-bfgs-b: Fortran

subroutines for large-scale bound-constrained optimization. ACM Trans. Math. Softw. 23

(1994), 550–560.

20

Matrix--free Likelihood Method for Exploratory Factor ...

Documents

Transcript of Matrix--free Likelihood Method for Exploratory Factor ...