Sparse latent factor regression models forgenome-wide and epigenome-wide
association studies
Basile Jumentier1,3 Kevin Caye1 Barbara Heude2
Johanna Lepeule3,? Olivier François1,?
Authors’ affiliations:
1 Université Grenoble-Alpes, Centre National de la Recherche Scientifique, GrenobleINP, TIMC-IMAG CNRS UMR 5525, 38000 Grenoble, France.
2 Université de Paris, CRESS, Inserm, INRAE, F-75004 Paris, France.
3 Université Grenoble-Alpes, Centre National de la Recherche Scientifique, InstitutNational de la Santé et de la Recherche Médicale, Institute for Advanced Biosciences,INSERM U 1209, CNRS UMR 5309, 38000 Grenoble, France.
? Corresponding authors: [email protected] , [email protected]
1
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
mailto:[email protected]:[email protected]:[email protected]://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
Abstract1
Association of phenotypes or exposures with genomic and epigenomic data faces2
important statistical challenges. One of these challenges is to remove variation due to3
unobserved confounding factors, such as individual ancestry or cell-type composition4
in tissues. This issue can be addressed with penalized latent factor regression models,5
where penalties are introduced to cope with high dimension in the data. If a rela-6
tively small proportion of genomic or epigenomic markers correlate with the variable7
of interest, sparsity penalties may help to capture the relevant associations, but the8
improvement over non-sparse approaches has not been fully evaluated yet. In this9
study, we introduced least-squares algorithms that jointly estimate effect sizes and10
confounding factors in sparse latent factor regression models. Computer simulations11
provided evidence that sparse latent factor regression models achieve higher statistical12
performance than other sparse methods, including the least absolute shrinkage and13
selection operator (LASSO) and a Bayesian sparse linear mixed model (BSLMM).14
Additional simulations based on real data showed that sparse latent factor regression15
models were more robust to departure from the generative model than non-sparse16
approaches, such as surrogate variable analysis (SVA) and other methods. We ap-17
plied sparse latent factor regression models to a genome-wide association study of18
a flowering trait for the plant Arabidopsis thaliana and to an epigenome-wide asso-19
ciation study of smoking status in pregnant women. For both applications, sparse20
latent factor regression models facilitated the estimation of non-null effect sizes while21
avoiding multiple testing problems. The results were not only consistent with pre-22
vious discoveries, but they also pinpointed new genes with functional annotations23
relevant to each application.24
2
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
1 Introduction25
Association studies represent one of the most powerful tool to identify genomic vari-26
ation correlating with disease states, exposure levels or phenotypes. Those studies27
are divided into several categories according to the nature of the genomic markers28
evaluated. For example, genome-wide association studies (GWAS) focus on single-29
nucleotide polymorphisms in different individuals to estimate disease allele effects30
(Balding, 2006), while epigenome-wide association studies (EWAS) measure epige-31
netic marks, such as DNA methylation levels to derive associations between epige-32
netic variation and exposure levels or to assess effects on phenotypic traits (Rakyan33
et al., 2011). Despite their success in identifying the genetic architecture of pheno-34
typic traits or genomic targets of exposure, association studies are plagued with the35
problem of confounding, which arises when unobserved variables correlate with the36
variable of interest and with the genomic markers simultaneously (Wang et al., 2017).37
Historical approaches to the confounding issue remove hidden confounders by38
considering corrections for inflation (Devlin and Roeder, 1999) and empirical null-39
hypothesis testing methods (Efron, 2004). Alternative approaches evaluate hidden40
confounders by using linear combinations of observed variables, often called factors.41
In GWAS, a frequently-used factor approach consists of computing the largest prin-42
cipal components of the genotype matrix, and includes them as covariates in linear43
regression models (Price et al., 2006). The variable of interest may, however, be co-44
linear to largest principal components, and removing their effects can result in loss45
of statistical power. To increase power, methods based on latent factor regression46
models have been proposed (Leek and Storey, 2007; Carvalho et al., 2008). Latent47
factor regression models employ deconvolution methods in which unobserved vari-48
3
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
ables, including batch effects, individual ancestry or tissue cell-type composition are49
integrated in the regression model by using latent factors. In these models, effect sizes50
and latent factors are estimated jointly. The latent factor regression framework en-51
compasses several methods which include surrogate variable analysis (SVA, Leek and52
Storey (2007)), latent factor mixed models (LFMM, Frichot et al. (2013)), residual53
principal component analysis (Kalaitzis and Lawrence, 2012), and confounder ad-54
justed testing and estimation (CATE, Wang et al. (2017)). Each method has specific55
merits relative to some category of association study, and the performances of the56
methods have been extensively debated in recent surveys (for example, see Kaushal57
et al. (2017)).58
A property of many latent factor regression models is to use regularization pa-59
rameters inducing constraints on effect size estimates. Among those methods, sparse60
regression models suppose that a relatively small proportion of all genomic variables61
correlate with the variable of interest or affect the phenotype, and evaluate associa-62
tions while avoiding multiple testing problems (Tibshirani, 1996; Hoggart et al., 2008;63
Wu et al., 2009). Sparse regression models have been coupled with linear mixed mod-64
els to combine the benefits of both for polygenic trait studies with Bayesian sparse65
linear mixed model (BSLMM) (Zhou et al., 2012, 2013). Sparse regression models66
can include confounding factors that are usually estimated separately of effect sizes.67
In this study, we introduce least-squares algorithms that jointly estimate effect sizes68
and confounding factors in sparse latent factor regression models. We estimate effect69
sizes based on regularized least-squares methods with L1 and nuclear norm penalties.70
Thus our method allows identifying non-null effect sizes without the use of multiple71
statistical tests. We refer to our models as sparse latent factor mixed models or sparse72
4
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
LFMM. We present estimation algorithms for sparse LFMM and theoretical results73
in the next section. Then we compare the performances of sparse LFMM with other74
sparse regression models (LASSO, BSLMM), and with non-sparse regression models75
(SVA, CATE, LFMM). To illustrate our approach, we used sparse LFMM to per-76
form a GWAS of flowering time for the plant Arabidopsis thaliana and to perform an77
epigenome-wide association study (EWAS) of smoking status in pregnant women.78
2 Latent factor regression models79
2.1 Models80
Latent factor regression models evaluate associations between the elements of a re-81
sponse matrix, Y, and variables of interest, called primary variables, X, measured82
for n individuals. The response matrix records p markers, which can represent any83
type of omic data (genotypes, DNA methylation, etc), collected for the individuals.84
The X matrix can also incorporate nuisance variables such as observed confounders85
(age, sex, etc), and its dimension is n×d, where d represents the total number of pri-86
mary and nuisance variables. Latent factor regression models are regression models87
combining fixed and latent effects as follows88
Y = XBT + W + E. (1)
Fixed effect sizes are recorded in the B matrix, which has dimension p × d. The E89
matrix represents residual errors, and has the same dimension as the response matrix.90
The matrix W is a latent matrix of rank K, defined by K latent factors (Leek and91
Storey, 2007; Frichot et al., 2013; Wang et al., 2017). The value of K is unknown,92
and it is generally determined by model choice or cross-validation procedures. The93
5
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
K latent factors, U, are defined from the singular value decomposition of the latent94
matrix95
W = UVT ,
where V is a K × p matrix of loadings (Eckart and Young, 1936). The matrices U96
and V are unique up to a change of sign.97
Naive statistical estimates for the B and W matrices in equation (1) could be98
obtained through the minimization of a classical least-squares loss function99
L(B,W) = ‖Y −W −XBT‖2F , (2)
where ‖.‖F is the Frobenius matrix norm. A minimum value of the loss function is100
attained when W is computed as the rank K singular value decomposition of Y. In101
this case, the B matrix can be obtained from the estimates of a linear regression of the102
residual matrix (Y−W) on X. To motivate the introduction of regularization terms103
in the loss function, we remark that the interpretation of latent factors obtained from104
this solution as confounder estimates may be incorrect, because it fails to include any105
information on the primary variable, X. Assuming that latent factors are computed106
only from the response matrix contradicts the definition of confounding variables107
(Wang et al., 2017). In addition, the definition is problematic, because it does not108
lead to a unique minimum of the loss function. To see it, consider any matrix P with109
dimensions d× p and check that110
‖Y − (U−XP)VT + X(BT −PVT ))‖2F = ‖Y −UVT + XBT‖2F .
As a consequence, B and (B − VPT ) correspond to valid minima, and there is111
6
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
an infinite space of possible solutions. To conclude, the loss function needs to be112
modified in order to warrant dependency of W on both Y and X, and to enable the113
computation of well-defined solutions.114
2.2 Sparse estimation algorithms115
L1-regularized least-square problem. To solve the problems outlined in the116
above section, a sparse regularization approach is considered. This approach intro-117
duces penalties based on the L1 norm of the regression coefficients and on the nuclear118
norm of the latent matrix119
Lsparse(W,B) =∥∥Y −W −XBT∥∥2
F+ µ‖B‖1 + γ‖W‖∗ , µ, γ > 0, (3)
where ‖B‖1 denotes the L1 norm of B, µ is an L1 regularization parameter, W is the120
latent matrix, ‖W‖∗ denotes its nuclear norm, and γ is a regularization parameter for121
the nuclear norm. The L1 penalty induces sparsity on the fixed effects (Tibshirani,122
1996), and corresponds to the prior information that not all response variables may123
be associated with the primary variables. More specifically, the prior implies that124
a restricted number of rows of the effect size matrix B are non-zero. The second125
regularization term is based on the nuclear norm, and it is introduced to penalize126
large numbers of latent factors. With these penalty terms, Lsparse(W,B) is a convex127
function, and convex mimimization algorithms can be applied to obtain estimates of128
B and W (Mishra et al., 2013).129
Sparse latent factor mixed model algorithm. To simplify the description of130
the estimation algorithm, let us assume that the explanatory variables, X, are scaled131
so that XTX = Idd. Note that our program implementation is more general, and132
7
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
does not make this restrictive assumption. Here, it is introduced to explain the133
sparse LFMM algorithm with simplified notations. We developed a block-coordinate134
descent method for minimizing the convex loss function Lsparse(W,B) with respect135
to B and W. The algorithm is initialized from the null matrix Ŵ0 = 0, and iterates136
the following steps.137
1. Find B̂t a minimum of the penalized loss function138
L(1)sparse(B) = ‖(Y − Ŵt−1)−XBT‖2F + µ‖B‖1 , (4)
2. Find Ŵt a minimum of the penalized loss function139
L(2)sparse(W) = ‖(Y −XB̂Tt )−W‖2F + γ‖W‖∗. (5)
The algorithm cycles through the two steps until a convergence criterion is met or the140
allocated computing resource is depleted. Each minimization step has a well-defined141
and unique solution. To see it, note that Step 1 corresponds to an L1-regularized re-142
gression of the residual matrix (Y−Ŵt−1) on the explanatory variables. To compute143
the regression coefficients, we used the Friedman block-coordinate descent method144
(Friedman et al., 2007). According to Tibshirani (1996), we obtained145
B̂t = sign(B̄t)(B̄t − µ)+ , (6)
where s+ = max(0, s), sign(s) is the sign of s, and B̄t is the linear regression estimate,146
B̄t = XTY−Ŵt−1. Step 2 consists of finding a low rank approximation of the residual147
matrix Y−XB̂Tt (Cai et al., 2008). This approximation starts with a singular value148
decomposition (SVD) of the residual matrix, Y−XB̂Tt = MSNT , with M a unitary149
8
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
matrix of dimension n×n, N a unitary matrix of dimension p× p, and S the matrix150
of singular values (sj)j=1,...,n. Then, we obtain151
Ŵt = MS̄NT (7)
where S̄ is the diagonal matrix with diagonal terms s̄j = (sj − γ)+, j = 1, . . . , n.152
Building on results from Tseng (2001), the following statement holds.153
Theorem 1. Let µ > 0 and γ > 0. Then the block-coordinate descent algorithm154
cycling through Step 1 and Step 2 converges to estimates of W and B defining a155
global minimum of the penalized loss function Lsparse(W,B).156
Note that the algorithmic complexities of Step 1 and Step 2 are bounded by a157
term of order O(pn+K(p+ n)). The computing time of sparse LFMM estimates is158
generally longer than for the CATE algorithm (Wang et al., 2017) or the ridge LFMM159
algorithm detailed below (Caye et al., 2019). Sparse LFMM needs to perform SVD160
and projections several times until convergence while CATE and ridge LFMM require161
a single iteration.162
2.3 Ridge regression algorithms163
Caye et al. (2019) considered a related approach, referred to as ridge LFMM, where164
the statistical estimates of the parameter matrices B and W are computed after165
minimizing the loss function with L2 norm regularization defined as follows166
Lridge(B,W) = ‖Y −W −XBT‖2F + λ‖B‖22 , λ > 0, (8)
where ‖.‖F is the Frobenius norm, ‖.‖2 is the L2 norm, and λ is a regularization pa-167
9
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
rameter. The minimization algorithm starts with an SVD of the explanatory matrix,168
X = QΣRT , where Q is an n× n unitary matrix, R is an d× d unitary matrix and169
Σ is an n×d matrix containing the singular values of X, denoted by (σj)j=1,...,d. The170
ridge estimates are computed as follows171
Ŵ = QD−1λ svdK(DλQTY) (9)
B̂T = (XTX + λIdd)−1XT (Y − Ŵ), (10)
where svdK(A) is the SVD of rank K of A, Idd is the d× d identity matrix, and Dλ172
is the n× n diagonal matrix with coefficients defined as173
dλ =
(√λ
λ+ σ21, . . . ,
√λ
λ+ σ2d, 1, . . . , 1
).
For λ > 0, the solution of the regularized least-squares problem is unique (Caye174
et al., 2019), and the corresponding matrices are called the ridge estimates. For175
completeness, we provide a short proof for this result, stated in (Caye et al., 2019), in176
the appendix. Using random projections to compute low rank approximations, the177
complexity of the estimation ridge LFMM algorithm is of order O(n2p + np logK)178
(Halko et al., 2011). For studies in which the number of samples, n, is much smaller179
than the number of response variables, p, computing times of ridge estimates are180
therefore faster than those of sparse LFMM.181
3 Results182
Generative model experiments. In a first series of experiments, we compared183
sparse LFMM with LASSO and three non-sparse approaches (ridge LFMM, CATE,184
10
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
Figure 1. Root Mean Square Error (RMSE) as a function of the effectsize of causal markers and confounding intensity. Two sparse methods (sparseLFMM, LASSO) and three non-sparse methods (ridge LFMM, CATE and SVA) werecompared. The “Zero” value corresponds to an RMSE obtained with all effect sizesset to zero (null-model error). Generative model simulation parameters: (A) Lowereffect sizes and confounding intensities (B) Lower effect sizes and higher confoundingintensities. (C) Higher effect sizes and lower confounding intensities. (D) Highereffect sizes and confounding intensities.
11
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
SVA). The data were simulated from the generative model defined in equation (1),185
and the performance of each algorithm was measured in four scenarios showing higher186
or lower effect sizes and confounding intensities (Figure 1). For all experiments,187
we computed statistical errors (RMSE) for the effect size estimates of each method188
(Figure 1). To provide a reference value for the RMSE, we measured the error made189
when all effect sizes were estimated as being null (“Zero” value or null-model error).190
The null-model error was equal to 0.069 in low effect size scenarios and equal to 0.135191
in high effect size scenarios. A powerful method was expected to reach error levels192
lower than the null-model error. The RMSEs of sparse LFMM ranged from 0.055193
to 0.092, less than those of the null-model. The RMSEs of LASSO were close to194
the ones of sparse LFMM in the low effect size scenarios. In contrast, non-sparse195
methods led to RMSEs higher than the null-model error, ranging between 0.13 and196
0.26 for ridge LFMM and CATE, and rising up to 0.50 for SVA. For the effect sizes197
associated with causal markers, non-sparse methods reached lower RMSE values than198
those of sparse methods, ranging between 0.12 and 0.26 for ridge LFMM and CATE,199
and between 0.60 and 1.03 for sparse LFMM (Figure S1). Regarding precision and200
F -score - which is a harmonic mean of power and precision, the performances of201
all methods were higher in scenarios with higher effect size and lower confounding202
intensity. Sparse LFMM performed similarly to or less than the LASSO when the size203
of the causal effects was small (Figure 2AB), but it reached higher F -scores for larger204
effect sizes (Figure 2CD). In those simulations, sparse LFMM obtained lower F -scores205
than ridge LFMM and CATE. The difference was substantial when the sizes of the206
causal effects were small (F ≈ 0.51 versus F ≈ 0.76, Figure 1AB), but the differences207
were small for the larger effect sizes (F ≈ 0.75, Figure 2CD). In all scenarios, sparse208
12
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
LFMM obtained better scores than SVA. In summary, sparse LFMM was associated209
with the smallest overall statistical error, but the estimates of effect size were biased210
more severely with this method than with non-sparse methods. Sparse LFMM was211
generally preferable to LASSO and SVA. Once non-null effect sizes are identified by212
sparse LFMM, a consensus strategy would use ridge LFMM or CATE for evaluating213
the effect sizes of the candidate markers.214
Empirical simulation experiments. In a second series of experiments, we used215
realistic simulations to compare sparse LFMM to other sparse and non-sparse meth-216
ods. Simulations were based on 162 ecotypes of the model plant Arabidopsis thaliana217
using 53,859 SNP genotypes in chromosome 5. The simulations considered lower218
and higher effect sizes and gene by environment (G × E) interaction levels. Those219
simulations departed from generative model simulations, and they were introduced220
to evaluate the robustness of effect size estimates in each approach. In lower G× E221
interaction scenarios, sparse LFMM obtained the highest scores (F in (0.57,0.60),222
precision in (0.81,0.82), Figure 3AC) compared to BSLMM (F in (0.36,0.44)), and223
to non-sparse methods (F ranging between 0.25 and 0.28). In higher G × E in-224
teraction scenarios, all methods obtained very low performances for the low effect225
size scenario, but sparse LFMM obtained among the highest F -score and precision.226
When the effect size was higher, sparse LFMM reached higher performances (F ≈227
0.28 and accuracy ≈ 0.33) than the other methods (Figure 3D). In those realistic228
simulations, sparse LFMM demonstrated greater robustness to departure from the229
generative model assumptions than the other sparse methods (BSLMM, LASSO),230
and also compared favorably with non-sparse methods (ridge LFMM, CATE, SVA).231
13
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
Figure 2. F -score and precision as a function of effect size of the causalmarkers and confounding intensity. Two sparse methods (sparse LFMM,LASSO) and three non-sparse methods (ridge LFMM, CATE and SVA) were com-pared. F -score is the harmonic mean of precision and recall. Generative modelsimulation parameters: (A) Lower effect sizes and confounding intensities (B) Lowereffect sizes and higher confounding intensities. (C) Higher effect sizes and lowerconfounding intensities. (D) Higher effect sizes and confounding intensities.
14
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
Figure 3. Empirical simulation data (F -score and precision). F -score andprecision as a function of the effect size of the causal markers and of the strength ofthe interaction between genotype and environment (G × E). Three sparse methods(sparse LFMM, BSLMM and LASSO) and three non-sparse methods (ridge LFMM,CATE and SVA) were compared. F -score is the harmonic mean of precision andrecall. Simulation parameters: (A) Lower effect sizes and lower G × E (B) Lowereffect sizes and higher G × E. (C) Higher effect sizes and lower G × E. (D) Highereffect sizes and higher G× E.
15
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
Runtimes and number of factors. Next, we evaluated runtimes for sparse LFMM,232
and compared those runtimes with BSLMM and ridge LFMM (Figure S2). What-233
ever the number of individuals or markers, ridge LFMM was the fastest method, and234
sparse LFMM was the slowest method. Higher computation times for sparse LFMM235
were not surprising because the method iterates many cycles before convergence,236
whereas ridge LFMM is an exact approach. It took around 2,000 seconds for sparse237
LFMM to complete runs with n = 1, 000 individuals and p = 100, 000 markers. With238
default values for MCMC parameters, BSLMM runtimes were of the same order as239
those of sparse LFMM. To assess the choice of K by cross-validation, we varied the240
number of latent factors between 3 and 10, and compared the values estimated by241
cross validation with the true values. In 73% simulations, the number of latent fac-242
tors was correctly estimated, and in the remaining 17% simulations, the true value243
of K was overestimated by one unit (Figure S3).244
GWAS of flowering time in A. thaliana. To illustrate the use of latent factor245
models in a context where confounding is difficult to control for, we performed a246
GWAS of flowering time using p = 53, 859 SNPs genotyped in chromosome 5 for247
n = 162 European accessions of the model plant A. thaliana. The sparse methods248
(sparse LFMM, LASSO, BSLMM) differed in their estimate of the number of null249
effect sizes (Figure 4ABC, Figure S4). The LASSO approach estimated 99.85% null250
effect sizes while the proportions were equal to 99.24% and 98.18% for BSLMM and251
sparse LFMM respectively. The LASSO was the most conservative approach, and252
sparse LFMM the most liberal one. Sparse LFMM shared 3.9% of hits with LASSO,253
and 5.5% with BSLMM (Figure S4). Less than 1% of all hits were common to the254
16
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
three approaches. The (non-null) effect sizes for hits varied on distinct scales, with255
LASSO exhibiting the strongest biases. All sparse methods detected the same top hit256
at around 4 Mb, corresponding to a SNP located within the FLC gene, consistent257
with the results of Atwell et al. (2010). The second hit in (Atwell et al., 2010),258
located in the gene DOG1, was also identified by sparse LFMM. BSLMM had more259
difficulties in identifying previously discovered genes. Given the high correlation260
– greater than 94 % – between effect sizes obtained with non-sparse methods, we261
grouped their results by averaging their estimates. Non-sparse methods exhibited262
effect sizes in a range of values closer to sparse LFMM than to LASSO and BSLMM,263
but higher statistical errors were observed for those approaches (Figure 4D). Overall,264
we found a significant correlation between the non-null effect sizes estimated by sparse265
LFMM and the corresponding effect sizes found by non-sparse methods (ρ = 0.8065,266
P < 10−16). In addition, sparse LFMM and the non-sparse methods found new hits267
around 13.9 Mb and 6.5 Mb of chr 5, corresponding to the SAP and ACL5 genes268
respectively.269
EWAS of exposure to smoking during pregnancy. To evaluate association270
between smoking during pregnancy and placental DNA methylation, we performed271
an EWAS considering tobacco consumption as a primary variable. To this objective,272
we considered beta-normalized methylation levels at p = 425, 878 probed CpG sites273
for n = 668 women (Heude et al., 2016; Rousseaux et al., 2019). The placentas were274
collected at delivery from women included in the EDEN mother-child cohort. Using275
sparse LFMM, the proportion of null effect sizes was equal to 99.698%, for a total276
number of 1,287 hits (Figure S5). To characterize the targeted CpGs, we evaluated277
17
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
Figure 4. GWAS of flowering time in A. thaliana (chromosome 5). A) Effectsize estimates for LASSO. B) Effect size estimates for sparse LFMM. C) Effect sizeestimates for sparse BSLMM. D) Average effect size estimates for non-sparse methods(ridge LFMM, CATE and SVA). Grey bars represent Arabidopsis SNPs associatedwith the FT16 phenotype in (Atwell et al., 2010), and correspond to the FLC andDOG1 genes.
18
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
whether there was an enrichment of enhancer and promoter regions in candidate278
regions compared to the methylome (Figure S6 and Figure S7). For the 1,287 CpGs279
with non-null effect sizes, 25.48% were found in enhancer regions, compared to 22.73%280
for the whole methylome, and 6.83% were found in promoter regions, compared to281
19.94% for the whole methylome. We compared the CpGs having the highest effect282
sizes in each method (Figure S8). Sparse LFMM shared 45.3 % of hits with non-283
sparse models (represented by ridge LFMM), and 2.8 % of hits with LASSO (Table284
S1). Among the 51 top hits shared by sparse LFMM and ridge LFMM, 25 were found285
in the body of a gene, 11 were not associated with a gene, 20 were in enhancer regions286
and 2 in promoter regions. Note that in this analysis, we averaged the effect sizes of287
non-sparse methods because their correlation was greater than 99%. The results of288
sparse LFMM agreed with the results of non-sparse methods better than with those289
of LASSO. The Pearson correlation between the non-null effect sizes estimated by290
sparse LFMM and the corresponding effect sizes estimated non-sparse methods was291
equal to ρ = 80.38% (P < 10−16), whereas the Pearson correlation between non-null292
effect sizes of sparse LFMM and LASSO was equal to ρ = 61.86% (P < 10−16).293
To focus on a specific chromosome, we detailed the outputs of all approaches for294
chromosome 3, which contained the epigenome-wide top hit for sparse LFMM and295
for non-sparse methods (cg27402634, located on an enhancer, Figure 5). This CpG296
was also detected with LASSO (Figure S9). The sparse LFMM hits shared three297
additional CpGs with non-sparse methods: cg09627057, cg18557837 and cg12662091.298
Overall, sparse LFMM detected 61 CpGs with non-null effect sizes: 43 were located299
in genes, 22 in enhancer regions and 6 in promoter regions.300
19
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
Figure 5. DNA methylation EWAS of smoking status in pregnant women(chromosome 3). A) Estimated effect size for sparse LFMM. The effect size atcg27402634 is equal to β = −0.117 (out of range). B) Estimated effect size for non-sparse methods (ridge LFMM, CATE and SVA). The effect size at cg27402634 isequal to β = −0.141 (out of range). CpGs with the highest effects are circled (genesin blue color). Red dots represent CpGs located in enhancer regions. Green dotsrepresent CpGs located in promoter regions (Illumina annotations).
20
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
4 Discussion301
We introduced sparse latent factor regression methods for the joint estimation of302
effect sizes and latent factors in genomic and epigenomic association studies. In303
generative and in empirical simulations, sparse LFMM obtained higher F -score and304
precision than previously introduced sparse methods, BSLMM and LASSO. Com-305
pared to three non-sparse methods (ridge LFMM, CATE and SVA), statistical errors306
of effect size estimates were reduced. In simulations based on a real data set, sparse307
LFMM reached the highest precision and F -score, showing that the method was more308
robust to departure from model assumptions than the other methods. For the causal309
markers, the effect sizes estimated by sparse LFMM and the corresponding effect sizes310
estimated by non-sparse methods were strongly correlated. Effect size estimates had311
a lower bias in non-sparse methods compared to sparse methods. These results sug-312
gest to combine sparse LFMM with a non-sparse method in the following way. At313
a first stage, sparse LFMM can be used to estimate the support of causal markers314
(non-null effect sizes). Then ridge LFMM and CATE can be used to estimate the315
effect sizes of the selected markers.316
In a GWAS of flowering time using 53,859 SNPs in the fifth chromosome of 162317
European accessions of the plant A. thaliana, sparse LFMM identified the FLC and318
DOG1 genes to be associated with the FT16 phenotype. The two genes were pre-319
viously reported as being associated with this phenotype in (Atwell et al., 2010).320
The FLC gene plays a central role in flowering induced by vernalization (Sheldon et321
al., 2000), and DOG1 is involved in the control of dormancy and seed germination322
(Nishimura et al., 2018). The second hit of sparse LFMM corresponded to SNPs323
linked to SAP, which is a transcriptional regulator involved in the specification of324
21
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
floral identity (Byzova et al., 1999). This association was also significant for non-325
sparse methods. In addition, the new method detected SNPs located in the ACL5326
gene, which plays a role in internodal growth and organ size (Hanzawa et al., 1997).327
In summary, sparse methods facilitated the selection of non-null effect sizes. The328
results for sparse LFMM were not only consistent with previous discoveries, but they329
also identified new candidate genes with interesting functional annotations.330
Next, we applied sparse LFMM in an EWAS of placental DNA methylation331
for women exposed to smoking during pregnancy, which is considered an impor-332
tant risk factor for child health (Lumley et al., 2009). The CpG with the high-333
est effect size in sparse LFMM and non-sparse methods (cg27402634) is located334
in an enhancer region, close to the LEKR1 gene which was associated with birth335
weight in a GWAS from the Early Growth Genetics (EGG) consortium (http:336
//egg-consortium.org/birth-weight.html). This association was detected as a337
top hit in an independent study of placental methylation and smoking (Morales et338
al., 2019). (Rousseaux et al., 2019) also detected the association with cg27402634 in339
an EWAS based on a slightly different study population, and with other measures340
of the level of tobacco consumption. (Morales et al., 2019) carried out a Sobel anal-341
ysis of mediation between smoking and birth weight, found the test significant for342
cg27402634. In the list of 51 CpGs with high effect sizes, several additional statis-343
tical associations between placental methylation and maternal smoking have been344
reported in previous studies, including cg21992501 in the gene TTC27 (Cardenas et345
al., 2019), cg25585967 and cg17823829, respectively in the TRIO and KDM5B genes346
(Morales et al., 2019; Everson et al., 2019). Turning to the rest of the methylome,347
we found additional associations that may adversely affect mother-child health. To348
22
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
http://egg-consortium.org/birth-weight.htmlhttp://egg-consortium.org/birth-weight.htmlhttp://egg-consortium.org/birth-weight.htmlhttps://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
better characterize the CpGs in those associations, we evaluated whether there was349
an enrichment in enhancer and in promoter regions. We found there was an en-350
richment of enhancer regions and a depletion of promoter regions for CpGs with351
non-null effects, consistent with the findings of (Rousseaux et al., 2019). Overall352
our new method allowed us to confirm some previously discovered associations, and353
also detected new associations including genes for which methylation changes have354
detrimental effects on the health of the child.355
Conclusion. Removing variation due to unobserved confounding factors is ex-356
tremely difficult in any type of association study. Assuming that a small proportion of357
all markers correlate with the exposure or phenotype, we addressed the confounding358
issue by using sparse latent factor regression models, providing mathematical guaran-359
tees that global solutions of least squares estimation problems are proposed. Sparsity360
constraints in our algorithm allowed the selection of markers without any need for361
statistical testing. The application of our method to real data sets highlighted new362
associations with relevant biological meaning. The methods are reproducible and are363
implemented in the R package lfmm.364
Materials and Methods365
Cross-validation method. Choosing regularization parameters of sparse LFMM366
or ridge LFMM and the number of latent factors can be done by using cross-validation367
methods. The cross-validation approach partitions the data into a training set and a368
test set. The training set is used to fit model parameters, and prediction errors are369
measured on the test set. In our approach, the response and explanatory variables370
23
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
are partitioned according to their rows (individuals). We denote by I the subset of371
individual labels on which prediction errors are computed. Estimates of effect sizes,372
B̂−I , and loading values, V̂−I , are computed on the training set. Next, we partition373
the set of columns of the response matrix, and denote by J the subset of columns on374
which prediction errors are computed. A factor matrix, Û−J , is estimated from the375
complementary subset as follows376
Û−J = (Y[I,−J ]−X[I, ]B̂T−I [−J, ])V̂−I [−J, ]. (11)
In these notations, the brackets indicate which subsets of rows and columns are377
selected. A prediction error is then computed as follows378
Error =∥∥∥Y[I, J ]− Û−JV̂T−I [J, ]−X[I, ]B̂T−I [J, ]∥∥∥
F. (12)
Regularization parameters and the number of factors leading to the lowest prediction379
error were retained in data analysis.380
Heuristics for regularization parameters and number of factors. Additional381
heuristics were used to determine the number of latent factors and the regularization382
parameter of the nuclear norm of the latent matrix. In order to choose the number of383
latent factors, K, we considered the matrix Dλ, defined for the ridge algorithm, and384
the unitary matrix Q, obtained from an SVD of X. The number of latent factors,385
K, can be estimated by using a spectral analysis of the matrix D0QTY. In our386
experiments, we used the “elbow” method based on the scree plot of eigenvalues of387
the matrix D0QTY. Values for K were confirmed by prediction errors computed by388
cross-validation. The L1-regularization parameter, µ, was determined by inspection389
of the proportion of non-zero effect sizes in the B matrix, which was estimated by390
24
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
cross-validation. Having set the proportion of non-null effect sizes, µ was computed391
by using the regularization path approach proposed by Friedman et al. (2010) as392
follows. The regularization path algorithm was initialized with the smallest values of393
µ such that394
B̂1 = sign(B̄1)(B̄1 − µ)+ = 0, (13)
where B̂1 resulted from Step 1 in the sparse LFMM algorithm, and B̄1 is the linear395
regression estimate. Then, we built a sequence of µ values that decreased from the396
inferred value of the parameter µmax to µmin = �µmax. We eventually measured the397
number of non-null elements in B̂t, and stopped when the target proportion was398
reached. The nuclear norm parameter (γ) determines the rank of the latent matrix399
W. We used a heuristic approach to evaluate γ from the number of latent factors400
K. Based on the singular values (λ1, . . . , λn) of the response matrix Y, we set401
γ =(λK + λK+1)
2. (14)
With this value of γ, sparse LFMM always converged to a latent matrix estimate402
having rank K in our experiments.403
Estimation algorithms. Sparse LFMM was compared to two other sparse meth-404
ods. As a baseline, we used Least Absolute Shrinkage and Selection Operator (LASSO)405
regression models (Tibshirani, 1996; Friedman et al., 2010). LASSO regression mod-406
els did not include any correction for confounding, and strong biases were expected in407
effect size estimates. The LASSO models were implemented in the R package glmnet,408
and the regularization parameter was selected by using a 5-fold cross validation ap-409
proach (Zeng et al., 2017). We also used Bayesian Sparse Linear Mixed Models410
25
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
(BSLMM) implemented in the GEMMA software (Zhou et al., 2013). BSLMM is411
a hybrid method that combines sparse regression models with linear mixed mod-412
els. BSLMM uses a Markov chain Monte Carlo (MCMC) method to estimate effect413
sizes. The MCMC burn-in period and sampling sizes were set to 10,000 (Zeng et414
al., 2017). To determine the proportion of non-zero effect sizes, two parameters were415
tuned (pmin and pmax). Those parameters correspond to the logarithm of the max-416
imum and minimum expected proportions of non-zero effect size. We also compared417
sparse LFMM to three non-sparse algorithms, all based on the generative model418
defined in equation (1). First we implemented Surrogate Variable Analysis (SVA,419
Leek and Storey (2007)). SVA was introduced to overcome the problems caused by420
heterogeneity in gene expression studies. The algorithm starts with estimating the421
loading values of a principal component analysis for the residuals of the regression422
of the response matrix Y on X. In a second step, SVA determines a subset of re-423
sponse variables exhibiting low correlation with X, and uses this subset of variables424
to estimate the latent factors. SVA was implemented in the R package sva. Next,425
we implemented the Confounder Adjusted Testing and Estimation (CATE) method426
(Wang et al., 2017). CATE uses a linear transformation of the response matrix such427
that the first axis of this transformation is colinear to X and the other axes are or-428
thogonal to X. CATE was used without negative controls, and it was implemented429
in the R package cate. We eventually used the ridge version of LFMM implemented430
in the R package lfmm (Caye et al., 2019).431
Generative model simulations. We defined the confounding intensity as the432
percentage of variance of the primary variable X explained by the latent factors U.433
26
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
Following Caye et al. (2019), we performed simulations of a primary variable, X,434
with d = 1, and K = 6 independent latent factors, U, for two values of confounding435
intensity, R2 = 0.1 (lower) and R2 = 0.5 (higher). The joint distribution of (X, U)436
was a multivariate Gaussian distribution. Having defined primary variables and latent437
factors, we used the generative model defined in equation (1) to simulate a response438
matrix, Y. To create sparse models, only a small proportion of effect sizes, around439
0.8%, were allowed to be different from zero. Non-null effect sizes were sampled440
according to a Gaussian distribution, N(B, 0.2), where B could take two values,441
B = 0.75 (lower value) and B = 1.5 (higher value). Residual errors and loadings,442
V, were sampled according to a standard Gaussian distribution. The dimensions443
of the response matrix were set to n = 400 individuals and p = 10, 000 variables.444
Two hundred simulations were performed for each combination of parameters (800445
simulations).446
Empirical simulations. We used the R package naturalgwas to simulate as-447
sociations of phenotypes based on a matrix of sampled genotypes (François and448
Caye, 2018). With this program, phenotypic simulations incorporate realistic fea-449
tures such as geographic population genetic structure and gene-by-environment in-450
teractions where environmental variables are derived from a bioclimatic database.451
When estimating effect sizes, population genetic structure and gene-by-environment452
interactions are considered to be the main sources of confounding. Phenotypes were453
simulated for n = 162 publicly available Single Nucleotide Polymorphisms (SNPs)454
genotyped from the fifth chromosome of the model plant Arabidopsis thaliana (Atwell455
et al., 2010). The response matrix contained p = 53, 859 SNPs, with minor allele fre-456
27
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
quency greater than 5%. The number of confounding factors was set to K = 6, and457
the phenotypes were generated from a combination of five causal SNPs with identical458
effect sizes. Two values of effect size were implemented, B = 6 (lower effect size)459
and B = 9 (higher effect size). Additionnally, two values of gene-by-environment460
interaction were implemented, G× E = 0.1 (lower G× E) and G× E = 0.9 (higher461
G×E). For each parameter combination, two hundred simulations were performed.462
Evaluation metrics. In the simulation study, all methods were used with their463
default parameters, and the number of latent factors was set to K = 6 in all latent464
factor models. To evaluate the capabilities of methods to identify true positives,465
we used precision, which corresponds to the proportion of true positives in a list of466
positive markers, the recall, which is the number of true positives divided by the467
number of causal markers, and the F -score, which is the harmonic mean of precision468
and recall. To compute precision and F -score in generative model experiments, a469
list of 100 markers with the largest absolute estimated effect sizes was considered470
for each data set and method. In empirical simulations, the measures were modified471
to account for linkage disequilibrium (LD) in the data. Candidate markers within a472
window of size 10kb around a causal marker were considered to be true discoveries473
(LD-r2 < 0.2, François and Caye (2018)). In addition to the F -score, we used the root474
mean squared error (RMSE) to evaluate the statistical errors of effect size estimates.475
We also used simulations from the generative model to assess the capability of the476
cross validation algorithm to estimate the number of latent factors in sparse LFMM.477
In program runs, the number of latent factors varied between K = 3 and K = 10478
and the value estimated by the cross validation algorithm was compared with the479
28
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
true value (K = 6).480
GWAS of plant phenotype. Sparse LFMM and a set of other methods were481
used to perform association studies for two distinct types of genomic data including482
genotypic and epigenetic markers. For Arabidopsis thaliana, we considered n = 162483
European accessions and p = 53, 859 SNPs from the fifth chromosome ot the plant484
genome to investigate associations with the flowering time phenotype FT16-TO:485
0000344 (Atwell et al., 2010). FT16 corresponds to the number of days required486
for an individual plant to reach the flowering stage. In the sparse LFMM algorithm,487
the percentage of non-null effect size was set to 0.01. The parameters pmin and pmax488
defining sparsity in the BSLMM algorithm were fixed to pmin= −5 and pmax= −4489
respectively. These values correspond to the logarithm of expected proportions of490
non-null effect sizes in BSLMM. For all factor methods, the number of latent factors491
was determined by cross-validation and set to K = 10.492
EWAS of exposure to tobacco consumption. Our second application to real493
data concerned an EWAS based on the EDEN mother-child cohort (Heude et al.,494
2016). Beta-normalized methylation levels at p = 425, 878 probed CpG sites were495
measured for n = 668 women. We tested the association between smoking status496
(219 current smokers women and 449 non-current smokers women) and DNA methy-497
lation (mDNA) levels in the mother’s placenta. Detailed information on the study498
population and protocols for placental DNA methylation assessment processing could499
be found in (Abraham et al., 2018; Rousseaux et al., 2019). The proportion of null500
effect sizes in sparse LFMM was equal to 0.999. For latent factor models, the number501
of latent factors was estimated by cross-validation, and was equal to K = 7.502
29
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
Acknowledgements. This article was developed in the framework of the Greno-503
ble Alpes Data Institute, supported by the French National Research Agency under504
the Investissements d’Avenir program (ANR-15-IDEX-02). It received support from505
LabEx PERSYVAL Lab, ANR-11-LABX-0025-01, and from the French National Re-506
search Agency (Agence Nationale pour la Recherche) ETAPE, ANR-18-CE36-0005.507
We thank the participants of the EDEN cohort. We thank the midwife research as-508
sistants for data collection, the psychologists and the data entry operators. We also509
thank the EDEN mother-child cohort study group which includes I Annesi-Maesano,510
JY Bernard, J Botton, M-A Charles, P Dargent- Molina, B de Lauzon- Guillain,511
P Ducimetière, M de Agostini, B Foliguet, A Forhan, X Fritel, A Germa, V Goua,512
R Hankard, B Heude, M Kaminski, B Larroque, N Lelong, J Lepeule, G Magnin,513
L Marchand, C Nabet, F Pierre, R Slama, MJ Saurel-Cubizolles, M Schweitzer, O514
Thiebaugeorges.515
Program Availability. All codes are publicly available. Sparse LFMM was im-516
plemented in the R package lfmm available from Github (https://bcm-uga.github.517
io/lfmm/) and submitted to the Comprehensive R Archive Network (https://cran.518
r-project.org/).519
Data Availability. The Arabidopsis thaliana data are publicly available from the520
1,001 genomes database (https://1001genomes.org/). The EDEN individual-level521
data have restricted access owing to ethical and legal conditions in France. They are522
available upon request from the EDEN steering committee at [email protected]
and through collaborations with the principal investigators of EDEN.524
30
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://bcm-uga.github.io/lfmm/https://bcm-uga.github.io/lfmm/https://bcm-uga.github.io/lfmm/https://cran.r-project.org/https://cran.r-project.org/https://cran.r-project.org/https://1001genomes.org/mailto:[email protected]://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
Fundings. The EDEN study was supported by Foundation for medical research525
(FRM), National Agency for Research (ANR), National Institute for Research in526
Public health (IRESP: TGIR cohorte santé 2008 program), French Ministry of Health527
(DGS), French Ministry of Research, INSERM Bone and Joint Diseases National Re-528
search (PRO-A), and Human Nutrition National Research Programs, Paris-Sud Uni-529
versity, Nestlé, French National Institute for Population Health Surveillance (InVS),530
French National Institute for Health Education (INPES), the European Union FP7531
programmes (FP7/2007-2013, HELIX, ESCAPE, ENRIECO, Medall projects), Dia-532
betes National Research Program (through a collaboration with the French Associ-533
ation of Diabetic Patients (AFD)), French Agency for Environmental Health Safety534
(now ANSES), Mutuelle Générale de l’Education Nationale a complementary health535
insurance (MGEN), French national agency for food security, French-speaking asso-536
ciation for the study of diabetes and metabolism (ALFEDIAM).537
References538
Abraham, E., Rousseaux, S., Agier, L., Giorgis-Allemand, L., Tost, J., Galineau,539
J., Hulin, A., Siroux, V., Vaiman, D., Charles, M.-A., Heude, B., Forhan, A.,540
Schwartz, J., Chuffart, F., Bourova-Flin, E., Khochbin, S., Slama, R., and Lep-541
eule, J., (2018). Pregnancy exposure to atmospheric pollution and meteorological542
conditions and placental DNA methylation. Environ. Int., 118, 334-347.543
Akama, T.O., Misra, A.K., Hindsgaul, O., and Fukuda, M.N. (2002). Enzymatic544
synthesis in vitro of the disulfated disaccharide unit of corneal keratan sulfate. J.545
Biol. Chem., 277, 42505-42513.546
31
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
Atwell, S., Huang, Y.S., Vilhjàlmsson, B.J., Willems, G., Horton, M., Li, Y., Meng,547
D., Platt, A., Tarone, A.M., Hu, T.T., Jiang, R., Muliyati, N.W., Zhang, X., Amer,548
M.A., Baxter, I., Brachi, B., Chory, J., Dean, C., Debieu, M., de Meaux, J., Ecker,549
J.R., Faure, N., Kniskern, J.M., Jones, J.D.G., Michael, T., Nemri, A., Roux, F.,550
Salt, D.E., Tang, C., Todesco, M., Traw, M.B., Weigel, D., Marjoram, P., Borevitz,551
J.O., Bergelson, J., and Nordborg, M. (2010). Genome-wide association study of552
107 phenotypes in Arabidopsis thaliana inbred lines. Nature, 465, 627-631.553
Balding, D.J. (2006) A tutorial on statistical methods for population association554
studies. Nat. Rev. Genet., 7, 781-781.555
Bertsekas, D. P. (1999) Nonlinear Programming. Belmont: Athena Scientific.556
Byzova, M.V., Franken, J., Aarts, M.G.M., de Almeida-Engler, J., Engler, G., Mar-557
iani, C., Van Lookeren Campagne, M.M., Angenent, G.C. (1999). Arabidopsis558
STERILE APETALA, a multifunctional gene regulating inflorescence, flower, and559
ovule development. Genes Dev., 13, 1002-1014.560
Cai, J-F., Candès, E.J. and Shen, Z. (2010) A singular value thresholding algorithm561
for matrix completion. SIAM J. Optim., 20 1956-1982.562
Carvalho, C. M. et al. (2008) High-dimensional sparse factor modeling: applications563
in gene expression genomics. J. Am. Stat. Assoc., 103, 1438-1456.564
Cardenas, A., Lutz, S.M., Everson, T.M., Perron, P., Bouchard, L., Hivert, M.-F.,565
(2019). Mediation by placental DNA methylation of the association of prenatal566
maternal smoking and birth weight. Am. J. Epidemiol., 188, 1878-1886.567
32
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
Caye, K., Jumentier, B., Lepeule, J., François, O. (2019) LFMM 2: Fast and accu-568
rate inference of gene-environment associations in genome-wide studies. Mol. Biol.569
Evol., 36, 852-860.570
Devlin, B. and Roeder K. (1999) Genomic control for association studies. Biometrics,571
55, 997-1004.572
Eckart, C. and Young, G. (1936) The approximation of one matrix by another of573
lower rank. Psychometrika, 1, 211-218.574
Efron, B. (2004) Large-scale simultaneous hypothesis testing: The choice of a null575
hypothesis. J. Am. Stat. Assoc., 99, 96-104.576
Everson, T.M., Vives-Usano, M., Seyve, E., Cardenas, A., Lacasaña, M., Craig, J.M.,577
Lesseur, C., Baker, E.R., Fernandez-Jimenez, N., Heude, B., Perron, P., Gonzalez-578
Alzaga, B., Halliday, J., Deyssenroth, M.A., Karagas, M.R., Iñiguez, C., Bouchard,579
L., Carmona-Saez, P., Loke, Y.J., Hao, K., Belmonte, T., Charles, M.A., Martorell-580
Marugan, J., Muggli, E., Chen, J., Fernandez, M.F., Tost, J., Gomez-Martin, A.,581
London, S.J., Sunyer, J., Marsit, C.J., Lepeule, J., Hivert, M.-F., Bustamante,582
M., (2019). Placental DNA methylation signatures of maternal smoking during583
pregnancy and potential impacts on fetal growth. BioRxiv, 663567.584
François, O., Caye, K. (2018) Naturalgwas: An R package for evaluating genome-wide585
association methods with empirical data. Mol. Ecol. Resour., 18(4), 789-797.586
Friedman, J., Hastie, T., Höfling, H., Tibshirani, R. (2007) Pathwise coordinate587
optimization. Ann. Appl. Stat., 1, 302-332.588
33
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
Friedman, J., Hastie, T., Tibshirani, T. (2010) Regularization paths for generalized589
linear models via coordinate descent. J. Stat. Softw., 33.590
Frichot, E., Schoville, S. D., Bouchard, G. and François, O. (2013) Testing for associ-591
ations between loci and environmental gradients using latent factor mixed models.592
Mol. Biol. Evol., 30, 1687-1699.593
Frichot, E. and François, O. (2015) LEA: an R package for landscape and ecological594
association studies. Methods Ecol. Evol., 6, 925-929.595
Gautier, M. (2015) Genome-wide scan for adaptive divergence and association with596
population-specific covariates. Genetics, 201, 1555-1579.597
Halko, N., Martinsson, P. G. and Tropp, J. A. (2011) Finding structure with ran-598
domness: Probabilistic algorithms for constructing approximate matrix decompo-599
sitions. SIAM Rev., 53, 217-288.600
Hanzawa, Y., Takahashi, T., and Komeda, Y. (1997). ACL5: an Arabidopsis gene601
required for internodal elongation after flowering. Plant J., 12, 863-874.602
Hastie, T., Tibshirani, R., and Friedman, J. (2009) The Elements of Statistical Learn-603
ing. Springer Series in Statistics, Springer, NY, USA.604
Heude, B., Forhan, A., Slama, R., Douhaud, L., Bedel, S., Saurel-Cubizolles, M.-J.,605
Hankard, R., Thiebaugeorges, O., De Agostini, M., Annesi-Maesano, I., Kaminski,606
M., and Charles, M.-A. (2016). Cohort Profile: The EDEN mother-child cohort607
on the prenatal and early postnatal determinants of child health and development.608
Int. J. Epidemiol., 45, 353-363.609
34
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
Hoggart, C.J., Whittaker, J.C., Iorio, M.D., and Balding, D.J.(2008) Simultaneous610
analysis of all SNPs in genome-wide and re-sequencing association studies. PLoS611
Genet., 4, e1000130.612
Houseman, E.A., Kile, M.L., Christiani, D.C., Ince, T.A., Kelsey, K.T., and Marsit,613
C.J. (2016). Reference-free deconvolution of DNA methylation data and mediation614
by cell composition effects. BMC Bioinformatics, 17.615
Jaffe, A. E. and Irizarry, R. A. (2014) Accounting for cellular heterogeneity is critical616
in epigenome-wide association studies. Genome Biol., 15, R3.617
Kalaitzis, A.A., and Lawrence, N.D. (2012) Residual component analysis: Generalis-618
ing PCA for more flexible inference in linear-Gaussian models. Proceedings of the619
29th International Conference on Machine Learning, ICML 2012, 1, 209-216.620
Kaushal, A. et al. (2017) Comparison of different cell type correction methods for621
genome-scale epigenetics studies. BMC Bioinformatics, 18, 216.622
Leek, J. T., and Storey, J. D. (2007) Capturing heterogeneity in gene expression623
studies by surrogate variable analysis. PLoS Genet., 3, e161.624
Lumley, J., Chamberlain, C., Dowswell, T., Oliver, S., Oakley, L., and Watson, L.,625
(2009). Interventions for promoting smoking cessation during pregnancy. Cochrane626
Database of Systematic Reviews 2009, Issue 3. Art. No.: CD001055.627
Mishra, B., Meyer, G., Bach, F., and Sepulchre, R. (2013) Low-rank optimization628
with trace norm penalty. SIAM J. Optim., 23, 2124-2149.629
35
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
Morales, E., Vilahur, N., Salas, L.A., Motta, V., Fernandez, M.F., Murcia, M., Llop,630
S., Tardon, A., Fernandez-Tardon, G., Santa-Marina, L., Gallastegui, M., Bollati,631
V., Estivill, X., Olea, N., Sunyer, J., Bustamante, M., (2016). Genome-wide DNA632
methylation study in human placenta identifies novel loci associated with maternal633
smoking during pregnancy. Int. J. Epidemiol., 45, 1644-1655.634
Nishimura, N., Tsuchiya, W., Moresco, J.J., Hayashi, Y., Satoh, K., Kaiwa, N., Irisa,635
T., Kinoshita, T., Schroeder, J.I., Yates, J.R., Hirayama, T., Yamazaki, T. (2018).636
Control of seed dormancy and germination by DOG1-AHG1 PP2C phosphatase637
complex via binding to heme. Nat. Commun., 9.638
Price, A.L. et al. (2006) Principal component analysis corrects for stratification in639
genome-wide association studies. Nat. Genet., 38, 904-909.640
Rakyan, V. K., Down, T. A., Balding, D. J. and Beck, S. (2011) Epigenome-wide641
association studies for common human diseases. Nat. Rev. Genet., 12, 529-541.642
Rellstab, C., Gugerli, F., Eckert, A. J., Hancock, A. M., and Holderegger, R. (2015) A643
practical guide to environmental association analysis in landscape genomics. Mol.644
Ecol., 24, 4348-4370.645
Rousseaux, S., Seyve, E., Chuffart, F., Bourova-Flin, E., Benmerad, M., et al. (2019).646
Maternal exposure to cigarette smoking induces immediate and durable changes647
in placental DNA methylation affecting enhancer and imprinting control regions.648
BioRxiv, 852186.649
Sayin, N., Kara, N., Pekel, G., and Altinkaynak, H. (2014). Effects of chronic smoking650
36
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
on central corneal thickness, endothelial cell, and dry eye parameters. Cutan. Ocul.651
Toxicol., 33, 201-205.652
Sheldon, C.C., Rouse, D.T., Finnegan, E.J., Peacock, W.J., Dennis, E.S. (2000). The653
molecular basis of vernalization: The central role of Flowering Locus C (FLC).654
Plant Biol., 97, 6.655
Tibshirani, R. (1996) Regression shrinkage and selection via the lasso. J. R. Stat.656
Soc. Ser. B, 58 267-288.657
The BIOS Consortium, van Iterson, M., van Zwet, E.W., Heijmans (2017) Controlling658
bias and inflation in epigenome-and transcriptome-wide association studies using659
the empirical null distribution. Genome Biol., 18, 19.660
Tseng, P. (2001) Convergence of a block coordinate descent method for nondifferen-661
tiable minimization. J. Optim. Theor. Appl., 109, 475-494.662
Wang, J., Zhao, Q., Hastie, T., Owen, A.B. (2017) Confounder adjustment in multiple663
hypothesis testing. Ann. Statist., 45, 1863-1894.664
Witten, D. M., Tibshirani, R., and Hastie, T. (2009) A penalized matrix decompo-665
sition with applications to sparse principal components and canonical correlation666
analysis. Biostatistics, 10(3), 515-534.667
Wu, T.T., Chen, Y.F., Hastie, T., Sobel, E., and Lange, K. (2009) Genome-wide668
association analysis by lasso penalized logistic regression. Bioinformatics, 25, 714-669
721.670
37
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
Yu, J., Pressoir, G., Briggs, W.H., Bi, I.V., Yamasaki, M., et al. (2006) A unified671
mixed-model method for association mapping that accounts for multiple levels of672
relatedness. Nature Genetics, 38, 203-208.673
Zeng, P., Zhou, X., Huang, S., (2017). Prediction of gene expression with cis-SNPs674
using mixed models and regularization methods. BMC Genomics, 18.675
Zhou, X., Stephens, M. (2012) Genome-wide efficient mixed-model analysis for asso-676
ciation studies. Nature Genetics, 44, 821.677
Zhou, X., Carbonetto, P., and Stephens, M. (2013) Polygenic modeling with Bayesian678
sparse linear mixed models. PLoS Genetics, 9(2), e1003264.679
Zou, H., Hastie, T., and Tibshirani, R. (2006) Sparse principal component analysis.680
Journal of Computational and Graphical Statistics, 15, 265-286.681
Appendix: Proofs of theorems682
This section provides mathematical proofs for the theorems stated in section 2.683
Theorem 1. Let µ > 0 and γ > 0. Then the block-coordinate descent algorithm684
cycling through Step 1 and Step 2 converges to estimates of W and B defining a685
global minimum of the penalized loss function Lsparse(W,B).686
Proof. The proof arguments are based on a result of Tseng (2001). Consider the687
Cartesian product of closed convex sets A = A1 × A2 × ... × Am, and let f(z) be a688
continuous convex function defined on A and such that689
38
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
f(z1, · · · , zm) = g(z1, · · · , zm) +m∑i=1
fi(zi) ,
where g(z) is a differentiable convex function, and for each i = 1, . . . ,m, fi(zi) is690
a continuous convex function. Let (zt) be the sequence of values defined by the691
following block-coordinate descent algorithm692
zt+1i ∈ arg minζ∈Ai
f(zt1, . . . , zti−1, ζ, z
ti+1, . . . , z
tm) , i = 1, . . . ,m. (15)
Then a limit point of the sequence (zt) defines a global minimum of the function f(z).693
The theorem’s proof is a consequence of the convexity of the penalized loss function694
Lsparse(W,B), and the fact that we can write695
Lsparse(B,W) = g(B,W)/2 + f1(B) + f2(W)
where g(B,W) = ‖Y−W−XBT‖2F is a differentiable convex function, and f1(B) =696
‖B‖21, f2(W) = ‖W‖2∗ are continuous convex functions. Tseng’s result can be applied697
with the function f(B,W) = Lsparse(B,W) to conclude the proof (see also (Bertsekas,698
1999)).699
700
Theorem 2. Let λ > 0 and assume σ2i > 0 for all i = 1, . . . , d. The estimates Ŵ701
and B̂ computed as follows702
Ŵ = QD−1λ svdK(DλQTY) (16)
B̂T = (XTX + λIdd)−1XT (Y − Ŵ), (17)
39
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
where svdK(A) is the rank K SVD of the matrix A, Idd is the d× d identity matrix,
and Dλ is the n× n diagonal matrix with coefficients defined as
dλ =
(√λ
λ+ σ21, . . . ,
√λ
λ+ σ2d, 1, . . . , 1
).
define a global mimimum of the penalized loss function Lridge(B,W).703
Proof. Given W, a global minimum for Lridge(B,W) is obtained with the ridge esti-704
mates for a linear regression of the response matrix Y −W on X.705
B̂T = (XTX + λIdd)−1XT (Y −W). (18)
Thus, the problem amounts to minimizing the function L(W) = Lridge(B̂,W) with706
respect to W. By definition of the Dλ and Q matrices, the loss function rewrites as707
L(W) =∥∥DλQT (Y −W)∥∥2F . (19)
Minimizing the above loss function is equivalent to finding the best approximation708
of rank K for the matrix DλQTY. According to Eckart and Young (1936), this709
approximation is given by the rank K singular value decomposition of DλQTY.710
Eventually we obtain that711
Ŵ = QD−1λ svdK(DλQTY) (20)
defines the unique global minimum of the L(W) function.712
40
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
Supplementary materials713
41
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
Figure S1. Generative model simulations (RMSE for causal markers only).Root Mean Square Error (RMSE) of causal effect sizes as a function of the effect sizeof the causal markers and of the confounding intensity. Two sparse methods (sparseLFMM, LASSO) and three non-sparse methods (ridge LFMM, CATE and SVA) werecompared. Simulation parameters: (A) Lower effect sizes and confounding intensities(B) Lower effect sizes and higher confounding intensities. (C) Higher effect sizes andlower confounding intensities. (D) Higher effect sizes and confounding intensities.
42
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
Figure S2. Comparison of the runtimes of three methods. Runtimes asa function of the number of markers (p) and the number of individuals (n). (A)p = 1000. (B) p = 10, 000. (C) p = 100, 000.
43
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
Figure S3. Estimation of the number of latent factors (K) in the generativesimulations. Difference between the true K of the simulations and the K estimatedby our cross validation algorithm.
44
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
Figure S4. GWAS of a flowering trait with sparse LFMM, LASSO andBSLMM. Venn diagram of SNPs associated with the FT16 phenotype in each ap-proach. The hits correspond to SNPs having non-null effect size estimates.
45
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
Figure S5. DNA methylation EWAS of smoking status in pregnant women(all chromosomes). A) Estimated reverse effect size for LASSO. B) Estimatedeffect size for sparse LFMM. C) Estimated effect size for non-sparse methods (ridgeLFMM, CATE and SVA).
46
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
Figure S6. EWAS of smoking status in women. Over-representation of en-hancer regions in sparse LFMM candidate regions compared to the methy-lome. Blue bars correspond to the fraction of enhancer regions in each chromosome.Red bars correspond to the fraction of enhancer regions detected by sparse LFMM.The horizontal blue line represent the average number of enhancer regions per chro-mosome for the methylome. The red line represents the average number for sparseLFMM.
47
.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint
https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/
Figure S7
Top Related