Download - Sparse latent factor regression models for genome-wide and ...Feb 07, 2020 · 3 Universit e Grenoble-Alpes, Centre National de la Recherche Scienti que, Institut ... 2012), and confounder

Sparse latent factor regression models forgenome-wide and epigenome-wide

association studies

Basile Jumentier1,3 Kevin Caye1 Barbara Heude2

Johanna Lepeule3,? Olivier François1,?

Authors’ affiliations:

1 Université Grenoble-Alpes, Centre National de la Recherche Scientifique, GrenobleINP, TIMC-IMAG CNRS UMR 5525, 38000 Grenoble, France.

2 Université de Paris, CRESS, Inserm, INRAE, F-75004 Paris, France.

3 Université Grenoble-Alpes, Centre National de la Recherche Scientifique, InstitutNational de la Santé et de la Recherche Médicale, Institute for Advanced Biosciences,INSERM U 1209, CNRS UMR 5309, 38000 Grenoble, France.

? Corresponding authors: [email protected] , [email protected]

1

.CC-BY-NC 4.0 International licenseavailable under a(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted February 7, 2020. ; https://doi.org/10.1101/2020.02.07.938381doi: bioRxiv preprint

mailto:[email protected]:[email protected]:[email protected]://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/

Abstract1

Association of phenotypes or exposures with genomic and epigenomic data faces2

important statistical challenges. One of these challenges is to remove variation due to3

unobserved confounding factors, such as individual ancestry or cell-type composition4

in tissues. This issue can be addressed with penalized latent factor regression models,5

where penalties are introduced to cope with high dimension in the data. If a rela-6

tively small proportion of genomic or epigenomic markers correlate with the variable7

of interest, sparsity penalties may help to capture the relevant associations, but the8

improvement over non-sparse approaches has not been fully evaluated yet. In this9

study, we introduced least-squares algorithms that jointly estimate effect sizes and10

confounding factors in sparse latent factor regression models. Computer simulations11

provided evidence that sparse latent factor regression models achieve higher statistical12

performance than other sparse methods, including the least absolute shrinkage and13

selection operator (LASSO) and a Bayesian sparse linear mixed model (BSLMM).14

Additional simulations based on real data showed that sparse latent factor regression15

models were more robust to departure from the generative model than non-sparse16

approaches, such as surrogate variable analysis (SVA) and other methods. We ap-17

plied sparse latent factor regression models to a genome-wide association study of18

a flowering trait for the plant Arabidopsis thaliana and to an epigenome-wide asso-19

ciation study of smoking status in pregnant women. For both applications, sparse20

latent factor regression models facilitated the estimation of non-null effect sizes while21

avoiding multiple testing problems. The results were not only consistent with pre-22

vious discoveries, but they also pinpointed new genes with functional annotations23

relevant to each application.24

2



https://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/

1 Introduction25

Association studies represent one of the most powerful tool to identify genomic vari-26

ation correlating with disease states, exposure levels or phenotypes. Those studies27

are divided into several categories according to the nature of the genomic markers28

evaluated. For example, genome-wide association studies (GWAS) focus on single-29

nucleotide polymorphisms in different individuals to estimate disease allele effects30

(Balding, 2006), while epigenome-wide association studies (EWAS) measure epige-31

netic marks, such as DNA methylation levels to derive associations between epige-32

netic variation and exposure levels or to assess effects on phenotypic traits (Rakyan33

et al., 2011). Despite their success in identifying the genetic architecture of pheno-34

typic traits or genomic targets of exposure, association studies are plagued with the35

problem of confounding, which arises when unobserved variables correlate with the36

variable of interest and with the genomic markers simultaneously (Wang et al., 2017).37

Historical approaches to the confounding issue remove hidden confounders by38

considering corrections for inflation (Devlin and Roeder, 1999) and empirical null-39

hypothesis testing methods (Efron, 2004). Alternative approaches evaluate hidden40

confounders by using linear combinations of observed variables, often called factors.41

In GWAS, a frequently-used factor approach consists of computing the largest prin-42

cipal components of the genotype matrix, and includes them as covariates in linear43

regression models (Price et al., 2006). The variable of interest may, however, be co-44

linear to largest principal components, and removing their effects can result in loss45

of statistical power. To increase power, methods based on latent factor regression46

models have been proposed (Leek and Storey, 2007; Carvalho et al., 2008). Latent47

factor regression models employ deconvolution methods in which unobserved vari-48

3




ables, including batch effects, individual ancestry or tissue cell-type composition are49

integrated in the regression model by using latent factors. In these models, effect sizes50

and latent factors are estimated jointly. The latent factor regression framework en-51

compasses several methods which include surrogate variable analysis (SVA, Leek and52

Storey (2007)), latent factor mixed models (LFMM, Frichot et al. (2013)), residual53

principal component analysis (Kalaitzis and Lawrence, 2012), and confounder ad-54

justed testing and estimation (CATE, Wang et al. (2017)). Each method has specific55

merits relative to some category of association study, and the performances of the56

methods have been extensively debated in recent surveys (for example, see Kaushal57

et al. (2017)).58

A property of many latent factor regression models is to use regularization pa-59

rameters inducing constraints on effect size estimates. Among those methods, sparse60

regression models suppose that a relatively small proportion of all genomic variables61

correlate with the variable of interest or affect the phenotype, and evaluate associa-62

tions while avoiding multiple testing problems (Tibshirani, 1996; Hoggart et al., 2008;63

Wu et al., 2009). Sparse regression models have been coupled with linear mixed mod-64

els to combine the benefits of both for polygenic trait studies with Bayesian sparse65

linear mixed model (BSLMM) (Zhou et al., 2012, 2013). Sparse regression models66

can include confounding factors that are usually estimated separately of effect sizes.67

In this study, we introduce least-squares algorithms that jointly estimate effect sizes68

and confounding factors in sparse latent factor regression models. We estimate effect69

sizes based on regularized least-squares methods with L1 and nuclear norm penalties.70

Thus our method allows identifying non-null effect sizes without the use of multiple71

statistical tests. We refer to our models as sparse latent factor mixed models or sparse72

4




LFMM. We present estimation algorithms for sparse LFMM and theoretical results73

in the next section. Then we compare the performances of sparse LFMM with other74

sparse regression models (LASSO, BSLMM), and with non-sparse regression models75

(SVA, CATE, LFMM). To illustrate our approach, we used sparse LFMM to per-76

form a GWAS of flowering time for the plant Arabidopsis thaliana and to perform an77

epigenome-wide association study (EWAS) of smoking status in pregnant women.78

2 Latent factor regression models79

2.1 Models80

Latent factor regression models evaluate associations between the elements of a re-81

sponse matrix, Y, and variables of interest, called primary variables, X, measured82

for n individuals. The response matrix records p markers, which can represent any83

type of omic data (genotypes, DNA methylation, etc), collected for the individuals.84

The X matrix can also incorporate nuisance variables such as observed confounders85

(age, sex, etc), and its dimension is n×d, where d represents the total number of pri-86

mary and nuisance variables. Latent factor regression models are regression models87

combining fixed and latent effects as follows88

Y = XBT + W + E. (1)

Fixed effect sizes are recorded in the B matrix, which has dimension p × d. The E89

matrix represents residual errors, and has the same dimension as the response matrix.90

The matrix W is a latent matrix of rank K, defined by K latent factors (Leek and91

Storey, 2007; Frichot et al., 2013; Wang et al., 2017). The value of K is unknown,92

and it is generally determined by model choice or cross-validation procedures. The93

5




K latent factors, U, are defined from the singular value decomposition of the latent94

matrix95

W = UVT ,

where V is a K × p matrix of loadings (Eckart and Young, 1936). The matrices U96

and V are unique up to a change of sign.97

Naive statistical estimates for the B and W matrices in equation (1) could be98

obtained through the minimization of a classical least-squares loss function99

L(B,W) = ‖Y −W −XBT‖2F , (2)

where ‖.‖F is the Frobenius matrix norm. A minimum value of the loss function is100

attained when W is computed as the rank K singular value decomposition of Y. In101

this case, the B matrix can be obtained from the estimates of a linear regression of the102

residual matrix (Y−W) on X. To motivate the introduction of regularization terms103

in the loss function, we remark that the interpretation of latent factors obtained from104

this solution as confounder estimates may be incorrect, because it fails to include any105

information on the primary variable, X. Assuming that latent factors are computed106

only from the response matrix contradicts the definition of confounding variables107

(Wang et al., 2017). In addition, the definition is problematic, because it does not108

lead to a unique minimum of the loss function. To see it, consider any matrix P with109

dimensions d× p and check that110

‖Y − (U−XP)VT + X(BT −PVT ))‖2F = ‖Y −UVT + XBT‖2F .

As a consequence, B and (B − VPT ) correspond to valid minima, and there is111

6




an infinite space of possible solutions. To conclude, the loss function needs to be112

modified in order to warrant dependency of W on both Y and X, and to enable the113

computation of well-defined solutions.114

2.2 Sparse estimation algorithms115

L1-regularized least-square problem. To solve the problems outlined in the116

above section, a sparse regularization approach is considered. This approach intro-117

duces penalties based on the L1 norm of the regression coefficients and on the nuclear118

norm of the latent matrix119

Lsparse(W,B) =∥∥Y −W −XBT∥∥2

F+ µ‖B‖1 + γ‖W‖∗ , µ, γ > 0, (3)

where ‖B‖1 denotes the L1 norm of B, µ is an L1 regularization parameter, W is the120

latent matrix, ‖W‖∗ denotes its nuclear norm, and γ is a regularization parameter for121

the nuclear norm. The L1 penalty induces sparsity on the fixed effects (Tibshirani,122

1996), and corresponds to the prior information that not all response variables may123

be associated with the primary variables. More specifically, the prior implies that124

a restricted number of rows of the effect size matrix B are non-zero. The second125

regularization term is based on the nuclear norm, and it is introduced to penalize126

large numbers of latent factors. With these penalty terms, Lsparse(W,B) is a convex127

function, and convex mimimization algorithms can be applied to obtain estimates of128

B and W (Mishra et al., 2013).129

Sparse latent factor mixed model algorithm. To simplify the description of130

the estimation algorithm, let us assume that the explanatory variables, X, are scaled131

so that XTX = Idd. Note that our program implementation is more general, and132

7




does not make this restrictive assumption. Here, it is introduced to explain the133

sparse LFMM algorithm with simplified notations. We developed a block-coordinate134

descent method for minimizing the convex loss function Lsparse(W,B) with respect135

to B and W. The algorithm is initialized from the null matrix Ŵ0 = 0, and iterates136

the following steps.137

1. Find B̂t a minimum of the penalized loss function138

L(1)sparse(B) = ‖(Y − Ŵt−1)−XBT‖2F + µ‖B‖1 , (4)

2. Find Ŵt a minimum of the penalized loss function139

L(2)sparse(W) = ‖(Y −XB̂Tt )−W‖2F + γ‖W‖∗. (5)

The algorithm cycles through the two steps until a convergence criterion is met or the140

allocated computing resource is depleted. Each minimization step has a well-defined141

and unique solution. To see it, note that Step 1 corresponds to an L1-regularized re-142

gression of the residual matrix (Y−Ŵt−1) on the explanatory variables. To compute143

the regression coefficients, we used the Friedman block-coordinate descent method144

(Friedman et al., 2007). According to Tibshirani (1996), we obtained145

B̂t = sign(B̄t)(B̄t − µ)+ , (6)

where s+ = max(0, s), sign(s) is the sign of s, and B̄t is the linear regression estimate,146

B̄t = XTY−Ŵt−1. Step 2 consists of finding a low rank approximation of the residual147

matrix Y−XB̂Tt (Cai et al., 2008). This approximation starts with a singular value148

decomposition (SVD) of the residual matrix, Y−XB̂Tt = MSNT , with M a unitary149

8




matrix of dimension n×n, N a unitary matrix of dimension p× p, and S the matrix150

of singular values (sj)j=1,...,n. Then, we obtain151

Ŵt = MS̄NT (7)

where S̄ is the diagonal matrix with diagonal terms s̄j = (sj − γ)+, j = 1, . . . , n.152

Building on results from Tseng (2001), the following statement holds.153

Theorem 1. Let µ > 0 and γ > 0. Then the block-coordinate descent algorithm154

cycling through Step 1 and Step 2 converges to estimates of W and B defining a155

global minimum of the penalized loss function Lsparse(W,B).156

Note that the algorithmic complexities of Step 1 and Step 2 are bounded by a157

term of order O(pn+K(p+ n)). The computing time of sparse LFMM estimates is158

generally longer than for the CATE algorithm (Wang et al., 2017) or the ridge LFMM159

algorithm detailed below (Caye et al., 2019). Sparse LFMM needs to perform SVD160

and projections several times until convergence while CATE and ridge LFMM require161

a single iteration.162

2.3 Ridge regression algorithms163

Caye et al. (2019) considered a related approach, referred to as ridge LFMM, where164

the statistical estimates of the parameter matrices B and W are computed after165

minimizing the loss function with L2 norm regularization defined as follows166

Lridge(B,W) = ‖Y −W −XBT‖2F + λ‖B‖22 , λ > 0, (8)

where ‖.‖F is the Frobenius norm, ‖.‖2 is the L2 norm, and λ is a regularization pa-167

9




rameter. The minimization algorithm starts with an SVD of the explanatory matrix,168

X = QΣRT , where Q is an n× n unitary matrix, R is an d× d unitary matrix and169

Σ is an n×d matrix containing the singular values of X, denoted by (σj)j=1,...,d. The170

ridge estimates are computed as follows171

Ŵ = QD−1λ svdK(DλQTY) (9)

B̂T = (XTX + λIdd)−1XT (Y − Ŵ), (10)

where svdK(A) is the SVD of rank K of A, Idd is the d× d identity matrix, and Dλ172

is the n× n diagonal matrix with coefficients defined as173

dλ =

(√λ

λ+ σ21, . . . ,

√λ

λ+ σ2d, 1, . . . , 1

).

For λ > 0, the solution of the regularized least-squares problem is unique (Caye174

et al., 2019), and the corresponding matrices are called the ridge estimates. For175

completeness, we provide a short proof for this result, stated in (Caye et al., 2019), in176

the appendix. Using random projections to compute low rank approximations, the177

complexity of the estimation ridge LFMM algorithm is of order O(n2p + np logK)178

(Halko et al., 2011). For studies in which the number of samples, n, is much smaller179

than the number of response variables, p, computing times of ridge estimates are180

therefore faster than those of sparse LFMM.181

3 Results182

Generative model experiments. In a first series of experiments, we compared183

sparse LFMM with LASSO and three non-sparse approaches (ridge LFMM, CATE,184

10




Figure 1. Root Mean Square Error (RMSE) as a function of the effectsize of causal markers and confounding intensity. Two sparse methods (sparseLFMM, LASSO) and three non-sparse methods (ridge LFMM, CATE and SVA) werecompared. The “Zero” value corresponds to an RMSE obtained with all effect sizesset to zero (null-model error). Generative model simulation parameters: (A) Lowereffect sizes and confounding intensities (B) Lower effect sizes and higher confoundingintensities. (C) Higher effect sizes and lower confounding intensities. (D) Highereffect sizes and confounding intensities.

11




SVA). The data were simulated from the generative model defined in equation (1),185

and the performance of each algorithm was measured in four scenarios showing higher186

or lower effect sizes and confounding intensities (Figure 1). For all experiments,187

we computed statistical errors (RMSE) for the effect size estimates of each method188

(Figure 1). To provide a reference value for the RMSE, we measured the error made189

when all effect sizes were estimated as being null (“Zero” value or null-model error).190

The null-model error was equal to 0.069 in low effect size scenarios and equal to 0.135191

in high effect size scenarios. A powerful method was expected to reach error levels192

lower than the null-model error. The RMSEs of sparse LFMM ranged from 0.055193

to 0.092, less than those of the null-model. The RMSEs of LASSO were close to194

the ones of sparse LFMM in the low effect size scenarios. In contrast, non-sparse195

methods led to RMSEs higher than the null-model error, ranging between 0.13 and196

0.26 for ridge LFMM and CATE, and rising up to 0.50 for SVA. For the effect sizes197

associated with causal markers, non-sparse methods reached lower RMSE values than198

those of sparse methods, ranging between 0.12 and 0.26 for ridge LFMM and CATE,199

and between 0.60 and 1.03 for sparse LFMM (Figure S1). Regarding precision and200

F -score - which is a harmonic mean of power and precision, the performances of201

all methods were higher in scenarios with higher effect size and lower confounding202

intensity. Sparse LFMM performed similarly to or less than the LASSO when the size203

of the causal effects was small (Figure 2AB), but it reached higher F -scores for larger204

effect sizes (Figure 2CD). In those simulations, sparse LFMM obtained lower F -scores205

than ridge LFMM and CATE. The difference was substantial when the sizes of the206

causal effects were small (F ≈ 0.51 versus F ≈ 0.76, Figure 1AB), but the differences207

were small for the larger effect sizes (F ≈ 0.75, Figure 2CD). In all scenarios, sparse208

12




LFMM obtained better scores than SVA. In summary, sparse LFMM was associated209

with the smallest overall statistical error, but the estimates of effect size were biased210

more severely with this method than with non-sparse methods. Sparse LFMM was211

generally preferable to LASSO and SVA. Once non-null effect sizes are identified by212

sparse LFMM, a consensus strategy would use ridge LFMM or CATE for evaluating213

the effect sizes of the candidate markers.214

Empirical simulation experiments. In a second series of experiments, we used215

realistic simulations to compare sparse LFMM to other sparse and non-sparse meth-216

ods. Simulations were based on 162 ecotypes of the model plant Arabidopsis thaliana217

using 53,859 SNP genotypes in chromosome 5. The simulations considered lower218

and higher effect sizes and gene by environment (G × E) interaction levels. Those219

simulations departed from generative model simulations, and they were introduced220

to evaluate the robustness of effect size estimates in each approach. In lower G× E221

interaction scenarios, sparse LFMM obtained the highest scores (F in (0.57,0.60),222

precision in (0.81,0.82), Figure 3AC) compared to BSLMM (F in (0.36,0.44)), and223

to non-sparse methods (F ranging between 0.25 and 0.28). In higher G × E in-224

teraction scenarios, all methods obtained very low performances for the low effect225

size scenario, but sparse LFMM obtained among the highest F -score and precision.226

When the effect size was higher, sparse LFMM reached higher performances (F ≈227

0.28 and accuracy ≈ 0.33) than the other methods (Figure 3D). In those realistic228

simulations, sparse LFMM demonstrated greater robustness to departure from the229

generative model assumptions than the other sparse methods (BSLMM, LASSO),230

and also compared favorably with non-sparse methods (ridge LFMM, CATE, SVA).231

13




Figure 2. F -score and precision as a function of effect size of the causalmarkers and confounding intensity. Two sparse methods (sparse LFMM,LASSO) and three non-sparse methods (ridge LFMM, CATE and SVA) were com-pared. F -score is the harmonic mean of precision and recall. Generative modelsimulation parameters: (A) Lower effect sizes and confounding intensities (B) Lowereffect sizes and higher confounding intensities. (C) Higher effect sizes and lowerconfounding intensities. (D) Higher effect sizes and confounding intensities.

14




Figure 3. Empirical simulation data (F -score and precision). F -score andprecision as a function of the effect size of the causal markers and of the strength ofthe interaction between genotype and environment (G × E). Three sparse methods(sparse LFMM, BSLMM and LASSO) and three non-sparse methods (ridge LFMM,CATE and SVA) were compared. F -score is the harmonic mean of precision andrecall. Simulation parameters: (A) Lower effect sizes and lower G × E (B) Lowereffect sizes and higher G × E. (C) Higher effect sizes and lower G × E. (D) Highereffect sizes and higher G× E.

15




Runtimes and number of factors. Next, we evaluated runtimes for sparse LFMM,232

and compared those runtimes with BSLMM and ridge LFMM (Figure S2). What-233

ever the number of individuals or markers, ridge LFMM was the fastest method, and234

sparse LFMM was the slowest method. Higher computation times for sparse LFMM235

were not surprising because the method iterates many cycles before convergence,236

whereas ridge LFMM is an exact approach. It took around 2,000 seconds for sparse237

LFMM to complete runs with n = 1, 000 individuals and p = 100, 000 markers. With238

default values for MCMC parameters, BSLMM runtimes were of the same order as239

those of sparse LFMM. To assess the choice of K by cross-validation, we varied the240

number of latent factors between 3 and 10, and compared the values estimated by241

cross validation with the true values. In 73% simulations, the number of latent fac-242

tors was correctly estimated, and in the remaining 17% simulations, the true value243

of K was overestimated by one unit (Figure S3).244

GWAS of flowering time in A. thaliana. To illustrate the use of latent factor245

models in a context where confounding is difficult to control for, we performed a246

GWAS of flowering time using p = 53, 859 SNPs genotyped in chromosome 5 for247

n = 162 European accessions of the model plant A. thaliana. The sparse methods248

(sparse LFMM, LASSO, BSLMM) differed in their estimate of the number of null249

effect sizes (Figure 4ABC, Figure S4). The LASSO approach estimated 99.85% null250

effect sizes while the proportions were equal to 99.24% and 98.18% for BSLMM and251

sparse LFMM respectively. The LASSO was the most conservative approach, and252

sparse LFMM the most liberal one. Sparse LFMM shared 3.9% of hits with LASSO,253

and 5.5% with BSLMM (Figure S4). Less than 1% of all hits were common to the254

16




three approaches. The (non-null) effect sizes for hits varied on distinct scales, with255

LASSO exhibiting the strongest biases. All sparse methods detected the same top hit256

at around 4 Mb, corresponding to a SNP located within the FLC gene, consistent257

with the results of Atwell et al. (2010). The second hit in (Atwell et al., 2010),258

located in the gene DOG1, was also identified by sparse LFMM. BSLMM had more259

difficulties in identifying previously discovered genes. Given the high correlation260

– greater than 94 % – between effect sizes obtained with non-sparse methods, we261

grouped their results by averaging their estimates. Non-sparse methods exhibited262

effect sizes in a range of values closer to sparse LFMM than to LASSO and BSLMM,263

but higher statistical errors were observed for those approaches (Figure 4D). Overall,264

we found a significant correlation between the non-null effect sizes estimated by sparse265

LFMM and the corresponding effect sizes found by non-sparse methods (ρ = 0.8065,266

P < 10−16). In addition, sparse LFMM and the non-sparse methods found new hits267

around 13.9 Mb and 6.5 Mb of chr 5, corresponding to the SAP and ACL5 genes268

respectively.269

EWAS of exposure to smoking during pregnancy. To evaluate association270

between smoking during pregnancy and placental DNA methylation, we performed271

an EWAS considering tobacco consumption as a primary variable. To this objective,272

we considered beta-normalized methylation levels at p = 425, 878 probed CpG sites273

for n = 668 women (Heude et al., 2016; Rousseaux et al., 2019). The placentas were274

collected at delivery from women included in the EDEN mother-child cohort. Using275

sparse LFMM, the proportion of null effect sizes was equal to 99.698%, for a total276

number of 1,287 hits (Figure S5). To characterize the targeted CpGs, we evaluated277

17




Figure 4. GWAS of flowering time in A. thaliana (chromosome 5). A) Effectsize estimates for LASSO. B) Effect size estimates for sparse LFMM. C) Effect sizeestimates for sparse BSLMM. D) Average effect size estimates for non-sparse methods(ridge LFMM, CATE and SVA). Grey bars represent Arabidopsis SNPs associatedwith the FT16 phenotype in (Atwell et al., 2010), and correspond to the FLC andDOG1 genes.

18




whether there was an enrichment of enhancer and promoter regions in candidate278

regions compared to the methylome (Figure S6 and Figure S7). For the 1,287 CpGs279

with non-null effect sizes, 25.48% were found in enhancer regions, compared to 22.73%280

for the whole methylome, and 6.83% were found in promoter regions, compared to281

19.94% for the whole methylome. We compared the CpGs having the highest effect282

sizes in each method (Figure S8). Sparse LFMM shared 45.3 % of hits with non-283

sparse models (represented by ridge LFMM), and 2.8 % of hits with LASSO (Table284

S1). Among the 51 top hits shared by sparse LFMM and ridge LFMM, 25 were found285

in the body of a gene, 11 were not associated with a gene, 20 were in enhancer regions286

and 2 in promoter regions. Note that in this analysis, we averaged the effect sizes of287

non-sparse methods because their correlation was greater than 99%. The results of288

sparse LFMM agreed with the results of non-sparse methods better than with those289

of LASSO. The Pearson correlation between the non-null effect sizes estimated by290

sparse LFMM and the corresponding effect sizes estimated non-sparse methods was291

equal to ρ = 80.38% (P < 10−16), whereas the Pearson correlation between non-null292

effect sizes of sparse LFMM and LASSO was equal to ρ = 61.86% (P < 10−16).293

To focus on a specific chromosome, we detailed the outputs of all approaches for294

chromosome 3, which contained the epigenome-wide top hit for sparse LFMM and295

for non-sparse methods (cg27402634, located on an enhancer, Figure 5). This CpG296

was also detected with LASSO (Figure S9). The sparse LFMM hits shared three297

additional CpGs with non-sparse methods: cg09627057, cg18557837 and cg12662091.298

Overall, sparse LFMM detected 61 CpGs with non-null effect sizes: 43 were located299

in genes, 22 in enhancer regions and 6 in promoter regions.300

19




Figure 5. DNA methylation EWAS of smoking status in pregnant women(chromosome 3). A) Estimated effect size for sparse LFMM. The effect size atcg27402634 is equal to β = −0.117 (out of range). B) Estimated effect size for non-sparse methods (ridge LFMM, CATE and SVA). The effect size at cg27402634 isequal to β = −0.141 (out of range). CpGs with the highest effects are circled (genesin blue color). Red dots represent CpGs located in enhancer regions. Green dotsrepresent CpGs located in promoter regions (Illumina annotations).

20




4 Discussion301

We introduced sparse latent factor regression methods for the joint estimation of302

effect sizes and latent factors in genomic and epigenomic association studies. In303

generative and in empirical simulations, sparse LFMM obtained higher F -score and304

precision than previously introduced sparse methods, BSLMM and LASSO. Com-305

pared to three non-sparse methods (ridge LFMM, CATE and SVA), statistical errors306

of effect size estimates were reduced. In simulations based on a real data set, sparse307

LFMM reached the highest precision and F -score, showing that the method was more308

robust to departure from model assumptions than the other methods. For the causal309

markers, the effect sizes estimated by sparse LFMM and the corresponding effect sizes310

estimated by non-sparse methods were strongly correlated. Effect size estimates had311

a lower bias in non-sparse methods compared to sparse methods. These results sug-312

gest to combine sparse LFMM with a non-sparse method in the following way. At313

a first stage, sparse LFMM can be used to estimate the support of causal markers314

(non-null effect sizes). Then ridge LFMM and CATE can be used to estimate the315

effect sizes of the selected markers.316

In a GWAS of flowering time using 53,859 SNPs in the fifth chromosome of 162317

European accessions of the plant A. thaliana, sparse LFMM identified the FLC and318

DOG1 genes to be associated with the FT16 phenotype. The two genes were pre-319

viously reported as being associated with this phenotype in (Atwell et al., 2010).320

The FLC gene plays a central role in flowering induced by vernalization (Sheldon et321

al., 2000), and DOG1 is involved in the control of dormancy and seed germination322

(Nishimura et al., 2018). The second hit of sparse LFMM corresponded to SNPs323

linked to SAP, which is a transcriptional regulator involved in the specification of324

21




floral identity (Byzova et al., 1999). This association was also significant for non-325

sparse methods. In addition, the new method detected SNPs located in the ACL5326

gene, which plays a role in internodal growth and organ size (Hanzawa et al., 1997).327

In summary, sparse methods facilitated the selection of non-null effect sizes. The328

results for sparse LFMM were not only consistent with previous discoveries, but they329

also identified new candidate genes with interesting functional annotations.330

Next, we applied sparse LFMM in an EWAS of placental DNA methylation331

for women exposed to smoking during pregnancy, which is considered an impor-332

tant risk factor for child health (Lumley et al., 2009). The CpG with the high-333

est effect size in sparse LFMM and non-sparse methods (cg27402634) is located334

in an enhancer region, close to the LEKR1 gene which was associated with birth335

weight in a GWAS from the Early Growth Genetics (EGG) consortium (http:336

//egg-consortium.org/birth-weight.html). This association was detected as a337

top hit in an independent study of placental methylation and smoking (Morales et338

al., 2019). (Rousseaux et al., 2019) also detected the association with cg27402634 in339

an EWAS based on a slightly different study population, and with other measures340

of the level of tobacco consumption. (Morales et al., 2019) carried out a Sobel anal-341

ysis of mediation between smoking and birth weight, found the test significant for342

cg27402634. In the list of 51 CpGs with high effect sizes, several additional statis-343

tical associations between placental methylation and maternal smoking have been344

reported in previous studies, including cg21992501 in the gene TTC27 (Cardenas et345

al., 2019), cg25585967 and cg17823829, respectively in the TRIO and KDM5B genes346

(Morales et al., 2019; Everson et al., 2019). Turning to the rest of the methylome,347

we found additional associations that may adversely affect mother-child health. To348

22



http://egg-consortium.org/birth-weight.htmlhttp://egg-consortium.org/birth-weight.htmlhttp://egg-consortium.org/birth-weight.htmlhttps://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/

better characterize the CpGs in those associations, we evaluated whether there was349

an enrichment in enhancer and in promoter regions. We found there was an en-350

richment of enhancer regions and a depletion of promoter regions for CpGs with351

non-null effects, consistent with the findings of (Rousseaux et al., 2019). Overall352

our new method allowed us to confirm some previously discovered associations, and353

also detected new associations including genes for which methylation changes have354

detrimental effects on the health of the child.355

Conclusion. Removing variation due to unobserved confounding factors is ex-356

tremely difficult in any type of association study. Assuming that a small proportion of357

all markers correlate with the exposure or phenotype, we addressed the confounding358

issue by using sparse latent factor regression models, providing mathematical guaran-359

tees that global solutions of least squares estimation problems are proposed. Sparsity360

constraints in our algorithm allowed the selection of markers without any need for361

statistical testing. The application of our method to real data sets highlighted new362

associations with relevant biological meaning. The methods are reproducible and are363

implemented in the R package lfmm.364

Materials and Methods365

Cross-validation method. Choosing regularization parameters of sparse LFMM366

or ridge LFMM and the number of latent factors can be done by using cross-validation367

methods. The cross-validation approach partitions the data into a training set and a368

test set. The training set is used to fit model parameters, and prediction errors are369

measured on the test set. In our approach, the response and explanatory variables370

23




are partitioned according to their rows (individuals). We denote by I the subset of371

individual labels on which prediction errors are computed. Estimates of effect sizes,372

B̂−I , and loading values, V̂−I , are computed on the training set. Next, we partition373

the set of columns of the response matrix, and denote by J the subset of columns on374

which prediction errors are computed. A factor matrix, Û−J , is estimated from the375

complementary subset as follows376

Û−J = (Y[I,−J ]−X[I, ]B̂T−I [−J, ])V̂−I [−J, ]. (11)

In these notations, the brackets indicate which subsets of rows and columns are377

selected. A prediction error is then computed as follows378

Error =∥∥∥Y[I, J ]− Û−JV̂T−I [J, ]−X[I, ]B̂T−I [J, ]∥∥∥

F. (12)

Regularization parameters and the number of factors leading to the lowest prediction379

error were retained in data analysis.380

Heuristics for regularization parameters and number of factors. Additional381

heuristics were used to determine the number of latent factors and the regularization382

parameter of the nuclear norm of the latent matrix. In order to choose the number of383

latent factors, K, we considered the matrix Dλ, defined for the ridge algorithm, and384

the unitary matrix Q, obtained from an SVD of X. The number of latent factors,385

K, can be estimated by using a spectral analysis of the matrix D0QTY. In our386

experiments, we used the “elbow” method based on the scree plot of eigenvalues of387

the matrix D0QTY. Values for K were confirmed by prediction errors computed by388

cross-validation. The L1-regularization parameter, µ, was determined by inspection389

of the proportion of non-zero effect sizes in the B matrix, which was estimated by390

24




cross-validation. Having set the proportion of non-null effect sizes, µ was computed391

by using the regularization path approach proposed by Friedman et al. (2010) as392

follows. The regularization path algorithm was initialized with the smallest values of393

µ such that394

B̂1 = sign(B̄1)(B̄1 − µ)+ = 0, (13)

where B̂1 resulted from Step 1 in the sparse LFMM algorithm, and B̄1 is the linear395

regression estimate. Then, we built a sequence of µ values that decreased from the396

inferred value of the parameter µmax to µmin = �µmax. We eventually measured the397

number of non-null elements in B̂t, and stopped when the target proportion was398

reached. The nuclear norm parameter (γ) determines the rank of the latent matrix399

W. We used a heuristic approach to evaluate γ from the number of latent factors400

K. Based on the singular values (λ1, . . . , λn) of the response matrix Y, we set401

γ =(λK + λK+1)

2. (14)

With this value of γ, sparse LFMM always converged to a latent matrix estimate402

having rank K in our experiments.403

Estimation algorithms. Sparse LFMM was compared to two other sparse meth-404

ods. As a baseline, we used Least Absolute Shrinkage and Selection Operator (LASSO)405

regression models (Tibshirani, 1996; Friedman et al., 2010). LASSO regression mod-406

els did not include any correction for confounding, and strong biases were expected in407

effect size estimates. The LASSO models were implemented in the R package glmnet,408

and the regularization parameter was selected by using a 5-fold cross validation ap-409

proach (Zeng et al., 2017). We also used Bayesian Sparse Linear Mixed Models410

25




(BSLMM) implemented in the GEMMA software (Zhou et al., 2013). BSLMM is411

a hybrid method that combines sparse regression models with linear mixed mod-412

els. BSLMM uses a Markov chain Monte Carlo (MCMC) method to estimate effect413

sizes. The MCMC burn-in period and sampling sizes were set to 10,000 (Zeng et414

al., 2017). To determine the proportion of non-zero effect sizes, two parameters were415

tuned (pmin and pmax). Those parameters correspond to the logarithm of the max-416

imum and minimum expected proportions of non-zero effect size. We also compared417

sparse LFMM to three non-sparse algorithms, all based on the generative model418

defined in equation (1). First we implemented Surrogate Variable Analysis (SVA,419

Leek and Storey (2007)). SVA was introduced to overcome the problems caused by420

heterogeneity in gene expression studies. The algorithm starts with estimating the421

loading values of a principal component analysis for the residuals of the regression422

of the response matrix Y on X. In a second step, SVA determines a subset of re-423

sponse variables exhibiting low correlation with X, and uses this subset of variables424

to estimate the latent factors. SVA was implemented in the R package sva. Next,425

we implemented the Confounder Adjusted Testing and Estimation (CATE) method426

(Wang et al., 2017). CATE uses a linear transformation of the response matrix such427

that the first axis of this transformation is colinear to X and the other axes are or-428

thogonal to X. CATE was used without negative controls, and it was implemented429

in the R package cate. We eventually used the ridge version of LFMM implemented430

in the R package lfmm (Caye et al., 2019).431

Generative model simulations. We defined the confounding intensity as the432

percentage of variance of the primary variable X explained by the latent factors U.433

26




Following Caye et al. (2019), we performed simulations of a primary variable, X,434

with d = 1, and K = 6 independent latent factors, U, for two values of confounding435

intensity, R2 = 0.1 (lower) and R2 = 0.5 (higher). The joint distribution of (X, U)436

was a multivariate Gaussian distribution. Having defined primary variables and latent437

factors, we used the generative model defined in equation (1) to simulate a response438

matrix, Y. To create sparse models, only a small proportion of effect sizes, around439

0.8%, were allowed to be different from zero. Non-null effect sizes were sampled440

according to a Gaussian distribution, N(B, 0.2), where B could take two values,441

B = 0.75 (lower value) and B = 1.5 (higher value). Residual errors and loadings,442

V, were sampled according to a standard Gaussian distribution. The dimensions443

of the response matrix were set to n = 400 individuals and p = 10, 000 variables.444

Two hundred simulations were performed for each combination of parameters (800445

simulations).446

Empirical simulations. We used the R package naturalgwas to simulate as-447

sociations of phenotypes based on a matrix of sampled genotypes (François and448

Caye, 2018). With this program, phenotypic simulations incorporate realistic fea-449

tures such as geographic population genetic structure and gene-by-environment in-450

teractions where environmental variables are derived from a bioclimatic database.451

When estimating effect sizes, population genetic structure and gene-by-environment452

interactions are considered to be the main sources of confounding. Phenotypes were453

simulated for n = 162 publicly available Single Nucleotide Polymorphisms (SNPs)454

genotyped from the fifth chromosome of the model plant Arabidopsis thaliana (Atwell455

et al., 2010). The response matrix contained p = 53, 859 SNPs, with minor allele fre-456

27




quency greater than 5%. The number of confounding factors was set to K = 6, and457

the phenotypes were generated from a combination of five causal SNPs with identical458

effect sizes. Two values of effect size were implemented, B = 6 (lower effect size)459

and B = 9 (higher effect size). Additionnally, two values of gene-by-environment460

interaction were implemented, G× E = 0.1 (lower G× E) and G× E = 0.9 (higher461

G×E). For each parameter combination, two hundred simulations were performed.462

Evaluation metrics. In the simulation study, all methods were used with their463

default parameters, and the number of latent factors was set to K = 6 in all latent464

factor models. To evaluate the capabilities of methods to identify true positives,465

we used precision, which corresponds to the proportion of true positives in a list of466

positive markers, the recall, which is the number of true positives divided by the467

number of causal markers, and the F -score, which is the harmonic mean of precision468

and recall. To compute precision and F -score in generative model experiments, a469

list of 100 markers with the largest absolute estimated effect sizes was considered470

for each data set and method. In empirical simulations, the measures were modified471

to account for linkage disequilibrium (LD) in the data. Candidate markers within a472

window of size 10kb around a causal marker were considered to be true discoveries473

(LD-r2 < 0.2, François and Caye (2018)). In addition to the F -score, we used the root474

mean squared error (RMSE) to evaluate the statistical errors of effect size estimates.475

We also used simulations from the generative model to assess the capability of the476

cross validation algorithm to estimate the number of latent factors in sparse LFMM.477

In program runs, the number of latent factors varied between K = 3 and K = 10478

and the value estimated by the cross validation algorithm was compared with the479

28




true value (K = 6).480

GWAS of plant phenotype. Sparse LFMM and a set of other methods were481

used to perform association studies for two distinct types of genomic data including482

genotypic and epigenetic markers. For Arabidopsis thaliana, we considered n = 162483

European accessions and p = 53, 859 SNPs from the fifth chromosome ot the plant484

genome to investigate associations with the flowering time phenotype FT16-TO:485

0000344 (Atwell et al., 2010). FT16 corresponds to the number of days required486

for an individual plant to reach the flowering stage. In the sparse LFMM algorithm,487

the percentage of non-null effect size was set to 0.01. The parameters pmin and pmax488

defining sparsity in the BSLMM algorithm were fixed to pmin= −5 and pmax= −4489

respectively. These values correspond to the logarithm of expected proportions of490

non-null effect sizes in BSLMM. For all factor methods, the number of latent factors491

was determined by cross-validation and set to K = 10.492

EWAS of exposure to tobacco consumption. Our second application to real493

data concerned an EWAS based on the EDEN mother-child cohort (Heude et al.,494

2016). Beta-normalized methylation levels at p = 425, 878 probed CpG sites were495

measured for n = 668 women. We tested the association between smoking status496

(219 current smokers women and 449 non-current smokers women) and DNA methy-497

lation (mDNA) levels in the mother’s placenta. Detailed information on the study498

population and protocols for placental DNA methylation assessment processing could499

be found in (Abraham et al., 2018; Rousseaux et al., 2019). The proportion of null500

effect sizes in sparse LFMM was equal to 0.999. For latent factor models, the number501

of latent factors was estimated by cross-validation, and was equal to K = 7.502

29




Acknowledgements. This article was developed in the framework of the Greno-503

ble Alpes Data Institute, supported by the French National Research Agency under504

the Investissements d’Avenir program (ANR-15-IDEX-02). It received support from505

LabEx PERSYVAL Lab, ANR-11-LABX-0025-01, and from the French National Re-506

search Agency (Agence Nationale pour la Recherche) ETAPE, ANR-18-CE36-0005.507

We thank the participants of the EDEN cohort. We thank the midwife research as-508

sistants for data collection, the psychologists and the data entry operators. We also509

thank the EDEN mother-child cohort study group which includes I Annesi-Maesano,510

JY Bernard, J Botton, M-A Charles, P Dargent- Molina, B de Lauzon- Guillain,511

P Ducimetière, M de Agostini, B Foliguet, A Forhan, X Fritel, A Germa, V Goua,512

R Hankard, B Heude, M Kaminski, B Larroque, N Lelong, J Lepeule, G Magnin,513

L Marchand, C Nabet, F Pierre, R Slama, MJ Saurel-Cubizolles, M Schweitzer, O514

Thiebaugeorges.515

Program Availability. All codes are publicly available. Sparse LFMM was im-516

plemented in the R package lfmm available from Github (https://bcm-uga.github.517

io/lfmm/) and submitted to the Comprehensive R Archive Network (https://cran.518

r-project.org/).519

Data Availability. The Arabidopsis thaliana data are publicly available from the520

1,001 genomes database (https://1001genomes.org/). The EDEN individual-level521

data have restricted access owing to ethical and legal conditions in France. They are522

available upon request from the EDEN steering committee at [email protected]

and through collaborations with the principal investigators of EDEN.524

30



https://bcm-uga.github.io/lfmm/https://bcm-uga.github.io/lfmm/https://bcm-uga.github.io/lfmm/https://cran.r-project.org/https://cran.r-project.org/https://cran.r-project.org/https://1001genomes.org/mailto:[email protected]://doi.org/10.1101/2020.02.07.938381http://creativecommons.org/licenses/by-nc/4.0/

Fundings. The EDEN study was supported by Foundation for medical research525

(FRM), National Agency for Research (ANR), National Institute for Research in526

Public health (IRESP: TGIR cohorte santé 2008 program), French Ministry of Health527

(DGS), French Ministry of Research, INSERM Bone and Joint Diseases National Re-528

search (PRO-A), and Human Nutrition National Research Programs, Paris-Sud Uni-529

versity, Nestlé, French National Institute for Population Health Surveillance (InVS),530

French National Institute for Health Education (INPES), the European Union FP7531

programmes (FP7/2007-2013, HELIX, ESCAPE, ENRIECO, Medall projects), Dia-532

betes National Research Program (through a collaboration with the French Associ-533

ation of Diabetic Patients (AFD)), French Agency for Environmental Health Safety534

(now ANSES), Mutuelle Générale de l’Education Nationale a complementary health535

insurance (MGEN), French national agency for food security, French-speaking asso-536

ciation for the study of diabetes and metabolism (ALFEDIAM).537

References538

Abraham, E., Rousseaux, S., Agier, L., Giorgis-Allemand, L., Tost, J., Galineau,539

J., Hulin, A., Siroux, V., Vaiman, D., Charles, M.-A., Heude, B., Forhan, A.,540

Schwartz, J., Chuffart, F., Bourova-Flin, E., Khochbin, S., Slama, R., and Lep-541

eule, J., (2018). Pregnancy exposure to atmospheric pollution and meteorological542

conditions and placental DNA methylation. Environ. Int., 118, 334-347.543

Akama, T.O., Misra, A.K., Hindsgaul, O., and Fukuda, M.N. (2002). Enzymatic544

synthesis in vitro of the disulfated disaccharide unit of corneal keratan sulfate. J.545

Biol. Chem., 277, 42505-42513.546

31




Atwell, S., Huang, Y.S., Vilhjàlmsson, B.J., Willems, G., Horton, M., Li, Y., Meng,547

D., Platt, A., Tarone, A.M., Hu, T.T., Jiang, R., Muliyati, N.W., Zhang, X., Amer,548

M.A., Baxter, I., Brachi, B., Chory, J., Dean, C., Debieu, M., de Meaux, J., Ecker,549

J.R., Faure, N., Kniskern, J.M., Jones, J.D.G., Michael, T., Nemri, A., Roux, F.,550

Salt, D.E., Tang, C., Todesco, M., Traw, M.B., Weigel, D., Marjoram, P., Borevitz,551

J.O., Bergelson, J., and Nordborg, M. (2010). Genome-wide association study of552

107 phenotypes in Arabidopsis thaliana inbred lines. Nature, 465, 627-631.553

Balding, D.J. (2006) A tutorial on statistical methods for population association554

studies. Nat. Rev. Genet., 7, 781-781.555

Bertsekas, D. P. (1999) Nonlinear Programming. Belmont: Athena Scientific.556

Byzova, M.V., Franken, J., Aarts, M.G.M., de Almeida-Engler, J., Engler, G., Mar-557

iani, C., Van Lookeren Campagne, M.M., Angenent, G.C. (1999). Arabidopsis558

STERILE APETALA, a multifunctional gene regulating inflorescence, flower, and559

ovule development. Genes Dev., 13, 1002-1014.560

Cai, J-F., Candès, E.J. and Shen, Z. (2010) A singular value thresholding algorithm561

for matrix completion. SIAM J. Optim., 20 1956-1982.562

Carvalho, C. M. et al. (2008) High-dimensional sparse factor modeling: applications563

in gene expression genomics. J. Am. Stat. Assoc., 103, 1438-1456.564

Cardenas, A., Lutz, S.M., Everson, T.M., Perron, P., Bouchard, L., Hivert, M.-F.,565

(2019). Mediation by placental DNA methylation of the association of prenatal566

maternal smoking and birth weight. Am. J. Epidemiol., 188, 1878-1886.567

32




Caye, K., Jumentier, B., Lepeule, J., François, O. (2019) LFMM 2: Fast and accu-568

rate inference of gene-environment associations in genome-wide studies. Mol. Biol.569

Evol., 36, 852-860.570

Devlin, B. and Roeder K. (1999) Genomic control for association studies. Biometrics,571

55, 997-1004.572

Eckart, C. and Young, G. (1936) The approximation of one matrix by another of573

lower rank. Psychometrika, 1, 211-218.574

Efron, B. (2004) Large-scale simultaneous hypothesis testing: The choice of a null575

hypothesis. J. Am. Stat. Assoc., 99, 96-104.576

Everson, T.M., Vives-Usano, M., Seyve, E., Cardenas, A., Lacasaña, M., Craig, J.M.,577

Lesseur, C., Baker, E.R., Fernandez-Jimenez, N., Heude, B., Perron, P., Gonzalez-578

Alzaga, B., Halliday, J., Deyssenroth, M.A., Karagas, M.R., Iñiguez, C., Bouchard,579

L., Carmona-Saez, P., Loke, Y.J., Hao, K., Belmonte, T., Charles, M.A., Martorell-580

Marugan, J., Muggli, E., Chen, J., Fernandez, M.F., Tost, J., Gomez-Martin, A.,581

London, S.J., Sunyer, J., Marsit, C.J., Lepeule, J., Hivert, M.-F., Bustamante,582

M., (2019). Placental DNA methylation signatures of maternal smoking during583

pregnancy and potential impacts on fetal growth. BioRxiv, 663567.584

François, O., Caye, K. (2018) Naturalgwas: An R package for evaluating genome-wide585

association methods with empirical data. Mol. Ecol. Resour., 18(4), 789-797.586

Friedman, J., Hastie, T., Höfling, H., Tibshirani, R. (2007) Pathwise coordinate587

optimization. Ann. Appl. Stat., 1, 302-332.588

33




Friedman, J., Hastie, T., Tibshirani, T. (2010) Regularization paths for generalized589

linear models via coordinate descent. J. Stat. Softw., 33.590

Frichot, E., Schoville, S. D., Bouchard, G. and François, O. (2013) Testing for associ-591

ations between loci and environmental gradients using latent factor mixed models.592

Mol. Biol. Evol., 30, 1687-1699.593

Frichot, E. and François, O. (2015) LEA: an R package for landscape and ecological594

association studies. Methods Ecol. Evol., 6, 925-929.595

Gautier, M. (2015) Genome-wide scan for adaptive divergence and association with596

population-specific covariates. Genetics, 201, 1555-1579.597

Halko, N., Martinsson, P. G. and Tropp, J. A. (2011) Finding structure with ran-598

domness: Probabilistic algorithms for constructing approximate matrix decompo-599

sitions. SIAM Rev., 53, 217-288.600

Hanzawa, Y., Takahashi, T., and Komeda, Y. (1997). ACL5: an Arabidopsis gene601

required for internodal elongation after flowering. Plant J., 12, 863-874.602

Hastie, T., Tibshirani, R., and Friedman, J. (2009) The Elements of Statistical Learn-603

ing. Springer Series in Statistics, Springer, NY, USA.604

Heude, B., Forhan, A., Slama, R., Douhaud, L., Bedel, S., Saurel-Cubizolles, M.-J.,605

Hankard, R., Thiebaugeorges, O., De Agostini, M., Annesi-Maesano, I., Kaminski,606

M., and Charles, M.-A. (2016). Cohort Profile: The EDEN mother-child cohort607

on the prenatal and early postnatal determinants of child health and development.608

Int. J. Epidemiol., 45, 353-363.609

34




Hoggart, C.J., Whittaker, J.C., Iorio, M.D., and Balding, D.J.(2008) Simultaneous610

analysis of all SNPs in genome-wide and re-sequencing association studies. PLoS611

Genet., 4, e1000130.612

Houseman, E.A., Kile, M.L., Christiani, D.C., Ince, T.A., Kelsey, K.T., and Marsit,613

C.J. (2016). Reference-free deconvolution of DNA methylation data and mediation614

by cell composition effects. BMC Bioinformatics, 17.615

Jaffe, A. E. and Irizarry, R. A. (2014) Accounting for cellular heterogeneity is critical616

in epigenome-wide association studies. Genome Biol., 15, R3.617

Kalaitzis, A.A., and Lawrence, N.D. (2012) Residual component analysis: Generalis-618

ing PCA for more flexible inference in linear-Gaussian models. Proceedings of the619

29th International Conference on Machine Learning, ICML 2012, 1, 209-216.620

Kaushal, A. et al. (2017) Comparison of different cell type correction methods for621

genome-scale epigenetics studies. BMC Bioinformatics, 18, 216.622

Leek, J. T., and Storey, J. D. (2007) Capturing heterogeneity in gene expression623

studies by surrogate variable analysis. PLoS Genet., 3, e161.624

Lumley, J., Chamberlain, C., Dowswell, T., Oliver, S., Oakley, L., and Watson, L.,625

(2009). Interventions for promoting smoking cessation during pregnancy. Cochrane626

Database of Systematic Reviews 2009, Issue 3. Art. No.: CD001055.627

Mishra, B., Meyer, G., Bach, F., and Sepulchre, R. (2013) Low-rank optimization628

with trace norm penalty. SIAM J. Optim., 23, 2124-2149.629

35




Morales, E., Vilahur, N., Salas, L.A., Motta, V., Fernandez, M.F., Murcia, M., Llop,630

S., Tardon, A., Fernandez-Tardon, G., Santa-Marina, L., Gallastegui, M., Bollati,631

V., Estivill, X., Olea, N., Sunyer, J., Bustamante, M., (2016). Genome-wide DNA632

methylation study in human placenta identifies novel loci associated with maternal633

smoking during pregnancy. Int. J. Epidemiol., 45, 1644-1655.634

Nishimura, N., Tsuchiya, W., Moresco, J.J., Hayashi, Y., Satoh, K., Kaiwa, N., Irisa,635

T., Kinoshita, T., Schroeder, J.I., Yates, J.R., Hirayama, T., Yamazaki, T. (2018).636

Control of seed dormancy and germination by DOG1-AHG1 PP2C phosphatase637

complex via binding to heme. Nat. Commun., 9.638

Price, A.L. et al. (2006) Principal component analysis corrects for stratification in639

genome-wide association studies. Nat. Genet., 38, 904-909.640

Rakyan, V. K., Down, T. A., Balding, D. J. and Beck, S. (2011) Epigenome-wide641

association studies for common human diseases. Nat. Rev. Genet., 12, 529-541.642

Rellstab, C., Gugerli, F., Eckert, A. J., Hancock, A. M., and Holderegger, R. (2015) A643

practical guide to environmental association analysis in landscape genomics. Mol.644

Ecol., 24, 4348-4370.645

Rousseaux, S., Seyve, E., Chuffart, F., Bourova-Flin, E., Benmerad, M., et al. (2019).646

Maternal exposure to cigarette smoking induces immediate and durable changes647

in placental DNA methylation affecting enhancer and imprinting control regions.648

BioRxiv, 852186.649

Sayin, N., Kara, N., Pekel, G., and Altinkaynak, H. (2014). Effects of chronic smoking650

36




on central corneal thickness, endothelial cell, and dry eye parameters. Cutan. Ocul.651

Toxicol., 33, 201-205.652

Sheldon, C.C., Rouse, D.T., Finnegan, E.J., Peacock, W.J., Dennis, E.S. (2000). The653

molecular basis of vernalization: The central role of Flowering Locus C (FLC).654

Plant Biol., 97, 6.655

Tibshirani, R. (1996) Regression shrinkage and selection via the lasso. J. R. Stat.656

Soc. Ser. B, 58 267-288.657

The BIOS Consortium, van Iterson, M., van Zwet, E.W., Heijmans (2017) Controlling658

bias and inflation in epigenome-and transcriptome-wide association studies using659

the empirical null distribution. Genome Biol., 18, 19.660

Tseng, P. (2001) Convergence of a block coordinate descent method for nondifferen-661

tiable minimization. J. Optim. Theor. Appl., 109, 475-494.662

Wang, J., Zhao, Q., Hastie, T., Owen, A.B. (2017) Confounder adjustment in multiple663

hypothesis testing. Ann. Statist., 45, 1863-1894.664

Witten, D. M., Tibshirani, R., and Hastie, T. (2009) A penalized matrix decompo-665

sition with applications to sparse principal components and canonical correlation666

analysis. Biostatistics, 10(3), 515-534.667

Wu, T.T., Chen, Y.F., Hastie, T., Sobel, E., and Lange, K. (2009) Genome-wide668

association analysis by lasso penalized logistic regression. Bioinformatics, 25, 714-669

721.670

37




Yu, J., Pressoir, G., Briggs, W.H., Bi, I.V., Yamasaki, M., et al. (2006) A unified671

mixed-model method for association mapping that accounts for multiple levels of672

relatedness. Nature Genetics, 38, 203-208.673

Zeng, P., Zhou, X., Huang, S., (2017). Prediction of gene expression with cis-SNPs674

using mixed models and regularization methods. BMC Genomics, 18.675

Zhou, X., Stephens, M. (2012) Genome-wide efficient mixed-model analysis for asso-676

ciation studies. Nature Genetics, 44, 821.677

Zhou, X., Carbonetto, P., and Stephens, M. (2013) Polygenic modeling with Bayesian678

sparse linear mixed models. PLoS Genetics, 9(2), e1003264.679

Zou, H., Hastie, T., and Tibshirani, R. (2006) Sparse principal component analysis.680

Journal of Computational and Graphical Statistics, 15, 265-286.681

Appendix: Proofs of theorems682

This section provides mathematical proofs for the theorems stated in section 2.683

Theorem 1. Let µ > 0 and γ > 0. Then the block-coordinate descent algorithm684

cycling through Step 1 and Step 2 converges to estimates of W and B defining a685

global minimum of the penalized loss function Lsparse(W,B).686

Proof. The proof arguments are based on a result of Tseng (2001). Consider the687

Cartesian product of closed convex sets A = A1 × A2 × ... × Am, and let f(z) be a688

continuous convex function defined on A and such that689

38




f(z1, · · · , zm) = g(z1, · · · , zm) +m∑i=1

fi(zi) ,

where g(z) is a differentiable convex function, and for each i = 1, . . . ,m, fi(zi) is690

a continuous convex function. Let (zt) be the sequence of values defined by the691

following block-coordinate descent algorithm692

zt+1i ∈ arg minζ∈Ai

f(zt1, . . . , zti−1, ζ, z

ti+1, . . . , z

tm) , i = 1, . . . ,m. (15)

Then a limit point of the sequence (zt) defines a global minimum of the function f(z).693

The theorem’s proof is a consequence of the convexity of the penalized loss function694

Lsparse(W,B), and the fact that we can write695

Lsparse(B,W) = g(B,W)/2 + f1(B) + f2(W)

where g(B,W) = ‖Y−W−XBT‖2F is a differentiable convex function, and f1(B) =696

‖B‖21, f2(W) = ‖W‖2∗ are continuous convex functions. Tseng’s result can be applied697

with the function f(B,W) = Lsparse(B,W) to conclude the proof (see also (Bertsekas,698

1999)).699

700

Theorem 2. Let λ > 0 and assume σ2i > 0 for all i = 1, . . . , d. The estimates Ŵ701

and B̂ computed as follows702


B̂T = (XTX + λIdd)−1XT (Y − Ŵ), (17)

39




where svdK(A) is the rank K SVD of the matrix A, Idd is the d× d identity matrix,

and Dλ is the n× n diagonal matrix with coefficients defined as

dλ =

(√λ

λ+ σ21, . . . ,

√λ

λ+ σ2d, 1, . . . , 1

).

define a global mimimum of the penalized loss function Lridge(B,W).703

Proof. Given W, a global minimum for Lridge(B,W) is obtained with the ridge esti-704

mates for a linear regression of the response matrix Y −W on X.705

B̂T = (XTX + λIdd)−1XT (Y −W). (18)

Thus, the problem amounts to minimizing the function L(W) = Lridge(B̂,W) with706

respect to W. By definition of the Dλ and Q matrices, the loss function rewrites as707

L(W) =∥∥DλQT (Y −W)∥∥2F . (19)

Minimizing the above loss function is equivalent to finding the best approximation708

of rank K for the matrix DλQTY. According to Eckart and Young (1936), this709

approximation is given by the rank K singular value decomposition of DλQTY.710

Eventually we obtain that711


defines the unique global minimum of the L(W) function.712

40




Supplementary materials713

41




Figure S1. Generative model simulations (RMSE for causal markers only).Root Mean Square Error (RMSE) of causal effect sizes as a function of the effect sizeof the causal markers and of the confounding intensity. Two sparse methods (sparseLFMM, LASSO) and three non-sparse methods (ridge LFMM, CATE and SVA) werecompared. Simulation parameters: (A) Lower effect sizes and confounding intensities(B) Lower effect sizes and higher confounding intensities. (C) Higher effect sizes andlower confounding intensities. (D) Higher effect sizes and confounding intensities.

42




Figure S2. Comparison of the runtimes of three methods. Runtimes asa function of the number of markers (p) and the number of individuals (n). (A)p = 1000. (B) p = 10, 000. (C) p = 100, 000.

43




Figure S3. Estimation of the number of latent factors (K) in the generativesimulations. Difference between the true K of the simulations and the K estimatedby our cross validation algorithm.

44




Figure S4. GWAS of a flowering trait with sparse LFMM, LASSO andBSLMM. Venn diagram of SNPs associated with the FT16 phenotype in each ap-proach. The hits correspond to SNPs having non-null effect size estimates.

45




Figure S5. DNA methylation EWAS of smoking status in pregnant women(all chromosomes). A) Estimated reverse effect size for LASSO. B) Estimatedeffect size for sparse LFMM. C) Estimated effect size for non-sparse methods (ridgeLFMM, CATE and SVA).

46




Figure S6. EWAS of smoking status in women. Over-representation of en-hancer regions in sparse LFMM candidate regions compared to the methy-lome. Blue bars correspond to the fraction of enhancer regions in each chromosome.Red bars correspond to the fraction of enhancer regions detected by sparse LFMM.The horizontal blue line represent the average number of enhancer regions per chro-mosome for the methylome. The red line represents the average number for sparseLFMM.

47




Figure S7