Original Article - JST

30
Japanese Journal of Biometrics Vol. 39, No. 2, 55–84 (2018) Original Article A New Association Analysis Method for Gut Microbial Compositional Data Using Ensemble Learning Tasuku Okui *1 , Yutaka Matsuyama *1 and Shigeyuki Nakaji *2 *1 Department of Biostatistics, Graduate School of Health and Nursing, the University of Tokyo *2 Department of Social Medicine, Graduate School of Medicine, the University of Hirosaki e-mail:[email protected] Nowadays, many methods that employ the 16S ribosomal RNA gene (16S rRNA se- quencing data) have been proposed for the analysis of gut microbial compositional data. 16S rRNA sequencing data is statistically multivariate count data. When multivariate data analysis methods are used for association analysis with a disease, 16S rRNA sequencing data is generally normalized before analysis models are fit- ted, because the total sequence read counts of the subjects are different. However, proper methods for normalization have not yet been discussed or proposed. Rar- efying is one such normalization method that equals the total counts of subjects by subsampling a certain amount of counts from each subject. It was thought that if rarefying were combined with ensemble learning, performance improvement could be achieved. Then, we proposed an association analysis method by combining rarefying with ensemble learning and evaluated its performance by simulation experiment using several multivariate data analysis methods. The proposed method showed superior performance compared with other analysis methods, with regard to the identification ability of response-associated variables and the classification ability of a response variable. We also used each evaluated method to analyze the gut microbial data of Japanese people, and then compared these results. Key words : Microbiome 16S rRNA sequencing data, Normalization, Multivariate data analysis, Ensemble learning. 1. Introduction Metagenome analysis, which uses next-generation sequencing, is a method currently used for surveying the ecological environment of the human gut microbiome (Kim 2013). Among the metagenome analyses methods, 16S rRNA gene analysis is the method which sequences only the 16S rRNA region of the whole genome (Oulas 2015); it is used to identify microbial species based Received March 2018. Revised August 2018. Accepted November 2018.

Transcript of Original Article - JST

okui-etc.dvi : output at 2019.2.27 This book was typeset using pLaTeX2e

Japanese Journal of Biometrics Vol. 39, No. 2, 55–84 (2018)

Original Article

A New Association Analysis Method forGut Microbial Compositional Data

Using Ensemble Learning

Tasuku Okui∗1, Yutaka Matsuyama∗1 and Shigeyuki Nakaji∗2∗1Department of Biostatistics, Graduate School of Health and Nursing,

the University of Tokyo∗2Department of Social Medicine, Graduate School of Medicine,

the University of Hirosakie-mail:[email protected]

Nowadays, many methods that employ the 16S ribosomal RNA gene (16S rRNA se-

quencing data) have been proposed for the analysis of gut microbial compositional

data. 16S rRNA sequencing data is statistically multivariate count data. When

multivariate data analysis methods are used for association analysis with a disease,

16S rRNA sequencing data is generally normalized before analysis models are fit-

ted, because the total sequence read counts of the subjects are different. However,

proper methods for normalization have not yet been discussed or proposed. Rar-

efying is one such normalization method that equals the total counts of subjects by

subsampling a certain amount of counts from each subject. It was thought that if

rarefying were combined with ensemble learning, performance improvement could be

achieved. Then, we proposed an association analysis method by combining rarefying

with ensemble learning and evaluated its performance by simulation experiment using

several multivariate data analysis methods. The proposed method showed superior

performance compared with other analysis methods, with regard to the identification

ability of response-associated variables and the classification ability of a response

variable. We also used each evaluated method to analyze the gut microbial data of

Japanese people, and then compared these results.

Key words: Microbiome 16S rRNA sequencing data, Normalization, Multivariate

data analysis, Ensemble learning.

1. Introduction

Metagenome analysis, which uses next-generation sequencing, is a method currently used

for surveying the ecological environment of the human gut microbiome (Kim 2013). Among the

metagenome analyses methods, 16S rRNA gene analysis is the method which sequences only the

16S rRNA region of the whole genome (Oulas 2015); it is used to identify microbial species based

Received March 2018. Revised August 2018. Accepted November 2018.

okui-etc.dvi : output at 2019.2.27 This book was typeset using pLaTeX2e

56 Okui et al.

on their DNA. The microbial composition and diversity of species of each subject can be known

by 16S rRNA gene analysis.

With regard to the association analysis of a disease using 16S rRNA sequencing data, there

are two major approaches, that is, univariate data analysis: differential abundance analysis

(Paulson 2013) and multivariate data analysis, which analyze all microbial species at once. When

the objective of the analysis is building a model that classifies a disease or identifying microbial

species that classifies a disease, multivariate data analysis methods are used (Schubert 2014, Tap

2017, Wu 2013, Mach 2015, Mahana 2016, Labus 2017, Lee 2014).

Statistically, 16S rRNA sequencing data is multivariate count data that signifies how many

counts are detected for each species from each subject. Each counts of a species represents se-

quence read counts attached to that microbial species. They are not absolute microbial counts,

and we should treat each species’ counts as relative values within each subject. Because the total

counts of each subject (which are called coverage or library size in general) are different among

subjects, each species’ counts should be normalized in order to compare the amount among sub-

jects. Various normalization methods (Weiss 2017) are proposed for next-generation sequencing

data, especially for RNA-sequence data (Rapaport 2013, Robinson 2010), and some methods

are proposed for 16S rRNA sequencing data. Proportion data, which is earned by dividing each

species’ counts by the total counts of each subject, is considered a natural normalization method

to use. However, statistically, it is compositional data (Aitchison 1982), which means that the

sum of all variables’ values of each subject is one. In compositional data, if the values of all vari-

ables but one variable are known, the value of the one variable is decided uniquely and a pseudo

correlation occurs between the variables (Tsilimigras 2016, Mandel 2015, Gregory 2016). To deal

with this problem, the conventional approaches to compositional data, log-ratio transformation,

and the method which divides each variable by one reference variable are often used (Shankar

2015, Cao 2016). However, if a variable has a zero value, these methods cannot be applied, and

each zero value needs be imputed by a small value. As a result, the analysis result is affected

by the small values, because 16S rRNA sequencing data contains a large number of zero values

(McMurdie 2014, Tsilimigras 2016).

As another normalization method, rarefying (McMurdie 2014), which subsamples a certain

amount of counts from 16S rRNA sequencing data of each subject and equalizes each of the total

counts, is frequently used. By using this method, data still remains as count data, and count

data analysis methods, which take into account the over-dispersed property of each microbial

variable, can be used. Moreover, zero values need not be imputed like log-ratio transformation.

When rarefying is applied, the sum of total counts of each subject remains constant, and pseudo

correlation would not occur, because the counts are sampled randomly from the raw count data.

However, the shortcoming of rarefying is that it does not use the available data fully, and some

researchers are of the opinion that rarefying is inadmissible in abundance analysis (McMurdie

Jpn J Biomet Vol. 39, No. 2, 2018

okui-etc.dvi : output at 2019.2.27 This book was typeset using pLaTeX2e

An Association Analysis Method for Gut Microbial Data Using Ensemble Learning 57

2014).

Although the proper normalization method for data analysis depends on data and the sta-

tistical model to be used, the performance of normalization methods was evaluated recently for

differential abundance analysis: univariate analysis method, where a model is fitted to each mi-

crobial variable separately (Weiss 2017). On the other hand, when multivariate data analysis

methods are used, statistical models are various and the proper normalization methods have not

yet been considered. Therefore, association analyses are performed by a normalization method

which is employed rather arbitrarily; the normalization methods are not standardized. In case

of 16S rRNA sequencing data that treats microbial hierarchy with relatively many variables like

family or genus, the effect of compositional data caused by using proportion data is uncertain,

and even the performance evaluation comparing rarefying with proportion has not yet been

carried out.

Within the normalization methods, the performance of rarefying might be improved by

combining it with ensemble learning. Ensemble learning is a machine learning method which

learns multiple models from one data set and combines each result to yield one final model

(Yang 2010). Learning multiple models from one data means learning multiple kinds of models

from the whole available data or learning multiple models by resampling from the data. Using

this method generally improves the predictive performance of a model. A series of methods have

been proposed, of which bagging and boosting are the representative methods (Buhlmann 2007,

Hastie 2009). Bagging bootstraps multiple samples from data, creates models from each sample,

and combines results. This method has already been proposed as an association analysis method

for 16S rRNA sequencing data (Tap 2017). When rarefying is used in association analysis,

multivariate data analysis methods and ensemble learning are fitted after rarefying is applied

to the data and all the available data are not used. In order to use all the available data,

rarefying must be repeated to the data multiple times and multiple models must be built. Then,

if we consider rarefying and building a model as one process of ensemble learning, repeating

this process multiple times results in a kind of ensemble learning. In other words, by including

rarefying and building a model into the process of ensemble learning, data which are not used

in the model fitting of a certain rarefied sample are used in those of another rarefied sample; the

demerit of rarefying might thus be resolved. Additionally, by combining rarefying and ensemble

learning, each rarefied sample’s model might become highly heterogeneous. Ensemble learning

elevates predictive performance by averaging multiple models, each of which has high variance,

and by lowering the variance of the averaged model. The More heterogeneous models are, the

more the variance of the averaged model becomes small (Hastie 2009). As a result, the predictive

performance of a model and the identification ability of response-associated microbiomes might

be improved by combining rarefying with ensemble learning. Also, by using this method, the

problem of compositional data is avoided without omitting some available data. We then focus

Jpn J Biomet Vol. 39, No. 2, 2018

okui-etc.dvi : output at 2019.2.27 This book was typeset using pLaTeX2e

58 Okui et al.

on normalization methods for association analysis when a multivariate data analysis method is

used. Specifically, we have proposed a new association analysis method, which combines rarefying

with ensemble learning, and compared this method to other methods by simulation and real data

analysis.

2. Method

In this section, we first explain the multivariate data analysis methods used for evaluating

the proposed method. We then explain the proposed method and the models that were examined.

2.1 Multivariate data analysis methods

In 16S rRNA sequence data analysis, methods for univariate data analysis (differential abun-

dance analysis) are relatively limited, and methods of multivariate data analysis differ from study

to study. In this study, we examined 5 methods: Random forest (RF), Lasso, Ridge regression

(Ridge), Elastic Net (EN), and Sparse partial least squares discriminant analysis (SPLSDA).

These models were selected because they are used in 16S rRNA sequence data analysis (Yat-

sunenko 2012, Schubert 2014, Tap 2017, Wu 2013, Mach 2015, Mahana 2016, Labus 2017, Lee

2014, Chua 2017, Chen 2012, Knights 2011, Naseribafrouei 2014, Halfvarson 2017), and more

broadly in omics data analyses like metabolome data analysis or eQTL analysis (Statnikov 2013,

Determan 2015, Lu 2017, Michaelson 2010, Cho 2010, Acharjee 2013, Jiang 2014). Statistically,

these methods can be used even if the number of variables exceeds the number of subjects, and

we can evaluate each variable’s contribution to a response variable easily. Although the Ridge

regression model is not used in 16S rRNA sequence data analysis as opposed to Lasso (Rush

2016, Lin 2014, Garcia 2014) and EN (Shankar 2015, Knights 2011), these three methods are

associated mutually, and we used all of them in order to evaluate the compatibility with the

proposed methods. Furthermore, with regards to SPLSDA, PLSDA is also used (Naseribafrouei

2014, Wu 2013), but we used SPLSDA because it can be used even for small sample data and

has high ability for variable selection.

2.1.1 Random Forest(s) (RF)

RF is a machine learning method which combines classification and regression trees (CART)

and an ensemble learning method called bagging (Breiman 2001). It is one of the most used

analysis methods in 16S rRNA sequence data analysis. CART is a supervised machine learning

method which divides data sequentially and builds a predictive model based on the conditions

of explanatory variables’ values (Shimokawa 2013, Breiman 1984). Results of analysis by CART

are not influenced by monotone transformation of explanatory variables and are robust against

outliers. On the other hand, CART model is a piecewise constant model, which assigns a constant

as the predictive value of a response variable, based on the conditions of explanatory variables’

values. Therefore, predictive accuracy is low by itself and combination with ensemble learning

is effective (Hastie 2009). CART is combined with bagging (bootstrap aggregating) for use as

Jpn J Biomet Vol. 39, No. 2, 2018

okui-etc.dvi : output at 2019.2.27 This book was typeset using pLaTeX2e

An Association Analysis Method for Gut Microbial Data Using Ensemble Learning 59

an ensemble analysis method in RF. Bagging is a method that generates multiple bootstrap

samples, builds a model from each sample, and combines their results (Breiman 1996). By using

bagging, one can improve predictive accuracy, but the improvement of predictive performance

is not always huge because samples generated by bagging are highly correlated. To cope with

that problem, RF elevates the heterogeneity of each sample’s model by reducing the number of

variables used in each of the dividing of CART (Breiman 2001). RF evaluates each explanatory

variable’s contribution to a response variable by variable importance. Several calculation methods

of variable importance are proposed, and in R package randomForest (Breiman 2015), the variable

importance value of an explanatory variable is calculated by the improvement of predictive error

by including the explanatory variable in model fitting. Additionally, in RF, a response variable

is classified by multiple votes from the classification result of each bootstrap sample.

2.1.2 Lasso, Ridge regression and Elastic Net (EN)

Lasso, Ridge regression and Elastic Net (EN) are regression models, each of which uses

regularization when estimating coefficient estimates. By using Lasso and EN, coefficient estimates

of explanatory variables which are not associated with a response variable become zero, and

estimation can be done even when the number of variables exceeds the number of subjects. Ridge

regression does not make coefficient estimates of explanatory variables which are not associated

with a response variable exactly zero, but shrinks their values toward zero (Hastie 2009).

Denote the number of subjects as N , number of variables as p, each value of response

variable of i-th subject as yi, standardized explanatory variables vector of i-th subject as xi,

β0 as intercept and β as coefficient vector, respectively. EN minimizes the following function

(Friedman 2010), and

min(β0,β)∈Rp+1

1

2N

NXi=1

(yi − β0 −xTi β)2 + λ

pXj=1

»1

2(1−α)β2

j + α|βj |–

λ and α are tuning parameters determined often by cross-validation or BIC. When α = 1,

EN becomes Lasso (Tibshirani 1996), and when α = 0, EN becomes Ridge regression (Hoerl

1970). Lasso performs variable selection, but selects only a part of variables which are correlated

mutually. If we use Ridge regression, coefficient estimates of correlated variables become close

to each other, but Ridge regression does not have a feature of variable selection. EN includes

both Lasso and Ridge regression models, as a special case.

Lasso, Ridge regression and EN can be expanded to a generalized linear model, and the

model for binary response variable is estimated by IRLS (Friedman 2010). When coefficient

estimates in the computation process are described as (β0, β), the following function is optimized

iteratively in EN model.

min(β0,β)∈Rp+1

1

2N

NXi=1

wi(zi − β0 −xTi β)2 + λ

pXj=1

»1

2(1−α)β2

j + α|βj |–

,

Jpn J Biomet Vol. 39, No. 2, 2018

okui-etc.dvi : output at 2019.2.27 This book was typeset using pLaTeX2e

60 Okui et al.

zi = β0 + xTi β +

yi − p(xi)

p(xi)(1− p(xi)), wi = p(xi)(1− p(xi)), p(xi) =

1

1 + e−(β0+xTi β)

In this study, λ was determined by cross-validation, and when we used EN, α was fixed to 0.5 in

order not only to select variables but also to take into account the correlations between variables.

2.1.3 Sparse partial least squares discriminant analysis (SPLSDA)

SPLSDA is an extended analysis method of partial least squares regression (PLS) (Chung

2010, Cao 2011). PLS is designed for coping with multi-collinearity, and SPLS model uses a

regularization like EN, and makes the interpretation of coefficient estimates easy. Additionally,

SPLSDA is extended from SPLS for categorical response variable. In this study, we used the

SPLSDA model assuming the analysis whose response variable is binary.

2.1.3.1 PLS

PLS is similar to Ridge regression, in that it was developed for coping with multi-collinearity

(Mevik 2016). PLS replaces explanatory variables into latent variables and builds an association

model with a response variable using the latent variables.

We set a response variable vector as y ∈ Rn×1, standardized explanatory variable matrix

as X ∈ Rn×p. PLS postulates latent variables T ∈ Rn×K , which are related to y and X .

Using T , set linear equations y = T QT + F and X = T P T + E, where F ∈ Rn×1, E ∈ Rn×p.

Additionally, define component coefficients matrix W as T = XW , where W ∈ Rp×K . PLS

sequentially estimates the kth component coefficients wk and regression coefficients β in the

following steps.

(1) Set y1 = y, X1 = X as the initial values.

(2) Solve wk = argmaxwk

{corr2(y1,X1wk)var(Xwk)} by singular value decomposition where,

wTk wk = 1, wT

k ΣXXwj = 0 (j = 1, . . . ,k − 1)

(3) Calculate t = X1wk

(4) Calculate the loading scalar and vector as q = yT1 t, p = XT

1 t, where t ∈ Rn×1

(5) Update y1 = y1 − qt, X1 = X1 − tpT

(6) Repeat (2), . . . , (5) over each component k(1, . . . ,K)

(7) Set β as y = XT β and calculate β using T and W

β = W (P T W )−1(T T T )−1T T y

2.1.3.2 SPLSDA

Sparse partial least squares regression (SPLS) (Chun 2010) is an extension of PLS and esti-

mates β while selecting variables that are strongly associated with a response variable. SPLSDA

is an expanded model of SPLS for categorical response variable (Chung 2010). SPLSDA and

SPLS are the same when the response variable is a univariate binary variable. Variable selection

is done by making a part of component coefficients wk zero using a regularization term. As

computational convenience, a regularization term is not directly imposed onto wk, but onto the

Jpn J Biomet Vol. 39, No. 2, 2018

okui-etc.dvi : output at 2019.2.27 This book was typeset using pLaTeX2e

An Association Analysis Method for Gut Microbial Data Using Ensemble Learning 61

surrogate coefficients vector c.

We describe the estimation method of SPLS (Chun 2010). Like PLS, each column of com-

ponent coefficients W ∈ Rp×K are estimated sequentially by the following steps.

(1) Set y as the initial value of y1

(2) Solve minwk,c

{−κwTk Mwk + (1− κ)(c−wk)T M (c−wk) + λ1|c|1 + λ2|c|22}, where wT

k wk =

1, M = XT y1yT1 X

This function is minimized when c = wk. When λ2 = ∞, c result in the following values.

c =

„|Z | − η max

1≤j≤p|Zj |

«+ sgn(Z), Z =

XT y1

‖XT y1‖ , j = 1, . . . , p

Then, wk were estimated, and the variables which are (c �= 0) are selected.

(3) Set variables that were selected in any components 1, . . . , (k − 1) as A. Revise A ← A +

(c �= 0), and fit PLS to the response variable using only variables which were included in A.

(4) Based on estimated βPLS

, revise y1 as y1 = y1 −XT βPLS

.

(5) Repeat Step (2) to (4) over each component k (1, . . . ,K).

Using the variables which have been selected until the last component, the PLS model is

fitted to the response variable y, and β are estimated finally. In this study, number of components

k was fixed to 1 because Rohart (2017) pointed out that the number of response categories minus

1 is sufficient for the number of components; this was checked by a small examination in advance.

The regularization parameter λ1 was decided by a 10-fold cross validation, by moving λ1 from

0.1 to 0.9.

2.2 Proposed method

We explain the method of combining rarefying with ensemble learning. Rarefying is one

of the most commonly used normalization methods for 16S rRNA sequence data analysis (Yat-

sunenko 2012, Tap 2017, Weiss 2017), and is performed by the following steps (McMurdie 2014):

(1) Determine the cut-off value of the total sequence read counts of subjects.

(2) Exclude subjects whose total counts are smaller than the cut-off value.

(3) With regard to the rest of the subjects, subsample counts equivalent to the cut-off value

from each subject in order to even the total counts of subjects.

In this study, we used a simple approach for ensemble learning:

(1) Repeat the rarefying process to the same data and generate multiple subsamples.

(2) Build a classification model from each subsample.

(3) Integrate these models into one final model.

We call this method sagging (subsample aggregating), named after bagging (bootstrap ag-

gregating). In this study, we conducted the rarefying process based on the minimum total counts

of all subjects, and not excluded subjects, because we analyzed data such that there were no

subjects whose total counts were extremely small.

Jpn J Biomet Vol. 39, No. 2, 2018

okui-etc.dvi : output at 2019.2.27 This book was typeset using pLaTeX2e

62 Okui et al.

2.3 Models compared in this study

In this study, we evaluated the performance of the method combining rarefying with sagging

by the multivariate data analysis methods introduced. For the evaluation, in each multivariate

analysis method, we performed 4 patterns of analysis: analysis that uses proportion as the

normalization method, analysis that uses rarefying as the normalization method, analysis that

combines rarefying with sagging, and analysis that uses proportion the normalization method

with bagging as ensemble learning. Because ensemble learning generally elevates the predictive

performance, we added the analysis that uses proportion as the normalization method with

bagging. In this method, we generated multiple bootstrap samples and used proportion as the

normalization method for each sample. From each sample, we built a predictive model, and

combined the models into a final model.

We thus evaluated 19 models depending upon the combination of normalization methods,

multivariate data analysis methods, and ensemble learning methods. In Table 1, we present

Table 1. Models compared in this study

Jpn J Biomet Vol. 39, No. 2, 2018

okui-etc.dvi : output at 2019.2.27 This book was typeset using pLaTeX2e

An Association Analysis Method for Gut Microbial Data Using Ensemble Learning 63

these 19 models. Also, as the evaluation methods, we indicate the methods of calculating the

measure of each explanatory variable’s degree of association with a response variable and the

methods of evaluating classification performance of a model. As for RF, we did not evaluate the

method that uses proportion as the normalization method with bagging, because RF already uses

bagging in the model learning process. In Lasso, Ridge regression, EN and SPLSDA, we adopted

regression coefficients and mean of regression coefficients when we used ensemble learning, as the

measure of degree of association with a response variable. Lasso, EN and SPLSDA are evaluated

generally by the accuracy of the selection of response-associated variables. However, because we

used methods combining ensemble learning, and had the objective of evaluating the ability to

identify variables that classify a response variable, we looked at coefficient estimates.

As the evaluation method of classification performance, we used the linear predictors and

their mean when we used Lasso, Ridge regression, SPLSDA and EN. RF cannot output the

linear predictors like ordinary regression models. Thus, we used the predictive probability of a

response variable. With regard to the sampling times for the ensemble learning, we used 50 for

all ensemble learning models for computational feasibility. The effect of the sampling times of

the proposed method was also examined in simulation experiment.

2.4 Analysis software used

All analysis was done by R3.3.4 (R Core team 2017). RF was performed by randomForest

(Breiman 2015) and as for tuning parameters, we used the default values. Lasso, Ridge regression

and EN were performed by glmnet (Friedman 2017), and SPLSDA, by package spls (Chung 2015).

SPLSDA is usually performed by mixOmics (Cao 2017), but we have to specify the number of

explanatory variables that are used for association analysis in advance and cannot select variables

based on predictive performance.

Therefore, we used package spls. When the response variable is binary, SPLSDA is the same

as SPLS model with respect to the computational algorithm (Chung 2015). We used gplots for

obtaining heatmaps (Warnes 2016). For the computation of AUC value, pROC (Robin 2017)

was used. Programs for rarefying and ensemble learning were made by the authors.

3. Simulation experiment

3.1 Simulation setting

We evaluated the performance of each model by a simulation experiment. We evaluated the

ability to identify variables which are associated with a response variable and the classification

performance for the 19 models presented in Table 1.

With regard to the setting of simulation data, we set the number of variables as 100 and 300,

supposing the family and genus as the microbial hierarchy of data, respectively. To evaluate the

performance in a setting where the number of subjects is relatively small, which is the case where

many 16S rRNA sequencing data analyses are performed, we set the number of subjects as 100.

Jpn J Biomet Vol. 39, No. 2, 2018

okui-etc.dvi : output at 2019.2.27 This book was typeset using pLaTeX2e

64 Okui et al.

We set 2 patterns for the variability of total sequence read counts of each subject because the

variability of counts is related to the performance of the normalization methods. If the counts

of each subjects were equal, data would not need to be normalized. Then, in one half of all

patterns, the variability of total read counts was set to be large between subjects, and in the

other half, the variability was set to be relatively small. In each pattern, we generated total

counts from log normal distribution, where (µ,σ) are (9.5,0.5) and (9.5,0.2), respectively.

f(x) =1√

2πσxexp

„− (lnx−µ)2

2σ2

«

In each setting, the mean of total counts and its standard deviation is about (15140,8068), and

(13630,2754), respectively.

We generated the microbial count data in the following way to make data zero inflated

like the actual 16S rRNA sequencing data. We generated a matrix (n× p) from a multivariate

normal distribution. Each element of covariance matrix of a multivariate normal distribution

(p× p) was generated from Uniform distribution, U(0,0.1) and U(0,0.8) in two ways. In the case

of U(0,0.1), the correlations among explanatory variables were relatively small; they were large in

the case of U(0,0.8). Each value of mean vector of the multivariate normal distribution (p× 1)

was generated from normal distribution N(4,2). The elements of the generated matrix were

exponential-transformed, and then, each element was changed into the proportion scale. That is,

if we denote each column j of a certain row i as xij , xij was transformed into xij = xij/(Σpj=1xij).

Based on the proportion data, the count data of each explanatory variable was generated from

multinomial distribution. Denote total counts of each subject as Ni, ni: counts vector of a

certain subject i were generated from the following distribution:

ni ∼ Multinomial(Ni,xi),

where ni = (ni1, . . . ,nip) and xi = (xi1, . . . ,xip). yi: each value of a binary response variable was

generated based on the proportion data.

Pr(yi = 1) = expit(xiβ + ε)

As for β, we randomly picked 30 variables associated with a response variable, and the coef-

ficients of these variables were generated form normal distribution N(1,1). The coefficients of

the rest of variables were set to zero. We set 2 patterns with respect to the effects of the other

factor against the response variable, that is, no effect in one pattern and a large effect in the

other. Then, ε was generated from the normal distribution N(0,1) in half of all the patterns

as a random error, and from normal distribution N(−2,3) in the other half of patterns as a

systematic error. The mean value of coefficients of explanatory variables that were related with

a response variable was set to 1, and in the case of ε for the distribution N(−2,3), the effect of

the systematic error was larger than xiβ on average; the identification of variables was relatively

difficult. We generated this data for model building and for classification performance assess-

ment separately in each pattern. Then, we trained models in model building data and evaluated

Jpn J Biomet Vol. 39, No. 2, 2018

okui-etc.dvi : output at 2019.2.27 This book was typeset using pLaTeX2e

An Association Analysis Method for Gut Microbial Data Using Ensemble Learning 65

Table 2. Simulation setting of each pattern

their classification performance in the other data by calculating testing AUC. The coefficients of

classification performance assessment data were set to the same as model building data.

We examined a first experiment for 16 patterns in total. The setting of each pattern is

presented in Table 2. As for the evaluation method of each model, we calculated AUC val-

ues for coincidence between the measure of the degree of association and the true presence or

absence of association with the response variable. Additionally, spearman’s rank correlation co-

efficient between true coefficients and the measure of degree of association were calculated. As

for classification performance, we calculated AUC values for coincidence between the measure

of performance and a response variable. Also, in classification performance evaluation, we cal-

culated AUC values of true model in addition to 19 models for reference. The number of times

that a simulation was conducted was set to 300.

As the second simulation experiment, we confirmed the appropriate sampling times for the

proposed method by varying the sampling times from 10 to 50 by 10.

3.2 Simulation results

Table 3 shows the result of the AUC values for the coincidence between the measure of the

degree of association and the true presence or absence of association with a response variable,

and Table 4 shows the standard deviations of the AUC values. Among the models that used RF

(1,2,3), the AUC values of model 1, which used proportion data were the highest in all patterns.

But the difference between models 1,2,3 was small in all patterns, and the values are relatively

Jpn J Biomet Vol. 39, No. 2, 2018

okui-etc.dvi : output at 2019.2.27 This book was typeset using pLaTeX2e

66 Okui et al.

Table

3.

AU

Cvalu

esfo

rth

em

easu

reofdeg

ree

ofass

oci

ation

Jpn J Biomet Vol. 39, No. 2, 2018

okui-etc.dvi : output at 2019.2.27 This book was typeset using pLaTeX2e

An Association Analysis Method for Gut Microbial Data Using Ensemble Learning 67

Table

4.

Sta

ndard

dev

iation

ofA

UC

valu

esfo

rth

em

easu

reofdeg

ree

ofass

oci

ationw

Jpn J Biomet Vol. 39, No. 2, 2018

okui-etc.dvi : output at 2019.2.27 This book was typeset using pLaTeX2e

68 Okui et al.

stable across all patterns. As for Lasso (models 4,5,6,7), the AUC values of the model that used

proportion (model 4) were higher than those of the model that used rarefying (model 5), across

all patterns. This result was also true for Ridge regression and EN. Also, in many cases, the

AUC values of the model that combined proportion with bagging (model 6) were higher than

those of model 4 when the number of subjects was 300, but lower when number of subjects was

100. The AUC values of the model that used the proposed method (model 7) were the highest in

almost all patterns and the values were relatively stable across all patterns. As for Ridge (models

8,9,10,11), the AUC values of the model that combined proportion with bagging (model 10) were

lower than those of the model that used proportion (model 8). The AUC values of model 8 were

the highest in all patterns, and the AUC values of the model that used the proposed method

(model 11) were slightly lower than those of model 8.

As for EN (models 12,13,14,15), the results were almost same as those of Lasso, but the AUC

values of the model that used the proposed method (model 15) were higher than those of Lasso

(model 7) across all patterns. As for SPLSDA (models 16,17,18,19), the AUC values of the model

that used rarefying (model 17) were higher than those for the model that used proportion (model

16) across all patterns. Additionally, the AUC values of the model that combined proportion

with bagging (model 18) were always higher than those of model 16. The AUC values of the

model that used the proposed method (model 19) were the highest in almost all patterns, and

the values were relatively stable across all patterns. Throughout the experiment, in Lasso, Ridge

regression, EN and SPLSDA, the AUC values improved by using the proposed method across

all patterns. On the other hand, this was not the case in RF. If we compare models across

different multivariate data analysis methods, the AUC values of Lasso, Ridge regression, EN

and SPLSDA exceeded those of RF when the proposed method was used. As for the standard

deviations, performance improvement by the proposed method was not apparent evidently.

Table 5 shows the result of spearman’s rank correlation coefficients between the true coef-

ficients and the measure of degree of association, and Table 6 shows the result of the standard

deviations. As for RF, model 3 had the highest correlation coefficients across all patterns. As for

Lasso, Ridge regression, EN and SPLSDA, the correlation coefficients of the models using ensem-

ble learning were higher than those of the models without ensemble learning. As for Lasso, Ridge

regression and EN, the correlation coefficients of the models that used the proposed method were

higher than those of the models that used proportion with bagging across almost all patterns, and

the standard deviations of the correlation coefficients for the proposed method were the lowest

across many patterns. As for SPLSDA, in many patterns, the correlation coefficient values of the

model that used the proposed method were higher than those of the model that used proportion

with bagging when the variance of total counts was large, and lower when the variance of total

counts was small.

Jpn J Biomet Vol. 39, No. 2, 2018

okui-etc.dvi : output at 2019.2.27 This book was typeset using pLaTeX2e

An Association Analysis Method for Gut Microbial Data Using Ensemble Learning 69

Table

5.

Corr

elation

coeffi

cien

tfo

rth

em

easu

reofdeg

ree

ofass

oci

ation

Jpn J Biomet Vol. 39, No. 2, 2018

okui-etc.dvi : output at 2019.2.27 This book was typeset using pLaTeX2e

70 Okui et al.

Table

6.

Sta

ndard

dev

iation

ofco

rrel

ation

coeffi

cien

tfo

rth

em

easu

reofdeg

ree

ofass

oci

ation

Jpn J Biomet Vol. 39, No. 2, 2018

okui-etc.dvi : output at 2019.2.27 This book was typeset using pLaTeX2e

An Association Analysis Method for Gut Microbial Data Using Ensemble Learning 71

Table

7.

AU

Cvalu

esfo

rcl

ass

ifica

tion

per

form

ance

Jpn J Biomet Vol. 39, No. 2, 2018

okui-etc.dvi : output at 2019.2.27 This book was typeset using pLaTeX2e

72 Okui et al.

Table 7 shows the result of the AUC values for coincidence between the measure of predictive

performance and a response variable, and Table 8 shows the standard deviations. As for RF,

there were no large differences between the models, but the AUC values of model 3 were the

highest across all models. As for Lasso, the AUC values of model 7 were the highest across

many patterns, and the degree of improvement of the AUC values was large when the number

of variables was 300. As for Ridge regression, the AUC values of model 8 and those of model 11

were almost equivalent and the highest across all patterns. As for EN, the AUC values of model

14 evidently dropped, when compared with model 12, when the variance of the total counts were

large or number of variables was 100. The AUC values of model 15 were the highest across all

patterns. As for SPLSDA, the AUC values of model 18 were always higher than those of model

16 when the variance of total counts was small, but this was not the case when the variability

of the total counts was large. The AUC values of model 19 were the highest across all patterns.

Throughout the evaluation of classification performance, the AUC values improved by using the

proposed method in all analysis methods. Also, the standard deviations of the models that used

the proposed method were the lowest in many patterns except for Lasso, where the standard

deviations also improved in model 6 as well as model 7. However, the AUC values and the

standard deviations of the proposed method were inferior to those of the true model.

The second simulation experiment, which evaluated the effect of the sampling times of the

proposed was conducted in the setting of pattern 1 and 2 of Table 2 on behalf of all patterns.

As for the models to evaluate, models 7, 11,15, 19 of Table 1 were used. We did not use RF

model because performance improvement was not confirmed by the proposed method in the first

experiment.

Table 9 shows the result of the second experiment. As for the AUC values and the correlation

coefficients for the measure of the degree of association, the values improved by augmenting

the sampling times when the number of variables was 300, but the degree of improvement was

relatively small in Ridge model. On the other hand, the AUC values for classification performance

did not improve by augmenting the sampling times. When the number of variables was 100, the

two AUC values, the correlation coefficients and their standard deviations did not improve by

augmenting the sampling times.

Jpn J Biomet Vol. 39, No. 2, 2018

okui-etc.dvi : output at 2019.2.27 This book was typeset using pLaTeX2e

An Association Analysis Method for Gut Microbial Data Using Ensemble Learning 73

Table

8.

Sta

ndard

dev

iation

ofA

UC

valu

esfo

rcl

ass

ifica

tion

per

form

ance

Jpn J Biomet Vol. 39, No. 2, 2018

okui-etc.dvi : output at 2019.2.27 This book was typeset using pLaTeX2e

74 Okui et al.

Table

9.

The

effec

tofth

esa

mpling

tim

esofth

epro

pose

dm

ethod

Jpn J Biomet Vol. 39, No. 2, 2018

okui-etc.dvi : output at 2019.2.27 This book was typeset using pLaTeX2e

An Association Analysis Method for Gut Microbial Data Using Ensemble Learning 75

4. Real data analysis

We fitted the evaluated models to Japanese gut microbial data. The data was collected at a

medical checkup, which was conducted in the Iwaki district of Hirosaki city in Aomori prefecture

every year. This medical checkup is held as a part of the study for prevention of dementia

and lifestyle-related diseases, and the study’s objective is to develop a preventive method for

multi-factor diseases and promote the health of local residents. The medical checkup’s target is

people older than 20. Gut microbial data began to be collected from 2015. In this study, we

used microbial 16S rRNA sequencing data, which was collected in 2015, and BMI. In 2015, 1148

people participated in the medical checkup, and the number of people whose microbial data and

BMI were collected was 1074.

As for microbiome 16S rRNA sequencing data, DNA was extracted from feces samples, and

amplicon analysis was done by Miseq to V3–V4 regions by techno-suruga labo. Reads sequenced

were grouped into Operational Taxonomic Units (OTUs), and the OTUs were classified microbial

species at the medical institute of University of Tokyo. The classification was done in each

microbial hierarchy, and in this study, we used family and genus as the microbial hierarchy for

data analysis. As the response variable, we used BMI and set 25 as the threshold value and

converted it into a binary variable. We applied each model to the 16S rRNA sequencing data

and compared the results. We used age as an explanatory variable for model building. Table 10

shows the attributes of the subjects and the 16S rRNA sequencing data. The number of families

observed was 110 and the number of genera was 304. The mean of the observed number of families

for each subject was 26.2 and 51.1 for the number of genera, and the number of microbiomes

Table 10. Attributes of subjects and 16S rRNA sequencing data

Jpn J Biomet Vol. 39, No. 2, 2018

okui-etc.dvi : output at 2019.2.27 This book was typeset using pLaTeX2e

76 Okui et al.

Fig. 1. Families associated with BMI

Fig. 2. Genera associated with BMI

each subject had was relatively limited.

Figure 1 and 2 show heatmaps indicating the rank of each microbiome for the measure of

degree of association value in each model. We used family as microbial hierarchy in Figure 1,

and Figure 2 shows the result of genera. The heatmaps show only microbiomes whose mean

rank is below 200 across all models in Figure 1, and below 100 across all models in Figure 2, for

taking visibility into account. Each mass of heatmap shows rank of a microbiome within each

model, and a color is attached based on the rank across 1–100, indicating that the thicker the

color is, the more associated the microbiome is with the response variable. If the measure of

Jpn J Biomet Vol. 39, No. 2, 2018

okui-etc.dvi : output at 2019.2.27 This book was typeset using pLaTeX2e

An Association Analysis Method for Gut Microbial Data Using Ensemble Learning 77

degree of association of a microbiome was 0, the microbiome’s rank was set to the lowest value.

In Figures 1 and 2, RF models (models 1,2,3) showed similar results. On the other hands, results

of Lasso, Ridge, EN and SPLSDA were different between models across Figure 1–2. As for Lasso,

model 4 scarcely identified microbiomes, and model 6 identified many microbiomes. As for Ridge

regression, all models identified many microbiomes, and the difference between model 9 and 11

was small. As for EN and SPLSDA, models without ensemble learning (models 12,13,16,17)

scarcely identified microbiomes, and models t hat used proportion and bagging (models 14,18)

identified many microbiomes.

5. Discussion

In this study, we focused on normalization methods when multivariate data analysis meth-

ods are used. We then proposed a new association analysis method combining rarefying with

ensemble learning and evaluated its performance in simulation experiments. Although the ef-

ficacy of the ensemble methods was already suggested (Shankar 2015), the significance of this

study is the proposal of a method that incorporates rarefying into the ensemble learning process.

As a result of a simulation experiment, in Lasso, EN and SPLSDA, the proposed method

showed higher performance in terms of AUC values for the measure of degree of association,

correlation coefficient for the measure of the degree of association, and AUC values for classifi-

cation performance, than other methods. As for Ridge regression, the proposed method and the

model that used proportion were almost equivalent in terms of the AUC values for the measure

of degree of association and the AUC values for classification performance. However, the pro-

posed method was superior in terms of the correlation coefficient for the measure of the degree

of association. In contrast to the other methods, the proposed method did not show superior

performance in RF. The cause of this result is unclear, but it is thought that RF is no more

a variable selection method than Ridge regression is, and there is no possibility that the mea-

sure of degree of association of a response-associated variable becomes accidently zero. Then,

the effect of ensemble learning might larger in variable selection methods than other regression

analysis methods. Another possible cause is that RF already uses ensemble learning, and the

bias caused by rarefying did not become zero even when sagging was used. This problem might

be eliminated by augmenting sampling times, but a large improvement would not be achieved

because RF already uses bagging. Furthermore, RF might not be influenced by the pseudo

correlation caused by proportion, unlike ordinary regression models, because the results were

relatively stable across patterns. Results of the simulation experiment indicate that when RF is

used, the normalization method does not markedly influence the result. Lasso, Ridge regression,

EN and SPLSDA are ordinary regression models, each of which learns the model without en-

semble learning in itself, and the correlation coefficient values rose by using ensemble learning in

these models. Additionally, in the real data analysis, microbiomes are rarely identified in many

models without ensemble learning (models 4,12,13,16,17); ensemble learning is thought to be

Jpn J Biomet Vol. 39, No. 2, 2018

okui-etc.dvi : output at 2019.2.27 This book was typeset using pLaTeX2e

78 Okui et al.

important for evaluating the effect of variables. However, taking into account the result of AUC

values for the measure of the degree of association, the coefficient estimates of variables which

were not related to a response variable were overestimated by pseudo-correlation in models that

combined proportion with bagging. Results of the real data analysis show that models combining

proportion with bagging (models 6,10,14,18) identified many microbiomes, but the measure of

the degree of association of many microbiomes might be overestimated.

As for classification performance, in Lasso, Ridge regression, SPLSDA and EN, when the

variability of the total counts was large and the number of variables was 100, AUC values dropped

by using the models combining proportion with bagging, compared with the models that used

proportion. When the variability of total counts was large, the effect of pseudo correlation by

proportion data augmenting and the generalizability of the coefficient estimates of the models that

used proportion worsened. As a result, the models using proportion with bagging overestimated

the coefficient estimates more than the models that used proportion only, evidenced by the result

of the AUC value and correlation coefficient for the measure of association. Though the AUC

values for the measure of the degree of association and the correlation coefficients for the measure

of the degree of association remained low across all models, the cause for it is considered to be

the simulation setting wherein variables whose values were almost all zero became the associated

variables for a response variable in many cases. As a result, the difference between the proposed

method and the true model was large in terms of the AUC values for classification performance.

From these results, in Lasso, Ridge regression, SPLSDA and EN, one can identify variables

associated with a response variable more accurately by using the proposed method. On the other

hand, RF’s performance was worse than models with the proposed method (models 7,11,15,19)

across all patterns. RF is one of the most commonly used analysis methods for 16S rRNA se-

quencing data; it is often used with rarefying. But from this experiment, it was suggested that

other regression models with the proposed method would identify response-associated explana-

tory variables more accurately. As for proportion data, the performance of the models with

proportion was superior to the models with rarefying when ensemble learning was not used in

many cases. Proportion data is generally avoided because it is compositional data, but it was

suggested that when the number of variables become larger, the problem of proportion data

become smaller. We did not examine cases of 1000 or more variables for computational burden;

proportion data might have no problems if data of microbiome hierarchy, such as species was

used. As for the sampling times of the proposed method, 10 times were sufficient when the

number of variables was 100. On the other hand, when the number of variables was 300, the

performance improved by augmenting the sampling times, especially in terms of the correlation

coefficient for the measure of degree of association. We can identify variables associated with a

response variable more accurately by augmenting the sampling times when the number of vari-

ables is large, but difference between 10 times and 50 times was relatively small. It depends on

Jpn J Biomet Vol. 39, No. 2, 2018

okui-etc.dvi : output at 2019.2.27 This book was typeset using pLaTeX2e

An Association Analysis Method for Gut Microbial Data Using Ensemble Learning 79

data, but, if the computational burden is large, about 10 times might be sufficient.

In identifying microbiomes associated with a response variable, the shortcoming of ensemble

learning is that the interpretation of results from the analysis becomes difficult. When we use

the proposed method, there are cases where we cannot judge whether a microbiome is associated

with the response variable based on the rank of coefficient estimates. If we want to select

variables, one proposal is that we adopt variables based on the rank of coefficient estimates,

until a variable where classification error in other data is minimized. Other association methods

such as differential abundance analysis or ordinary variable selection models can select variables

easily, but their selection is not always accurate and they cannot evaluate the effect of variables

associated with the response variable like ensemble-learning methods. If the association with

the response variable is relatively limited like the real data analysis in this study, the identified

variables do not augment considerably even when the proposed method was used in Lasso, EN

and SPLSDA (models 7,15,19). On the other hand, if we use Ridge regression with the proposed

method (model 11), many variables are identified (Figure 1,2) because the coefficient estimates

are not forced to become zero in Ridge regression. Then, Ridge regression is inferior to the other

methods in terms of ease of interpretation of analysis results. If we want to build a classification

model with a response variable rather than identify variables associated with a response variable,

the proposed method is useful for the four analysis methods including Ridge regression.

A limitation of this study is that we examined only 5 major association analysis methods;

other methods should also be examined. Other methods, for example, support vector machine

and nearest shrunken centroids (Knights 2011) are used, but if the method does not include

ensemble learning in the learning process, the ability in identification of response-associated

microbiomes would improve by using the proposed method. The degree of improvement when

using the proposed depends on the analysis method. Another limitation was that we did not

examine the ordinary approaches for compositional data, that is, log-ratio transformation and

the method which uses one variable as reference. These methods are used as the normalization

methods for Lasso, EN and SPLSDA. As shown in Table 10, a variety of microbiomes which

one subject has are limited, and there are a lot of zero values in the 16S rRNA sequencing

data. In real data analysis, we could not use log-ratio transformation even after imputing zero

values with small values, because the computation involved multiplying many tiny values and

estimation could not be done. Omitting tiny microbiomes in advance from data analysis may

solve this problem, but 16S rRNA sequencing data are collections of such tiny microbiomes.

Then, omitting tiny microbiomes may make data analysis meaningless. By using the proposed

method, one can identify response-associated microbiomes without omitting microbiomes. The

ordinary approaches for compositional data were not developed for 16S rRNA sequencing data,

and if we use data with a large variety of microbiomes, like genus or species, these methods

cannot be used properly.

Jpn J Biomet Vol. 39, No. 2, 2018

okui-etc.dvi : output at 2019.2.27 This book was typeset using pLaTeX2e

80 Okui et al.

As mentioned in the introduction, although proportion data are considered to be a problem,

simulation experiments for multivariate data analysis methods have not yet been tried, and

methods like rarefying or log-ratio transformation are used arbitrarily. As the simulation results

suggest, performance of analysis methods varies depending on the normalization methods. The

simulation patterns we examined were limited in that parameters like number of subjects, total

read counts and their variability are more diverse in reality. As for total read counts, the values

vary between a few thousand and several hundred million. Then, broader simulation experiments

for normalization methods should be performed in the future.

6. Conclusion

We proposed a new association method that combines rarefying with ensemble learning

and examined its performance. As a result, it was suggested that if we use ordinary regression

models, we can identify microbiomes associated with a response variable and classify the response

variable more accurately using the proposed method.

Acknowledgements

This research is partially supported by the Center of Innovation Program from Japan Science

and Technology Agency, JST. Additionally, we would like to thank for the two referees for their

thorough review of the manuscript and appropriate comments.

REFERENCES

Aitchison J (1982). The statistical analysis of compositional data. Journal of the royal statistical

society , 44: 139–177.

Acharjee A, Finkers R, Visser RGF, Maliepaard C (2013). Comparison of regularized regression

methods for omics data. Metabolomics; 3: 3.

Breiman L, Friedman JH, Olshen RA, Stone CJ (1984). Classification and Regression Trees.

Chapman & Hall (Wadsworth, Inc.): New York.

Breiman L (1996). Bagging predictors. Machine learning , 24: 123–140.

Breiman L (2001). Random forests. Machine Learning , 45: 5–32.

Breiman L, Cutler A, Liaw A, Wiener M (2015). package ‘randomForest’ Breiman and Cut-

ler’s Random Forests for classification and regression. https://cran.r-project.org/web/

packages/randomForest/randomForest.pdf.

Buhlmann P, Hothorn T (2007). Boosting algorithms: regularization, prediction and model fit-

ting. Statistical science, 22, 477–505.

Cao KA Le, Costello ME, Lakis VA, Bartolo F, Chua XY (2016). MixMC: A multivariate sta-

tistical framework to gain insight into microbial communities. PloS One, 11: 8.

Jpn J Biomet Vol. 39, No. 2, 2018

okui-etc.dvi : output at 2019.2.27 This book was typeset using pLaTeX2e

An Association Analysis Method for Gut Microbial Data Using Ensemble Learning 81

Cao KA Le, Rohart F, Gonzalez I, Dejean S, Gautier B, Bartolo F (2017). package ‘mixOmics’

omics data integration projects., https://cran.r-project.org/web/packages/mixOmics/

mixOmics.pdf.

Cao KA Le, Boitard S, Besse P (2011). Sparse PLS discriminant analysis: biologically relevant

feature selection and graphical displays for multiclass problems. BMC Bioinformatics, 12:

253.

Chen W, Liu F, Ling Z, Tong X, Xiang C (2012). Human intestinal lumen and mucosa-associated

microbiota in patients with colorectal cancer. PLoS One, 7.

Cho S, Kim K, Kim YJ, Lee JK, Cho YS, Lee JY, Han BG, Kim H, Ott J, Park T (2010). Joint

identification of multiple genetic variants via elastic-net variable selection in a genome-wide

association analysis. Annals of human genetics, 74: 416–428.

Chua LL, Rajasuriar R, Azanan MS, Abdullah NK, Tang MS, Lee SC, Woo YL, Lim YAL,

Ariffin H, Loke P (2017). Reduced microbial diversity in adult survivors of childhood acute

lymphoblastic leukemia and microbial associations with increased immune activation. Mi-

crobiome, 5: 35.

Chun H, Keles S (2010). Sparse partial least squares regression for simultaneous dimension

reduction and variable selection. Journal of the Royal statistical society , 72: 2–25.

Chung D, Keles S (2010). Sparse partial least squares classification for high dimensional data.

Statistical applications in genetics and molecular biology , 9: 1.

Chung D, Chun H, Keles S (2015). package ‘spls’ sparse partial least squares(SPLS) regression

and classification. https://cran.r-project.org/web/packages/spls/spls.pdf.

Determan Jr. CE (2015). Optimal algorithm for metabolomics classification and feature selection

varies by dataset. International journal of biology , 7: 1.

Friedman J, Hastie T, Tibshirani R (2010). Regularization paths for generalized linear models

via coordinate descent. Journal of statistical software, 33: 1.

Friedman J, Hastie T, Simon N, Qjan J, Tibshirani R (2017). package ‘glmnet’ lasso and elastic-

net regularlized generalized linear models, https://cran.r-project.org/web/packages/

glmnet/glmnet.pdf.

Garcia TP, Muller S, Carroll RJ, Walzem RL (2014). Identification of important regressor groups,

subgroups and individuals via regularization methods: application to gut microbiome data.

Bioinformatics, 30: 831–837.

Gregory B, Reid G (2016). Compositional analysis: a valid approach to analyze microbiome

high-throughput sequencing data. NRC research press, 62: 692–703.

Halfvarson J, Brislawn CJ, Lamendella R, Vazquez-Baeza Y, Walters WA, Bramer LM, D’Amato

M, Bonfiglio F McDonald D, Gonzalez A, McClure EE, Dunklebarger MF, Kinght R &

Jpn J Biomet Vol. 39, No. 2, 2018

okui-etc.dvi : output at 2019.2.27 This book was typeset using pLaTeX2e

82 Okui et al.

Jansson JK (2017). Dynamics of the human gut microbiome in inflammatory bowel disease.

Nature microbiology , 2.

Hastie T, Tibshirani R, Friedman J (2009). The Elements of Statistical Learning; Data Mining,

Inference and Prediction 2ed. Springer: New York.

Hoerl AE, Kennard RW (1970). Ridge regression: Biased estimation for nonorthogonal problems.

Technometrics, 12: 55–67.

Jacobs JP, Goudarzi M, Singh N, Tong M, Mchardy IH, Ruegger P, Asadourian M, Moon BH,

Ayson A, Bomeman J, McGoyern DP, Fornace AJ Jr, Braun J, Dubinsky M (2016). A

disease-associated microbial and metabolomics state in relatives of pediatric inflammatory

bowel disease patients. Cellular and molecular gastroenterology , 2: 750–766.

Jiang M, Wang C, Zhang Y, Feng Y, Wang Y, Zhu Y (2014). Sparse partial least squares

discriminant analysis for differential geographical origins of Salvia miltiorrhiza by 1H-NMR-

bases metabolomics. Phytochemical analytics, 25: 50–58.

Kim M, Lee KH, Yoon SW, Kim BS, Chun J, Yi H (2013). Analytical tools and databases for

metagenomics in the next-generation sequencing era. Genomics & Informatics, 11: 102–113.

Knights D, Costello EK, Knight R (2011). Supervised classification of human microbiota. FFMS

microbiology reviews, 35: 342–359.

Labus JS, Hollister EB, Jacobs J, Kirbach K, Owzguen N, Gupta A, Acosta J, Luna RA, Aagaard

K, Versalovic J, Savidge T, Hsiao E, Tillisch K, Mayer EA (2017). Differences in gut microbial

composition correlate with regional brain volumes in irritable bowel syndrome. Microbiome,

5: 49.

Lee SC, Tang MS, Lim YAL, Choy SH, Kurtz ZD, Cox LM, Gundra UM, Cho I, Bonneau R,

Blaser MJ, Chua KH, Loke PNG (2014). Helminth colonization is associated with increased

diversity of the gut microbiota. PloS neglected tropical diseases, 8: 5.

Lin W, Shi P, Feng R, Li H (2014). Variable selection in regression with compositional covariates.

Biometrika, 101: 785–797.

Lu D, Weljie A, de Leon AR, McConnell Y, Bathe OF, Kopciuk K (2017). Performance of variable

selection methods using stability-based selection. BMC research notes, 10: 143.

Mach N, Berri M, Estelle J, Levene F, Lemonnier G, Denis C, Leplat JJ, Chevaleyre C, Billon

Y, Dore J, Rogel-Gaillard C, Lepage P (2015). Early-life establishment of the sqine gut

microbiome and impact on host phenotypes. Environmental microbiology reports, 7: 554–

569,

Mahana D, Trent CM, Kurtz ZD, Bokulich NA, Battaglia T, Chung J, Muller CL, Li H, Bonneau

RA, Blaser MJ (2016). Antibiotic perturbation of the murine gut microbiome enhances

Jpn J Biomet Vol. 39, No. 2, 2018

okui-etc.dvi : output at 2019.2.27 This book was typeset using pLaTeX2e

An Association Analysis Method for Gut Microbial Data Using Ensemble Learning 83

the adiposity, insulin resistance, and liver disease associated with high-fat diet. Genome

Medicine, 8: 48.

Mandel S, Treuren WV, White RA, Eggesbo M, Kinght R, Peddada SD (2015). Analysis of

composition of microbiomes: a novel method for studying microbial composition. Microbial

ecology in health and disease, 26: 1.

McMurdie PJ, Holmes S (2014). Waste not, want not: why rarefying microbiome data is inad-

missible. PloS One computational biology , 10: 4.

Mevik BH, Wehrens R (2016). Introduction to the pls Package. https://cran.r-project.org/

web/packages/pls/vignettes/pls-manual.pdf.

Michaelson JJ, Alberts R, Schughart K, Beyer A (2010). Data-driven assessment of eQTL map-

ping methods. BMC bioinformatics, 11: 502.

Naseribafrouei A, Hestad K, Avershina E, Sekelja M, Linlokken A, Wilson R, Rudi K (2014).

Correlation between the human fecal microbiota and depression. Neurogastroenterology , 26:

1155–1162.

Oulas A, Pavloudi C, Polymenakou P, Pavlopoulas GA, Papanikolaou N, Kotoulas G et al. (2015).

Metagenomics: tools and insights for analyzing next-generation sequencing data derived from

biodiversity studies. Bioinformatics and biology insights, 9: 75–88.

Paulson JN, Stine OC, Bravo HC, Pop M (2013). Differential abundance analysis for microbial

marker-gene surveys. Nature methods, 10.

Rapaport F, Khanin R, Liang Y, Pirun M, Krek A, Zumbo P, Mson CE, Socci ND, Betel D

(2013). Comprehensive evaluation of differential gene expression analysis methods for RNA-

seq data. Genome Biology, 14.

R Core Team (2017). R: a language and environment for statistical computing. R Foundation

for Statistical Computing. Vienna, Austria URL http://www.R-project.org/.

Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez JC, Muller M, Siegest S (2017).

package ‘pROC’ display and analyze ROC curves. https://cran.r-project.org/web/

packages/pROC/pROC.pdf.

Robinson MD, Oshlack A (2010). A scaling normalization method for differential expression

analysis of RNA-seq data. Genome Biology , 11.

Rohart F, Gautier B, Singh A, Cao KL (2017). mixOmics: an R package for ’omics feature

selection and multiple data integration. In draft https://biorxiv.org/content/biorxiv/

early/2017/08/108597.full.pdf.

Rush ST, Lee CH, Mio W, Kim PT (2016). The phylogenetic lasso and the microbiome. https:

//arxiv.org/pdf/1607.08877.pdf.

Jpn J Biomet Vol. 39, No. 2, 2018

okui-etc.dvi : output at 2019.2.27 This book was typeset using pLaTeX2e

84 Okui et al.

Schubert AM, Rogers MAM, Ring C, Mogle J, Petrosino JP, Young VB, Aronoff DM, Schloss

PD (2014). Microbiome data distinguish patients with Clostridium difficile infection and

non-C. difficile-associated diarrhea from healthy controls. mBio, 5.

Shankar J, Szpalowski S, Solis NV, Mounaud S, Liu H, Losada L, Nierman WC, Filler SG (2015).

A systematic evaluation of high-dimensional, ensemble-based regression for exploring large

model spaces in microbiome analyses. BMC bioinformatics, 16: 31.

Shimokawa T, SugimotoT, Goto M (2013). Tree structure approach: Learning data science by R

9. Kyoritsu syuppann.

Statnikov A, Henaff M (2013). A comprehensive evaluation of multicategory classification meth-

ods for microbiome data. Microbiome, 1: 11.

Tap J, Derrien M, Tornblom H, Brazeilles R, Cools-Portier S, Dore J, Storsrud S, Neve B-L,

Ohman L, Simren M: Identification of an intestinal microbiota signature associated with

severity of irritable bowel syndrome. Gastroenterology 2017, 152: 111–123.

Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the royal sta-

tistical society B(Methodology), 58: 267–288.

Tsilimigras MCB, Fodor AA (2016). Compositional data analysis of the microbiome: fundamen-

tals, tools, and challenges. Annals of epidemiology , 26: 330–335.

Warnes GR, Bolker B, Bonebakker L, Gentleman R, Liaw WHA, Lumley T, Maechler M, Magnus-

son A, Moeller S, Schwantz M, Venebles B (2016). Packages ‘gplots’ various R programming

tools for plotting data. https://cran.r-project.org/web/packages/gplots/gplots.pdf.

Weiss S, Xu ZZ, Peddada S, Amir A, Bittinger K, Gonzalez A, Lozupone C et al. (2017). Nor-

malization and microbial differential abundance strategies depend upon data characteristics.

Microbiome 2017, 5: 27.

Wu N, Yang X, Zhang R, Li J, Xiao X, Hu Y, Chen Y, Yang F et al. (2013). Dysbiosis signature

of fecal microbiota in colorectal cancer patients. Microbiome Ecology , 66: 462–470.

Yang P, Yang HY, Bing BZ, Albert YZ (2010). A review of ensemble methods in bioinformatics.

Current bioinformatics, 5: 296–308.

Yatsunenko T, Rey FE, Manary MJ, Trehan I, Dominguez-Bello MG, Contreras M et al. (2012).

Human gut microbiome viewed across age and geography. Nature, 486: 222–227.

Jpn J Biomet Vol. 39, No. 2, 2018