Simultaneous conﬁdence intervals for comparing...

Simultaneous confidence intervals forcomparing biodiversity indices estimated from

metagenomic trials

Ralph Scherer

26. September 2013

R. Scherer (Inst. of Biometry, MHH) ICSI 2013, Hannover 26. September 2013 1 / 18

Introduction Next generation sequencing Experiments

Next generation sequencing Experiments

Background

I Human body: More bacterial cells inside (1014) than our own cells(1013)

I A fact is: The key to understand the human condition lies inunderstanding the human genome

I But this may be insufficient→Sequencing the genomes of our own microbes is necessary too

I Both together can give more information than each alone

I Metagenomics: Obtain genomic information directly frommicrobial communities in their natural habitats

I See ”A primer on metagenomics” [Wooley et al., 2010]

Introduction Example: Human gut microbiome trial

Example: Human gut microbiome trial

I Yatsunenko et al. [2012] studied gut microbiomes of 531 individuals

I The cohort were healthy children and adults from the Amazonas ofVenezuela, rural Malawi and US metropolitan areas

I The main interest was to find out if there are differences betweenage categories or between geographical areas

I The data were pre-processed with qiime software

I After the quality steps 1,093,740,274 Illumina reads remained

I These resulted after the otu-picking script and taxonomicassignment in an OTU table with 11905 different taxa andcorresponding counts for the 531 individuals

I Mean Count per replicate is 1,935,000. But: There is one replicatewith a row sum of 1→ deleted in the following analysis

Introduction Example: Human gut microbiome trial

Comparison of diversity

I There are several ways to identify possible differences betweenage groups or geographical areas

I One solution may be the comparison of the diversity (here: Degreeof variation of bacterial species within human gut) betweendefined groups

I This can be done using α-diversity measures like Shannon orSimpson index

I Due to the multiple sample design (three geographical areas),simultaneous confidence intervals or multiplicity adjusted p-valuesfor the differences between the diversity measures are needed

Introduction Human gut microbiome trial I

Human gut microbiome trial

S.obs S.chao1 S.ACE shannon simpson

= 3 yr

B 3 −

= 3 yr

B 3 −

= 3 yr

B 3 −

= 3 yr

B 3 −

= 3 yr

B 3 −

AGE_CUTPOINTS

COUNTRY

GAZ:Malawi

GAZ:United States of America

GAZ:Venezuela

Figure : Different α-diversity measures separated by age and geography

Introduction α-diversity measures and related issues

α-diversity measures and related issues

Unequal variances

I The Simpson index ϕ(D)i = ∑

Ss=1 π2

as well as the Shannon index ϕ(H)i =−∑

Ss=1 πis log(πis) depend on the

probability vectors π̂i = π̂i1, ..., π̂iS ,

I π̂i represents the estimated probability of occurring for everyspecies s, s = 1, ...,S in sample i, i = 1, ...,k

I The corresponding variance estimators V̂ar(ϕ̂(D)) and V̂ar(ϕ̂(H))mainly depend on the probabilities π̂i and number of species ni

I According to Rogers and Hsu [2001], one can not assume equalvariances across the samples

Introduction α-diversity measures and related issues

α-diversity measures and related issues

Over-dispersion

I Species counts usually show over-dispersion

I Over-dispersion occurs, if the observed variance exceeds thenominal variance of the postulated distribution

I Typically, species counts exhibit a high variation across replicatesand a high number of zero counts

I This indicates an over-dispersed distributionI Idea: Nonparametric bootstrap methods

I Only based on observed dataI Take the over-dispersion into account

Simultaneous confidence intervals Asymptotic SCIs with variance estimators considering heterogeneity

Asymptotic SCIs (AM)

I Rogers and Hsu [2001] and Fritsch and Hsu [1999] constructed SCIsfor the Shannon and Simpson index considering heterogeneousvariances

I Tukey-type SCIs for the Simpson index are constructed in thefollowing way

ϕ̂(D)i − ϕ̂

(D)i ′ ±q2,1−α;M,R

√V̂ar(ϕ̂

(D)i ) + V̂ar(ϕ̂

(D)i ′ ) (1)

with q2,1−α;M,R being a two-sided quantile from an M-variatenormal distribution with correlation matrix R.

I When estimating the simultaneous confidence intervals for theShannon index, ϕ̂(D) is replaced with ϕ̂(H) and V̂ar(ϕ̂(D)) withV̂ar(ϕ̂(H))

Disadvantages of the asymptotic SCIs

I Rogers and Hsu [2001] and Fritsch and Hsu [1999] constructedintervals under the assumption of multinomial distributed countswithout replicates

I The probability vector πi is the same for every replicate j, j = 1, ..., r

I If the data has replicates, the counts may be summed up for everyspecies inside every sample and the indices can then becalculated on the resulting vectors

I This may lead to an underestimation of the variance

I Over-dispersion is not considered adequately

Two ways to calculate the diversity index

(a) Diversity estimation with an ANOVA model, treatment i

Replicate j Speciess = 1

... Speciess = S

Index Param. of interest

1 yi11 ... yi1S θ̂i12 yi21 ... yi2S θ̂i23 yi31 ... yi3S θ̂i3r yir1 ... yirS θ̂ir

ANOVA model estimator θ̄i

(b) Diversity estimation on summend up counts, treatment i

Replicate j Speciess = 1

... Speciess = S

Param. of interest

1 yi11 ... yi1S2 yi21 ... yi2S3 yi31 ... yi3Sr yir1 ... yirS

∑rj=1 yi�1 ... yi�S θ̂i�

Asymptotic gaussian SCIs based on an ANOVAmodel (AG)

I In case of replicated counts, θ̄i may estimated from an ANOVAmodel according to method method (a)

I With θ̄i and the residuals ε̂ij = θ̂ij − θ̄i , the well-known Tukey-typeintervals [Tukey, 1953; Hothorn et al., 2008] can be constructed

θ̄i − θ̄i ′ ± t2,1−α;M,R,df =∑ ri−k σ̂

√1ri

+1ri ′

with variance

σ̂2 = (

∑i=1

∑j=1

)(ε̂ij − ε̄2i )/(

∑i=1

ri −k)) (3)

and t2,1−α;M,R,df =∑ ri−k being a two-sided quantile from an M-variatet−distribution with correlation matrix R.

tmax SCIs based on an ANOVA model (WY)

I Following method (a) compute the parameter of interest θ̂ij , i.e.Simpson’s ϕ measure, for every replication j, j = 1, ..., r , separately.

I Bootstrap the estimated indices directly according to Westfall andYoung [1993]

1 Fit a linear model to the estimated indices θ̂ij resulting in θ̂i2 Bootstrap the residuals ε̂ij unstratified3 For every bootstrap step b, b = 1, ...,B build the test statistic

t∗ii ′ =ε̄∗i − ε̄∗i ′√

((σ̂2i ε̂ )∗/ni + (σ̂2

i′ ε̂ )∗/ni′). (4)

4 q1−α is the 1−α empirical quantile of the B values max(t∗ii ′).5 The resulting simultaneous confidence intervals are constructed in the

following way

θ̄i − θ̄i ′ ±q1−α

√(σ̂2

i /ni + σ̂2i′/ni′), (5)

where σ̂2i is the residual mean square for the ith treatment in the

ANOVA model

tmax SCIs based on summed up counts (TS)

1 Bootstrap the original data set in a row, stratified by the k levels oftreatments.

2 Estimate the group wise index of interest θ̂ ∗i� according to method(b) for every bootstrap sample.

3 In every bootstrap sample, calculate the test statistic

t∗ii ′ =(θ̂ ∗i� − θ̂ ∗i ′�)− (θ̂i�− θ̂i ′�)√

((σ̂2θ̂i�

)∗+ (σ̂2θ̂i′�

)∗)(6)

with the variance estimators based on multinomial assumptions4 q1−α is the 1−α empirical quantile of the B values max(t∗ii ′).5 The resulting simultaneous confidence intervals are then

θ̂i�− θ̂i ′�±q1−α

√(σ̂2

θ̂i�+ σ̂2

θ̂i′�), (7)

rank-perc SCIs based on summed up counts (PE)

I Bootstrap the original data set in a row, stratified by the k levels oftreatments.

I Estimate the group wise index of interest θ̂ ∗i� according to method(b) for each bootstrap sample.

I Build differences of interest δm for all bootstrap samplesI Construct SCIs according to Besag et al. [1995]

1 Rank the differences seperately2 Compute and store maximum of ranks for each bootstrap sample3 Compute the 1−α quantile t∗ of the maximum ranks4 Finally, the confidence limits are constructed for each elementary

parameter δm by taking[δ

[B+1−t∗]m ;δ

[t∗]m

], i.e. the B + 1− t∗th and t∗th

value from the ordered sample of the joint empirical distributionobtained for δm.

Results Simulation results

Simulation results

Figure : Simulation results for theShannon index

Figure : Simulation results for theSimpson index

Results Analysed example data set

Analysed example data set

SCIs for Tukey type

Difference of Shannon's index

Venezuela − USA

Venezuela − Malawi

USA − Malawi

Venezuela − USA

USA − Malawi

Venezuela − USA

USA − Malawi

Venezuela − USA

USA − Malawi

Venezuela − USA

USA − Malawi

−0.5 0.0 0.5 1.0

WYTSPEAGAM

Figure : Example data results for theShannon index

SCIs for Tukey type

Difference of Simpson's index

Venezuela − USA

USA − Malawi

Venezuela − USA

USA − Malawi

Venezuela − USA

USA − Malawi

Venezuela − USA

USA − Malawi

Venezuela − USA

USA − Malawi

−0.04 −0.02 0.00 0.02 0.04 0.06 0.08

WYTSPEAGAM

Figure : Example data results for theSimpson index

Software implementation

I The publication corresponding to today’s talk is Scherer andSchaarschmidt [2013]

I All methods except for the asymptotic methods based on thelinear model are implemented in the R-package simboot

I The asymptotic method is implemented in the R-packagemultcomp

I The bioconductor package phyloseq was used to import theotu-table from qiime

I simboot is on github for bug reporting:https://github.com/shearer/simboot

I A github homepage http://shearer.github.io/simboot/ with a tutorialfor sequence data is under development

Software implementation

Literature I

Besag, J., Green, P., Higdon, D., and Mengersen, K. (1995). Bayesian Computation andStochastic-Systems. Statistical Science, 10(1):3–41.

Fritsch, K. S. and Hsu, J. C. (1999). Multiple comparison of entropies with application to dinosaurbiodiversity. Biometrics, 55(4):1300–1305.

Hothorn, T., Bretz, F., and Westfall, P. (2008). Simultaneous inference in general parametric models.Biometrical Journal, 50(3):346–63.

Rogers, J. A. and Hsu, J. C. (2001). Multiple comparisons of biodiversity. BIOMETRICAL JOURNAL,43(5):617–625.

Scherer, R. and Schaarschmidt, F. (2013). Simultaneous confidence intervals for comparing biodiversityindices estimated from overdispersed count data. Biometrical . . . , 55:246–263.

Tukey, J. W. (1953). The problem of multiple comparisons.

Westfall, P. H. and Young, S. S. (1993). Resampling-Based Multiple Testing. John Wiley & Sons.

Wooley, J. C., Godzik, A., and Friedberg, I. (2010). A primer on metagenomics. PLoS computationalbiology, 6(2):e1000667.

Yatsunenko, T., Rey, F. E., Manary, M. M. J., Trehan, I., Dominguez-Bello, M. G., Contreras, M., Magris, M.,Hidalgo, G., Baldassano, R. N., Anokhin, A. P., Heath, A. C., Warner, B., Reeder, J., Kuczynski, J.,Caporaso, J. G., Lozupone, C. a., Lauber, C., Clemente, J. C., Knights, D., Knight, R., and Gordon,J. I. (2012). Human gut microbiome viewed across age and geography. Nature, 486(Ivic):222–7.

Simultaneous conﬁdence intervals for comparing...

Documents

Transcript of Simultaneous conﬁdence intervals for comparing...

Lecture 3: Confidence intervalskeshet/MCB2012/SlidesDodo/DataFitLect3.pdf · Bootstrap confidence intervals: Computational technique based on resampling the errors. Asymptotic confidence

LOCAL ASYMPTOTICS FOR POLYNOMIAL SPLINE ...jianhua/paper/Huang03reg.pdfPOLYNOMIAL SPLINE REGRESSION 1601 conﬁdence intervals. They also provide theoretical insights about the properties

Package ‘PHEindicatormethods’...Package ‘PHEindicatormethods’ June 25, 2020 Type Package Version 1.3.2 Title Common Public Health Statistics and their Conﬁdence Intervals

[ST] Survival Analysis - Survey Design · 2016. 2. 16. · survival analysis— Introduction to survival analysis 3 Obtaining summary statistics, conﬁdence intervals, tables, etc.

Estimating Instrument Performance: with Confidence …...1 Introduction 1 1.1 Estimating Probability of Detection or Identiﬁcation 1 1.2 Conﬁdence Intervals and Choosing Among

Replication and p · PDF fileReplication and p Intervals p Values Predict the Future Only Vaguely, but Conﬁdence Intervals Do Much Better Geoff Cumming School of Psychological Science,

Conﬁdence Bounds & Intervals for Parameters Relating to the ...faculty.washington.edu/fscholz/DATAFILES/Confidence...2 Binomial Distribution: Upper and Lower Bounds for p Suppose

Conﬁdence Intervals for ARMA-GARCH Value-at-Risk · PDF fileConﬁdence Intervals for ARMA-GARCH Value-at-Risk Laura Spierdijk May 31, 2013 ... This study proposes a residual subsample

WindandWaveExtremesovertheWorldOceansfrom VeryLargeEnsembles · for both wind speed and signiﬁcant wave height. Conﬁdence intervals are much tighter due to the large size of the

...Severity: 100 Confidence: 100 Severity: 100 Confidence: 100 Severity: 75 Confidence: 100 Severity: 75 Confidence: 100 Severity: 60 Confidence: 100 Severity: 75 ...

Cost curves: An improved method for visualizing …...curves, such as showing confidence intervals on a classifier’s performance, and visualizing the statistical significance

Chapter 5 Conﬁdence Intervals and Hypothesis Testingidiom.ucsd.edu/~rlevy/pmsl_textbook/chapters/pmsl_5.pdf · Conﬁdence Intervals and Hypothesis Testing ... • Choose interval

Conﬁdence Intervals for Forecast Veriﬁcation

11. Conﬁdence Intervals for Flood Return Level Estimates ... · Conﬁdence Intervals for Flood Return Level Estimates assuming Long-Range Dependence ... We exemplify this approach

Rate Optimal Estimation and Conﬁdence Intervals for High ...

Conﬁdence intervals in stationary autocorrelated time series · Conﬁdence intervals in stationary autocorrelated time series Halkos, George and Kevork, Ilias University of Thessaly,

Simultaneous confidence bands for the distribution ...lijianyang.com/mypapers/strSCBAISM.pdfweighted sum of transformed Brownian bridges. Moreover, simultaneous conﬁdence bands (SCBs)

Asymptotic confidence intervals for Poisson regression€¦ · Asymptotic confidence intervals for Poisson regression ∗ Michael Kohler Fachrichtung 6.1-Mathematik, Universität

Bootstrap Conﬁdence Intervals · Bootstrap Conﬁdence Intervals ST552 Lecture 13 Charlotte Wickham 2019-02-11 1. Motivation The inferences we’ve covered so far relied on our

Entropy evaluation based on confidence intervals of frequency ......Entropy evaluation based on conﬁdence intervals of frequency estimates been proposed (Dubois et al.,1993). This