Simultaneous confidence intervals for comparing...

18
Simultaneous confidence intervals for comparing biodiversity indices estimated from metagenomic trials Ralph Scherer 26. September 2013 R. Scherer (Inst. of Biometry, MHH) ICSI 2013, Hannover 26. September 2013 1 / 18

Transcript of Simultaneous confidence intervals for comparing...

Page 1: Simultaneous confidence intervals for comparing ...shearer.github.io/simboot/Presentation20130919.pdf · IEstimate the group wise index of interest q^ i according tomethod (b)for

Simultaneous confidence intervals forcomparing biodiversity indices estimated from

metagenomic trials

Ralph Scherer

26. September 2013

R. Scherer (Inst. of Biometry, MHH) ICSI 2013, Hannover 26. September 2013 1 / 18

Page 2: Simultaneous confidence intervals for comparing ...shearer.github.io/simboot/Presentation20130919.pdf · IEstimate the group wise index of interest q^ i according tomethod (b)for

Introduction Next generation sequencing Experiments

Next generation sequencing Experiments

Background

I Human body: More bacterial cells inside (1014) than our own cells(1013)

I A fact is: The key to understand the human condition lies inunderstanding the human genome

I But this may be insufficient→Sequencing the genomes of our own microbes is necessary too

I Both together can give more information than each alone

I Metagenomics: Obtain genomic information directly frommicrobial communities in their natural habitats

I See ”A primer on metagenomics” [Wooley et al., 2010]

R. Scherer (Inst. of Biometry, MHH) ICSI 2013, Hannover 26. September 2013 2 / 18

Page 3: Simultaneous confidence intervals for comparing ...shearer.github.io/simboot/Presentation20130919.pdf · IEstimate the group wise index of interest q^ i according tomethod (b)for

Introduction Example: Human gut microbiome trial

Example: Human gut microbiome trial

I Yatsunenko et al. [2012] studied gut microbiomes of 531 individuals

I The cohort were healthy children and adults from the Amazonas ofVenezuela, rural Malawi and US metropolitan areas

I The main interest was to find out if there are differences betweenage categories or between geographical areas

I The data were pre-processed with qiime software

I After the quality steps 1,093,740,274 Illumina reads remained

I These resulted after the otu-picking script and taxonomicassignment in an OTU table with 11905 different taxa andcorresponding counts for the 531 individuals

I Mean Count per replicate is 1,935,000. But: There is one replicatewith a row sum of 1→ deleted in the following analysis

R. Scherer (Inst. of Biometry, MHH) ICSI 2013, Hannover 26. September 2013 3 / 18

Page 4: Simultaneous confidence intervals for comparing ...shearer.github.io/simboot/Presentation20130919.pdf · IEstimate the group wise index of interest q^ i according tomethod (b)for

Introduction Example: Human gut microbiome trial

Comparison of diversity

I There are several ways to identify possible differences betweenage groups or geographical areas

I One solution may be the comparison of the diversity (here: Degreeof variation of bacterial species within human gut) betweendefined groups

I This can be done using α-diversity measures like Shannon orSimpson index

I Due to the multiple sample design (three geographical areas),simultaneous confidence intervals or multiplicity adjusted p-valuesfor the differences between the diversity measures are needed

R. Scherer (Inst. of Biometry, MHH) ICSI 2013, Hannover 26. September 2013 4 / 18

Page 5: Simultaneous confidence intervals for comparing ...shearer.github.io/simboot/Presentation20130919.pdf · IEstimate the group wise index of interest q^ i according tomethod (b)for

Introduction Human gut microbiome trial I

Human gut microbiome trial

S.obs S.chao1 S.ACE shannon simpson

0

1000

2000

3000

0

1000

2000

3000

4000

5000

1000

2000

3000

4000

5000

0

1

2

3

4

5

0.00

0.25

0.50

0.75

1.00

A <

= 3 yr

B 3 −

17 yr

C A

dults

NA

A <

= 3 yr

B 3 −

17 yr

C A

dults

NA

A <

= 3 yr

B 3 −

17 yr

C A

dults

NA

A <

= 3 yr

B 3 −

17 yr

C A

dults

NA

A <

= 3 yr

B 3 −

17 yr

C A

dults

NA

AGE_CUTPOINTS

Alp

ha D

iver

sity

Mea

sure

COUNTRY

GAZ:Malawi

GAZ:United States of America

GAZ:Venezuela

Figure : Different α-diversity measures separated by age and geography

R. Scherer (Inst. of Biometry, MHH) ICSI 2013, Hannover 26. September 2013 5 / 18

Page 6: Simultaneous confidence intervals for comparing ...shearer.github.io/simboot/Presentation20130919.pdf · IEstimate the group wise index of interest q^ i according tomethod (b)for

Introduction α-diversity measures and related issues

α-diversity measures and related issues

Unequal variances

I The Simpson index ϕ(D)i = ∑

Ss=1 π2

is ,

as well as the Shannon index ϕ(H)i =−∑

Ss=1 πis log(πis) depend on the

probability vectors π̂i = π̂i1, ..., π̂iS ,

I π̂i represents the estimated probability of occurring for everyspecies s, s = 1, ...,S in sample i, i = 1, ...,k

I The corresponding variance estimators V̂ar(ϕ̂(D)) and V̂ar(ϕ̂(H))mainly depend on the probabilities π̂i and number of species ni

I According to Rogers and Hsu [2001], one can not assume equalvariances across the samples

R. Scherer (Inst. of Biometry, MHH) ICSI 2013, Hannover 26. September 2013 6 / 18

Page 7: Simultaneous confidence intervals for comparing ...shearer.github.io/simboot/Presentation20130919.pdf · IEstimate the group wise index of interest q^ i according tomethod (b)for

Introduction α-diversity measures and related issues

α-diversity measures and related issues

Over-dispersion

I Species counts usually show over-dispersion

I Over-dispersion occurs, if the observed variance exceeds thenominal variance of the postulated distribution

I Typically, species counts exhibit a high variation across replicatesand a high number of zero counts

I This indicates an over-dispersed distributionI Idea: Nonparametric bootstrap methods

I Only based on observed dataI Take the over-dispersion into account

R. Scherer (Inst. of Biometry, MHH) ICSI 2013, Hannover 26. September 2013 7 / 18

Page 8: Simultaneous confidence intervals for comparing ...shearer.github.io/simboot/Presentation20130919.pdf · IEstimate the group wise index of interest q^ i according tomethod (b)for

Simultaneous confidence intervals Asymptotic SCIs with variance estimators considering heterogeneity

Asymptotic SCIs (AM)

I Rogers and Hsu [2001] and Fritsch and Hsu [1999] constructed SCIsfor the Shannon and Simpson index considering heterogeneousvariances

I Tukey-type SCIs for the Simpson index are constructed in thefollowing way

ϕ̂(D)i − ϕ̂

(D)i ′ ±q2,1−α;M,R

√V̂ar(ϕ̂

(D)i ) + V̂ar(ϕ̂

(D)i ′ ) (1)

with q2,1−α;M,R being a two-sided quantile from an M-variatenormal distribution with correlation matrix R.

I When estimating the simultaneous confidence intervals for theShannon index, ϕ̂(D) is replaced with ϕ̂(H) and V̂ar(ϕ̂(D)) withV̂ar(ϕ̂(H))

R. Scherer (Inst. of Biometry, MHH) ICSI 2013, Hannover 26. September 2013 8 / 18

Page 9: Simultaneous confidence intervals for comparing ...shearer.github.io/simboot/Presentation20130919.pdf · IEstimate the group wise index of interest q^ i according tomethod (b)for

Simultaneous confidence intervals Asymptotic SCIs with variance estimators considering heterogeneity

Disadvantages of the asymptotic SCIs

I Rogers and Hsu [2001] and Fritsch and Hsu [1999] constructedintervals under the assumption of multinomial distributed countswithout replicates

I The probability vector πi is the same for every replicate j, j = 1, ..., r

I If the data has replicates, the counts may be summed up for everyspecies inside every sample and the indices can then becalculated on the resulting vectors

I This may lead to an underestimation of the variance

I Over-dispersion is not considered adequately

R. Scherer (Inst. of Biometry, MHH) ICSI 2013, Hannover 26. September 2013 9 / 18

Page 10: Simultaneous confidence intervals for comparing ...shearer.github.io/simboot/Presentation20130919.pdf · IEstimate the group wise index of interest q^ i according tomethod (b)for

Simultaneous confidence intervals Asymptotic SCIs with variance estimators considering heterogeneity

Two ways to calculate the diversity index

(a) Diversity estimation with an ANOVA model, treatment i

Replicate j Speciess = 1

... Speciess = S

Index Param. of interest

1 yi11 ... yi1S θ̂i12 yi21 ... yi2S θ̂i23 yi31 ... yi3S θ̂i3r yir1 ... yirS θ̂ir

ANOVA model estimator θ̄i

(b) Diversity estimation on summend up counts, treatment i

Replicate j Speciess = 1

... Speciess = S

Param. of interest

1 yi11 ... yi1S2 yi21 ... yi2S3 yi31 ... yi3Sr yir1 ... yirS

∑rj=1 yi�1 ... yi�S θ̂i�

R. Scherer (Inst. of Biometry, MHH) ICSI 2013, Hannover 26. September 2013 10 / 18

Page 11: Simultaneous confidence intervals for comparing ...shearer.github.io/simboot/Presentation20130919.pdf · IEstimate the group wise index of interest q^ i according tomethod (b)for

Simultaneous confidence intervals Asymptotic SCIs with variance estimators considering heterogeneity

Asymptotic gaussian SCIs based on an ANOVAmodel (AG)

I In case of replicated counts, θ̄i may estimated from an ANOVAmodel according to method method (a)

I With θ̄i and the residuals ε̂ij = θ̂ij − θ̄i , the well-known Tukey-typeintervals [Tukey, 1953; Hothorn et al., 2008] can be constructed

θ̄i − θ̄i ′ ± t2,1−α;M,R,df =∑ ri−k σ̂

√1ri

+1ri ′

(2)

with variance

σ̂2 = (

k

∑i=1

ri

∑j=1

)(ε̂ij − ε̄2i )/(

k

∑i=1

ri −k)) (3)

and t2,1−α;M,R,df =∑ ri−k being a two-sided quantile from an M-variatet−distribution with correlation matrix R.

R. Scherer (Inst. of Biometry, MHH) ICSI 2013, Hannover 26. September 2013 11 / 18

Page 12: Simultaneous confidence intervals for comparing ...shearer.github.io/simboot/Presentation20130919.pdf · IEstimate the group wise index of interest q^ i according tomethod (b)for

Simultaneous confidence intervals Asymptotic SCIs with variance estimators considering heterogeneity

tmax SCIs based on an ANOVA model (WY)

I Following method (a) compute the parameter of interest θ̂ij , i.e.Simpson’s ϕ measure, for every replication j, j = 1, ..., r , separately.

I Bootstrap the estimated indices directly according to Westfall andYoung [1993]

1 Fit a linear model to the estimated indices θ̂ij resulting in θ̂i2 Bootstrap the residuals ε̂ij unstratified3 For every bootstrap step b, b = 1, ...,B build the test statistic

t∗ii ′ =ε̄∗i − ε̄∗i ′√

((σ̂2i ε̂ )∗/ni + (σ̂2

i′ ε̂ )∗/ni′). (4)

4 q1−α is the 1−α empirical quantile of the B values max(t∗ii ′).5 The resulting simultaneous confidence intervals are constructed in the

following way

θ̄i − θ̄i ′ ±q1−α

√(σ̂2

i /ni + σ̂2i′/ni′), (5)

where σ̂2i is the residual mean square for the ith treatment in the

ANOVA model

R. Scherer (Inst. of Biometry, MHH) ICSI 2013, Hannover 26. September 2013 12 / 18

Page 13: Simultaneous confidence intervals for comparing ...shearer.github.io/simboot/Presentation20130919.pdf · IEstimate the group wise index of interest q^ i according tomethod (b)for

Simultaneous confidence intervals Asymptotic SCIs with variance estimators considering heterogeneity

tmax SCIs based on summed up counts (TS)

1 Bootstrap the original data set in a row, stratified by the k levels oftreatments.

2 Estimate the group wise index of interest θ̂ ∗i� according to method(b) for every bootstrap sample.

3 In every bootstrap sample, calculate the test statistic

t∗ii ′ =(θ̂ ∗i� − θ̂ ∗i ′�)− (θ̂i�− θ̂i ′�)√

((σ̂2θ̂i�

)∗+ (σ̂2θ̂i′�

)∗)(6)

with the variance estimators based on multinomial assumptions4 q1−α is the 1−α empirical quantile of the B values max(t∗ii ′).5 The resulting simultaneous confidence intervals are then

θ̂i�− θ̂i ′�±q1−α

√(σ̂2

θ̂i�+ σ̂2

θ̂i′�), (7)

R. Scherer (Inst. of Biometry, MHH) ICSI 2013, Hannover 26. September 2013 13 / 18

Page 14: Simultaneous confidence intervals for comparing ...shearer.github.io/simboot/Presentation20130919.pdf · IEstimate the group wise index of interest q^ i according tomethod (b)for

Simultaneous confidence intervals Asymptotic SCIs with variance estimators considering heterogeneity

rank-perc SCIs based on summed up counts (PE)

I Bootstrap the original data set in a row, stratified by the k levels oftreatments.

I Estimate the group wise index of interest θ̂ ∗i� according to method(b) for each bootstrap sample.

I Build differences of interest δm for all bootstrap samplesI Construct SCIs according to Besag et al. [1995]

1 Rank the differences seperately2 Compute and store maximum of ranks for each bootstrap sample3 Compute the 1−α quantile t∗ of the maximum ranks4 Finally, the confidence limits are constructed for each elementary

parameter δm by taking[δ

[B+1−t∗]m ;δ

[t∗]m

], i.e. the B + 1− t∗th and t∗th

value from the ordered sample of the joint empirical distributionobtained for δm.

R. Scherer (Inst. of Biometry, MHH) ICSI 2013, Hannover 26. September 2013 14 / 18

Page 15: Simultaneous confidence intervals for comparing ...shearer.github.io/simboot/Presentation20130919.pdf · IEstimate the group wise index of interest q^ i according tomethod (b)for

Results Simulation results

Simulation results

Figure : Simulation results for theShannon index

Figure : Simulation results for theSimpson index

R. Scherer (Inst. of Biometry, MHH) ICSI 2013, Hannover 26. September 2013 15 / 18

Page 16: Simultaneous confidence intervals for comparing ...shearer.github.io/simboot/Presentation20130919.pdf · IEstimate the group wise index of interest q^ i according tomethod (b)for

Results Analysed example data set

Analysed example data set

SCIs for Tukey type

Difference of Shannon's index

Venezuela − USA

Venezuela − Malawi

USA − Malawi

Venezuela − USA

Venezuela − Malawi

USA − Malawi

Venezuela − USA

Venezuela − Malawi

USA − Malawi

Venezuela − USA

Venezuela − Malawi

USA − Malawi

Venezuela − USA

Venezuela − Malawi

USA − Malawi

−0.5 0.0 0.5 1.0

WYTSPEAGAM

Figure : Example data results for theShannon index

SCIs for Tukey type

Difference of Simpson's index

Venezuela − USA

Venezuela − Malawi

USA − Malawi

Venezuela − USA

Venezuela − Malawi

USA − Malawi

Venezuela − USA

Venezuela − Malawi

USA − Malawi

Venezuela − USA

Venezuela − Malawi

USA − Malawi

Venezuela − USA

Venezuela − Malawi

USA − Malawi

−0.04 −0.02 0.00 0.02 0.04 0.06 0.08

WYTSPEAGAM

Figure : Example data results for theSimpson index

R. Scherer (Inst. of Biometry, MHH) ICSI 2013, Hannover 26. September 2013 16 / 18

Page 17: Simultaneous confidence intervals for comparing ...shearer.github.io/simboot/Presentation20130919.pdf · IEstimate the group wise index of interest q^ i according tomethod (b)for

Software implementation

Software implementation

I The publication corresponding to today’s talk is Scherer andSchaarschmidt [2013]

I All methods except for the asymptotic methods based on thelinear model are implemented in the R-package simboot

I The asymptotic method is implemented in the R-packagemultcomp

I The bioconductor package phyloseq was used to import theotu-table from qiime

I simboot is on github for bug reporting:https://github.com/shearer/simboot

I A github homepage http://shearer.github.io/simboot/ with a tutorialfor sequence data is under development

R. Scherer (Inst. of Biometry, MHH) ICSI 2013, Hannover 26. September 2013 17 / 18

Page 18: Simultaneous confidence intervals for comparing ...shearer.github.io/simboot/Presentation20130919.pdf · IEstimate the group wise index of interest q^ i according tomethod (b)for

Software implementation

Literature I

Besag, J., Green, P., Higdon, D., and Mengersen, K. (1995). Bayesian Computation andStochastic-Systems. Statistical Science, 10(1):3–41.

Fritsch, K. S. and Hsu, J. C. (1999). Multiple comparison of entropies with application to dinosaurbiodiversity. Biometrics, 55(4):1300–1305.

Hothorn, T., Bretz, F., and Westfall, P. (2008). Simultaneous inference in general parametric models.Biometrical Journal, 50(3):346–63.

Rogers, J. A. and Hsu, J. C. (2001). Multiple comparisons of biodiversity. BIOMETRICAL JOURNAL,43(5):617–625.

Scherer, R. and Schaarschmidt, F. (2013). Simultaneous confidence intervals for comparing biodiversityindices estimated from overdispersed count data. Biometrical . . . , 55:246–263.

Tukey, J. W. (1953). The problem of multiple comparisons.

Westfall, P. H. and Young, S. S. (1993). Resampling-Based Multiple Testing. John Wiley & Sons.

Wooley, J. C., Godzik, A., and Friedberg, I. (2010). A primer on metagenomics. PLoS computationalbiology, 6(2):e1000667.

Yatsunenko, T., Rey, F. E., Manary, M. M. J., Trehan, I., Dominguez-Bello, M. G., Contreras, M., Magris, M.,Hidalgo, G., Baldassano, R. N., Anokhin, A. P., Heath, A. C., Warner, B., Reeder, J., Kuczynski, J.,Caporaso, J. G., Lozupone, C. a., Lauber, C., Clemente, J. C., Knights, D., Knight, R., and Gordon,J. I. (2012). Human gut microbiome viewed across age and geography. Nature, 486(Ivic):222–7.

R. Scherer (Inst. of Biometry, MHH) ICSI 2013, Hannover 26. September 2013 18 / 18