Download - Simultaneous confidence intervals for comparing ...shearer.github.io/simboot/Presentation20130919.pdf · IEstimate the group wise index of interest q^ i according tomethod (b)for

Transcript
Page 1: Simultaneous confidence intervals for comparing ...shearer.github.io/simboot/Presentation20130919.pdf · IEstimate the group wise index of interest q^ i according tomethod (b)for

Simultaneous confidence intervals forcomparing biodiversity indices estimated from

metagenomic trials

Ralph Scherer

26. September 2013

R. Scherer (Inst. of Biometry, MHH) ICSI 2013, Hannover 26. September 2013 1 / 18

Page 2: Simultaneous confidence intervals for comparing ...shearer.github.io/simboot/Presentation20130919.pdf · IEstimate the group wise index of interest q^ i according tomethod (b)for

Introduction Next generation sequencing Experiments

Next generation sequencing Experiments

Background

I Human body: More bacterial cells inside (1014) than our own cells(1013)

I A fact is: The key to understand the human condition lies inunderstanding the human genome

I But this may be insufficient→Sequencing the genomes of our own microbes is necessary too

I Both together can give more information than each alone

I Metagenomics: Obtain genomic information directly frommicrobial communities in their natural habitats

I See ”A primer on metagenomics” [Wooley et al., 2010]

R. Scherer (Inst. of Biometry, MHH) ICSI 2013, Hannover 26. September 2013 2 / 18

Page 3: Simultaneous confidence intervals for comparing ...shearer.github.io/simboot/Presentation20130919.pdf · IEstimate the group wise index of interest q^ i according tomethod (b)for

Introduction Example: Human gut microbiome trial

Example: Human gut microbiome trial

I Yatsunenko et al. [2012] studied gut microbiomes of 531 individuals

I The cohort were healthy children and adults from the Amazonas ofVenezuela, rural Malawi and US metropolitan areas

I The main interest was to find out if there are differences betweenage categories or between geographical areas

I The data were pre-processed with qiime software

I After the quality steps 1,093,740,274 Illumina reads remained

I These resulted after the otu-picking script and taxonomicassignment in an OTU table with 11905 different taxa andcorresponding counts for the 531 individuals

I Mean Count per replicate is 1,935,000. But: There is one replicatewith a row sum of 1→ deleted in the following analysis

R. Scherer (Inst. of Biometry, MHH) ICSI 2013, Hannover 26. September 2013 3 / 18

Page 4: Simultaneous confidence intervals for comparing ...shearer.github.io/simboot/Presentation20130919.pdf · IEstimate the group wise index of interest q^ i according tomethod (b)for

Introduction Example: Human gut microbiome trial

Comparison of diversity

I There are several ways to identify possible differences betweenage groups or geographical areas

I One solution may be the comparison of the diversity (here: Degreeof variation of bacterial species within human gut) betweendefined groups

I This can be done using α-diversity measures like Shannon orSimpson index

I Due to the multiple sample design (three geographical areas),simultaneous confidence intervals or multiplicity adjusted p-valuesfor the differences between the diversity measures are needed

R. Scherer (Inst. of Biometry, MHH) ICSI 2013, Hannover 26. September 2013 4 / 18

Page 5: Simultaneous confidence intervals for comparing ...shearer.github.io/simboot/Presentation20130919.pdf · IEstimate the group wise index of interest q^ i according tomethod (b)for

Introduction Human gut microbiome trial I

Human gut microbiome trial

S.obs S.chao1 S.ACE shannon simpson

0

1000

2000

3000

0

1000

2000

3000

4000

5000

1000

2000

3000

4000

5000

0

1

2

3

4

5

0.00

0.25

0.50

0.75

1.00

A <

= 3 yr

B 3 −

17 yr

C A

dults

NA

A <

= 3 yr

B 3 −

17 yr

C A

dults

NA

A <

= 3 yr

B 3 −

17 yr

C A

dults

NA

A <

= 3 yr

B 3 −

17 yr

C A

dults

NA

A <

= 3 yr

B 3 −

17 yr

C A

dults

NA

AGE_CUTPOINTS

Alp

ha D

iver

sity

Mea

sure

COUNTRY

GAZ:Malawi

GAZ:United States of America

GAZ:Venezuela

Figure : Different α-diversity measures separated by age and geography

R. Scherer (Inst. of Biometry, MHH) ICSI 2013, Hannover 26. September 2013 5 / 18

Page 6: Simultaneous confidence intervals for comparing ...shearer.github.io/simboot/Presentation20130919.pdf · IEstimate the group wise index of interest q^ i according tomethod (b)for

Introduction α-diversity measures and related issues

α-diversity measures and related issues

Unequal variances

I The Simpson index ϕ(D)i = ∑

Ss=1 π2

is ,

as well as the Shannon index ϕ(H)i =−∑

Ss=1 πis log(πis) depend on the

probability vectors π̂i = π̂i1, ..., π̂iS ,

I π̂i represents the estimated probability of occurring for everyspecies s, s = 1, ...,S in sample i, i = 1, ...,k

I The corresponding variance estimators V̂ar(ϕ̂(D)) and V̂ar(ϕ̂(H))mainly depend on the probabilities π̂i and number of species ni

I According to Rogers and Hsu [2001], one can not assume equalvariances across the samples

R. Scherer (Inst. of Biometry, MHH) ICSI 2013, Hannover 26. September 2013 6 / 18

Page 7: Simultaneous confidence intervals for comparing ...shearer.github.io/simboot/Presentation20130919.pdf · IEstimate the group wise index of interest q^ i according tomethod (b)for

Introduction α-diversity measures and related issues

α-diversity measures and related issues

Over-dispersion

I Species counts usually show over-dispersion

I Over-dispersion occurs, if the observed variance exceeds thenominal variance of the postulated distribution

I Typically, species counts exhibit a high variation across replicatesand a high number of zero counts

I This indicates an over-dispersed distributionI Idea: Nonparametric bootstrap methods

I Only based on observed dataI Take the over-dispersion into account

R. Scherer (Inst. of Biometry, MHH) ICSI 2013, Hannover 26. September 2013 7 / 18

Page 8: Simultaneous confidence intervals for comparing ...shearer.github.io/simboot/Presentation20130919.pdf · IEstimate the group wise index of interest q^ i according tomethod (b)for

Simultaneous confidence intervals Asymptotic SCIs with variance estimators considering heterogeneity

Asymptotic SCIs (AM)

I Rogers and Hsu [2001] and Fritsch and Hsu [1999] constructed SCIsfor the Shannon and Simpson index considering heterogeneousvariances

I Tukey-type SCIs for the Simpson index are constructed in thefollowing way

ϕ̂(D)i − ϕ̂

(D)i ′ ±q2,1−α;M,R

√V̂ar(ϕ̂

(D)i ) + V̂ar(ϕ̂

(D)i ′ ) (1)

with q2,1−α;M,R being a two-sided quantile from an M-variatenormal distribution with correlation matrix R.

I When estimating the simultaneous confidence intervals for theShannon index, ϕ̂(D) is replaced with ϕ̂(H) and V̂ar(ϕ̂(D)) withV̂ar(ϕ̂(H))

R. Scherer (Inst. of Biometry, MHH) ICSI 2013, Hannover 26. September 2013 8 / 18

Page 9: Simultaneous confidence intervals for comparing ...shearer.github.io/simboot/Presentation20130919.pdf · IEstimate the group wise index of interest q^ i according tomethod (b)for

Simultaneous confidence intervals Asymptotic SCIs with variance estimators considering heterogeneity

Disadvantages of the asymptotic SCIs

I Rogers and Hsu [2001] and Fritsch and Hsu [1999] constructedintervals under the assumption of multinomial distributed countswithout replicates

I The probability vector πi is the same for every replicate j, j = 1, ..., r

I If the data has replicates, the counts may be summed up for everyspecies inside every sample and the indices can then becalculated on the resulting vectors

I This may lead to an underestimation of the variance

I Over-dispersion is not considered adequately

R. Scherer (Inst. of Biometry, MHH) ICSI 2013, Hannover 26. September 2013 9 / 18

Page 10: Simultaneous confidence intervals for comparing ...shearer.github.io/simboot/Presentation20130919.pdf · IEstimate the group wise index of interest q^ i according tomethod (b)for

Simultaneous confidence intervals Asymptotic SCIs with variance estimators considering heterogeneity

Two ways to calculate the diversity index

(a) Diversity estimation with an ANOVA model, treatment i

Replicate j Speciess = 1

... Speciess = S

Index Param. of interest

1 yi11 ... yi1S θ̂i12 yi21 ... yi2S θ̂i23 yi31 ... yi3S θ̂i3r yir1 ... yirS θ̂ir

ANOVA model estimator θ̄i

(b) Diversity estimation on summend up counts, treatment i

Replicate j Speciess = 1

... Speciess = S

Param. of interest

1 yi11 ... yi1S2 yi21 ... yi2S3 yi31 ... yi3Sr yir1 ... yirS

∑rj=1 yi�1 ... yi�S θ̂i�

R. Scherer (Inst. of Biometry, MHH) ICSI 2013, Hannover 26. September 2013 10 / 18

Page 11: Simultaneous confidence intervals for comparing ...shearer.github.io/simboot/Presentation20130919.pdf · IEstimate the group wise index of interest q^ i according tomethod (b)for

Simultaneous confidence intervals Asymptotic SCIs with variance estimators considering heterogeneity

Asymptotic gaussian SCIs based on an ANOVAmodel (AG)

I In case of replicated counts, θ̄i may estimated from an ANOVAmodel according to method method (a)

I With θ̄i and the residuals ε̂ij = θ̂ij − θ̄i , the well-known Tukey-typeintervals [Tukey, 1953; Hothorn et al., 2008] can be constructed

θ̄i − θ̄i ′ ± t2,1−α;M,R,df =∑ ri−k σ̂

√1ri

+1ri ′

(2)

with variance

σ̂2 = (

k

∑i=1

ri

∑j=1

)(ε̂ij − ε̄2i )/(

k

∑i=1

ri −k)) (3)

and t2,1−α;M,R,df =∑ ri−k being a two-sided quantile from an M-variatet−distribution with correlation matrix R.

R. Scherer (Inst. of Biometry, MHH) ICSI 2013, Hannover 26. September 2013 11 / 18

Page 12: Simultaneous confidence intervals for comparing ...shearer.github.io/simboot/Presentation20130919.pdf · IEstimate the group wise index of interest q^ i according tomethod (b)for

Simultaneous confidence intervals Asymptotic SCIs with variance estimators considering heterogeneity

tmax SCIs based on an ANOVA model (WY)

I Following method (a) compute the parameter of interest θ̂ij , i.e.Simpson’s ϕ measure, for every replication j, j = 1, ..., r , separately.

I Bootstrap the estimated indices directly according to Westfall andYoung [1993]

1 Fit a linear model to the estimated indices θ̂ij resulting in θ̂i2 Bootstrap the residuals ε̂ij unstratified3 For every bootstrap step b, b = 1, ...,B build the test statistic

t∗ii ′ =ε̄∗i − ε̄∗i ′√

((σ̂2i ε̂ )∗/ni + (σ̂2

i′ ε̂ )∗/ni′). (4)

4 q1−α is the 1−α empirical quantile of the B values max(t∗ii ′).5 The resulting simultaneous confidence intervals are constructed in the

following way

θ̄i − θ̄i ′ ±q1−α

√(σ̂2

i /ni + σ̂2i′/ni′), (5)

where σ̂2i is the residual mean square for the ith treatment in the

ANOVA model

R. Scherer (Inst. of Biometry, MHH) ICSI 2013, Hannover 26. September 2013 12 / 18

Page 13: Simultaneous confidence intervals for comparing ...shearer.github.io/simboot/Presentation20130919.pdf · IEstimate the group wise index of interest q^ i according tomethod (b)for

Simultaneous confidence intervals Asymptotic SCIs with variance estimators considering heterogeneity

tmax SCIs based on summed up counts (TS)

1 Bootstrap the original data set in a row, stratified by the k levels oftreatments.

2 Estimate the group wise index of interest θ̂ ∗i� according to method(b) for every bootstrap sample.

3 In every bootstrap sample, calculate the test statistic

t∗ii ′ =(θ̂ ∗i� − θ̂ ∗i ′�)− (θ̂i�− θ̂i ′�)√

((σ̂2θ̂i�

)∗+ (σ̂2θ̂i′�

)∗)(6)

with the variance estimators based on multinomial assumptions4 q1−α is the 1−α empirical quantile of the B values max(t∗ii ′).5 The resulting simultaneous confidence intervals are then

θ̂i�− θ̂i ′�±q1−α

√(σ̂2

θ̂i�+ σ̂2

θ̂i′�), (7)

R. Scherer (Inst. of Biometry, MHH) ICSI 2013, Hannover 26. September 2013 13 / 18

Page 14: Simultaneous confidence intervals for comparing ...shearer.github.io/simboot/Presentation20130919.pdf · IEstimate the group wise index of interest q^ i according tomethod (b)for

Simultaneous confidence intervals Asymptotic SCIs with variance estimators considering heterogeneity

rank-perc SCIs based on summed up counts (PE)

I Bootstrap the original data set in a row, stratified by the k levels oftreatments.

I Estimate the group wise index of interest θ̂ ∗i� according to method(b) for each bootstrap sample.

I Build differences of interest δm for all bootstrap samplesI Construct SCIs according to Besag et al. [1995]

1 Rank the differences seperately2 Compute and store maximum of ranks for each bootstrap sample3 Compute the 1−α quantile t∗ of the maximum ranks4 Finally, the confidence limits are constructed for each elementary

parameter δm by taking[δ

[B+1−t∗]m ;δ

[t∗]m

], i.e. the B + 1− t∗th and t∗th

value from the ordered sample of the joint empirical distributionobtained for δm.

R. Scherer (Inst. of Biometry, MHH) ICSI 2013, Hannover 26. September 2013 14 / 18

Page 15: Simultaneous confidence intervals for comparing ...shearer.github.io/simboot/Presentation20130919.pdf · IEstimate the group wise index of interest q^ i according tomethod (b)for

Results Simulation results

Simulation results

Figure : Simulation results for theShannon index

Figure : Simulation results for theSimpson index

R. Scherer (Inst. of Biometry, MHH) ICSI 2013, Hannover 26. September 2013 15 / 18

Page 16: Simultaneous confidence intervals for comparing ...shearer.github.io/simboot/Presentation20130919.pdf · IEstimate the group wise index of interest q^ i according tomethod (b)for

Results Analysed example data set

Analysed example data set

SCIs for Tukey type

Difference of Shannon's index

Venezuela − USA

Venezuela − Malawi

USA − Malawi

Venezuela − USA

Venezuela − Malawi

USA − Malawi

Venezuela − USA

Venezuela − Malawi

USA − Malawi

Venezuela − USA

Venezuela − Malawi

USA − Malawi

Venezuela − USA

Venezuela − Malawi

USA − Malawi

−0.5 0.0 0.5 1.0

WYTSPEAGAM

Figure : Example data results for theShannon index

SCIs for Tukey type

Difference of Simpson's index

Venezuela − USA

Venezuela − Malawi

USA − Malawi

Venezuela − USA

Venezuela − Malawi

USA − Malawi

Venezuela − USA

Venezuela − Malawi

USA − Malawi

Venezuela − USA

Venezuela − Malawi

USA − Malawi

Venezuela − USA

Venezuela − Malawi

USA − Malawi

−0.04 −0.02 0.00 0.02 0.04 0.06 0.08

WYTSPEAGAM

Figure : Example data results for theSimpson index

R. Scherer (Inst. of Biometry, MHH) ICSI 2013, Hannover 26. September 2013 16 / 18

Page 17: Simultaneous confidence intervals for comparing ...shearer.github.io/simboot/Presentation20130919.pdf · IEstimate the group wise index of interest q^ i according tomethod (b)for

Software implementation

Software implementation

I The publication corresponding to today’s talk is Scherer andSchaarschmidt [2013]

I All methods except for the asymptotic methods based on thelinear model are implemented in the R-package simboot

I The asymptotic method is implemented in the R-packagemultcomp

I The bioconductor package phyloseq was used to import theotu-table from qiime

I simboot is on github for bug reporting:https://github.com/shearer/simboot

I A github homepage http://shearer.github.io/simboot/ with a tutorialfor sequence data is under development

R. Scherer (Inst. of Biometry, MHH) ICSI 2013, Hannover 26. September 2013 17 / 18

Page 18: Simultaneous confidence intervals for comparing ...shearer.github.io/simboot/Presentation20130919.pdf · IEstimate the group wise index of interest q^ i according tomethod (b)for

Software implementation

Literature I

Besag, J., Green, P., Higdon, D., and Mengersen, K. (1995). Bayesian Computation andStochastic-Systems. Statistical Science, 10(1):3–41.

Fritsch, K. S. and Hsu, J. C. (1999). Multiple comparison of entropies with application to dinosaurbiodiversity. Biometrics, 55(4):1300–1305.

Hothorn, T., Bretz, F., and Westfall, P. (2008). Simultaneous inference in general parametric models.Biometrical Journal, 50(3):346–63.

Rogers, J. A. and Hsu, J. C. (2001). Multiple comparisons of biodiversity. BIOMETRICAL JOURNAL,43(5):617–625.

Scherer, R. and Schaarschmidt, F. (2013). Simultaneous confidence intervals for comparing biodiversityindices estimated from overdispersed count data. Biometrical . . . , 55:246–263.

Tukey, J. W. (1953). The problem of multiple comparisons.

Westfall, P. H. and Young, S. S. (1993). Resampling-Based Multiple Testing. John Wiley & Sons.

Wooley, J. C., Godzik, A., and Friedberg, I. (2010). A primer on metagenomics. PLoS computationalbiology, 6(2):e1000667.

Yatsunenko, T., Rey, F. E., Manary, M. M. J., Trehan, I., Dominguez-Bello, M. G., Contreras, M., Magris, M.,Hidalgo, G., Baldassano, R. N., Anokhin, A. P., Heath, A. C., Warner, B., Reeder, J., Kuczynski, J.,Caporaso, J. G., Lozupone, C. a., Lauber, C., Clemente, J. C., Knights, D., Knight, R., and Gordon,J. I. (2012). Human gut microbiome viewed across age and geography. Nature, 486(Ivic):222–7.

R. Scherer (Inst. of Biometry, MHH) ICSI 2013, Hannover 26. September 2013 18 / 18