Bacterial community comparisons by taxonomy-supervised ... · Bacterial community comparisons by...

6
Bacterial community comparisons by taxonomy-supervised analysis independent of sequence alignment and clustering Woo Jun Sul a , James R. Cole a , Ederson da C. Jesus a,b , Qiong Wang a , Ryan J. Farris a , Jordan A. Fish a , and James M. Tiedje a,c,1 a Center for Microbial Ecology, Michigan State University, East Lansing, MI 48824; b Embrapa Agrobiologia, Seropédica, 23890-000, Rio de Janeiro, Brazil; and c World Class University Center for Green Metagenomics, Yonsei University, Seoul 120-749, Republic of Korea Contributed by James M. Tiedje, July 28, 2011 (sent for review March 8, 2011) High-throughput sequencing of 16S rRNA genes has increased our understanding of microbial community structure, but now even higher-throughput methods to the Illumina scale allow the crea- tion of much larger datasets with more samples and orders-of- magnitude more sequences that swamp current analytic methods. We developed a method capable of handling these larger datasets on the basis of assignment of sequences into an existing taxon- omy using a supervised learning approach (taxonomy-supervised analysis). We compared this method with a commonly used clus- tering approach based on sequence similarity (taxonomy-unsuper- vised analysis). We sampled 211 different bacterial communities from various habitats and obtained w1.3 million 16S rRNA sequen- ces spanning the V4 hypervariable region by pyrosequencing. Both methodologies gave similar ecological conclusions in that β-diver- sity measures calculated by using these two types of matrices were signicantly correlated to each other, as were the ordination congurations and hierarchical clustering dendrograms. In addition, our taxonomy-supervised analyses were also highly correlated with phylogenetic methods, such as UniFrac. The taxonomy-super- vised analysis has the advantages that it is not limited by the exhaustive computation required for the alignment and clustering necessary for the taxonomy-unsupervised analysis, is more toler- ant of sequencing errors, and allows comparisons when sequences are from different regions of the 16S rRNA gene. With the tre- mendous expansion in 16S rRNA data acquisition underway, the taxonomy-supervised approach offers the potential to provide more rapid and extensive community comparisons across habitats and samples. taxonomy bin | operational taxonomic unit T he increasing abundance of 16S rRNA gene sequences stim- ulated by reduced sequencing costs and greatly expanded parallel capacities is providing a more encompassing view of mi- crobial communities (1). Although the short read lengths pro- vided by the current technologies make it more challenging to assign sequences to bacterial taxonomy, the depth and replication provided are powerful advantages (24). Information on bacterial community structure can be compiled in a matrix where different communities are represented as rows and speciesas columns, i.e., a community-by-species matrix. When describing bacterial community relationships based on 16S rRNA gene sequences, each sequence is allocated to a species, usually termed an operational taxonomic unit (OTU), by align- ment-based clustering at a specied nucleotide distance, often at a 97% identity. This community-by-OTU matrix, which is based exclusively on nucleotide distances among 16S rRNA gene sequences, has bacterial communities as rows with OTU as col- umns. This community-by-OTU matrix can be used to measure dissimilarities between bacterial communities (β-diversity) either by presence/absence or abundance data. These dissimilarities combined in a distance matrix can be used for bacterial commu- nity comparisons by ordination and clustering methods. This process, termed taxonomy-unsupervised analysis,originates from the distribution of 16S rRNA gene sequences into OTUs. When applying taxonomy-unsupervised analysis to very large numbers of sequences (>10 6 ) produced by the new sequencing technologies, much larger computational capacities are required to analyze the data (5). The alignment and clustering of se- quences that require calculation of pairwise nucleotide distances is one bottleneck. Even though taxonomy-unsupervised analysis is advantageous in that it includes sequences that are not yet assignable to bacterial taxonomy, the current computational limitations make pursuing comparisons among bacterial com- munities difcult. We investigated an alternative analysis, i.e., taxonomy-su- pervised analysis,which consists of allocating sequences into taxonomy-supervised taxonomy binson the basis of the existing bacterial taxonomy, which is rooted in polyphasic taxonomy (6) and hence also reects physiological, morphological, and genetic information. Currently, several ribosomal RNA databases [i.e., Ribosomal Database Project, RDP (7), Greengenes (8), SILVA (9), and GAST (10)] are dedicated to sequence deposition and provide 16S rRNA gene classication tools. Here we compared taxonomy-unsupervised (OTU) and tax- onomy-supervised analyses for two β-diversity measures, ordi- nation congurations, and hierarchical clustering dendrograms, using w1.3 million (M) V4-region 16S rRNA gene sequences obtained by pyrosequencing of 211 bacterial communities. We show that the taxonomy-supervised analysis gives similar eco- logical conclusions as taxonomy-unsupervised analysis, is more advantageous in avoiding computational limitations, can be used for comparisons with different sequenced regions of the rRNA gene, and has greater tolerance to sequencing errors. Results Obtaining V4-16S rRNA Gene Sequences by Barcoded Pyrosequencing. Regions in the 16S rRNA gene suitable for pyrosequencing were identied, which exhibited: (i ) an appropriate amplicon length for pyrosequencing reads, (ii ) high coverage by bacterial universal primers, (iii ) high resolution and accuracy for bacterial classi- cation and identication, and (iv) a low frequency of insertions and deletions to simplify sequence alignment. Especially, the re- gion used in this study (corresponding to Escherichia coli 16S rRNA gene positions 563802) spanning the hypervariable V4 region, with an average length of 207 bp between primer sequen- ces, is one of the regions with the high classication accuracy using Author contributions: W.J.S. and J.M.T. designed research; W.J.S. and E.d.C.J. performed research; W.J.S., J.R.C., Q.W., R.J.F., and J.A.F. analyzed data; and W.J.S., J.R.C., E.d.C.J., and J.M.T. wrote the paper. The authors declare no conict of interest. 1 To whom correspondence should be addressed: [email protected]. This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. 1073/pnas.1111435108/-/DCSupplemental. www.pnas.org/cgi/doi/10.1073/pnas.1111435108 PNAS | August 30, 2011 | vol. 108 | no. 35 | 14637e14642 MICROBIOLOGY Downloaded by guest on June 15, 2020

Transcript of Bacterial community comparisons by taxonomy-supervised ... · Bacterial community comparisons by...

Page 1: Bacterial community comparisons by taxonomy-supervised ... · Bacterial community comparisons by taxonomy-supervised analysis independent of sequence alignment and clustering Woo

Bacterial community comparisons bytaxonomy-supervised analysis independentof sequence alignment and clusteringWoo Jun Sula, James R. Colea, Ederson da C. Jesusa,b, Qiong Wanga, Ryan J. Farrisa, Jordan A. Fisha,and James M. Tiedjea,c,1

aCenter for Microbial Ecology, Michigan State University, East Lansing, MI 48824; bEmbrapa Agrobiologia, Seropédica, 23890-000, Rio de Janeiro, Brazil;and cWorld Class University Center for Green Metagenomics, Yonsei University, Seoul 120-749, Republic of Korea

Contributed by James M. Tiedje, July 28, 2011 (sent for review March 8, 2011)

High-throughput sequencing of 16S rRNA genes has increased ourunderstanding of microbial community structure, but now evenhigher-throughput methods to the Illumina scale allow the crea-tion of much larger datasets with more samples and orders-of-magnitude more sequences that swamp current analytic methods.We developed a method capable of handling these larger datasetson the basis of assignment of sequences into an existing taxon-omy using a supervised learning approach (taxonomy-supervisedanalysis). We compared this method with a commonly used clus-tering approach based on sequence similarity (taxonomy-unsuper-vised analysis). We sampled 211 different bacterial communitiesfrom various habitats and obtainedw1.3 million 16S rRNA sequen-ces spanning the V4 hypervariable region by pyrosequencing. Bothmethodologies gave similar ecological conclusions in that β-diver-sity measures calculated by using these two types of matriceswere significantly correlated to each other, as were the ordinationconfigurations and hierarchical clustering dendrograms. In addition,our taxonomy-supervised analyses were also highly correlatedwith phylogenetic methods, such as UniFrac. The taxonomy-super-vised analysis has the advantages that it is not limited by theexhaustive computation required for the alignment and clusteringnecessary for the taxonomy-unsupervised analysis, is more toler-ant of sequencing errors, and allows comparisons when sequencesare from different regions of the 16S rRNA gene. With the tre-mendous expansion in 16S rRNA data acquisition underway, thetaxonomy-supervised approach offers the potential to providemore rapid and extensive community comparisons across habitatsand samples.

taxonomy bin | operational taxonomic unit

The increasing abundance of 16S rRNA gene sequences stim-ulated by reduced sequencing costs and greatly expanded

parallel capacities is providing a more encompassing view of mi-crobial communities (1). Although the short read lengths pro-vided by the current technologies make it more challenging toassign sequences to bacterial taxonomy, the depth and replicationprovided are powerful advantages (2–4).Information on bacterial community structure can be compiled

in a matrix where different communities are represented as rowsand “species” as columns, i.e., a community-by-species matrix.When describing bacterial community relationships based on 16SrRNA gene sequences, each sequence is allocated to a species,usually termed an operational taxonomic unit (OTU), by align-ment-based clustering at a specified nucleotide distance, oftenat a 97% identity. This community-by-OTU matrix, which isbased exclusively on nucleotide distances among 16S rRNA genesequences, has bacterial communities as rows with OTU as col-umns. This community-by-OTU matrix can be used to measuredissimilarities between bacterial communities (β-diversity) eitherby presence/absence or abundance data. These dissimilaritiescombined in a distance matrix can be used for bacterial commu-nity comparisons by ordination and clustering methods. This

process, termed “taxonomy-unsupervised analysis,” originatesfrom the distribution of 16S rRNA gene sequences into OTUs.When applying taxonomy-unsupervised analysis to very large

numbers of sequences (>106) produced by the new sequencingtechnologies, much larger computational capacities are requiredto analyze the data (5). The alignment and clustering of se-quences that require calculation of pairwise nucleotide distancesis one bottleneck. Even though taxonomy-unsupervised analysisis advantageous in that it includes sequences that are not yetassignable to bacterial taxonomy, the current computationallimitations make pursuing comparisons among bacterial com-munities difficult.We investigated an alternative analysis, i.e., “taxonomy-su-

pervised analysis,” which consists of allocating sequences intotaxonomy-supervised “taxonomy bins” on the basis of the existingbacterial taxonomy, which is rooted in polyphasic taxonomy (6)and hence also reflects physiological, morphological, and geneticinformation. Currently, several ribosomal RNA databases [i.e.,Ribosomal Database Project, RDP (7), Greengenes (8), SILVA(9), and GAST (10)] are dedicated to sequence deposition andprovide 16S rRNA gene classification tools.Here we compared taxonomy-unsupervised (OTU) and tax-

onomy-supervised analyses for two β-diversity measures, ordi-nation configurations, and hierarchical clustering dendrograms,using w1.3 million (M) V4-region 16S rRNA gene sequencesobtained by pyrosequencing of 211 bacterial communities. Weshow that the taxonomy-supervised analysis gives similar eco-logical conclusions as taxonomy-unsupervised analysis, is moreadvantageous in avoiding computational limitations, can be usedfor comparisons with different sequenced regions of the rRNAgene, and has greater tolerance to sequencing errors.

ResultsObtaining V4-16S rRNA Gene Sequences by Barcoded Pyrosequencing.Regions in the 16S rRNA gene suitable for pyrosequencing wereidentified, which exhibited: (i) an appropriate amplicon lengthfor pyrosequencing reads, (ii) high coverage by bacterial universalprimers, (iii) high resolution and accuracy for bacterial classifi-cation and identification, and (iv) a low frequency of insertionsand deletions to simplify sequence alignment. Especially, the re-gion used in this study (corresponding to Escherichia coli 16SrRNA gene positions 563–802) spanning the hypervariable V4region, with an average length of 207 bp between primer sequen-ces, is one of the regions with the high classification accuracy using

Author contributions: W.J.S. and J.M.T. designed research; W.J.S. and E.d.C.J. performedresearch; W.J.S., J.R.C., Q.W., R.J.F., and J.A.F. analyzed data; and W.J.S., J.R.C., E.d.C.J.,and J.M.T. wrote the paper.

The authors declare no conflict of interest.1To whom correspondence should be addressed: [email protected].

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1111435108/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1111435108 PNAS | August 30, 2011 | vol. 108 | no. 35 | 14637e14642

MICRO

BIOLO

GY

Dow

nloa

ded

by g

uest

on

June

15,

202

0

Page 2: Bacterial community comparisons by taxonomy-supervised ... · Bacterial community comparisons by taxonomy-supervised analysis independent of sequence alignment and clustering Woo

the RDP’s naïve Bayesian rRNA classifier (11). Classification ac-curacy declines with length, but is accurate at higher taxonomiclevels even for 50-bp fragments. Its applicability for pyrosequenc-ing was further supported by in silico Unifrac analysis (12). Theuniversality of the primers was determined by internal alignmentof perfect matches against 16S rRNA gene sequences in theRDP (94.6% coverage) and from themetagenomic database of theSorcerer II Global Ocean Sampling Expedition (94.7% coverage)(13). Specifically, the primers designed in this study targeted anoverwhelming majority of known 16S rRNA gene sequencesthroughout all phyla, while providing deep taxonomic classificationuseful for community comparisons (Fig. S1).The w1.3 M 16S rRNA gene sequences we used were from

211 bacterial communities from a variety of studies and habitatsthat reflected a random collection of pyrosequencing projectdata. Each bacterial assemblage was assigned into 11 habitatgroups by a priori classification (a classification made beforeexperimentation) (Table 1), using two-level habitat definitionsfrom the Genomic Standards Consortium’s “Habitat-Lite” ver-sion 0.4 ontology (14). Detailed sample descriptions are pro-vided in Table S1.

Allocation of 1.3 M 16S rRNA Gene Sequences to Taxonomy Bins andto OTUs at 97% Identity. For taxonomy-supervised analysis, each16S rRNA gene query sequence was assigned to a set of taxon-omy bins, 1,400 genera, and 492 artificial “unclassified” taxa,using the RDP classifier (11). When the classifier cutoffs were setat 80, 50, and 0% threshold (the latter forced all sequences intogenus bins), 48, 64, and 100% of the sequences were classified tothe genus level (Fig. 1), producing 903, 1,170, and 1,259 bins

for the respective thresholds. The maximum distance amongsequences within each bin increased when the RDP classifierthreshold was set lower (Fig. S2). For taxonomy-unsupervisedanalysis, all sequences clustered into 112,233 OTUs at 97% 16SrRNA gene sequence identity.A total of 22,154 pairwise β-diversity calculations using Chao’s

adjusted Sørensen (15) and Jaccard (16) indices among 211bacterial communities were performed with both community-by-OTU and community-by-taxonomy bin matrices. The β-diversitydistance matrix based on OTUs was highly correlated with thosebased on taxonomy bins at three different thresholds as con-firmed by Mantel tests (Table 2 and Fig. 2; SI Results and TableS2 for 90% OTUs) (17). Ordinations from β-diversity matricesbased both on OTU and taxonomy bins were also correlated toeach other when configurations two axes of nonmetric multidi-mensional scaling (NMDS) were compared by Procrustes anal-ysis (correlation r > 0.51, P value <0.001) (Figs. 3 and 4 andTable S3). The dendrograms produced by hierarchical clustering(unweighted pair-group method using arithmetric averages,UPGMA) of the matrices on the basis of OTU and taxonomybins were highly correlated to each other as measured bycophenetic correlation (Table 3). Also, comparisons betweena phylogenetic distance matrix created using UniFrac (18) andthe β-diversity distance matrix based on taxonomy bins showedthese matrices were highly correlated using a Mantel test(Table 2).

Comparison of Communities with Artificial Sequencing Errors. Todetermine tolerance to sequencing errors, artificial sequencingerrors were added to copies of the parent library, consistingof V4 regions extracted from a library of 2,483 near full-length16S rRNA Sanger-based sequences (19). The resulting altered

Table 1. The eleven habitat groups as defined followingHabitat-Lite criteria before sampling (sequencing)

Habitat groupNumbers ofsamples Habitat-Lite description

G01 116 Terrestrial*, soil†

G02 8 Extreme*, soil†

G03 10 Terrestrial*, extreme*, soil†

G04 16 Organism associated*G05 6 Freshwater*, waste water†

G06 7 Freshwater*, sediment†

G07 2 Fossil*, organism associated*G08 10 Marine*, sediment†

G09 14 Cultured*, soil†, or sediment†

G10 20 Extreme*, freshwater*, sediment†

G11 2 Extreme*, microbial mat†

Detailed habitat descriptions are provided in Table S1.*Top level terms in Habitat-Lite version 0.3 (14).†Second level terms.

0

20

40

60

80

100

Domain Phylum Class Order Family Genus

Cla

ssifi

ed S

eque

nces

up

to

taxo

nom

y Le

vel (

%)

Taxonomy units

at 0% Confidence Threshold

at 50% Confidence Threshold

at 80% Confidence Threshold

Fig. 1. Qualified sequence classification percentages at different confi-dence thresholds determined by the RDP classifier for the indicated taxo-nomic levels.

Table 2. Comparison of two distance matrices by Mantel tests

r statistic

Similarity index usedTaxonomy bins- (at 80%)based distance matrix

Taxonomy bins- (at 50%)based distance matrix

Taxonomy bins- (at 0%)based distance matrix

97% OTU-baseddistance matrix

Chao’s adjustedSørensen 0.779*** 0.803*** 0.817***Jaccard 0.863*** 0.860*** 0.787***

UniFrac distancematrix

Chao’s adjustedSørensen 0.737*** 0.732*** 0.737***Jaccard 0.866*** 0.873*** 0.873***

The Mantel statistic was based on Spearman’s rank correlation. The correlation r reflects the relationship between two matrices. ***P < 0.001.The permutations (n = 999) assessed for the significance.

14638 | www.pnas.org/cgi/doi/10.1073/pnas.1111435108 Sul et al.

Dow

nloa

ded

by g

uest

on

June

15,

202

0

Page 3: Bacterial community comparisons by taxonomy-supervised ... · Bacterial community comparisons by taxonomy-supervised analysis independent of sequence alignment and clustering Woo

libraries were compared with the unaltered parent library usingChao’s adjusted Sørensen and Jaccard indices after either un-supervised OTUs clustering or taxonomy binning with the RDPclassifier. In comparisons using unsupervised OTU clustering,both β-diversity indices dropped to near zero (similar commu-nities have β-diversity indices close to 1) by around 3% added“sequencing errors,” whereas supervised taxonomic binningshowed only a small decrease with up to 10% of added se-quencing errors (Fig. 5).

DiscussionThere are major advantages in the taxonomy-supervised analysis:(i) it is capable of comparing the data coming from differentregions in the 16S rRNA gene, (ii) it avoids the computationallyintensive alignment and clustering, and (iii) it is easy to combinedata sets classified separately. Depending on the length of the16S rRNA gene region sequenced and the resolution of thebacterial taxonomy classification, the taxonomy-supervised anal-ysis can compare the bacterial communities of 16S rRNA genesequences spanning other hypervariable regions and bacterialcommunities with previously deposited sequences. For example,the RDP classifier returned similar classification results for V3,V4, and V6 regions compared with full-length queries at the ge-nus level (Table 4). Therefore, one can obtain comparable dataregardless of the sequenced region. When doing such compar-isons, one must be aware that primer sets for different regionscan sample somewhat different populations.Another advantage of the taxonomy-supervised analysis is that,

due to the fixed number of taxonomy bins, it is simple to addand delete bacterial communities from a preformulated bacterialcommunities comparison. Using taxonomy-unsupervised analysis,the addition and deletion of bacterial communities affects thecommunity-by-OTUmatrix because the number and compositionof OTUs are affected by realignment and reclustering. Further-more, sequence allocation into taxonomy bins is computationallyfaster than into taxonomy-unsupervised OTUs as the time forthe former method increases linearly with the number of se-quences, whereas the time for the latter increases quadratically,thus requiring significantly longer processing times. For instance,taxonomy-supervised analysis took 1 h for generating community-by-taxonomy bins using the RDP classifier with 1.3 M sequences(480,000 unique sequences), whereas taxonomy-unsupervisedanalysis took w10 h for sequence alignment using Infernal and144 h for complete-linkage clustering on an Intel XEONMacPro.Also, the taxonomy-unsupervised analysis is very sensitive tosequencing errors (20–22) (Fig. 5). With taxonomy-supervised

1

22155

1 22155

Ran

k of

-d

iver

sity

dis

tanc

e fro

m 9

7% O

TUs

Rank of -diversity distance from taxonomy-bins

Fig. 2. Scatter plot of β-diversity distance orders (the highest β-diversity isconsidered as rank 1) calculated using community-by-taxonomy bins matrix(x axis) RDP’s classifier at 0% threshold and community-by-OTU matrix(y axis). In this plot, β-diversity distance is 1 − Chao’s adjusted Sørensensimilarity index.

A B

C D

Stress = 24.31

Stress = 24.31 Stress = 23.271

Stress = 23.51

Fig. 3. Comparison of NMDS plots based on abundance-based distance (1 − Chao’s adjusted Sørensen similarity index). The habitat groups defined by usingthe ontology of Habitat-Lite are indicated by the different color and shape points (key below). Community-by-taxonomy bins at (A) 80%, (B) 50%, (C) 0%, and(D) community-by-OTU.

Sul et al. PNAS | August 30, 2011 | vol. 108 | no. 35 | 14639

MICRO

BIOLO

GY

Dow

nloa

ded

by g

uest

on

June

15,

202

0

Page 4: Bacterial community comparisons by taxonomy-supervised ... · Bacterial community comparisons by taxonomy-supervised analysis independent of sequence alignment and clustering Woo

analysis, it is statistically very unlikely that random sequencingerrors will cause a sequence from one organism to “mutate” toresemble the sequence from an unrelated organism.We focused on defining the differences between using taxon-

omy-supervised and taxonomy-unsupervised analyses when com-paring bacterial communities. Both β-diversity distance matrices,the configuration of ordinations, and the comparisons of hierar-chical clustering results confirmed that the two analyses are sig-nificantly correlated such that similar conclusions would be drawn(i.e., Fig. S3, distance-based redundancy analysis (db-RDA) withboth matrices showed the similar correlations with bacterial com-munities and environmental variable). The resolution, however, ismore limited with the taxonomy-supervised analysis due to thecoarser average distance among taxa. The median of maximumdistance within taxonomy bins was 10.8, 15, and 31.9% at 80, 50,and 0% RDP classifier thresholds, respectively (SI Results andFig. S2). For example, there was a decreased resolution of habitatgroup G01 (terrestrial and soil) to other groups in taxonomy-su-pervised (Fig. 4A) compared with taxonomy-unsupervised anal-ysis (Fig. 4D), especially in occurrence-based NMDS plots. This isdue to the more limited number of taxonomy bins in the phylumAcidobacteria (26 genera and four unclassified taxa), Verrucomi-crobia (10 genera and eight unclassified taxa), and Gemmatimo-nadetes (2 genera and five unclassified taxa). These bins have

a relatively large number of sequences in habitat group G01(soils) due to the low number of isolated bacteria or describedclusters. In contrast, the communities in habitat group G04 (or-ganism associated; specifically animal feces) were mostly com-posed of well-characterized groups and exhibited better separationfrom other groups with the taxonomy-supervised analysis ratherthan with the taxonomy-unsupervised analysis (Figs. 3 and 4).When a more complete bacterial taxonomy is available for thesephyla and the unclassified taxa, the bacterial communities com-parison should exhibit a higher resolution, more accurately reflectbacterial community composition, and, if the polyphasic founda-tion of taxonomy has validity, better reflect the physiologies.As revolutionary sequencing technologies continue to emerge,

generating tremendous numbers of 16S rRNA and other markergene sequences, the limited flexibility of current clustering toolsand the computational requirements will be major bottlenecks.For example, two recent studies have produced millions of 16Samplicon sequences of 150–200 bases in length using Illuminapaired-end technology (23, 24). In related work, we have used thetechnique presented here to analyze 9.7 million paired-end Illu-mina sequence reads from five soil samples in less than 2 h. Thetaxonomy-based method has the potential to overcome theselimitations as a fast and simple bacterial assemblage comparison

A B

C D

Stress = 13.25

Stress = 14.80 Stress = 20.91

Stress = 14.59

Fig. 4. Comparison of NMDS plots based on occurrence-based distance (1 − Jaccard similarity index). The habitat groups defined by using the ontologyof Habitat-Lite are indicated by the different color and shape points (key below). Community-by-taxonomy bins at (A) 80%, (B) 50%, (C) 0%, and (D)community-by-OTU.

Table 3. Cophenetic correlations of unweighted pair-group method using arithmetric averages (UPGMA) clustering dendrograms

Cophenetic matrix of UPGMA clustering with

Similarity index usedTaxonomy bins- (at 80%)based distance matrix

Taxonomy bins- (at 50%)based distance matrix

Taxonomy bins- (at 0%)based distance matrix

Cophenetic matrix of UPGMAclustering with 97% OTU-baseddistance matrix

Chao’s adjustedSørensen 0.748 0.709 0.694Jaccard 0.816 0.839 0.907

Cophenetic correlation was computed using Spearman’s rank from two matrices of cophenetic distance, which is the distance where two samples becamethe same group in the clustering dendrogram. Higher cophenetic correlation means the UPGMA clustering dendrograms are more similar.

14640 | www.pnas.org/cgi/doi/10.1073/pnas.1111435108 Sul et al.

Dow

nloa

ded

by g

uest

on

June

15,

202

0

Page 5: Bacterial community comparisons by taxonomy-supervised ... · Bacterial community comparisons by taxonomy-supervised analysis independent of sequence alignment and clustering Woo

method, but its value would be improved if microbiologists ad-vance the taxonomy for the poorly characterized groups.

Materials and MethodsSample Preparation. DNA of 211 bacterial communities was obtained viavarious extraction methods. Amplicon preparations for pyrosequencing andsequencing 16S rRNA gene by Genome Sequencer FLX system (454 Life Sci-ences) were performed. The detailed method is described in SI Materialsand Methods.

Binning Sequences into Species. For the taxonomy-unsupervised analysis, thecommunity-by-OTU matrix was generated as follows: briefly, all sequenceswere aligned by secondary structure using Infernal (25), clustered by com-plete-linkage clustering, and then allocated into OTUs of 97% nucleotideidentity through RDP’s pyrosequencing pipeline (26). For the taxonomy-supervised analysis, all sequences were allocated into taxonomy bins ofgenus and artificial unclassified taxa provided by the RDP classifier at 80,50, and 0% confidence thresholds (11). The RDP classifier was trained usingthe Taxonomic Outline of the Bacteria and Archaea (TOBA), release 7.8 (27)augmented with unofficial taxa to cover regions of diversity not covered inthe formal bacterial taxonomy. Each of the lowest taxonomy units, i.e.,genera and unclassified taxa were considered as taxonomy bins. The re-liability of classification of each sequence was estimated by bootstrapping,and sequences that could not be assigned, as they were below a bootstrapconfidence threshold, were allocated to an artificial unclassified taxon. Inthe same manner, the community-by-taxonomy bin matrix was generatedwith communities as rows and with taxonomy bins as columns.

β-Diversity Measures and Other Statistical Analyses. β-Diversity measures ofthe 211 communities were calculated on the basis of pairwise Chao’s ad-justed Sørensen similarity index (quantitative measures corrected for unseenspecies) (15) and Jaccard similarity index (presence/absence measures) (16)using EstimateS (http://viceroy.eeb.uconn.edu/EstimateS). Each of the twodistance matrices of bacterial communities (1 − Chao’s adjusted Sorensensimilarity index and 1 − Jaccard similarity index) from the community-by-OTU and community-by-taxonomy bin matrices were compared by Manteltests (17) on the basis of Spearman’s rank correlation ρ. Bacterial commu-nities were displayed by two axes of NMDS to represent the greatest vari-ability. The configuration of the NMDS plots was compared by Procrustesanalysis (28), a statistical shape analysis that compares the distribution ofpoints’ shapes with all 211 points in NMDS dimensions. UPGMA hierarchicalclustering was performed, and clustering dendrograms were compared bythe cophenetic correlation. For measuring UniFrac distance (18), the ap-proximately maximum-likelihood phylogenetic trees were generated usingFastTree (29).

To test bacterial classification variability of the different 16S rRNA genehypervariable regions, four sets of near full-length (>1,200 bp) 16S rRNAgene sequence collections were used: one each of human gut, soil, bovinerumen, and ocean (30–34). All sequences were aligned and V3, V4, and V6hypervariable regions extracted on the basis of the reference positions ofthe Escherichia coli 16S rRNA gene. Classification results for the full-lengthsequences were compared with results for the corresponding V3, V4, and V6hypervariable regions.

To measure the effects of sequencing error on β-diversity measures, theregion corresponding to the V4 amplicon region was extracted from a li-brary of 2,483 near full-length 16S rRNA Sanger-based sequences (19) toproduce a V4 region high-quality library. Artificial sequencing errors wereadded to copies of this library by randomly modifying various percentages ofbases in each sequence. The original and modified libraries were assigned tosupervised taxonomic bins with the RDP classifier as described above. Un-supervised OTU bins were calculated individually for each modified library incombination with the original library as described above.

NMDS, Mantel test, Procrustes analysis, UPGMA clustering, and copheneticcorrelation were performed using the R statistical program (R DevelopmentCore Team) running the vegan package (Vegan: Community Ecology Packageversion 1.8-6). Except where otherwise indicated, processing software waswritten in Java (API v1.5.0) and executed on the Macintosh (OS 10.4) or Linux(2.4.23) operating systems running Java virtual machines from Apple orSun, respectively.

ACKNOWLEDGMENTS. We thank Stella Asuming-Brempong, Mary BethLeigh, David Emerson, Chris Blackwood, Erick Cardenas, Ryan Penton, BlazStres, Stephan Gantner, Claudia Etchebehere, Thad Stanton, DeboraRodrigues, Aviaja Hansen, Mathew Marshall, Alexandre Soares Rosado,and Dan Fisher whose projects provided the 211 DNA samples. This studyis funded by grants from the Department of Energy, the Department ofAgriculture, and the National Institute of Environmental Health Sciences,with samples coming from grants from these agencies plus the NationalScience Foundation, the National Aeronautics and Space Administration,and several international sources, including the World Class Universityprogram through the National Research Foundation of Korea, funded bythe Ministry of Education, Science, and Technology (R33-10076).

1. Tringe SG, Hugenholtz P (2008) A renaissance for the pioneering 16S rRNA gene. Curr

Opin Microbiol 11:442e446.2. Sogin ML, et al. (2006) Microbial diversity in the deep sea and the underexplored

“rare biosphere.” Proc Natl Acad Sci USA 103:12115e12120.

3. Huber JA, et al. (2007) Microbial population structures in the deep marine biosphere.

Science 318:97e100.4. Roesch LF, et al. (2007) Pyrosequencing enumerates and contrasts soil microbial

diversity. ISME J 1:283e290.

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8

Sim

ilarit

y

Simulated 'Sequencing errors' Distance

Jaccard with unsupervised OTU

Chao's Sorensen with unsupervised OTU

Jaccard with taxonomy-bin

Chao's Sorensen with taxonomy-bin

Fig. 5. Simulation of “sequencing errors.” The similarity indices measuredthe differences between the original parent library and the altered library bysimulated sequencing errors distance, mean of the nucleotide substitutionrates (%) in all query sequences done in a randomized manner.

Table 4. Bacterial classification accuracy by the RDP classifier ofpartial sequences spanning hypervariable V3, V4, and V6 regionsof 16S rRNA genes

V3 V4 V6

Bootstrap cutoff, % 0 50 80 0 50 80 0 50 80Human gut

% classified 100 93.9 79.4 100 92 88.4 100 66.2 45.8% matching 89.3 91.6 99.9 94.5 96.5 97.7 74.9 97.6 99.6

Soil% classified 100 73.6 50.6 100 76.4 56.3 100 34.4 15.6% matching 71.9 86.8 92.9 82.7 92.7 97.3 43.8 75.6 83.1

Marine% classified 100 61.4 48.3 100 77.4 53.4 100 52.3 41.7% matching 66.0 94.1 99.9 81.3 96.9 98.9 49.3 88.0 93.7

Bovine rumen% classified 100 76.3 62.3 100 79.4 66.7 99.1 51.4 43.9% matching 76.8 93.4 99.4 88.6 96.7 100 55.6 88.5 89.5

RDP classifier used training set no. 6 from March 2010. V3 and V6 regionscorrespond to those amplified by primers from Dethlefsen et al. (30); V3 re-gion range, E. coli position 358–514 and length 144 ± 11 bp; V6 region range,E. coli position 986–1045 and length 60 ± 4 bp. V4 region corresponds to theamplification product of primers from this paper spanning E. coli position578–784 and length 207 ± 1 bp. Human gut data were obtained fromDethlefsen et al. (30). Analysis was updated from Claesson et al. (31). Soildata were obtained from Elshahed et al. (32). Marine data were obtainedfrom Walsh et al. (33). Bovine rumen data were obtained from Brulc et al.(34). % classified, fraction of sequences classified to genus level; %matching,matching of each region’s classifications with full-length classification.

Sul et al. PNAS | August 30, 2011 | vol. 108 | no. 35 | 14641

MICRO

BIOLO

GY

Dow

nloa

ded

by g

uest

on

June

15,

202

0

Page 6: Bacterial community comparisons by taxonomy-supervised ... · Bacterial community comparisons by taxonomy-supervised analysis independent of sequence alignment and clustering Woo

5. Hamady M, Knight R (2009) Microbial community profiling for human microbiome

projects: Tools, techniques, and challenges. Genome Res 19:1141e1152.6. Colwell RR (1970) Polyphasic taxonomy of the genus vibrio: Numerical taxonomy of

Vibrio cholerae, Vibrio parahaemolyticus, and related Vibrio species. J Bacteriol 104:

410e433.7. Cole JR, et al. (2007) The ribosomal database project (RDP-II): Introducing myRDP

space and quality controlled public data. Nucleic Acids Res 35(Database issue):

D169eD172.8. DeSantis TZ, et al. (2006) Greengenes, a chimera-checked 16S rRNA gene database

and workbench compatible with ARB. Appl Environ Microbiol 72:5069e5072.9. Pruesse E, et al. (2007) SILVA: A comprehensive online resource for quality checked

and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res 35:

7188e7196.10. Huse SM, et al. (2008) Exploring microbial diversity and taxonomy using SSU rRNA

hypervariable tag sequencing. PLoS Genet 4:e1000255.11. Wang Q, Garrity GM, Tiedje JM, Cole JR (2007) Naive Bayesian classifier for rapid

assignment of rRNA sequences into the new bacterial taxonomy. Appl Environ

Microbiol 73:5261e5267.12. Liu Z, Lozupone C, Hamady M, Bushman FD, Knight R (2007) Short pyrosequencing

reads suffice for accurate microbial community analysis. Nucleic Acids Res 35:e120.13. Rusch DB, et al. (2007) The Sorcerer II Global Ocean Sampling expedition: Northwest

Atlantic through eastern tropical Pacific. PLoS Biol 5:e77.14. Hirschman L, et al.; Novo Project (2008) Habitat-Lite: A GSC case study based on free

text terms for environmental metadata. OMICS 12:129e136.15. Chao A, Chazdon RL, Colwell RK, Shen TJ (2006) Abundance-based similarity indices

and their estimation when there are unseen species in samples. Biometrics 62:

361e371.16. Jaccard P (1901) Comparative study of the floral distribution in a portion of the Alps

and Jura (Translated from French). Bull Soc Vaud Sci Nat 37:547e579.17. Mantel N (1967) The detection of disease clustering and a generalized regression

approach. Cancer Res 27:209e220.18. Lozupone C, Knight R (2005) UniFrac: A new phylogenetic method for comparing

microbial communities. Appl Environ Microbiol 71:8228e8235.19. La Duc MT, et al. (2009) Comprehensive census of bacteria in clean rooms by using

DNA microarray and cloning methods. Appl Environ Microbiol 75:6559e6567.

20. Kunin V, Engelbrektson A, Ochman H, Hugenholtz P (2010) Wrinkles in the rarebiosphere: Pyrosequencing errors can lead to artificial inflation of diversity estimates.Environ Microbiol 12:118e123.

21. Reeder J, Knight R (2009) The ‘rare biosphere’: A reality check. Nat Methods 6:636e637.

22. Quince C, et al. (2009) Accurate determination of microbial diversity from 454 py-rosequencing data. Nat Methods 6:639e641.

23. Caporaso JG, et al. (2011) Global patterns of 16S rRNA diversity at a depth of millionsof sequences per sample. Proc Natl Acad Sci USA 108(Suppl 1):4516e4522.

24. Bartram AK, Lynch MD, Stearns JC, Moreno-Hagelsieb G, Neufeld JD (2011) Genera-tion of multimillion-sequence 16S rRNA gene libraries from complex microbial com-munities by assembling paired-end illumina reads. Appl Environ Microbiol 77:3846e3852.

25. Nawrocki EP, Kolbe DL, Eddy SR (2009) Infernal 1.0: Inference of RNA alignments.Bioinformatics 25:1335e1337.

26. Cole JR, et al. (2009) The Ribosomal Database Project: Improved alignments and newtools for rRNA analysis. Nucleic Acids Res 37(Database issue):D141eD145.

27. Garrity GM, et al. (2007) Taxonomic Outline of the Bacteria and Archaea (Name-sforLife, East Lansing, MI), Release 7.7, 10.1601/TOBA7.7.

28. Peres-Neto PR, Jackson DA (2001) How well do multivariate data sets match? Theadvantages of a Procrustean superimposition approach over the Mantel test. Oeco-logia 129:169e178.

29. Price MN, Dehal PS, Arkin AP (2009) FastTree: Computing large minimum evolutiontrees with profiles instead of a distance matrix. Mol Biol Evol 26:1641e1650.

30. Dethlefsen L, Huse S, Sogin ML, Relman DA (2008) The pervasive effects of an anti-biotic on the human gut microbiota, as revealed by deep 16S rRNA sequencing. PLoSBiol 6:e280.

31. Claesson MJ, et al. (2009) Comparative analysis of pyrosequencing and a phylogeneticmicroarray for exploring microbial community structures in the human distal in-testine. PLoS ONE 4:e6669.

32. Elshahed MS, et al. (2008) Novelty and uniqueness patterns of rare members of thesoil biosphere. Appl Environ Microbiol 74:5422e5428.

33. Walsh DA, et al. (2009) Metagenome of a versatile chemolithoautotroph from ex-panding oceanic dead zones. Science 326:578e582.

34. Brulc JM, et al. (2009) Gene-centric metagenomics of the fiber-adherent bovine ru-men microbiome reveals forage specific glycoside hydrolases. Proc Natl Acad Sci USA106:1948e1953.

14642 | www.pnas.org/cgi/doi/10.1073/pnas.1111435108 Sul et al.

Dow

nloa

ded

by g

uest

on

June

15,

202

0