Elucidating factors behind pair wise distances discrepancies between short and near full-length...

1
Elucidating factors behind pair wise distances discrepancies between short and near full-length sequences. We hypothesized that since the 16S rRNA molecule is made of sites with varying levels of evolutionary conservation, then the proportion of these sites in a specific amplicon would impact the pair wise distance values obtained in the dataset. To this end, we used the classification, put forward by the reviews of Baker et al. (2), and Van de Peer et al. (10), of all base pairs in the 16S rRNA gene of E. coli into conserved (C), variable (V), and highly variable (HV) to determine the % of C, V, and HV base pairs in each of the pyrosequencing fragments and compared it to the near full-length fragment. We used multiple regression and tested all possible combinations of percentages and ratios of C, V, and HV bases. The best model equation obtained was y (slope)= (30.5 x C/total) + (11.5 x HV/V) - (27.9 x HV/total) - (8.5 x C/V) + (5.25 x HV/C) – (0.001 x length) -4.79. A Comparative Study of Species Richness Estimates Obtained Using Near Complete Fragments and Simulated Pyrosequencing-Generated Fragments in 16S rRNA Gene-Based Environmental Surveys N. H. Youssef 1 , C. S. Sheik 2 , L. R. Krumholz 2 , F. Z. Najar 2 , B. A. Roe 2 , M. S. Elshahed 1 ; 1 Oklahoma State Univ., Stillwater, OK, 2 Univ. of Oklahoma, Norman, OK. N-088 Abstract It is not yet clear how the number of operational taxonomic units (OTUs), and hence species richness estimates, determined using pyrosequencing-generated fragments correlate with those assigned using near full-length 16S rRNA gene fragments. We constructed a 16S rRNA clone library from an undisturbed tall grass prairie soil (1132 clones), and used it to compare species richness estimates using 8 pyrosequencing-candidate fragments (99-361 bp in length) to the near full-length fragment. While fragments encompassing the V1+V2, and V6 regions overestimated species richness, those encompassing V3, V7, and V7+V8 regions underestimated species richness, and those encompassing the V4, V5+V6, and V6+V7 provided estimates comparable to the near full-length fragment. Similar results were obtained when analyzing three other datasets. Regression analysis indicated base variability within an examined fragment could potentially explain those differences. Introduction Typical culture-independent 16S rRNA gene surveys of highly diverse ecosystems allow for the identification of only abundant members of the communities (1). Estimates obtained are highly dependant on sample size. The large number of 16S rRNA gene sequences produced with pyrosequencing (7) allows access to rare members of the community (4), as well as a relatively more accurate estimation of species richness. However, it is unclear how pair wise distances, and hence operational taxonomic unit (OTU) assignments and species richness estimates, computed using various shorter fragments will correlate to those computed using near complete 16S rRNA gene. Here, we constructed, sequenced, and analyzed a 16S rRNA library of 1132 clones, and compared OTU numbers, and species richness values obtained using the full- length datasets, and fragments simulating pyrosequencing output. We show that the choice of the pyrosequenced fragment could impact the number of OTUs, and species richness estimates with some fragments underestimating and others overestimating species richness when compared to longer near complete 16S rRNA gene fragments. Further, we established a regression analysis that explains the nature of the observed discrepancy using the proportion of the hypervariable, variable, and conserved bases within a fragment. C a V1+V2 V3 V4 V5+V6 V6 V6+V7 V7 V7+V8 NF L 3 4589 178 2273 91 3428 125 2759 125 3179 121 2688 108 1209 57 1959 87 2819 98 6 2360 94 1397 68 2223 100 1539 68 2302 95 1684 79 547 32 706 38 1790 80 8 1616 70 686 34 1034 48 1129 53 1488 62 1023 50 343 22 383 20 1036 51 10 1275 59 627 37 792 40 904 46 1407 63 865 48 207 13 288 17 912 52 15 737 41 190 11 305 17 324 17 845 41 292 19 91 7 123 10 347 24 Comparing species richness estimates in short and long fragments at various taxonomic cutoffs in Soil Okla-A clone library. All three species richness estimation methods as well as slopes of scatter plots were in general agreement with each other, as well as with results obtained from OTU assignments in describing the relationship between long and short fragments. Table 3. Parametric species richness estimates obtained using the near full- length sequences and each of the 8 short simulated regions studied at 5 different taxonomic cutoffs for Soil-Okla-A clone library Table 1. Variable sites encompassed, and base composition for the short simulated regions studied and the near full-length fragment Percentage of V (variable), HV (highly variable), and C (conserved) bases. NFL: Near full-length sequences Percentage of bases Regions Variable regions V HV C 27 - 355 V1+V2 47 18 35 338 - 548 V3 44 14 42 530 - 826 V4 57 5 38 805 - 1065 V5+V6 49 10 41 967 - 1065 V6 45 19 36 967 - 1238 V6+V7 44 9 47 1046 - 1238 V7 40 3 57 1046 - 1406 V7+V8 43 5 52 NFL 51 10 39 Comparison of number of OTUs obtained using near complete and shorter fragment The number of OTUs obtained using short simulated fragments ranged between 0.44 to 2.10 times the values obtained using the near-full length16S. Fragments encompassing regions V1+V2 and V6 overestimated the number of OTUs at all taxonomic cutoffs. Fragments encompassing V3, V7, and V7+V8 regions underestimated OTU numbers. Fragments encompassing V4, and V5+V6, and V6+V7 gave, in general, comparable OTU numbers to the full sequence, as further evidenced by slope values of 0.97, 1, and 0.98, respectively (Table2). Materials and methods Site. Undisturbed tall grass prairie soil in central Oklahoma. DNA extraction. FastDNA spin kit for soil. PCR and cloning. Primers 8f-1492r. TOPO-TA cloning kit. Chimera. Bellerophon (version 3) function on Greengenes. Alignments. ClustalX program, Greengenes NAST aligner Clipping of shorter fragments. Jalview (3). Distance matrix, OTU assignments. PAUP. DOTUR. Scatter plots slopes. Species richness estimates. Chao, and ACE estimators. Six parametric distributions ( http://www.stat.cornell.edu/~bunge/ ). Other environments. Another soil ecosystem (5), digestive tract of Zebrafish(8), and ocean floor microbial community (9). Regression analysis. Multiple regression using MS Excel. Comparing OTUs, species richness estimates and slopes of scatter plots inshort and long fragments in libraries derived from other ecosystems. Trends obtained from OTU determinations and scatter plot slopes of: a Trembling Aspen soil (1152 clones), the digestive tract of Zebrafish (612 clones), and microbial communities inhabiting the ocean crust in the east pacific ridge (902 clones) were strikingly similar to those observed with soil Okla-A clone library (Table 4). Species richness estimates for these three environments mirrored the same trends (data not shown). Conclusions •Regions V1+V2, as well as V6 overestimate diversity, regions V3, V7, and V7+V8 underestimate diversity, while regions V4, V5+V6, and V6+V7 give comparable estimate to near full-length fragments. •This pattern held true for the various environments tested. •The bias in species richness estimates could readily be explained by base variability. •While previous studies suggested using region V4 for phylogenetic studies (6, 11), our evaluation of species richness suggests that V4, V5+V6, and V6+V7 regions provide estimates closest to longer fragments. Collectively, the V4-encompassing region appears to provide the best choice for both phylogenetic assignments and estimates consideration. •Based on this study, we recommend the use of fragments (V4, V5+V6, V6+V7) for pyrosequencing studies concerned with species-richness determination in microbial communities. References 1.Axelrood, P. E., et al.. 2002. Can. J. Microbioil. 48:655-674. 2.Baker, G. C., et al. 2003. J. Microbiol. Methods 55:541-555. 3. Clamp, M., et al. 2004. Bioinformatics 20:426-427. 4. Huber, J. A., et al. 2007. Science 318:97-100. 5. Lesaulnier, C., et al. 2008. Environ. Microbiol. 10:926-941. 6. Liu, Z., et al. 2007. Nucleic Acids Res. 35:120- 130. 7. Margulies, M., et al. 2005. Nature 437:376-380. 8. Rawls, J. F., et al. 2006. Cell 127:423-433. 9. Santelli, C. M., et al. 2008. Nature 453:653-656. 10. Van de Peer, Y., et al. 1996. Nucleic Acids Res. Table 4. Slopes obtained for 3 different clone libraries derived from soil, zebrafish gut, and ocean floor as compared to KFS. Table2. Number of OTUs and ratios of species richness estimates obtained using the near full-length sequences and each of the 8 short simulated regions studied at 5 different taxonomic cutoffs for Soil-Okla-A clone library. C a Regions V 1+V 2 O TU Ratio V3 O TU Ratio V4 OTU Ratio V 5+V 6 OTU Ratio V6 OTU Ratio V 6+V 7 OTU Ratio V7 OTU Ratio V 7+V 8 OTU Ratio N FL 3 652 1.3 495 0.68 619 1.09 570 0.92 584 1 514 0.8 340 0.4 412 0.54 639 6 499 1.48 331 0.76 414 1.19 393 0.96 479 1.33 375 0.9 194 0.37 241 0.45 397 8 407 1.54 261 0.74 327 1.11 328 1.19 418 1.54 300 0.96 149 0.38 187 0.47 306 10 345 1.52 203 0.76 267 1.12 267 1.12 370 1.69 241 0.92 112 0.31 143 0.43 242 15 233 2 122 0.69 158 1.07 166 1.07 283 2.41 142 0.93 61 0.35 73 0.43 135 Sl. f 1.2 0.88 0.97 1 1.67 0.98 0.6 0.65 NA Environment V1+V2 V3 V4 V5+V6 V6 V6+V7 V7 V7+V8 Trembling Aspen soil 1.23 0.87 0.97 1.05 1.8 1.07 0.68 0.73 Zebrafish gut 1.27 0.72 0.94 1.02 1.35 0.96 0.56 0.65 Basalt Oceanic Floor 1.24 0.86 0.96 1.02 1.74 0.94 0.56 0.66 KFS 1.2 0.88 0.97 1 1.67 0.98 0.6 0.65 References

Transcript of Elucidating factors behind pair wise distances discrepancies between short and near full-length...

Page 1: Elucidating factors behind pair wise distances discrepancies between short and near full-length sequences. We hypothesized that since the 16S rRNA molecule.

Elucidating factors behind pair wise distances discrepancies between short and near full-length sequences.

We hypothesized that since the 16S rRNA molecule is made of sites with varying levels of evolutionary conservation, then the proportion of these sites in a specific amplicon would impact the pair wise distance values

obtained in the dataset. To this end, we used the classification, put forward by the reviews of Baker et al. (2), and Van de Peer et al. (10), of all base

pairs in the 16S rRNA gene of E. coli into conserved (C), variable (V), and highly variable (HV) to determine the % of C, V, and HV base pairs in each of the pyrosequencing fragments and compared it to the near full-

length fragment. We used multiple regression and tested all possible combinations of percentages and ratios of C, V, and HV bases. The best

model equation obtained was y (slope)= (30.5 x C/total) + (11.5 x HV/V) - (27.9 x HV/total) - (8.5 x C/V) + (5.25 x HV/C) – (0.001 x length) -4.79.

A Comparative Study of Species Richness Estimates Obtained Using Near Complete Fragments and Simulated Pyrosequencing-Generated Fragments in 16S rRNA Gene-Based Environmental Surveys

N. H. Youssef1, C. S. Sheik2, L. R. Krumholz2, F. Z. Najar2, B. A. Roe2, M. S. Elshahed1;1Oklahoma State Univ., Stillwater, OK, 2Univ. of Oklahoma, Norman, OK.

N-088

AbstractIt is not yet clear how the number of operational taxonomic units (OTUs), and hence species richness estimates, determined using pyrosequencing-generated fragments correlate with those assigned using near full-length 16S rRNA gene fragments. We constructed a 16S rRNA clone library from an undisturbed tall grass prairie soil (1132 clones), and used it to compare species richness estimates using 8 pyrosequencing-candidate fragments (99-361 bp in length) to the near full-length fragment. While fragments encompassing the V1+V2, and V6 regions overestimated species richness, those encompassing V3, V7, and V7+V8 regions underestimated species richness, and those encompassing the V4, V5+V6, and V6+V7 provided estimates comparable to the near full-length fragment. Similar results were obtained when analyzing three other datasets. Regression analysis indicated base variability within an examined fragment could potentially explain those differences.

IntroductionTypical culture-independent 16S rRNA gene surveys of highly

diverse ecosystems allow for the identification of only abundant members of the communities (1). Estimates obtained are highly

dependant on sample size.The large number of 16S rRNA gene sequences produced with

pyrosequencing (7) allows access to rare members of the community (4), as well as a relatively more accurate estimation

of species richness. However, it is unclear how pair wise distances, and hence operational taxonomic unit (OTU)

assignments and species richness estimates, computed using various shorter fragments will correlate to those computed using

near complete 16S rRNA gene. Here, we constructed, sequenced, and analyzed a 16S rRNA

library of 1132 clones, and compared OTU numbers, and species richness values obtained using the full-length datasets, and

fragments simulating pyrosequencing output. We show that the choice of the pyrosequenced fragment could impact the number of OTUs, and species richness estimates with some fragments

underestimating and others overestimating species richness when compared to longer near complete 16S rRNA gene fragments. Further, we established a regression analysis that explains the nature of the observed discrepancy using the proportion of the

hypervariable, variable, and conserved bases within a fragment.

Ca V1+V2 V3 V4 V5+V6 V6 V6+V7 V7 V7+V8 NFL

3 4589 178 2273 91 3428 125 2759 125 3179 121 2688 108 1209 57 1959 87 2819 98 6 2360 94 1397 68 2223 100 1539 68 2302 95 1684 79 547 32 706 38 1790 80 8 1616 70 686 34 1034 48 1129 53 1488 62 1023 50 343 22 383 20 1036 51

10 1275 59 627 37 792 40 904 46 1407 63 865 48 207 13 288 17 912 52 15 737 41 190 11 305 17 324 17 845 41 292 19 91 7 123 10 347 24

Comparing species richness estimates in short and long fragments at various taxonomic cutoffs in Soil Okla-A clone library. All three species richness estimation methods as well as slopes of scatter plots were in general agreement with each other, as well as with results obtained from OTU assignments in describing the relationship between long and short fragments.Table 3. Parametric species richness estimates obtained using the near full-length sequences and each of the 8 short simulated regions studied at 5 different taxonomic cutoffs for Soil-Okla-A clone library

Table 1. Variable sites encompassed, and base composition for the short simulated regions studied and the near full-length fragment

Percentage of V (variable), HV (highly variable), and C (conserved) bases. NFL: Near full-length sequences

Percentage of

bases Regions Variable

regions V HV C

27- 355 V1+V2 47 18 35

338- 548 V3 44 14 42

530- 826 V4 57 5 38

805- 1065 V5+V6 49 10 41

967- 1065 V6 45 19 36

967- 1238 V6+V7 44 9 47

1046 - 1238 V7 40 3 57

1046 - 1406 V7+V8 43 5 52

NFL 51 10 39

Comparison of number of OTUs obtained using near complete and shorter fragmentThe number of OTUs obtained using short simulated fragments ranged between 0.44 to 2.10 times the values obtained using the near-full length16S. Fragments encompassing regions V1+V2 and V6 overestimated the number of OTUs at all taxonomic cutoffs. Fragments encompassing V3, V7, and V7+V8 regions underestimated OTU numbers. Fragments encompassing V4, and V5+V6, and V6+V7 gave, in general, comparable OTU numbers to the full sequence, as further evidenced by slope values of 0.97, 1, and 0.98, respectively (Table2).

Environment V1+V2 V3 V4 V5+V6 V6 V6+V7 V7 V7+V8

Trembling Aspen soil 1.23 0.87 0.97 1.05 1.8 1.07 0.68 0.73 Zebrafish gut 1.27 0.72 0.94 1.02 1.35 0.96 0.56 0.65

Basalt Oceanic Floor 1.24 0.86 0.96 1.02 1.74 0.94 0.56 0.66

KFS 1.2 0.88 0.97 1 1.67 0.98 0.6 0.65 Materials and methods

Site. Undisturbed tall grass prairie soil in central Oklahoma. DNA extraction. FastDNA spin kit for soil.PCR and cloning. Primers 8f-1492r. TOPO-TA cloning kit. Chimera. Bellerophon (version 3) function on Greengenes. Alignments. ClustalX program, Greengenes NAST aligner Clipping of shorter fragments. Jalview (3). Distance matrix, OTU assignments. PAUP. DOTUR. Scatter plots slopes.Species richness estimates. Chao, and ACE estimators. Six parametric distributions (http://www.stat.cornell.edu/~bunge/).Other environments. Another soil ecosystem (5), digestive tract of Zebrafish(8), and ocean floor microbial community (9). Regression analysis. Multiple regression using MS Excel.

Comparing OTUs, species richness estimates and slopes of scatter plots inshort and long fragments in libraries derived from other ecosystems. Trends obtained from OTU determinations and scatter plot slopes of: a Trembling Aspen soil (1152 clones), the digestive tract of Zebrafish (612 clones), and microbial communities inhabiting the ocean crust in the east pacific ridge (902 clones) were strikingly similar to those observed with soil Okla-A clone library (Table 4). Species richness estimates for these three environments mirrored the same trends (data not shown).

Conclusions•Regions V1+V2, as well as V6 overestimate diversity, regions V3, V7, and V7+V8 underestimate diversity, while regions V4, V5+V6, and V6+V7 give comparable estimate to near full-length fragments. •This pattern held true for the various environments tested. •The bias in species richness estimates could readily be explained by base variability. •While previous studies suggested using region V4 for phylogenetic studies (6, 11), our evaluation of species richness suggests that V4, V5+V6, and V6+V7 regions provide estimates closest to longer fragments. Collectively, the V4-encompassing region appears to provide the best choice for both phylogenetic assignments and estimates consideration.•Based on this study, we recommend the use of fragments (V4, V5+V6, V6+V7) for pyrosequencing studies concerned with species-richness determination in microbial communities.

References 

1.Axelrood, P. E., et al.. 2002. Can. J. Microbioil. 48:655-674. 2.Baker, G. C., et al. 2003. J. Microbiol. Methods 55:541-555.3. Clamp, M., et al. 2004. Bioinformatics 20:426-427.4. Huber, J. A., et al. 2007. Science 318:97-100.5. Lesaulnier, C., et al. 2008. Environ. Microbiol. 10:926-941.6. Liu, Z., et al. 2007. Nucleic Acids Res. 35:120-130.7. Margulies, M., et al. 2005. Nature 437:376-380.8. Rawls, J. F., et al. 2006. Cell 127:423-433.9. Santelli, C. M., et al. 2008. Nature 453:653-656.10. Van de Peer, Y., et al. 1996. Nucleic Acids Res. 24:3381-3391.11. Wang, Q., et al. 2007. Appl. Environ. Microbiol. 73:5261-5267.                      

Table 4. Slopes obtained for 3 different clone libraries derived from soil, zebrafish gut, and ocean floor as compared to KFS.

Table2. Number of OTUs and ratios of species richness estimates obtained using the near full-length sequences and each of the 8 short simulated regions studied at 5 different taxonomic cutoffs for Soil-Okla-A clone library.

Ca

Regions

V1+V2

OTU Ratio

V3

OTU Ratio

V4

OTU Ratio

V5+V6

OTU Ratio

V6

OTU Ratio

V6+V7

OTU Ratio

V7

OTU Ratio

V7+V8

OTU Ratio NFL

3 652 1.3 495 0.68 619 1.09 570 0.92 584 1 514 0.8 340 0.4 412 0.54 639

6 499 1.48 331 0.76 414 1.19 393 0.96 479 1.33 375 0.9 194 0.37 241 0.45 397

8 407 1.54 261 0.74 327 1.11 328 1.19 418 1.54 300 0.96 149 0.38 187 0.47 306

10 345 1.52 203 0.76 267 1.12 267 1.12 370 1.69 241 0.92 112 0.31 143 0.43 242

15 233 2 122 0.69 158 1.07 166 1.07 283 2.41 142 0.93 61 0.35 73 0.43 135

Sl.f 1.2 0.88 0.97 1 1.67 0.98 0.6 0.65 NA

Environment V1+V2 V3 V4 V5+V6 V6 V6+V7 V7 V7+V8

Trembling Aspen soil 1.23 0.87 0.97 1.05 1.8 1.07 0.68 0.73

Zebrafish gut 1.27 0.72 0.94 1.02 1.35 0.96 0.56 0.65

Basalt Oceanic Floor 1.24 0.86 0.96 1.02 1.74 0.94 0.56 0.66

KFS 1.2 0.88 0.97 1 1.67 0.98 0.6 0.65

References