Michigan State University, 567 Wilson Road, Rm. 2225A...
Transcript of Michigan State University, 567 Wilson Road, Rm. 2225A...
RDP is committed to making our resources broadly available. In addition to web-based services, RDP now distributes many of its process/analysis tools as stand-alone, open-source versions through ���http://github.com/rdpstaff ���and ���http://sourceforge.net/projects/rdp-classifier/
RDPTools package: a comprehensive tool suite comprised of the most commonly-used tools from RDP, RDPipeline and RDP FunGene websites, and user-requested command-line tools, such as:
! GFClassify (a fast and accurate classification of amplicon sequences from close homologous gene families),
! KmerFilter (kmer analysis),
! PrimerMatch (a Primer/Probe Match tool),
! ReadSeq (a Java-based common sequence file format reader and sequence file manipulator),
! Xander (a gene-targeted metagenomic assembly tool), and more.
Amplicon analysis pipeline for Fungal LSU and ���Prokaryote SSU sequences
! Account-based access for managing job submission, monitoring status, and downloading results.
! Defined Community Analysis Tool calculates the observed error rate.
! Chimera Check tool powered by UCHIME.
! Optimized paired-end read assembler; handles complex overlap layouts such as reads past the 5’ end of the primer.
! Accepts compressed file upload.
! Infernal 1.1 aligners with larger processing capacity.
! Improved clustering tools.
! Input validators redesigned to use a ‘Fail-fast’ system to identify common errors and alert user before processing starts.
RDPipeline for Fungal Amplicon Sequencing 0
Shared sequences with���matching genus label
1459 3500 267
1247 5993 19433
36598
WIU CSIRO
UNITE
273 682 770
15
Shared genera
84 40
1336
WIU CSIRO
UNITE
Shared sequences
1568 3589 281
1332 5785 19245
36386
WIU CSIRO
UNITE
! Readily integrated as modules in customized workflow for individual tasks.
! More processing/analysis options.
! Supported with detailed instructions and tutorials.
All the methods used are implemented in Java, as part of RDP Classifier package, publicly available on RDP’s repository on SourceForge and GitHub (some new methods will be available soon).
Leave-one-sequence-out testing: each iteration one sequence from the training set was chosen as a test sequence. That sequence was removed from training set. The assignment of the sequence produced by the Classifier was compared to the original taxonomy label to measure the accuracy of the Classifier.
Leave-one-taxon-out testing: similar to the leave-one-sequence-out testing except for each test sequence, the lowest taxon that sequence assigned to (either species or genus node) was removed from the training set. This is intended to test if the species or genus is no present in the training set, how likely the Classifier can assign the sequence to the correct genus or higher taxa.
Sab score: the percent of share 8-mers between two sequences. As calculated by RDP SeqMatch, except 8-mers were used here.
Taxa Similarity: for each pair of sequences from a set, we calculated the Sab score and added score to the lowest common ancestor taxon of the two sequences. This is intended to measure how close the sequences are within or between taxa. The Sab scores for each rank were calculated to generate box and whisker plots and cumulative plots. The three “rmdup” sets were used for these tests.
Methods for Comparing Fungal ITS Reference Sets
! Browse sequence collections and choose subsets for further analysis
! Build phylogenetic trees
! Test primers and probes for coverage
! Download aligned sequences
Additional tools are specialized to process protein coding gene amplicon data.
Newly enhanced FrameBot frameshift-corrected���
protein and nucleotide sequences, nearest
matches to a ���reference set –
10-100x increase in throughput without indexing
Updated mcClust memory-efficient
hierarchical clustering tool
New Command-Line Tools available for workflow process (visit SourceForge
and GitHub)
USEARCH newly incorporated
for chimera detection
Enhanced Defined Community Analysis
evaluate error patterns and rates on reads amplified
from a mix of known organisms
Classification Performance Between Sets. Percent of sequence classification matching the label in the test set when Classifier is trained on a different training set.
Intra Taxon Similarity by fraction of matching 8-mer (Sab score)
! CSIRO showed the highest average similarity within species (intra-species).
! The intra-species similarity for UNITE was much lower than for CISRO, and almost identical to the similarity between species in the same genus. This may be due to the “species hypothesis” clustering method used.
! WIU has the highest intra-genus similarity, but without species delineation, this includes intra-species comparisons. Other comparisons only include inter-taxon comparisons at lower ranks.
1Center for Microbial Ecology; 2Department of Microbiology and Molecular Genetics Michigan State University, 567 Wilson Road, Rm. 2225A, East Lansing, Michigan 48824
Contact: [email protected]
Qiong Wang1, Benli Chai1, James M. Tiedje1,2 and James R. Cole1
Poster 1006, 9 June 2014 (Mon 8 PM) Systematics, Abstract ID: 131 rRNA, phylogeny, classification, molecular phylogenetics
http://rdp.cme.msu.edu http://fungene.cme.msu.edu
Cole J.R., Wang Q., Fish J.A., Chai B., McGarrell D.M., Sun Y., Brown C.T., Porras-Alfaro A., Kuske C.R., and Tiedje J.M. Ribosomal Database Project: data and tools for high throughput rRNA analysis. Nucl. Acids Res. 42 (Database issue): D633-D642 (2014).
Fish J.A., Chai B., Wang Q., Sun Y., Brown C.T., Tiedje J.M., and Cole J.R. FunGene: the Functional Gene Pipeline and Repository. Front. Microbiol. 4:291 (2013).
Koljalg U., Nilsson R.H., Abarenkov K., Tedersoo L., Taylor A., Bahram M., Bates S.T., Bruns T.D., Bengtsson-Palme J., Callaghan T.M., et al. Towards a unified paradigm for sequence-based identification of fungi. Molecular Ecology 22: 5271–5277 (2013).
Liu K.L., Porras-Alfaro A., Kuske C.R., Eichorst S.A., and Xie G. Accurate, rapid taxonomic classification of fungal large-subunit rRNA genes. Appl. Environ. Microbiol. 78:1523-1533 (2012).
Porras-Alfaro A., Liu K-L., Kuske C.R. and Xie G. From Genus to Phylum: Large-Subunit and Internal Transcribed Spacer rRNA Operon Regions Show Similar Classification Accuracies Influenced by Database Composition. Appl. Environ. Microbiol. 80: 829-840 (2014).
Wang Q., Quensen III J.F., Fish J.A., Lee T.-K., Sun Y., Tiedje J.M., and Cole J.R. Ecological patterns of nifH genes in four terrestrial climatic zones explored with targeted metagenomics using FrameBot, a new informatics tool. mBio 4:e00592-13 (2013). Contact the RDP by email at [email protected]. RDP staff can also be reached by phone at 1-517-432-4998 during
normal business hours (EST); via fax at +1-517-353-8957 (Attn: RDP). Please include details of your question or problem.
RDP’S Mission Includes User Support
RDP is supported by the Office of Science (BER), U.S. Dept. of Energy Grant ���DE-FG02-99ER62848 and NIEHS Superfund Grant 5P42 ES004911
Acknowledgements
References
Comparison of Three Fungal ITS Reference Sets
Rank WIU CSIRO UNITE
domain 4 1 2 phylum 11 8 9 class 36 39 48 order 118 127 184 family 328 357 626 genus 1142 1601 3061 species NA 9073 22330 All seqs 8966 24447 41824 Unique seqs 6889 17870 41183
WIU CSIRO UNITE
(Near) Complete 94.6 95.2 93
Incomplete ITS1 2.7 2.4 2
Incomplete ITS2 2.2 2 4.3
Incomplete Both 0.4 0.3 0.6
Taxonomic composition Sequence coverage (%)
Classification accuracy at each major taxon rank from leave-one-out testing. RDP’s Classifier was trained on each of the three “rmdup” fungal ITS sets. The current RDP 16S rRNA training set was included for comparison. No bootstrap cutoff was applied in the accuracy calculation.
RDP Classification Performance: within set – Large differences between reference sets
Classification Performance (Shared Sequences Only)
Training���Test WIU CSIRO UNITE
WIU 91.2 93.5
CSIRO 92.3 97.1
UNITE 92.3 97.5
Classification Performance (Sequences from Shared Genera)
Training���Test WIU CSIRO UNITE
WIU 77.7 79.7
CSIRO 73.6 87.2
UNITE 66.2 79.5
Fungal LSU taxonomic composition in training set 11
Browsers browse and select fungal LSU from
taxonomic hierarchy or publications /
powerful search and selection features
ProbeMatch Fast search for primer/probe sites ���
from fungal 28S database
SeqMatch finds nearest neighbor in the LSU dataset; more accurate than BLAST
Workflow Tutorials and a Community-driven Q&A User Wiki
RDP Aligner NEW Fungal 28S
Aligner --
Updated Bacterial and Archaeal 16S
Aligner
The taxa similarity (measured by Sab score). The 1st quartile, median and 3rd quartile were shown as the bottom, middle and top of the box, the 2nd and 98th percentiles were used for the whiskers.
30%
40%
50%
60%
70%
80%
90%
100%
domain
phylum
class
orde
r
family
genu
s
species
Accuracy
WIU
CSIRO
UNITE
16S
30%
40%
50%
60%
70%
80%
90%
100%
domain
phylum
class
orde
r
family
genu
s
Leave-‐one-‐taxon*-‐out accuracy (* UNITE & CSIRO = species; WIU & 16S = genus) Leave-‐one-‐seq-‐out accuracy
LSU Base Position
ITS Reference Sets
WIU ref set: This is a published hand-curated set. The sequences and taxonomy construction of this fungal ITS set were described in detail in Porras-Alfaro et al. (2013). It contains lineage only to the genus level.
CSIRO ref set: An early version from an active curatorial effort kindly provided by Paul Greenfield and Vinita Deshpande of the Australian Commonwealth Scientific and Industrial Research Organization (personal communication). It contains lineages to the species level.
UNITE ref set: Version No. 6 of the general release downloaded from the UNITE repository (http://unite.ut.ee/repository.php). The “species hypothesis” taxa were constructed using a two-tier clustering process, which first clusters sequences to subgenus/genus level and then to finer species level (Koljalg et al., 2013). Of these sequences, 25% have “informal” taxon names such as “__unidentified” in the lineage provide by UNITE. We removed the 1% of the sequences with “informal” phyla taxon names, but kept the informal taxa at lower ranks.
“rmdup sets”: For each of the reference sets described above, we constructed a unique set by removing any sequence that is identical to, or is a substring of, another sequence in the same training set. These datasets were referred to as “rmdup” sets. Removing duplicates avoids inflated results in classification performance testing.
Important Environmental Fungal Genes:
! Carbon Cycling ! Disease Resistance
And More to Come!
! New 95,365 aligned and annotated fungal LSU sequences.
! Aligner uses a new Infernal secondary-structure aware alignment model trained on 183 LSU sequences from finished genomes.
! Updated RDP Fungal 28S Classifier (Liu, et al., 2012) with a new training set (11,442 sequences) providing increased coverage of the Glomeromycota, Chytridiomycota, and other basal lineages and better performance in separating fungal from non-fungal eukaryotes
! New most RDP tools work with fungal LSU data:
Sequences from Shared Genera (%)
Genera���Sequences WIU CSIRO UNITE
WIU 93.4 97.6
CSIRO 76.7 99.1
UNITE 53.8 64.6
cbh1 −Cellobiohydrolase lcc − Laccase exc1 − Exochitinase mnp −Manganese peroxidase lip − Lignin peroxidase vp1 −Versatile peroxidase glx −Glyoxal oxidase
RDP’s Database & Tools for Fungal Analysis NEW Rel. 11 RDP FunGene for Fungal Gene Analysis RDP Open Source
Rank Count Domain 1 Phylum 7 Class 40 Order 139 Family 451 Genus 1789
Few LSU sequences cover the entire alignment
Shared genera contain ���most sequences.
Few sequences are shared.
Shared sequences cross-classify in matching genera, but unshared sequences do not. This may indicate that s h a r e d g e n e r a a r e “circumscribed” differently in the three training sets.