Michigan State University, 567 Wilson Road, Rm. 2225A...

1
RDP is committed to making our resources broadly available. In addition to web-based services, RDP now distributes many of its process/analysis tools as stand-alone, open-source versions through http://github.com/rdpstaff and http://sourceforge.net/projects/rdp-classifier/ RDPTools package: a comprehensive tool suite comprised of the most commonly-used tools from RDP, RDPipeline and RDP FunGene websites, and user-requested command-line tools, such as: ! GFClassify (a fast and accurate classification of amplicon sequences from close homologous gene families), ! KmerFilter (kmer analysis), ! PrimerMatch (a Primer/Probe Match tool), ! ReadSeq (a Java-based common sequence file format reader and sequence file manipulator), ! Xander (a gene-targeted metagenomic assembly tool), and more. Amplicon analysis pipeline for Fungal LSU and Prokaryote SSU sequences ! Account-based access for managing job submission, monitoring status, and downloading results. ! Defined Community Analysis Tool calculates the observed error rate. ! Chimera Check tool powered by UCHIME. ! Optimized paired-end read assembler; handles complex overlap layouts such as reads past the 5’ end of the primer. ! Accepts compressed file upload. ! Infernal 1.1 aligners with larger processing capacity. ! Improved clustering tools. ! Input validators redesigned to use a ‘Fail-fast’ system to identify common errors and alert user before processing starts. RDPipeline for Fungal Amplicon Sequencing 0 Shared sequences with matching genus label 1459 3500 267 1247 5993 19433 36598 WIU CSIRO UNITE 273 682 770 15 Shared genera 84 40 1336 WIU CSIRO UNITE Shared sequences 1568 3589 281 1332 5785 19245 36386 WIU CSIRO UNITE ! Readily integrated as modules in customized workflow for individual tasks. ! More processing/analysis options. ! Supported with detailed instructions and tutorials. All the methods used are implemented in Java, as part of RDP Classifier package, publicly available on RDP’s repository on SourceForge and GitHub (some new methods will be available soon). Leave-one-sequence-out testing: each iteration one sequence from the training set was chosen as a test sequence. That sequence was removed from training set. The assignment of the sequence produced by the Classifier was compared to the original taxonomy label to measure the accuracy of the Classifier. Leave-one-taxon-out testing: similar to the leave-one-sequence-out testing except for each test sequence, the lowest taxon that sequence assigned to (either species or genus node) was removed from the training set. This is intended to test if the species or genus is no present in the training set, how likely the Classifier can assign the sequence to the correct genus or higher taxa. Sab score: the percent of share 8-mers between two sequences. As calculated by RDP SeqMatch, except 8-mers were used here. Taxa Similarity: for each pair of sequences from a set, we calculated the Sab score and added score to the lowest common ancestor taxon of the two sequences. This is intended to measure how close the sequences are within or between taxa. The Sab scores for each rank were calculated to generate box and whisker plots and cumulative plots. The three “rmdup” sets were used for these tests. Methods for Comparing Fungal ITS Reference Sets ! Browse sequence collections and choose subsets for further analysis ! Build phylogenetic trees ! Test primers and probes for coverage ! Download aligned sequences Additional tools are specialized to process protein coding gene amplicon data. Newly enhanced FrameBot frameshift-corrected protein and nucleotide sequences, nearest matches to a reference set – 10-100x increase in throughput without indexing Updated mcClust memory-efficient hierarchical clustering tool New Command-Line Tools available for workflow process (visit SourceForge and GitHub) USEARCH newly incorporated for chimera detection Enhanced Defined Community Analysis evaluate error patterns and rates on reads amplified from a mix of known organisms Classification Performance Between Sets. Percent of sequence classification matching the label in the test set when Classifier is trained on a different training set. Intra Taxon Similarity by fraction of matching 8-mer (Sab score) ! CSIRO showed the highest average similarity within species (intra-species). ! The intra-species similarity for UNITE was much lower than for CISRO, and almost identical to the similarity between species in the same genus. This may be due to the “species hypothesis” clustering method used. ! WIU has the highest intra-genus similarity, but without species delineation, this includes intra-species comparisons. Other comparisons only include inter-taxon comparisons at lower ranks. 1 Center for Microbial Ecology; 2 Department of Microbiology and Molecular Genetics Michigan State University, 567 Wilson Road, Rm. 2225A, East Lansing, Michigan 48824 Contact: [email protected] Qiong Wang 1 , Benli Chai 1 , James M. Tiedje 1,2 and James R. Cole 1 Poster 1006, 9 June 2014 (Mon 8 PM) Systematics, Abstract ID: 131 rRNA, phylogeny, classification, molecular phylogenetics http://rdp.cme.msu.edu http://fungene.cme.msu.edu Cole J.R., Wang Q., Fish J.A., Chai B., McGarrell D.M., Sun Y., Brown C.T., Porras-Alfaro A., Kuske C.R., and Tiedje J.M. Ribosomal Database Project: data and tools for high throughput rRNA analysis. Nucl. Acids Res. 42 (Database issue): D633-D642 (2014). Fish J.A., Chai B., Wang Q., Sun Y., Brown C.T., Tiedje J.M., and Cole J.R. FunGene: the Functional Gene Pipeline and Repository. Front. Microbiol. 4:291 (2013). Koljalg U., Nilsson R.H., Abarenkov K., Tedersoo L., Taylor A., Bahram M., Bates S.T., Bruns T.D., Bengtsson-Palme J., Callaghan T.M., et al. Towards a unified paradigm for sequence-based identification of fungi. Molecular Ecology 22: 5271–5277 (2013). Liu K.L., Porras-Alfaro A., Kuske C.R., Eichorst S.A., and Xie G. Accurate, rapid taxonomic classification of fungal large-subunit rRNA genes. Appl. Environ. Microbiol. 78:1523-1533 (2012). Porras-Alfaro A., Liu K-L., Kuske C.R. and Xie G. From Genus to Phylum: Large-Subunit and Internal Transcribed Spacer rRNA Operon Regions Show Similar Classification Accuracies Influenced by Database Composition. Appl. Environ. Microbiol. 80: 829-840 (2014). Wang Q., Quensen III J.F., Fish J.A., Lee T.-K., Sun Y., Tiedje J.M., and Cole J.R. Ecological patterns of nifH genes in four terrestrial climatic zones explored with targeted metagenomics using FrameBot, a new informatics tool. mBio 4:e00592-13 (2013). Contact the RDP by email at [email protected]. RDP staff can also be reached by phone at 1-517-432-4998 during normal business hours (EST); via fax at +1-517-353-8957 (Attn: RDP). Please include details of your question or problem. RDP’S Mission Includes User Support RDP is supported by the Office of Science (BER), U.S. Dept. of Energy Grant DE-FG02-99ER62848 and NIEHS Superfund Grant 5P42 ES004911 Acknowledgements References Comparison of Three Fungal ITS Reference Sets Rank WIU CSIRO UNITE domain 4 1 2 phylum 11 8 9 class 36 39 48 order 118 127 184 family 328 357 626 genus 1142 1601 3061 species NA 9073 22330 All seqs 8966 24447 41824 Unique seqs 6889 17870 41183 WIU CSIRO UNITE (Near) Complete 94.6 95.2 93 Incomplete ITS1 2.7 2.4 2 Incomplete ITS2 2.2 2 4.3 Incomplete Both 0.4 0.3 0.6 Taxonomic composition Sequence coverage (%) Classification accuracy at each major taxon rank from leave-one-out testing. RDP’s Classifier was trained on each of the three “rmdup” fungal ITS sets. The current RDP 16S rRNA training set was included for comparison. No bootstrap cutoff was applied in the accuracy calculation. RDP Classification Performance: within set – Large differences between reference sets Classification Performance (Shared Sequences Only) Training Test WIU CSIRO UNITE WIU 91.2 93.5 CSIRO 92.3 97.1 UNITE 92.3 97.5 Classification Performance (Sequences from Shared Genera) Training Test WIU CSIRO UNITE WIU 77.7 79.7 CSIRO 73.6 87.2 UNITE 66.2 79.5 Fungal LSU taxonomic composition in training set 11 Browsers browse and select fungal LSU from taxonomic hierarchy or publications / powerful search and selection features ProbeMatch Fast search for primer/probe sites from fungal 28S database SeqMatch finds nearest neighbor in the LSU dataset; more accurate than BLAST Workflow Tutorials and a Community-driven Q&A User Wiki RDP Aligner NEW Fungal 28S Aligner -- Updated Bacterial and Archaeal 16S Aligner The taxa similarity (measured by Sab score). The 1st quartile, median and 3rd quartile were shown as the bottom, middle and top of the box, the 2nd and 98th percentiles were used for the whiskers. 30% 40% 50% 60% 70% 80% 90% 100% domain phylum class order family genus species Accuracy WIU CSIRO UNITE 16S 30% 40% 50% 60% 70% 80% 90% 100% domain phylum class order family genus Leaveonetaxon*out accuracy (* UNITE & CSIRO = species; WIU & 16S = genus) Leaveoneseqout accuracy LSU Base Position ITS Reference Sets WIU ref set : This is a published hand-curated set. The sequences and taxonomy construction of this fungal ITS set were described in detail in Porras-Alfaro et al. (2013). It contains lineage only to the genus level. CSIRO ref set : An early version from an active curatorial effort kindly provided by Paul Greenfield and Vinita Deshpande of the Australian Commonwealth Scientific and Industrial Research Organization (personal communication). It contains lineages to the species level. UNITE ref set : Version No. 6 of the general release downloaded from the UNITE repository (http://unite.ut.ee/repository.php). The “species hypothesis” taxa were constructed using a two-tier clustering process, which first clusters sequences to subgenus/genus level and then to finer species level (Koljalg et al., 2013). Of these sequences, 25% have “informal” taxon names such as “__unidentified” in the lineage provide by UNITE. We removed the 1% of the sequences with “informal” phyla taxon names, but kept the informal taxa at lower ranks. rmdup sets: For each of the reference sets described above, we constructed a unique set by removing any sequence that is identical to, or is a substring of, another sequence in the same training set. These datasets were referred to as “rmdup” sets. Removing duplicates avoids inflated results in classification performance testing. Important Environmental Fungal Genes: ! Carbon Cycling ! Disease Resistance And More to Come! ! New 95,365 aligned and annotated fungal LSU sequences. ! Aligner uses a new Infernal secondary-structure aware alignment model trained on 183 LSU sequences from finished genomes. ! Updated RDP Fungal 28S Classifier (Liu, et al., 2012) with a new training set (11,442 sequences) providing increased coverage of the Glomeromycota, Chytridiomycota, and other basal lineages and better performance in separating fungal from non-fungal eukaryotes ! New most RDP tools work with fungal LSU data: Sequences from Shared Genera (%) Genera Sequences WIU CSIRO UNITE WIU 93.4 97.6 CSIRO 76.7 99.1 UNITE 53.8 64.6 cbh1 Cellobiohydrolase lcc Laccase exc1 Exochitinase mnp Manganese peroxidase lip Lignin peroxidase vp1 Versatile peroxidase glx Glyoxal oxidase RDP’s Database & Tools for Fungal Analysis NEW Rel. 11 RDP FunGene for Fungal Gene Analysis RDP Open Source Rank Count Domain 1 Phylum 7 Class 40 Order 139 Family 451 Genus 1789 Few LSU sequences cover the entire alignment Shared genera contain most sequences. Few sequences are shared. Shared sequences cross- classify in matching genera, but unshared sequences do not. This may indicate that shared genera are “circumscribed” differently in the three training sets.

Transcript of Michigan State University, 567 Wilson Road, Rm. 2225A...

Page 1: Michigan State University, 567 Wilson Road, Rm. 2225A ...rdp.cme.msu.edu/download/posters/MSA2014_RDP.pdf · Sab score: the percent of ... WIU 91.2 93.5 CSIRO 92.3 97.1 UNITE 92.3

RDP is committed to making our resources broadly available. In addition to web-based services, RDP now distributes many of its process/analysis tools as stand-alone, open-source versions through ���http://github.com/rdpstaff ���and ���http://sourceforge.net/projects/rdp-classifier/

RDPTools package: a comprehensive tool suite comprised of the most commonly-used tools from RDP, RDPipeline and RDP FunGene websites, and user-requested command-line tools, such as:

! GFClassify (a fast and accurate classification of amplicon sequences from close homologous gene families),

! KmerFilter (kmer analysis),

! PrimerMatch (a Primer/Probe Match tool),

! ReadSeq (a Java-based common sequence file format reader and sequence file manipulator),

! Xander (a gene-targeted metagenomic assembly tool), and more.

Amplicon analysis pipeline for Fungal LSU and ���Prokaryote SSU sequences

!  Account-based access for managing job submission, monitoring status, and downloading results.

! Defined Community Analysis Tool calculates the observed error rate.

!  Chimera Check tool powered by UCHIME.

! Optimized paired-end read assembler; handles complex overlap layouts such as reads past the 5’ end of the primer.

!  Accepts compressed file upload.

!  Infernal 1.1 aligners with larger processing capacity.

!  Improved clustering tools.

!  Input validators redesigned to use a ‘Fail-fast’ system to identify common errors and alert user before processing starts.

RDPipeline for Fungal Amplicon Sequencing 0  

   

Shared sequences with���matching genus label

1459   3500  267  

1247  5993   19433  

36598  

WIU CSIRO

UNITE

   

273   682  770  

15  

Shared genera

84   40  

1336  

WIU CSIRO

UNITE

   

Shared sequences

1568   3589  281  

1332  5785   19245  

36386  

WIU CSIRO

UNITE

! Readily integrated as modules in customized workflow for individual tasks.

! More processing/analysis options.

!  Supported with detailed instructions and tutorials.

All the methods used are implemented in Java, as part of RDP Classifier package, publicly available on RDP’s repository on SourceForge and GitHub (some new methods will be available soon).

Leave-one-sequence-out testing: each iteration one sequence from the training set was chosen as a test sequence. That sequence was removed from training set. The assignment of the sequence produced by the Classifier was compared to the original taxonomy label to measure the accuracy of the Classifier.

Leave-one-taxon-out testing: similar to the leave-one-sequence-out testing except for each test sequence, the lowest taxon that sequence assigned to (either species or genus node) was removed from the training set. This is intended to test if the species or genus is no present in the training set, how likely the Classifier can assign the sequence to the correct genus or higher taxa.

Sab score: the percent of share 8-mers between two sequences. As calculated by RDP SeqMatch, except 8-mers were used here. 

Taxa Similarity: for each pair of sequences from a set, we calculated the Sab score and added score to the lowest common ancestor taxon of the two sequences. This is intended to measure how close the sequences are within or between taxa. The Sab scores for each rank were calculated to generate box and whisker plots and cumulative plots. The three “rmdup” sets were used for these tests.

Methods for Comparing Fungal ITS Reference Sets

!  Browse sequence collections and choose subsets for further analysis

!  Build phylogenetic trees

!  Test primers and probes for coverage

! Download aligned sequences

Additional tools are specialized to process protein coding gene amplicon data.

Newly enhanced FrameBot frameshift-corrected���

protein and nucleotide sequences, nearest

matches to a ���reference set –

10-100x increase in throughput without indexing

Updated mcClust memory-efficient

hierarchical clustering tool

New Command-Line Tools available for workflow process (visit SourceForge

and GitHub)

USEARCH newly incorporated

for chimera detection

Enhanced Defined Community Analysis

evaluate error patterns and rates on reads amplified

from a mix of known organisms

Classification Performance Between Sets. Percent of sequence classification matching the label in the test set when Classifier is trained on a different training set.

Intra Taxon Similarity by fraction of matching 8-mer (Sab score)

!  CSIRO showed the highest average similarity within species (intra-species).

!  The intra-species similarity for UNITE was much lower than for CISRO, and almost identical to the similarity between species in the same genus. This may be due to the “species hypothesis” clustering method used.

! WIU has the highest intra-genus similarity, but without species delineation, this includes intra-species comparisons. Other comparisons only include inter-taxon comparisons at lower ranks.

1Center for Microbial Ecology; 2Department of Microbiology and Molecular Genetics Michigan State University, 567 Wilson Road, Rm. 2225A, East Lansing, Michigan 48824

Contact: [email protected]

Qiong Wang1, Benli Chai1, James M. Tiedje1,2 and James R. Cole1

Poster 1006, 9 June 2014 (Mon 8 PM) Systematics, Abstract ID: 131 rRNA, phylogeny, classification, molecular phylogenetics

http://rdp.cme.msu.edu http://fungene.cme.msu.edu

Cole J.R., Wang Q., Fish J.A., Chai B., McGarrell D.M., Sun Y., Brown C.T., Porras-Alfaro A., Kuske C.R., and Tiedje J.M. Ribosomal Database Project: data and tools for high throughput rRNA analysis. Nucl. Acids Res. 42 (Database issue): D633-D642 (2014).

Fish J.A., Chai B., Wang Q., Sun Y., Brown C.T., Tiedje J.M., and Cole J.R. FunGene: the Functional Gene Pipeline and Repository. Front. Microbiol. 4:291 (2013).

Koljalg U., Nilsson R.H., Abarenkov K., Tedersoo L., Taylor A., Bahram M., Bates S.T., Bruns T.D., Bengtsson-Palme J., Callaghan T.M., et al. Towards a unified paradigm for sequence-based identification of fungi. Molecular Ecology 22: 5271–5277 (2013).

Liu K.L., Porras-Alfaro A., Kuske C.R., Eichorst S.A., and Xie G. Accurate, rapid taxonomic classification of fungal large-subunit rRNA genes. Appl. Environ. Microbiol. 78:1523-1533 (2012).

Porras-Alfaro A., Liu K-L., Kuske C.R. and Xie G. From Genus to Phylum: Large-Subunit and Internal Transcribed Spacer rRNA Operon Regions Show Similar Classification Accuracies Influenced by Database Composition. Appl. Environ. Microbiol. 80: 829-840 (2014).

Wang Q., Quensen III J.F., Fish J.A., Lee T.-K., Sun Y., Tiedje J.M., and Cole J.R. Ecological patterns of nifH genes in four terrestrial climatic zones explored with targeted metagenomics using FrameBot, a new informatics tool. mBio 4:e00592-13 (2013). Contact the RDP by email at [email protected]. RDP staff can also be reached by phone at 1-517-432-4998 during

normal business hours (EST); via fax at +1-517-353-8957 (Attn: RDP). Please include details of your question or problem.

RDP’S Mission Includes User Support

RDP is supported by the Office of Science (BER), U.S. Dept. of Energy Grant ���DE-FG02-99ER62848 and NIEHS Superfund Grant 5P42 ES004911

Acknowledgements

References

Comparison of Three Fungal ITS Reference Sets

Rank WIU CSIRO UNITE

domain 4 1 2 phylum 11 8 9 class 36 39 48 order 118 127 184 family 328 357 626 genus 1142 1601 3061 species NA 9073 22330 All seqs 8966 24447 41824 Unique seqs 6889 17870 41183

WIU CSIRO UNITE

(Near) Complete 94.6 95.2 93

Incomplete ITS1 2.7 2.4 2

Incomplete ITS2 2.2 2 4.3

Incomplete Both 0.4 0.3 0.6

Taxonomic composition Sequence coverage (%)

Classification accuracy at each major taxon rank from leave-one-out testing. RDP’s Classifier was trained on each of the three “rmdup” fungal ITS sets. The current RDP 16S rRNA training set was included for comparison. No bootstrap cutoff was applied in the accuracy calculation.

RDP Classification Performance: within set – Large differences between reference sets

Classification Performance (Shared Sequences Only)

Training���Test WIU CSIRO UNITE

WIU 91.2 93.5

CSIRO 92.3 97.1

UNITE 92.3 97.5

Classification Performance (Sequences from Shared Genera)

Training���Test WIU CSIRO UNITE

WIU 77.7 79.7

CSIRO 73.6 87.2

UNITE 66.2 79.5

Fungal LSU taxonomic composition in training set 11

Browsers browse and select fungal LSU from

taxonomic hierarchy or publications /

powerful search and selection features

ProbeMatch Fast search for primer/probe sites ���

from fungal 28S database

SeqMatch finds nearest neighbor in the LSU dataset; more accurate than BLAST

Workflow Tutorials and a Community-driven Q&A User Wiki

RDP Aligner NEW Fungal 28S

Aligner --

Updated Bacterial and Archaeal 16S

Aligner

The taxa similarity (measured by Sab score). The 1st quartile, median and 3rd quartile were shown as the bottom, middle and top of the box, the 2nd and 98th percentiles were used for the whiskers.

30%  

40%  

50%  

60%  

70%  

80%  

90%  

100%  

domain  

phylum

 

class  

orde

r  

family  

genu

s  

species  

Accuracy  

WIU  

CSIRO  

UNITE  

16S  

30%  

40%  

50%  

60%  

70%  

80%  

90%  

100%  

domain  

phylum

 

class  

orde

r  

family  

genu

s  

Leave-­‐one-­‐taxon*-­‐out  accuracy  (*  UNITE  &  CSIRO  =  species;  WIU  &  16S  =  genus)    Leave-­‐one-­‐seq-­‐out  accuracy  

LSU Base Position

ITS Reference Sets

WIU ref set: This is a published hand-curated set. The sequences and taxonomy construction of this fungal ITS set were described in detail in Porras-Alfaro et al. (2013). It contains lineage only to the genus level. 

CSIRO ref set: An early version from an active curatorial effort kindly provided by Paul Greenfield and Vinita Deshpande of the Australian Commonwealth Scientific and Industrial Research Organization (personal communication). It contains lineages to the species level.  

UNITE ref set: Version No. 6 of the general release downloaded from the UNITE repository (http://unite.ut.ee/repository.php). The “species hypothesis” taxa were constructed using a two-tier clustering process, which first clusters sequences to subgenus/genus level and then to finer species level (Koljalg et al., 2013). Of these sequences, 25% have “informal” taxon names such as “__unidentified” in the lineage provide by UNITE. We removed the 1% of the sequences with “informal” phyla taxon names, but kept the informal taxa at lower ranks.

“rmdup sets”: For each of the reference sets described above, we constructed a unique set by removing any sequence that is identical to, or is a substring of, another sequence in the same training set. These datasets were referred to as “rmdup” sets. Removing duplicates avoids inflated results in classification performance testing.

Important Environmental Fungal Genes:

!  Carbon Cycling !  Disease Resistance

And More to Come!

! New 95,365 aligned and annotated fungal LSU sequences.

!  Aligner uses a new Infernal secondary-structure aware alignment model trained on 183 LSU sequences from finished genomes.

! Updated RDP Fungal 28S Classifier (Liu, et al., 2012) with a new training set (11,442 sequences) providing increased coverage of the Glomeromycota, Chytridiomycota, and other basal lineages and better performance in separating fungal from non-fungal eukaryotes

! New most RDP tools work with fungal LSU data:

Sequences from Shared Genera (%)

Genera���Sequences WIU CSIRO UNITE

WIU 93.4 97.6

CSIRO 76.7 99.1

UNITE 53.8 64.6

cbh1 −Cellobiohydrolase lcc − Laccase exc1 − Exochitinase mnp −Manganese peroxidase lip − Lignin peroxidase vp1 −Versatile peroxidase glx −Glyoxal oxidase

RDP’s Database & Tools for Fungal Analysis NEW Rel. 11 RDP FunGene for Fungal Gene Analysis RDP Open Source

Rank Count Domain 1 Phylum 7 Class 40 Order 139 Family 451 Genus 1789

Few LSU sequences cover the entire alignment

Shared genera contain ���most sequences.

Few sequences are shared.

Shared sequences cross-classify in matching genera, but unshared sequences do not. This may indicate that s h a r e d g e n e r a a r e “circumscribed” differently in the three training sets.