Improving Interoperability of Text Mining Tools with BioC

1
We would like to thank Don Comeau, Rezarta Doğan, and John Wilbur for their discussion and help with the BioC tools. This research was supported by the Intramural Research Program of the NIH, National Library of Medicine. Acknowledgments Wei, C. H., Harris, B. R., Kao, H. Y., Lu, Z. (2013) tmVar: A text mining approach for extracting sequence variants in biomedical literature, Bioinformatics, 29 (11), 1433-1439. Wei, C.H., Kao, H.Y., Lu, Z. (2012) SR4GN: a species recognition software tool for gene normalization. PloS one, 7, e38460. Leaman, R., Islamaj Dogan, R., Lu, Z. (2013) DNorm: Disease Name Normalization with Pairwise Learning to Rank. Bioinformatics. Wei, C.H., Kao, H.Y., Lu, Z. (2013) PubTator: a web-based text mining tool for assisting biocuration. Nucleic acids research, 41, W518-522. Wei, C.H., Kao, H.Y. (2011) Cross-species gene normalization by species inference. BMC bioinformatics, 12 Suppl 8, S5. References The lack of interoperability among text mining tools is a major bottleneck in creating more complex applications. Despite the availability of numerous methods and techniques for various text mining tasks, combining different tools requires substantial efforts and time. In response, BioC offers a minimalistic approach to tool interoperability by stipulating minimal changes to existing tools and applications. In this study, we introduce several state-of-the-art text mining tools developed at the NCBI, and modify these tools to make them BioC compatible. Our toolkit can be accessed at http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/ . The NCBI Text Mining Toolkit Improving Interoperability of Text Mining Tools with BioC Ritu Khare, Chih-Hsuan Wei, Yuqing Mao, Robert Leaman, Zhiyong Lu National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, MD, 20894 Abstract DNorm DNorm tmVar tmVar SR4GN SR4GN tmChem tmChem GenNorm GenNorm PubMed Abstract Disease Mentions with MEDIC IDs Mutation Mentions Species Mentions with Taxonomy IDs Chemical Mentions Gene Mentions with Entrez IDs Annotations for Various BioConcepts Concept Recognition and Annotation Toolkit PubMed Abstracts or FullText Articles NER tools Bioconcept Programming Language(s) Method F-measure tmChem Chemical Java, Perl, C++ CRF 88.27% DNorm Disease Java CRF 80.90% tmVar Mutation Perl, C++ CRF 91.39% SR4GN Species Perl Rule based 85.42% GenNorm Gene Perl Statistical 92.89% Tools Bioconcept PubMed/ PMC XML BioC Free Text PubTator GenNorm tmChem Chemical DNorm Disease tmVar Mutation SR4GN Species GenNorm Gene PubTator N/A Table 1. Summary of Concept Recognition Tools Table 2. Compatible input/output formats Building BioC Compatible Versions Figure 1. The NCBI Toolkit Figure 2. Tool Features tmChem achieved the best performance in BioCreative IV CHEMDNER task on chemical entity mention recognition. GenNorm achieved the best performance in BioCreative III Gene Normalization task DNorm achieved the best performance in 2013 ShARe/CLEF shared task for normalizing disease names in clinical notes Conclusions BioC was easy to learn and straightforward to implement. Only minimal changes were required to re-package the NCBI toolkit with BioC. Our tools are now interoperable with each other, and with several other tools to build more powerful text mining applications and offer wider usage. Figure 3. BioC Input and Output Format BioC comprises: XML format describing how to present text documents and annotations, and functions to read/write documents in the BioC XML format. Steps to build BioC compatible tools: Modify the input/output format of the tool and create a key file to interpret the BioC annotation file. Table 2. Input/Output formats supported by our tools Offset Identifiers Mentions Types

Transcript of Improving Interoperability of Text Mining Tools with BioC

Page 1: Improving Interoperability of Text Mining Tools with BioC

We would like to thank Don Comeau, Rezarta Doğan, and John Wilbur for their discussion and help with the BioC tools. This research was supported by the Intramural Research Program of the NIH, National Library of Medicine.

Acknowledgments Wei, C. H., Harris, B. R., Kao, H. Y., Lu, Z. (2013) tmVar: A text mining approach for extracting sequence variants in biomedical literature, Bioinformatics, 29 (11), 1433-1439. Wei, C.H., Kao, H.Y., Lu, Z. (2012) SR4GN: a species recognition software tool for gene normalization. PloS one, 7, e38460. Leaman, R., Islamaj Dogan, R., Lu, Z. (2013) DNorm: Disease Name Normalization with Pairwise Learning to Rank. Bioinformatics. Wei, C.H., Kao, H.Y., Lu, Z. (2013) PubTator: a web-based text mining tool for assisting biocuration. Nucleic acids research, 41, W518-522. Wei, C.H., Kao, H.Y. (2011) Cross-species gene normalization by species inference. BMC bioinformatics, 12 Suppl 8, S5.

References

The lack of interoperability among text mining tools is a major bottleneck in creating more complex applications. Despite the availability of numerous methods and techniques for various text mining tasks, combining different tools requires substantial efforts and time. In response, BioC offers a minimalistic approach to tool interoperability by stipulating minimal changes to existing tools and applications. In this study, we introduce several state-of-the-art text mining tools developed at the NCBI, and modify these tools to make them BioC compatible. Our toolkit can be accessed at http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/.

The NCBI Text Mining Toolkit

Improving Interoperability of Text Mining Tools with BioC Ritu Khare, Chih-Hsuan Wei, Yuqing Mao, Robert Leaman, Zhiyong Lu

National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, MD, 20894

Abstract

DNormDNorm

tmVartmVar

SR4GNSR4GN

tmChemtmChem

GenNormGenNorm

PubMed  Abstract

Disease  Mentions  with  MEDIC  IDs

Mutation  Mentions

Species  Mentions  with  Taxonomy  IDs

Chemical  Mentions

Gene  Mentions  with  Entrez  IDs

Annotations  for  Various  BioConcepts

Concept  Recognition  and  Annotation  Toolkit

PubMed  Abstracts  or  Full-­‐Text  Articles

NER tools Bioconcept Programming Language(s) Method F-measure

tmChem Chemical Java, Perl, C++ CRF 88.27%

DNorm Disease Java CRF 80.90%

tmVar Mutation Perl, C++ CRF 91.39%

SR4GN Species Perl Rule based 85.42%

GenNorm Gene Perl Statistical 92.89%

Tools Bioconcept PubMed/ PMC XML BioC Free Text PubTator GenNorm

tmChem Chemical √ √ √ DNorm Disease √ √ √ tmVar Mutation √ √ √ √ SR4GN Species √ √ √ √ GenNorm Gene √ √ √ √ PubTator N/A √ √ √

Table 1. Summary of Concept Recognition Tools

Table 2. Compatible input/output formats

Building BioC Compatible Versions

Figure 1. The NCBI Toolkit Figure 2. Tool Features

² tmChem achieved the best performance in BioCreative IV CHEMDNER task on chemical entity mention recognition.

² GenNorm achieved the best performance in BioCreative III Gene Normalization task ² DNorm achieved the best performance in 2013 ShARe/CLEF shared task for

normalizing disease names in clinical notes

Conclusions BioC was easy to learn and straightforward to implement. Only minimal changes were required to re-package the NCBI toolkit with BioC. Our tools are now interoperable with each other, and with several other tools to build more powerful text mining applications and offer wider usage.

Figure 3. BioC Input and Output Format

BioC comprises: XML format describing how to present text documents and annotations, and functions to read/write documents in the BioC XML format. Steps to build BioC compatible tools: Modify the input/output format of the tool and create a key file to interpret the BioC annotation file.

Table 2. Input/Output formats supported by our tools

Offset

Identifiers

Mentions

Types