New Software Developments on Chemical Information Extractionbulletin.acscinf.org/PDFs/Deng.pdf ·...
Transcript of New Software Developments on Chemical Information Extractionbulletin.acscinf.org/PDFs/Deng.pdf ·...
![Page 1: New Software Developments on Chemical Information Extractionbulletin.acscinf.org/PDFs/Deng.pdf · ChemAxon’s “Document to Structure” • Extract chemical information from documents](https://reader035.fdocuments.in/reader035/viewer/2022070708/5eb8393c4175fe7d6d623fb7/html5/thumbnails/1.jpg)
New Software Developments on Chemical Information Extraction
Wei Deng (David)
![Page 2: New Software Developments on Chemical Information Extractionbulletin.acscinf.org/PDFs/Deng.pdf · ChemAxon’s “Document to Structure” • Extract chemical information from documents](https://reader035.fdocuments.in/reader035/viewer/2022070708/5eb8393c4175fe7d6d623fb7/html5/thumbnails/2.jpg)
ChemAxon’s Naming Technology
• Name to structure – IUPAC, traditional and common names – A common name library of existing drugs – Support CAS Registry number – Homology group: alkyl, aryl … – Future: Biological names (PDB code, EC # …)
• Structure to Name – IUPAC Name, traditional names, common names – Support other structure features
• Isotopes, pseudo-asymmetric stereocenters …
• Accuracy and coverage constantly improving • Also available from command-line
2
![Page 3: New Software Developments on Chemical Information Extractionbulletin.acscinf.org/PDFs/Deng.pdf · ChemAxon’s “Document to Structure” • Extract chemical information from documents](https://reader035.fdocuments.in/reader035/viewer/2022070708/5eb8393c4175fe7d6d623fb7/html5/thumbnails/3.jpg)
ChemAxon’s “Document to Structure”
• Extract chemical information from documents – Names: powered by the Naming Technology – Also import SMILES, InChI, CAS number … – Images: OSRA – Returns structure and their location in the document
• Works with scanned PDF since 5.8 (Feb 2012) – Great for patent mining
• OCR and syntax correction constantly developed – 3-rnethyl-l-me- thoxynaphthalene – 3-methyl-1-methoxynaphthalene
3
![Page 4: New Software Developments on Chemical Information Extractionbulletin.acscinf.org/PDFs/Deng.pdf · ChemAxon’s “Document to Structure” • Extract chemical information from documents](https://reader035.fdocuments.in/reader035/viewer/2022070708/5eb8393c4175fe7d6d623fb7/html5/thumbnails/4.jpg)
From Document to Structures
4 Non-searchable patent (50 pages) Structure (text + image) + location
![Page 5: New Software Developments on Chemical Information Extractionbulletin.acscinf.org/PDFs/Deng.pdf · ChemAxon’s “Document to Structure” • Extract chemical information from documents](https://reader035.fdocuments.in/reader035/viewer/2022070708/5eb8393c4175fe7d6d623fb7/html5/thumbnails/5.jpg)
Search by Structure or Text
5
![Page 6: New Software Developments on Chemical Information Extractionbulletin.acscinf.org/PDFs/Deng.pdf · ChemAxon’s “Document to Structure” • Extract chemical information from documents](https://reader035.fdocuments.in/reader035/viewer/2022070708/5eb8393c4175fe7d6d623fb7/html5/thumbnails/6.jpg)
Non-searchable PDF is now Searchable
6
![Page 7: New Software Developments on Chemical Information Extractionbulletin.acscinf.org/PDFs/Deng.pdf · ChemAxon’s “Document to Structure” • Extract chemical information from documents](https://reader035.fdocuments.in/reader035/viewer/2022070708/5eb8393c4175fe7d6d623fb7/html5/thumbnails/7.jpg)
ChemAxon’s “Document to Structure”
• New Features in 5.9 (Mar 2012) – MS Office document: doc, docx, ppt, pptx, xls, xlsx, odt … – Embedded structure objects (ChemDraw, Symyx, Marvin
…) – Progressively display result – Speed improvement – Instant JChem Integration; Simplfied API
• Currently in development for 5.10 (May 2012) – OSRA “Confidence” – Fragment groups integration with Markush generation – Collaboration with Linguamatics – IJC (OSRA, Location)
7
![Page 8: New Software Developments on Chemical Information Extractionbulletin.acscinf.org/PDFs/Deng.pdf · ChemAxon’s “Document to Structure” • Extract chemical information from documents](https://reader035.fdocuments.in/reader035/viewer/2022070708/5eb8393c4175fe7d6d623fb7/html5/thumbnails/8.jpg)
Free Online Service Chemicalize.org
• Extract chemical information from web pages and PDF documents • Interactively display all structures and their predicted properties
• Search all structures extracted
• Gather links of interest to chemists for post processing (search, analysis, reporting, fun…)
• Recently reviewed on Journal of Chemical Information and Modeling
8
![Page 9: New Software Developments on Chemical Information Extractionbulletin.acscinf.org/PDFs/Deng.pdf · ChemAxon’s “Document to Structure” • Extract chemical information from documents](https://reader035.fdocuments.in/reader035/viewer/2022070708/5eb8393c4175fe7d6d623fb7/html5/thumbnails/9.jpg)
9
Webpage - chemicalized • All chemical names are highlighted with dotted line • Mouse over a name pops up the structure image • Click on the image will direct to the data page • Links are “respected”
![Page 10: New Software Developments on Chemical Information Extractionbulletin.acscinf.org/PDFs/Deng.pdf · ChemAxon’s “Document to Structure” • Extract chemical information from documents](https://reader035.fdocuments.in/reader035/viewer/2022070708/5eb8393c4175fe7d6d623fb7/html5/thumbnails/10.jpg)
• Customizable report layout for calculation results. Users can move, open, close, expand calculation boxes and this is remembered on the next visit
Data Page: Extensive Predicted Properties
![Page 11: New Software Developments on Chemical Information Extractionbulletin.acscinf.org/PDFs/Deng.pdf · ChemAxon’s “Document to Structure” • Extract chemical information from documents](https://reader035.fdocuments.in/reader035/viewer/2022070708/5eb8393c4175fe7d6d623fb7/html5/thumbnails/11.jpg)
11
• All structures are summarized above the chemicalized page • Click on a structure to highlight all occurrences. Click again to
navigate to the next occurrence • All structures can be downloaded as MRV or SDF
Webpage - chemicalized
![Page 12: New Software Developments on Chemical Information Extractionbulletin.acscinf.org/PDFs/Deng.pdf · ChemAxon’s “Document to Structure” • Extract chemical information from documents](https://reader035.fdocuments.in/reader035/viewer/2022070708/5eb8393c4175fe7d6d623fb7/html5/thumbnails/12.jpg)
PDF File - chemicalized
![Page 13: New Software Developments on Chemical Information Extractionbulletin.acscinf.org/PDFs/Deng.pdf · ChemAxon’s “Document to Structure” • Extract chemical information from documents](https://reader035.fdocuments.in/reader035/viewer/2022070708/5eb8393c4175fe7d6d623fb7/html5/thumbnails/13.jpg)
Aspirin: query highlighted in results
Searching Chemicalize.org – Structure Search
![Page 14: New Software Developments on Chemical Information Extractionbulletin.acscinf.org/PDFs/Deng.pdf · ChemAxon’s “Document to Structure” • Extract chemical information from documents](https://reader035.fdocuments.in/reader035/viewer/2022070708/5eb8393c4175fe7d6d623fb7/html5/thumbnails/14.jpg)
• Aspirin; web page hits - “show” related structures • Autosuggest while typing
Searching Chemicalize.org – Keyword Search
![Page 15: New Software Developments on Chemical Information Extractionbulletin.acscinf.org/PDFs/Deng.pdf · ChemAxon’s “Document to Structure” • Extract chemical information from documents](https://reader035.fdocuments.in/reader035/viewer/2022070708/5eb8393c4175fe7d6d623fb7/html5/thumbnails/15.jpg)
Everything is Published
• Recent viewed – Webpages – Structures – Documents – Searched queries (structure and keyword)
15
![Page 16: New Software Developments on Chemical Information Extractionbulletin.acscinf.org/PDFs/Deng.pdf · ChemAxon’s “Document to Structure” • Extract chemical information from documents](https://reader035.fdocuments.in/reader035/viewer/2022070708/5eb8393c4175fe7d6d623fb7/html5/thumbnails/16.jpg)
Availability and Customization
• Source code available • Minor changes required on example codes
for customization, such as – Import extracted structures to other databases – Post-process filtering according to properties – Batch process of multiple documents
16
![Page 17: New Software Developments on Chemical Information Extractionbulletin.acscinf.org/PDFs/Deng.pdf · ChemAxon’s “Document to Structure” • Extract chemical information from documents](https://reader035.fdocuments.in/reader035/viewer/2022070708/5eb8393c4175fe7d6d623fb7/html5/thumbnails/17.jpg)
Hunting for Hidden Treasures
• A CINF Symposium regarding “chemical information in patents and other documents”
• ACS meeting in Philadelphia, August 19-23, 2012.
• Current speakers from – Content providers – Software providers – Pharmaceutical researchers
17