Impact of scientific developments on the Chemical Weapons ...
New Software Developments on Chemical Information Extraction
Transcript of New Software Developments on Chemical Information Extraction
New Software Developments on
Chemical Information Extraction
Wei Deng (David)
ChemAxon’s Naming Technology
• Name to structure
– IUPAC, traditional and common names
– A common name library of existing drugs
– Support CAS Registry number
– Homology group: alkyl, aryl …
– Future: Biological names (PDB code, EC # …)
• Structure to Name
– IUPAC Name, traditional names, common names
– Support other structure features • Isotopes, pseudo-asymmetric stereocenters …
• Accuracy and coverage constantly improving
• Also available from command-line
2
ChemAxon’s “Document to Structure”
• Extract chemical information from documents – Names: powered by the Naming Technology
– Also import SMILES, InChI, CAS number …
– Images: OSRA
– Returns structure and their location in the document
• Works with scanned PDF since 5.8 (Feb 2012)
– Great for patent mining
• OCR and syntax correction constantly developed
– 3-rnethyl-l-me- thoxynaphthalene
– 3-methyl-1-methoxynaphthalene
3
ChemAxon’s “Document to Structure”
• New Features in 5.9 (Mar 2012)
– MS Office document: doc, docx, ppt, pptx, xls, xlsx, odt …
– Embedded structure objects (ChemDraw, Symyx, Marvin
…)
– Progressively display result
– Speed improvement
– Instant JChem Integration; Simplfied API
• Currently in development for 5.10 (May
2012) – OSRA “Confidence”
– Fragment groups integration with Markush generation
– Collaboration with Linguamatics
– IJC (OSRA, Location) 4
From Document to Structures
5
Non-searchable patent (50 pages) Structure (text + image) + location
Search by Structure or Text
6
Non-searchable PDF is now Searchable
7
Free Online Service Chemicalize.org
• Extract chemical information from web pages and PDF documents
• Interactively display all structures and their predicted properties
• Search all structures extracted
• Gather links of interest to chemists for post processing (search,
analysis, reporting, fun…)
• Recently reviewed on Journal of Chemical Information and
Modeling
8
9
Webpage - chemicalized
• All chemical names are highlighted with dotted line
• Mouse over a name pops up the structure image
• Click on the image will direct to the data page
• Links are “respected”
• Customizable report layout for calculation results. Users can move, open, close, expand calculation boxes and this is remembered on the next visit
Data Page: Extensive Predicted Properties
11
• All structures are summarized above the chemicalized page
• Click on a structure to highlight all occurrences. Click again to
navigate to the next occurrence
• All structures can be downloaded as MRV or SDF
Webpage - chemicalized
PDF File - chemicalized
Aspirin: query highlighted in results
Searching Chemicalize.org – Structure Search
• Aspirin; web page hits - “show” related structures
• Autosuggest while typing
Searching Chemicalize.org – Keyword Search
Everything is Published
• Recent viewed
– Webpages
– Structures
– Documents
– Searched queries (structure and keyword)
15
Availability and Customization
• Source code available
• Minor changes required on example codes
for customization, such as
– Import extracted structures to other databases
– Post-process filtering according to properties
– Batch process of multiple documents
16
Hunting for Hidden Treasures
• A CINF Symposium regarding “chemical
information in patents and other documents”
• ACS meeting in Philadelphia, August 19-23,
2012.
• Current speakers from
– Content providers
– Software providers
– Pharmaceutical researchers
17