Extract Chemical Information from Patents Using ...
Transcript of Extract Chemical Information from Patents Using ...
![Page 1: Extract Chemical Information from Patents Using ...](https://reader031.fdocuments.in/reader031/viewer/2022012916/61c709fae69ac10897104c2f/html5/thumbnails/1.jpg)
Extract Chemical Information from Patents
Using Chemicalize and D2S
(Document to Structure)
Wei Deng (David), Daniel Bonniot, Andras Stracz
International Conference and Exhibition on
Computer Aided Drug Design & QSAR
Oct 30th, 2012
Chicago, IL, USA
![Page 2: Extract Chemical Information from Patents Using ...](https://reader031.fdocuments.in/reader031/viewer/2022012916/61c709fae69ac10897104c2f/html5/thumbnails/2.jpg)
ChemAxon’s Naming Technology
● Structure to Name
● Name to Structure
● Document to Structure
● Document to Database
● Chemicalize
2
![Page 3: Extract Chemical Information from Patents Using ...](https://reader031.fdocuments.in/reader031/viewer/2022012916/61c709fae69ac10897104c2f/html5/thumbnails/3.jpg)
ChemAxon’s Naming Technology
• Structure to Name
– IUPAC Name, traditional names
– Reaching maturity
– Still upcoming: peptides, some fused systems
• Name to structure
– IUPAC, CAS and systematic names
– Dictionary of common names and drug names
– Support CAS Registry number (webservice)
– Homology group: alkyl, aryl …
– Future: Biological names, polymers, ...
• Accuracy and coverage constantly improving
• Also available from command-line
3
![Page 4: Extract Chemical Information from Patents Using ...](https://reader031.fdocuments.in/reader031/viewer/2022012916/61c709fae69ac10897104c2f/html5/thumbnails/4.jpg)
Name to Structure Internals
• Dictionary of common and drug names
– Uses nine different source dictionaries
– Harmonized using weighted consensus method
– 150K names for 40K unique structures
– Custom dictionary and plug-in system
• Systematic names
– Proprietary algorithm
– ”Understands” systematic names
– Example:
(2R)-2-methylsulfanyl-3-hydroxybutanedioate
4
![Page 5: Extract Chemical Information from Patents Using ...](https://reader031.fdocuments.in/reader031/viewer/2022012916/61c709fae69ac10897104c2f/html5/thumbnails/5.jpg)
Systematic Name Example
5
(2R)-2-methylsulfanyl-3-hydroxybutanedioate
![Page 6: Extract Chemical Information from Patents Using ...](https://reader031.fdocuments.in/reader031/viewer/2022012916/61c709fae69ac10897104c2f/html5/thumbnails/6.jpg)
Systematic Name: Parsing
6
(2R)-2-methylsulfanyl-3-hydroxybutanedioate
![Page 7: Extract Chemical Information from Patents Using ...](https://reader031.fdocuments.in/reader031/viewer/2022012916/61c709fae69ac10897104c2f/html5/thumbnails/7.jpg)
Systematic Name: Parsing
7
(2R)-2-methylsulfanyl-3-hydroxybutanedioate
![Page 8: Extract Chemical Information from Patents Using ...](https://reader031.fdocuments.in/reader031/viewer/2022012916/61c709fae69ac10897104c2f/html5/thumbnails/8.jpg)
Systematic Name: Parsing
8
(2R)-2-methylsulfanyl-3-hydroxybutanedioate
![Page 9: Extract Chemical Information from Patents Using ...](https://reader031.fdocuments.in/reader031/viewer/2022012916/61c709fae69ac10897104c2f/html5/thumbnails/9.jpg)
Systematic Name: Parsing
9
(2R)-2-methylsulfanyl-3-hydroxybutanedioate
![Page 10: Extract Chemical Information from Patents Using ...](https://reader031.fdocuments.in/reader031/viewer/2022012916/61c709fae69ac10897104c2f/html5/thumbnails/10.jpg)
Systematic Name: Parsing
1
0
(2R)-2-methylsulfanyl-3-hydroxybutanedioate
![Page 11: Extract Chemical Information from Patents Using ...](https://reader031.fdocuments.in/reader031/viewer/2022012916/61c709fae69ac10897104c2f/html5/thumbnails/11.jpg)
Systematic Name: Parsing
1
1
(2R)-2-methylsulfanyl-3-hydroxybutanedioate
![Page 12: Extract Chemical Information from Patents Using ...](https://reader031.fdocuments.in/reader031/viewer/2022012916/61c709fae69ac10897104c2f/html5/thumbnails/12.jpg)
Systematic Name: Structure Generation
1
2
(2R)-2-methylsulfanyl-3-hydroxybutanedioate
![Page 13: Extract Chemical Information from Patents Using ...](https://reader031.fdocuments.in/reader031/viewer/2022012916/61c709fae69ac10897104c2f/html5/thumbnails/13.jpg)
OCR Error Correction
1
3
(2R)-2-rnethylsulfany1-3-hydr0xybutanedi0ate
![Page 14: Extract Chemical Information from Patents Using ...](https://reader031.fdocuments.in/reader031/viewer/2022012916/61c709fae69ac10897104c2f/html5/thumbnails/14.jpg)
OCR Error Correction
1
4
(2R)-2-rnethylsulfany1-3-hydr0xybutanedi0ate
(2R)-2-methylsulfanyl-3-hydroxybutanedioate
![Page 15: Extract Chemical Information from Patents Using ...](https://reader031.fdocuments.in/reader031/viewer/2022012916/61c709fae69ac10897104c2f/html5/thumbnails/15.jpg)
OCR Error Correction
1
5
(2R)-2-rnethylsulfany1-3-hydr0xybutanedi0ate
(2R)-2-methylsulfanyl-3-hydroxybutanedioate
Λr-benzyl-Λr-[3-(lH-tetrazol-5-
yl)phenyl]propanamide
?-benzyl-?-[3-(?H-tetrazol-5-
yl)phenyl]propanamide
![Page 16: Extract Chemical Information from Patents Using ...](https://reader031.fdocuments.in/reader031/viewer/2022012916/61c709fae69ac10897104c2f/html5/thumbnails/16.jpg)
OCR Error Correction
1
6
(2R)-2-rnethylsulfany1-3-hydr0xybutanedi0ate
(2R)-2-methylsulfanyl-3-hydroxybutanedioate
Λr-benzyl-Λr-[3-(lH-tetrazol-5-
yl)phenyl]propanamide
N-benzyl-N-[3-(1H-tetrazol-5-
yl)phenyl]propanamide
![Page 17: Extract Chemical Information from Patents Using ...](https://reader031.fdocuments.in/reader031/viewer/2022012916/61c709fae69ac10897104c2f/html5/thumbnails/17.jpg)
ChemAxon’s “Document to Structure”
• Extract chemical information from
documents – Names: powered by the Naming Technology
– Also import smiles, InChI, CAS number …
– Images: OSRA …
– Works with scanned non-searchable PDF
– Returns structures and their location in the document
• Supported formats: – MS Office document: doc, docx, ppt, pptx, xls, xlsx, odt …
– Embedded structure objects (ChemDraw, Symyx, Marvin,
…)
– PDF, text, XML, HTML
17
![Page 18: Extract Chemical Information from Patents Using ...](https://reader031.fdocuments.in/reader031/viewer/2022012916/61c709fae69ac10897104c2f/html5/thumbnails/18.jpg)
From Document to Structures
1
8
Non-searchable patent (50 pages) Structure (text + image) + location
![Page 19: Extract Chemical Information from Patents Using ...](https://reader031.fdocuments.in/reader031/viewer/2022012916/61c709fae69ac10897104c2f/html5/thumbnails/19.jpg)
Instant JChem Example
• In Instant JChem, search for
19
![Page 20: Extract Chemical Information from Patents Using ...](https://reader031.fdocuments.in/reader031/viewer/2022012916/61c709fae69ac10897104c2f/html5/thumbnails/20.jpg)
Search by Structure or Text
20
![Page 21: Extract Chemical Information from Patents Using ...](https://reader031.fdocuments.in/reader031/viewer/2022012916/61c709fae69ac10897104c2f/html5/thumbnails/21.jpg)
Non-searchable PDF is now Searchable
21
![Page 22: Extract Chemical Information from Patents Using ...](https://reader031.fdocuments.in/reader031/viewer/2022012916/61c709fae69ac10897104c2f/html5/thumbnails/22.jpg)
“Document to Database”
22
![Page 23: Extract Chemical Information from Patents Using ...](https://reader031.fdocuments.in/reader031/viewer/2022012916/61c709fae69ac10897104c2f/html5/thumbnails/23.jpg)
ChemAxon’s “Document to Database”
• Data in DB:
– Structures
– Source (name, smiles, embedded, …) and
location
– Documents, Authors, Metadata...
• Questions:
– What structures appear in a specific document?
– What documents contain a
structure/substructure/...?
– What documents written since 2010 in location X
contain substructure S?
... 23
![Page 24: Extract Chemical Information from Patents Using ...](https://reader031.fdocuments.in/reader031/viewer/2022012916/61c709fae69ac10897104c2f/html5/thumbnails/24.jpg)
ChemAxon’s “Document to Database”
• Customizable:
– Metadata extracted from documents
– Interface (IJC forms, webapp)
• Demo:
– One month of US patents
– 85K unique structures from systematic names
– 1M occurrences
24
![Page 25: Extract Chemical Information from Patents Using ...](https://reader031.fdocuments.in/reader031/viewer/2022012916/61c709fae69ac10897104c2f/html5/thumbnails/25.jpg)
Summary
● Extensive, improving naming technologies
(n2s, s2n)
● Increasing support for Document mining
(d2s, d2db, SharePoint)
● Putting it all together → going large scale,
extract and use valuable chemical
information
● Still component-based and responding to
our users requests
2
5
![Page 26: Extract Chemical Information from Patents Using ...](https://reader031.fdocuments.in/reader031/viewer/2022012916/61c709fae69ac10897104c2f/html5/thumbnails/26.jpg)
Free Online Service Chemicalize.org
• Extract
• Interactively display
• Calculate
• Search
26
Recently reviewed in J. Chem. Inf. Model., 2012, 52 (2), pp 613–
615
![Page 27: Extract Chemical Information from Patents Using ...](https://reader031.fdocuments.in/reader031/viewer/2022012916/61c709fae69ac10897104c2f/html5/thumbnails/27.jpg)
27
Webpage - Chemicalized
![Page 28: Extract Chemical Information from Patents Using ...](https://reader031.fdocuments.in/reader031/viewer/2022012916/61c709fae69ac10897104c2f/html5/thumbnails/28.jpg)
• Customizable report layout for calculation results. Users can move, open, close, expand calculation boxes and this is remembered on the next visit
Data Page: Extensive Predicted Properties
![Page 29: Extract Chemical Information from Patents Using ...](https://reader031.fdocuments.in/reader031/viewer/2022012916/61c709fae69ac10897104c2f/html5/thumbnails/29.jpg)
29
Webpage - Chemicalized
![Page 30: Extract Chemical Information from Patents Using ...](https://reader031.fdocuments.in/reader031/viewer/2022012916/61c709fae69ac10897104c2f/html5/thumbnails/30.jpg)
PDF File - Chemicalized
![Page 31: Extract Chemical Information from Patents Using ...](https://reader031.fdocuments.in/reader031/viewer/2022012916/61c709fae69ac10897104c2f/html5/thumbnails/31.jpg)
Aspirin: query highlighted in results
Searching Chemicalize.org – Structure Search
![Page 32: Extract Chemical Information from Patents Using ...](https://reader031.fdocuments.in/reader031/viewer/2022012916/61c709fae69ac10897104c2f/html5/thumbnails/32.jpg)
• Aspirin; web page hits - “show” related structures
• Autosuggest while typing
Searching Chemicalize.org – Keyword Search
![Page 33: Extract Chemical Information from Patents Using ...](https://reader031.fdocuments.in/reader031/viewer/2022012916/61c709fae69ac10897104c2f/html5/thumbnails/33.jpg)
Everything is Published
• Recent viewed
– Webpages
– Structures
http://www.chemicalize.org/blog/2012/09/26/1-million-
molecules/
– Documents
– Searched queries (structure and keyword)
33
![Page 34: Extract Chemical Information from Patents Using ...](https://reader031.fdocuments.in/reader031/viewer/2022012916/61c709fae69ac10897104c2f/html5/thumbnails/34.jpg)
Availability and Customization
• Source code available
• Minor changes required on example codes
for customization, such as
– Import extracted structures to other databases
– Post-process filtering according to properties
– Batch process of multiple documents
34