Extract Chemical Information from Patents Using ...

34
Extract Chemical Information from Patents Using Chemicalize and D2S (Document to Structure) Wei Deng (David), Daniel Bonniot, Andras Stracz International Conference and Exhibition on Computer Aided Drug Design & QSAR Oct 30 th , 2012 Chicago, IL, USA

Transcript of Extract Chemical Information from Patents Using ...

Page 1: Extract Chemical Information from Patents Using ...

Extract Chemical Information from Patents

Using Chemicalize and D2S

(Document to Structure)

Wei Deng (David), Daniel Bonniot, Andras Stracz

International Conference and Exhibition on

Computer Aided Drug Design & QSAR

Oct 30th, 2012

Chicago, IL, USA

Page 2: Extract Chemical Information from Patents Using ...

ChemAxon’s Naming Technology

● Structure to Name

● Name to Structure

● Document to Structure

● Document to Database

● Chemicalize

2

Page 3: Extract Chemical Information from Patents Using ...

ChemAxon’s Naming Technology

• Structure to Name

– IUPAC Name, traditional names

– Reaching maturity

– Still upcoming: peptides, some fused systems

• Name to structure

– IUPAC, CAS and systematic names

– Dictionary of common names and drug names

– Support CAS Registry number (webservice)

– Homology group: alkyl, aryl …

– Future: Biological names, polymers, ...

• Accuracy and coverage constantly improving

• Also available from command-line

3

Page 4: Extract Chemical Information from Patents Using ...

Name to Structure Internals

• Dictionary of common and drug names

– Uses nine different source dictionaries

– Harmonized using weighted consensus method

– 150K names for 40K unique structures

– Custom dictionary and plug-in system

• Systematic names

– Proprietary algorithm

– ”Understands” systematic names

– Example:

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

4

Page 5: Extract Chemical Information from Patents Using ...

Systematic Name Example

5

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

Page 6: Extract Chemical Information from Patents Using ...

Systematic Name: Parsing

6

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

Page 7: Extract Chemical Information from Patents Using ...

Systematic Name: Parsing

7

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

Page 8: Extract Chemical Information from Patents Using ...

Systematic Name: Parsing

8

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

Page 9: Extract Chemical Information from Patents Using ...

Systematic Name: Parsing

9

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

Page 10: Extract Chemical Information from Patents Using ...

Systematic Name: Parsing

1

0

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

Page 11: Extract Chemical Information from Patents Using ...

Systematic Name: Parsing

1

1

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

Page 12: Extract Chemical Information from Patents Using ...

Systematic Name: Structure Generation

1

2

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

Page 13: Extract Chemical Information from Patents Using ...

OCR Error Correction

1

3

(2R)-2-rnethylsulfany1-3-hydr0xybutanedi0ate

Page 14: Extract Chemical Information from Patents Using ...

OCR Error Correction

1

4

(2R)-2-rnethylsulfany1-3-hydr0xybutanedi0ate

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

Page 15: Extract Chemical Information from Patents Using ...

OCR Error Correction

1

5

(2R)-2-rnethylsulfany1-3-hydr0xybutanedi0ate

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

Λr-benzyl-Λr-[3-(lH-tetrazol-5-

yl)phenyl]propanamide

?-benzyl-?-[3-(?H-tetrazol-5-

yl)phenyl]propanamide

Page 16: Extract Chemical Information from Patents Using ...

OCR Error Correction

1

6

(2R)-2-rnethylsulfany1-3-hydr0xybutanedi0ate

(2R)-2-methylsulfanyl-3-hydroxybutanedioate

Λr-benzyl-Λr-[3-(lH-tetrazol-5-

yl)phenyl]propanamide

N-benzyl-N-[3-(1H-tetrazol-5-

yl)phenyl]propanamide

Page 17: Extract Chemical Information from Patents Using ...

ChemAxon’s “Document to Structure”

• Extract chemical information from

documents – Names: powered by the Naming Technology

– Also import smiles, InChI, CAS number …

– Images: OSRA …

– Works with scanned non-searchable PDF

– Returns structures and their location in the document

• Supported formats: – MS Office document: doc, docx, ppt, pptx, xls, xlsx, odt …

– Embedded structure objects (ChemDraw, Symyx, Marvin,

…)

– PDF, text, XML, HTML

17

Page 18: Extract Chemical Information from Patents Using ...

From Document to Structures

1

8

Non-searchable patent (50 pages) Structure (text + image) + location

Page 19: Extract Chemical Information from Patents Using ...

Instant JChem Example

• In Instant JChem, search for

19

Page 20: Extract Chemical Information from Patents Using ...

Search by Structure or Text

20

Page 21: Extract Chemical Information from Patents Using ...

Non-searchable PDF is now Searchable

21

Page 22: Extract Chemical Information from Patents Using ...

“Document to Database”

22

Page 23: Extract Chemical Information from Patents Using ...

ChemAxon’s “Document to Database”

• Data in DB:

– Structures

– Source (name, smiles, embedded, …) and

location

– Documents, Authors, Metadata...

• Questions:

– What structures appear in a specific document?

– What documents contain a

structure/substructure/...?

– What documents written since 2010 in location X

contain substructure S?

... 23

Page 24: Extract Chemical Information from Patents Using ...

ChemAxon’s “Document to Database”

• Customizable:

– Metadata extracted from documents

– Interface (IJC forms, webapp)

• Demo:

– One month of US patents

– 85K unique structures from systematic names

– 1M occurrences

24

Page 25: Extract Chemical Information from Patents Using ...

Summary

● Extensive, improving naming technologies

(n2s, s2n)

● Increasing support for Document mining

(d2s, d2db, SharePoint)

● Putting it all together → going large scale,

extract and use valuable chemical

information

● Still component-based and responding to

our users requests

2

5

Page 26: Extract Chemical Information from Patents Using ...

Free Online Service Chemicalize.org

• Extract

• Interactively display

• Calculate

• Search

26

Recently reviewed in J. Chem. Inf. Model., 2012, 52 (2), pp 613–

615

Page 27: Extract Chemical Information from Patents Using ...

27

Webpage - Chemicalized

Page 28: Extract Chemical Information from Patents Using ...

• Customizable report layout for calculation results. Users can move, open, close, expand calculation boxes and this is remembered on the next visit

Data Page: Extensive Predicted Properties

Page 29: Extract Chemical Information from Patents Using ...

29

Webpage - Chemicalized

Page 30: Extract Chemical Information from Patents Using ...

PDF File - Chemicalized

Page 31: Extract Chemical Information from Patents Using ...

Aspirin: query highlighted in results

Searching Chemicalize.org – Structure Search

Page 32: Extract Chemical Information from Patents Using ...

• Aspirin; web page hits - “show” related structures

• Autosuggest while typing

Searching Chemicalize.org – Keyword Search

Page 34: Extract Chemical Information from Patents Using ...

Availability and Customization

• Source code available

• Minor changes required on example codes

for customization, such as

– Import extracted structures to other databases

– Post-process filtering according to properties

– Batch process of multiple documents

34