Introducing CLiDE Pro: A chemical OCR tool Aniko T. Valko, Keymodule Ltd.

24
Introducing CLiDE Pro: A chemical OCR tool Aniko T. Valko, Keymodule Ltd.

Transcript of Introducing CLiDE Pro: A chemical OCR tool Aniko T. Valko, Keymodule Ltd.

Page 1: Introducing CLiDE Pro: A chemical OCR tool Aniko T. Valko, Keymodule Ltd.

IntroducingCLiDE Pro:

A chemical OCR tool

Aniko T. Valko, Keymodule Ltd.

Page 2: Introducing CLiDE Pro: A chemical OCR tool Aniko T. Valko, Keymodule Ltd.

Chemical structure Diagrams

Chemical structure diagrams are a form of representation of chemical compounds.

Information contained in a structure diagram can be divided into three areas:

• Atom information

• Bond information

• Structural information

chemical elements,functional groups,generic elements,

vertex label,charge,atomic weight,hybridization,etc.

O

SOMe

O

R

N

R

N

O

XDH

H

HMe

3

14 1520

16

21

O O

OR H

CR2

CR2

e

f

ff

fe

e

e

bond orders,bond styles,bond labels

O

OAl

H

OEt Li

N

S SC6F13

21

atom information,bond information,

overall charge,structure label

Page 3: Introducing CLiDE Pro: A chemical OCR tool Aniko T. Valko, Keymodule Ltd.

29 31 0 0 0 0 0 0 0 0999 V2000 -1.9417 2.3939 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -2.3542 1.6794 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -1.9417 0.9649 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -1.1167 0.9649 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -0.7042 1.6794 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -1.1167 2.3939 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -3.1792 1.6794 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -3.5917 0.9649 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -3.5917 2.3939 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -4.0042 1.6794 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.1208 1.6794 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0 0.7042 1.0961 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 -0.0927 2.4763 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 0.7042 2.2628 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 1.5292 1.0961 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1.9417 0.3816 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

What is chemical OCR for?

SHN

O O

Cl

O

N

Publication process

chemical structure diagrams are

converted to images

All chemical information is lost!

Manual reproduction

slow and prone to errors

chemical OCR

automatic extraction of chemical information from

chemical structure depictions

20-90 seconds per page

Page 4: Introducing CLiDE Pro: A chemical OCR tool Aniko T. Valko, Keymodule Ltd.

CLiDE Pro

A chemical OCR software tool

The latest incarnation of software to emerge from the long-term CLiDE (Chemical Literature Data Extraction) project [1-3].

[1] P. Ibison, M. Jacguot, F. Kam, A. Neville, R.W. Simpson, C. Tonnelier, T. Venczel and A.P. Johnson. Chemical Literature Data Extraction: The CLiDE Project. J. Chem. Inf. Comput. Sci. 1993, 33(3), 338-344.[2] P. Ibison, F. Kam, R.W. Simpson, C. Tonnelier, T. Venczel and A.P. Johnson. Chemical Structure Recognition and Generic Text in the CLiDE Project. In Proceedings on Online Information 92. 1992, London, England.[3] A. Simon and A.P. Johnson. Recent Advances in the CLiDE Project: Logical Layout Analysis of Chemical Documents. J. Chem. Inf. Comput. Sci. 1997, 37(1), 109-116.

Page 5: Introducing CLiDE Pro: A chemical OCR tool Aniko T. Valko, Keymodule Ltd.

Features

Converts chemical images into connection tables

Interprets generic structures

Supports document-oriented processing as opposed to page-oriented processing The whole document is loaded and processed at once rather than individual pages.

Loads PDF documents, as well as TIFF and BMP image files

Handles various difficult drawing features

Exports chemical information into MDL MOL files

Operates in interactive or batch mode

Tools for structure and text editing

Page 6: Introducing CLiDE Pro: A chemical OCR tool Aniko T. Valko, Keymodule Ltd.

Three main problems involvedin chemical OCR

1) Identification of chemical images within a document.

2) Compilation of chemical graphs of individual molecules from chemical images.

3) Interpretation of complex objects such as generic structures using the retrieved chemical graphs.

Page 7: Introducing CLiDE Pro: A chemical OCR tool Aniko T. Valko, Keymodule Ltd.

CLiDE Pro’s solutions to Problem 1

Problem 1: Identification of chemical images within a document

Document image segmentation

Digitized image of a document page of a

patent

Segmented document highlighting recognized

text blocks and graphic blocks

Identification of connected components

Bottom-up layout analysisby building the tree structure

of the page

Page 8: Introducing CLiDE Pro: A chemical OCR tool Aniko T. Valko, Keymodule Ltd.

CLiDE Pro’s solutions to Problem 2

Problem 2: Extraction of connection tables from chemical images

A chemical image

1 Chemical image

Classification of connected components into basic groups:characters

linesdashes

graphics

2 Classification of connected components

[4] R.O. Duda and P.E. Hart. Use of the Hough Transform to Detect Lines and Curves in Pictures. Graphics Image Process. 1972, 1.

3 Construction ofdashed bonds

[5] J. Sklansky and V. Gonzalez. Fast Polygonal Approximation of Digitized Curves. Pattern Recognit. 1980, 12, 327-331.

4 Vectorization

12 3 Construction of dashed bondsbased on the Hough transform method [4]

Vectorization based ona polygon approximation method [5]

4Construction of atom labels:OCR

Grouping characters into atom labelsRecognition of superatoms

5

5 Construction ofatom labels

Construction of connection table:Connecting lines to atoms

Joining lines to form implicit Carbon atoms

6

6 Construction ofconnection table

3D molecular structureafter

exporting the constructed CT into SDF file in 2D andconverting the structure from 2D to 3D

Page 9: Introducing CLiDE Pro: A chemical OCR tool Aniko T. Valko, Keymodule Ltd.

CLiDE Pro’s solutions to Problem 3

Problem 3: Interpretation of generic structures

X

N

O

CO2R

Y1

Y2

N S

41a: X=N, Y1=H, Y2=Cl, R=Et41b: X=CF, Y1=Y2=F, R=Et

Generic text interpretation (GTI)1

R-groups,substitution values,labels

Currently, GTI is limited to the presence of ‘=‘ sign separating the R-groups and the substituents.

However, combined assignment to R-groups are handled successfully.

X

N

O

CO2R

Y1

Y2

N S

41a: X=N, Y1=H, Y2=Cl, R=Et41b: X=CF, Y1=Y2=F, R=Et

X

N

O

CO2R

Y1

Y2

N S

41a: X=N, Y1=H, Y2=Cl, R=Et41b: X=CF, Y1=Y2=F, R=Et

2 Association the generic text block to the structure by matching R-groups present in both the text and the structure

X

N

O

CO2R

Y1

Y2

N S

41a: X=N, Y1=H, Y2=Cl, R=Et41b: X=CF, Y1=Y2=F, R=Et

Page 10: Introducing CLiDE Pro: A chemical OCR tool Aniko T. Valko, Keymodule Ltd.

Alignment of Atom Labels

Horizontal atom labels

Vertical atom labels

Two types of alignment of atom labels with more than one character:

Examples

Page 11: Introducing CLiDE Pro: A chemical OCR tool Aniko T. Valko, Keymodule Ltd.

Alignment of Atom labels

Constructed molecule Input image

The interpreted structure in CLiDE Pro’s GUI:

Page 12: Introducing CLiDE Pro: A chemical OCR tool Aniko T. Valko, Keymodule Ltd.

Ambiguity in interpretation

Horizontal lines representing dashes of a dashed wedged bond

A horizontal line representing

a negative charge

Contextual analysis

Page 13: Introducing CLiDE Pro: A chemical OCR tool Aniko T. Valko, Keymodule Ltd.

Ambiguity in interpretation

Constructed molecule Input image

The interpreted structure in CLiDE Pro’s GUI:

Page 14: Introducing CLiDE Pro: A chemical OCR tool Aniko T. Valko, Keymodule Ltd.

Ambiguity in interpretation

Contextual analysis

Vertical lines representingIodine atoms

A vertical line part of

a double bond

Page 15: Introducing CLiDE Pro: A chemical OCR tool Aniko T. Valko, Keymodule Ltd.

Ambiguity in interpretation

Constructed molecule Input image

The interpreted structure in CLiDE Pro’s GUI:

Page 16: Introducing CLiDE Pro: A chemical OCR tool Aniko T. Valko, Keymodule Ltd.

Ambiguity in interpretation

Contextual analysis

Circles represent:

Oxygen atoms

aromatic rings

Page 17: Introducing CLiDE Pro: A chemical OCR tool Aniko T. Valko, Keymodule Ltd.

Input image

Ambiguity in interpretation

Constructed molecule

Page 18: Introducing CLiDE Pro: A chemical OCR tool Aniko T. Valko, Keymodule Ltd.

Crossing bonds in bridged molecule

Input imageConstructed molecule

No extra Carbon atom is generated at the point where bonds cross each other

Functional groups are expanded in the exported structure

Page 19: Introducing CLiDE Pro: A chemical OCR tool Aniko T. Valko, Keymodule Ltd.

A generic structure

Constructed molecule Input image

R = H

R = Me

Page 20: Introducing CLiDE Pro: A chemical OCR tool Aniko T. Valko, Keymodule Ltd.

Bad image quality

Constructed molecule Input image

Isolated black spots (noise from scanning)

Black spots touching one CC

Black spots merging two or more CCs

Page 21: Introducing CLiDE Pro: A chemical OCR tool Aniko T. Valko, Keymodule Ltd.

Bad image quality

Input imageConstructed molecule

Page 22: Introducing CLiDE Pro: A chemical OCR tool Aniko T. Valko, Keymodule Ltd.

The quality of interpretation depends on the ability of dealing with difficult situations such as - ambiguous drawing features - distortions resulting from bad image quality

Conclusions and Outlook

CLiDE Pro, a chemical OCR tool

3 main problems in chemical OCR and CLiDE Pro’s solutions

Goal to extend CLiDE Pro on further chemical drawing features such as

- Reaction schemes (partly implemented)

- Improved generic text interpretation (dealing with tables of R-groups)

- Positional variation in Markush structures

- Other difficult situations (e.g. missing bonds between ring atoms)

- Frequency variation in Markush structures

Page 23: Introducing CLiDE Pro: A chemical OCR tool Aniko T. Valko, Keymodule Ltd.

Palytoxin – A complex structure

Constructed molecule

Input image

Page 24: Introducing CLiDE Pro: A chemical OCR tool Aniko T. Valko, Keymodule Ltd.

Further Information Acknowledgments

CLiDE Pro is licensed with Keymodule Ltd. and SimBioSys Inc.

http://www.keymodule.co.ukhttp://www.simbiosys.ca

Live demo at Booth #817 People who previously worked on CLiDE