Recent developments in the CLiDE tool for extraction of ...€¦ · Recent developments in the...

28
Recent developments in the CLiDE tool for extraction of chemical structure data from patents and other documents Aniko T. Valko Keymodule Ltd. Peter Johnson Vilmos A. Valko

Transcript of Recent developments in the CLiDE tool for extraction of ...€¦ · Recent developments in the...

Page 1: Recent developments in the CLiDE tool for extraction of ...€¦ · Recent developments in the CLiDE tool for extraction of chemical structure data from patents and other documents

Recent developments in the CLiDE tool for extraction of chemical

structure data from patents and other documents

Aniko T. Valko

Keymodule Ltd.

Peter Johnson Vilmos A. Valko

Page 2: Recent developments in the CLiDE tool for extraction of ...€¦ · Recent developments in the CLiDE tool for extraction of chemical structure data from patents and other documents

Summary

1) About CLiDE What is CLiDE for?

2) Performance against a benchmark set of images About the benchmark set Performance of CLiDE Enhancements made in CLiDE Comparison with selected systems

3) Performance against selected patents About patents Performance of CLiDE Comparison with selected systems

4) Conclusions and future work

Page 3: Recent developments in the CLiDE tool for extraction of ...€¦ · Recent developments in the CLiDE tool for extraction of chemical structure data from patents and other documents

Part 1:

About CLiDE

Page 4: Recent developments in the CLiDE tool for extraction of ...€¦ · Recent developments in the CLiDE tool for extraction of chemical structure data from patents and other documents

What is CLiDE for?

CLiDE is an Optical Chemical Structure Recognition (OCSR) software application, aimed at converting structure diagrams to computer-readable structures (i.e. connection tables)

PDF, DOC, DOCX, HTML

BMP, GIF, JPEG, PBM, PGM, PNG, PNM,

PPM, TIFF, XBM, XPM

Molfile, RGfile, SDfile CDX, CML, MRV

XML

Valence-violated atom

Non-interpreted atom

Clashing atoms

Small bond angle

Atoms at which CLiDE broke up the structure

Page 5: Recent developments in the CLiDE tool for extraction of ...€¦ · Recent developments in the CLiDE tool for extraction of chemical structure data from patents and other documents

Part 2:

Performance against a benchmark set of images

Page 6: Recent developments in the CLiDE tool for extraction of ...€¦ · Recent developments in the CLiDE tool for extraction of chemical structure data from patents and other documents

Benchmark set

● Images of isolated structures, one structure per image

● #images: 5735

● US Patent Office Complex Work Unit

US07321045-20080122-C00150 US07320974-20080122-C00070 US07323286-20080129-C00108 US07317070-20080108-C00008 US07316739-20080108-C00281 US07314700-20080101-C00001 US07320972-20080122-C00016 US07314876-20080101-C00035 US07314576-20080101-C00035 US07314511-20080101-C00002

● Available on the OSRA web site

● Verification set: Each image is associated with a Molfile meant to describe the correct connection table

Page 7: Recent developments in the CLiDE tool for extraction of ...€¦ · Recent developments in the CLiDE tool for extraction of chemical structure data from patents and other documents

Test runs on the benchmark set

● Test environment

CPU 3 Ghz Core 2 Duo

Memory 4GB

Linux distribution Ubuntu 10.0.4 (64-bit)

● Test run per image

1) CLiDE was run on an image

2) CLiDE analysed the image and generated a connection table

3) The connection table extracted by CLiDE was compared (using canonical SMILES) to the corresponding connection table from the verification set (so called ‘ground truth’)

● Performance measurements

1) Accuracy rate: the percentage of images that were correctly processed by CLiDE

2) Runtime: the total runtime measured over all the test runs

Page 8: Recent developments in the CLiDE tool for extraction of ...€¦ · Recent developments in the CLiDE tool for extraction of chemical structure data from patents and other documents

Performance against benchmark set

57,62%

59,00%

58,91%

59,30%

59,30%

81,81%

82,75%

84,60%

85,78%

86,55%

87,79%

87,96%

00:00

07:12

14:24

21:36

28:48

36:00

43:12

50:24

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

3.2

.0

4.2

.0

4.4

.0

5.0

.0

5.2

.0

5.2

.1

5.4

.0

5.5

.0

5.5

.1

5.5

.2

5.5

.3

5.5

.4

Accuracy rate Runtime (min:sec)

Optimization and improvements in CLiDE’s document segmentation method (see later)

Auto correction of atom labels Better handling of aromatic rings Parsing chemical formulas Avoidance of loss of characters in atom labels Better handling of thick bonds Further improvements to chemical formula parsing 57.62% 87.98% 44 min 20 min

Page 9: Recent developments in the CLiDE tool for extraction of ...€¦ · Recent developments in the CLiDE tool for extraction of chemical structure data from patents and other documents

Enhancements in CLiDE

Corrections in atom labels 59.30% 81.81%

● Auto correction of OCR errors in atom labels

● Avoidance of misinterpretation of ‘Cl’ labels as Carbons

Page 10: Recent developments in the CLiDE tool for extraction of ...€¦ · Recent developments in the CLiDE tool for extraction of chemical structure data from patents and other documents

Enhancements in CLiDE

Chemical formula parsing 82.75% 84.60%

● Parsing the chemical formula into a sub connection table

Two-step process:

● Generating atom coordinates for the sub connection table

Super Atom Database: over 1000 super atoms, e.g. Me, Ph, Boc, TBDMS

Problem categories:

● Super atoms in chemical formulas ● Left- and right-aligned chemical formulas ● Branching in chemical formulas ● Chemical formulas with multiple attachments

● Chemical formulas with multiple attachments (―OCH2CH2O―)

● Super atoms in chemical formulas (―CO2Ph)

● Left- and right-aligned chemical formulas (―CH2NH2 vs NH2CH2 ―)

● Branching in chemical formulas (―OC(CH3)3)

Future work:

● Variables in chemical formulas (―CO2R, ―NHZ, ―SiR3)

Page 11: Recent developments in the CLiDE tool for extraction of ...€¦ · Recent developments in the CLiDE tool for extraction of chemical structure data from patents and other documents

Enhancements in CLiDE

Avoidance of loss of characters from atom labels

84.60% 85.78%

Page 12: Recent developments in the CLiDE tool for extraction of ...€¦ · Recent developments in the CLiDE tool for extraction of chemical structure data from patents and other documents

Enhancements in CLiDE

Better handling of thick bonds (stereo indicators)

85.78% 86.55%

Page 13: Recent developments in the CLiDE tool for extraction of ...€¦ · Recent developments in the CLiDE tool for extraction of chemical structure data from patents and other documents

Comparison with selected systems

74,48%

70,27%

68,68%

68,68%

43,20%

61,28%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

OSRA Imago

Accuracy rate

57.62%

87.96%

Runtime (hour:min)

05:51

04:50

04:50

04:54

00:15

01:52

0:00

1:12

2:24

3:36

4:48

6:00

7:12

OSRA Imago

00:44

00:20

Page 14: Recent developments in the CLiDE tool for extraction of ...€¦ · Recent developments in the CLiDE tool for extraction of chemical structure data from patents and other documents

Is the benchmark set correct?

Verification set #Molfiles to be corrected: 117

US07314693-20080101-C00370.TIF US07314693-20080101-C00370.MOL

● Anomalies: 10

● Stereo bonds: 22

US07316472-20080108-C00239.TIF US07316472-20080108-C00239.MOL

US07314872-20080101-C00024.TIF US07314872-20080101-C00024.MOL

● Incorrect sub connection tables for chemical formulae (e.g. NC, H3CO2S, OCF3): 63

● Errors in atom label: 14

● Other kinds of error: 17

Page 15: Recent developments in the CLiDE tool for extraction of ...€¦ · Recent developments in the CLiDE tool for extraction of chemical structure data from patents and other documents

Is the benchmark set correct?

Input images #images to be excluded: 16

● incorrect chemical formula: 1

US07314693-20080101-C00112.TIF

● disconnected atom: 1

● incorrect or ambiguous stereo bond: 6

US07314874-20080101-C00551.TIF USRE039991-20080101-C00187.TIF USRE039991-20080101-C00188.TIF

● arrow with unknown meaning: 8

US07320974-20080122-C00022.TIF

Page 16: Recent developments in the CLiDE tool for extraction of ...€¦ · Recent developments in the CLiDE tool for extraction of chemical structure data from patents and other documents

Performance after corrections

CLiDE 5.5.4 87.96% 90.11%

● #images: 5735

● #corrected Molfiles: 117

● #excluded images: 16

OSRA 1.4.0 68.68% 69.84%

Imago 2.0 beta 61.28% 61.91%

Page 17: Recent developments in the CLiDE tool for extraction of ...€¦ · Recent developments in the CLiDE tool for extraction of chemical structure data from patents and other documents

Part 3:

Performance against selected patents

Page 18: Recent developments in the CLiDE tool for extraction of ...€¦ · Recent developments in the CLiDE tool for extraction of chemical structure data from patents and other documents

About patents

Patents # non-Markush

structures

US6410540 218

WO2008099019 668

Challenges:

● Chemical structure diagrams have to be identified within the document page

● Interpretation of Markush structures

Markush structures were excluded from our tests

Page 19: Recent developments in the CLiDE tool for extraction of ...€¦ · Recent developments in the CLiDE tool for extraction of chemical structure data from patents and other documents

Challenge: Document segmentation

Page 65 of US6410540

Underlined text

5.5.4 4.4.0

Page 20: Recent developments in the CLiDE tool for extraction of ...€¦ · Recent developments in the CLiDE tool for extraction of chemical structure data from patents and other documents

Challenge: Document segmentation

Page 188 of WO2008019099 Table

5.5.4 4.4.0

Performance measurements

● Accuracy rate

● Runtime

● #Garbage structures: The number of structures that were assigned to non-chemical structure diagrams

Page 21: Recent developments in the CLiDE tool for extraction of ...€¦ · Recent developments in the CLiDE tool for extraction of chemical structure data from patents and other documents

Performance of CLiDE

88.90% 87,15%

80%

90%

100%

4.4.0 5.5.4

Accuracy rate

US6410540

00:22

00:06 00:00

00:14

00:28

4.4.0 5.5.4

Runtime (hour:min)

145

81

0

100

200

4.4.0 5.5.4

#Garbage structures

WO2008019099

57.63% 74,25%

0%

50%

100%

4.4.0 5.5.4

Accuracy rate

01:34

00:04 00:00

01:12

02:24

4.4.0 5.5.4

Runtime (hour:min)

1225

29

0

1000

2000

4.4.0 5.5.4

#Garbage structures

Page 22: Recent developments in the CLiDE tool for extraction of ...€¦ · Recent developments in the CLiDE tool for extraction of chemical structure data from patents and other documents

Comparison with selected systems Comparison with OSRA

87.15%

70,18%

60%

80%

100%

CLiDE OSRA

Accuracy rate

US6410540

00:06 00:09

00:00

00:07

00:14

CLiDE OSRA

Runtime (hour:min)

81

29

0

50

100

CLiDE OSRA

#Garbage structures

WO2008019099

74.25%

51,64%

40%

90%

CLiDE OSRA

Accuracy rate

00:04

00:58

00:00

01:12

CLiDE OSRA

Runtime (hour:min)

29

183

0

100

200

CLiDE OSRA

#Garbage structures

Page 23: Recent developments in the CLiDE tool for extraction of ...€¦ · Recent developments in the CLiDE tool for extraction of chemical structure data from patents and other documents

Part 4:

Conclusions and future work

Page 24: Recent developments in the CLiDE tool for extraction of ...€¦ · Recent developments in the CLiDE tool for extraction of chemical structure data from patents and other documents

Conclusions

● There has been considerable progress in OCSR, but nevertheless there still remain many problems to be solved

● The test sets showcased the diversity and the frequency of the problem types

● Regarding performance: • CLiDE has greatly improved during the last few years • CLiDE compares well with the other OCSR systems available to us for

testing

● In favourable cases, OCSR as exemplified by CLiDE now approaches OCR in accuracy (90%)

Page 25: Recent developments in the CLiDE tool for extraction of ...€¦ · Recent developments in the CLiDE tool for extraction of chemical structure data from patents and other documents

Future work

● Further improvements to structure recognition

● Filtering out garbage structures

● Identification and exclusion of non-chemical structure diagrams

● Further improvements to document segmentation

Short-term goals:

Long-term goals:

● Contextual document analysis, aimed at linking structures to text data

Page 26: Recent developments in the CLiDE tool for extraction of ...€¦ · Recent developments in the CLiDE tool for extraction of chemical structure data from patents and other documents

Flavours of CLiDE

CLiDE is released in three variants, designed for individual user needs

CLiDE Standard

CLiDE Professional

CLiDE Batch

Designed for the individual chemist who wishes to convert selected images into editable structures for use in reports etc.

GUI enterprise version to process whole documents with interactive editing

Unsupervised extraction for database creation etc.

Page 27: Recent developments in the CLiDE tool for extraction of ...€¦ · Recent developments in the CLiDE tool for extraction of chemical structure data from patents and other documents

Further information

www.keymodule.co.uk

[email protected]

Page 28: Recent developments in the CLiDE tool for extraction of ...€¦ · Recent developments in the CLiDE tool for extraction of chemical structure data from patents and other documents

Acknowledgment

Peter Johnson Keymodule Ltd. and University of Leeds

Anthony P. Cook University of Leeds

Vilmos A. Valko Keymodule Ltd.

Reseller agents • SimBioSys Inc. (North America) • NeoTrident Technology Ltd. (China) • Hulinks Inc. (Japan)

All users who gave us constructive feedback

Thank you for your attention