Imago OCR: Open-source toolkit for chemical structure image recognition
-
Upload
mikhail-rybalkin -
Category
Technology
-
view
5.459 -
download
4
description
Transcript of Imago OCR: Open-source toolkit for chemical structure image recognition
Imago OCROpen-source toolkit for chemical
structure image recognition
14/08/2012 GGA Software Services LLC 1
http://ggasoftware.com/opensource/imago/
Project goals
• Perform the optical chemical structure recognition applicable for a wide range of raster images:– different image formats
– various scanning quality (or even photo)
– complex structures and uncommon features
• Provide complete toolset for embedding recognition engine in any other application
GGA Software Services LLC 214/08/2012
Applications
• Automated articles and patents processing
– similarity analysis
• Chemical database search (PubChem, etc.)
• “The Deep Web indexing”
– development of a universal chemical search engine;
– conversion of a human-readable data to machine-readable formats
14/08/2012 GGA Software Services LLC 3
Use case
14/08/2012 GGA Software Services LLC 4
Source image MOL format
imago
• BMP, DIB, JPG, JPE, PNG, PBM, PGM, PPM, SR, RAS, TIFF;
• Images from scanner/camera;• PDF document
• MDL Molfile;• SMILES (requires Indigo);• Rendered image (requires
Indigo)
Supported features
• Multiple bonds
• Single-up & single-down bonds
• Bridged bonds
• Aromatic rings
14/08/2012 GGA Software Services LLC 5
Supported features
• Superatom labels,
charges, isotopes
• Abbreviations expansion
• R-groups handling
• Query features
14/08/2012 GGA Software Services LLC 6
Engine structure
14/08/2012 GGA Software Services LLC 7
Prefilter & Binarization
Vectorization & Separation
Logical layout analyzer
Image loader
Molecule export
Raster level
Primitives level
Structural level
Preliminary filters
• Pass-through filter
– For rendered images (only binarization)
• Cross-correlation based filter
– For scanned images (quite fast)
• Logical analysis based filter
– For low-quality photos
– Takes some time for processing
• Imago allows auto-detection of suitable filter
14/08/2012 GGA Software Services LLC 8
Cross-correlation based filter
14/08/2012 GGA Software Services LLC 9
Source image Strong threshold Weak threshold
← Filter result: image combined of weak threshold image segments that passes the restrictions of the CC value between corresponding strong threshold image segments
Logical analysis based filter
• Removes noise (spots, light glares)
• Suitable for out-of-focus images
• Can process low-contrast images
• Removes unusual artifacts
• Deals with multicolor photos
• Keywords: wiener filtering, wave algorithm, weak segmentation
14/08/2012 GGA Software Services LLC 10
Preliminary separation
• Separate labels and graphics:
• Hu moments classifier (d1)
• Contours analysis (d2)
• Approximation criteria (d3)
• Object is symbol if f(d1, d2, d3) > c0
14/08/2012 GGA Software Services LLC 11
Vectorization
• Convert pixels to a matching polyline:
• Minimization of mean distance between original and vectorized structure
– Penalty for extra segments
14/08/2012 GGA Software Services LLC 12
Logical layout analysis
• Mapping labels to bonds– Group labels into superatoms
• Finding multiple bonds– Dissolving of short edges
– Connection of bridged bonds
• Removal of surely unrelated captions
• Detection of aromatic rings– Figuring out stereo bonds orientation and
aromatizing molecule if circles were presented
14/08/2012 GGA Software Services LLC 13
Adaptive methods or particular cases?
• Adaptive methods
– Based on optimization of some function
– Wider input class range
– Probably better results in hard cases
14/08/2012 GGA Software Services LLC 14
• Particular-case methods
– Based on some criteria
– Stability
– Good performance
– Easier implementation
Particular case methods
• What is it?
• Line? Tested line criteria: no.
• Character? Tested against ‘A’: no.… Tested against ‘Z’: no.
• Ring? no.
• Unrecognizable object – ignore.
14/08/2012 GGA Software Services LLC 15
Adaptive methods
14/08/2012 GGA Software Services LLC 16
• What is it?
• Line: approximation: d=1.6
• Character? Compared with ‘C’: d=6.1… Compared with ‘L’: d=3.2
• Ring? approximation: d=653.3
• Final decision depends on neighbors
Decision tree
14/08/2012 GGA Software Services LLC 17
Label with d=0.1 (almost surely recognized)
Then object is a bond and segments group recognized as bond + label with d=0.1+1.6=1.7
Bond with d=0.0
“C” with d=0.1
Then object is a letter ‘l’ and segments group recognized as bond + label of two chars with d=0.0+0.1+3.2=3.3
Metrics
• For symbols– Distance between Fourier descriptors set
• For graphics– Distance between approximated and source image
• For single-up bonds– f(average fill, relative size, etc.)
• For single-down bonds– f(distance between segments, line thickness, etc.)
• … (every recognition method has a metric function)
14/08/2012 GGA Software Services LLC 18
Labels correction
• Any recognized symbol can have alternatives:
: A(metric value of 3.2), R(4.9), P(5.0)
• Imago keeps probable captions information (periodic table, abbreviations)
• Labels correction: select such combination of symbols alternatives that is probably and the sum of metric values is minimal
• Allows to recognize partially broken labels
14/08/2012 GGA Software Services LLC 19
Recognition
• Image recognition is a search of vectorized result gives minimal distance value between vectorized form and original image
• Can be formalized depending on metrics
• Search is exhaustive
– Needs some restrictions to achieve good speed
14/08/2012 GGA Software Services LLC 20
Trade-off: restricted adaptive methods
• Limit metric values: d < 0.5 – surely; d > 10.0 –impossibly
• Limit Euclidian distances for neighbors search (up to 100 pixels)
• Limit alternatives count (not more than 10)• Assume image filling rate is less than 10%• Assume the distances for single-down bonds segments
is in range 5..10 pixels• Assume the symbol aspect ratio is in range 0.5..2.0• Some more assumptions with the “magic” constants• Gains the speed and stability
14/08/2012 GGA Software Services LLC 21
Configuration clusters
• For scanned images– Strict adaptive methods limits (fast, <300ms per image)
• For photos and low quality images– Flexible limits (less than a second per image in average)
• For high-resolution images – up to 5 seconds
• For handwritten structures– up to 10 seconds in complex cases
• Imago supports auto-detection of suitable configuration cluster
14/08/2012 GGA Software Services LLC 22
Configuration cluster creation
• Allows to gain better recognition success rate for specified images type:
– different render type
– images captured differently (scanner type, lighting conditions, etc.)
• Process is automated
– test set of target images type is required
– takes some time
– machine learning application
14/08/2012 GGA Software Services LLC 23
Machine learning
• Test set: amount of pairs (image; related MDL molfile)
• Imago will tune the method parameters to gain the best score on the test collection– Metrics included
– No information directly related to test set (such a characters table) is stored
• Criteria of the complete set will be formed by small subset of the same type
14/08/2012 GGA Software Services LLC 24
Learning effectiveness
• Used Img2Structure test set with different renderer:
• Initial results (before training): 202/944 correct, similarity value: 74.54%
• Trained on set of 50 images with new render
• Trained results: 831/944 correct, similarity value: 98.33% on the whole set
14/08/2012 GGA Software Services LLC 25
Comparison: overall scores 1
• Image2Structure set from TREC 2011 Chemical IR Track (removed ambiguous & partial structures): original files
14/08/2012 GGA Software Services LLC 26
OSRA 1.4.0 Imago 1.0 Imago 2.0 beta 1
Absolutely correct 769 / 944 540 / 944 861 / 944
Almost correct1 +31 +49 +43
Average time 2.54s 0.20s 0.31s
Average similarity2 94.57% 89.59% 98.26%
1 similarity value is greater than 95%;2 correct elements (atoms and bonds) ratio; extra and missing elements are counted too.
Comparison: overall scores 2
• Image2Structure re-rendered using appropriate molfiles
14/08/2012 GGA Software Services LLC 27
OSRA 1.4.0 Imago 1.0 Imago 2.0 beta 1
Absolutely correct 796 / 944 604 / 944 831 / 944
Almost correct1 +20 +58 +29
Average time 4.57s 0.47s 1.24s
Average similarity2 93.45% 95.38% 98.33%
1 similarity value is greater than 95%;2 correct elements (atoms and bonds) ratio; extra and missing elements are counted too.
Common issues resolved
14/08/2012 GGA Software Services LLC 28
Source OSRA Imago
Large gap
Lines too close
No more symbols
Imago Library
• API: Methods set for– Image loading– Configuration clusters setup– Retrieving molfile results– Partial processing (filtering, approximation, validation)
• Bindings for C/C++, Java• Cross-platform implementation (Windows, Linux, Mac)• Dependencies:
– Boost library (LGPL license)– OpenCV library (BSD license)– Indigo (optional)
14/08/2012 GGA Software Services LLC 29
Thank you for the attention!
• Imago OCR:http://ggasoftware.com/opensource/imago/
• Try imago recognition engine online:http://ggasoftware.com/opensource/imago/online/
14/08/2012 GGA Software Services LLC 30
Appendix AImago: technical details
14/08/2012 GGA Software Services LLC 31
Pass-trough prefilter
• Calculate black, white and others pixels
• If (black + white) > t0 ∙ others,
– recolor others to black → image is binarized
– else schedule another prefilter call
• Perform accurate image downscale when image is too large (>5Mpix)
14/08/2012 GGA Software Services LLC 32
Cross-correlation prefilter
• Smooth source image → smoothed– Pyramidal reduce 2x, then pyramidal upsample 2x
• Process adaptive threshold binarization filter of smoothed image:– With threshold t0 → strong– With threshold t1 → weak
• Segmentate (strong, weak) images using wavemap algorithm• For each weak segment find appropriate strong segment and
calculate intersection:– If intersection area to original segment area ratio is less than c0 then
remove this segment (bad segment)
• If reassembled image contains the rectangular structure R – crop image to R inner dimensions (locate molecules)
• Calculate average pixels intensity for good segments and try to add other pixels with intensity passing this boundary (if they’re not affecting segments connectivity)
14/08/2012 GGA Software Services LLC 33
Separator details
• Given a binarized set of segments classify them into two main groups: letters and chemical bond representation
• Classification result is based on the value of C = k0 ∙ r0 + k1 ∙ r1 + k2 ∙ r2
– Where (r0, r1, r2) are submethods results
– And (k0, k1, k2) – weight constants (configurable)
14/08/2012 GGA Software Services LLC 34
Separator: Hu moments
• Hu moments usually differs for characters and bonds, so the classification tree can be computed
• Note: some objects can not be classifiedthat way
14/08/2012 GGA Software Services LLC 35
symbolsr0 = 0
bondsr0 = 1
Separator: contours analysis
• Extract the outer contour of the binarized segment S;– approximate the chain contour using Teh-Chin chain
approximation algorithm;– taking line thickness as a approximation parameter the polygon
is approximated once again;– calculate the offsets of the contour points by a clockwise step;– the output is a chain of sequential vectors normalized by their
perimeters;
• Compare the chain result to the set of patterns describing valid structures– The set contains of 8x8 matrices where the cell (j, k) denotes
the probability of changing the jth direction to the kth.
• Result of this method is r1 – probability of {S is a bond}
14/08/2012 GGA Software Services LLC 36
Separator: approximation criteria
• For a given segment S we calculate its best approximation with n line segments (d0) and the closest distance to the most probable character (d1)– If d1 < d0 and n > n0 then probably segment
represents character• Check its width/height ratio, height/average_height
ratio: penalty p0 if this criteria is not matched
• Result is r2 = 1 - (d1 [+ p0]) – probability of {S is a bond}
– Result is r2 = d0 – probability of {S is a bond}
14/08/2012 GGA Software Services LLC 37
Bonds skeleton analysis
• Dissolve short edges
• Join closest vertices
• Dissolve intermediate vertices
• Find multiple edges
• Connect bridged bonds
• Shrink short bonds
• Detect and mark suspicious edges
14/08/2012 GGA Software Services LLC 38
Basic labels analysis
• Location analysis: check against baseline– The subscripts are underline:
– Capitals mostly above line:
• Calculate distances to all possible characters:
• Alternate distances using topological features
• Select the best result candidate and calculate recognition quality:
14/08/2012 GGA Software Services LLC 39
Superatoms analysis
• Concatenate recognized characters into labels
• Check chemical validity
• If validity check is failed – try to find the most probable alternative using other distance map elements
• If such alternative is not found – try to recognize the less probable characters as bonds
• Handle R-semantic, special characters: X, Q, A
14/08/2012 GGA Software Services LLC 40
Appendix BImago: workflow features
14/08/2012 GGA Software Services LLC 41
Related continuous integration system
14/08/2012 GGA Software Services LLC 42
…
Versions list
Results estimation
Test sets
Explanation: continuous integration
• Some logically grounded changes may decrease the recognition rate → convenient tracking tool is required
• Good way to improve overall stability
• Useful visual representation of the machine-learning progress
14/08/2012 GGA Software Services LLC 43
Embedded HTML-based logging system
14/08/2012 GGA Software Services LLC 44
Embedded images
Performance counters
Variables and parameters dump
Call hierarchy
Explanation: logging system
• Structured logs (reports) are offering– Convenient way of bugs detection;
– Exact visual representation of the internal processes;
• Several improvements may be evident just by looking through logs
• Performance decrease is comparable to the (usual) plaintext logs
• Stability is not affected
14/08/2012 GGA Software Services LLC 45
Authors
• Rostislav Chutkov
• Michael Rybalkin
• Kliton Andrea
• Victor Smolov
• GGA Software Services LLC
14/08/2012 GGA Software Services LLC 46