Datech2014-Session1-Document Representation Refinement for Precise Region Description
-
Upload
impact-centre-of-competence -
Category
Technology
-
view
248 -
download
3
description
Transcript of Datech2014-Session1-Document Representation Refinement for Precise Region Description
Document Representation Refinement for Precise Region Description
Christian Clausner, Stefan Pletschacher and Apostolos Antonacopoulos
PRImA Lab, School of Computing, Science and Engineering, University of Salford,
United Kingdom
Document Page Regions
DATeCH 2014 2
Segmentation, Classification
• Region (block, zone): Connected area of a document image with content of a single specific type
• Examples: Text, graphic, table
Region Representation
• By geometric objects
– Bounding box
– Stack of rectangles
– Polygon
• By pixels
– Bitmap
– Run-length encoding
DATeCH 2014 3
Need for Precise Region Descriptions
• Precise description is crucial for all but the most trivial document analysis and recognition applications
• For performance evaluation: The loss of quality introduced by imprecise regions can be bigger than the variation of accuracy of the actual recognition method
DATeCH 2014 4
The Situation
• Trend to more precise descriptions, but…
• Output of state-of-the-art OCR systems:
– Stacks of rectangles (ABBYY FineReader Engine 11)
– Bounding boxes (Tesseract OCR 3.02)
• Popular formats for layout analysis and OCR results:
– ALTO XML (boxes, ellipses, polygons (region level only))
– FineReader XML (stacks of rectangles (region level only))
– PAGE XML (polygons for all levels)
– HOCR (boxes)
DATeCH 2014 5
Refinement through Polygonal Fitting
• Applicable to regions that have child objects in the document model
• A typical object hierarchy contains regions, text lines, words and glyphs (characters)
• Idea: Tightly wrap a polygon around the child objects
DATeCH 2014 6
Polygonal Fitting Approach
1. Create bitmasks for the child objects and transfer them to an empty bitmap
2. Fill the gaps between the child objects by a smearing approach
3. Optional: Exclude neighbour regions
4. Trace the contour of the foreground and create a polygon
DATeCH 2014 7
1 - Transferring Child Object to Bitmap
• Starting point: Polygonal object (e.g. text line, word, or glyph)
• Lossless conversion to rectangle based interval representation
• Transferring the rectangles to the target bitmap
DATeCH 2014 8
2 – Smearing Approach
• Goal: Connect all foreground components in the bitmap by filling the gaps in-between
1. Alternatingly fill horizontal and vertical gaps if they are smaller than a dynamic threshold (threshold is increased after each iteration)
2. If necessary, use diagonal smearing to connect remaining components
DATeCH 2014 9
3 – Subtraction of Neighbours
• Optional step to avoid overlap with adjacent regions
• Simply erase the corresponding pixels from the created bitmap
DATeCH 2014 10
4 – Outline Tracing
• Trace the contour of the foreground component in the created bitmap
• Create polygon on-the-fly by adding points for each change of direction (corner)
DATeCH 2014 11
Experiments
• Carried out on a dataset of contemporary documents consisting of scanned magazine and technical article pages
• Processed with Tesseract OCR 3.02 (open source)
• Exported to PAGE XML with and without refinement
DATeCH 2014 12
DATeCH 2014 13
Original (unrefined) Refined
Results
• Measurement of region overlaps (number and area)
DATeCH 2014 14
Overlapping Regions
Overlap Area (Megapixel)
Original Outlines
621 (45.8%) 19.9
Refined Outlines
286 (21.1%) 2.5
Impact on Performance Evaluation
• Real-world scenario
• Measure the performance of Tesseract OCR engine
• Evaluation metrics of previous ICDAR page segmentation competitions
DATeCH 2014 15
Average success rate using original outlines 81.1%
Average success rate using refined outlines 84.5%
Average improvement for all documents 3.4%
Maximum improvement 22.9%
Conclusion • Existing geometric region data can be significantly refined by fitting
precise polygons around child objects
• Validity and impact on real-world scenarios has been shown
• Refinement in performance evaluation helps to eliminate problems that arise from insufficient geometric descriptions → Concentrate on real issues of OCR methods
• Positive effect on accuracy of presentation/repurposing systems (highlighting, cropping, article tracking, etc.)
• Approach used in Aletheia ground truth editor and result viewer (primaresearch.org/tools)
DATeCH 2014 16
DATeCH 2014 17