IMPACT Final Conference - Research Parallel Sessions - 03 typewritten ocr
-
Upload
impact-centre-of-competence -
Category
Education
-
view
1.906 -
download
3
Transcript of IMPACT Final Conference - Research Parallel Sessions - 03 typewritten ocr
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
OCR for Typewritten DocumentsStefan Pletschacher
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Overview
Stefan Pletschacher - OCR for Typewritten Documents, IMPACT Conference, London, 25.10.2011 2
Introduction to Typewritten OCR Document Types and Challenges Specific Approaches Results
Hansen Writing Ball, Source: Wikipedia
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
(The) Short History
Stefan Pletschacher - OCR for Typewritten Documents, IMPACT Conference, London, 25.10.2011 3
1870 first commercially manufactured typewriter 1970s-80s first PCs and desktop printers
IBM 5150 PC, 1981, Source: WikipediaSholes and Glidden typewriter, 1873, Source: Wikipedia
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Typewritten Documents
Stefan Pletschacher - OCR for Typewritten Documents, IMPACT Conference, London, 25.10.2011 4
Millions of pages of significant typewritten documents exist in archives and libraries– Practically most administrative and individually-produced documents of the
20th Century Typewritten documents pose unique challenges to recognition
– Each character is produced independently of the rest – glyphs can appear with different intensity/weight even within the same word
– Carbon copies are common – glyphs are blurred, connected to each other and the background is textured
– Content – administrative documents with names, abbreviations, numbers etc. which render lexicon based recognition approaches less useful
In addition, the usual degradations of historical documents are present due to ageing and use
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Document Types and Challenges
Stefan Pletschacher - OCR for Typewritten Documents, IMPACT Conference, London, 25.10.2011 5
Manuscripts Scientific publications Index cards Administrative documents Letters …
Annotations Abbreviations and names Carbon copies (low contrast) Punch holes, staples etc. Damage from regular handling (folds,
tears, stains) Discoloured paper (often unevenly)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Some Examples
Stefan Pletschacher - OCR for Typewritten Documents, IMPACT Conference, London, 25.10.2011 6
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Specific Approaches
Stefan Pletschacher - OCR for Typewritten Documents, IMPACT Conference, London, 25.10.2011 7
Incorporate background knowledge about typewritten documents
Pre-processing– Improved glyph segmentation– Enhancement of individual glyph images
Recognition– Perform language independent character recognition using specifically
trained classifiers– Voting engine
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Typewritten OCR
Stefan Pletschacher - OCR for Typewritten Documents, IMPACT Conference, London, 25.10.2011 8
Document Image
(greyscale)
Binarisation
Document Image
(black-and-white)
Region Segmentation
Text Line Segmentation
PAGE XML(with text regions)
<?xml version="1.0“><PcGts> <Page> <Region/> </Page></PcGts>
PAGE XML(with text lines)
<?xml version="1.0“><PcGts> <Page> <Region/> </Page></PcGts>
PAGE XML(completely filled)
<?xml version="1.0“><PcGts> <Page> <Region/> </Page></PcGts>
Glyph Segmentation
Glyph Elements
PAGE Exporter(includes word composition)
Glyph Elements
(with text)
TOCR
Glyph Enhancement
Enhanced Glyphs
Composite Character Recognition
Template Matching
Feature-based ClassifierVoting Engine
Weights
...
System developed in IMPACT
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Some Results
Stefan Pletschacher - OCR for Typewritten Documents, IMPACT Conference, London, 25.10.2011 9
Top:Commercial OCR
Bottom:IMPACT Typewritten OCR prototype
More complete results and thus higher overall accuracy
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
For more information visit:
PRImAhttp://www.primaresearch.org
IMPACThttp://www.impact-project.eu
Stefan Pletschacher - OCR for Typewritten Documents, IMPACT Conference, London, 25.10.2011 10