IMPACT Final Conference - Research Parallel Sessions - 03 typewritten ocr

10
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR for Typewritten Documents Stefan Pletschacher

Transcript of IMPACT Final Conference - Research Parallel Sessions - 03 typewritten ocr

Page 1: IMPACT Final Conference - Research Parallel Sessions - 03 typewritten ocr

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

OCR for Typewritten DocumentsStefan Pletschacher

Page 2: IMPACT Final Conference - Research Parallel Sessions - 03 typewritten ocr

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Overview

Stefan Pletschacher - OCR for Typewritten Documents, IMPACT Conference, London, 25.10.2011 2

Introduction to Typewritten OCR Document Types and Challenges Specific Approaches Results

Hansen Writing Ball, Source: Wikipedia

Page 3: IMPACT Final Conference - Research Parallel Sessions - 03 typewritten ocr

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

(The) Short History

Stefan Pletschacher - OCR for Typewritten Documents, IMPACT Conference, London, 25.10.2011 3

1870 first commercially manufactured typewriter 1970s-80s first PCs and desktop printers

IBM 5150 PC, 1981, Source: WikipediaSholes and Glidden typewriter, 1873, Source: Wikipedia

Page 4: IMPACT Final Conference - Research Parallel Sessions - 03 typewritten ocr

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Typewritten Documents

Stefan Pletschacher - OCR for Typewritten Documents, IMPACT Conference, London, 25.10.2011 4

Millions of pages of significant typewritten documents exist in archives and libraries– Practically most administrative and individually-produced documents of the

20th Century Typewritten documents pose unique challenges to recognition

– Each character is produced independently of the rest – glyphs can appear with different intensity/weight even within the same word

– Carbon copies are common – glyphs are blurred, connected to each other and the background is textured

– Content – administrative documents with names, abbreviations, numbers etc. which render lexicon based recognition approaches less useful

In addition, the usual degradations of historical documents are present due to ageing and use

Page 5: IMPACT Final Conference - Research Parallel Sessions - 03 typewritten ocr

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Document Types and Challenges

Stefan Pletschacher - OCR for Typewritten Documents, IMPACT Conference, London, 25.10.2011 5

Manuscripts Scientific publications Index cards Administrative documents Letters …

Annotations Abbreviations and names Carbon copies (low contrast) Punch holes, staples etc. Damage from regular handling (folds,

tears, stains) Discoloured paper (often unevenly)

Page 6: IMPACT Final Conference - Research Parallel Sessions - 03 typewritten ocr

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Some Examples

Stefan Pletschacher - OCR for Typewritten Documents, IMPACT Conference, London, 25.10.2011 6

Page 7: IMPACT Final Conference - Research Parallel Sessions - 03 typewritten ocr

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Specific Approaches

Stefan Pletschacher - OCR for Typewritten Documents, IMPACT Conference, London, 25.10.2011 7

Incorporate background knowledge about typewritten documents

Pre-processing– Improved glyph segmentation– Enhancement of individual glyph images

Recognition– Perform language independent character recognition using specifically

trained classifiers– Voting engine

Page 8: IMPACT Final Conference - Research Parallel Sessions - 03 typewritten ocr

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Typewritten OCR

Stefan Pletschacher - OCR for Typewritten Documents, IMPACT Conference, London, 25.10.2011 8

Document Image

(greyscale)

Binarisation

Document Image

(black-and-white)

Region Segmentation

Text Line Segmentation

PAGE XML(with text regions)

<?xml version="1.0“><PcGts> <Page> <Region/> </Page></PcGts>

PAGE XML(with text lines)

<?xml version="1.0“><PcGts> <Page> <Region/> </Page></PcGts>

PAGE XML(completely filled)

<?xml version="1.0“><PcGts> <Page> <Region/> </Page></PcGts>

Glyph Segmentation

Glyph Elements

PAGE Exporter(includes word composition)

Glyph Elements

(with text)

TOCR

Glyph Enhancement

Enhanced Glyphs

Composite Character Recognition

Template Matching

Feature-based ClassifierVoting Engine

Weights

...

System developed in IMPACT

Page 9: IMPACT Final Conference - Research Parallel Sessions - 03 typewritten ocr

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Some Results

Stefan Pletschacher - OCR for Typewritten Documents, IMPACT Conference, London, 25.10.2011 9

Top:Commercial OCR

Bottom:IMPACT Typewritten OCR prototype

More complete results and thus higher overall accuracy

Page 10: IMPACT Final Conference - Research Parallel Sessions - 03 typewritten ocr

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

For more information visit:

PRImAhttp://www.primaresearch.org

IMPACThttp://www.impact-project.eu

Stefan Pletschacher - OCR for Typewritten Documents, IMPACT Conference, London, 25.10.2011 10