Evaluation and post-correction of OCR of digitised historical newspapers

Evaluation and post-correction of OCR of digitised historical newspapersA research project

Lotte Wilms (KB) & Janneke van der Zwaan (NL eScience Center)@lottewilms @jvdzwa

• Digitised Dutch newspapers

• 1618-1995

• Images + metadata + text

• now: 11 million pages (in 1.351.123 issues)

• prognosis 2020: 20 million pages

• Full text searchable on: www.delpher.nl

Delpher newspaper corpus

http://www.delpher.nl/

Crowdsourced corrections

type/format level comments

PDF issue Searchable text + scan

JPEG-2000 page Access: JPEG 2000 lossy compression, colour (or greyscale in case of original from microfilm)

Master: JPEG 2000 part 1, lossless compression, greyscale or colour

Dublin Core iss./p./art. Descriptive metadata

OCR article XML

ALTO page

mpeg21-didl issue Structural metadata

The data

Aims of project

• Insight into quality of our OCR

• Insight into automated methods of post-correction

• Reprocessing images

• Machine learning approach

• Other?

Output of the project

• Representative sample set of digitised newspapers, with ground truth

• Report on quality of OCR of Delpher’s digitised newspapers

• Report on post-correction possibilities of OCR using automatic

techniques

• Impact analysis of most likely method of improvement

• Prototype for OCR post-correction and evaluation using deep learning

Sample set

• 2000 pages

• Representative of the whole collection, taking into account:

• Date of publication

• Date of production

• Software used

Production information

<OCRProcessing>

Database with production information• Extracted from the metadata:

• Issue identifier

• Newspaper title

• Publication date

• Producer

• Production date

• Software used

?? Some issues processed twice with ABBYY 8.1 & 9.0

Ground truth

Post-correction methods

• Deep learning (by Janneke van der Zwaan from the Netherlands eScience

Center)

• PICCL by Martin Reynaert (UvT & RU)

• https://github.com/LanguageMachines/PICCL

• Proprietary software from a startup?

• More?

https://github.com/LanguageMachines/PICCL

https://github.com/LanguageMachines/PICCL

Improve OCR using Deep Learning

• Character-Level

Language models

• Long Short Term

Memory (LSTM)

Also applicable for OCR evaluation

without GT!

Reprocessing images

• Service provider

• Access or master images?• Access: JPEG 2000 lossy compression, colour (or greyscale in case of

original from microfilm)• Master: JPEG 2000 part 1, lossless compression, greyscale and colour

• Standard software or new solutions?

Evaluation

• Focused on wordbased searching

• Bag of words?

• Use existing tools or create our own?

Impact analysis

• Most likely scenario

• Impact on the organisation

• Percentage of improvement on OCR

• Effort needed to implement method

• How to handle different versions of OCR

Any tips or questions?

Evaluation and post-correction of OCR of digitised historical newspapers

Technology

Transcript of Evaluation and post-correction of OCR of digitised historical newspapers