Evaluation and post-correction of OCR of digitised historical newspapers
-
Upload
impact-centre-of-competence -
Category
Technology
-
view
192 -
download
0
Transcript of Evaluation and post-correction of OCR of digitised historical newspapers
Evaluation and post-correction of OCR of digitised historical newspapersA research project
Lotte Wilms (KB) & Janneke van der Zwaan (NL eScience Center)@lottewilms @jvdzwa
• Digitised Dutch newspapers
• 1618-1995
• Images + metadata + text
• now: 11 million pages (in 1.351.123 issues)
• prognosis 2020: 20 million pages
• Full text searchable on: www.delpher.nl
Delpher newspaper corpus
type/format level comments
PDF issue Searchable text + scan
JPEG-2000 page Access: JPEG 2000 lossy compression, colour (or greyscale in case of original from microfilm)
Master: JPEG 2000 part 1, lossless compression, greyscale or colour
Dublin Core iss./p./art. Descriptive metadata
OCR article XML
ALTO page
mpeg21-didl issue Structural metadata
The data
Aims of project
• Insight into quality of our OCR
• Insight into automated methods of post-correction
• Reprocessing images
• Machine learning approach
• Other?
Output of the project
• Representative sample set of digitised newspapers, with ground truth
• Report on quality of OCR of Delpher’s digitised newspapers
• Report on post-correction possibilities of OCR using automatic
techniques
• Impact analysis of most likely method of improvement
• Prototype for OCR post-correction and evaluation using deep learning
Sample set
• 2000 pages
• Representative of the whole collection, taking into account:
• Date of publication
• Date of production
• Software used
Database with production information• Extracted from the metadata:
• Issue identifier
• Newspaper title
• Publication date
• Producer
• Production date
• Software used
?? Some issues processed twice with ABBYY 8.1 & 9.0
Post-correction methods
• Deep learning (by Janneke van der Zwaan from the Netherlands eScience
Center)
• PICCL by Martin Reynaert (UvT & RU)
• https://github.com/LanguageMachines/PICCL
• Proprietary software from a startup?
• More?
Improve OCR using Deep Learning
• Character-Level
Language models
• Long Short Term
Memory (LSTM)
Also applicable for OCR evaluation
without GT!
Reprocessing images
• Service provider
• Access or master images?• Access: JPEG 2000 lossy compression, colour (or greyscale in case of
original from microfilm)• Master: JPEG 2000 part 1, lossless compression, greyscale and colour
• Standard software or new solutions?
Impact analysis
• Most likely scenario
• Impact on the organisation
• Percentage of improvement on OCR
• Effort needed to implement method
• How to handle different versions of OCR