Humanistinen tiedekunta
Senka Drobac and Pekka Kauppinen and Krister Lindén
Improving OCR of historical newspapers and journals published in
Finland by adding Swedish training data
1
Humanistinen tiedekunta
Motivation
•Corpus of historical newspapers and magazines that has been digitized by the National Library of Finland
•OCR was done with commercial software Abbyy FineReader
•Character accuracy rate (CAR): ~ 90-91%
Humanistinen tiedekunta
Figure from: Vesanto, Aleksi, et al. "A system for identifying and exploring text repetition in large historical document corpora." Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden. No. 131. Linköping University Electronic Press, 2017.
Humanistinen tiedekunta
Ocropy
•Decided to train models with Ocropy + post processing
•Ocropy:
• Open source, uses LSTM, line based
• Tools for preprocessing, segmentation, training, recognition, evaluation
• Above 98.5% CAR on German 19th and 20th century
Humanistinen tiedekunta
OCR workflow (Ocropy)
Image
Line
segmentation Line
imagesText
(lines)
Text
(lines)
Binarized
Image
(Pre-processing)
Binarization OCRPost
processing
Pre-trained model
Humanistinen tiedekunta
Data
1771-1919
Languages: Finnish and
Swedish
Typefaces: Fraktur and
Antiqua
Humanistinen tiedekunta
• Good quality
• Finnish Fraktur
• One column
Humanistinen tiedekunta
• Good quality
• Swedish Antiqua
• Two columns
Humanistinen tiedekunta
• Binarized image
• Difficult segmentation
Humanistinen tiedekunta
• Binarized image
• Challenging segmentation
• Many different fonts on one
page
Humanistinen tiedekunta
• Both Finnish and Swedish
on the same page
Humanistinen tiedekunta
• Poor quality
Humanistinen tiedekunta
Line examples - Fraktur
☛ För billigt pris: En kursläde i garden
Sananlennätinkonttori awoinna joka päiwä
-— Salama i s k i tiistai yönä klo
pitänyt tarpeellisena warata jonkunlaisen
Humanistinen tiedekunta
Line examples – Antiqua
osakkaat kutsutaan täten varsinaiseen yhtiö-
nuksia määräämälleen rautatiease-
m stammanträda i nämnde kontors loka
Heines poetische Werke. I två band. 17 m.
Humanistinen tiedekunta
Ocropy + post.proc. results
•Finnish data sets:
• CAR: 93.5% - 94.83%
• After post-processing CAR: 93.68% - 95.21%
•It is better to randomly sample lines from the entire corpus thantrain on all lines from 250 pages
Humanistinen tiedekunta
• Lots of Swedish material -> add Swedish training data
Finnish:
~10 000 training lines
(randomly picked)
~75% Fraktur, ~25% Antiqua
Swedish:
~ 3 300 training lines
(randomly picked)
~50% Fraktur, ~50% Antiqua
Humanistinen tiedekunta
Experiments
Test set FIN MODEL SWE MODEL
Fin-Fraktur 95.43 / 78.79 93.2 / 69.61
Fin-Antiqua 85.81 / 53.36 88.89 / 62.32
Swe-Fraktur 78.84 / 40.43 87.59 / 55.32
Swe-Antiqua 79.93 / 40.01 90.66 / 66.36
Test set/MODEL FIN + SWE 1 FIN + SWE 2 FIN + SWE 3
Fin-Fraktur 96.19 / 81.91 95.07 / 76.65 94.97 / 76.13
Fin-Antiqua 89.35 / 63.35 87.23 / 58.22 86.64 / 55.79
Swe-Fraktur 82.53 / 51.11 80.76 / 43.48 83.22 / 45.65
Swe-Antiqua 86.65 / 59.84 83.69 / 49.49 84.88 / 52.5
SWE 1: 840 lines
SWE 2: 1 680 lines
SWE 3: 3 360 lines
Results show CAR (%) / WAR (%)
Not enough Finnish Antiqua in
training
FIN: 10 000 lines
Language is important!
Humanistinen tiedekunta
Conclusions
•Need more Swedish data
•Need more Finnish Antiqua data
•Is it possible to train one model for everything?
Top Related