Download - Improving OCR of historical newspapers and journals ... · data 1. Humanistinen tiedekunta Motivation ... Binarized Image (Pre-processing) Binarization OCR Post processing Pre-trained

Humanistinen tiedekunta

Senka Drobac and Pekka Kauppinen and Krister Lindén

Improving OCR of historical newspapers and journals published in

Finland by adding Swedish training data

1


Motivation

•Corpus of historical newspapers and magazines that has been digitized by the National Library of Finland

•OCR was done with commercial software Abbyy FineReader

•Character accuracy rate (CAR): ~ 90-91%


Figure from: Vesanto, Aleksi, et al. "A system for identifying and exploring text repetition in large historical document corpora." Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden. No. 131. Linköping University Electronic Press, 2017.


Ocropy

•Decided to train models with Ocropy + post processing

•Ocropy:

• Open source, uses LSTM, line based

• Tools for preprocessing, segmentation, training, recognition, evaluation

• Above 98.5% CAR on German 19th and 20th century


OCR workflow (Ocropy)

Image

Line

segmentation Line

imagesText

(lines)

Text

(lines)

Binarized

Image

(Pre-processing)

Binarization OCRPost

processing

Pre-trained model


Data

1771-1919

Languages: Finnish and

Swedish

Typefaces: Fraktur and

Antiqua


• Good quality

• Finnish Fraktur

• One column


• Good quality

• Swedish Antiqua

• Two columns


• Binarized image

• Difficult segmentation


• Binarized image

• Challenging segmentation

• Many different fonts on one

page


• Both Finnish and Swedish

on the same page


• Poor quality


Line examples - Fraktur

☛ För billigt pris: En kursläde i garden

Sananlennätinkonttori awoinna joka päiwä

-— Salama i s k i tiistai yönä klo

pitänyt tarpeellisena warata jonkunlaisen


Line examples – Antiqua

osakkaat kutsutaan täten varsinaiseen yhtiö-

nuksia määräämälleen rautatiease-

m stammanträda i nämnde kontors loka

Heines poetische Werke. I två band. 17 m.


Ocropy + post.proc. results

•Finnish data sets:

• CAR: 93.5% - 94.83%

• After post-processing CAR: 93.68% - 95.21%

•It is better to randomly sample lines from the entire corpus thantrain on all lines from 250 pages


• Lots of Swedish material -> add Swedish training data

Finnish:

~10 000 training lines

(randomly picked)

~75% Fraktur, ~25% Antiqua

Swedish:

~ 3 300 training lines

(randomly picked)

~50% Fraktur, ~50% Antiqua


Experiments

Test set FIN MODEL SWE MODEL

Fin-Fraktur 95.43 / 78.79 93.2 / 69.61

Fin-Antiqua 85.81 / 53.36 88.89 / 62.32

Swe-Fraktur 78.84 / 40.43 87.59 / 55.32

Swe-Antiqua 79.93 / 40.01 90.66 / 66.36

Test set/MODEL FIN + SWE 1 FIN + SWE 2 FIN + SWE 3

Fin-Fraktur 96.19 / 81.91 95.07 / 76.65 94.97 / 76.13

Fin-Antiqua 89.35 / 63.35 87.23 / 58.22 86.64 / 55.79

Swe-Fraktur 82.53 / 51.11 80.76 / 43.48 83.22 / 45.65

Swe-Antiqua 86.65 / 59.84 83.69 / 49.49 84.88 / 52.5

SWE 1: 840 lines

SWE 2: 1 680 lines

SWE 3: 3 360 lines

Results show CAR (%) / WAR (%)

Not enough Finnish Antiqua in

training

FIN: 10 000 lines

Language is important!


Conclusions

•Need more Swedish data

•Need more Finnish Antiqua data

•Is it possible to train one model for everything?