Download - Europeana Newspapers LFT Infoday Muehlberger

Text- und Strukturerkennung für

historische Zeitungen

Günter Mühlberger

Universität Innsbruck – Digitalisierung und

elektronische Archivierung

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

Who we are

• Digitisation and Digital Preservation group @ University of Innsbruck

• Since mid 1990ies involved in digitisation and Optical Character Recognition

(OCR)

• Research projects: LAURIN, METADATA ENGINE, books2u!, reUSE,

Digitisation on Demand, eBooks on Demand, IMPACT, PrestoPRIME,

ARROW+, Europeana Newspaper, tranScriptorium,…

• Our mission: “Digitisation of humanities” = Digital Humanities

• Selection of Digitisation projects

• Austrian Literature Online (since 2002)

• Digitisation of the Innsbrucker Newspaper Archive (2004-2006)

• Digitisation of the Tiroler Tageszeitung from 1945-2003) (2012-2014)

• Text recognition of 8 Mill. Newspaper pages within Europeana Newspapers

• Commercial services via the Technology Transferplatform of the University

2





Digitisation

3

IMAGE

CAPTURING

TEXT &

STRUCTURE

RECOGNITION

NATURAL

LANGUAGE

PROCESSING

CONTENT

REPRESEN-

TATION





Example – Index card: Capturing

4





OCR Interface

5





Raw OCR Text

6

“Â.”- ikonogr.

religiös

V oragine , Jacob a ; LEGENDA AUREA Dresdae

ÄLipsiae 1846





Structure Recognition

7

“Â.”- ikonogr.

religiös

V oragine , Jacob a ; LEGENDA AUREA Dresdae

ÄLipsiae 1846





Natural Language Processing

8

Voragine, Jacob

LEGENDA AUREA

1846

Matching with reference database, e.g. WorldCat





Matching with Reference (knowledge) data

9





The actual book

10





Content Representation

Instead of a scanned index card we are able to

access/link/work with a full featured catalogue entry and the

actually digitised work

Instead of digitised newspapers we want to

access/link/work with the content/information/knowledge

contained in these newspapers!

OCR is one important step towards this overall objective!

11





OCR – Some Facts

• Optical Character Recognition

• “Old” technology: “pattern recognition”

• Largest progress in late 1990ies

• Market situation

• Two large companies: ABBYY, Nuance

• Cheap technology

• Open Source tools: Tesseract, Ocropus, Gamera,…

• Google: Worked with ABYYY, changed to Tesseract since 2012

• ABBYY

• Took part in two EU projects

• Gothic letter and long “s” out of the box “Old Italian” as language

• Direct export of Analysed Layout and Text Object (ALTO)

12





Output

• Processing

• University Innsbruck, 32 ABBYY Licenses on 4 Server

• 10.000 large newspaper pages per day, 40.000 medium size, 150.000

book size

• PDF

• Text above the image vs. text behind the image

• PDF/A Standard

• Tagged PDF

• XML - ALTO

• Keeps all the information: Blocks, type of blocks, languages, lines, words,

characters, confidence of words, etc.

• ALTO: de-facto standard – Library of Congress

13





Accuracy rates

• What do we expect?

• Researchers: Critical edition of Shakespears Works: no error accepted

• eBooks: less than 1 error per 1000 characters (=half a page)

• Users getting full-text searching offered as an additional feature?

• Academic staff working (copy & paste) with a text?

• Natural language processing?

• Knowledge extraction?

• Word Error Rate (WER) vs. Character Error Rate (CER)

• WER more meaningful to users

• WER easier to measure

14





IMPACT

EVA/MIN

ERVA

12th Nov.

2008

15





IMPACT

EVA/MIN

ERVA

12th Nov.

2008

16





17





Outlook OCR

• Abbyy

• For small and medium amounts, up to some ten-millions of pages

• Tesseract

• Growing community

• Can be parallelized on High Performance Computing engines (e.g. several

hundreds or thousands of nodes)

• More experiments can be done for very large volumes, e.g. hundreds of

millions of pages

• Handwritten Text Recognition

• Next generation of engines for handwritten material

• Speech and face recognition as technological background

• Transcription and Recognition Platform

• Virtual Research Environment

• Will be released by University of Innsbruck in 2015

18





Structural Metadata

• Layout Analyses

• Noise reduction (redundant text)

• A newspaper contains much more than edited articles

Content units

• One separation could be: edited articles – advertisements - entertainment

• Document Understanding

• Newspaper consists of repeated sections (“templates”)

• Unique vs. common content

E.g. local news, local advertisements, etc. vs. “world news”

• Common content may be found elsewhere in more detail

E.g. book announcement

19





Austrian Newspapers Online – ANNO - 1916

20





…more than edited articles

21


http://anno.onb.ac.at/cgi-content/anno?aid=ibn&datum=18700604&zoom=16




Edited articles vs. advertisements vs. entertainment

22

Innsbrucker Nachrichten, 4 June 1870





Innsbrucker Nachrichten 1870

23





Content units

• Types

• List of recently died persons

• Announcement of local associations

• Apartments to rent

• Obituaries

• Continued novels

• …

24





Technical approaches

• Layout analysis

• Specific tools

• XML Output of OCR engine (cheap, easy to handle)

• Approaches

• Rule based approaches (experts needed)

• Machine learning approaches (large amounts of training samples needed)

• Functional Extension Parser (IMPACT project)

• Rule based approach for historical books (pre 1900)

• More than 80% accuracy for non-trivial features are hard to reach

• E.g. separation edited text – advertisments – entertainment, running titles, section headings,

25





Summary

• Digitisation of newspapers is in many countries/regions still

at the beginning

• OCR, though erroneous, is a must and cheap (compared to

scanning)

• Post-processing of OCR is promising

• Structural metadata are a must as well, new approaches are

needed (beyond article separation)

• Natural Language Processing and more advanced

operations will benefit

• Final goal of “document understanding” by machines

26


Thank you for your attention! l Günter Mühlberger

<[email protected]>