Text- und Strukturerkennung für
historische Zeitungen
Günter Mühlberger
Universität Innsbruck – Digitalisierung und
elektronische Archivierung
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Who we are
• Digitisation and Digital Preservation group @ University of Innsbruck
• Since mid 1990ies involved in digitisation and Optical Character Recognition
(OCR)
• Research projects: LAURIN, METADATA ENGINE, books2u!, reUSE,
Digitisation on Demand, eBooks on Demand, IMPACT, PrestoPRIME,
ARROW+, Europeana Newspaper, tranScriptorium,…
• Our mission: “Digitisation of humanities” = Digital Humanities
• Selection of Digitisation projects
• Austrian Literature Online (since 2002)
• Digitisation of the Innsbrucker Newspaper Archive (2004-2006)
• Digitisation of the Tiroler Tageszeitung from 1945-2003) (2012-2014)
• Text recognition of 8 Mill. Newspaper pages within Europeana Newspapers
• Commercial services via the Technology Transferplatform of the University
2
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Digitisation
3
IMAGE
CAPTURING
TEXT &
STRUCTURE
RECOGNITION
NATURAL
LANGUAGE
PROCESSING
CONTENT
REPRESEN-
TATION
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Example – Index card: Capturing
4
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
OCR Interface
5
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Raw OCR Text
6
“Â.”- ikonogr.
religiös
V oragine , Jacob a ; LEGENDA AUREA Dresdae
ÄLipsiae 1846
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Structure Recognition
7
“Â.”- ikonogr.
religiös
V oragine , Jacob a ; LEGENDA AUREA Dresdae
ÄLipsiae 1846
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Natural Language Processing
8
Voragine, Jacob
LEGENDA AUREA
1846
Matching with reference database, e.g. WorldCat
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Matching with Reference (knowledge) data
9
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
The actual book
10
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Content Representation
Instead of a scanned index card we are able to
access/link/work with a full featured catalogue entry and the
actually digitised work
Instead of digitised newspapers we want to
access/link/work with the content/information/knowledge
contained in these newspapers!
OCR is one important step towards this overall objective!
11
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
OCR – Some Facts
• Optical Character Recognition
• “Old” technology: “pattern recognition”
• Largest progress in late 1990ies
• Market situation
• Two large companies: ABBYY, Nuance
• Cheap technology
• Open Source tools: Tesseract, Ocropus, Gamera,…
• Google: Worked with ABYYY, changed to Tesseract since 2012
• ABBYY
• Took part in two EU projects
• Gothic letter and long “s” out of the box “Old Italian” as language
• Direct export of Analysed Layout and Text Object (ALTO)
12
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Output
• Processing
• University Innsbruck, 32 ABBYY Licenses on 4 Server
• 10.000 large newspaper pages per day, 40.000 medium size, 150.000
book size
• Text above the image vs. text behind the image
• PDF/A Standard
• Tagged PDF
• XML - ALTO
• Keeps all the information: Blocks, type of blocks, languages, lines, words,
characters, confidence of words, etc.
• ALTO: de-facto standard – Library of Congress
13
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Accuracy rates
• What do we expect?
• Researchers: Critical edition of Shakespears Works: no error accepted
• eBooks: less than 1 error per 1000 characters (=half a page)
• Users getting full-text searching offered as an additional feature?
• Academic staff working (copy & paste) with a text?
• Natural language processing?
• Knowledge extraction?
• Word Error Rate (WER) vs. Character Error Rate (CER)
• WER more meaningful to users
• WER easier to measure
14
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
IMPACT
EVA/MIN
ERVA
12th Nov.
2008
15
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
IMPACT
EVA/MIN
ERVA
12th Nov.
2008
16
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
17
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Outlook OCR
• Abbyy
• For small and medium amounts, up to some ten-millions of pages
• Tesseract
• Growing community
• Can be parallelized on High Performance Computing engines (e.g. several
hundreds or thousands of nodes)
• More experiments can be done for very large volumes, e.g. hundreds of
millions of pages
• Handwritten Text Recognition
• Next generation of engines for handwritten material
• Speech and face recognition as technological background
• Transcription and Recognition Platform
• Virtual Research Environment
• Will be released by University of Innsbruck in 2015
18
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Structural Metadata
• Layout Analyses
• Noise reduction (redundant text)
• A newspaper contains much more than edited articles
Content units
• One separation could be: edited articles – advertisements - entertainment
• Document Understanding
• Newspaper consists of repeated sections (“templates”)
• Unique vs. common content
E.g. local news, local advertisements, etc. vs. “world news”
• Common content may be found elsewhere in more detail
E.g. book announcement
19
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Austrian Newspapers Online – ANNO - 1916
20
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
…more than edited articles
21
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Edited articles vs. advertisements vs. entertainment
22
Innsbrucker Nachrichten, 4 June 1870
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Innsbrucker Nachrichten 1870
23
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Content units
• Types
• List of recently died persons
• Announcement of local associations
• Apartments to rent
• Obituaries
• Continued novels
• …
24
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Technical approaches
• Layout analysis
• Specific tools
• XML Output of OCR engine (cheap, easy to handle)
• Approaches
• Rule based approaches (experts needed)
• Machine learning approaches (large amounts of training samples needed)
• Functional Extension Parser (IMPACT project)
• Rule based approach for historical books (pre 1900)
• More than 80% accuracy for non-trivial features are hard to reach
• E.g. separation edited text – advertisments – entertainment, running titles, section headings,
25
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Summary
• Digitisation of newspapers is in many countries/regions still
at the beginning
• OCR, though erroneous, is a must and cheap (compared to
scanning)
• Post-processing of OCR is promising
• Structural metadata are a must as well, new approaches are
needed (beyond article separation)
• Natural Language Processing and more advanced
operations will benefit
• Final goal of “document understanding” by machines
26
Thank you for your attention! l Günter Mühlberger