Presentation of Claus Gravenhorst, BnF Information Day
-
Upload
europeana-newspapers -
Category
Education
-
view
93 -
download
1
Transcript of Presentation of Claus Gravenhorst, BnF Information Day
Optical Layout Recognition (OLR)
From unstructured to structured newspaper data
Claus Gravenhorst, CCS Content Conversion Specialists GmbH
ENP information day, Paris, November 27, 2014
Agenda
• About CCS
• General OLR-workflow for mass digitization
• Layout and structure analysis
• ENP OLR workflow
• Quality assurance
• Output – METS/ALTO package
• Use of structural data – Access and presentation
About CCS
• CCS Content Conversion Specialists GmbH (Hamburg), as technical project partner, will provide its expertise and docWorks technology to set up and operate a mass digitization workflow for creating high quality structured content from 2 million scanned newspaper pages provided by 5 library partners
• Page volume:
BNF=1.000 k, NLE=500 k , SUB HH=480 k, NLF=90 k, SBB=10 k
• The distributed OLR workflow enables the contribution of project partners (content providers) to the integrated quality assurance process
• CCS is also contributing to the specification of the ENMAP metadata model
General workflow for mass digitization
Re-Scan
Conversion
Imaging
Layout Analysis
OCR
ISR
Reject Condition
DeliveryQA
random
Final Output
ScanningImage
Metadata
Database
----------------
Repository
• Automated QA
DocumentUID
Barcode
Item Tracking
Manual QA
•in-house•near-shore•off-shore•multiple locations
Manual QA
•in-house•near-shore
Check in
Check out
Scanner
•Robot-
•Book-
•Document-
•Microfilm-
QA+CorrectionQA+Correcti
onQA +
Correction
Z 39.50Metadata
Layout and structure analysis
• Layout analysis based on „bottom up“ approach
• General rule system enables recognition of words, text lines, text blocks, columns and classification of text blocks, illustrations, advertisements, tables and the following page types:
- title page (the title page of an issue) - content page (a page that consists of content/text only) - illustration page (a page that has at least one illustration) - advertisement page (a page that contains adverts only)
• Structure analysis through classification of headlines and grouping of zones into articles
(incl. article continuation)
ENP OLR workflow | Conversion without scanning
•Digital Image•MetadataDelivery
•Digital Image•MetadataDelivery
•Digital ObjectReturn
•Digital ObjectReturn
Inspection / Automatic QAInspection /
Automatic QA
•Doc Delivery•Doc Delivery
RejectReject
Conversion facility
Material location
Conversion
MD Recording
Possible conversion scenarios
A) Conversion at library (on-site)
B) Conversion off-shore at CCS data center,final QA at the library via internet transfer (remote QA solution)
C) Conversion off-shore at CCS,final QA at the library by backup shipment
Scenario B | Remote QA at library
Internet
StorageStorage
IN
OUTPOOL
dW Share
Master
OffshoreProcessing
@ CCS
OUTPUT
METS ALTO
StorageStorage
POOL
dW Share
RQA
QA on-site @ Library
INPUT
Quality assurance
• @ CCS | Automated markup and basic manual correction: - Headlines, illustrations, tables, captions, advertisements, etc. - Article segmentation and grouping of zones into articles (incl. continuation)
• @ Content Provider (Library)
Recommended: - Zoning: correct classification of blocks as „text“ or „illustration“ - Article segmentation: correct identification of headlines/text blocks/captions - Grouping: correct grouping of blocks (text, illustration) to articles - Metadata: correct title, issue date and issue number
Optional: - Page types: correct page types - Page numbers: correct page sequence - OCR: perform text correction of specific zones (e.g. headlines, captions)
Output | METS/ALTO package
• METS/ALTO metadata schemas to describe the structured digital ouput object
• A newspaper issue processed in docWorks is converted into one METS XML file. It reflects the whole physical and logical structure, manages all links to the image files and the related ALTO XML files. ALTO is based on a standardized page description schema and contains all information of a page (print space, margins, coordinates, OCR results).
• Benefits of structural markup:
- better browsing and more precise text search
- better access and display on tablet and mobile devices - automated article classification and clustering through data/text mining and linguistic technologies - user engagement for manual online text correction, article classification, annotation, building personal collections, etc. - sharing articles via social media platforms like Facebook, Twitter, etc. _______________ METS = Metadada Encoding and Transmission Standard
ALTO = Analyzed Layout and Text Object
Access and Presentation (I)
• Sample presentation system (Veridian)
• Browse by date, title
• Text search
• Article hit list
• Word highlighting
Access and Presentation (II)
• Issue
• Table of contents
Access and Presentation (III)
• Text & image view
• User text correction
• Article clipping
• Print article
• Distribute via email and social media platforms
Thank you for your attention!
www.europeana-newspapers.eu