Bratislava WS - Conteh - BL - IMPACT overview_pdf

19
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT Workshop, 7 th May 2010, Bratislava Aly Conteh, British Library Overview of the IMPACT Project

Transcript of Bratislava WS - Conteh - BL - IMPACT overview_pdf

Page 1: Bratislava WS - Conteh - BL - IMPACT overview_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT Workshop, 7th May 2010, Bratislava

Aly Conteh, British Library

Overview of the IMPACT Project

Page 2: Bratislava WS - Conteh - BL - IMPACT overview_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Background Text that is not digital is virtually invisible

Digitised material is becoming available too slowly, in too small quantities and from too few sources

OCR (optical character recognition) technology does not produce satisfactory results for historical documents

There is a lack of institutional knowledge and expertise which causes inefficiency and ‘re-inventing the wheel’

Aly Conteh, British Library 2

Page 3: Bratislava WS - Conteh - BL - IMPACT overview_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

OBJECTIVESSignificantly improve mass digitisation of historical printed text by

Innovating OCR software and language technology

Sharing expertise and building capacity across Europe

Ensuring that tools and services will be sustained after the end of the project

Aly Conteh, British Library 3

Page 4: Bratislava WS - Conteh - BL - IMPACT overview_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

The IMPACT Consortium - Original Libraries

– National Library of the Netherlands (KB)– The British Library (BL)– Bibliothèque nationale de France (BNF)– German National Library (DNB)– Bavarian State Library (BSB)– Göttingen State and University Library

(UGOE) – Austrian National Library (ONB)– University of Innsbruck Library (UIBK)

Universities & Research centres– Dutch Institute for Lexicology (INL)– National Centre for Scientific Research –

Demokritos (NCSR)– University of Salford (USAL)– University of Munich (CIS group)– University of Innsbruck (InfMath group)– University of Bath (UKOLN)

Industry partners– IBM (Haifa Research Lab)– ABBYY (Moscow)

Aly Conteh, British Library 4

Page 5: Bratislava WS - Conteh - BL - IMPACT overview_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT Extension: objectives To demonstrate the IMPACT tools for efficient lexicon building for language families

outside the current IMPACT focus→ Currently in IMPACT three Germanic languages : English, German, Dutch→ Add Romance and Slavic languages

To demonstrate and disseminate project results in Southern and Eastern Europe, and support building capacity in digitisation in these countries

To reinforce cooperation and better exploitation of ICT R&D synergies across the enlarged European Union

To build strategic partnerships with aim of gaining access to knowledge, developing standards and interoperable solutions

Aly Conteh, British Library 5

Page 6: Bratislava WS - Conteh - BL - IMPACT overview_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Extention in two iterations:1. Second phase, foreseen in original IMPACT contract

→ 3 languages: French, Spanish, Polish→ 5 partners (entry 1 february 2010)

2. Proposal in Objective ICT-2009.9.5 , call 5 of FP7: Enlarged European Union→ 3 languages: Slovene, Bulgarian and Czech → 6 partners (entry 1 april 2010)

All will be equal partners in consortium Full integration expected in June 2010

Aly Conteh, British Library 6

Page 7: Bratislava WS - Conteh - BL - IMPACT overview_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

New partners identified: second phase22 Analyse et Traitement Informatique de la Langue Française ATILF FR

23 Biblioteca Nacional de España BNE ES

24 Fundación Biblioteca Virtual Miguel de Cervantes BVC ES

25 Poznań Supercomputing and Networking Center PSNC PL

26 University of Warsaw, Department of Formal Linguistics UW DFL PL

Aly Conteh, British Library 7

Page 8: Bratislava WS - Conteh - BL - IMPACT overview_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

New partners identified – IMPACT enlarged EU16 Institute for Parallel Processing, Bulgarian Academy of Sciences BAS BG

17 “St. Cyril and Methodius” National Library NLB BG

18 Jožef Stefan Institute JSI SI

19 Narodna in univerzitetna knjižnica (National and University Library) NUK SI

20 Institute of the Czech National Corpus, Charles University Prague CUP CZ

21 Národní knihovna České republiky (National Library of the Czech Republic) NKC CZ

Aly Conteh, British Library 8

Page 9: Bratislava WS - Conteh - BL - IMPACT overview_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Aly Conteh, British Library 9

Page 10: Bratislava WS - Conteh - BL - IMPACT overview_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Facts and figures Project supported by the European Community under the FP7 ICT Work

Programme. coordinated by the National Library of the Netherlands (KB) Project type: Large-scale Integrating Project EU funding: € 11 500 000 Start date: 1 January 2008 Duration: 48 months From 2012: sustainable Centre of Competence Contact: [email protected] Web site: www.impact-project.eu

Aly Conteh, British Library 10

Page 11: Bratislava WS - Conteh - BL - IMPACT overview_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Project Structure

Aly Conteh, British Library 11

OPERATIONAL CONTEXT

Requirements, Benchmarking and Metrics

Best Practices and Guidelines

Technical Framework and Technical Integration

CAPACITY BUILDING

Published resources

Training and support

Demonstration

TEXT RECOGNITION

Pre-processing and segmentation

Adaptive and experimental OCR

Models and dictionaries

ENHANCEMENT & ENRICHMENT

Collaborative correction

Lexicons and gazetteers

Structural metadata

Page 12: Bratislava WS - Conteh - BL - IMPACT overview_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Tools for Text Recognition (OCR)Technologies for the extraction of text in a digital form from the page

Adaptive OCR engine: Core of IMPACT, cutting-edge software system which is tailored specifically to the needs of libraries adapts itself to the material during OCR process, integrating several other tools:

Image enhancement toolkit Segmentation toolkit Post-correction modules Other OCR engines

Experimental prototypes and tools Typewritten OCR prototype Wordspotting engine Inventory extraction prototype

Aly Conteh, British Library 12

OC

CB

TR EE

Page 13: Bratislava WS - Conteh - BL - IMPACT overview_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Tools for Enrichment (language technology)Make the OCR results more accurate and more accessible Collaborative correction

Full web-based collaborative correction system: web-based platform, suitable for massive volunteer participation, validates and corrects OCR results. first tool of its kind to be directly linked to an OCR engine

Lexicons and gazetteers General and Named Entities lexica for Dutch, German and English as well as support for lexicon

development in other European languages Toolboxes providing the means to overcome the historical language barrier Collaborative web-based workspace for named entity management

Structural metadataFunctional Extension Parser: a set of web services that can be exploited to automatically detect and tag structural metadata of scanned material

Aly Conteh, British Library 13

OC

CB

TR EE

Page 14: Bratislava WS - Conteh - BL - IMPACT overview_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Strategic tools and services Web site provides access to all project outputs and forms the nucleus of a virtual network of all European digitisation

centres of competence and associated research activities

A set of Decision Support Tools that can be used to initiate, organise, manage and cost mass digitisation projects

A learning resource toolbox will contain operational guidelines, providing guidance on real world implementation of all tools produced within the project.

Training and support Help Desk system that brokers end-user requests to project partners and to other digitisation centres of

competence. Training programme dealing with large-scale digitisation issues and technologies, with a range of supporting

documentation made available through the project website

Demonstration

Aly Conteh, British Library 14

OC

CB

TR EE

Page 15: Bratislava WS - Conteh - BL - IMPACT overview_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Building a sustainable Centre of Competence First Phase 2008: IMPACT core consortium of 15 partners

Good mix of public and private partners Experience in mass digitisation and research in OCR, Language and Image processing

Second Phase 2010: extension with 11 additional partners Public collection holders and language institutes Adding wider set of European languages and experience in mass digitisation

Third Phase 2011: Open to all partners Other Centres of Competence Digitisation Suppliers Research Institutes Libraries, Archives and Museums

By 2012 IMPACT exists as a sustainable Centre of Competence

Aly Conteh, British Library 15

Page 16: Bratislava WS - Conteh - BL - IMPACT overview_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Aly Conteh, British Library 16

Page 17: Bratislava WS - Conteh - BL - IMPACT overview_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Aly Conteh, British Library 17

http://www.impact-project.eu

Page 18: Bratislava WS - Conteh - BL - IMPACT overview_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Aly Conteh, British Library 18

Twitter: impactocr Blog: impactocr.wordpress.com

Page 19: Bratislava WS - Conteh - BL - IMPACT overview_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Thank you

Aly Conteh, British Library 19