DML–CZ: asks and bids Jiří Rákosník, Institute of Mathematics AS CR, Praha Towards a European...

29
DML–CZ: asks and bids Jiří Rákosník, Institute of Mathematics AS CR, Praha Towards a European Virtual Library in Mathematics, Santiago de Compostela, 13.3.2009

Transcript of DML–CZ: asks and bids Jiří Rákosník, Institute of Mathematics AS CR, Praha Towards a European...

DML–CZ:asks and bids

Jiří Rákosník, Institute of Mathematics AS CR, Praha

Towards a European Virtual Library in Mathematics, Santiago de Compostela, 13.3.2009

2

DML–CZ, a brief description

Digital Mathematics Library consisting of relevant mathematical literature published in the domain of the Czech Republic and Slovakia

Funding: R&D programme Information Society of the Academy of Sciences

2005–2009

3

Partners

Institute of Mathematics AS CR, Praha (J. Rákosník) – coordinator, material selection, copyright, mathematical supervision

Institute of Computer Science, Masaryk University, Brno(M. Bartošek, P. Kovář, M. Šárfy, V. Krejčíř) – content management system, metadata Q/A, long-term archiving

Faculty of Informatics MU, Masaryk University, Brno (P. Sojka) – formats and tools, technical coordination, information retrieval, indexing

Faculty of Mathematics and Physics, Charles University, Praha (O. Ulrych, J. Veselý) – harvesting and adjusting metadata

Library AS CR, Praha (M. Lhoták, M. Duda, A. Ryšánková) – document scanning, graphical adjustment and OCR in the Digitization Centre Jenštejn

Jenštejn

4

The scope

journals for mathematical research and education

conference proceedings monographs, textbooks altogether more than 200 000 pages

5

Journals

Titleretro (scan)

retro-born-digital born-digital

Czechoslovak Mathematical Journal1951–1991 1992–2008

Aplikace Matematiky / Applications of Mathematics1956–1993 1994–2008

Archivum Mathematicum, Brno1965–1991 1992–2007

Commentationes Mathematicae Universitatis Carolinae1960–1990 1991–2008

Kybernetika1965–1997 1998–2008

Časopis pro pěstování matematiky a fysiky1872–1950

Časopis pro pěstování matematiky1951–1990

Mathematica Bohemica 1991 1992–2008

Acta Univ. Palackianae Olomucensis. Mathematica 1960–2003 2004–2008

Acta Mathematica et Informatica Univ. Ostraviensis1993–2003

Acta Mathematica Univ. Ostraviensis 2004–2008

Mathematica Slovaca1951–2008

Matematika–Fyzika–Informatika1991–2005 2006–2009

Pokroky matematiky, fyziky a astronomie1956–2005 2006–2009

2008 2009 2010–

pages: 106 000 133 000 30 000+

6

Proceedings

Title volumes

Equadiff 11

Toposym 10

Asymptotic Statistics 4

Winter School Abstract Analysis 33

Nonlinear Analysis, Function Spaces, Applications 8

Function Spaces, Differential Operators, Nonlinear Analysis 6

2008 2009 2010–2008 2009 2010–

pages: 7 750 6 900

7

Monographs

Title volumes

Bernad Bolzano Collection 21

From the collection of The Royal Czech Society for Sciences 15

Other monographs 2

2008 2009 2010–2008 2009 2010–

pages: 4 500 1 000

8

Content

multilingual: Czech, Slovak, Russian, English, German, French, Italian

text, drawings, photographs (B&W) maths, physics, chemistry, education,

reviews, personalia, politics

9

Inspiration

GDZ: technology for scanning, text adjustment, OCR

Cellule MathDoc, NUMDAMDML, document enhancement, presentation,

services

10

Scanning

parameters – 600 dpi, 4bit depth scanning facilities – Digibook RGB 10000, A1

color book scanner and two book scanners Zeutschel OS 7000, A2 B/W

software – BookRestorer to make the scanned pages uniform (graphical adjustment, white space around the text body etc.)

Sirius system for archival storage of scans (put on CDs as TIFFs)

11

Optical Character Recognition text OCR by two phase DML-OCR implemented

with ABBYY FineReader SDK 8.1 errors in maths reading → methods for

separation of text OCR and mathematics OCR maths: Infty system (Suzuki et al., Japan)

layout analysis character recognition structure analysis of math. expressions manual error correction

PDF with one OCR layer, multilayer PDF with several OCR layers (text, math in TeX, math in MathML or OMDoc)

99 %+ accuracy for text, 96 %+ for mathematics

12

Optical Character Recognition text OCR by two phase DML-OCR implemented

with ABBYY FineReader SDK 8.1 errors in maths reading → methods for

separation of text OCR and mathematics OCR maths: Infty system (Suzuki et al., Japan)

layout analysis character recognition structure analysis of math. expressions manual error correction

PDF with one OCR layer, multilayer PDF with several OCR layers (text, math in TeX, math in MathML or OMDoc)

99 %+ accuracy for text, 96 %+ for mathematics

13

Metadata and image enhancement/processing metadata standards – choice of standards (DC, MODS,

METS are supported by DSpace) Unicode with TeX → possible conversion to MathML maths standards rather than librarians’ standards

metadata acqusition – Zbl/MR, OCR tagging, (retyping) image enhancements – TIFF, PDF, jbig2 compression

as a measure of quality semantic processing – document markup

enhancement, document classification, citation linking, document clustering, indexing

references and fulltexts as part of metadata, English titles and MSC mandatory

OAI-PMH export trying to follow miniDML, T. Fischer etc.

14

Metadata and image enhancement/processing metadata standards – choice of standards (DC, MODS,

METS are supported by DSpace) Unicode with TeX → possible conversion to MathML maths standards rather than librarians’ standards

metadata acqusition – Zbl/MR, OCR tagging, (retyping) image enhancements – TIFF, PDF, jbig2 compression

as a measure of quality semantic processing – document markup

enhancement, document classification, citation linking, document clustering, indexing

references and fulltexts as part of metadata, English titles and MSC mandatory

OAI-PMH export trying to follow miniDML, T. Fischer etc.

15

Metadata Editor

metadata creation & DL integration developed in Brno for DML-CZ web-based application

web interface suite of scripts files in directories internal database

16

Metadata Editor

input data loading articles building metadata editing references processing verification pdf-compilation export to DML-CZ

17

18

pages to beexcluded

article1

article2

19

20

Indexing, storage

indexing multiple OCR, multiple attribute layers (lemmas,

reviewer comments, semantic classifications, etc.) space

no problem to store and index that for all mathematics literature so far

software client/server architecture Lucene indexing software (OSS)

21

Presentation

delivery customised digital library system DSpace (open

source, created at MIT) for final articles delivery, search

Manakin interface planned visualization techniques – “lost in

hyperspace fear”, vizualization of document clustering, Visual Browser (different user's eyes)

22

Delivery

web portalunique and persistent URLs: PURL

interfaces to other servicesOAI-PMH harvesting – necessary to set up

the content for OAI-PMHbibtex exportGooglebot optimization of metadata

23

Further problems and questions

paper classification automated MSC experiment automated MSC learning metadata from born-digital documents search OCR systems OCR XML postprocessing maths OCR

24

Bids

Metadata Editor Applications for classification of publications Document markup enhancement

algorithms of language identification (bi-gram, tri-gram based, paragraph or even sentence level)

Measuring mathematical similarity of publications

OCR experience (possibly capacity) Adjusted metadata of high fidelity Experience (both good and wrong) in workflow

conduct

25

Asks

Interlinking system (the EuDML core?) Effective system for adjusting and standardizing

scanned pages Metadata standards and metadata

conversion/export tools Unified authority base, journal names

abbreviations, … Effective maths OCR

26

Asks

Coordinated effort/support in copyright issues Directive 2001/29/EC on the harmonisation of certain

aspects of copyright and related rights in the information society

Green Paper Copyright in the Knowledge Economy COM(2008) 466/3

Fifth Freedom in the single market: free movement of knowledge and innovation

ENCES (European Network for Copyright in support of Education and Science) http://www.ences.eu

moving wall supporting Open Access activities

27

Asks

Document markup enhancement context dependent mapping from visual to logical

markup document classification, metrics, ontology construction,

comparison with MSC 2000 classification semiautomatic bibliography markup and metrics, global

mathematics citation index, “MathRank” document clustering (for visualization, …), identification

of plagiarism

28

Mathematician’s expectations

Reliability rate of correspondence with the original document persistency

Search multilingual reliable identification of authors interlinking with Zentralblatt and Mathematical Reviews

29

Mathematician’s expectations

Copyright free access / reasonable moving wall

User friendly services citations export in bibtex/AmsTeX format interlinking between repositories unified layout design

Sustainable development