NEDERLAB & friends Today’s Nederlab spokesman: Martin Reynaert Tilburg center for Cognition and Communication - Tilburg University Centre for Language and Speech Technology - Radboud Universiteit Nijmegen Symposium: Digitale historische kranten als ‘big data’. Koninklijke Bibliotheek, The Hague. March 24, 2015 1

NEDERLAB & friends

Today’s Nederlab spokesman: Martin Reynaert

Tilburg center for Cognition and Communication - Tilburg UniversityCentre for Language and Speech Technology - Radboud Universiteit Nijmegen

Symposium: Digitale historische kranten als ‘big data’.Koninklijke Bibliotheek, The Hague. March 24, 2015


Nederlab: Consortium

In Nederlab werken samen:


Nederlab: aims

The Nederlab project aims to bring together all digitizedtexts relevant to the Dutch national heritage (c. A.D. 800 –present) consisting of terabytes of data in one user-friendlyand tool-enriched web interface, allowing scholars tosimultaneously search and analyze textual data in a virtualresearch environment.The focus in Nederlab is currently on incorporating the vastdigital text collections of the Koninklijke Bibliotheek(http://www.kb.nl/en) (KB or Dutch National Library)as well as the contents of the Digitale Bibliotheek voor deNederlandse Letteren (http://www.dbnl.org/) (DBNL -The Digital Library of Dutch Literature).KB text collections comprise newspapers from 1618 to 1995and the Early Dutch Books Online or EDBO(http://www.delpher.nl/).


Nederlab: added value

Nederlab adds extra value to the corpora it incorporatesTexts are uniformly reformatted in FoLIA XML: Format forLinguistic AnnotationsProvides OCR post-correction by means of Text-InducedCorpus Clean-up or TICCLProvides extra linguistic annotations: lemmatisation,POS-tagging and Named Entity labelingEnhances search and retrieval, provides better researchopportunities


TICCL correction for ’Bataaffhe’

Most EDBO books are printed in Fraktur: long ‘s’Most EDBO books are from the period of the‘Bataafsche Republiek’ (late 18th century)EDBO has 10,333 Dutch books,in all about 235 million words of textTop 7 TICCL corrections with corpus frequencies:

Bataaffche 53.538Bataaffchen 15.735Bataaffehe 1.749Bataafseh 796Bataafiche 683Bataaflche 443Bataavfche 433

TICCL identified 1.445 variants for ‘Bataafsche’, of which 872were hapaxes (single corpus occurrences)In all, TICCL corrected 81.302 EDBO tokens into ’Bataafsche’


Crowd Sourcing the text quality problem

Nederlab coordinator Nicoline van der Sijs runs a crowd-sourcingendeavour on the sideVolunteers are welcome, please check:http://www.meertens.knaw.nl/kranten_editor/

People retype the 17th century KB newspapersSome statistics:


@PhilosTEI in the CLARIN-NL Infrastructure

Project leader: philosopher Arianna Betti (UvA) -http://www.axiom.humanities.uva.nl

There is a system online that allows you to start building your veryown corpusIt has an OCR engine (Tesseract)It is multilingualAnd it has Text-Induced Corpus Clean-up or TICCLThrow in images and get post-corrected FoLiA or TEI XML!It is free and Open-SourceIt is to be further developed in CLARIAH into ‘PICCL’:Philosophical (or: Practical) Integrator of Computational andCorpus Libraries, i.e. a complete corpus building work flow


Poster CLARIN-NL:http://ticclops.uvt.nl/CLARINFinalEvent.PhilosTEI.pdf

Poster CLARIAH:http://ticclops.uvt.nl/CLIN2015-poster.pdf


Next phase in Nederlab

Incorporate corpus exploration and exploitation tools built incompanion projectsOpenSoNaR has nice featuresWe will adopt them!


