17. kb.nederlab.20150324

15
NEDERLAB & friends Today’s Nederlab spokesman: Martin Reynaert Tilburg center for Cognition and Communication - Tilburg University Centre for Language and Speech Technology - Radboud Universiteit Nijmegen Symposium: Digitale historische kranten als ‘big data’. Koninklijke Bibliotheek, The Hague. March 24, 2015 1

Transcript of 17. kb.nederlab.20150324

Page 1: 17. kb.nederlab.20150324

NEDERLAB & friends

Today’s Nederlab spokesman: Martin Reynaert

Tilburg center for Cognition and Communication - Tilburg UniversityCentre for Language and Speech Technology - Radboud Universiteit Nijmegen

Symposium: Digitale historische kranten als ‘big data’.Koninklijke Bibliotheek, The Hague. March 24, 2015

1

Page 2: 17. kb.nederlab.20150324

Nederlab: Consortium

In Nederlab werken samen:

2

Page 3: 17. kb.nederlab.20150324

Nederlab: aims

The Nederlab project aims to bring together all digitizedtexts relevant to the Dutch national heritage (c. A.D. 800 –present) consisting of terabytes of data in one user-friendlyand tool-enriched web interface, allowing scholars tosimultaneously search and analyze textual data in a virtualresearch environment.The focus in Nederlab is currently on incorporating the vastdigital text collections of the Koninklijke Bibliotheek(http://www.kb.nl/en) (KB or Dutch National Library)as well as the contents of the Digitale Bibliotheek voor deNederlandse Letteren (http://www.dbnl.org/) (DBNL -The Digital Library of Dutch Literature).KB text collections comprise newspapers from 1618 to 1995and the Early Dutch Books Online or EDBO(http://www.delpher.nl/).

3

Page 4: 17. kb.nederlab.20150324

Nederlab Portal: Home

http://www.nederlab.nl/onderzoeksportaal/

4

Page 5: 17. kb.nederlab.20150324

Nederlab Portal: Simple Query

5

Page 6: 17. kb.nederlab.20150324

Nederlab: added value

Nederlab adds extra value to the corpora it incorporatesTexts are uniformly reformatted in FoLIA XML: Format forLinguistic AnnotationsProvides OCR post-correction by means of Text-InducedCorpus Clean-up or TICCLProvides extra linguistic annotations: lemmatisation,POS-tagging and Named Entity labelingEnhances search and retrieval, provides better researchopportunities

6

Page 7: 17. kb.nederlab.20150324

TICCL correction for ’Bataaffhe’

Most EDBO books are printed in Fraktur: long ‘s’Most EDBO books are from the period of the‘Bataafsche Republiek’ (late 18th century)EDBO has 10,333 Dutch books,in all about 235 million words of textTop 7 TICCL corrections with corpus frequencies:

Bataaffche 53.538Bataaffchen 15.735Bataaffehe 1.749Bataafseh 796Bataafiche 683Bataaflche 443Bataavfche 433

TICCL identified 1.445 variants for ‘Bataafsche’, of which 872were hapaxes (single corpus occurrences)In all, TICCL corrected 81.302 EDBO tokens into ’Bataafsche’

7

Page 8: 17. kb.nederlab.20150324

Nederlab Portal: document hits before TICCLcorrection

8

Page 9: 17. kb.nederlab.20150324

Nederlab Portal: document hits after TICCL correction

9

Page 10: 17. kb.nederlab.20150324

Crowd Sourcing the text quality problem

Nederlab coordinator Nicoline van der Sijs runs a crowd-sourcingendeavour on the sideVolunteers are welcome, please check:http://www.meertens.knaw.nl/kranten_editor/

People retype the 17th century KB newspapersSome statistics:

10

Page 11: 17. kb.nederlab.20150324

@PhilosTEI in the CLARIN-NL Infrastructure

Project leader: philosopher Arianna Betti (UvA) -http://www.axiom.humanities.uva.nl

There is a system online that allows you to start building your veryown corpusIt has an OCR engine (Tesseract)It is multilingualAnd it has Text-Induced Corpus Clean-up or TICCLThrow in images and get post-corrected FoLiA or TEI XML!It is free and Open-SourceIt is to be further developed in CLARIAH into ‘PICCL’:Philosophical (or: Practical) Integrator of Computational andCorpus Libraries, i.e. a complete corpus building work flow

11

Page 12: 17. kb.nederlab.20150324

@PhilosTEI in the CLARIN-NL Infrastructure

System:http://philostei.clarin.inl.nl

Poster CLARIN-NL:http://ticclops.uvt.nl/CLARINFinalEvent.PhilosTEI.pdf

Poster CLARIAH:http://ticclops.uvt.nl/CLIN2015-poster.pdf

12

Page 13: 17. kb.nederlab.20150324

Next phase in Nederlab

Incorporate corpus exploration and exploitation tools built incompanion projectsOpenSoNaR has nice featuresWe will adopt them!

13

Page 14: 17. kb.nederlab.20150324

OpenSoNaR in the CLARIN-NL Infrastructure

System:http://opensonar.clarin.inl.nl

Poster:http://ticclops.uvt.nl/CLARINFinalEvent.OpenSoNaR.pdf

14

Page 15: 17. kb.nederlab.20150324

ENJOY!!

Thank you for your attention!

http://www.nederlab.nl/onderzoeksportaal/

NEDERLAB & friends

Today’s Nederlab spokesman: Martin Reynaert

Tilburg center for Cognition and Communication - Tilburg UniversityCentre for Language and Speech Technology - Radboud Universiteit Nijmegen

Symposium: Digitale historische kranten als ‘big data’.Koninklijke Bibliotheek, The Hague. March 24, 201515