20150324ハロワA4紹介 兵庫...Title 20150324ハロワA4紹介_兵庫 Created Date 3/24/2015 11:39:20 AM
17. kb.nederlab.20150324
-
Upload
ingeangevaare -
Category
Government & Nonprofit
-
view
64 -
download
0
Transcript of 17. kb.nederlab.20150324
![Page 1: 17. kb.nederlab.20150324](https://reader038.fdocuments.in/reader038/viewer/2022110311/55a9d34d1a28ab651b8b47a2/html5/thumbnails/1.jpg)
NEDERLAB & friends
Today’s Nederlab spokesman: Martin Reynaert
Tilburg center for Cognition and Communication - Tilburg UniversityCentre for Language and Speech Technology - Radboud Universiteit Nijmegen
Symposium: Digitale historische kranten als ‘big data’.Koninklijke Bibliotheek, The Hague. March 24, 2015
1
![Page 2: 17. kb.nederlab.20150324](https://reader038.fdocuments.in/reader038/viewer/2022110311/55a9d34d1a28ab651b8b47a2/html5/thumbnails/2.jpg)
Nederlab: Consortium
In Nederlab werken samen:
2
![Page 3: 17. kb.nederlab.20150324](https://reader038.fdocuments.in/reader038/viewer/2022110311/55a9d34d1a28ab651b8b47a2/html5/thumbnails/3.jpg)
Nederlab: aims
The Nederlab project aims to bring together all digitizedtexts relevant to the Dutch national heritage (c. A.D. 800 –present) consisting of terabytes of data in one user-friendlyand tool-enriched web interface, allowing scholars tosimultaneously search and analyze textual data in a virtualresearch environment.The focus in Nederlab is currently on incorporating the vastdigital text collections of the Koninklijke Bibliotheek(http://www.kb.nl/en) (KB or Dutch National Library)as well as the contents of the Digitale Bibliotheek voor deNederlandse Letteren (http://www.dbnl.org/) (DBNL -The Digital Library of Dutch Literature).KB text collections comprise newspapers from 1618 to 1995and the Early Dutch Books Online or EDBO(http://www.delpher.nl/).
3
![Page 4: 17. kb.nederlab.20150324](https://reader038.fdocuments.in/reader038/viewer/2022110311/55a9d34d1a28ab651b8b47a2/html5/thumbnails/4.jpg)
Nederlab Portal: Home
http://www.nederlab.nl/onderzoeksportaal/
4
![Page 5: 17. kb.nederlab.20150324](https://reader038.fdocuments.in/reader038/viewer/2022110311/55a9d34d1a28ab651b8b47a2/html5/thumbnails/5.jpg)
Nederlab Portal: Simple Query
5
![Page 6: 17. kb.nederlab.20150324](https://reader038.fdocuments.in/reader038/viewer/2022110311/55a9d34d1a28ab651b8b47a2/html5/thumbnails/6.jpg)
Nederlab: added value
Nederlab adds extra value to the corpora it incorporatesTexts are uniformly reformatted in FoLIA XML: Format forLinguistic AnnotationsProvides OCR post-correction by means of Text-InducedCorpus Clean-up or TICCLProvides extra linguistic annotations: lemmatisation,POS-tagging and Named Entity labelingEnhances search and retrieval, provides better researchopportunities
6
![Page 7: 17. kb.nederlab.20150324](https://reader038.fdocuments.in/reader038/viewer/2022110311/55a9d34d1a28ab651b8b47a2/html5/thumbnails/7.jpg)
TICCL correction for ’Bataaffhe’
Most EDBO books are printed in Fraktur: long ‘s’Most EDBO books are from the period of the‘Bataafsche Republiek’ (late 18th century)EDBO has 10,333 Dutch books,in all about 235 million words of textTop 7 TICCL corrections with corpus frequencies:
Bataaffche 53.538Bataaffchen 15.735Bataaffehe 1.749Bataafseh 796Bataafiche 683Bataaflche 443Bataavfche 433
TICCL identified 1.445 variants for ‘Bataafsche’, of which 872were hapaxes (single corpus occurrences)In all, TICCL corrected 81.302 EDBO tokens into ’Bataafsche’
7
![Page 8: 17. kb.nederlab.20150324](https://reader038.fdocuments.in/reader038/viewer/2022110311/55a9d34d1a28ab651b8b47a2/html5/thumbnails/8.jpg)
Nederlab Portal: document hits before TICCLcorrection
8
![Page 9: 17. kb.nederlab.20150324](https://reader038.fdocuments.in/reader038/viewer/2022110311/55a9d34d1a28ab651b8b47a2/html5/thumbnails/9.jpg)
Nederlab Portal: document hits after TICCL correction
9
![Page 10: 17. kb.nederlab.20150324](https://reader038.fdocuments.in/reader038/viewer/2022110311/55a9d34d1a28ab651b8b47a2/html5/thumbnails/10.jpg)
Crowd Sourcing the text quality problem
Nederlab coordinator Nicoline van der Sijs runs a crowd-sourcingendeavour on the sideVolunteers are welcome, please check:http://www.meertens.knaw.nl/kranten_editor/
People retype the 17th century KB newspapersSome statistics:
10
![Page 11: 17. kb.nederlab.20150324](https://reader038.fdocuments.in/reader038/viewer/2022110311/55a9d34d1a28ab651b8b47a2/html5/thumbnails/11.jpg)
@PhilosTEI in the CLARIN-NL Infrastructure
Project leader: philosopher Arianna Betti (UvA) -http://www.axiom.humanities.uva.nl
There is a system online that allows you to start building your veryown corpusIt has an OCR engine (Tesseract)It is multilingualAnd it has Text-Induced Corpus Clean-up or TICCLThrow in images and get post-corrected FoLiA or TEI XML!It is free and Open-SourceIt is to be further developed in CLARIAH into ‘PICCL’:Philosophical (or: Practical) Integrator of Computational andCorpus Libraries, i.e. a complete corpus building work flow
11
![Page 12: 17. kb.nederlab.20150324](https://reader038.fdocuments.in/reader038/viewer/2022110311/55a9d34d1a28ab651b8b47a2/html5/thumbnails/12.jpg)
@PhilosTEI in the CLARIN-NL Infrastructure
System:http://philostei.clarin.inl.nl
Poster CLARIN-NL:http://ticclops.uvt.nl/CLARINFinalEvent.PhilosTEI.pdf
Poster CLARIAH:http://ticclops.uvt.nl/CLIN2015-poster.pdf
12
![Page 13: 17. kb.nederlab.20150324](https://reader038.fdocuments.in/reader038/viewer/2022110311/55a9d34d1a28ab651b8b47a2/html5/thumbnails/13.jpg)
Next phase in Nederlab
Incorporate corpus exploration and exploitation tools built incompanion projectsOpenSoNaR has nice featuresWe will adopt them!
13
![Page 14: 17. kb.nederlab.20150324](https://reader038.fdocuments.in/reader038/viewer/2022110311/55a9d34d1a28ab651b8b47a2/html5/thumbnails/14.jpg)
OpenSoNaR in the CLARIN-NL Infrastructure
System:http://opensonar.clarin.inl.nl
Poster:http://ticclops.uvt.nl/CLARINFinalEvent.OpenSoNaR.pdf
14
![Page 15: 17. kb.nederlab.20150324](https://reader038.fdocuments.in/reader038/viewer/2022110311/55a9d34d1a28ab651b8b47a2/html5/thumbnails/15.jpg)
ENJOY!!
Thank you for your attention!
http://www.nederlab.nl/onderzoeksportaal/
NEDERLAB & friends
Today’s Nederlab spokesman: Martin Reynaert
Tilburg center for Cognition and Communication - Tilburg UniversityCentre for Language and Speech Technology - Radboud Universiteit Nijmegen
Symposium: Digitale historische kranten als ‘big data’.Koninklijke Bibliotheek, The Hague. March 24, 201515