Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany...
Transcript of Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany...
![Page 1: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary](https://reader033.fdocuments.in/reader033/viewer/2022052008/601d877abe345e403f462997/html5/thumbnails/1.jpg)
Corpus Assembly as Text Data Integration from Digital Libraries and the Web
Jena University Language & Information Engineering (JULIE) Lab
https://julielab.de/
DFG Graduate School „Romanticism as a Model“
http://modellromantik.uni-jena.de
Friedrich Schiller University Jena, Germany
Jun 3 2019 – Urbana-Champaign ILJCDL 19‘ – Session 1A – Generation and Linking
Udo Hahn & Tinghui Duan
![Page 2: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary](https://reader033.fdocuments.in/reader033/viewer/2022052008/601d877abe345e403f462997/html5/thumbnails/2.jpg)
Jena/HalleGermany
Allgemeine Literatur-Zeitung (1785-1849)
Very important historical text sourcefor literary studies
in German Romanticism (1790-1830)
General Literature Gazette, ALZ
![Page 3: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary](https://reader033.fdocuments.in/reader033/viewer/2022052008/601d877abe345e403f462997/html5/thumbnails/3.jpg)
Allgemeine Literatur-Zeitung (1785-1849)
Corpus • Analyse
Research Result
![Page 4: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary](https://reader033.fdocuments.in/reader033/viewer/2022052008/601d877abe345e403f462997/html5/thumbnails/4.jpg)
Allgemeine Literatur-Zeitung (1785-1849)Traditional Workflow
Printed Book • Scan
Scanned Picture
• OCR
Full Text• Encode
• Assemble
Corpus • Analyse
Research Result
![Page 5: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary](https://reader033.fdocuments.in/reader033/viewer/2022052008/601d877abe345e403f462997/html5/thumbnails/5.jpg)
Allgemeine Literatur-Zeitung (1785-1849)Traditional Workflow
Printed Book • Scan
Scanned Picture
• OCR
Full Text• Encod
• Assemble
Corpus • Analyse
Research Result
315 Volumes
≈ 150,000 Pages
≈ 150,000,000 Tokens
![Page 6: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary](https://reader033.fdocuments.in/reader033/viewer/2022052008/601d877abe345e403f462997/html5/thumbnails/6.jpg)
Allgemeine Literatur-Zeitung (1785-1849)Traditional Workflow
Printed Book • Scan
Scanned Picture
• OCR
Full Text• Encode
• Assemble
Corpus • Analyse
Research Result
Cost- and Time-Consuming
315 Volumes
≈ 150,000 Pages
≈ 150,000,000 Tokens
![Page 7: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary](https://reader033.fdocuments.in/reader033/viewer/2022052008/601d877abe345e403f462997/html5/thumbnails/7.jpg)
Allgemeine Literatur-Zeitung (1785-1849)
Full Text• Encode
• Assemble
Digital Libraries
Corpus • Analyse
Research Result
Alternative Workflow
![Page 8: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary](https://reader033.fdocuments.in/reader033/viewer/2022052008/601d877abe345e403f462997/html5/thumbnails/8.jpg)
Germany:Bavarian State Library
Scattered Digital Resources of ALZ
![Page 9: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary](https://reader033.fdocuments.in/reader033/viewer/2022052008/601d877abe345e403f462997/html5/thumbnails/9.jpg)
Austria:Austrian National Library
Germany:Bavarian State Library
Scattered Digital Resources of ALZ
![Page 10: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary](https://reader033.fdocuments.in/reader033/viewer/2022052008/601d877abe345e403f462997/html5/thumbnails/10.jpg)
Austria:Austrian National Library
Switzerland:University of Lausanne
Germany:Bavarian State Library
Scattered Digital Resources of ALZ
![Page 11: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary](https://reader033.fdocuments.in/reader033/viewer/2022052008/601d877abe345e403f462997/html5/thumbnails/11.jpg)
UK:University of Oxford
Austria:Austrian National Library
Switzerland:University of Lausanne
Germany:Bavarian State Library
Scattered Digital Resources of ALZ
![Page 12: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary](https://reader033.fdocuments.in/reader033/viewer/2022052008/601d877abe345e403f462997/html5/thumbnails/12.jpg)
USA:Harvard UniversityIndiana UniversityNew York Public LibraryPrinceton UniversityStanford UniversityUniversity of IllinoisUniversity of Michigan
UK:University of Oxford
Austria:Austrian National Library
Switzerland:University of Lausanne
Germany:Bavarian State Library
Scattered Digital Resources of ALZ
![Page 13: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary](https://reader033.fdocuments.in/reader033/viewer/2022052008/601d877abe345e403f462997/html5/thumbnails/13.jpg)
USA:Harvard UniversityIndiana UniversityNew York Public LibraryPrinceton UniversityStanford UniversityUniversity of IllinoisUniversity of Michigan
UK:University of Oxford
Austria:Austrian National Library
Switzerland:University of Lausanne
Germany:Bavarian State Library
Scattered Digital Resources of ALZ
![Page 14: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary](https://reader033.fdocuments.in/reader033/viewer/2022052008/601d877abe345e403f462997/html5/thumbnails/14.jpg)
USA:Harvard UniversityIndiana UniversityNew York Public LibraryPrinceton UniversityStanford UniversityUniversity of IllinoisUniversity of Michigan
UK:University of Oxford
Austria:Austrian National Library
Switzerland:University of Lausanne
Germany:Bavarian State Library
Scattered Digital Resources of ALZ
![Page 15: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary](https://reader033.fdocuments.in/reader033/viewer/2022052008/601d877abe345e403f462997/html5/thumbnails/15.jpg)
USA:Harvard UniversityIndiana UniversityNew York Public LibraryPrinceton UniversityStanford UniversityUniversity of IllinoisUniversity of Michigan
UK:University of Oxford
Austria:Austrian National Library
Switzerland:University of Lausanne
Germany:Bavarian State Library
Scattered Digital Resources of ALZ
1,200+ Volumes
600,000+ Pages
600,000,000+ Tokens
![Page 16: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary](https://reader033.fdocuments.in/reader033/viewer/2022052008/601d877abe345e403f462997/html5/thumbnails/16.jpg)
Proposed Workflow
Digital Libraries and the Web
• Collect
• Correct Metadata
![Page 17: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary](https://reader033.fdocuments.in/reader033/viewer/2022052008/601d877abe345e403f462997/html5/thumbnails/17.jpg)
Proposed Workflow
Digital Libraries and the Web
• Collect
• Correct Metadata
https://archive.org/details/bub_gb_udTjAAAAMAAJ/
![Page 18: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary](https://reader033.fdocuments.in/reader033/viewer/2022052008/601d877abe345e403f462997/html5/thumbnails/18.jpg)
Proposed Workflow
Digital Libraries and the Web
• Collect
• Correct Metadata
Full-Texts • Evaluate
• Select
![Page 19: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary](https://reader033.fdocuments.in/reader033/viewer/2022052008/601d877abe345e403f462997/html5/thumbnails/19.jpg)
Proposed Workflow
Digital Libraries and the Web
• Collect
• Correct Metadata
Full-Texts • Evaluate
• Select
14 different full-text versions for this page!
![Page 20: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary](https://reader033.fdocuments.in/reader033/viewer/2022052008/601d877abe345e403f462997/html5/thumbnails/20.jpg)
Proposed Workflow
Digital Libraries and the Web
• Collect
• Correct Metadata
Full-Texts • Evaluate
• Select
Best-Quality Full-Texts
• Encode
• Assemble
![Page 21: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary](https://reader033.fdocuments.in/reader033/viewer/2022052008/601d877abe345e403f462997/html5/thumbnails/21.jpg)
Proposed Workflow
Digital Libraries and the Web
• Collect
• Correct Metadata
Full-Texts • Evaluate
• Select
Best-Quality Full-Texts
• Encode
• Assemble
Target-Corpus
![Page 22: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary](https://reader033.fdocuments.in/reader033/viewer/2022052008/601d877abe345e403f462997/html5/thumbnails/22.jpg)
Result
Digital Libraries and the Web
• Collect
• Correct Metadata
Full-Texts • Evaluate
• Select
Best-Quality Full-Texts
• Encode
• Assemble
Target-Corpus
261 Volumes
126,612 Pages
120,369,005 Tokens
![Page 23: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary](https://reader033.fdocuments.in/reader033/viewer/2022052008/601d877abe345e403f462997/html5/thumbnails/23.jpg)
Result
Digital Libraries and the Web
• Collect
• Correct Metadata
Full-Texts • Evaluate
• Select
Best-Quality Full-Texts
• Encode
• Assemble
Target-Corpus
315 Volumes
≈ 150,000 Pages
≈ 150,000,000 Tokens
261 Volumes
126,612 Pages
120,369,005 Tokens
≈ 82% coverage
![Page 24: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary](https://reader033.fdocuments.in/reader033/viewer/2022052008/601d877abe345e403f462997/html5/thumbnails/24.jpg)
Result
Digital Libraries and the Web
• Collect
• Correct Metadata
Full-Texts • Evaluate
• Select
Best-Quality Full-Texts
• Encode
• Assemble
Target-Corpus
The Largest Corpus for German Romanticism
https://github.com/JULIELab/ALZ
315 Volumes
≈ 150,000 Pages
≈ 150,000,000 Tokens
261 Volumes
126,612 Pages
120,369,005 Tokens
≈ 82% coverage
![Page 25: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary](https://reader033.fdocuments.in/reader033/viewer/2022052008/601d877abe345e403f462997/html5/thumbnails/25.jpg)
Problems
• Restricted Accessibility
• Heterogeneous Digitizing Conditions and OCR-Qualities
![Page 26: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary](https://reader033.fdocuments.in/reader033/viewer/2022052008/601d877abe345e403f462997/html5/thumbnails/26.jpg)
Conclusion
• The Largest Corpus for German Romanticism
• Big Potential of DLs for Computational Literary Studies
• More Cooperation Between DLs Desirable
• Better Metadata and OCR-Quality are Desirable
![Page 27: Corpus Assembly as Text Data Integration from Digital Libraries …€¦ · Jena/Halle Germany Allgemeine Literatur-Zeitung (1785-1849) Very important historical text source for literary](https://reader033.fdocuments.in/reader033/viewer/2022052008/601d877abe345e403f462997/html5/thumbnails/27.jpg)
Corpus Assembly as Text Data Integration from Digital Libraries and the Web
Jena University Language & Information Engineering (JULIE) Lab
https://julielab.de/
DFG Graduate School „Romanticism as a Model“
http://modellromantik.uni-jena.de
Friedrich Schiller University Jena, Germany
Udo Hahn & Tinghui Duan
Thank you!