Romanello tokyo

26
Structured Vs Unstructured: Extracting Information From Scholarly Texts in European Classical Studies Matteo Romanello 1 1 Centre for Computing in the Humanities EIRI - CCH Symposium on the Digitization in the Humanities Keio University - Tokyo 18th March 2010 Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 1 / 26

description

presentation of my research project held at the EIRI – CCH Conference on the Digitization in the Humanities at Keio University (Tokyo)

Transcript of Romanello tokyo

  • 1. Structured Vs Unstructured: Extracting Information From Scholarly Texts inEuropean Classical Studies Matteo Romanello11 Centre for Computing in the Humanities EIRI - CCH Symposium on the Digitization in the HumanitiesKeio University - Tokyo 18th March 2010Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 1 / 26

2. Overview1 Introduction2 Motivations and Background3 Methodology4 Work Phases5 Expected ResultsRomanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 2 / 26 3. Introduction Overview1 Introduction2 Motivations and Background3 Methodology4 Work Phases5 Expected ResultsRomanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 3 / 26 4. Introduction The Project at a glance Project started in October 2009; Field of application: Digital Humanities, Classics (particularly Greek literature); co-supervision between the CCH and the CS department at Kings -> application of Computational Linguistics methodRomanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 4 / 26 5. Introduction FocusScholarly Texts from the European Scholarly Tradition in ClassicalStudiesSecondary sources, e.g. journal papers, as opposed to primarysources, i.e. Ancient Texts Sets of texts considered so far:Princeton - Stanford Working Papers in Classics (PSWPC)LEXIS online: classics journal available online under Open Accesspolicygoal -> information extraction Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 5 / 26 6. Introduction Goal Devising an automatic system to improve semantic information retrieval over a discipline-specic corpus of unstructured texts focus on secondary sources automatic -> scalable with huge amount of data information retrieval -> the task of retrieving information unstructured texts -> raw texts (e.g. .txt les) as opposed to the structured/encoded XMLRomanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 6 / 26 7. Motivations Overview1 Introduction2 Motivations and Background3 Methodology4 Work Phases5 Expected ResultsRomanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 7 / 26 8. Motivations The Million Book Libraryarchives.org, Google Books -> growth ofvolume of information publicly available inelectronic formatlonger shelf-life of books inClassics/Humanitiesneed for effective tools to accessinformation for research purposesRomanello (CCH)Extracting Information From Scholarly Texts EIRI - CCH Symposium 8 / 26 9. Motivations Information Extraction in Classics: challenges lack of tools comparable to CiteseerX, GoPubMed, etc. results of traditional search engines -> high recall but low precision need to go beyond TOCs or string matching-based IR still issues with encoding of Ancient Greek no ad-hoc gold standards/training set lack of tools specically tailored to Classics resources electronically available text does not mean electronic textRomanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 9 / 26 10. Methodology Overview1 Introduction2 Motivations and Background3 Methodology4 Work Phases5 Expected ResultsRomanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 10 / 26 11. Methodology Named Entities as Access Point to Information mentions of entities matter for Classicists -> importance of print indexes in Classics Disambiguation, different spellings or translations of names relating different expressions to the same entityRomanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 11 / 26 12. Methodology Named Entities as Access Point to Information Entities to be extracted: 1 Place Names (ancient and modern); 2 Relevant Person Names (mythological names, ancient authors, modern scholars) 3 References to primary and secondary sources (canonical texts and modern publications about them) Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 12 / 26 13. Methodology Reuse of Structured Information Reuse of structured data sources, e.g. thesauri, authority lists, etc., produced by scholars over the last two decades. -> To train machine-learning based tools to mine unstructured texts. Related work: Research in the AI eld -> Semantic Integration Use of Wikipedia/DBpedia in NLP Related projects: EROCS by IBMRomanello (CCH)Extracting Information From Scholarly Texts EIRI - CCH Symposium 13 / 26 14. Work Phases Overview1 Introduction2 Motivations and Background3 Methodology4 Work Phases5 Expected ResultsRomanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 14 / 26 15. Work Phases Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 15 / 26 16. Work Phases Corpus building Getting materials Crawling online archivesExtracting the text from collected documents Tools for text extraction from PDF -> open issues with Ancient Greek encoding re-OCR documents even the native digital onesRomanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 16 / 26 17. Work Phases Corpus Building II Corpora open access, multilingual Princeton/Stanford Working Papers in Classics (PSWPC) Lexis online 470 articles in 2 corporaOCRFinereader Ocropus (layout analysis) text extracted from PDFs (tools like pdftotext etc.) Alignment of multiple OCR outputsRomanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 17 / 26 18. Work Phases Building the Knowledge Base (KB) Goal: integrate different data sources into a single KB Why? Information about the same entities spread over several data sources Data sources might use different output formats (raw text, DBs, HTML, XML etc.) partial overlappings but no interoperabilityHow?Use of high level ontologies to map records related to the sameentity Result: KB containing semantic dataRomanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 18 / 26 19. Work Phases Building the Knowledge Base (KB) II Ontologies -> in CS a formalism to model data Integrating data sources: import each datasource map it to high level ontologies (e.g., CIDOC-CRM) nd overlappings between datasources -> alignign the records The obtained knowledge base will be used as support for all the text processing tasks Implementation of the KB: RDF triple store with a SPARQL interfaceRomanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 19 / 26 20. Work Phases Corpus Processing1sentence identication2entities extraction (named entities recognition + disambiguation)KB implied to build up an entity context 3 canonical references extraction KB provides training data 4 modern bibliographic references extractionKB provides list of journals/name places/authors to improve theperfomances of the tool Romanello (CCH)Extracting Information From Scholarly Texts EIRI - CCH Symposium 20 / 26 21. Work Phases Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 21 / 26 22. Work Phases Canonical References Extraction 1 citations used specically for primary sources (i.e. works of ancient authors) 2 essential entry point to information: refer to the research object, i.e. ancient texts 3 logical instead of physical citation scheme (e.g., chapter/paragr vs. page) 4 variation -> time, style, language (regexp insufcient!)Example Hom. Il. XII 1 Aesch. Sept. 565-67, 628-30; Ar. Arch. 803 Hes. fr. 321 M.-W. Callimaco, ep. 28 Pf., 5-6Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 22 / 26 23. Expected Results Overview1 Introduction2 Motivations and Background3 Methodology4 Work Phases5 Expected ResultsRomanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 23 / 26 24. Expected Results Results Provide automatically multiple meaningful entry points to information Enrich the corpus with links to resources (particularly primary sources) Improve the user access to the corpus Demonstrate the scalability of the approachTools/Resources Knowledge Base for Classics Articles with improved text quality (improved) corpora to be released single tools for information extraction (e.g. CREX Canonical References EXtractor)Romanello (CCH)Extracting Information From Scholarly Texts EIRI - CCH Symposium 24 / 26 25. Expected Results Possible Applications Solution to problems peculiar of Classics might help to improve the performances of existing tools/algorithmsCollections of secondary sources as corpora: citation patterns citation and co-citation networks trends in the Classics citation practiceRomanello (CCH)Extracting Information From Scholarly Texts EIRI - CCH Symposium 25 / 26 26. Expected Results Thanks for your attention! [email protected] http://uk.linkedin.com/in/matteoromanelloRomanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 26 / 26