Corpus studio Erwin Komen
Transcript of Corpus studio Erwin Komen
CorpusStudio web application Erwin R. Komen
Meertens Instituut // Radboud University Nijmegen // SIL-International [email protected]
1. Background • Existing software:
• CorpusStudio – Windows • Cesax – Windows • Successfully used in linguistic research
• Web application version? • Central location for corpora (‘last’ version) • Platform independent: MacOS/Linux/Windows • Fast parallel processing
2. Formats • FoLiA xml
• Dutch: Nederlab, CGN, Sonar/Lassy • TEI-Psdx xml
• English historical + SLA • Caucasian: Chechen, Lak, Lezgi • Old Welsh • Dutch
• Additional formats • Convert via ‘Cesax’ (Alpino, Negra, …) • Add handler into CorpusStudio
4. Defining queries • Definition editor
• Constants • Functions (Xquery)
• Query editor • Subcategorization (Xquery)
• Constructor editor • Execution order • Options (examples, output, complement)
• Result database Feature editor • Xquery user-functions calculate them
6. Availability • CorpusStudio sources (build your own version)
• https://github.com/ErwinKomen • CLARIN-NL access
• http://www.clarin.nl/node/2095
7. References Boag, Scott, Don Chamberlin, Mary F. Fernández, Daniela Florescu, Jonathan Robie, and Jérôme Siméon. 2010.
XQuery 1.0: An XML Query Language (Second Edition): W3C Recommendation, <http://www.w3.org/XML/Query>. van Gompel, Maarten & Martin Reynaert (2014). FoLiA: A practical XML format for linguistic annotation - a descriptive
and comparative study. Computational Linguistics in the Netherlands Journal; 3:63-81; 2013. Komen, Erwin R. 2013. Corpus databases with feature pre-calculation. In Proceedings of the twelfth workshop on
treebanks and linguistic theories (TLT12). Sandra Kübler, Petya Osenova & Martin Volk (eds), 85-96. Sofia, Bulgaria: The institute of information and communication technologies, Bulgarian AS.
User information Project information
Definition Editor
Query Editor
Constructor Editor
Result viewer
Meta Data Editor
Definitions
Queries
Corpus Research Project (.crpx)
Search service: crpp
Query Executor
Database Creator
Output Monitor
Results (.xml)
Corpus Research Database
(.xml)
Table Viewer
Result Viewer
Documents (.xml)
xml
xml
xml
xml
xml
Input Selector
json
Status
xml
json
Database feature editor
Result Grouping
Standard grouping
(.json)
Grouping Viewer
Corpus Viewer
Result database
Result dbase Viewer
Result dbase Editor
3. Corpus Research Projects • All information for one research project
• Meta information (author, dates, goal) • Input (language, corpus, filter) • All definition and query files used • Execution order • Optional: result database features
• Exchange • Upload/download • Compatible with Windows CorpusStudio
CorpusStudio components
Meta Data Editor
Definition Editor
Input Selector
Query Editor
Constructor Editor
Output Monitor
Query Executor
Result Viewer
Corpus Viewer
Database feature editor
5. Future • Grouping editor
• Group output over meta-data categories • User-definable (Xquery)
• Query/project wizard • Tabular input of principal components • Relations, names, feature calculations
• Result database editor • View and edit result database records