Towards Digital Copticde.digitalclassicist.org/berlin/files/slides/dcsb_zeldes... · 2018-05-18 ·...
Transcript of Towards Digital Copticde.digitalclassicist.org/berlin/files/slides/dcsb_zeldes... · 2018-05-18 ·...
Towards Digital Coptic
Caroline T. Schroeder, University of the Pacific [email protected]
Amir Zeldes, Humboldt-Universität zu Berlin [email protected]
Berlin Digital Classicist Seminar, 14.1.2014
Searching and Visualizing Coptic Manuscript Data
Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 1/37
Plan
Introduction
Coptic data
Annotations so far: normalizing, tokenizing and tagging
Search architecture
Searching through multiple segmentations: ANNIS
Dealing with corpus formats: TEI, SaltNPepper
Visualization
Dedicated visualizations
A reusable generic approach
Conclusion and outlook
Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 2/37
Who are these people?
Prof. Caroline T. Schroeder – Religious and Classical Studies / Humanities Center Director University of the Pacific
Dr. Amir Zeldes – Korpuslinguistik / SFB 632 Information Structure (from March: eHumanities group KOMeT) Humboldt-Universität zu Berlin
Cooperation Coptic SCRIPTORIUM established at 2012 NEH summer institute on "Text in a Digital Age" (Tufts): http://coptic.pacific.edu/
Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 3/37
Why Coptic?
Last stage of Ancient Egyptian Language (starting 2nd Century)
Mediterranean in 1st millenium
Hellenistic period
Unique language
Longest continuous documentation
Contact language (with Greek)
Religious significance
Early Christianity
Rise of monasticism
Gnosticism
...
BMBF eHumanties - KOMeT / Zeldes Coptische Dialects
Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 4/37
The data
Lots of material (thanks to the Egyptian desert )
Relatively little online, nothing like Greek and Latin (Perseus)
Lots of things you may want are not available:
New Testament (online, not normalized/lemmatized/annotated)
Old Testament
The Rule of St. Pachomius
Works of Shenoute of Atripe
Apophthegmata patrum
...
But some have been digitized at some point!
Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 5/37
A word about the texts in this talk
So far we've concentrated on Shenoute's sermon Abraham our Father
"As for us, brethren, let us live by the truth so that we are upstanding in all our works, and so that the prophets, apostles and all the saints might dwell among us, ..."
Apophthegmata Patrum (sayings of the desert fathers)
"They said about the blessed Sarah the virgin that she spent sixty years living at the top of the river and she never set foot outside to see the river."
New Testament, esp. Gospel of Mark
see http://coptic.pacific.edu/ for corpora and tools
Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 6/37
Getting from raw text to annotated corpora
Making the data searchable starts with:
Encoding manuscripts (Epidoc TEI)
Segmentation of "word forms"
Normalization
Segmentation of morphemes
Part-of-speech tagging
More annotations...
Brief recap: Detailed talk in Leipzig last month (slides on my page)
Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 7/37
Normalization
Automatic normalization, manual correction
handling of known diacritics, abbreviations
closed, growing list of known variants
Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 8/37
Tokenization
Identifying morphemes non-trivial (agglutinative language, different conventions; we follow Layton 2004)
ϫⲓⲛⲧⲁⲓⲣ̅ⲙⲟⲛⲁⲭⲟⲥ 'Since I became a monk' since-that-PAST-1sg-do-monk
ⲉⲛⲧⲁϥⲧⲣⲉⲛⲣⲡϣⲁ 'he who made us keep the ceremony' REL-PAST-3sgM-CAUS-1pl-do-the-observance
Word level segmentation: manual (no scriptio continua)
Morph segmentation: automatic (accuracy: 84% - 94%)
ⲛ̄ⲟⲩϣⲏⲣⲉ` ⲛ̄ⲁⲃⲣⲁϩⲁⲙ` ⲛ ⲟⲩ ϣⲏⲣⲉ ⲛ ⲁⲃⲣⲁϩⲁⲙ of-a-son of-Abraham of a son of Abraham
Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 9/37
Part-of-speech tagging
POS tagging using TreeTagger (Schmid 1994) and a lexicon from the CMCL project (courtesy of Prof. Tito Orlandi)
Two tag sets:
fine grained (45 tags) and coarse (22 tags) (see http://coptic.pacific.edu/ for documentation)
Interannotator agreement: 94.19% agreement, kappa = 93.67 (considers chance agreement, cf. Artstein & Poesio 2008)
Accuracy:
In domain, 10-fold cross-validation: 94.04% (fine)
Out of domain (test with papyri.info): 79.6% (fine) / 87.7% (coarse)
Main difficulties: open classes (N/V), disambiguating homonyms (ⲉ can have 6 different tags!)
Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 10/37
Further annotations
Many other layers are done manually:
Translation
Language of origin
Coreference
Entity tagging (people, places...)
Parallel alignment (with Greek)
Syntax trees (very preliminary tests)
Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 11/37
Representing data – how to look at all this stuff?
We now have a lot of data to represent:
Diplomatic transcriptions (including character rendering!)
Normalization
Segmentation into words, morphemes, sometimes letters
Annotations
How do we encode this data for search and visualization?
Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 12/37
The first challenge: minimal units
Minimal units, or tokens, are critical for searching:
Find all words preceding the word "God"
Give me any mentions of Saint Paphnutius, ±10 words
Search for the glosses father and son within 20 words
Two problems:
The concept of words is complex in Coptic
Annotations overlap parts of words: individual letters, line breaks... tokens are smaller than words!
ⲡⲉϪⲁϥ ϫⲉ ⲉⲓ̇ⲥ ϣ ⲙⲟⲩⲛ ⲛ̇ⲣⲟⲙⲡⲉ ⲻ Ⲡⲉϫⲉ ⲡ̇ϩⲗ̇ⲗⲟ ⲛⲁϥ
he sAid "it's been e
ight years" –
The old man told him
Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 13/37
Solution: segmentation layers in ANNIS
We use the open source ANNIS platform as a search interface (Zeldes et al. 2009)
Any annotation layer can be defined as a segmentation defining alternative views on:
Adjacency (in words, morphemes, etc.)
Proximity (in words, morphemes, etc.)
Context size (in words, morphemes, etc.)
But which segmentation layer do you want to see?
Remember, diplomatic and normalized layers don't match
Any segmentation layer is usable as "base text"
Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 14/37
Switching segmentations in ANNIS
Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 15/37
Different contexts
Example search: entity="person"
Hit: Abba Antonius
Some options:
±5 words, diplomatic: (less than -5 found, since start of text) Ⲁⲩϭⲱⲗⲡ̇ ⲉ̇ⲃⲟⲗ ⲛⲁⲡⲁ ⲁ̇ⲛⲧⲱⲛⲓ̇ⲟⲥ ϩⲓ̇ⲡ̇ϫⲁⲓ̇ⲉ̇ · ϫⲉⲟⲩⲛⲟⲩⲁ̇ ⲉ̇ϥⲉⲓⲛⲉ̇ ⲙ̇ⲙⲟⲕ
±10 morphs, normalized: ⲁ ⲩ ϭⲱⲗⲡ ⲉⲃⲟⲗ ⲛ ⲁⲡⲁ ⲁⲛⲧⲱⲛⲓⲟⲥ ϩⲓ ⲡ ϫⲁⲓⲉ · ϫⲉ ⲟⲩⲛ ⲟⲩⲁ ⲉ ϥ ⲉⲓⲛⲉ ⲙⲙⲟ ⲕ
±5 tokens: Ⲁ ⲩ ϭⲱⲗⲡ̇ ⲉ̇ⲃⲟⲗ ⲛ ⲁⲡⲁ ⲁ̇ⲛ ⲧⲱⲛⲓ̇ⲟⲥ ϩⲓ̇ ⲡ̇ ϫⲁⲓ̇ⲉ̇ · ϫⲉ
Ⲁⲩϭⲱⲗⲡ̇ 5 ⲉ̇ⲃⲟⲗ ⲛⲁⲡⲁ ⲁ̇ⲛ ⲧⲱⲛⲓ̇ⲟⲥ ϩⲓ̇ ⲡ̇ϫⲁⲓ̇ⲉ̇ · ϫⲉ ⲟⲩⲛ ⲟⲩⲁ̇ ⲉ̇ϥⲉⲓⲛⲉ̇
Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 16/37
Searching with AQL (see http://www.sfb632.uni-potsdam.de/annis/ )
Basic principle of ANNIS Query Language (AQL):
search for some annotations (#1, #2, #3...)
stipulate relationships between them (operators)
Example: verbs of Greek origin
pos="V" & source_lang="Greek" & #1 _=_ #2
The head bandit repented
I have faith in God
identical coverage operator
Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 17/37
Referencing segmentations
There are many operators
. (adjacent), _i_ (inclusion), _o_ (overlap), _l_ (left aligned)...
> (dominance), -> (pointing relation), >@l (left child)...
...
Possible to use segmentations in queries:
#1 . #2 - one followed by two
#1 .word #2 - two is the next word after one
#1 .norm,1,10 #2 - within 1 to 10 norm units
...
Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 18/37
Adding metadata
Metadata is like any other constraint, with meta:: prefix
Can use regular expressions and negation
pos!="V" & source_lang="Greek" & #1 _=_ #2 & meta::msName=/MONB.*/
For metadata names and values we use TEI/EpiDoc as a guideline
More information on AQL: http://www.sfb632.uni-potsdam.de/annis/
Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 19/37
Architecture and formats
Different formats are suitable for different parts of the data
TEI ideal for manuscript structure, metadata
Linguistic formats for computational corpus linguistics: tagging, parsing, coreference
Convert and merge data using SaltNPepper (Zipser & Romary 2010)
Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 20/37
SaltNPepper (Zipser & Romary 2010)
Metamodel Salt for multiformat conversion
Work on extending TEI support: 2014-15
Salt as internal representation in ANNIS
Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 21/37
How can we view the data?
Even if we can query everything at once:
people who are indirect objects of the verb "show" aligned with Greek neuters...
Can we also look at everything at once?
Excerpt from a Salt graph view of two words:
Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 22/37
Breaking it down
Different annotations require different visualizations
Two conflicting requirements:
Ideal representation for each layer (syntax -> trees)
Stay generic and minimize amount of visualizations
How can we avoid programming new visualizations with each new annotation layer?
Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 23/37
Generic versus dedicated
For some purposes, dedicated visualizations cannot be avoided
Special interactive functionality
Special layouting algorithms
For other purposes, we can reuse visualizations by making flexible and configurable
Need to take segmentations into account
Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 24/37
Some dedicated examples
Syntax trees
Coreference view (interactive)
Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 25/37
Taking segmentations into account
Visualizations must be configurable to be aware of different base texts
Syntax tree is based on normalized "word"-internal morphs
Sometimes one syntactic unit has multiple tokens
band of ban dits came upon a band of bandits band ofban 15 dits and foundthem drinking . [...]
Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 26/37
Reusing dedicated visualizers?
In some cases, some creative uses can be found for existing visualizations
Using the coreference visualizer for parallel alignment:
apophthegmata patrum
Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 27/37
Generic visualizations
Two main generic visualizers:
Annotation grid:
just mark borders of annotations
good for flat information
HTML visualizer:
generates HTML elements based on annotations
defined using two simple stylesheets
can look like (almost) anything
Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 28/37
Multiple grids
All annotations in one grid can lead to visual overload
Often better to separate groups of annotations:
Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 29/37
The HTML visualizer
norm.config norm.css
p p
word span; style="word"
norm span; style="norm" value
trans t:title; style="trans" value
div.htmlvis {
font-family: Antinoou, sans-serif; width: 500px; white-space: normal !important;
}
.trans:hover{color: red}
.word:after{content: " ";}
Any specific visualization is configured by two style sheets: a config file and a CSS file
Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 30/37
Result <p> <t class="translation" title="Abraham our father wished to have children with Sarah."> <span class="word"> <span class="norm"> ⲁⲃⲣⲁϩⲁⲙ </span> </span> <span class="word"> <span class="norm"> ⲡⲉⲛ </span> <span class="norm"> ⲉⲓⲱⲧ </span> </span> </t>
... </p>
Abraham our Father
Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 31/37
Reusing the HTML visualizer
dipl.config
tok span value
lb div; style="line"
pb table:title; style="pb" value
pb tr
cb td; style="cb"
hi_rend hi_rend:rend value
Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 32/37
Visualizing TEI @rend attributes dipl.css div.line{display: block; height: 22px counter-increment: linecount;}
div.line:nth-of-type(5n):before{ content: counter(linecount)" "}
...
.pb{border-style:solid;}
.cb{counter-reset: linecount 0; width: 160px; min-width: 160px}
...
hi_rend[rend*=superscript] {vertical-align: super; font-size: 80%}
hi_rend[rend*=red] {color: red}
hi_rend[rend*=tall] {font-size: 120%}
hi_rend[rend*=extralarge] {font-size: 160%}
Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 33/37
Aggregate visualizations
Latest version of ANNIS offers basic frequency analysis
Open question: How much more should we build?
Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 34/37
Aggregate visualizations
Other visualizations are currently done e.g. in R: 11 apophthegmata patrum Gospel of Mark 1
ⲉⲓ
ⲩⲛⲟⲩ
ⲓⲏⲥⲟⲩⲥ
ⲛⲙⲛⲧ
ⲉⲣⲉ
ⲃⲁⲡⲧⲓⲥⲙⲁ
ⲅⲁⲗⲓⲗⲁⲓⲁ
ⲓⲱϩⲁⲛⲛⲏⲥ
ⲛⲥⲱⲡⲛⲉⲩⲙⲁ
ⲥⲓⲙⲱⲛ
ⲕⲏⲣⲩⲥⲥⲉ
ⲥⲩⲛⲁⲅⲱⲅⲏ
ⲧⲃⲃⲟϯⲥⲃⲱ
ⲁⲕⲁⲑⲁⲣⲧⲟⲛ
ⲇⲁⲓⲙⲱⲛⲓⲟⲛ
ⲉⲣⲏⲙ
ⲟⲥ
ⲉⲩⲁⲅⲅⲉⲗⲓⲟⲛ
ⲕⲁⲛⲉⲩ
ⲛⲙⲙⲁⲥⲟⲩⲧ
ⲛ
ⲓϫⲉ
ⲡⲉϫⲁϩⲗⲗⲟ
ⲕⲁⲡⲁ
ⲡⲉⲓ
ⲧⲁ .
ⲫⲟⲣⲉⲓ
ϣⲁ
ϫⲟⲟ
ⲗⲁⲁⲩ
ⲣⲓ
ⲣⲟⲙⲡⲉ
ϣⲟⲙⲛⲧ
ϣⲧⲏⲛ
ⲉⲓⲣⲉ
ⲏⲣⲡ
ⲡⲉϫⲉⲥⲱ
ⲧⲉⲧⲛ
ϩⲟⲟⲩ
ϭⲱⲗⲡ
ⲁϣ
ⲉⲓⲃⲉ
ⲕⲱ
ⲙⲉⲉⲩⲉ
ⲙⲟⲛⲁⲭⲟⲥ
ⲙⲟⲟⲩ
ⲟⲩⲛ
ⲟⲩⲱⲙ
ⲣⲁⲧ
old man
Egyptian vocabulary said
you.SG.M
Abba
eat
wine
I/me
Greek vocabulary
synagogue
impure baptism
John
Jesus
Holy Ghost
Gospel
Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 35/37
Conclusion
Annotation projects should not be limited by corpus architectures:
annotate whatever you want, however often you want
link anything to anything
Why annotate all of these things in the corpus? (and not just in a separate spreadsheet)
Plots of just the verbs? Proper names? POS tagging
Highlight, search and link place-names? Entity tagging
Collapse inflected variants? Lemmatization
Collapse prominent referents? Coreference annotation
Dispersion of any of the above, alignment ... and much more
Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 36/37
Conclusion
Anything can be made queryable with more layers:
typical constructions and objects of verbs?
Greek vs. native verbs -> add language of origin layer
Translation behavior -> add alignment layer
...
Fitting visualization facilities
should be easy to re-use
optimized to the task, display relevant portions of information
for many purposes, they must be sensitive to segmentations
Berlin, 14.1.2014 Schroeder & Zeldes / Towards Digital Coptic 37/37
Outlook
This March: BMBF funded young researcher group on eHumanities at HU Berlin
KOMeT: KOrpuslinguistische Methoden für ePhilologie mit TEI
Focus on marrying TEI resources with computational linguistics methods and formats
Developing NLP tools, search and visualization for ancient world textual resources
Pilot phase (2014, approved): Coptic
Main phase (2015-2019, pending): Other languages as well
Currently looking for a student assistant (60h/month)
Stay tuned for more!
Ⲙⲓⲱⲧⲛ ⲧⲱⲛⲟⲩ! well-being+your.PL greatly => Thanks!
References
Artstein, Ron & Massimo Poesio (2008), Inter-Coder Agreement for Computational Linguistics. Computational Linguistics 34(4), 556–596.
Layton, Bentley (2004), A Coptic Grammar. Second Edition, Revised and Expanded. (Porta linguarum orientalium 20.) Wiesbaden: Harrassowitz.
Schmid, Helmut (1994), Probabilistic Part-of-Speech Tagging Using Decision Trees. In: Proceedings of the Conference on New Methods in Language Processing. Manchester, UK, 44–49. Available at: http://www.ims.uni-stuttgart.de/ftp/pub/corpora/tree-tagger1.pdf.
Zeldes, Amir, Julia Ritz, Anke Lüdeling & Christian Chiarcos (2009), ANNIS: A Search Tool for Multi-Layer Annotated Corpora. In: Proceedings of Corpus Linguistics 2009. Liverpool, UK.
Zipser, Florian & Laurent Romary (2010), A Model Oriented Approach to the Mapping of Annotation Formats using Standards. In: Proceedings of the Workshop on Language Resource and Language Technology Standards, LREC-2010. Valletta, Malta, 7–18.
Links
Coptic SCRIPTORIUM: http://coptic.pacific.edu/
ANNIS: http://www.sfb632.uni-potsdam.de/annis/
Search engine for our corpora: https://korpling.german.hu-berlin.de/annis3/scriptorium
Papyri.info: http://papyri.info/
CMCL: http://cmcl.let.uniroma1.it/