E cient Stemmatology: a Graph Database Application in the ...tstuder/papers/StemmaRest.pdf · Using...

Efficient Stemmatology: a Graph DatabaseApplication in the Digital Humanities

Tara Andrews1, Ido Gershoni2, Ramona Imhof2, Sascha Kaufmann1, JakobSchaerer2, Thomas Studer3, and Severin Zumbrunn2

1 Digital Humanities, University of Bern, [email protected] University of Bern, Switzerland

{ido.gerschoni, ramona.imhof, jakob.schaerer,

severin.zumbrunn}@students.unibe.ch3 Institute of Computer Science, University of Bern, Switzerland

[email protected]

Abstract. Stemmatology is a branch of textual scholarship that is con-cerned with constructing the family tree, or stemma of manuscript copiesof a given text. There are several rival methods and algorithms forstemma construction, all of which rely on different forms of simplificationof the textual data. In this paper we propose a backend for stemmatologyapplications that stores textual data in a graph database. Our evaluationshows that this backend scales very well compared to a previous attemptto store the data in a relational database, and supports efficient access tolarge data sets and transformation of that data into the different formatsrequired by competing algorithms.

Keywords: Digital humanities, graph database, Neo4j, performance,stemmatology

1 Introduction

Stemmatology is a branch of textual studies that is concerned with the recon-struction of the transmission history of a given text. In the case of ancient andmedieval texts, this normally means creating a hypothesis about the order inwhich the surviving texts were copied, and what texts must have once existedin order to explain the creation of all current copies. In this form the problem ofstemmatology has a large overlap, both in concept and in method, with similarproblems in evolutionary biology and historical linguistics [9].

Stemmatologists speak of ‘witnesses’, which are the existing copies of thetexts; the ‘archetype’, which is the ancestor of all known witnesses; and the ‘orig-inal’, which may be the archetype or may be a text from which the archetypewas ultimately copied. There are essentially two rival methods for the construc-tion of text stemmata on the basis of the textual content of witnesses. The first,known as ‘Lachmannian method’, has at its core the premise that shared errorindicates shared origin—that is, that when two witnesses share a reading that

was not in the text of the archetype (that is, an ‘error’), and which could notreasonably have arisen due to coincidence (that is, the error is ‘significant’), thenthe likely explanation is that they have a shared ancestor which contained thesame reading (or that one is the ancestor of the other). These coincidences oferror, selected judiciously, allow the textual scholar to produce a ‘stemma’—thatis, an evolutionary tree of witnesses with the archetype at the root.

There are two main drawbacks to the Lachmannian method of stemmatol-ogy. The first is that it requires the textual scholar to distinguish accuratelybetween readings that would have been in the archetype and those that werenot, as well as to judge whether a given ‘error’ (non-archetypal reading) waslikely to have been made independently by two scribes working from two unre-lated exemplars (parents). The second drawback is that Lachmannian stemmaconstruction assumes that each manuscript text has one and only one exem-plar. When a manuscript is copied with reference to two texts, this is known as‘contamination’, and its occurrence prevents further analysis with this method.These problems were acknowledged as early as the 1920s [4], and attempts weremade to develop a method of stemma construction that did not rely so heavilyon a priori judgment of textual scholars as to what constitutes an error [6, 7].

Development of alternative methods for stemmatology accelerated with theappearance of computational methods for evolutionary phylogenetics, and textphylogenetics is now a thriving sub-field within stemmatology [8, 12, 13]. De-spite the wide variety of phylogenetic methods that can be applied to textualdata, and the demonstrated utility of many of them in reconstructing the his-tory of a textual tradition, phylogenetic methods remain controversial, largelydue to the tendency of textual scholars to refrain from attempting to distinguishsignificant error and thus allowing the possibility that a textual similarity intwo manuscripts that actually arises from the archetype might be mistaken forevidence that the two manuscripts are closely related. In addition, most phyloge-netic algorithms retain a feature of a model more suited to evolutionary biology,in that they preserve an assumption that any extant ‘organism’ cannot be theancestor of another. The controversy over the application of stemmatic methodhas led to the development of counter-alternative algorithms, normally classed as‘neo-Lachmannian’, that seek to provide a means for identification of ‘significanterrors’ on which a satisfactory stemma can be based [5, 11].

This multiplicity of methods, and the debate over the validity of their applica-tion to the stemma problem, was the impetus for the ‘Tree of Texts’ project thatran from 2010–2012 at the KU Leuven. The goal of the project was to developcomputational models of text variation and text transmission that are robustenough to be used in empirical evaluations of the different methods [3, 2]. Oneresult of the project was the collection of software tools known as Stemmaweb,now publicly available and used by a number of textual scholars to examine theirdata.

The Stemmaweb tools work on the basis of two underlying data models, oneto represent the variation that appears in separate copies of a given text (other-wise known as the collation) and another to represent a hypothesis concerning

the history of the transmission of that text (a.k.a. a stemma). Both of these aretreated as graphs. Each stemma is a connected rooted directed acyclic graph(CRDAG), but need not be a tree—this feature allows almost all sorts of stem-mas, including those produced by Lachmannian or phylogenetic methods as wellas those produced manually on the basis of external evidence, to be represented.The collation is also initially a CRDAG, with a root node at the beginning ofthe text and a sink node at the end. The sequence of readings (words, roughlyspeaking) in a text is represented as a sequence of nodes, and each witness takesa single non-looping path through these nodes from beginning to end. Alterna-tive readings may also have undirected ‘relationship’ edges between them thatindicate correspondences (e.g. spelling or grammatical variation of the same rootword, synonyms, etc.)

The representation of the text as a graph makes certain sorts of queriesstraightforward and economical. For instance, to retrieve the text of a particularwitness, the collation is traversed from start to end along the path that bears thegiven witness label. Since the graph is a CRDAG it can be ranked, and the rankscan be used to produce a character matrix suitable for phylogenetic algorithms.By finding the points of divergence and re-convergence within the graph we canproduce a list of variants within the text.

A list of textual variants extracted from a collation model can be easilycross-correlated with a stemma hypothesis of its witnesses. Each variant at aparticular location in the text can assigned a color, and each witness in thestemma is colored according to the particular variant it carries. The stemma isconsidered to be an adequate explanation of the distribution of variants (thatis, the variation is considered text-genealogical) if each variant color originatesin only one place on the graph. Figure 1 gives an example of genealogical andnon-genealogical variation; in each case, the node represents a manuscript andits color represents the precise reading it contains.

2 Design Overview

The first implementation of the Stemmaweb tool, carried out in 2011, focused onfunctionality in preference to performance. In particular, it used a Perl objectstorage library (KiokuDB) backed by a traditional relational database [14] topersistently store the text traditions. The expertise of the primary programmerled to the choice of Perl as development language and the ‘Graph’ module forgraph traversal operations. The object storage solution was chosen in order toavoid the need to deconstruct and resurrect the graph models from a relationaldatabase architecture. However, these choices led to an unnecessarily close cou-pling of the underlying text model with the functionality of the front-end Stem-maweb tools, and also to very slow response times due to the complexity of thedifferent connections between elements in different text traditions.

StemmaRest is a new backend for Stemmaweb that uses the graph databaseNeo4j for the persistency layer [10, 15]. It provides a RESTful webservice to

(a) Genealogical variant (b) Non-genealogical variant

Fig. 1: Variant distribution on a stemma

access and analyze the traditions. Thus it is easy to connect the existing graphicaluser interface to the new backend.

StemmaRest is developed in Java and uses the Jersey framework to imple-ment the web services. Figure 2 shows the class model of StemmaRest. In theJersey framework all provided resources have to be registered, which meansthat they have to implement the IResource interface. All IResource objectsthat access to the database use the class GraphDatabaseServiceProvider. Thisclass provides a static GraphDatabaseServiceObject, which can be used byIResource to access the database.

The classes named ...Model are data classes that contain the data models.They are also used for to serialize and deserialize xml and JSON strings. Theparser classes are used to import graphml and dot files into the database.

When a client sends a request to the stemmaREST service, the followingsequence of actions is performed (see Figure 3):

1. The Jersey framework instantiates the requested IResource object.2. This object creates an instance of the class GraphDatabaseServiceProvided

requesting the singleton database instance.3. Then the IResource object can access the database and the transaction is

executed and a response is sent to the client.

3 Database

As mentioned before, StemmaRest’s persistency layer is based on the graphdatabase Neo4j. In stemmatology applications, executing queries often amountsto finding objects in a list that are connected to certain other objects. In a

Fig. 2: Class-Overview

Fig. 3: Request Sequence

traditional relational database we had to employ multiple joins to deal withthese connections and thus computing the final result becomes very inefficient.Using Neo4j, this task becomes much easier since the graph database nativelysupports graph traversal using either breadth- or depth-first algorithms for givenrelations between nodes.

The StemmaRest database is essentially one big graph with different labelsmarking different nodes and relationships. Table 1 shows the labels that are usedin the database. Figure 4 shows the structure of the database.

Nodes Relationships

ROOT OWNS TRADITION

STEMMA HAS STEMMA

WITNESS HAS WITNESS

TRADITION OWNS TRADITION

SECTION PART

READING COLLATION

USER SEQUENCE

RELATIONSHIP

HAS END

Table 1: Labels used in the database

Since Neo4j stores each label in a separate file, searching and traversing thegraph along a given relationship (or within a given type of nodes) is highlyefficient.

Neo4J uses a script language called cypher. Cypher is a declarative graphquery language that allows for expressive and efficient querying and updating ofthe graph store. Cypher queries, though, need to be interpreted and translatedinto an execution plan. This is the reason why they are not always as fast as thenative java traversal API, which is, therefore, the common query tool we use inStemmaRest.

The code snippet in Figure 5 (from the class Witness) shows an query im-plementation using the traversal API. This code finds a requested witness in thedatabase and returns it as a string according to given start and end readings.

4 Evaluation

The main goal of StemmaRest was to create a scalable store for stemmatologydata that makes it possible to work with large data sets. When discussing theperformance of our approach, we are only interested in data complexity [1]. Sinceour backend provides a fixed set of services, query complexity is not an issue here.

The performance of StemmaRest may be measured with respect to two differ-ent parameters: one is, as usual, the size of the whole database; the other is thesize of the working tradition. We will discuss these two approaches separately.

Fig. 4: Database structure

for (Node node : traverseReadings(startNode, layer)) {

long nodeRank = Long.parseLong(node.getProperty("rank").toString());

if (nodeRank >= startRank && nodeRank <= endRank

&& !booleanValue(node, "is_lacuna")) {

if (!joinPrior && !booleanValue(node, "join_next") && !witnessAsText.equals(""))

witnessAsText += " ";

witnessAsText += node.getProperty("text").toString();

joinPrior = booleanValue(node, "join_prior");

}

}

where the traversal is defined as:

db.traversalDescription().depthFirst()

.relationships(ERelations.SEQUENCE, Direction.OUTGOING)

.evaluator(e)

.uniqueness(Uniqueness.RELATIONSHIP_PATH)

.traverse(startNode)

.nodes()

.forEach(x -> {

if (!booleanValue(x, "is_end")) {result.add(x);}

});

Fig. 5: Code for finding a witness

4.1 Performance with Respect to the Whole Database

We did performance tests with data sets of different sizes to evaluate the scal-ability behavior of StemmaRest. The main finding is that our backend scalesvery well. Indeed, for many queries the response time does not depend on theoverall size of the database, that is for a given tradition, many algorithms runin constant time.

Since StemmaRest provides a web service, our performance tests measurethe response time of the service. That is the overall time needed to execute alloperations for a certain request. This includes the time to transmit the data overHTTP, the time to execute the internal algorithms and the time to access thedatabase. Since we want measure the platform’s performance only (and minimizethe influence of the network performance), we used the local loop interface totransmit the data.

For the purpose of the tests, we populated the database with a randomgraph that contains several valid traditions on which the REST requests canbe executed. We performed tests using databases with 1,000 nodes (readings),100,000 nodes, and 1,000,000 nodes.

The first set of diagrams (Figure 6 – Figure 8) shows the results of the testswith different database sizes. The results show that, given a working tradition,the RESTful service response time does not depend on the size of the wholedatabase in a significant way. The reason for this good behavior is that in a graphdatabase a query can search a sub-graph without filtering the whole database.

Fig. 6: Database with 1,000 nodes, working tradition with 100 nodes

Fig. 7: Database with 100,000 nodes, working tradition with 100 nodes

Fig. 8: Database with 1,000,000 nodes, working tradition with 100 nodes

The implementation of StemmaRest uses some search-node-by-id methods(a part of the Neo4j framework), which search the complete database. It is im-portant to note that those queries require O(log n) time (in the worst case), butthey are not seen in the noise of the other operations during the tests. This canbe seen in the diagrams in methods like getReading, getNextReadingOfAWitness,etc., which uses the search-node-by-id method. Still, their execution time hardlychanges even in a very big database. In a much larger database those methodswill slow down the REST requests, though it is not to be expected that thedatabase will grow so big that such operations will have any impact.

4.2 Performance with Respect to the Working Tradition

As mentioned before, the response time is almost independent of the databasesize due to the fact that each tradition can be selected as a sub-graph and thealgorithms only have to search that, rather than the whole database. Obviously,the tradition size must have an impact on the speed of the implemented algo-rithms because the working subset, which is in most cases a tradition, growswith a bigger tradition. Most of the algorithms that work on a tradition are inO(log n). Note that there are a few import and export functions that run inlinear time since they need to handle each node and relation.

In Figures 9 and 10 we see that the method getAllReadingsFromATradition

does not exhibit a logarithmic behavior. Actually, this method parses each read-ing of the given tradition to a JSON object and returns a list of those objects.Hence it requires linear time. Since we do not expect traditions with more than10,000 nodes in practice, the utilized linear time methods should still be appli-cable for online query processing.

Fig. 9: Database with 10,000 nodes, working tradition with 1,000 nodes

Fig. 10: Database with 10,000 nodes, working tradition with 10,000 nodes

5 Conclusion

We have presented StemmaRest, a novel backend for stemmatology applications,that stores textual data in the graph database Neo4j. Our implementation sup-ports efficient access to large data sets and transformation of that data into thedifferent formats required by the applications.

Our performance evaluation of StemmaRest shows that

1. given a working tradition, the service response time does not depend on thesize of the overall database;

2. most of the algorithms that work on a tradition use logarithmic time in thesize of that tradition.

Thus our graph database solution scales very well and is much more efficientthan a previous attempt that used a relational database.

References

1. Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases: The Logical Level.Addison-Wesley (1995)

2. Andrews, T.L.: Analysis of variation significance in artificial traditions using stem-maweb. Digital Scholarship in the Humanities (2014)

3. Andrews, T.L., Mace, C.: Beyond the tree of texts: Building an empirical modelof scribal variation through graph analysis of texts and stemmata. LLC 28(4),504–521 (2013)

4. Bedier, J.: La tradition manuscrite du Lai de l’Ombre. Reflexions sur l’art d’editerles anciens textes. Romania 54, 161–96, 321–56 (1928)

5. Camps, J.B., Cafiero, F.: Genealogical Variant Locations and Simplified Stemma:A Test Case. In: Andrews, T.L., Mace, C. (eds.) Analysis of Ancient and MedievalTexts and Manuscripts: Digital Approaches, vol. 1, pp. 69–94. Brepols, Turnhout(Dec 2014), http://boris.unibe.ch/60728/

6. Dearing, V.A.: Principles and practice of textual analysis. University of CaliforniaPress, Berkeley (1974)

7. Greg, W.W.: The calculus of variants: an essay on textual criticism. Clarendonpress, Oxford (1927)

8. Howe, C.J., Connolly, R., Windram, H.F.: Responding to Criticisms of Phyloge-netic Methods in Stemmatology. Studies in English Literature 1500-1900 52(1),51–67 (2012), 1

9. Platnick, N.I., Cameron, H.D.: Cladistic Methods in Textual, Linguistic, and Phy-logenetic Analysis. Systematic Zoology 26(4), 380–385 (1977), http://www.jstor.org/stable/2412794

10. Robinson, I., Webber, J., Eifrem, E.: Graph Databases. O’Reilly (2013)11. Roelli, P., Bachmann, D.: Towards Generating a Stemma of Complicated

Manuscript Traditions: Petrus Alfonsi’s Dialogus. Revue d’histoire des textes n.s.5, 307–21 (2010)

12. Roos, T., Heikkila, T.: Evaluating methods for computer-assisted stemmatologyusing artificial benchmark data sets. Literary and Linguistic Computing 24(4),417–433 (Dec 2009), http://llc.oxfordjournals.org/content/24/4/417

13. Roos, T., Zou, Y.: Analysis of Textual Variation by Latent Tree Structures. In:IEEE International Conference on Data Mining. Vancouver (2011), http://www.cs.helsinki.fi/u/ttonteri/pub/icdm2011.pdf

14. Studer, T.: Relationale Datenbanken: Von den theoretischen Grundlagen zu An-wendungen mit PostgreSQL. Springer (2016)

15. Vukotic, A., Watt, N.: Neo4j in Action. Manning (2014)

E cient Stemmatology: a Graph Database Application in the ...tstuder/papers/StemmaRest.pdf · Using...

Documents

Transcript of E cient Stemmatology: a Graph Database Application in the ...tstuder/papers/StemmaRest.pdf · Using...