Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called...

148
Friedrich-Alexander-Universit¨atErlangen-N¨ urnberg Technische Fakult¨ at, Department Informatik ROLAND VALLERY MASTER THESIS TREE-BASED EDIT ANALYSIS OF WIKI ARTICLES Submitted on 11 May 2016 Supervisors: Dipl.-Inf. Hannes Dohrn Prof. Dr. Dirk Riehle, M.B.A. Professur f¨ ur Open-Source-Software Department Informatik, Technische Fakult¨ at Friedrich-Alexander-Universit¨ at Erlangen-N¨ urnberg

Transcript of Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called...

Page 1: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Friedrich-Alexander-Universitat Erlangen-Nurnberg

Technische Fakultat, Department Informatik

ROLAND VALLERY

MASTER THESIS

TREE-BASED EDIT ANALYSIS

OF WIKI ARTICLES

Submitted on 11 May 2016

Supervisors:Dipl.-Inf. Hannes DohrnProf. Dr. Dirk Riehle, M.B.A.Professur fur Open-Source-SoftwareDepartment Informatik, Technische FakultatFriedrich-Alexander-Universitat Erlangen-Nurnberg

Page 2: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Versicherung

Ich versichere, dass ich die Arbeit ohne fremde Hilfe und ohne Benutzung andererals der angegebenen Quellen angefertigt habe und dass die Arbeit in gleicher oderahnlicher Form noch keiner anderen Prufungsbehorde vorgelegen hat und vondieser als Teil einer Prufungsleistung angenommen wurde. Alle Ausfuhrungen,die wortlich oder sinngemaß ubernommen wurden, sind als solche gekennzeichnet.

Erlangen, 11 May 2016

License

This work is licensed under the Creative Commons Attribution 4.0 Internationallicense (CC BY 4.0), see https://creativecommons.org/licenses/by/4.0/

Erlangen, 11 May 2016

i

Page 3: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Abstract

With the aim to improve editing support, this thesis examines frequent patternsin the English Wikipedia’s edit history. These patterns are assumed to resembleedit transformations Wikipedia authors had in mind when changing Wikipediaarticles and might be of interest to facilitate text refactorings.

The transformations are expected to have a complex structure with multiple rela-tions between elements of Wikipedia articles and edit operations, which cannot betrivially modeled. This thesis tackles this problem by encoding the information,which is hidden in the edit history, in graphs, representing edit operations andelements of Wikipedia articles as graph nodes and relations as links between thesenodes. The resulting edit script graphs are mined for frequent subgraphs withthe intention of retrieving interesting frequent patterns in the form of graphs.

As results we list, visualize and analyze the discovered frequent patterns and cre-ate pattern clusters, which resemble real-world text transformations at differentlevels of abstraction.

ii

Page 4: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Contents

1 Introduction 1

2 Research Chapter 22.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2.1 State of the Art in Pattern Mining . . . . . . . . . . . . . 32.2.2 Wikitext Parsing and Edit Pattern Mining . . . . . . . . . 4

2.3 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . 52.4 Research Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.4.1 The Wiki Object Model (WOM) . . . . . . . . . . . . . . 62.4.2 The HDDiff Edit Script . . . . . . . . . . . . . . . . . . . 72.4.3 Mining of Edit Script Graphs . . . . . . . . . . . . . . . . 102.4.4 Pattern Mining Workflow . . . . . . . . . . . . . . . . . . 112.4.5 Pattern Clustering and Pattern Hierarchy . . . . . . . . . 11

2.5 Pattern Mining Framework . . . . . . . . . . . . . . . . . . . . . . 132.5.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.5.2 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 132.5.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 142.5.4 Edit Script to Graph Transformation Algorithms . . . . . 142.5.5 Conceptual Model . . . . . . . . . . . . . . . . . . . . . . 182.5.6 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 192.5.7 Evaluation of the Pattern Mining Framework . . . . . . . 19

2.6 Research Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.7 Results Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.7.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.7.2 Exemplary Discovered Patterns . . . . . . . . . . . . . . . 25

2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Appendices 28Appendix A Table of Most Frequent Patterns . . . . . . . . . . . . . 28Appendix B Table of Largest Frequent Patterns . . . . . . . . . . . 29Appendix C Table of Runs . . . . . . . . . . . . . . . . . . . . . . . 32

iii

Page 5: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D Selected Frequent Pattern Graphs . . . . . . . . . . . . 41

References 141

iv

Page 6: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

List of Figures

2.1 Edit script retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 First revision of example article . . . . . . . . . . . . . . . . . . . 62.3 Wikitext of first revi sion . . . . . . . . . . . . . . . . . . . . . . . 72.4 WOM of first revision . . . . . . . . . . . . . . . . . . . . . . . . . 82.5 Second revision of example article . . . . . . . . . . . . . . . . . . 92.6 Wikitext of second revision . . . . . . . . . . . . . . . . . . . . . . 92.7 WOM of second revision . . . . . . . . . . . . . . . . . . . . . . . 102.8 Pattern mining framework process . . . . . . . . . . . . . . . . . . 122.9 Edit script graph of edit operations as nodes algorithm . . . . . . 162.10 Edit script graph of edit operations linking WOM trees algorithm 172.11 Pattern mining entities . . . . . . . . . . . . . . . . . . . . . . . . 202.12 Pattern support and vertex number . . . . . . . . . . . . . . . . . 232.13 Pattern cluster hierarchy . . . . . . . . . . . . . . . . . . . . . . . 242.14 Transclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.15 Pattern no. 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462.16 Pattern no. 186 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

v

Page 7: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

1 Introduction

Big data, allowing the extraction of new meaning from huge chunks of data, hasgained widespread attention throughout recent years (Dean & Ghemawat, 2008;Abouzeid, Bajda-Pawlikowski, Abadi, Silberschatz, & Rasin, 2009). This thesis’contribution in this context is a structured analysis of the English Wikipedia’sedit history.

In more detail, the main objective of this thesis is

• to convert the English Wikipedia’s edit history into a sequence of edit scriptsand

• to mine this sequence for recurring and interesting patterns of user behavior.

For this purpose this thesis covers the review of relevant literature as well as theactual crawling of the English Wikipedia’s edit history based upon

• an existing tree-based format of wiki-content called Wiki Object Model(WOM) (Dohrn & Riehle, 2011) and

• the existing HD tree diff algorithm, which computes the difference betweentwo documents and is especially suited for WOM based wiki articles (Dohrn& Riehle, 2014).

As resulting artifacts this thesis delivers

• a software that mines the edit script history for patterns and

• a summary of the discovered patterns.

Thereby, this thesis focuses on exploratory research and provides ground workfor a detailed analysis of the found patterns for text refactorings.

1

Page 8: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

2 Research Chapter

Mechanical art’s focus uponcontent—as opposed to form,aura and originality—is pertinentto digital art.

— Melissa Langdon (Langdon,2014)

2.1 Introduction

Since the launch of the English version of Wikipedia in 2001, more than 5 mil-lion articles have been written (“Wikipedia statistics”, n.d.). Editing articlesis tedious work, and it contains many repetitive tasks, such as formatting lists.Better interfaces and automation tools could potentially reduce effort and in-crease productivity (Dohrn & Riehle, 2013). However, this first requires solidunderstanding of typical editing processes in collaborative environments. Onepromising path towards this understanding is to observe and analyze past userbehavior. A valuable source is Wikipedia itself, as besides the actual content,Wikipedia also stores the complete revision history of articles. This means theedit history contains the complete chronicle from a blank page to the current stateof an article. In this huge pile of historical data, a vast amount of interestingknowledge could be hidden, which remains yet to be explored.

The challenge in analyzing these large amounts of material is to extract interestinginformation which is not blurred by the idiosyncrasies of the articles, but whichrepresents recurring patterns that originate when transforming one revision intoanother.

This thesis mines the edit history of the English Wikipedia for frequent patternsof user behavior. It is based on the assumption that the edit history can bemodeled as a graph structure and that frequent common subgraphs in this graph

2

Page 9: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

structure represent recurring transformations the user had in mind on a higherabstraction level. In the course of this thesis we model a complete pattern miningprocess, consisting of

• applying the existing change detection algorithm HDDiff on Wikipedia ar-ticle revision pairs and

• transforming the detected changes into graphs as well as

• mining these graphs for subgraphs and finally

• presenting the discovered frequent patterns

The structure of this thesis is as follows: first, related work and the state of theart in the field of pattern mining are reviewed in chapter 2.2.2. Based on thistheoretical background the approach of using graphs to mine the edit historyis justified in chapter 2.4. Then, a graph mining framework which retrievespatterns in the form of graphs is developed in chapter 2.5 and the actual miningis performed in chapter 2.6. The theses ends with a discussion of the miningresults in chapter 2.7.

2.2 Related Work

2.2.1 State of the Art in Pattern Mining

The main goal of this thesis is to search the edit history of the English Wikipediato extract unknown information about recurring transformations. This task fallsunder the scope of data mining, defined by Han (2005) as “the process of discover-ing interesting patterns and knowledge from large amounts of data”. Han (2005)also describes the knowledge discovery (KDD) process which involves steps likedata preprocessing, data mining, pattern evaluation and knowledge presentation.This thesis aims to model a complete instance of the KDD process for the purposeof mining frequent patterns in the English Wikipedia’s edit history. Therefore, acareful look at existing techniques in the field of data mining is essential to makeintelligent use of approaches which have already proven to be beneficial.

The basic approach to frequent pattern mining is the mining of frequent itemsets(Aggarwal & Han, 2014). A frequent itemset is a set of items which happen tooccur frequently together as a group in a set of transactions which contain severalitems. However, the approach of mining frequent itemsets is only applicable todata in the form of a set and not for more complex data types, which fall un-der the category of dependency-oriented data as named by Aggarwal (2015). Tomine complex data structures like graphs, as this thesis will accomplish, differentmining algorithms have been developed. In particular, the problem of finding

3

Page 10: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Figure 2.1: Edit script retrieval

frequent patterns in graphs can be tackled by algorithms which mine graphs forfrequent subgraphs. Various algorithms for finding frequent subgraphs exist, forexample gSpan (Yan & Han, 2002), GASTON (Nijssen & Kok, 2004) and SUB-DUE (Ketkar, Holder, & Cook, 2005). However, the problem of finding commonsubgraphs is in general NP-hard (Gartner, Flach, & Wrobel, 2003). Therefore,increasing the number of simultaneously mined graphs increases the run-time ofthe algorithms substantially. “ParSeMiS - the Parallel and Sequential MiningSuite” (n.d.) provides implementations of frequent subgraph mining algorithms.

Existing approaches to mine the English Wikipedia’s edit history can be found in(Viegas, Wattenberg, & Dave, 2004), who visualize the activity on Wikipedia andfocus on the procedural and collaborative side of Wikipedia, but do not explicitlysearch for edit transformations users had in mind.

The author is not aware of any existing attempts to mine the English Wikipedia’sediting history by means of frequent subgraph mining, which constitutes theresearch approach of this thesis as laid out in section 2.4.

2.2.2 Wikitext Parsing and Edit Pattern Mining

While the last section gave a brief overview of the general state of the art in pat-tern mining, this section illustrates the context of this thesis and shows concreteexisting techniques, which this thesis builds upon. Figure 2.1 shows how existingbuilding blocks can be used to convert the English Wikipedia edit history intoedit scripts, which forms the basis for the mining task this thesis performs. Thefirst artifacts in this process are the revision texts, which Wikipedia authors havecontributed to an article. These revisions are freely available in a format calledwikitext.

The Sweble parser has been developed to convert wikitext into a tree-based objectmodel called WOM (Dohrn & Riehle, 2011) The WOM allows to analyze thestructure of articles at a higher abstraction level. To detect changes between

4

Page 11: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

articles in the WOM format, the HDDiff comparison algorithm can be applied(Dohrn & Riehle, 2014). The HDDiff algorithm operates on revision pairs andcreates an edit script, that enumerates the differences between the two revisionsin form of a list of operations. If these operations are applied to the WOM ofone revision, the other revision is retrieved. The HDDiff algorithm combinesthe advantages of tree-to-tree based comparison algorithms with the advantagesof purely textual comparison algorithms by splitting text nodes to get a bettermatching between WOM nodes in the old and new revision. Thus, in contrastto common text diff algorithms, the edit scripts produced by HDDiff also featuremove operations. The possible operations in the HDDiff edit script are insert,update, move and delete. In addition to the operations, the HDDiff edit scriptalso refers to the nodes which are affected by the operations.

The edit scripts form the starting point of the pattern mining process performedin this thesis, which will be described in the following sections.

2.3 Research Question

This thesis performs exploratory research by retrieving possible frequent patternsfrom the English Wikipedia’s edit history in a systematic and structured manner.The underlying research question is: What are frequent edit operations performedby authors on Wikipedia?

However, this thesis will not confirm the validity of the found patterns, meaningthat the necessary confirmatory research of analyzing the relevance of the dis-covered frequent patterns is subject to future work. This confirmatory researchcould be performed by actual human beings manually assessing the true meaningof the mechanically found patterns.

2.4 Research Approach

This section describes the approach this thesis follows to retrieve frequent pat-terns of user behavior from the English Wikipedia’s edit history. The reasoningbehind our approach is based on a close look at the elements that change fromone revision of an article to another. Analyzing these changes, multiple and com-plex associations between these elements can be noticed. For example, elements,which are inserted into the article to shape the new revision, can be linked to theelements surrounding the newly inserted elements, making up the context of theinsertion.

5

Page 12: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Indo-European Family

Germanic Languages• English• Dutch• German

Roman Languages• French• Spanish• Basque

Figure 2.2: First revision of example article

In the course of this section we demonstrate how we cope with this complexityand how we strive to retrieve patterns from the English Wikipedia’s edit historyby using graph mining techniques. To this end we first explain the existingWOM format for Wikipedia articles. Then, we continue with a description of theexisting change detection algorithm HDDiff and elucidate why we need to convertthe outcome of the HDDiff algorithm applied on revision pairs into graphs for thepurpose of retrieving interesting editing patterns. We conclude with a descriptionof the methods we use to mine the constructed graphs for patterns.

As a use case, that is supposed to motivate and illustrate our research approach,the restructuring of a list in a made up article about language families is givenin Figure 2.2. Throughout this chapter, the changes to this article are assumedto be exemplary transformations users perform on Wikipedia articles. The textin Figure 2.2 shall form an excerpt of the first version, i.e. the first revision, ofthis article.

2.4.1 The Wiki Object Model (WOM)

Wikipedia stores articles in a wiki markup language, which is used as inputformat by the authors, and which can be converted into HTML by the MediaWikiSoftware behind the English Wikipedia. In this wikitext format, the first revisionof the example article can be formulated as shown in Figure 2.3.

To process the content of Wikipedia articles, a format which represents the ele-ments of an article at a higher abstraction level than plain wikitext is advanta-geous. To this purpose the WOM (Wiki Object Model) format has been devel-oped. The WOM format stores articles in a tree structure and provides means tooperate on the article and its elements. The WOM can be represented in various

6

Page 13: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

== Indo-European Family ==Germanic Languages* English* Dutch* GermanRoman Languages* French* Spanish* Basque

Figure 2.3: Wikitext of first revi sion

ways, for example, as an XML document or as a data structure in a program-ming language, and can be created from wikitext by the Sweble parser (see section2.2.2). Figure 2.4 shows a graphical representation of the WOM of the examplearticle, drawn with adapted existing visualization techniques. The meaning ofthe syntactical elements of the wikitext can be recognized in the WOM, e.g. theconstruct inside the markup ”==” is presented as a heading. The symbol ”\n”stands for a line break. Strings prefixing labels (for example ”mww:”) functionas namespace identifiers. The colors used in the graphic will be explained later.The WOM also contains RTD (Round Trip Data) information, which preservethe original input of the user. For reasons of clarity, RTD elements are omittedin Figure 2.4.

2.4.2 The HDDiff Edit Script

As the research question states, this thesis’ interest does not primarily lie inthe Wikipedia articles themselves, but in the changes between article revisions.Therefore, this thesis applies the existing HDDiff change detection algorithmon pairs of article revisions in WOM format. The HDDiff algorithm matcheselements of two revisions and determines the operations that are necessary toperform on the matched elements in order to convert one revision of the articleinto another. Figures 2.5 and 2.6 show the layout and the wikitext of a newrevision of the example article and Figure 2.7 shows the resulting WOM tree ofthis revision created by the Sweble parser. Identical colors of nodes in Figures2.4 and 2.7 indicate that the respective article elements have been matched byapplying the HDDiff algorithm on these revision pairs. For a match elementsdon’t necessarily have to be identical. Instead, it is sufficient if they are similarand form the best match. Non-leaf nodes are displayed in gray.

The actual result of the HDDiff algorithm is a so-called edit script. As explained

7

Page 14: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

#doc

umen

t

artic

le

body

sect

ion

head

ing

body

"=="

" In

do-E

urop

ean

Fam

ily "

"=="

"\n\

n"p

"\n"

ul"\

n"p

"\n"

ul"\

n"

"Ger

man

ic L

angu

ages

"li

lili

"*"

" "

"Eng

lish"

"\n"

"*"

" D

utch

""\

n""*

""

Ger

man

"

"Rom

an L

angu

ages

"li

lili

"*"

" Fr

ench

""\

n""*

""

Span

ish"

"\n"

"*"

" B

asqu

e"

Figure 2.4: WOM of first revision

8

Page 15: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Indo-European Family

Germanic Languages• English• Dutch• German

Roman Languages• French• Italian

Figure 2.5: Second revision of example article

== Indo-European Family ==Germanic Languages* [[English language—English]]* Dutch* GermanRoman Languages* French* Italian

Figure 2.6: Wikitext of second revision

in section 2.2.2, an HDDiff edit script is a list of edit operations illustrating thechanges from one document revision to another. Each operation in the edit scriptholds references to the nodes in the WOM that are affected by this operation.More concretely, the edit script for the two revision pairs of the example articleis a list containing:

• 7 insert operations, one insert operation per inserted element. The insertedelements are part of an intlink and displayed in white color in Figure 2.7.An intlink is an internal link in a wiki software and is marked by doublesquare brackets in the wikitext.

• 4 delete operations, one delete operation per deleted element. The deletedelements, belonging to the text ”Basque”, are displayed in white color inFigure 2.4.

• 1 move operation, as one element has been moved to a new location, i.e.the text node containing the text ”English” has been made a child of thenewly inserted target element of the intlink element.

• 1 update operation, as the substitution of the text ”Spanish” by the text

9

Page 16: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

#doc

umen

t

artic

le

body

sect

ion

head

ing

body

"=="

" In

do-E

urop

ean

Fam

ily "

"=="

"\n\

n"p

"\n"

ul"\

n"p

"\n"

ul"\

n"

"Ger

man

ic L

angu

ages

"li

lili

"*"

" "

intli

nk"\

n"

"[["

mw

w:ta

rget

title

"]]"

"Eng

lish

lang

uage

""|"

"Eng

lish"

"*"

" D

utch

""\

n""*

""

Ger

man

"

"Rom

an L

angu

ages

"li

li

"*"

" Fr

ench

""\

n""*

""

Italia

n"

Figure 2.7: WOM of second revision

”Italian” has been reported as an update by the HDDiff algorithm.

2.4.3 Mining of Edit Script Graphs

The complex relations between article elements of different revisions are reflectedin the HDDiff edit script, as described in the previous section. However, in orderto apply common frequent pattern mining algorithms, the form of the data to bemined should be appropriate to serve as input for the mining algorithms. The editscript itself cannot be easily mined by conventional pattern mining algorithms,as the structure of the edit script is a plain list that does not give direct clues

10

Page 17: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

about the frequent patterns to search for.

Therefore, we construct representations of the edit script that highlight the com-plex relations between WOM nodes and edit operations in the HDDiff edit script:based on the reasonable assumption that graphs are naturally well suited to modelstructured data with multiple relationships (Aggarwal & Han, 2014), this thesistransforms HDDiff edit scripts into graphs and applies existing graph mining al-gorithms to discover frequent subgraphs in these graphs. The frequent subgraphsare supposed to correspond to operations Wikipedia authors had in mind on ahigher abstraction level.

2.4.4 Pattern Mining Workflow

Figure 2.8 gives an overview of the resulting work flow. We read random revisionpairs from the English Wikipedia’s edit history, call the HDDiff algorithm onthese revision pairs and convert the resulting edit scripts into edit script graphs.Consecutively, the edit script graphs are passed to a frequent subgraph miningalgorithm which determines the frequent subgraphs/patterns. Thus, the input tothe subgraph mining algorithm consists of a set of graphs. Each of those graphsrepresents an edit script, created by applying the HDDiff algorithm on a randomrevision pair.

If a pattern occurs more than once in an edit script, the support count is incre-mented only by one for reasons of simplicity. For the same reason we restrict thefound patterns to connected patterns.

The interesting subgraphs have to be frequent, meaning they must have a min-imum support, i.e. they have to appear in a minimum percentage of edit scriptgraphs.

Moreover, graphs with a larger number of vertices and edges are expected torepresent more complex transformations. As trivial changes are not consideredto be interesting this thesis only mines subgraphs with a minimum number ofnodes.

Reasonable thresholds for the minimum support and the minimum number ofnodes parameters have to be found during the mining process and will be givenin section 2.6.

2.4.5 Pattern Clustering and Pattern Hierarchy

Some of the resulting patterns can be very similar, e.g. differing only in a singlenode. Therefore, and to reduce the complexity of a large number of patterns,

11

Page 18: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Read Random Sample of Revision Pairs from English

Wikipedia DB

Revision Pairs

Apply HDDiff Comparison Algorithm on Revision Pairs

Transform Edit Scripts to Graphs

Edit Scripts

Interesting Patterns

Frequent Common Subgraphs

Edit Script Graphs

Apply Frequent Subgraph Mining on

Edit Script Graphs

Analyze Common Subgraphs

Figure 2.8: Pattern mining framework process

12

Page 19: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

each pattern is mapped to a pattern cluster. Patterns with a similar structure(identified by similar subgraph relations) are assigned to the same cluster. Sub-sequently, the resulting clusters can be analyzed with respect to the subgraph re-lations between patterns to get a pattern cluster hierarchy. This hierarchy shoulddemonstrate, to what extent larger transformations, represented by graphs witha larger number of vertices, consist of smaller transformations.

We propose to calculate the similarity of patterns based on the following distancefunction:

1−number of common common subpatterns and superpatterns of compared patterns

number of all subpatterns and superpatterns of compared patterns

The two patterns to compare are themselves included in the sets of their respectivesubpatterns and superpatterns. Therefore, this function yields a value between0 and 1. The smaller the return value, the more similar the patterns are.

2.5 Pattern Mining Framework

2.5.1 Purpose

In this chapter the pattern mining framework, which is a central contributionof this thesis, will be presented. This framework serves to mine and analyzefrequent patterns in the form of graphs from the English Wikipedia’s edit history.Furthermore, the derived conceptual model of the relevant entities (patterns,edit scripts, transformation algorithms, mining runs, etc. ) constitutes a precisecommon terminology.

2.5.2 Requirements

The requirements for the pattern mining framework can be immediately derivedfrom the research approach illustrated in section 2.4:

1. The framework has to implement the complete pattern mining processshown in Figure 2.8

2. The architecture of the framework should permit iterative improvements.This allows to repeatedly tune the configuration parameters which affectthe pattern mining process.

13

Page 20: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

3. The framework should offer different edit script to graph transformationalgorithms, which allows specifying different characteristics of the patternsto be searched

4. The framework should include a graphical user interface to quickly assessthe resulting patterns.

5. The framework should be smoothly integrated into the existing WOM andHDDiff infrastructure.

6. It would be advantageous if the framework provided support for subgraphrelations among patterns to describe pattern hierarchies.

2.5.3 Architecture

The proposed framework follows a pipeline architecture, propagating data streamsfrom pipeline step to pipeline step. This architecture enables the mining stepsto save their results temporarily to file, thus avoiding constantly starting fromthe very beginning if a step fails to execute or if a step should be rerun with adifferent parameter configuration. The pipeline steps and the data structures,which are passed between the steps, correspond to the constituents of the miningprocess illustrated in Figure 2.8.

As the mining parameters have to be gradually adapted, the mining process isan iterative process, meaning that the steps can also be repeated from the verybeginning, resulting in a loop structure. Another reason for the iterative natureof the process is that in a single mining run the number of input graphs thatcan be mined is rather limited. Due to computability restrictions the miningof an arbitrarily large number of graphs for frequent subgraphs is infeasible,especially because of the fact that the frequent subgraph mining problem is NP-hard (Gartner et al., 2003).

2.5.4 Edit Script to Graph Transformation Algorithms

Before performing the actual mining, we have to convert the data to be mined,i.e. the edit scripts, into a form that is appropriate for the mining algorithms.This step is essential, as practical experience in the field of data mining suggeststhat data preprocessing accounts for a large part of the total effort of the overallmining process (Han, 2005). In our case the revision pairs, being converted intoedit scripts by the HDDiff algorithm, form the basis for the mining process. Aspart of this thesis we propose two algorithms to transform edit scripts returnedby the HDDiff algorithm into graphs, on which the frequent subgraph miningalgorithms can be applied:

14

Page 21: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

• the algorithm OPS AS NODES BETWEEN WOM NODES, in which twoWOM nodes are always connected by an edit operation in-between and

• the algorithm OPS AS LINKS BETWEEN TREES, which links the com-plete WOM tree of the old revision with the complete WOM tree of thenew revision

The output of the transformation algorithms is restricted to the informationpresent in the edit scripts. This means, for example, that the actual order ofthe edit operations the user carried through is not available, since the user is notpermanently tracked while editing a page.

In the following sections we will describe the two developed transformation algo-rithms in detail exemplified by the fictional revision pair introduced in chapter2.4. For both algorithms we only use labeled nodes and avoid labeling edges toreserve edge labels for future usages. Instead, we use labeled nodes functioning asrole nodes between two nodes, thereby mimicking labeled edges between nodes.Furthermore, we only use undirected graphs to keep the structure of the graphsin a general format and support a broad range of mining algorithms.

The first algorithm is based on the idea to directly link each edit operation withthe WOM nodes affected by this operation. Figure 2.9 shows the graph resultingfrom applying the OPS AS NODES BETWEEN WOM NODES algorithm onthe output of the HDDiff algorithm applied to the sample revision pair.

In Figure 2.9 the roles the WOM nodes play in an edit operation are drawn ingreen. These role nodes link the edit operation nodes, drawn in blue, with theWOM element nodes, drawn in red. The edit script graph can be divided intothree connected components, one comprising the nodes involved in the insertionand moving of nodes and the other three components showing the deleted nodes,which are not connected to other nodes in the graph. The node for the intlink isinserted below its parent node - the ”li” node - and serves as parent node for thetarget and title node of the link.

The second algorithm follows the instinct to connect matching nodes from theWOM trees of the compared revisions. Furthermore, the context of the nodes isadded, i.e. the surrounding nodes in the WOM tree. This matching is speciallydesigned to illustrate how move operations modify the WOM by shifting a WOMnode to another destination in a different environment. The resulting edit scriptgraph of the sample revisions can be seen in Figure 2.10. Red nodes indicateWOM nodes of the old revision and black nodes indicate WOM nodes of thenew revision. The edit operations, drawn in blue, connect WOM nodes from thesame or different WOM trees. The context of the nodes is marked by the green”parent” nodes.

In this section we have elucidated the purpose, design and output of the two

15

Page 22: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

INSE

RT In

serte

dNod

ePa

rent

wom

:intli

nkw

om:li

INSE

RT In

serte

dNod

ePa

rent

mw

w:ta

rget

INSE

RT In

serte

dNod

ePa

rent

wom

:text

INSE

RT In

serte

dNod

ePa

rent

wom

:title

MO

VE

Mov

edN

ode

ToPa

rent

wom

:text

UPD

ATE

Upd

ated

Nod

e

wom

:text

DEL

ETE

Del

eted

Nod

e

wom

:text

DEL

ETE

Del

eted

Nod

e

wom

:li

Figure 2.9: Edit script graph of edit operations as nodes algorithm

16

Page 23: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

wom

:li

pare

nt_o

f_in

serte

d_no

depa

rent

wom

:intli

nk

pare

nt_o

f_in

serte

d_no

depa

rent

_of_

inse

rted_

node

mw

w:ta

rget

pare

nt_o

f_in

serte

d_no

de

wom

:text

wom

:title

wom

:text

mov

ed_t

o wom

:text

wom

:title

pare

nt

wom

:text

upda

ted_

to

wom

:text

wom

:li

pare

nt

wom

:li

pare

nt

wom

:text

dele

ted_

from

_par

ent_

node

wom

:li

dele

ted_

from

_par

ent_

node

wom

:ul

Figure 2.10: Edit script graph of edit operations linking WOM trees algorithm

17

Page 24: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

developed edit script to graph transformation algorithms. Both algorithms havein common that they abstract away details of the actual textual content for thepurpose of emphasizing commonalities in different articles. For example, the textstring ”English” does not appear in the edit script graphs and has been general-ized into a wom:text node, which makes it possible for the mining algorithms todetect common structures in multiple graphs without being distracted by specifictext strings.

2.5.5 Conceptual Model

Figure 2.11 shows the relevant classes of the model of the pattern mining frame-work and their relationships.

The loop structure of the mining framework is modeled by the Run entity, whichcontains the input configuration parameters of a single run of the mining loop,i.e. particularly the minimum support and the minimum number of nodes thefrequent subgraphs have to fulfill. Results like the number of found patterns arealso part of the Run class.

A set of edit scripts, which are arguably parts of runs, is the basic unit of inputfor the mining process. The minimum support parameter defines in how manyedit scripts a pattern has to be found to be considered frequent. Edit scripts areuniquely identified by two revision IDs. A graph transformation algorithm,which is selected as a run parameter, converts an edit script into a graph. Thus,edit scripts as well as patterns are represented by graphs, which model complexrelationships between article elements and edit operations.

Patterns, which are the main targets of this thesis, can be found in multipleruns, so that an entity PatternRun is created for each pattern found in a run.A pattern always belongs to a specific transformation algorithm, which is linkedto the pattern via instances of the PatternRun and Run class. Patterns obtainedby the same graph transformation algorithm can be assigned to a cluster withsimilar graphs.

As the same pattern can occur multiple times in the same edit script graph, thesePatternOccurrences are instances of a PatternRun entity.

Subgraph relations among patterns are formed through a recursive relationship.If the graph of a pattern is a subgraph of an edit script graph, we regard thepattern itself as ”occurring in the edit script”, and if the graph of a pattern isa subgraph of the graph of another pattern, we regard the pattern itself as a”subpattern” of the other pattern, which is called the ”superpattern”.

Instances of all of the classes shown in Figure 2.11 - except the graphs belongingto edit scripts - are persisted to a database to keep track of the mining results,

18

Page 25: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

allowing to verify the findings of this thesis. As edit script graphs can comprisea huge number of nodes and therefore claim huge amounts of storage, the editscript graphs are not stored to disk. Instead, the edit script graphs can easily berestored by running the HDDiff algorithm on the revision pairs again and sub-sequently reapplying the graph transformation algorithm on the returned editscripts. The pattern graphs however are stored to disk, as these graphs form themining results and have to be compared and matched to graphs gained from dif-ferent runs. Patterns from different runs can be matched by reusing the frequentsubgraph algorithm used in the mining process: the largest common subgraph oftwo patterns with the same number of nodes is the graph of the pattern itself ifand only if the two patterns are identical.

The Transformation class, representing the actual transformations users hadin mind, is drawn transparently in Figure 2.11 to demonstrate that this class iscurrently not part of the framework. The mapping from patterns to transforma-tions is subject of future work and requires human intervention. Likewise, tagsof a pattern will have to be added.

2.5.6 Implementation

The mining framework is implemented with standard Java EE technologies andincludes a graphical web interface which visualizes the results and offers func-tionalities to the user to name and tag the discovered patterns. To mine the editscript graphs for frequent subgraphs we use the ParSeMiS (Parallel and Sequen-tial Graph Mining Suite) framework (“ParSeMiS - the Parallel and SequentialMining Suite”, n.d.), which “search[es] for interesting substructures” and imple-ments frequent subgraph mining algorithms. The source code of the ParSeMiSproject is available online (“ParSeMiS project”, n.d.).

2.5.7 Evaluation of the Pattern Mining Framework

In this section the developed pattern mining framework is evaluated against therequirements stated in chapter 2.5.2. In total, the developed pattern miningframework comprises the models of the relevant entities as well as the miningprocess, which are the key requirements. Thereby, the framework incorporatesall steps of the KKD process outlined in chapter 2.2.1 for the purpose of miningthe English Wikipedia’s editing history. Furthermore, the pipeline architectureof the pattern mining framework allows to iteratively optimize the configurationparameters as requirement 2 postulated.

The design and implementation of two different edit script to graph transfor-mation algorithms allows to compare and validate their results. The devel-

19

Page 26: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Edit

scri

pt

revi

sio

nA

revi

sio

nB

nu

mb

erO

fOp

erat

ion

s

Run m

inSu

pp

ort

min

Nu

mb

erO

fNo

des

/nu

mb

erO

fEd

itsc

rip

ts/n

um

ber

OfP

atte

rns

Pat

tern

InR

un

sup

po

rt/r

ank

Patt

ern

nam

eta

gsd

fsC

od

e

*

1

Gra

ph

pri

nt

()

1

1

has

pat

tern

gra

ph

grap

hR

epre

sen

tati

on

/nu

mb

erO

fNo

des

<<f

utu

re w

ork

>>

Tran

sfor

mat

ion

rep

rese

nts

<<us

e>>

<<I

nte

rfa

ce>

>

Ed

itsc

rip

tGra

ph

Re

nd

eri

ngA

lgo

rith

m

cre

ateE

dit

scri

ptG

rap

h (

Ed

itsc

rip

t) :

Gra

ph

*1

* is s

ubp

atte

rn o

f

*

1

yie

lds

edit

scri

ptgr

aph

* 1

ren

der

**

Pat

tern

Occ

uren

ce

<<in

sta

nceO

f>>

*

*

occ

urs

in

Patt

ern

Clu

ster

rep

rese

nts

*1

/

Figure 2.11: Pattern mining entities

20

Page 27: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

oped framework includes a graphical user interface and connects existing compo-nents, particularly implementations of the WOM, the HDDiff algorithm and theParSeMiS framework.

All in all the framework fulfills the requirements stated in section 2.5.2. Whetherthe framework is suited to mine frequent graph patterns from Wikipedia articlerevision histories will be evaluated in section 2.7.

2.6 Research Results

The following results are restricted to the graph algorithmOPS AS NODES BETWEEN WOM, as the alternative algorithmOPS AS LINKS BETWEEN TREES did not produce satisfying results, whichwill be discussed in the next section.

The mining process was decomposed into two phases:

1. In the first phase 134.936 edit script graphs were mined for frequent patternsin 323 runs with different mining parameters and 315 ”pattern candidates”were gained in 2.819 edit script graphs. The revision pairs were restrictedto immediately succeeding revisions of random articles. If ParSeMiS suc-cessfully mined the edit scripts and found frequent subgraphs satisfying themining parameters of that run the frequent subgraphs were saved to fileas pattern candidates. The highest number of edit script graphs ParSeMiSwas able to process in a single run was 995. In this phase 257 runs of all323 runs led to an out of memory error in ParSeMiS.

2. In the second phase, called the final summary run, the pattern candidatesfrom phase one were validated. I.e. the pattern candidates were checkedfor occurrences in all of those edit script graphs which in phase one couldbe successfully mined for frequent subgraphs. This step was necessary todetect patterns that were only frequent by chance, i.e. because of the lim-ited number of graphs in a single run. Phase two could be computed bychecking pairs of pattern candidates and edit script graphs for a subgraphrelation instead of checking all subgraph relations in a single run. In thefinal summary run 2.819 edit script graphs and 315 pattern candidateswere checked for occurrences of the given pattern in the given edit script.Of those 887.985 possible combinations of edit script graphs and patterncandidates, 875.047 combinations (about 98%) could be successfully pro-cessed by ParSeMiS while 12.938 combinations led to an error when runningParSeMiS. 62.306 matches, i.e. occurrences of patterns in edit scripts, werefound, not counting multiple occurrences of the same pattern in the sameedit script graph.

21

Page 28: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

The parameters used in the mining runs for the number of edit script graphsto mine, the minimum node count and the minimum support, can be found inAppendix C.

Figure 2.12 shows a scatter plot visualizing the relationship between the supportpercentage of a pattern in the final summary run and its number of vertices.

We checked all combinations of patterns for subgraph relations. Based on thediscovered subgraph relations we generated the pattern hierarchy as proposed insection 2.4.5. The resulting pattern hierarchy is visualized in Figure 2.13. Thearrows in this Figure indicate a part-of-relationship, e.g. at least one pattern incluster 2 is a subpattern of at least one pattern in cluster 12 and so forth.

Detailed results including the largest patterns with a support count of at least 3and the most frequent patterns can be found in the appendix.

2.7 Results Discussion

2.7.1 Evaluation

The first configuration parameter in the mining process, which was the edit scriptto graph transformation algorithm, showed unexpected results. Here, the graphtransformation algorithm OPS AS LINKS BETWEEN TREES, which was de-signed to link the complete WOM tree of the old revision with the completeWOM tree of the new revision did not reveal meaningful insights. The pat-terns gained from this algorithm merely contained nodes of the WOM trees andmostly did not include the edit operations connecting the WOM tree nodes, i.e.the patterns only showed frequent article elements and missed the frequent trans-formations actually sought after. Therefore, this graph transformation algorithmwas not pursued further. Instead, the focus was solely on the transformation algo-rithm OPS AS NODES BETWEEN WOM NODES, in which two WOM nodesare always connected by an edit operation in-between.

An idiosyncrasy of the OPS AS NODES BETWEEN WOM NODES algorithmis that some edit operations occupy more nodes in the edit script graph thanothers. For example, the update operation only refers to a single node (the up-dated node), whereas the insert operation refers to the inserted node as wellas to the parent node of the inserted node, which leads to at least two nodesin the edit script graph instead of one. As a consequence of this idiosyncrasy,patterns with insert operations tend to be larger and are more likely to exceedthe minimum number of nodes threshold. This makes the OPS AS NODES-BETWEEN WOM NODES algorithm favor insert operations over update op-

22

Page 29: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

20 40 60 80 100

05

00

10

00

15

00

20

00

Number of vertices

Su

pp

ort

Figure 2.12: Pattern support and vertex number

23

Page 30: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Figure 2.13: Pattern cluster hierarchy

1

16

18

2

34

56

7

9

1112

14

17

2326

27

30

33

44

46

19

2140

4145

22

8

48

10

4950

13

15

20

24

25

28 29

3132

3435

36

37

3839

4243

47

24

Page 31: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

erations (regardless of this peculiarity, insert operations obviously occur muchmore frequently than all other operation types in the edit scripts).

Regarding the actual frequent subgraph mining step, the results showed that in80% of the runs the mining process ended abruptly with an out of memory error.In consequence, the mining process had to be restricted to a set of 50 graphs ina single run, which is in fact not very high in magnitude. This indicates that theproblem of finding common frequent subgraphs is in general too hard to allowmining a large number of edit script graphs simultaneously.

The patterns shown in the appendix cannot always be easily interpreted. Inthe current state of the pattern mining framework, it is non-trivial to recognizewhich actual transformation in the article refers to a found pattern. This isdue to the fact that currently the nodes of a pattern cannot be exactly matchedto the associated article elements, which would help the user in classifying thepatterns. Instead, the user is left with the pattern graph and a list of revisionpairs and their edit script graphs in which the pattern occurs but often it isunknown which text elements in the article are associated with which nodes inthe pattern. Future versions of the developed graph mining framework mightimprove this shortcoming and keep track of the article elements associated witha node in pattern graphs and edit script graphs.

With respect to the clusters patterns were assigned to, it can be said that theseclusters seem to be a good fit. At a first glance similar patterns appear to beassigned to the same cluster. The automatically generated pattern cluster hier-archy grants a first glimpse at the possible semantic meanings and relationshipsbetween the discovered patterns, indicating in how far larger transformations usesmaller transformations as building blocks. However, a thorough evaluation ofthe proposed clustering remains future work.

2.7.2 Exemplary Discovered Patterns

In this section we present exemplary frequent patterns gained from the miningruns.

By far the most frequent pattern is pattern No. 6 (see Figure 2.15) which reflectsthe insertion of an internal link encompassing the target of the link. It doesnot come as a surprise that in a strongly connected document repository likeWikipedia, links make up the vast majority of discovered patterns (the numberof occurences of pattern No. 6 is 1.934 in 2.819 revision pairs. The secondfrequent pattern, pattern No. 5, has a frequency of 1.302.)

The largest frequent pattern with a support count of above 3 is pattern number186 (Figure 2.16), which stands for an interesting transformation as well: This

25

Page 32: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Figure 2.14: Transclusion

pattern shows the insertion of a transclusion with four attributes and values. Atransclusion is the embedding of another document inside the current documentand is often used for template mechanisms like infoboxes. Figure 2.14 shows anexample infobox for a music band generated by a transclusion with attributesand values for origin, genre, Years active, Labels and Members.

2.8 Conclusion

In this thesis the problem of finding frequent patterns in the English Wikipedia’sedit history was reduced to the problem of searching frequent subgraphs. Also,the relationships among these patterns were analyzed, particularly by exploitingsubgraph relations. Future work can now focus on the visual inspection andmanual or automated analysis of the discovered patterns.

The number of edit scripts successfully searched for frequent patterns - 2.819- was quite small in comparison to the total number of articles on Wikipedia.This is mostly due to the research approach, which was based on graphs insteadof sets. This approach has the advantage of giving the opportunity to modelcomplex relationships, which this thesis exploited heavily by modeling structuredrelationships between article elements and edit operations. The downside of thisapproach however is that the resulting frequent subgraph mining problem is NP-hard, leading to a smaller number of edit scripts being mineable than by usingsets or frequent itemsets.

Possible variants to the approach taken in this thesis are manifold. Future workcould

26

Page 33: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

• use different parameters, for example directed graphs instead of undirectedgraphs

• restrict edit script graphs to trees. That way, more efficient subgraph min-ing algorithms could be applied, for example SLEUTH (Zaki, 2005), whichmight discover rarer and larger common subgraphs as more edit scriptgraphs could be included in a single run.

• use data types completely different from graphs. These types could be setsfor frequent itemset mining or completely unstructured data which could forexample be processed by the commercial IBM ’Watson’ software (Ferrucci,2011).

• analyze the whole edit history of a single article or even between articlesinstead of only considering changes between two revisions of the same article

Future work might also take more semantic aspects into account. Whereas thisthesis focused on purely structural changes (e.g. the indentation of a list), adifferent promising approach might be to shed more light on the semantic contentthat has changed (e.g. why the list was indented).

27

Page 34: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix A: Table of Most Frequent Patterns

Appendix A Table of Most Frequent Patterns

Id VertexCount

Support Cluster Transformation Name

6 12 1934 2 insertion of intlink with target withtext

5 13 1302 2 insertion of intlink with target at-tribute below paragraph

8 12 1206 2 insertion of intlink with target andother item below paragrah

4 16 1161 2 insertion of intlink with target andother item below new paragraph

7 13 1142 2 insertion of intlink with target3 17 1123 2 insertion of intlink with target and

text below new paragraph9 10 823 2 insertion of three items below para-

graph15 11 788 2 insertion of text and two other ele-

ments below paragraph19 11 765 3 insertion of an element and a para-

graph with child element belowbody

16 12 760 3 insertion of text and paragraphwith child element below body

13 12 741 2 insertion of two text nodes and an-other element below paragraph

17 12 640 2 insertion of intlink with title andtarget below parent element

66 12 635 3 insertion of element and paragraphwith text below body

12 16 634 2 insertion of intlink with title andtarget with text below element

18 15 633 2 insertion of link with target and twoother elements below paragraph

26 13 630 3 insertion of text and paragraphwith text below body

11 19 612 2 insertion of intlink with target withtext and text element 2 other ele-ments below paragraph

28

Page 35: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Id VertexCount

Support Cluster Transformation Name

14 16 604 2 insertion of text and intlink withtarget and another element belowparagraph

23 15 592 2 insertion of intlink with title andtarget below element

10 20 591 2 insertion of intlink with target withtext and text element and other el-ement below paragraph

21 19 585 2 insertion of intlink with title andtarget below other element

22 16 580 2 insertion of intlink with title withtext and target below other element

69 17 575 2 insertion of 2 text elements and anintlink with target below paragraph

20 20 573 2 insertion of intlink with title withtext and target with text belowother element

44 21 565 2 insertion of 2 text elements and anintlink with target with text belowparagraph

Appendix B Table of Largest Frequent Patterns

Id VertexCount

Support Cluster Transformation Name

186 81 48 18 insertion of transclusion with 4 at-tributes with name and value

78 76 28 16 insertion of transclusion with twoargument with name and value andone argument with value

187 72 9 16 insertion of named transclusionwith 5 arguments with value

299 64 11 9 insertion of definition list item with2 intlink2 with title and target and4 text elements and another ele-ment below definition list

29

Page 36: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix B: Table of Largest Frequent Patterns

Id VertexCount

Support Cluster Transformation Name

278 62 10 5 insertion of text and 3 intlinks withtarget and title and 2 other ele-ments below paragraph

117 61 17 9 insertion of definition list with twolist items of which one has 3 text el-ements and 2 intlinks with title andtarget

143 59 9 5 insertion of a paragraph with 2intlinks with target and 4 text el-ements and another element belowbody

121 56 42 5 insertion of 2 intlinks with targetand title and 3 text elements andanother element below paragraph

144 54 18 5 insertion of an intlink with targetand title and another intlink withtitle and 4 text elements and an-other element below paragraph

246 54 3 9 insertion of definition list with listitem with bold text and 4 other textelements and another element

149 53 29 5 insertion of 4 text elements and3 intlinks with target below para-graph

235 53 13 17 insertion of list item with intlinkwith target and 3 text nodes belowunordered list

87 53 57 5 insertion of 3 intlinks with targetand title and 2 text elements belowparagraph

145 52 53 5 insertion of section with headingand body with inserted text and in-serted paragraph with 2 text ele-ments and intlink with target andtitle below body

220 49 34 5 insertion of paragraph with 3intlinks with target and title and atext element below paragraph

30

Page 37: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Id VertexCount

Support Cluster Transformation Name

88 49 49 5 insertion of 3 intlinks with targetand 3 text elements below para-graph

188 49 69 18 insertion of transclusion with 4 ar-guments with value

79 48 61 5 insertion of an element and 2intlinks with target and an intlinkwith title and target below para-graph

164 48 9 9 insertion of definition list with adefinition list item with 3 text el-ements and an intlink with targetand title and another element anda text element below body

175 48 8 5 insertion of a paragraph with textand 2 other text elements and aparagraph with an intlink with tar-get and with 3 text elements andanother element below body

267 45 89 5 insertion of 3 intlinks with targetand two text elements below para-graph

123 45 33 17 insertion of a list item below with3 intlinks with target below un-ordered list

185 45 53 14 insertion of section with headingwith text and body with 2 text ele-ments and an unordered list with alist item with an intlink with targetbelow body

119 45 77 5 insertion of section with headingand body with text and paragraphwith intlink with target and twotext elements below body

297 44 74 5 insertion of paragraph with 2 textelements and an intlink with targetand an intlink with target and titlebelow other element

31

Page 38: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix C: Table of Runs

Appendix C Table of Runs

Id NumberOf FoundPatterns

Number OfEdit Scripts

MinimumNode Count

MinimumSupport

ExceptionOccured

Summary315 2683322 0 50 10 12 TRUE321 0 50 10 12 TRUE320 0 50 10 12 TRUE319 0 50 10 12 TRUE318 0 50 10 12 TRUE317 0 50 10 12 TRUE316 0 50 10 12 TRUE315 0 50 10 12 TRUE314 24 48 10 4 FALSE313 0 50 10 4 TRUE312 24 47 10 4 FALSE311 0 50 10 4 TRUE310 0 50 10 4 TRUE309 0 50 10 4 TRUE308 0 50 10 4 TRUE307 0 50 10 4 TRUE306 0 50 10 4 TRUE305 0 50 10 12 TRUE304 0 50 10 12 TRUE303 0 50 10 12 TRUE302 17 47 10 4 FALSE301 0 50 10 4 TRUE300 0 50 10 4 TRUE299 0 50 10 4 TRUE298 18 47 10 4 FALSE297 0 50 10 4 TRUE296 0 50 10 4 TRUE295 0 50 10 4 TRUE294 0 50 10 4 TRUE293 32 47 10 4 FALSE292 0 50 10 4 TRUE291 26 48 10 4 FALSE290 0 50 10 4 TRUE289 0 50 10 4 TRUE288 0 50 10 4 TRUE

32

Page 39: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Id NumberOf FoundPatterns

Number OfEdit Scripts

MinimumNode Count

MinimumSupport

ExceptionOccured

287 0 50 10 4 TRUE286 0 50 10 4 TRUE285 14 47 10 4 FALSE284 0 50 10 4 TRUE283 0 50 10 4 TRUE282 0 50 10 4 TRUE281 0 50 10 4 TRUE280 0 50 10 4 TRUE279 0 50 10 4 TRUE278 0 50 10 4 TRUE277 0 50 10 4 TRUE276 0 50 10 4 TRUE275 0 50 10 4 TRUE274 0 50 10 4 TRUE273 0 50 10 4 TRUE272 0 50 10 4 TRUE271 0 50 10 4 TRUE270 0 50 10 4 TRUE269 0 50 10 4 TRUE268 48 46 10 4 FALSE267 0 50 10 4 TRUE266 0 50 10 4 TRUE265 0 50 10 4 TRUE264 0 50 10 4 TRUE263 0 50 10 4 TRUE262 0 50 10 4 TRUE261 0 50 10 4 TRUE260 0 50 10 4 TRUE259 0 50 10 4 TRUE258 0 50 10 4 TRUE257 0 50 10 4 TRUE256 0 50 10 4 TRUE255 0 50 10 4 TRUE254 0 50 10 4 TRUE253 0 50 10 4 TRUE252 20 48 10 4 FALSE251 0 50 10 4 TRUE250 0 50 10 4 TRUE249 0 50 10 4 TRUE

33

Page 40: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix C: Table of Runs

Id NumberOf FoundPatterns

Number OfEdit Scripts

MinimumNode Count

MinimumSupport

ExceptionOccured

248 22 47 10 4 FALSE247 0 50 10 4 TRUE246 0 50 10 4 TRUE245 17 47 10 4 FALSE244 19 48 10 4 FALSE243 22 47 10 4 FALSE242 20 47 10 4 FALSE241 0 50 10 4 TRUE240 46 48 10 4 FALSE239 0 50 10 4 TRUE238 0 50 10 4 TRUE237 0 50 10 4 TRUE236 0 50 10 4 TRUE235 0 50 10 4 TRUE234 0 50 10 4 TRUE233 0 50 10 4 TRUE232 0 50 10 4 TRUE231 0 50 10 12 TRUE230 13 48 10 4 FALSE229 0 50 10 4 TRUE228 0 50 10 4 TRUE227 0 50 10 4 TRUE226 0 50 10 4 TRUE225 16 49 10 4 FALSE224 0 50 10 4 TRUE223 0 50 10 4 TRUE222 0 50 10 4 TRUE221 0 50 10 4 TRUE220 0 50 10 4 TRUE219 0 50 10 4 TRUE218 0 50 10 4 TRUE217 0 50 10 4 TRUE216 37 45 10 4 FALSE215 0 50 10 4 TRUE214 0 50 10 4 TRUE213 0 50 10 4 TRUE212 0 50 10 4 TRUE211 20 48 10 4 FALSE210 0 50 10 4 TRUE

34

Page 41: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Id NumberOf FoundPatterns

Number OfEdit Scripts

MinimumNode Count

MinimumSupport

ExceptionOccured

209 0 50 10 4 TRUE208 0 50 10 4 TRUE207 0 50 10 4 TRUE206 0 50 10 4 TRUE205 0 50 10 4 TRUE204 28 44 10 4 FALSE203 0 50 10 4 TRUE202 0 50 10 4 TRUE201 0 50 10 4 TRUE200 0 50 10 4 TRUE199 0 50 10 4 TRUE198 0 50 10 4 TRUE197 0 50 10 4 TRUE196 0 50 10 4 TRUE195 26 50 10 4 FALSE194 17 42 10 4 FALSE193 0 50 10 4 TRUE192 0 50 10 12 TRUE191 0 50 10 4 TRUE190 0 50 10 4 TRUE189 28 48 10 4 FALSE188 0 50 10 4 TRUE187 0 500 10 10 TRUE186 0 500 10 10 TRUE185 5 469 10 10 FALSE184 8 465 10 10 FALSE183 8 485 10 10 FALSE182 5 464 10 10 FALSE181 0 500 10 10 TRUE180 0 500 10 10 TRUE179 0 500 10 10 TRUE178 0 500 10 10 TRUE177 6 473 10 10 FALSE176 0 500 10 10 TRUE175 0 500 10 10 TRUE174 0 500 10 10 TRUE173 0 500 10 10 TRUE172 0 500 10 10 TRUE171 0 500 10 10 TRUE

35

Page 42: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix C: Table of Runs

Id NumberOf FoundPatterns

Number OfEdit Scripts

MinimumNode Count

MinimumSupport

ExceptionOccured

170 0 50 30 3 TRUE169 3 50 30 3 FALSE168 0 50 30 3 TRUE167 0 50 30 3 TRUE166 0 50 30 3 TRUE165 0 44 30 3 FALSE164 3 50 30 3 FALSE163 0 50 30 3 TRUE162 0 50 30 3 TRUE161 0 50 30 3 TRUE160 0 50 30 3 TRUE159 2 45 30 3 FALSE158 0 50 30 3 TRUE157 0 50 30 3 TRUE156 0 50 30 3 TRUE155 0 50 30 3 TRUE154 0 50 30 3 TRUE153 0 50 30 3 TRUE152 0 50 30 3 TRUE151 0 50 30 3 TRUE150 0 50 30 3 TRUE149 1 45 30 3 FALSE148 0 50 30 3 TRUE147 3 49 30 3 FALSE146 0 50 30 3 TRUE145 4 44 30 3 FALSE144 0 50 30 3 TRUE143 0 50 30 3 TRUE142 0 50 30 3 TRUE141 0 50 30 3 TRUE140 0 50 30 3 TRUE139 1 47 30 3 FALSE138 0 50 30 3 TRUE137 0 50 30 3 TRUE136 0 50 30 3 TRUE135 0 50 30 3 TRUE134 0 50 30 3 TRUE133 2 47 30 3 FALSE132 0 2000 10 8 TRUE

36

Page 43: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Id NumberOf FoundPatterns

Number OfEdit Scripts

MinimumNode Count

MinimumSupport

ExceptionOccured

131 13 180 10 8 FALSE130 0 200 10 8 TRUE129 12 189 10 8 FALSE128 13 191 10 8 FALSE127 0 200 10 8 TRUE126 0 200 10 8 TRUE125 10 188 10 8 FALSE124 0 200 10 8 TRUE123 0 200 10 8 TRUE122 0 200 10 8 TRUE121 0 200 10 8 TRUE120 0 200 10 8 TRUE119 0 200 10 8 TRUE118 18 181 10 8 FALSE117 0 200 10 8 TRUE116 0 200 10 8 TRUE115 12 182 10 8 FALSE114 17 187 10 8 FALSE113 0 200 10 8 TRUE112 10 193 10 8 FALSE111 13 193 10 8 FALSE110 0 200 10 7 TRUE109 0 200 10 6 TRUE108 0 200 10 6 TRUE107 0 200 10 6 TRUE106 0 200 10 6 TRUE105 0 200 10 6 TRUE104 8 197 10 6 FALSE103 0 200 10 6 TRUE102 0 200 10 6 TRUE101 10 171 10 7 FALSE100 0 200 10 7 TRUE99 0 200 10 7 TRUE98 10 182 10 7 FALSE97 8 191 10 8 FALSE96 9 187 10 8 FALSE95 13 96 10 5 FALSE94 0 125 10 5 TRUE93 0 250 10 5 TRUE

37

Page 44: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix C: Table of Runs

Id NumberOf FoundPatterns

Number OfEdit Scripts

MinimumNode Count

MinimumSupport

ExceptionOccured

92 0 250 10 5 TRUE91 0 250 10 5 TRUE90 0 500 10 5 TRUE89 0 500 10 5 TRUE88 0 500 10 5 TRUE87 0 500 10 5 TRUE86 0 500 10 5 TRUE85 0 500 10 5 TRUE84 0 500 10 5 TRUE83 20 49 10 5 FALSE82 0 50 10 5 TRUE81 0 50 10 5 TRUE80 14 45 10 5 FALSE79 12 50 10 5 FALSE78 4 50 10 10 FALSE77 0 500 10 10 TRUE76 4 474 10 11 FALSE75 4 474 10 12 FALSE74 0 474 20 12 FALSE73 0 500 20 10 TRUE72 0 500 20 8 TRUE71 0 500 20 2 TRUE70 0 500 30 1 TRUE69 4 474 10 12 FALSE68 0 1050 10 12 TRUE67 0 550 10 12 TRUE66 0 550 20 1 TRUE65 0 1050 30 1 TRUE64 0 1050 20 1 TRUE63 0 1050 10 1 TRUE62 0 50 10 1 TRUE61 0 50 10 12 TRUE60 0 50 10 12 TRUE59 0 50 10 12 TRUE58 1 50 10 12 FALSE57 0 50 10 12 TRUE56 0 50 10 12 TRUE55 23 499 10 12 FALSE54 0 500 10 12 TRUE

38

Page 45: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Id NumberOf FoundPatterns

Number OfEdit Scripts

MinimumNode Count

MinimumSupport

ExceptionOccured

53 0 500 10 12 TRUE52 0 500 10 12 TRUE51 0 500 10 12 TRUE50 0 500 20 6 TRUE49 0 500 20 6 TRUE48 0 500 20 6 TRUE47 0 1500 10 12 TRUE46 0 1500 10 12 TRUE45 0 1500 10 12 TRUE44 17 499 10 12 FALSE43 17 499 10 12 FALSE42 0 500 10 12 TRUE41 0 500 10 12 TRUE40 0 500 10 8 TRUE39 0 500 10 8 TRUE38 0 500 10 8 TRUE37 0 500 10 8 TRUE36 0 1000 10 8 TRUE35 0 1000 10 8 TRUE34 0 10000 10 8 TRUE33 0 10000 20 1 TRUE32 0 10000 10 5 TRUE31 0 10000 10 5 TRUE30 0 10000 10 5 TRUE29 0 10000 16 5 TRUE28 0 10000 16 5 TRUE27 0 10000 16 5 TRUE26 0 10000 16 5 TRUE25 0 10000 10 5 TRUE24 0 10000 10 12 TRUE23 0 10000 10 12 TRUE22 0 10000 10 12 TRUE21 0 50000 10 12 TRUE20 0 50000 10 12 TRUE19 0 50000 10 12 TRUE18 0 100000 10 12 TRUE17 0 100000 10 12 TRUE16 0 100000 10 12 TRUE15 0 100000 10 12 TRUE

39

Page 46: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix C: Table of Runs

Id NumberOf FoundPatterns

Number OfEdit Scripts

MinimumNode Count

MinimumSupport

ExceptionOccured

14 0 100000 10 12 TRUE13 0 100000 10 12 TRUE12 0 100000 10 12 TRUE11 0 100000 10 12 TRUE10 0 100000 10 12 TRUE9 0 100000 10 12 TRUE8 0 100000 10 12 TRUE7 0 100000 10 12 TRUE6 0 100000 10 12 TRUE5 0 100000 10 12 TRUE4 0 100000 10 12 TRUE3 0 50 10 12 TRUE2 0 50 10 12 TRUE1 9 995 10 12 FALSE

40

Page 47: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D Selected Frequent Pattern Graphs

Pattern No. 3

wom:text

InsertedNode

INSERT

Parent

wom:p

Parent

INSERT

InsertedNode

wom:intlink

Parent

INSERT

InsertedNode

mww:target

wom:text

InsertedNode

INSERT

Parent

41

Page 48: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 12033 revision B id: 12046;revision A id: 98746 revision B id: 98749;revision A id: 175239 revision B id:175260; revision A id: 279188 revision Bid: 279189; revision A id: 282193 revisionB id: 282194;

graph is subgraph of patterns : 10 34 43 44 50 51 52 74 75 80 81 87 88 8990 114 116 119 120 121 122 142 143 144145 146 147 148 149 150 166 173 174 175176 177 178 220 221 222 223 224 237 238239 240 241 267 271 272 278 279 297

graph is supergraph of patterns :

42

Page 49: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Pattern No. 4

InsertedNode

INSERT

Parent

wom:p

Parent

INSERT

InsertedNode

wom:intlink

Parent

INSERT

InsertedNode

mww:target

wom:text

InsertedNode

INSERT

Parent

43

Page 50: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 12033 revision B id: 12046;revision A id: 98746 revision B id: 98749;revision A id: 175239 revision B id:175260; revision A id: 279188 revision Bid: 279189; revision A id: 282193 revisionB id: 282194;

graph is subgraph of patterns : 10 11 34 43 44 50 51 52 53 54 55 56 74 7579 80 81 85 87 88 89 90 91 114 116 119 120121 122 142 143 144 145 146 147 148 149150 166 167 173 174 175 176 177 178 220221 222 223 224 225 237 238 239 240 241267 271 272 278 279 280 281 282 297

graph is supergraph of patterns :

44

Page 51: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Pattern No. 5

wom:p

Parent

INSERT

InsertedNode

wom:intlink

Parent

INSERT

InsertedNode

mww:target

wom:text

InsertedNode

INSERT

Parent

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 12033 revision B id: 12046;revision A id: 98746 revision B id: 98749;revision A id: 175239 revision B id:175260; revision A id: 279188 revision Bid: 279189; revision A id: 282193 revisionB id: 282194;

graph is subgraph of patterns : 10 11 34 43 44 50 51 52 53 54 55 56 74 7579 80 81 85 87 88 89 90 91 92 111 112 113114 115 116 119 120 121 122 137 142 143144 145 146 147 148 149 150 166 167 173174 175 176 177 178 203 220 221 222 223224 225 237 238 239 240 241 244 245 267271 272 278 279 280 281 282 297

graph is supergraph of patterns :

45

Page 52: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

Pattern No. 6

Parent

INSERT

InsertedNode

wom:intlink

Parent

INSERT

InsertedNode

mww:target

wom:text

InsertedNode

INSERT

Parent

Figure 2.15: Pattern no. 6

46

Page 53: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 12033 revision B id: 12046;revision A id: 58452 revision B id: 217584;revision A id: 98746 revision B id: 98749;revision A id: 108099 revision B id:405282; revision A id: 108224 revision Bid: 108236;

graph is subgraph of patterns : 10 11 12 20 21 24 25 34 35 41 42 43 44 5051 52 53 54 55 56 59 60 74 75 76 79 80 8182 83 85 87 88 89 90 91 92 93 111 112 113114 115 116 117 119 120 121 122 123 124135 137 142 143 144 145 146 147 148 149150 151 164 166 167 168 173 174 175 176177 178 184 185 192 203 209 220 221 222223 224 225 226 227 228 235 236 237 238239 240 241 244 245 246 247 248 267 271272 278 279 280 281 282 285 289 297 299300 301 307

graph is supergraph of patterns :

47

Page 54: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

Pattern No. 7

mww:target

InsertedNode

INSERT

Parent

wom:intlink

InsertedNode

INSERT

Parent

wom:p

wom:text

InsertedNode

INSERT

Parent

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 12033 revision B id: 12046;revision A id: 98746 revision B id: 98749;revision A id: 175239 revision B id:175260; revision A id: 279188 revision Bid: 279189; revision A id: 282193 revisionB id: 282194;

graph is subgraph of patterns : 10 14 34 43 44 50 51 52 69 74 75 77 80 8187 88 89 90 114 116 119 120 121 122 142143 144 145 146 147 148 149 150 166 173174 175 176 177 178 220 221 222 223 224237 238 239 240 241 253 254 267 271 272278 279 297

graph is supergraph of patterns :

48

Page 55: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Pattern No. 8

InsertedNode

INSERT

Parent

wom:p

mww:target

InsertedNode

INSERT

Parent

wom:intlink

InsertedNode

INSERT

Parent

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 12033 revision B id: 12046;revision A id: 98746 revision B id: 98749;revision A id: 175239 revision B id:175260; revision A id: 279188 revision Bid: 279189; revision A id: 282193 revisionB id: 282194;

graph is subgraph of patterns : 10 11 14 18 34 43 44 50 51 52 53 54 55 5669 74 75 77 79 80 81 85 87 88 89 90 91 114116 119 120 121 122 142 143 144 145 146147 148 149 150 166 167 173 174 175 176177 178 220 221 222 223 224 225 237 238239 240 241 253 254 259 260 267 271 272278 279 280 281 282 295 297

graph is supergraph of patterns :

49

Page 56: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

Pattern No. 9

InsertedNode

INSERT

Parent

InsertedNode

INSERT

Parent

wom:p

InsertedNode

INSERT

Parent

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 12033 revision B id: 12046;revision A id: 175239 revision B id:175260; revision A id: 308634 revision Bid: 308661; revision A id: 332508 revisionB id: 686549; revision A id: 359376 revi-sion B id: 359392;

graph is subgraph of patterns : 10 11 13 14 15 18 43 44 51 52 57 65 68 6975 77 79 80 81 85 87 88 89 90 116 119 120121 122 125 126 142 143 144 145 146 147148 149 150 153 154 155 166 167 169 170173 174 175 176 177 178 194 195 202 205207 220 221 222 225 237 238 239 240 241249 250 251 252 253 254 267 271 272 274275 276 278 279 297

graph is supergraph of patterns :

50

Page 57: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Pattern No. 10

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:p

Parent

INSERT

InsertedNode

wom:intlink

Parent

INSERT

InsertedNode

mww:target

Parent

wom:text

InsertedNode

INSERT

51

Page 58: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 12033 revision B id: 12046;revision A id: 175239 revision B id:175260; revision A id: 308634 revision Bid: 308661; revision A id: 332508 revisionB id: 686549; revision A id: 359376 revi-sion B id: 359392;

graph is subgraph of patterns : 43 44 51 52 75 80 81 87 88 89 90 116 119120 121 122 142 143 144 145 146 147 148149 150 166 173 174 175 176 177 178 220221 222 237 238 239 240 241 267 271 272278 279 297

graph is supergraph of patterns : 3 4 5 6 7 8 9 11 14 15 18

52

Page 59: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Pattern No. 11

InsertedNode

INSERT

Parent

InsertedNode

INSERT

Parent

wom:p

Parent

INSERT

InsertedNode

wom:intlink

Parent

INSERT

InsertedNode

mww:target

Parent

wom:text

InsertedNode

INSERT

53

Page 60: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 12033 revision B id: 12046;revision A id: 175239 revision B id:175260; revision A id: 308634 revision Bid: 308661; revision A id: 332508 revisionB id: 686549; revision A id: 359376 revi-sion B id: 359392;

graph is subgraph of patterns : 10 43 44 51 52 75 79 80 81 85 87 88 89 90116 119 120 121 122 142 143 144 145 146147 148 149 150 166 167 173 174 175 176177 178 220 221 222 225 237 238 239 240241 267 271 272 278 279 297

graph is supergraph of patterns : 4 5 6 8 9

54

Page 61: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Pattern No. 12

wom:title

InsertedNode

INSERT

Parent

Parent

INSERT

InsertedNode

wom:intlink

Parent

INSERT

InsertedNode

mww:target

Parent

wom:text

InsertedNode

INSERT

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 12033 revision B id: 12046;revision A id: 108224 revision B id:108236; revision A id: 175239 revision Bid: 175260; revision A id: 279188 revisionB id: 279189; revision A id: 308634 revi-sion B id: 308661;

graph is subgraph of patterns : 20 21 41 50 74 79 87 89 90 91 92 112 117121 122 123 124 137 142 143 144 145 146147 164 174 177 178 184 192 203 220 225227 228 246 248 272 278 279 280 281 282285 297 299 300 301 307

graph is supergraph of patterns : 6

55

Page 62: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

Pattern No. 13

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:p

Parent

wom:text

InsertedNode

INSERT

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 12033 revision B id: 12046;revision A id: 175239 revision B id:175260; revision A id: 308634 revision Bid: 308661; revision A id: 332508 revisionB id: 686549; revision A id: 359376 revi-sion B id: 359392;

graph is subgraph of patterns : 43 44 51 52 57 69 75 80 81 87 88 89 90 116119 120 121 122 125 126 142 143 144 145146 147 148 149 150 153 154 155 166 169170 173 174 175 176 177 178 194 195 202237 238 239 240 241 249 250 251 252 267271 272 274 275 276 297

graph is supergraph of patterns : 9

56

Page 63: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Pattern No. 14

InsertedNode

INSERT

Parent

mww:target

InsertedNode

INSERT

Parent

wom:intlink

InsertedNode

INSERT

Parent

wom:p

Parent

wom:text

InsertedNode

INSERT

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 12033 revision B id: 12046;revision A id: 175239 revision B id:175260; revision A id: 308634 revision Bid: 308661; revision A id: 332508 revisionB id: 686549; revision A id: 359376 revi-sion B id: 359392;

graph is subgraph of patterns : 10 43 44 51 52 69 75 77 80 81 87 88 89 90116 119 120 121 122 142 143 144 145 146147 148 149 150 166 173 174 175 176 177178 220 221 222 237 238 239 240 241 253254 267 271 272 278 279 297

graph is supergraph of patterns : 7 8 9

57

Page 64: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

Pattern No. 15

InsertedNode

INSERT

Parent

InsertedNode

INSERT

Parent

wom:p

Parent

wom:text

InsertedNode

INSERT

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 12033 revision B id: 12046;revision A id: 175239 revision B id:175260; revision A id: 308634 revision Bid: 308661; revision A id: 332508 revisionB id: 686549; revision A id: 359376 revi-sion B id: 359392;

graph is subgraph of patterns : 10 43 44 51 52 57 65 69 75 77 80 81 87 8889 90 116 119 120 121 122 125 126 142 143144 145 146 147 148 149 150 153 154 155166 169 170 173 174 175 176 177 178 194195 202 220 221 222 237 238 239 240 241249 250 251 252 253 254 267 271 272 274275 276 278 279 297

graph is supergraph of patterns : 9

58

Page 65: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Pattern No. 16

InsertedNode

INSERT

Parent

wom:p

InsertedNode

INSERT

Parent

wom:body

Parent

wom:text

InsertedNode

INSERT

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 12033 revision B id: 12046;revision A id: 98746 revision B id: 98749;revision A id: 108099 revision B id:405282; revision A id: 175239 revision Bid: 175260; revision A id: 279562 revisionB id: 279563;

graph is subgraph of patterns : 26 34 36 61 70 72 73 74 89 111 112 113 119120 125 143 145 146 154 173 174 175 176180 181 193 200 201 234 242 244 249 250251 290 309 310

graph is supergraph of patterns :

59

Page 66: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

Pattern No. 17

Parent

wom:title

InsertedNode

INSERT

Parent

mww:target

InsertedNode

INSERT

Parent

wom:intlink

InsertedNode

INSERT

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 12033 revision B id: 12046;revision A id: 108224 revision B id:108236; revision A id: 175239 revision Bid: 175260; revision A id: 279188 revisionB id: 279189; revision A id: 308634 revi-sion B id: 308661;

graph is subgraph of patterns : 20 21 22 23 41 50 74 79 87 89 90 91 92 112117 121 122 123 124 137 142 143 144 145146 147 164 174 177 178 184 192 203 220225 227 228 246 248 272 278 279 280 281282 283 284 285 297 299 300 301 307

graph is supergraph of patterns :

60

Page 67: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Pattern No. 18

InsertedNode

INSERT

Parent

InsertedNode

INSERT

Parent

wom:p

Parent

mww:target

InsertedNode

INSERT

Parent

wom:intlink

InsertedNode

INSERT

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 12033 revision B id: 12046;revision A id: 175239 revision B id:175260; revision A id: 308634 revision Bid: 308661; revision A id: 332508 revisionB id: 686549; revision A id: 359376 revi-sion B id: 359392;

graph is subgraph of patterns : 10 43 44 51 52 69 75 77 79 80 81 85 87 8889 90 116 119 120 121 122 142 143 144 145146 147 148 149 150 166 167 173 174 175176 177 178 220 221 222 225 237 238 239240 241 253 254 267 271 272 278 279 297

graph is supergraph of patterns : 8 9

61

Page 68: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

Pattern No. 19

InsertedNode

INSERT

Parent

wom:body

Parent

InsertedNode

INSERT

Parent

wom:p

InsertedNode

INSERT

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 12033 revision B id: 12046;revision A id: 98746 revision B id: 98749;revision A id: 108099 revision B id:405282; revision A id: 175239 revision Bid: 175260; revision A id: 279562 revisionB id: 279563;

graph is subgraph of patterns : 26 34 36 61 66 70 72 73 74 89 111 112 113119 120 125 143 145 146 154 173 174 175176 180 181 193 200 201 234 242 244 249250 251 290 309 310

graph is supergraph of patterns :

62

Page 69: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Pattern No. 20

wom:text

InsertedNode

INSERT

Parent

wom:title

InsertedNode

Parent

INSERT

InsertedNode

wom:text

InsertedNode

INSERT

Parent

mww:target

InsertedNode

INSERT

Parent

wom:intlink

Parent

INSERT

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 12033 revision B id: 12046;revision A id: 108224 revision B id:108236; revision A id: 175239 revision Bid: 175260; revision A id: 279188 revisionB id: 279189; revision A id: 308634 revi-sion B id: 308661;

graph is subgraph of patterns : 41 50 74 79 87 89 90 91 112 117 121 122123 124 137 142 164 177 184 192 225 228248 272 278 279 285 297 299 300 301 307

graph is supergraph of patterns : 6 12 17 21 22 23

63

Page 70: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

Pattern No. 21

InsertedNode

INSERT

Parent

wom:title

InsertedNode

Parent

INSERT

InsertedNode

wom:text

InsertedNode

INSERT

Parent

mww:target

InsertedNode

INSERT

Parent

wom:intlink

Parent

INSERT

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 12033 revision B id: 12046;revision A id: 108224 revision B id:108236; revision A id: 175239 revision Bid: 175260; revision A id: 279188 revisionB id: 279189; revision A id: 308634 revi-sion B id: 308661;

graph is subgraph of patterns : 20 41 50 74 79 87 89 90 91 112 117 121 122123 124 137 142 143 144 145 146 147 164174 177 178 184 192 225 228 246 248 272278 279 285 297 299 300 301 307

graph is supergraph of patterns : 6 12 17

64

Page 71: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Pattern No. 22

wom:text

InsertedNode

INSERT

Parent

wom:title

InsertedNode

Parent

INSERT

InsertedNode

mww:target

InsertedNode

INSERT

Parent

wom:intlink

Parent

INSERT

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 12033 revision B id: 12046;revision A id: 108224 revision B id:108236; revision A id: 175239 revision Bid: 175260; revision A id: 279188 revisionB id: 279189; revision A id: 308634 revi-sion B id: 308661;

graph is subgraph of patterns : 20 41 50 74 79 87 89 90 91 112 117 121 122123 124 137 142 164 177 184 192 225 228248 272 278 279 285 297 299 300 301 307

graph is supergraph of patterns : 17

65

Page 72: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

Pattern No. 23

InsertedNode

INSERT

Parent

wom:title

InsertedNode

Parent

INSERT

InsertedNode

mww:target

InsertedNode

INSERT

Parent

wom:intlink

Parent

INSERT

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 12033 revision B id: 12046;revision A id: 108224 revision B id:108236; revision A id: 175239 revision Bid: 175260; revision A id: 279188 revisionB id: 279189; revision A id: 308634 revi-sion B id: 308661;

graph is subgraph of patterns : 20 41 50 74 79 87 89 90 91 112 117 121 122123 124 137 142 143 144 145 146 147 164174 177 178 184 192 225 228 246 248 272278 279 285 297 299 300 301 307

graph is supergraph of patterns : 17

66

Page 73: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Pattern No. 26

wom:text

InsertedNode

wom:text

InsertedNode

INSERT

Parent

wom:body

Parent

INSERT

InsertedNode

wom:p

Parent

INSERT

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 12033 revision B id: 12046;revision A id: 98746 revision B id: 98749;revision A id: 108099 revision B id:405282; revision A id: 175239 revision Bid: 175260; revision A id: 279562 revisionB id: 279563;

graph is subgraph of patterns : 34 36 61 70 72 73 74 89 119 120 125 143145 146 154 173 174 175 176 180 181 193200 201 242 249 250 251 290 309 310

graph is supergraph of patterns : 16 19 66

67

Page 74: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

Pattern No. 44

wom:text

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:p

Parent

INSERT

InsertedNode

wom:intlink

Parent

INSERT

InsertedNode

mww:target

Parent

wom:text

InsertedNode

INSERT

68

Page 75: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 12033 revision B id: 12046;revision A id: 175239 revision B id:175260; revision A id: 308634 revision Bid: 308661; revision A id: 332508 revisionB id: 686549; revision A id: 359376 revi-sion B id: 359392;

graph is subgraph of patterns : 51 52 75 80 81 87 88 89 90 116 119 120 121122 142 143 144 145 146 147 148 149 150166 173 174 175 176 177 178 237 238 239240 241 267 271 272 297

graph is supergraph of patterns : 3 4 5 6 7 8 9 10 11 13 14 15 18 69

69

Page 76: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

Pattern No. 66

InsertedNode

INSERT

Parent

wom:body

Parent

INSERT

InsertedNode

wom:p

Parent

INSERT

InsertedNode

wom:text

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 12033 revision B id: 12046;revision A id: 98746 revision B id: 98749;revision A id: 108099 revision B id:405282; revision A id: 175239 revision Bid: 175260; revision A id: 279562 revisionB id: 279563;

graph is subgraph of patterns : 26 34 36 61 70 72 73 74 89 119 120 125 143145 146 154 173 174 175 176 180 181 193200 201 242 249 250 251 290 309 310

graph is supergraph of patterns : 19

70

Page 77: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Pattern No. 69

mww:target

wom:text

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:p

Parent

INSERT

InsertedNode

wom:intlink

Parent

INSERT

InsertedNode

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 12033 revision B id: 12046;revision A id: 175239 revision B id:175260; revision A id: 308634 revision Bid: 308661; revision A id: 332508 revisionB id: 686549; revision A id: 359376 revi-sion B id: 359392;

graph is subgraph of patterns : 43 44 51 52 75 80 81 87 88 89 90 116 119120 121 122 142 143 144 145 146 147 148149 150 166 173 174 175 176 177 178 237238 239 240 241 267 271 272 297

graph is supergraph of patterns : 7 8 9 13 14 15 18

71

Page 78: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

Pattern No. 78

72

Page 79: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Parent

INSERT

InsertedNode

wom:text

InsertedNode

INSERT

Parent

mww:value

InsertedNode

INSERT

Parent

mww:arg

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

mww:value

InsertedNode

INSERT

Parent

mww:arg

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

mww:value

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

mww:name

InsertedNode

INSERT

Parent

mww:arg

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

mww:value

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

mww:name

InsertedNode

INSERT

Parent

mww:arg

InsertedNode

INSERT

Parent

mww:transclusion

Parent

INSERT

InsertedNode

mww:name

Parent

INSERT

InsertedNode

wom:text

73

Page 80: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 90853654 revision B id:106357619; revision A id: 149216395 re-vision B id: 149791818; revision A id:181855521 revision B id: 188528728; re-vision A id: 268363914 revision B id:268364383; revision A id: 281074108 re-vision B id: 281074807;

graph is subgraph of patterns :

graph is supergraph of patterns : 1 2 38 84 95 172 188 189 302 314

74

Page 81: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Pattern No. 79

75

Page 82: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

mww:target

InsertedNode

INSERT

Parent

wom:intlink

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

mww:target

InsertedNode

INSERT

Parent

wom:intlink

InsertedNode

INSERT

Parent

wom:p

Parent

INSERT

InsertedNode

wom:text

InsertedNode

INSERT

Parent

wom:title

InsertedNode

INSERT

Parent

wom:intlink

Parent

INSERT

InsertedNode

mww:target

wom:text

InsertedNode

INSERT

Parent

76

Page 83: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 175239 revision B id:175260; revision A id: 308634 revision Bid: 308661; revision A id: 359376 revisionB id: 359392; revision A id: 621714 revi-sion B id: 622105; revision A id: 1207572revision B id: 1238684;

graph is subgraph of patterns : 142 225 278

graph is supergraph of patterns : 4 5 6 8 9 11 12 17 18 20 21 22 23 54 68 9192 137 167 281 282 284

77

Page 84: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

Pattern No. 87

wom:text

InsertedNode

INSERT

Parent

wom:title

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

mww:target

InsertedNode

INSERT

Parent

wom:intlink

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:title

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

mww:target

InsertedNode

INSERT

Parent

wom:intlink

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:p

Parent

INSERT

InsertedNode

wom:text

78

Page 85: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 3034654 revision B id:3034706; revision A id: 5359016 revisionB id: 8141976; revision A id: 61048870revision B id: 61050239; revision A id:73469485 revision B id: 73509280; revisionA id: 77437862 revision B id: 77438215;

graph is subgraph of patterns : 121 142

graph is supergraph of patterns : 3 4 5 6 7 8 9 10 11 12 13 14 15 17 18 20 2122 23 43 44 50 51 54 57 65 68 69 90 91 92116 137 153 166 167 170 177 178 222 237240 254 272 279 281 282 284

79

Page 86: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

Pattern No. 88

wom:text

InsertedNode

INSERT

Parent

mww:target

InsertedNode

INSERT

Parent

wom:intlink

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

mww:target

InsertedNode

INSERT

Parent

wom:intlink

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

mww:target

InsertedNode

INSERT

Parent

wom:intlink

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:p

Parent

INSERT

InsertedNode

wom:text

80

Page 87: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 1378554 revision B id:1378561; revision A id: 4734004 revision Bid: 4734726; revision A id: 5359016 revi-sion B id: 8141976; revision A id: 7421939revision B id: 7720063; revision A id:8845074 revision B id: 9117591;

graph is subgraph of patterns : 142 148 149

graph is supergraph of patterns : 3 4 5 6 7 8 9 10 11 13 14 15 18 43 44 5154 57 65 68 69 81 116 153 166 167 170 222237 240 253 254 267

81

Page 88: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

Pattern No. 117

wom:text

InsertedNode

INSERT

Parent

wom:title

wom:text

InsertedNode

INSERT

Parent

mww:target

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:title

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

mww:target

InsertedNode

INSERT

Parent

wom:intlink

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:dd

Parent

INSERT

InsertedNode

wom:dl

Parent

INSERT

InsertedNode

wom:dd

Parent

INSERT

InsertedNode

wom:intlink

Parent

INSERT

InsertedNode

82

Page 89: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 3489847 revision B id:4061708; revision A id: 73469485 revisionB id: 73509280; revision A id: 103960624revision B id: 104012474; revision A id:125522905 revision B id: 125522953; re-vision A id: 262260060 revision B id:262308574;

graph is subgraph of patterns :

graph is supergraph of patterns : 6 12 17 20 21 22 23 24 25 27 28 29 30 3132 33 124 192 209 215 217 258 285 300 301

83

Page 90: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

Pattern No. 119

84

Page 91: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

wom:text

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:heading

InsertedNode

INSERT

Parent

wom:body

Parent

INSERT

InsertedNode

wom:section

Parent

INSERT

InsertedNode

wom:body

Parent

INSERT

InsertedNode

wom:p

Parent

INSERT

InsertedNode

wom:intlink

Parent

INSERT

InsertedNode

mww:target

Parent

INSERT

InsertedNode

wom:text

85

Page 92: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 3206207 revision B id:3206266; revision A id: 4991103 revision Bid: 5000801; revision A id: 5295032 revi-sion B id: 5295067; revision A id: 5359016revision B id: 8141976; revision A id:5561215 revision B id: 5580959;

graph is subgraph of patterns : 145

graph is supergraph of patterns : 3 4 5 6 7 8 9 10 11 13 14 15 16 18 19 26 3444 52 55 56 61 66 69 101 102 113 114 115120 134 154 155 201 207 208 271 275 286290

86

Page 93: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Pattern No. 121

87

Page 94: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

wom:text

InsertedNode

INSERT

Parent

wom:title

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:title

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

mww:target

InsertedNode

INSERT

Parent

wom:intlink

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

Parent

INSERT

InsertedNode

wom:p

Parent

INSERT

InsertedNode

wom:intlink

Parent

INSERT

InsertedNode

mww:target

Parent

INSERT

InsertedNode

wom:text

88

Page 95: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 3034654 revision B id:3034706; revision A id: 5359016 revisionB id: 8141976; revision A id: 73469485revision B id: 73509280; revision A id:116455033 revision B id: 116455138; re-vision A id: 150823988 revision B id:172390487;

graph is subgraph of patterns : 142

graph is supergraph of patterns : 3 4 5 6 7 8 9 10 11 12 13 14 15 17 18 20 2122 23 43 44 50 51 52 53 54 56 57 65 68 6987 90 91 92 114 115 116 137 147 153 155166 167 170 177 178 222 237 240 241 254272 279 281 282 284 297

89

Page 96: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

Pattern No. 123

90

Page 97: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

wom:text

InsertedNode

INSERT

Parent

wom:title

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

mww:target

InsertedNode

INSERT

Parent

wom:intlink

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:ul

Parent

INSERT

InsertedNode

wom:li

Parent

INSERT

InsertedNode

wom:intlink

Parent

INSERT

InsertedNode

mww:target

Parent

INSERT

InsertedNode

wom:text

91

Page 98: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 2443498 revision B id:5131858; revision A id: 5042796 revision Bid: 6445130; revision A id: 5072619 revi-sion B id: 5072645; revision A id: 6520278revision B id: 6548195; revision A id:12748001 revision B id: 13565511;

graph is subgraph of patterns :

graph is supergraph of patterns : 6 12 17 20 21 22 23 35 39 40 41 42 46 5960 67 83 93 135 151 184 198 228 248 289

92

Page 99: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Pattern No. 143

93

Page 100: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

wom:text

InsertedNode

INSERT

Parent

wom:body

Parent

INSERT

InsertedNode

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

mww:target

InsertedNode

INSERT

Parent

wom:intlink

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:p

Parent

INSERT

InsertedNode

InsertedNode

INSERT

Parent

wom:title

InsertedNode

INSERT

Parent

wom:intlink

Parent

INSERT

InsertedNode

mww:target

Parent

INSERT

InsertedNode

wom:text

94

Page 101: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 14223369 revision B id:14223881; revision A id: 246808850 re-vision B id: 247360688; revision A id:362265788 revision B id: 363129554; re-vision A id: 382022859 revision B id:382032428; revision A id: 382032428 re-vision B id: 382046870;

graph is subgraph of patterns :

graph is supergraph of patterns : 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1921 23 26 34 43 44 51 52 53 54 55 56 57 6165 66 68 69 81 92 113 114 115 116 120 125126 166 167 170 173 174 176 178 201 202207 208 222 237 240 241 253 254 271 274275 276 281 282 284

95

Page 102: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

Pattern No. 144

96

Page 103: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Parent

INSERT

InsertedNode

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

mww:target

InsertedNode

INSERT

Parent

wom:intlink

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:p

Parent

INSERT

InsertedNode

InsertedNode

INSERT

Parent

wom:title

InsertedNode

INSERT

Parent

wom:intlink

Parent

INSERT

InsertedNode

mww:target

Parent

INSERT

InsertedNode

wom:text

97

Page 104: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 5359016 revision B id:8141976; revision A id: 14223369 revisionB id: 14223881; revision A id: 44286761revision B id: 46077804; revision A id:127355053 revision B id: 127403088; re-vision A id: 208374819 revision B id:208376565;

graph is subgraph of patterns : 142

graph is supergraph of patterns : 3 4 5 6 7 8 9 10 11 12 13 14 15 17 18 21 2343 44 51 52 53 54 56 57 65 68 69 81 92 114115 116 126 166 167 170 178 202 222 237240 241 253 254 276 281 282 284

98

Page 105: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Pattern No. 145

99

Page 106: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

wom:body

Parent

INSERT

InsertedNode

wom:text

InsertedNode

INSERT

Parent

wom:heading

InsertedNode

INSERT

Parent

wom:section

Parent

INSERT

InsertedNode

wom:text

InsertedNode

INSERT

Parent

wom:body

Parent

INSERT

InsertedNode

wom:text

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:p

Parent

INSERT

InsertedNode

InsertedNode

INSERT

Parent

wom:title

InsertedNode

INSERT

Parent

wom:intlink

Parent

INSERT

InsertedNode

mww:target

Parent

INSERT

InsertedNode

wom:text

100

Page 107: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 4991103 revision B id:5000801; revision A id: 5359016 revisionB id: 8141976; revision A id: 6965245revision B id: 6965270; revision A id:14223369 revision B id: 14223881; revisionA id: 16130709 revision B id: 24768319;

graph is subgraph of patterns :

graph is supergraph of patterns : 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1819 21 23 26 34 44 52 55 56 61 66 69 92 101102 113 114 115 119 120 134 201 207 208271 275 282 284 286 290

101

Page 108: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

Pattern No. 149

102

Page 109: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

wom:text

InsertedNode

INSERT

Parent

mww:target

InsertedNode

INSERT

Parent

wom:intlink

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

mww:target

InsertedNode

INSERT

Parent

wom:intlink

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:p

Parent

INSERT

InsertedNode

wom:intlink

Parent

INSERT

InsertedNode

mww:target

Parent

INSERT

InsertedNode

wom:text

103

Page 110: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 1378554 revision B id:1378561; revision A id: 4734004 revision Bid: 4734726; revision A id: 5359016 revi-sion B id: 8141976; revision A id: 7421939revision B id: 7720063; revision A id:16077479 revision B id: 102146373;

graph is subgraph of patterns : 142

graph is supergraph of patterns : 3 4 5 6 7 8 9 10 11 13 14 15 18 43 44 51 5457 65 68 69 81 88 116 166 167 170 202 222237 240 253 254 267 276

104

Page 111: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Pattern No. 164

105

Page 112: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

wom:text

InsertedNode

INSERT

Parent

wom:title

InsertedNode

INSERT

Parent

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:body

Parent

INSERT

InsertedNode

wom:dl

Parent

INSERT

InsertedNode

wom:dd

Parent

INSERT

InsertedNode

wom:intlink

Parent

INSERT

InsertedNode

mww:target

Parent

INSERT

InsertedNode

wom:text

106

Page 113: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 49157044 revision B id:49157157; revision A id: 57771190 revisionB id: 57824325; revision A id: 139221613revision B id: 139226806; revision A id:151423544 revision B id: 151424123; re-vision A id: 179973422 revision B id:180115908;

graph is subgraph of patterns :

graph is supergraph of patterns : 6 12 17 20 21 22 23 24 25 27 28 29 30 3132 33 192 209 215 217 255 258 285 300 301

107

Page 114: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

Pattern No. 175

wom:text

InsertedNode

INSERT

Parent

wom:p

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:body

Parent

INSERT

InsertedNode

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

mww:target

InsertedNode

INSERT

Parent

wom:intlink

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:p

wom:text

InsertedNode

INSERT

Parent

108

Page 115: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 3034654 revision B id:3034706; revision A id: 9340385 revisionB id: 9361391; revision A id: 58893019revision B id: 59142580; revision A id:77074188 revision B id: 77074256; revisionA id: 135381416 revision B id: 135384471;

graph is subgraph of patterns :

graph is supergraph of patterns : 3 4 5 6 7 8 9 10 11 13 14 15 16 18 19 26 3437 44 52 55 56 57 61 63 64 65 66 68 69 111113 114 115 116 120 125 126 140 153 154155 166 167 170 193 201 207 208 237 240241 242 244 250 251 254 268 270 271 274275 309 310

109

Page 116: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

Pattern No. 185

110

Page 117: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

wom:body

Parent

INSERT

InsertedNode

wom:text

InsertedNode

INSERT

Parent

wom:heading

InsertedNode

INSERT

Parent

wom:section

Parent

INSERT

InsertedNode

wom:text

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:body

Parent

INSERT

InsertedNode

wom:ul

Parent

INSERT

InsertedNode

wom:li

Parent

INSERT

InsertedNode

wom:intlink

Parent

INSERT

InsertedNode

mww:target

wom:text

InsertedNode

INSERT

Parent

111

Page 118: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 6380368 revision B id:6470369; revision A id: 9172344 revisionB id: 9406450; revision A id: 12262527revision B id: 12263196; revision A id:12748001 revision B id: 13565511; revisionA id: 16316290 revision B id: 25213012;

graph is subgraph of patterns :

graph is supergraph of patterns : 6 37 60 64 93 101 102 134 139 140 286 288313

112

Page 119: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Pattern No. 186

113

Page 120: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

wom:text

InsertedNode

INSERT

Parent

mww:name

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

mww:value

InsertedNode

INSERT

Parent

mww:arg

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

mww:name

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

mww:value

InsertedNode

INSERT

Parent

mww:arg

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

mww:name

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

mww:value

InsertedNode

INSERT

Parent

mww:arg

InsertedNode

INSERT

Parent

mww:transclusion

Parent

INSERT

InsertedNode

wom:text

InsertedNode

INSERT

Parent

mww:name

InsertedNode

INSERT

Parent

mww:arg

Parent

INSERT

InsertedNode

mww:value

wom:text

InsertedNode

INSERT

Parent

Figure 2.16: Pattern no. 186

114

Page 121: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 40259015 revision B id:41645610; revision A id: 72502662 revisionB id: 82348989; revision A id: 90853654revision B id: 106357619; revision A id:135052356 revision B id: 153450757; re-vision A id: 145120921 revision B id:145545842;

graph is subgraph of patterns :

graph is supergraph of patterns : 2 95 172

115

Page 122: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

Pattern No. 187

116

Page 123: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Parent

INSERT

InsertedNode

wom:text

InsertedNode

INSERT

Parent

mww:name

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

mww:value

InsertedNode

INSERT

Parent

mww:arg

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

mww:value

InsertedNode

INSERT

Parent

mww:arg

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

mww:value

InsertedNode

INSERT

Parent

mww:arg

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

mww:value

InsertedNode

INSERT

Parent

mww:arg

InsertedNode

INSERT

Parent

mww:transclusion

Parent

INSERT

InsertedNode

mww:arg

Parent

INSERT

InsertedNode

mww:value

wom:text

InsertedNode

INSERT

Parent

117

Page 124: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 48080311 revision B id:56106659; revision A id: 78800907 revisionB id: 107533037; revision A id: 170664988revision B id: 170817765; revision A id:268363914 revision B id: 268364383; re-vision A id: 343278029 revision B id:346817266;

graph is subgraph of patterns :

graph is supergraph of patterns : 1 2 38 84 172

118

Page 125: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Pattern No. 188

119

Page 126: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

wom:text

InsertedNode

INSERT

Parent

mww:value

InsertedNode

INSERT

Parent

mww:arg

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

mww:value

InsertedNode

INSERT

Parent

mww:arg

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

mww:value

InsertedNode

INSERT

Parent

mww:arg

InsertedNode

INSERT

Parent

mww:transclusion

Parent

INSERT

InsertedNode

mww:arg

Parent

INSERT

InsertedNode

mww:value

wom:text

InsertedNode

INSERT

Parent

120

Page 127: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 13518044 revision B id:16942617; revision A id: 40259015 revisionB id: 41645610; revision A id: 46586073revision B id: 55476879; revision A id:48080311 revision B id: 56106659; revisionA id: 51071612 revision B id: 55936807;

graph is subgraph of patterns : 78

graph is supergraph of patterns : 2 172

121

Page 128: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

Pattern No. 220

122

Page 129: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

wom:title

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

mww:target

InsertedNode

INSERT

Parent

wom:intlink

InsertedNode

INSERT

Parent

wom:title

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

mww:target

InsertedNode

INSERT

Parent

wom:intlink

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:p

Parent

INSERT

InsertedNode

wom:intlink

Parent

INSERT

InsertedNode

mww:target

Parent

INSERT

InsertedNode

wom:text

123

Page 130: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 175239 revision B id:175260; revision A id: 621714 revision Bid: 622105; revision A id: 1207572 revi-sion B id: 1238684; revision A id: 5359016revision B id: 8141976; revision A id:15595117 revision B id: 15595255;

graph is subgraph of patterns : 142 278

graph is supergraph of patterns : 3 4 5 6 7 8 9 10 11 12 14 15 17 18 54 65 6892 167 222 254 281 282 284

124

Page 131: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Pattern No. 235

125

Page 132: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

wom:ul

Parent

INSERT

InsertedNode

mww:xml-entity-ref

InsertedNode

INSERT

Parent

wom:for

InsertedNode

INSERT

Parent

wom:subst

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

mww:target

InsertedNode

INSERT

Parent

wom:intlink

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:li

Parent

INSERT

InsertedNode

wom:intlink

Parent

INSERT

InsertedNode

mww:target

Parent

INSERT

InsertedNode

wom:text

126

Page 133: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 290011761 revision B id:291655560; revision A id: 325813725 re-vision B id: 326126815; revision A id:343339961 revision B id: 343564464; re-vision A id: 346376733 revision B id:346378833; revision A id: 361997002 re-vision B id: 363448256;

graph is subgraph of patterns :

graph is supergraph of patterns : 6 35 39 40 42 46 58 59 60 67 82 83 93 135151 197 198 236 243 247 289

127

Page 134: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

Pattern No. 246

128

Page 135: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

InsertedNode

INSERT

Parent

wom:title

InsertedNode

INSERT

Parent

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:b

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

Parent

INSERT

InsertedNode

wom:dl

Parent

INSERT

InsertedNode

wom:dd

Parent

INSERT

InsertedNode

wom:intlink

Parent

INSERT

InsertedNode

mww:target

Parent

INSERT

InsertedNode

wom:text

129

Page 136: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 14797615 revision B id:14798837; revision A id: 375703076 re-vision B id: 375718764; revision A id:380561896 revision B id: 380563345;

graph is subgraph of patterns :

graph is supergraph of patterns : 6 12 17 21 23 24 25 27 28 29 30 31 32 33209 215 217

130

Page 137: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Pattern No. 267

131

Page 138: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

wom:text

InsertedNode

INSERT

Parent

mww:target

InsertedNode

INSERT

Parent

wom:intlink

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

mww:target

InsertedNode

INSERT

Parent

wom:intlink

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:p

Parent

INSERT

InsertedNode

wom:intlink

Parent

INSERT

InsertedNode

mww:target

Parent

INSERT

InsertedNode

wom:text

132

Page 139: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 1378554 revision B id:1378561; revision A id: 1464792 revision Bid: 2754088; revision A id: 4734004 revi-sion B id: 4734726; revision A id: 5359016revision B id: 8141976; revision A id:7331750 revision B id: 7331807;

graph is subgraph of patterns : 88 142 148 149

graph is supergraph of patterns : 3 4 5 6 7 8 9 10 11 13 14 15 18 44 51 54 5765 68 69 116 167 222 240 254

133

Page 140: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

Pattern No. 278

wom:text

InsertedNode

INSERT

Parent

mww:target

InsertedNode

INSERT

Parent

InsertedNode

INSERT

Parent

InsertedNode

INSERT

Parent

wom:title

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

mww:target

InsertedNode

INSERT

Parent

wom:intlink

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:title

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

mww:target

InsertedNode

INSERT

Parent

wom:intlink

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:p

Parent

INSERT

InsertedNode

wom:intlink

Parent

INSERT

InsertedNode

134

Page 141: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 1207572 revision B id:1238684; revision A id: 5359016 revisionB id: 8141976; revision A id: 103155489revision B id: 110087167; revision A id:216324329 revision B id: 216436425; re-vision A id: 244588905 revision B id:271529496;

graph is subgraph of patterns : 142

graph is supergraph of patterns : 3 4 5 6 7 8 9 10 11 12 14 15 17 18 20 21 2223 50 54 65 68 79 91 92 137 167 220 222253 254 279 281 282 284

135

Page 142: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

Pattern No. 297

Parent

INSERT

InsertedNode

wom:text

InsertedNode

INSERT

Parent

mww:target

InsertedNode

INSERT

Parent

wom:intlink

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:title

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

mww:target

InsertedNode

INSERT

Parent

wom:intlink

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:p

wom:text

InsertedNode

INSERT

Parent

136

Page 143: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 308634 revision B id:308661; revision A id: 621714 revision Bid: 622105; revision A id: 1464792 revisionB id: 2754088; revision A id: 3034654 revi-sion B id: 3034706; revision A id: 5359016revision B id: 8141976;

graph is subgraph of patterns : 121 142

graph is supergraph of patterns : 3 4 5 6 7 8 9 10 11 12 13 14 15 17 18 20 2122 23 44 50 51 52 53 54 56 57 65 68 69 9091 92 114 115 116 122 137 147 155 167 177178 222 241 254 272 279 281 282 284

137

Page 144: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

Pattern No. 299

138

Page 145: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

wom:dl

Parent

INSERT

InsertedNode

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:title

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

mww:target

InsertedNode

INSERT

Parent

wom:intlink

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:text

InsertedNode

INSERT

Parent

wom:dd

Parent

INSERT

InsertedNode

wom:text

InsertedNode

INSERT

Parent

wom:title

InsertedNode

INSERT

Parent

wom:intlink

Parent

INSERT

InsertedNode

mww:target

wom:text

InsertedNode

INSERT

Parent

139

Page 146: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Appendix D: Selected Frequent Pattern Graphs

graph has occurrences in (at least) the fol-lowing revision pairs :

revision A id: 3489847 revision B id:4061708; revision A id: 125522905 revisionB id: 125522953; revision A id: 139221613revision B id: 139226806; revision A id:204869581 revision B id: 204871924; re-vision A id: 211901416 revision B id:211901998;

graph is subgraph of patterns :

graph is supergraph of patterns : 6 12 17 20 21 22 23 24 25 28 29 31 192 209217 258

140

Page 147: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

References

Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., & Rasin, A.(2009, August). Hadoopdb: an architectural hybrid of mapreduce and dbmstechnologies for analytical workloads. Proc. VLDB Endow. 2 (1), 922–933.doi:10.14778/1687627.1687731

Aggarwal, C. C. (2015). Data mining - the textbook. Springer. doi:10.1007/978-3-319-14142-8

Aggarwal, C. C. & Han, J. (Eds.). (2014). Frequent pattern mining. Springer.doi:10.1007/978-3-319-07821-2

Cormen, T. H., Stein, C., Rivest, R. L., & Leiserson, C. E. (2001). Introductionto algorithms (2nd). McGraw-Hill Higher Education.

Dean, J. & Ghemawat, S. (2008, January). Mapreduce: simplified data processingon large clusters. Commun. ACM, 51 (1), 107–113. doi:10.1145/1327452.1327492

Dohrn, H. & Riehle, D. (2011). Design and implementation of the sweble wikitextparser: unlocking the structured data of wikipedia. In Proceedings of the7th international symposium on wikis and open collaboration (pp. 72–81).WikiSym ’11. Mountain View, California: ACM.

Dohrn, H. & Riehle, D. (2013). Design and implementation of wiki content trans-formations and refactorings. In Proceedings of the 9th international sym-posium on open collaboration (2:1–2:10). WikiSym ’13. Hong Kong, China:ACM.

Dohrn, H. & Riehle, D. (2014). Fine-grained change detection in structured textdocuments. In Proceedings of the 2014 acm symposium on document engi-neering (pp. 87–96). DocEng ’14. Fort Collins, Colorado, USA: ACM.

Ferrucci, D. A. (2011, June). Ibm’s watson/deepqa. SIGARCH Comput. Archit.News, 39 (3). doi:10.1145/2024723.2019525

Gartner, T., Flach, P., & Wrobel, S. (2003). On graph kernels: hardness resultsand efficient alternatives. In In: conference on learning theory (pp. 129–143).

Han, J. (2005). Data mining: concepts and techniques. San Francisco, CA, USA:Morgan Kaufmann Publishers Inc.

141

Page 148: Tree-based Edit Analysis of Wiki Articles · an existing tree-based format of wiki-content called Wiki Object Model (WOM) (Dohrn & Riehle, 2011) and the existing HD tree di algorithm,

Selected Frequent Pattern Graphs

Ketkar, N. S., Holder, L. B., & Cook, D. J. (2005). Subdue: compression-basedfrequent pattern discovery in graph data. In Proceedings of the 1st inter-national workshop on open source data mining: frequent pattern miningimplementations (pp. 71–76). OSDM ’05. Chicago, Illinois: ACM. doi:10.1145/1133905.1133915

Langdon, M. (2014). The work of art in a digital age: art, technology and global-isation. Springer. doi:10.1007/978-1-4939-1270-4

Meinl, T., Worlein, M., Fischer, I., & Philippsen, M. (2006). Mining moleculardatasets on symmetric multiprocessor systems. In International conferenceon systems, man and cybernetics, 2006. smc ’06 (pp. 1269–1274).

Nijssen, S. & Kok, J. N. (2004). A quickstart in frequent structure mining canmake a difference. In Proceedings of the tenth acm sigkdd international con-ference on knowledge discovery and data mining (pp. 647–652). KDD ’04.Seattle, WA, USA: ACM. doi:10.1145/1014052.1014134

ParSeMiS project. (n.d.). Retrieved April 20, 2016, from https://github.com/timtadh/parsemis

ParSeMiS - the Parallel and Sequential Mining Suite. (n.d.). Retrieved April 20,2016, from www2.cs.fau.de/EN/research/zold/ParSeMiS/index.html

Wikipedia statistics. (n.d.). Retrieved April 20, 2016, from https://en.wikipedia.org/wiki/Wikipedia:Statistics

Samatova, N. F., Hendrix, W., Jenkins, J., Padmanabhan, K., & Chakraborty,A. (2013). Practical graph mining with r. Chapman & Hall/CRC.

Viegas, F. B., Wattenberg, M., & Dave, K. (2004). Studying cooperation andconflict between authors with history flow visualizations. In Proceedings ofthe sigchi conference on human factors in computing systems (pp. 575–582).CHI ’04. Vienna, Austria: ACM. doi:10.1145/985692.985765

Yan, X. & Han, J. (2002). Gspan: graph-based substructure pattern mining.In Proceedings of the 2002 ieee international conference on data mining(pp. 721–). ICDM ’02. Washington, DC, USA: IEEE Computer Society.

Yan, X. & Han, J. (2003). Closegraph: mining closed frequent graph patterns. InProceedings of the ninth acm sigkdd international conference on knowledgediscovery and data mining (pp. 286–295). KDD ’03. Washington, D.C.:ACM.

Zaki, M. J. (2005, March). Efficiently mining frequent embedded unordered trees.Fundamenta Informaticae, 66 (1-2), 33–52. special issue on Advances inMining Graphs, Trees and Sequences.

142