Prototyping the Tree Automata Workbench Marbles · 2012-11-14 · Prototyping the Tree Automata...

Prototyping the Tree Automata WorkbenchMarbles

Petter Ericson

Supervisor: Frank DrewesAssistant supervisor: Brink van der Merwe

Department of Computing Science, Umea UniversityS–901 87 Umea, Sweden, [email protected]

Abstract. In [Dre09], Drewes outlines Marbles, a programming frameworkfor working in a generic and systematic way, not only on trees, as severalframeworks already exist for this purpose, but on tree recognisers, trans-ducers, generators and other formal devices as well. This thesis presents aprototype of a proposed implementation of this framework, demontrating itsfunctionality by using it as a base for implementing a well-known algorithmon tree transducers.

Table of Contents

Prototyping the Tree Automata Workbench Marbles . . . . . . . . . . . . . . . . . . . . . 1Petter Ericson Supervisor: Frank Drewes Assistant supervisor:

Brink van der Merwe

1 Introduction and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Project goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1 Introduction to trees and automata theory . . . . . . . . . . . . . . . . . . . . . . . 5

Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Contexts, variables and multicontexts . . . . . . . . . . . . . . . . . . . . . . . . . . . 6String automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Tree automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Project plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Basic Implementation of Marbles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1 Choice of language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Haskell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Scala . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2 Basics of Scala . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.3 Scala example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Fields and functions of Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11The object Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Tree parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.4 General Marbles organisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 Tree Recognisers and Transducers in Marbles . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.1 Recognisers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.2 Semirings and weighted automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.3 Generators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.4 Transducers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Weighted transducers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 Algorithms on Tree Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.1 Functionalising tree automata algorithms . . . . . . . . . . . . . . . . . . . . . . . . 265.2 Further transducer background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.3 Bottom-up transducer splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

BUFTT splitting example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.4 Top-down transducer splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

TDFTT splitting example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.5 Splitting algorithm implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.6 Top-down splitter implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6 Conclusions and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

1 Introduction and Motivation

Techniques based on trees and various tree formalisms have seen increasing usein many areas in recent years[NP92][CDG+02]. Perhaps most well-known is thepropensity for using XML as a data exchange medium[Sch07], but tree-based tech-niques have found its place in many other areas, such as natural language processing[KG05], model checking[AJMd02], and compiler optimisation.

Tree techniques, and specifically tree automata techniques are largely based inthe work done by Chomsky et al. in exploring various string formalisms and theirrespective restrictions, such as the Chomsky hierarchy[Cho56]. During the 1960’s,researchers started exploring whether the results obtained in string automata theoryand formal languages could somehow be extended to trees. As the case was, theycould, and since then, tree automata research has been an ever growing area ofresearch, though practical applications of this research are much more recent.

Despite this rather long history, there has been relatively little in the way ofprogramming language and operating system support for exploration of the capa-bilities of tree automata. Instead, toolkits and applications tend to focus on solvinga specific problem, or being well-suited for one particular area. While the lack of ageneric toolkit has obviously not entirely stopped research into tree techniques andvarious automata, it may still be argued that the fragmented nature of the researchcodebase has slowed the pace. Further, if such a generic toolkit was available andwidely used, researchers would presumably find it easier to collaborate by producingand exchanging source code.

The tree automata workbench Marbles, proposed by Drewes in [Dre09] is in-tended to be a generic and extensible programming framework for working withtrees and tree automata. Specifically, it is intended for exploring the relationshipsbetween, and capabilities of, various categories of tree automata, and how algo-rithms can transform these automata in different ways.

1.1 Previous work

Of course, several projects exist which allow for working on trees and tree automatain various capacities. However, most of these are somewhat narrow in scope, or arenot presented in a more organised fashion but rather used informally within theconfines of a research group.

Certain tools have seen more wide distribution, though, and some of these haveserved as an inspiration for the ideas behind Marbles.

– Treebag[Dre98] is a workbench for working with tree generators and transduc-ers, stepping through derivations and transductions and inspecting the effectson the tree. It is further possible to write algebras that work on the trees toproduce some output. The specific context for which Treebag was developedwas tree-based picture generation, but it is usable for many other tasks relatingto trees and tree generators as well. Treebag is in many ways an ancestor toMarbles, having the same originator and similar ambition of generality.

– ForestFire[Cle09] is a toolkit for pattern-matching and tree acceptance prob-lems, with various algorithms organised into taxonomies.

– Tiburon[MK06] is a package of algorithms for working on various kinds oftree automata, notably weighted tree transducers and regular tree grammars.Though the focus is on natural language processing and related problems inmachine learning, the algorithms are general enough to potentially see use incertain other domains as well.

3

– Timbuk[GB03] is a collection of tools for working on reachability proofs forterm rewriting systems. Recent versions (3.0 and onward) no longer include thetree automata manipulation tools that warrants its inclusion here, but olderversions include various algorithms for emptiness checking, boolean operationssuch as automata intersection, union, inversion etc.

– LearnLib[RSB05] is a library for finite automata learning and experimentationthat primarily focuses on learning algorithms for string automata. However,certain aspects of its organisation was useful as an inspiration for the Marblesprototype.

1.2 Project goals

While the above projects and systems are very useful in their specific domains, theyare nevertheless developed to explore those domains, and in some sense constrainedby them. Marbles, in contrast, aims to be a jack-of-all-trades programming frame-work, supporting exploration of tree automata and tree-based algorithms throughbeing extensible and flexible enough for first-approximation work in practically anydomain.

As may be apparent, a complete framework of this size and complexity is not aviable goal for a thesis at the MSc level. Instead, we aim to propose a viable basisfor further research into a complete framework. Specifically, we aim to– Find a programming language suitable for implementing the Marbles system,

with a view to making the system easy for external researchers to expand upon– Choose a reasonable subset of tree formalisms for implementation in the proto-

type, given the time constraints and desire for coverage of the relevant automataclasses

– Make a viable prototype for the Marbles system, in particular the prototypeshould,• be usable for at least a small part of the tasks covered by the full system• include a basic GUI for interacting with the automata• include concrete implementations of a subset of the concepts described in

[Dre09], and• have a reasonable (i.e. consistent and logical) architecture, suitable for fur-

ther implementation work, with a view to eventually be expanded into thecomplete framework.

1.3 Outline

Section 2 will dig deeper into the theoretical fundaments required for the rest ofthe thesis, while Section 3 will describe the basics of the Marbles implementation,including a discussion of the choice of programming language.

Section 4 will continue discoursing on the theory and practise of the prototypeimplementations, by describing the theoretical description, as well as the implemen-tation of the automata types provided in the prototype.

Section 5 describes a proof-of-concept implementation of two related algorithmson tree automata using the types and methods described in Sections 3 and 4.

Finally, Section 6 will contain a number of closing remarks, and introduce thenext steps in making the prototype into an actual working framework, usable forresearch purposes.

4

2 Preliminaries

In order to fully appreciate the potential applications and programming patternsused in the Marbles prototype, it is necessary to first go through some basics offormal tree language theory. In principle, the extension from string languages issimple - simply allow more than one successor to each symbol - but obviously thisis not sufficiently well-defined to function very well in a formal setting.

2.1 Introduction to trees and automata theory

We define an alphabet to be any nonempty set Σ of symbols, which can be extendedto be a ranked alphabet by adding a mapping R from Σ to N.

The number k = R(s) we name the rank of the symbol s ∈ Σ. We also definethe sets Σk = s ∈ Σ | R(s) = k for all k ∈ N. As a convention, we may use asubscript to make the rank explicit, i.e. a symbol s with rank R(s) = 2 may bewritten s2. Requiring that symbols have one rank only is not in general necessary,but makes some proofs and theorems easier to state.

Trees A tree can be defined in many ways: as an acyclic graph with a designatedroot node or as terms, for example. We prefer to view trees as a special case ofstrings, however, and reach this definition:

Let [, ], , be a set of auxiliary symbols, disjoint from any other alphabet con-sidered herein. The set TΣ of trees over the (ranked) alphabet Σ is the set of stringsdefined inductively as follows– Σ0 ⊂ TΣ ,– for a ∈ Σk, k ≥ 1, t1 . . . tk ∈ TΣ , t = a[t1, . . . , tk] ∈ TΣ ,.

Fig. 1. A simple graphical representation of the tree a[b[c], d]

In the tree t = a[b[c], d] (shown graphically in Figure 1), the symbol a is the rootof the tree, while b[c] and d are child trees, or direct subtrees. The set of all subtreesof a particular tree, subtrees(t), is composed inductively as follows:– t is in the set subtrees(t)– if t′ is in the set subtrees(t), then all child trees of t′ are in subtrees(t)

Further, a tree with no direct subtrees (e.g. d), is called a leaf. Thus, Σ0 is a setof trees, as well as a set of symbols. A tree language over Σ is any subset of TΣ . Wecan again use Σ0 as an example, as it can be viewed as the tree language consistingof only leaves.

The yield of a tree t ∈ TΣ is the string over Σ0 obtained by reading the leavesof the tree from left to right.

5

It should be noted that for all trees considered in this thesis, there is an orderingof the direct subtrees of a tree such that each direct subtree can be given an index.However, it is likely that the complete Marbles system would contain support forunordered trees as well.

Contexts, variables and multicontexts A context c over the ranked alphabetΣ is a tree with a special symbol 2 6∈ Σ occurring exactly once, as a leaf. The sub-stitution of any tree t ∈ TΣ in place of the symbol 2 is denoted c[t] and (obviously)yields a tree in TΣ . The set of all context over the ranked alphabet Σ is denotedCΣ .

We define a set X of variables x1, x2, . . ., which is disjoint from any specificranked alphabet considered in this thesis. In terms of ranks, X = X0, and we usethe notation Xk to denote the k first elements of X.

A multicontext ck of rank k over the ranked alphabet Σ is a tree in TΣ∪Xk ,where each variable occurs exactly once. We use the notation c[t1, . . . , tk] to denotethe substitution of each variable xi in c by the tree ti.

String automata Recall that a deterministic finite string automaton (DFSA) isa 5-tuple A = (Σ,Q,R, F, q0), where– Σ is the alphabet,– Q is the set of states– R is the set of rules on the form qa→ q′, where q, q′ ∈ Q, and a ∈ Σ, such that

each left-hand side occurs at most once,– F ⊆ Q is the set of final states, and– q0 ∈ Q is the initial state.

An intermediate string si of the DFSA A working on the string s is a stringpiqiri where pi is a prefix in the string s and ri the suffix such that piri = s, whileqi is a state in Q. Informally, pi may be seen as the part of the string that has beenprocessed, qi as the current state, and ri as the part of the string that is left.

A valid run of a DFSA A on a string s is a sequence sl . . . sk of intermediatestrings where si and si+1 are related to each other and R as follows:– pi+1 in si+1 is exactly pia, and ri is exactly ari+1 for some symbol a in s– there is a rule qia→ qi+1 in R.

An accepting run s0 . . . sn of a DFSA A on a string s is a valid run such that– s0 = q0s, and– sn = sq where q ∈ F

The regular language accepted by A is the set L(A) of strings on which acceptingruns of A can be constructed.

Further, a DFSA can be seen as a function from the set of strings Σ∗ to theboolean values, where every string in the language is mapped to true, and everyother string to false.

By dropping the requirement that each left-hand side occurs at most once in therule set, we obtain non-deterministic finite string automata (NFSA). No additionalexpressive power is gained with this change, though individual languages may havean exponentially smaller representation as NFSA than as DFSA[Sip06].

Tree automata While we will more rigorously define tree automata in the nextsection, a brief note is in order to explain how string automata are extended to workon trees. Basically, there are two approaches; either the automaton has an initialstate which is applied to the root, after which the computation runs in parallel

6

top-down through the various branches. The alternative is to have no particularstarting state, but instead “leaf transitions”, moving from leaf directly to a state,and then working bottom-up through the tree, culminating in a state that is or isnot part of the set of final states. Naturally, these automata exist in deterministicand non-deterministic versions as well.

2.2 Project plan

The original plan was for the project to run over 6 months, with the bird-eye view ofthe intended activities detailed in Table 1. The various stages are explained below:

– Initial planning was to include not only the planning, but also the final deci-sion on what the project would actually entail. The preliminary readings werefocused on the initial Marbles paper, as well as a small number of surveys ofsimilar projects and the programming languages used there.

– Design and prototyping: The taxonomy constructions detailed in [Cle08]were to be considered at this stage, while making a further study of reasonableprogramming language choices and the module structures implied by them.

– Basic definitions and implementation: The choice of programming lan-guage having been made, the basic types and organisation of the prototype wasto be considered and implemented at this stage.

– Concrete class implementation: Actual automata and algorithm types wereto be implemented here, and connected to a basic GUI.

– Concluding implementation, writing a report: The final weeks were tobe dedicated to bughunting, minor design issues and writing of the report.

Week Activity

1 Initial planning, preliminary reading of materials, initial design choices

2-4 Further design and prototyping, including work on the taxonomies dis-cussed in [Cle08].

5-10 Basic definitions and implementation

10-15 Concrete class implementation, simple GUI construction

16-22 Concluding implementation, writing a report

Table 1. Initial project plan

7

3 Basic Implementation of Marbles

3.1 Choice of language

As the prototype was intended to function as a base on which the full systemcould be built, much thought was spent on the choice of language. Ideally, thelanguage would be familiar to a large number of researchers, while having severaldesirable features, such as platform-independence, easy extensibility and a powerfultyping system. Initially, the languages considered were C++, Java and Haskell (C#being seen as far too closely tied to the Microsoft Windows platform). However, asC++ is both hard to distribute in a platform-independent form, and notoriouslycounterintuitive, the deliberations quickly centered on Java or Haskell.

Java Java is intuitively a good fit, being familiar to most researchers, and addi-tionally having the desired platform independence and extensibility. However, thereare a number of typing “tricks” that are quite difficult to pull off using Java, suchas deciding on type parameter co- and contravariance (that is, if S is a supertypeof T, is P<S> a super- or subtype of P<T>?). Java also has a number of other unde-sirable features, e.g. the distinction between primitive types and objects, the lackof operator overloading and implicit conversions between used-defined types, and ageneral lack of easy prototyping constructions. Further, programming in Java tendsto require much so-called “boilerplate” code, i.e. simple and often-used conceptstake much code to express properly. As an example, Java requires the types of botharguments, return values and variables to occur at all times, even when the type ofa particular variable is both obvious (to a human reader) and easy to infer (for thecompiler).

Haskell Haskell[HPJW+92] in contrast, is a purely functional language which whilenot having the mass appeal of Java, is still well-known in the research community.However, it lacks the compiled portability of Java, as well as any sort of easy in-tegration with any other language. Further, having no option to use anything butfunctional programming (albeit with monads etc. playing the roles of objects) makescertain algorithms hard to implement. It does feature a very powerful typing system,including inference computations to reduce boilerplate type declarations. There isquite a bit more support for fast prototyping than in Java, including an interactiveshell (ghci) for trying out the functions and monads that has been defined.

Scala With both Haskell and Java being problematic in their own way, Scala[OMM+04]appeared as an alternative. While at the time it was not as well-known as evenHaskell, it nevertheless had an active community, and showed great promise forthe future. Further, it touted easy integration into the Java ecosystem, meaningresearchers interested in using Marbles would likely be able to write their clientcode in Java and use the Scala parts of the framework behind the scenes. Strong,static typing with heavy use of type inference further tipped the scales, promisingto remove much of the boilerplate required in Java code. Scala also allows for muchcoding to be done using a functional programming paradigm, which offers certainother benefits in code readability, extensibility and reuse. Other benefits of Scalainclude the ease by which user-written classes and types can be integrated into thelanguage. Notably, infix operators are simply method applications with the dot andparenthesis omitted, and by defining properly named methods, pattern matchingand other features traditionally implemented as language constructs, can be easilyapplied on any imaginable class.

8

Scala also allows for quick and easy prototyping through its interactive shell andquick syntax. The integration with Java works both ways as well; it is trivial to useany existing Java library as a component in a Scala program, meaning integratingexisting Java code with Marbles would likely be comparatively easy. These featurescombined to make Scala appear an ideal choice.

3.2 Basics of Scala

Scala is a functional object-oriented hybrid language with static typing and a syntaxdesigned to remove boilerplate and increase legibility. It is designed to run on boththe Java JVM and Microsofts .Net infrastructures, and has a large standard librarythat handles much of the underlying complexities in various common tasks.

Everything in Scala is an object, down to the integral data types (int and so on),and functions. Further, as opposed to Java or C#, there is no such thing as a staticmethod, which makes Scala in some respects even more object-oriented than thosetwo languages. To facilitate the equivalent functionality, Scala allows the definitionof singleton objects through the use of the keyword object. By defining a classand singleton with the same name, these are named companion object and class ofeach other, which means that they can access the private members of each others,accomplishing the functionality of static members.

Probably the most significant difference between Scala and C#/Java is multipleimplementation inheritance. That is, Java classes inherit only from a single class,but potentially multiple interfaces. However, each interface only describes methodsthat need to be present in the class. No actual implementation code is included ininterface. Through personal communication with the author, it has been establishedthat this particular feature (or rather, lack thereof) was a major stumbling blockin the (Java) Treebag implementation. By nature, automata implementation lendsitself well to use of multiple implementation inheritance, and the steps required toreproduce the same behaviour in Java was cumbersome and forced.

In contrast, Scala traits are “rich”, in the sense that they can make use of thedefined methods to provide more functionality. For example, by inheriting (mixingin) the trait Ordered[T] and implementing the single abstract method compare,all of the comparison operators (< > <= >=) become available, as well as varioussorting methods on collections of the class and many other similarly useful functions.

3.3 Scala example

In order to familiarise ourselves with Scala syntax, we will gradually construct partsof the class Tree as implemented in Marbles. Starting off, recall that a tree is definedas having a finite number of subtrees. A simple implementation of trees in Scalamight thus look like this:

class Tree (val subtrees: Seq[Tree])

The keyword class starts a class definition, just as in Java and C#, but as isreadily apparent, such definitions may be much more concise than in those two lan-guages. The parenthesis simultaneously defines the default constructor of the classand the instance variables, which in this case is a sequence (Seq) of Trees. Further-more, this sequence is defined to be immutable (the val keyword), meaning that itcannot change during the lifetime of the object. Instance variables and methods arepublic by default in Scala. A sample usage of this class is the simple assignment

val t = new Tree(Nil)

println(t.subtrees)

9

which creates a leaf tree, and prints the subtrees of it. The printout of this scriptwill obviously be, simply, ’Nil’.

This tree class can only represent skeletons of trees, however, as there is no wayto associate a symbol with a specific position. We can easily add a “root” instancevariable to take care of this, and by using type parameters, we can even make treeswith roots of any type:

class Tree[+T] (val root:T, val subtrees: Seq[Tree[T]])

The type parameterisation should be familiar to anyone with experience of lan-guages like C++, C# or Java. The only unfamiliar part would be the + symbol,which in this case represents the variance of the type parameterisation. That is,if S is a superclass of T, is a Tree[S] a superclass of Tree[T] (covariance)? Is ita subclass of Tree[T] (contravariance)? In Scala, covariance is indicated by the+ sign, while a - sign would indicate contravariance. Note that if Trees were notimmutable, then they would not actually be covariant, as one could potentially do

val t1 = new Tree[Int](2,Nil)

val t2:Tree[AnyRef] = t1

t2.root = "abc"

println(3.0 / t1.root)

which will obviously result in a runtime error. Thus making objects immutable is notonly useful for verification, but also has impacts on what restrictions are reasonableto set on type casts.

As defined above, the Tree class can represent the actual data of a tree properly,but it still lacks any method implementations. For example, printing a Tree withprintln would use the standard AnyRef toString method, which simply prints theclass name and object address. Likewise, the equals method uses object referencerather than object equality to determine its truth value. Adding these methods wearrive at

class Tree[+T] (val root:T, val subtrees: Seq[Tree[T]])

override def toString : String = subtrees.size match

case 0 => root toString

case _ => root.toString + subtrees.mkString("[",",","]")

override def equals(other : Any) = other match

case that : Tree[_] => this.root == that.root &&

this.subtrees == that.subtrees

case _ => false

override def hashCode =

41 * ( 41 + root.hashCode) + subtrees.hashCode

As is apparent, the def keyword declares the beginning of a function declaration,while override indicates that the function will override an implementation in asuperclass. Further, note that toString and hashCode lack parentheses. This is aScala convention indicating that they, though they might require calculation, willnot change the object.

10

Function definition includes an equals sign that indicates the beginning of thefunction body. This might be followed by a , but may also simply be followed bya statement computing the result, as in these two cases.

The pattern matching showcased in equals and toString is similar to mostother functional languages, though the specifics differ. Note the use of the under-score character ( ) as a general default, or “uninteresting” value. Further, mkStringis a method of the Seq class, which constructs a string from the first argument,the members of the sequence interspersed with the second argument, and endingwith the third argument. In other words, generally what you would expect a stringrendition of a specific tree would be.

The hashCode definition is included so as to make sure that if two object areequal, then they also have the same hashCode. The specific implementation wasinspired by an example in the book Programming in Scala[OSV11]. The basic ideais to use a reasonably large prime (e.g. 41) and combine it with the hashCodes ofall instance variables relevant for object equality, to arrive at a reasonably fast andwell-spread hash code.

Fields and functions of Tree For the latter parts of the implementation of Tree,we will omit the already defined parts, and only show the added fields and functions.

depth The depth (or height) of a tree is the length of the longest path from theroot to a leaf. In Tree:

def depth:Int = subtrees match

case Nil => 1

case _ => ((subtrees map (_.depth)) max) + 1

Again, we use pattern matching to give the proper results. However, we also utilisetwo new Scala constructs: anonymous function, and the map method. Taking theseconcepts in order, an anonymous function can be defined in many different ways.The canonical way is to define the arguments, with types, and then the body, asfollows:

val f = (x:Tree[T]) => x.depth

However, Scala allows for various shorthands. In particular, if the type of the ar-gument can be inferred somehow, it may be omitted. Moreover, if each argumentonly appears once in the function, one may omit the list of arguments, and insteadintroduce “holes” (denoted by underscores) in the function body. Thus the abovefunction is identical to the argument given to map in the definition of depth.

Moving on, map is a method of pretty much every collection in the Scala collec-tions framework. It takes as its argument a function literal from the element type T

to some other type U, and produces a collection of of the results of the function, asapplied to every element of the collection. Thus subtrees map ( .depth) returnsa sequence of the depths of each subtree. Taking the maximum and adding onenecessarily results in the depth of the current tree.

leaves The leaves of a tree are often of interest in various algorithms and automataimplementations. Accessing them in Marbles is handled by the leaves function:

def leaves:Seq[Tree[T]] = subtrees match

case Nil => Seq(this)

case _ => subtrees flatMap (_.leaves)

11

The only thing that warrants explanation in this function would be the flatMapfunction. In Scala, flatten can be applied to a collection of collections (e.g. a List

of Lists), and results in the inner collection being “flattened”, i.e. the members ofthe inner collections are instead made members of the outer:

scala> val l = List(Set(1),Set(2,3),Set(2))

l: List[Set[Int]] = List(Set(1), Set(2, 3), Set(2))

scala> l.flatten

res1: List[Int] = List(1, 2, 3, 2)

flatMap, then, is simply a map followed by flatten.

map With mapping functions being so central to a functional way of programming,the Tree class would have been clearly lacking without one:

def map[To](f : (T) => (To)):Tree[To] =

new Tree(f(root), subtrees map (_ map f))

Note the (function) type of the argument.

subst In addition to simply making a mapping of the nodes, one might want tokeep most of the tree intact, but substituting specific subtrees based on the valueof the root. This accomplished in Marbles by the subst method:

def subst[U >: T](subs : Map[U,Tree[U]]) : Tree[U] =

if(subs isDefinedAt root)

subs(root)

else

new Tree[U](root,subtrees map(_.subst(subs)))

The type parameter of subst showcases another part of the Scala typing system,namely lower bounds. This means that the type parameter (in this case U), must bea superclass of the indicated class (in this case T). That this distinction is reasonablein the case of the subst method is easily verified: As some subtrees of type Tree[T]are likely to remain, it is not reasonable to have no restrictions whatsoever on thetarget type, nor is it reasonable to allow subtypes of T. However, a T is also amember of all its supertypes, meaning that (due to covariance) Tree[T]s are alsomembers of Tree[U] for U a supertype of T. These kinds of restrictions are veryhard to realise in a consistent manner in languages such as Java.

12

The object Tree We mentioned briefly the concepts of companion objects, whichis among other things how Scala provides the functionality usually associated withstatic members.

apply In addition to the more traditional uses of static methods (constants, parsingetc.), a Scala convention is to define only a default constructor in the class, and usethe companion object apply functions as factory methods for additional sets ofarguments. This in turn relies on another Scala shorthand notation, namely that incalls to the apply function, the function name can be omitted. That is, given thesingleton object

object Incr

def apply(i:Int):Int = i + 1

then instead of

Incr.apply(5)

you can write

Incr(5)

The constructor/factory parts of the Tree companion object looks like this:

object Tree

def apply[T](rt:T,ss:Seq[Tree[T]]) = new Tree(rt,ss)

def apply[T](rt:T) = new Tree(rt,Nil)

Note that since the object itself is a singleton, it has no default constructor, and notype parameter. Instead, the type parameters are attached to the apply methods.

unapply In order to facilitate pattern matching on user-defined types, Scala definesthe “inverse” function to apply: unapply. In the case of Tree, it is implemented asfollows:

def unapply[T](t:Tree[T]):Option[ (T, Seq[Tree[T]]) ] =

Some((t.root,t.subtrees))

The predefined type Option is used to indicate optional values, and is either equalto Some(value), as in the unapply method above, or to None. In the case of generalunapply methods, they can do an arbitrary computation on the input and returnNone in case the pattern does not match. In this case, the pattern match is usedlike this:

val t = Tree("a",Nil)

t match

case Tree(x,Nil) => println(x + " with no subtrees")

case Tree(x,subs) => println(x + " with subtrees " + subs)

13

Tree parsing Scala provides a framework for combinator parsing, which is usedheavily in Marbles to provide persistent test cases. Fully examining the parsingframework is outside the scope of this thesis, as it requires quite a few concepts ofScala not covered in the sections above. However, a few notes are in order beforethe source code is shown:– A Parser[T] is the parser type defined by Scala, which is used in building

combined parsing expressions.– The implicit keyword is introduced in the code below. It has three different

meanings:• Before the declaration of an object, it means that that object can be implic-

itly accessed wherever it is in scope (usually, the same class, the companionobject, and anywhere that class is explicitly used).

• Before a function parameter, it means that if the parameter is omitted, asearch will be made to find an implicit object of the correct type, which willbe inserted as a parameter.

• By defining a converter function T -> S as implicit, an object might beautomatically converted from type T to S, anywhere that conversion functionis in scope, and the typing rules demand it. A notable instance is the implicitconversion from AnyRef to String, analogous to Java. No implicit converteris explicitly shown in the below code.

– The ~ operator combines multiple Parsers into a Parser parsing the sequenceof the segments. This is the basic operator of the combinator parsing framework.

– The class ElementParsers[T] was created to serve as a way to pass a Parser[T]as an implicit parameter, something which proved otherwise complicated dueto the design of the Scala combinator parsing framework.

/** A parser for trees on the form "root[tree,...]" or "root" if the

* tree is a leaf.

*/

implicit def treeParser[T](implicit rootParsers:ElementParsers[T]) =

new ElementParsers[Tree[T]]

val root:Parser[T] = rootParsers

def tree:Parser[Tree[T]] =

(root~opt("["~>repsep(tree,",")<~"]")) ^^

case root~None => new Tree(root,Nil)

case root~Some(subs) => new Tree(root,subs)

def start = tree

For further details on Scala syntax and programming, the excellent book Pro-gramming in Scala[OSV11] by Odersky et. al. is highly recommended.

14

3.4 General Marbles organisation

The current Marbles codebase is organised into three modules:

– algorithm holds the actual higher-level algorithms that have been implementedas part of this thesis. Future versions will likely have baseline classes and traitsavailable to help facilitate working with the as of yet unimplemented MarblesGUI, but as of now, the algorithms are relatively self-contained.

– automaton contains the various types of tree automata that have been imple-mented. These classes will likely also be further split up into base classes, traitsand subclasses in the future, to avoid code duplication as far as possible. How-ever, at the moment there are a few simple interface traits defined, with littlecode being shared between the classes.

– util is a catch-all module for holding the basics of the Marbles system. Notably,alphabets and trees reside here, as do the basics of the parsing system.

A (partial) class diagram is shown in 2.

Fig. 2. Partial class diagram of the Marbles prototype

15

4 Tree Recognisers and Transducers in Marbles

As described in Section 2.1, finite automata are constructs that in general have afew things in common, notably an alphabet, a state set, and a rule set. In Marbles,the Alphabet is simply a Set of some type T, while a RankedAlphabet is a mapfrom some T to Int. States could be of any type, conceptually, but in the currentprototype, they are Strings. The type of the rule sets obviously vary betweendifferent automata types, but in general they are Maps from tuples representing theleft-hand side to Sets of the appropriate type collecting the various right-hand sides.All of the assertions regarding the expressive power of various unweighted automataexpressed below have been proven since at least the 1970’s, and the proofs cangenerally be found in Joost Engelfriets lecture notes on Tree Automata and Treegrammars[Eng75], though in many cases more elegant variants have emerged.

4.1 Recognisers

Extending string recognisers (FSA) to the tree case entails, as mentioned, someway of handling branches, with a choice being made as to moving from the rootdownwards (top-down) or from the leaves up (bottom-up) during processing. Bothof these approaches are available in Marbles, using the TDNFTA and BUNFTA classes,respectively. With more theoretical rigour:

A bottom-up non-deterministic finite tree automaton (BUNFTA) is a 4-tupleA = (Σ,Q,R, F ) where– Σ is the (ranked) tree alphabet,– Q is a ranked alphabet of states such that Q = Q1,– R is a set of rules on the form a[q1 . . . qk]→ q′ for q′, q1 . . . qk ∈ Q, a ∈ Σk, and– F ⊆ Q is a set of final states.

In Marbles, this is represented by the BUNFTA class, which contains the instancevariables sigma, states, rules, and fin, which obviously corresponds exactly tothe structure described above. The rule set is of the type Map[(T,Seq[String]),

Set[String]], which, again, corresponds rather exactly to how we describe themin algorithms. As was mentioned in the introduction to this section, we use a Set

to gather the various right-hand sides corresponding to a particular left-hand side.Moving on with the theoretical definition, an intermediate tree ti of a BUNFTA

A is a tree over Σ ∪Q.A valid run of a BUNFTA A is a sequence of intermediate trees tl . . . tm such

that the trees ti and ti+1 are related as follows:– There is a subtree a[q1[s1] . . . qk[sk]], a ∈ Σ, q1, . . . , qk ∈ Q, s1, . . . , sk ∈ TΣ at a

position p in ti– there is a subtree q[a[s1, . . . , sk]], q ∈ Q at position p in ti+1

– ti and ti+1 are otherwise equal, and– there is a rule a[q1 . . . qk]→ q′ in R.

An accepting run t0 . . . tn of a BUNFTA A on a tree t is a run where– in t0 = t and– in tn = q[t], q ∈ F .

The set L(A) of trees on which an accepting run can be constructed for a BUN-FTA A is the language of the automaton. The class of languages recognised byBUNFTA is the class of regular tree languages.

16

In Marbles, running the automaton on a tree to see if it part of the languageis a simple manner of using the apply method, either explicitly or through justusing the object(arguments) syntax. Further, the applyState method will re-veal the exact state set that a specific subtree ends up in, while isDeterministic

checks if the automaton is deterministic (i.e. that every Set of right-hand sidesis of size at most one). Also, parsing of an automaton has been implemented us-ing combinator parsing. In addition, BUNFTA mixes in (inherits) the Scala traitPartialFunction[Tree[T],Boolean], which makes it smoothly integrate into theScala software ecosystem.

By restricting the rule set such that each left-hand side appears at most once,we arrive at the deterministic variant of bottom-up tree automata (BUDFTA).The expressive power of BUDFTA is exactly equal to BUNFTA[Eng75], and bothrecognise the class of regular tree languages. Though the proof of this assertion isoutside the scope of this introduction, we provide an example of a single languageimplemented with and without nondeterminism. The language is the set of treesover Σ = f2, g1, a0, b0 such that each subtree whose root is an f contains both asand bs, and the two automata are, respectively

Example 1. N = (Σ, qa, qb, RN , qa, qb) where RN is

a→ qa b→ qb

g[qa]→ qa g[qb]→ qb

f [qa, qb]→ qa f [qa, qb]→ qb

f [qb, qa]→ qa f [qb, qa]→ qb

and D = (Σ, qa, qb, qab, RD, qa, qb, qab) where RD is

a→ qa b→ qb

g[qa]→ qa g[qb]→ qb

f [qa, qb]→ qab f [qb, qa]→ qab

f [qab, qa]→ qab f [qab, qb]→ qab

f [qa, qab]→ qab f [qb, qab]→ qab

g[qab]→ qab f [qab, qab]→ qab

The proof of the general equivalence is based on the same principle as the proofof the equivalence of deterministic and nondeterministic string automata. That is,the state set Q of the nondeterministic is replaced by the set P(Q), and transitionsare added accordingly. Though D lacks a state corresponding to the empty set andthe requisite transitions it demonstrates the key elements of the proof: the additionof states and transitions in a systematic manner to recognise the same language asN deterministically.

Moving on to the top-down case, a top-down deterministic finite tree automaton(TDNFTA) is a 4-tuple A = (Σ,Q,R, q0) where– Σ is the (ranked) tree alphabet– Q is a ranked alphabet of states such that Q = Q1

– R is a set of rules on the form q[a[x1 . . . xk]]→ q1[x1] . . . qk[xk] where q, q1 . . . qk ∈Q, a ∈ Σk, x1 . . . xn = Xk are variables, and

– q0 is the initial state.

Again, the Marbles implementation stays close to what is defined in the theory,with the variables being named sigma, state, rules and q0, respectively, with therule set being a Map[(T,String),Set[Seq[String]]]. The one thing to note isthat the right-hand sides are contained in a Set of Seqs. That is, the state that

17

the automaton uses to traverse downward is dependent on what state is applied tothe sibling trees. The distinction may seem unimportant, but as will be apparent,it is critical to how the tree automaton works non-deterministically. We introducethe notation λ to denote the empty sequence (i.e. in rules involving leaves on theleft-hand side).

An intermediate tree of a TDNFTA A is, as for BUNFTA, a tree over Σ ∪Q.A valid run of the TDNFTA A is a sequence of intermediate trees tl . . . tm where

ti and ti+1 relate to each other as follows:– There is a subtree q[a[s1, . . . , sk]], q ∈ Q, a ∈ Σk, s1, . . . , sk ∈ TΣ at position p

in ti– there is a subtree a[q1[t1] . . . qk[tk]], q1, . . . , qk ∈ Q at position p in ti+1

– ti and ti+1 are otherwise equal, and– there is a rule q[a[v1 . . . vk]]→ q1[v1] . . . qk[vk] in R.

An accepting run t0 . . . tn of a TDNFTA A on a tree t is a run where in t0 = q0[t],and tn ∈ TΣ , that is, no states remain in the final intermediate tree.

The set L(A) of trees on which an accepting run can be constructed for the TD-NFTA A is the language accepted by A. TDNFTA recognise the class of regular treelanguages, just as BUNFTA and BUDFTA. Deterministic top-down tree automata(TDDFTA) can be defined similarly to BUDFTA, that is, we restrict the rule setsuch that each left-hand side occurs at most once.

TDDFTA recognise a proper subclass of the regular tree languages, i.e. thereare regular tree languages for which no TDDFTA can be constructed. An exampleof such a language is the language f [a, b], f [b, a]. To prove this, assume that thereis a rule

q0[f [v1, v2]]→ q1[v1], q2[v2]

in R. This, however, means that in order for both f [a, b] and f [b, a] to be in L(A),there must be rules

q1[a]→ λ

q2[a]→ λ

q1[b]→ λ

q2[b]→ λ

in R as well, meaning that both f [a, a] and f [b, b] are in L(A), which would result inthe automaton recognising the wrong language. It should be fairly obvious that it israther trivial to construct a BUDFTA recognising the correct language (and moregenerally, that all finite tree languages, like all finite string languages, are regular).

Further, by allowing non-determinism in the top-down case, we can amend theautomaton to have the rule set

q0[f [v1, v2]]→ qa[v1], qb[v2]

q0[f [v1, v2]]→ qb[v1], qa[v2]

qa[a]→ λ

qb[b]→ λ

showing that top-down automata can recognise the same language. This examplealso shows non-determinism in top-down automata, and specifically nondetermins-tistically choosing not only the possible states for a specific subtree, but the possiblecombinations of states for sibling subtrees, which is what allows TDNFTA to recog-nise languages TDDFTA cannot.

The functions apply, isDeterministic, and parsing all work similarly to howthey work for BUNFTA, but applyState no longer takes only a tree as an argument,but instead takes both a tree and a state, and reports if it is possible for the treeto be processed starting in the specified state.

18

4.2 Semirings and weighted automata

A semiring is an algebraic structure, used to define weighted tree automata (WTA),and, using WTA, recognisable tree series. Specifically, a semiring is a set O equippedwith two binary operations, + and · (addition and multiplication), such that– + is an associative, commutative operation on O with identity element 0– · is an associative operation on O with identity element 1– · distributes over +, and– multiplication with the additive identity 0 annihilates O, that is, a ·0 = 0 ·a = 0

for all a ∈ O.Marbles defines the Semiring[T] and SemiringFactory[T] traits which may be

implemented by the user. Alternatively, one may use one of the predefined semiringsprovided in semirings.scala, that is either– the Reals semiring, which is basically the real numbers, with +, ·, 0 and 1 as

would be expected.– the MaxPlus semiring, with

0 :=−∞1 := 0

+ := max

· := +

– the Boolean semiring, with

0 := false

1 := true

+ := OR

· := AND

(here, symbols to the left of ‘:=’ denote semiring components and symbols to theright of it have their usual meaning).

Informally, a weighted tree automaton (WTA) computes a function from TΣ tosome semiring O, using the multiplication and addition operations to deduce a valuefrom the tree. Using the previous definition of a BUNFTA as a basis, we extendthis to the weighted case as follows:

A bottom-up weighted finite tree automaton (BUWFTA) is a 5-tuple A = (O,Σ,Q,R, F ) where– O is a semiring,– Σ and Q are as in BUDFTA,– R is a set of rules on the form a[q1 . . . qk] →w q′ for q′, q1 . . . qk ∈ Q, a ∈ Σk,

where w is called the weight of the rule, and– F is a mapping from Q to O of final weights

The Marbles representation of this structure is the BUWFTA class, which is definedin much the same way as the BUNFTA class, save that the set containing the right-hand sides now contains tuples of resulting state and weight. Further, the set offinal states has been replaced by a Map from state to a final weight, and insteadof inheriting from PartialFunction[Tree[T],Boolean], the resulting value is oftype R (i.e. the semiring type parameter).

We define the (potentially infinite) alphabet Γ = Γ1 of pairs of state and weight.That is, for q ∈ Q and w ∈ O, (q, w) ∈ Γ1. An intermediate tree ti of a BUWFTAA is a tree over Γ .

A valid run of a BUWFTA A is a sequence of intermediate trees tl . . . tm suchthat the trees ti and ti+1 are related as follows:

19

– There is a subtree a[(q1, w1)[s1] . . . (qk, wk)[sk]], a ∈ Σ, q1, . . . , qk ∈ Q,w1, . . . , wk ∈O, s1, . . . , sk ∈ TΣ at a position p in ti

– there is a subtree (q, w′)[a[s1, . . . , sk], q ∈ Q at position p in ti+1

– ti and ti+1 are otherwise equal,– there is a rule a[q1 . . . qk]→w q

′ in R, and

– w′ = w ·∏ki=1 wi.

An successful run r = t0 . . . tn of a BUWFTA A on a tree t is a run where– t0 = t and– tn = (q, w)[t], where q ∈ Q. In this case, wr(t) = F (q) ∗ w is the weight

contributed by the run r of the BUWFTA A on the tree t.

The weight of the tree t as given by the BUWFTA A is the sum of all weightswr(t), where r is an accepting run of A on t.

The tree series defined by A is the mapping from the trees in TΣ to their weightsas given by A. As was mentioned at the beginning of this section, BUWFTA definethe class of recognisable tree series.

The top-down case is defined analogously, though obviously using TDNFTA asits base rather than BUNFTA. For a thorough survey on weighted automata theoryin general, including formal proofs of various properties mentioned above, refer to[DKV09].

4.3 Generators

The “inverse” formal devices of recognisers are various kinds of grammars. Notablein the string case is the context-free grammar, which is much more readily usedto define context-free languages than the appropriate recogniser, the push-downautomaton. Likewise, the standard regular expression is as easily converted to agrammar as to a finite string automaton. For regular tree languages the equivalentconstruction is the regular tree grammar. While recognisers are reasonable to definein both a top-down and a bottom-up manner, it would be hard to know in advancehow many leaves to start with in a bottom-up generator. Further, as with the stringvariants, deterministic grammars are obviously unreasonable, as such grammarswould only define a single string or tree. Formally:

A regular tree grammar (RTG) RTG is a 4-tuple G = (Σ,N,R, S) where– Σ is the ranked alphabet of terminal (output) symbols,– N is a ranked alphabet of non-terminal symbols such that N0 = N ,– R is a set of rules on the formA→ t, where A ∈ N and t is a tree over Σ ∪N , and

– S ∈ N is the starting symbol

In Marbles, the class RTGrammar is more or less set up as expected, with theinstance variables being named sigma, nonterminals, rules and start, respec-tively. The rules variable is a Map[String,Set[Tree[Either[String,T]]]]. Incontrast with the recogniser automata, generators like RTGs cannot in a naturalway be represented as functions in Scala. Instead, we choose to model them asIterator[Tree[T]], i.e. devices that iterate over a (potentially infinite) collectionof items (in this case the Tree[T] of a tree language).

20

An intermediate tree t of an RTG G is a tree over Σ ∪N .A valid sequence of G is a sequence t0 . . . tn of intermediate trees where each

pair of trees ti, ti+1 are related as follows:– There is a nonterminal A at position p in ti,– there is a subtree t′ at position p in ti+1,– ti and ti+1 are otherwise identical, and– there is a rule A→ t′ in R.

A generation of a tree t ∈ TΣ by the RTG G is a valid sequence of intermediatetrees such that t0 = S and tn = t.

The language L(G) generated by the RTG G is the set of trees that can begenerated by G. Though the proof is outside the scope of this thesis, it can be shownthat RTG correspond exactly to BUNFTA, and thus is another way to define theregular tree languages. As an example, consider the language defined by D and Nin Example 1 on page 17. A RTG defining the same language looks as follows:

Example 2. G = (Σ, A,B, S, RG, S) where RG is

S → A S → B

A→ a B → b

A→ g[A] B → g[B]

A→ f [A,B] B → f [A,B]

A→ f [B,A] B → f [B,A]

It may be illustrative to connect the nonterminals A and B with the states qaand qb, respectively, and compare RG with RN . Note that, apart from the start rulesinvolving S, the rules are identical but “inverted”. In fact, the constructive proof ofthe expressive equivalence of BUNFTA and RTG involves adding a start state thatgoes to every state/nonterminal in F , and then simply inverting the rules.

In Marbles, actual tree generation proved to be much more of a problem thananticipated. Eventually, a solution was found, though the algorithm is less elegantthan might be desired. At the centre of the algorithm is a “rule-choice” tree, whichdictates what rule to apply at each nonterminal. The iteration works on this tree,while constructing the current output tree of the iteration.

Making things more formal: for the RTG G = (Σ,N,R, S), we define a partialmapping dr (the “rule-depth”) from the nonterminals N to the natural numbersiteratively as follows: dr is the unique partial function wr : N → N such that, for allA ∈ N , wr(A) is the smallest natural number for where there exists a rule A → twhere, for all nonterminals B in t, wr(B) < wr(A). Starting with the completelyundefined mapping, we can determine wr:– If there is a rule A → t in RG such that t consists of only terminal symbols,

then dr(A) = 0.– Loop, while dr becomes more defined:• Let drold = dr• For each rule A→ c[B1, . . . , Bl] in R, where c is a multicontext over Σ, andB1, . . . , Bl are nonterminals,– if drold(A) is not defined, but drold(Bi) for i = 1, . . . , l is, let dr(A) =max 1≤i≤ldrold(Bi) + 1.

The mapping dr thus denotes how many “levels” each nonterminal is from afinished output tree. This is used to order the rules involving a specific left-handside, such that the most “shallow” rule comes first. If two rules are equally deep,the ordering is based on the output trees.

21

Given a rule-choice tree, we define the corresponding output tree as follows:

FUNCTION computeCurrent(n: nonterminal, rt:rule-choice tree)

LET root[subtrees] = rt

LET ot:output-tree = rules(n)[root] // i.e. the root:th rule

// corresponding to the

// left-hand side n

FOREACH(nonterminal sn in ot)

IF subtrees IS EMPTY

// Find the shortest path to a complete output tree

replace sn with computeCurrent(n, 0[])

ELSE

replace sn with computeCurrent(n,subtrees.head)

LET subtrees = subtrees.tail

ENDIF

END

RETURN ot

END

LET output = computeCurrent(startSymbol, ruleChoiceTree)

Thus, the rule-choice tree 0[] will return the smallest tree of the language, eventhough it may require more than a single rule application to get there.

Iterating over the trees of the language is accomplished using a depth-first searchwith iterative deepening, though certain complications are introduced because ofcertain rule-chains ending in dead ends, among other things. In addition, some rulecombinations will result in the same output tree being generated twice. This caneither be ignored as being an irrelevant side effect of the algorithm, or alleviatedthrough keeping track of the output trees that has already been used, and simplykeep iterating until a new tree is found. This is guaranteed to terminate by the ruleset and alphabet both being finite. As an example, for the language discussed inExample 2, the first few rule-choice trees we expect from the iteration are:

0[]

1[]

0[1]

0[2]

0[3]

1[1]

//We skip 1[2] and 1[3] since the outputs are equal to 0[2] and 0[3]

0[1[1]]

0[1[2]]

22

The iteration itself is not particularly interesting. It is a simple matter of updat-ing the positions of the tree, and substituting the subtrees with fresh copies with theproper number of subtrees in the cases of an internal node being updated. Indeed,implementing the algorithm in a functional manner took far more time than under-standing its general outline. A simplified version of the functional implementationlooks as follows:

FUNCTION iterateSubtree(rt : rule-choice tree,

depth : integer,

alreadyUpdated : boolean

): (rule-choice tree, boolean)

IF(alreadyUpdated)

(rt, true) // Simply move on, the tree has already been updated

ELSE

IF(depth == 0)

IF (rules left for this nonterminal)

// Simple, we can update here and move on

(Tree(rt.root + 1, Nil), true)

ELSE

// We need to update somewhere else

(Tree(0, Nil), false)

ELSE

// We are not yet at the target depth

LET newsubs = FOREACH( srt IN rt.subtrees )

// Iterate downwards, and collect the changed subtrees

(newsub, alreadyUpdated) = iterateSubtree(srt,

depth - 1

alreadyUpdated)

YIELD newsub

END

// Did we get our desired change yet?

IF(alreadyUpdated)

// Yes, return it, then

(Tree(rt.root, newsubs), true)

ELSE

// We need to update this node

IF(rules left for this nonterminal)

// Fill the tree below this level with the proper

// number and arrangement of zeroes

(fillTree(nonterminal, rt.root + 1, depth), true)

ELSE

// Just fill with zeros for now

(fillTree(nonterminal, 0, depth), false)

ENDIF //Rules left

ENDIF // Update below this node

ENDIF // At the target depth

ENDIF // Update before we even got to this node

END

In the actual iteration, an iterate function tries to run the current rule-choicethrough the iterateSubtree function, and increases the depth if it is not possibleto update. This function also incorporates the duplicate checking code.

23

4.4 Transducers

Tree transducers are formalised automata that take a tree t as input and use thatto construct an output a tree t′ (possibly linked to some other value). Because oftheir use in areas such as translation and XML processing, as well as having otherinteresting properties, they have been the focus of quite a bit more research thanthe recogniser classes.

As for recognisers, it is reasonable to define both bottom-up and top-down vari-ants of tree transducers and it will be shown that both variants have interestingproperties. Informally, we can think of tree transducers as tree recognisers that,apart from producing a state at each node, also produce a tree. Additionally, thereis a specified way the trees at each node are combined into one final output tree.More formally:

A bottom-up finite tree transducer (BUFTT) is a 5-tuple T = (Σ,∆,Q,R, F ),where– Σ is the (ranked) input alphabet,– ∆ is the (ranked) output alphabet,– Q is a ranked alphabet of states such that Q = Q1,– R is a set of rules on the form

s[q1[x1], . . . , qk[xk]]→ q[t′]

where q, q1, . . . , qk ∈ Q, s ∈ Σk, x1, . . . , xk are variables, and t′ ∈ T∆∪Xk

– and F ⊆ Q is a set of final states.

As for the previous automata types, the Marbles equivalent is fairly close tothe theoretical definition: sigma, delta, states and fin all have the types onewould expect, while rules is of the type Map[(F,Seq[String]),Set[(VarTree[T],String)]], where F is the type parameter of the input alphabet, and T of the outputalphabet. VarTree[T] is in principle a Tree over Either[Int,T], though there area number of extra methods implemented for easing the tasks associated with treetransducers and similar constructs.

An intermediate tree t of a BUFTT T is a tree over Σ ∪∆ ∪Q.A computation of a BUFTT T is a sequence tl, . . . , tm of intermediate trees such

ti and ti+1 relate to each other and T as follows:– there exists a tree s[q1[t1], . . . , qk[tk]] at position p in ti.– there exists a tree q[t′′] at position p in ti+1

– ti and ti+1 are otherwise equal,– there is a rule s[q1[x1], . . . , qk[xk]]→ q[t′] in R, and– t′′ is the tree one obtains by taking t′ and substituting each instance of xi byti, for i = 1, . . . , k.

A successful computation of a BUFTT T on a tree t ∈ TΣ is a computationt0, . . . , tn where t0 = t and tn = q[tout] where q ∈ F and tout ∈ T∆. The trees t,and tout are the input and output trees, respectively, of this computation. As theBUFTT may be nondeterministic, each input tree defines a set of output trees, andthe BUFTT as a whole defines a relation U on TΣ × T∆, where (t, tout) ∈ U if andonly if there is a successful computation of T such that t and tout are its input andoutput trees respectively.

In Marbles, a BUTreeTransducer[F,T] inherits from TreeTransducer[F,T],which as of this writing is simply a “forwarding” trait that inherits from the ba-sic PartialFunction[Tree[F],Set[Tree[T]]] trait. This inserts the transducerat the appropriate place in the Scala ecosystem, and allows one to use various in-teresting constructions, such as making a RegularTreeGrammar, and then mapping

24

a tree transducer on top of it, to end up with an Iterator over the sets of outputtrees. Alternatively, by using flatMap, the individual trees are accessed. As for theother automata types, a parser is included in the companion object.

In a similar way that BUNFTA relate to BUFTT do TDNFTA relate to top-down finite tree transducers (TDFTT). Formally:

A top-down finite tree transducer (TDFTT) is a 5-tuple T = (Σ,∆,Q,R, q0),where– Σ is the (ranked) input alphabet,– ∆ is the (ranked) output alphabet,– Q is a ranked alphabet of states such that Q = Q1,– R is a set of rules on the form

q[s[x1, . . . , xk]]→ c[q1[xi1 ], . . . , qn[xin ]]

where q1, . . . , qn, q ∈ Q, s ∈ Σk, k ∈ N, i1, . . . , in ∈ 1, . . . , k and c is a multi-context of rank n over ∆,

– and q0 ⊆ Q is a set of initial states.

The Marbles implementation is again fairly close to the theory, but with rules

being of a quite interesting type: Map[(F,String),Set[(VarTree[T],Seq[(String,Int)])]]. Here, (F,String) corresponds to the left-hand side quite obviously, butthe right-hand side is more complex: Each VarTree has a number of variablesthat may be larger or smaller than the number of subtrees of s, so the Seq of(String,Int) records what state should be used for each particular variable, andwhat subtree of s should be inserted at that point. Obviously, the Seq needs to havethe same size as the amount of variables in the VarTree.

We forego formal definitions of the computations of TDFTT at this time tofocus on what makes TDFTT fundamentally different from BUFTT. In short: In-stead of choosing a tree based on symbol and states from below, and inserting thesubtrees at their respective places, we work from the top, transforming the treeand nondeterministically choosing the states and trees as we move downward. Thisbecomes relevant only when the transducer is non-linear, in the sense that subtreesare copied during the processing. Specifically, in TDFTT, we can initiate processingof two copies of the same subtree using two different states, while in BUFTT anyprocessing will already be complete by the time we are able to apply any copy-ing. This important distinction means that there are relations that can be definedby BUFTT but not by TDFTT, and vice versa. This relationship will be furtherexplored in Section 5.

Weighted transducers In a similar way to hoe weighted automata associate aweight with a tree, weighted transducers associate a weight with a input-outputtree pair. This is useful for multiple purposes, such as associating probabilities withvarious translations of a natural language sentence.

25

5 Algorithms on Tree Automata

In order to demonstrate that the implemented prototype serves one of the intendedpurposes of the complete Marbles system (i.e. as a means to quickly and easilytest algorithms on various tree automata), several algorithms on tree automatawere implemented. During the implementation work, certain problems seemed tobe common for implementing most if not all algorithms.

5.1 Functionalising tree automata algorithms

While it is quite possible to use an imperative programming style even in Scala,the language is designed to be used in a functional way, and many aspects of thecollections framework among others have plenty of methods and structures thatallow for easy functional programming. This can be contrasted to many if not mostdescriptions of algorithms in the literature, where imperative pseudocode seemsprevalent. Converting the algorithms from imperative to functional requires a rathermore deep understanding of the fundamentals of the algorithm. For this reason,implementation is often slow to start, and may have to restarted several times, asthe understanding of the problem grows. As a compensation, the final algorithmimplementation may in some cases be both simpler, more elegant, and less bug-prone than the more traditional implementations. In addition, while implementationusing imperative languages may at times be more straightforward, it most oftenstill requires the programmer to consider various implementation details that is leftundefined by the algorithm.

5.2 Further transducer background

In order to properly appreciate the example transducer splitting algorithms, werequire some more theory and definitions. Recall the definitions in Subsection 4.4.By placing constraints on the structures, we can find different classes of tree trans-ductions. Specifically, we call the class of tree transductions definable by TDFTTT, and by BUFTT, B. Further, by restricting the number of rules with the sameleft-hand side to at most one, we arrive at the deterministic TDFTT and BUFTT,respectively, the classes of transductions definable by these are denoted by DT andDB. These, and the classes defined below, are all described in [Eng75]. Additionally,the proofs of the various relations between the classes can be found there, as wellas the algorithms implemented below.

Additionally, we define the following constraints:– A transducer is total deterministic, if there is exactly one right-hand side for

each possible left-hand side.– A transducer is linear, if each variable that occurs on the left-hand side of a

rule occurs at most once in each right-hand side.– A transducer is non-deleting, if each variable that occurs on the left-hand side

of a rule occurs at least once in each right-hand side.– A transducer is single-state, or pure, if |Q| = 1.

By prepending Dt, L, N, and P to T or B, we denote the class of transductionsdefined by TDFTT and BUFTT with the above constraints, respectively. For ex-ample, DLB is the class of transductions definable by deterministic linear BUFTT.

In Subsection 4.4, we briefly mentioned that there were transductions that couldbe defined by a BUFTT but not by a TDFTT, and vice versa, i.e. that T and Bare incomparable:

T 6⊆ B 6⊆ T

26

We further mentioned that the defining differences between BUFTT and TDFTTwere that BUFTT can process an input subtree nondeterministically, and then copythe results using a later rule, alternatively discard the results entirely. TDFTT, bycontrast, can copy an input subtree first, and then apply different states to the twooutputs, or alternatively use the same state, but do processing in the two copieswith nondeterministic differences.

It would seem natural to use the restrictions detailed above to find a ”natural”common subset of transductions to T and B. As it turns out, several such commonsubsets exist:

First of all, we note that by eliminating copying of subtrees (i.e. imposing lin-earity), all that makes TDFTT more powerful than BUFTT is eliminated, i.e.:

LT ( LB ( B

Further restricting the deletion of subtrees would seem to eliminate any advan-tage to using BUFTT as well, leading us to state that

NLB = NLT ( LT

This equality relation is not immediately applicable in this thesis, but an impor-tant subset of this class of transductions is, namely that of finite state relabelings(QREL), where we further restrict the right-hand sides of rules such that rules ofTDFTT have the form:

q[s[x1, . . . , xk]]→ s′[q1[x1], . . . , qk[xk]]

and of BUFTT have the form

s[q1[x1], . . . , qk[xk]]→ q[s′[x1, . . . , xk]

where in both cases, s′ is a single symbol in ∆k.A different subset of both B and T can be obtained by restricting the actual

passing of (state) information, as long as no non-deterministic copying is allowed.That is,

PDtB = PDtT

Again, a formal proof is outside the scope of this thesis, but the equality is well-known. This set is also known as the set of tree homomorphisms (HOM).

While algorithms or properties could relatively easily be implemented in Mar-bles to identify homomorphisms and finite state relabelings, this is not currentlyavailable.

If A and B are classes of transductions, let AB denote the class of transductionspossible by applying first a transduction from A, and then applying one from B tothe result, that is we compose a transduction in A with one in B. It is well-knownthat while string transductions are closed under composition (that is, one gains noadditional possible string transductions by using two transducers rather than one),the same is not true for tree transducers, that is

T ( T T

andB ( B B

27

However, it is further known that every TDFTT can be represented by a treehomomorphism followed by a linear TDFTT, both of which can be represented byBUFTT. In addition, each BUFTT can be represented by a finite state relabeling,followed by a tree homomorphism. That is,

T ( HOM LT ( B B

andB ( QREL HOM ( T T

In the following sections, we will describe and implement algorithms that canbe used in constructive proofs of the last of the above statements.

5.3 Bottom-up transducer splitting

Recall that a BUFTT is a 5-tuple B = (Σ,∆,Q,R, F ), and a TDFTT is a 5-tupleT = (Σ,∆,Q,R, q0), as defined in Subsection 4.4. Further, according to the abovereasoning, the features of BUFTT that can not be realised using a single TDFTT isthat of deleting subtrees based on the features of that subtree, and that of copyingan already processed output tree into multiple copies. However, a BUFTT can berealised using a finite state relabeling composed with a tree homomorphism, bothof which can be implemented using TDFTT. Thus, we can decompose a BUFTTinto two TDFTT.

The idea of the BUFTT decomposition algorithm is to define a ”transitional”alphabet, which is used by the finite state relabeling to store information indicatingby which tree-piece the node should be replace to produce the final output tree.After that, a homomorphism is used to actually apply the replacements. Formally,let B = (Σ,∆,Q,R, F ) be the (input) BUFTT.

– Ω is the transitional alphabet, defined as follows: if

s[q1[x1], . . . , qk[xk]]→ q[t]

is a rule in R, then dt is a (new) symbol in Ωk.– We define the TDFTT T1 = (Σ,Ω,Q,R1, F ), a finite state relabeling, with R1

defined as follows: ifs[q1[x1], . . . , qk[xk]]→ q[t]

is a rule in R, then

q[s[x1, . . . , xk]]→ dt[q1[x1], . . . , qk[xk]] ∈ R1

– We define the TDFTT T2 = (Ω,∆, qonly, R2, qonly), a tree homomorphism, withR2 defined as follows: R2 = qonly [dt[x1, . . . , xk]]→ t′ | dt ∈ Ω, where t′ is thetree obtained by replacing each occurrence of a variable xi in t by qonly[xi].

Note that Ω is finite, as R has a finite number of right-hand sides. Further,looking at the rules in R1, they all conform to the pattern specified for TDFTTimplementing finite state relabelings, meaning that the transduction defined by T1is in QREL. For T2, note that Q is a singleton, and that we define exactly one rulein R2 for every possible left-hand side, meaning that the transduction defined byT2 in in PDtT = HOM.

28

BUFTT splitting example In lieu of a formal proof of L(B) = L(T1 T2), weprovide an example transduction using the two methods to illustrate how each stepof the transduction is preserved in the decomposition.

Consider the example BUFTT B = (Σ,∆,Q,R, F ) where– Σ = a2, b1, c0– ∆ = f2, g1, h1, i0– Q = qa, qeven, qodd– R = c → qeven[i]b[qeven[x1]] → qodd[g[x1]]b[qeven[x1]] → qodd[h[x1]]b[qodd[x1]] → qeven[g[x1]]b[qodd[x1]] → qeven[h[x1]]a[qeven[x1], qodd[x2]]→ qa[f [x2, x2]]a[qodd[x1], qeven[x2]]→ qa[f [x1, x1]]

– F = qaThat is, each input subtree of b[b[. . . b[c]]] that consists of an odd number of bs

is copied (including the nondeterministic differences), while a subtree with an evennumber of bs is deleted. We place heavy constraints on what trees are accepted inorder to make the rule set smaller and simpler. Notably, we only accept input treeson the form a[b[b[. . . b[c] . . .]], b[b[. . . b[c] . . .]]].

As our standing example input in dealing with this transduction, we use thesimple, valid input tree a[b[b[c]], b[c]], which results in the output tree set f [g[i], g[i]],f [h[i], h[i]], according to the rules governing BUFTT.

Using the BUTSplitter algorithm to split B, we start by finding the relevantintermediate alphabet:

Ω = di, dg[x1], dh[x1], df [x2,x2], df [x1,x1]

where di ∈ Ω0, dg[x1] and dh[x1] are in Ω1, and df [x2,x2] and df [x1,x1] are in Ω2.With this done, we move on to the first TDFTT T1 (the finite state relabeling),

where– Σ = a2, b1, c0,– ∆ = Ω = di, dg[x1], dh[x1], df [x2,x2], df [x1,x1],– Q = qa, qeven, qodd– R = qeven[c] → diqeven[b[x1]] → dg[x1][qodd[x1]]qeven[b[x1]] → dh[x1][qodd[x1]]qodd[b[x1]] → dg[x1][qeven[x1]]qodd[b[x1]] → dh[x1][qeven[x1]]qa[a[x1, x2]]→ df [x1,x1][qodd[x1], qeven[x2]]qa[a[x1, x2]]→ df [x2,x2][qeven[x1], qodd[x2]]

– q0 = qaNote that even though only one of the two subtrees of the df [...] is used in the

final output, we still need to do computations on both subtrees to determine thattheir heights are correct (even and odd, respectively). Any rejection of an input treehappens in this transducer, through failing to apply rules to a subtree.

29

The homomorphism T2 that completes the tree transduction is comparativelysimple; in TDFTT form:– Σ = Ω = di, dg[x1], dh[x1], df [x2,x2], df [x1,x1],– ∆ = f2, g1, h1, i0– Q = qonly– R = qonly[di] → iqonly[dg[x1][x1]] → g[qonly[x1]]qonly[dh[x1][x1]] → h[qonly[x1]]qonly[df [x1,x1

[x1, x2]]→ f [qonly[x1], qonly[x1]]qonly[df [x2,x2

[x1, x2]]→ f [qonly[x2], qonly[x2]]– q0 = qonly

The rule set of this homomorphism should be sufficient to demonstrate howdeletion and copying is translated in the BUTSplitter algorithm. Specifically, thedf [...] rules are again the rules that are of interest.

By applying only the relabeling T1 to the input tree a[b[b[c]], b[c]], we arrive atthe intermediate tree set

df [x2,x2][dh[x1][dh[x1][i]], dh[x1][i]],

df [x2,x2][dh[x1][dg[x1][i]], dh[x1][i]],

df [x2,x2][dg[x1][dh[x1][i]], dh[x1][i]],

df [x2,x2][dg[x1][dg[x1][i]], dh[x1][i]],

df [x2,x2][dh[x1][dh[x1][i]], dg[x1][i]],

df [x2,x2][dh[x1][dg[x1][i]], dg[x1][i]],

df [x2,x2][dg[x1][dh[x1][i]], dg[x1][i]],

df [x2,x2][dg[x1][dg[x1][i]], dg[x1][i]]

Obviously, all derivations starting with the rule

qa[a[x1, x2]]→ df [x1,x1][qodd[x1], qeven[x2]]

will fail, as it is not possible to construct a successful run from qodd on the subtreeb[b[c]], and neither from qeven on b[c]. Nevertheless, the amount of trees would lookstrange, given that the correct output according to B is a set of only two trees.However, the homomorphism will delete the larger subtrees (which are all thatdiffers within two groups of four trees each), leaving the correct set of two trees:

f [g[i], g[i]], f [h[i], h[i]]

30

5.4 Top-down transducer splitting

In the case of TDFTT, the capabilities that a single BUFTT is unable to reproduceis, as stated, to copy an input subtree and use different processing for differentcopies, either through starting in different states or through nondeterminism havingdifferent outcomes. However, by using a tree homomorphism, composed with a linearTDFTT, we can achieve the same effect, and as all linear TDFTT can be realisedusing BUFTT, a TDFTT can be decomposed into two BUFTT, implementing thetransductions discussed above.

We again start by defining a transitional alphabet Ω. However, instead of stor-ing information about the nondeterminism, copying and deletion applied in the la-bels, the first BUFTT is a homomorphism, “exploding” the input tree into a much“wider” version of the input tree. Specifically, each subtree s[t1, . . . , tk] is convertedinto a new subtree s[tn1 , t

n2 , . . . , t

nk ], where tn1 represents a sequence of n copies of

t1. The number n is chosen such that no variable occurs more than n times in anyright-hand side of any rule in the input TDFTT. After copying is done, a linearBUFTT does the computations required on all copies simultaneously, after whichthe unneeded subtrees are simply discarded. Formally:

– Let T = (Σ,∆,Q,R, q0) be the input TDFTT, then– Ω is the transitional alphabet, defined as follows: if n is the maximum number

of copies made by any rule in R of any subtree, and s ∈ Σk, then s ∈ Ωk ∗ n.– We define a homomorphism, B1 = (Σ,Ω, qonly, R1, qonly) where

R1 = s[qonly[x1], . . . , qonly[xk]]→ qonly[s[tn1 , tn2 , . . . , t

nk ]] | k ∈ N, sk ∈ Σ.

– We define a linear BUFTT, B2 = (Ω,∆,Q,R2, q0), where R2 is discussed furtherbelow.

R2 is significantly more complex than R1 in the bottom-up case. Suppose that

rt = q[s[x1, . . . , xk]]→ c[q1[xi1 ], . . . , ql[xil ]]

is a rule r ∈ R. The first step in the transformation is to move to a linear rule fromΩ to ∆ instead of from Σ.

Intuitively, let 〈p〉 = dp/ne denote the position of the pth child of a symbol,before each subtree was copied n times. For example, 〈5〉, if n = 3 would result in2, as subtree 5 in the Ω tree is a copy of the second subtree in the original tree.Now, we turn the rule rt into the linear rule

rl = q[s[x1, . . . , xk·n]]→ c[q1[xp1 ], . . . , ql[xpl ]]

where p1, . . . , pl ∈ 1, . . . , k · n are chosen in such a way that they are pairwisedistinct and 〈pj〉 = ij for j = 1, . . . , l. That is, we let each state qj work on its“own” (input) copy xpj of the original input variable x〈pj〉.

Now, in order to “flip” these rules around to make a BUFTT, we need to notonly change the positions of the states and variables, but do so in a way thatpreserves both the ordering of the input subtrees and the state-subtree relationsthat are in the output of the above rules. In other words, for each subtree qj [xpj ]in the right-hand side of a top-down rule, then in the bottom-up rule, qj should bethe pjth child of s. Though it is not hard to grasp intuitively that this is possible,defining the precise relationships takes more effort. Specifically, as the rank k of thesymbol s may be greater than the number l of subtrees that actually occur in theoutput, we need some way of handling the subtrees that are discarded. To this end,we introduce an “irrelevant” state, qnop, for which we define transitions as follows:

31

For each symbol sk ∈ Ω, there is a rule

rs = s[qnop[x1], . . . , qnop[xk]]→ qnop[a]

in R, where a ∈ ∆0. Thus, qnop will be a resulting state of every subtree in TΩ .Now, each linear rule

rl = q[s[x1, . . . , xk·n]]→ c[q1[xp1 ], . . . , ql[xpl ]]

is flipped into the bottom-up rule

rb = s[t1, . . . , tk·n]→ q[c[xp1 , . . . , xpl ]]

where

ti =

qj [xi] if pj = i for a j ∈ 1, . . . , lqnop [xi] if no such j exists.

R2 thus consists of all rules rs and rb sketched above. Note that as qnop is notin q0 and furthermore that all rules rb discard any output subtrees associated withqnop, whatever output is produced by the rs rules is irrelevant.

TDFTT splitting example As our working example for splitting a TDFTT, weuse a transduction very similar to the one used for the BUFTT variant. ConsiderT = (Σ,∆,Q,R, q0), where– Σ = a2, b1, c0– ∆ = f2, g1, h1, i0– Q = qa, qodd, qeven– R = qa[a[x1, x2]]→ f [qodd[x1], qodd[x1]]qa[a[x1, x2]]→ f [qodd[x2], qodd[x2]]qodd[b[x1]] → g[qeven[x1]]qodd[b[x1]] → h[qeven[x1]]qeven[b[x1]] → g[qodd[x1]]qeven[b[x1]] → h[qodd[x1]]qeven[c] → i

– q0 = qaThe transition defined by T differs from that of B in the BUFTT example in

several respects:– No guarantee exists about the two output subtrees being equal; indeed, every

combination of different possible outputs will exist in the output tree set.– There is no longer any requirement that the unused subtree must be of even

height, instead two input subtrees of odd heights will result in the full outputsof both heights.Though this example showcases several traits that are unique to TDFTT, one

notable feature is unused: that of starting off the processing of two copies of thesame subtree using different states. It is an interesting property, but requires rathermore complex examples to demonstrate.

Looking at the rule set, it is immediately obvious that the maximum number ofcopies made in any rule application is 2. Going through the algorithm, this tells usthat the intermediate alphabet Ω will be:

Ω = a4, b2, c0

32

Continuing on, the homomorphism B1 will be as follows:– Σ = a2, b1, c0– ∆ = Ω = a4, b2, c0– Q = qonly– R = c → qonly[c]b[qonly[x1]] → qonly[b[x1, x1]]c[qonly[x1], qonly[x2]]→ qonly[c[x1, x1, x2, x2]]

– F = qonly

As we mentioned while describing the algorithm, this is a fairly simple homo-morphism; even more so than the one used for splitting BUFTT. Enough copyingis applied at every stage such that every possible combination of output trees willbe produced in the next stage. Note that this will result in intermediate trees thatare generally exponentially larger than the original trees.

The second BUFTT, B2, will look like this:– Σ = Ω = a4, b2, c0– ∆ = f2, g1, h1, i0– Q = qa, qodd, qeven, qnop– R = a[qnop[x1], qnop[x2], qnop[x3], qnop[x4]]→ qnop[i]b[qnop[x1], qnop[x2]] → qnop[i]c → qnop[i]

a[qodd[x1], qodd[x2], qnop[x3], qnop[x4]] → qa[f [x1, x2]]a[qnop[x1], qnop[x2], qodd[x3], qodd[x4]] → qa[f [x3, x4]]b[qodd[x1], qnop[x2]] → qeven[g[x1]]b[qodd[x1], qnop[x2]] → qeven[h[x1]]b[qeven[x1], qnop[x2]] → qodd[g[x1]]b[qeven[x1], qnop[x2]] → qodd[h[x1]]c → qeven[i]

– F = q0 = qa

While Σ, ∆ and F offer no surprises, Q has been supplemented by the qnopstate, to make sure that every subtree (even irrelevant ones) at least give rise tosome output, and do not halt the execution of the automaton. The rules in R havealso been changed in interesting ways:– Three new rules have been introduced to handle the qnop state, simply moving

from qnop in all subtrees to produce a qnop, with the simplest possible outputtree (a leaf).

– The b rules necessarily deal with a second, irrelevant, subtree, but simply as-suming that it results in qnop, and not including it in the output is sufficient toreach the desired outcome.

– The a rules are further complicated by not only discarding subtrees using qnopmatching and exclusion from the output, but also by including two copies ofthe same subtree that have been processed differently. This is obviously a re-quirement in order to implement the same transduction as T .

33

Fig. 3. The example input tree a[b[c], b[b[c]]]

Showing the correctness of B1 B2 is again more complex than was the situationin the bottom-up case, mainly because the first step, B1 is the relatively trivial one,which does no actual state-based processing, and rejects no input trees. However,running T on the example input tree a[b[c], b[b[c]]], shown in Figure 3, we can quicklydetermine that this would result in the output tree set

f [g[i], g[i]]

f [g[i], h[i]]

f [h[i], g[i]]

f [h[i], h[i]]

Further, the output from B1, run on the same input tree, would result in theΩ-tree tΩ = a[b[c, c], b[c, c], b[b[c, c], b[c, c]], b[b[c, c], b[c, c]]] shown in Figure 4.

Fig. 4. The output of the homomorphism B1, run on the tree a[b[c], b[b[c]]]

Showing the complete output of B2 on tΩ seems pointless, as this is alreadypostulated to be the same as for T on t. Instead we show a number of intermediatetrees of a run of B2 on tΩ . In Figure 5 on the facing page, we show two successiveintermediate trees of such a run, between which a single application of a b-rule hasdeleted an irrelevant qnop subtree, and generated a g output symbol. Note that inthis case, the run will not be successful, as the right-most subtree of a will terminatein qeven instead of qnop.

Supposing that the right-most a subtree instead had used the qnop-rules ex-clusively, we could arrive at the left-hand intermediate tree in Figure 6 on thenext page. Note that the left-most two subtrees were computed from two copies ofthe same subtree of t (as were the right-most two, obviously). Applying the rele-vant a-rule (discarding the two irrelevant qnop subtrees, and using the two differentcopies of b[c]), we arrive at the right-hand intermediate tree, which is on the formq[t′], q ∈ F, t′ ∈ T∆, meaning t′ is an output tree of B1 B2(t).

34

Fig. 5. Two intermediate trees, before and after the application of a b-rule

Fig. 6. Two intermediate trees, before and after the application of a a-rule

35

5.5 Splitting algorithm implementation

Implementing the above algorithms was for the most part a question of translatingthe concepts to Scala equivalents. The bottom-up splitter in particular, requiredvery little in the way of specific methods and programming difficulties, once thesteps of the algorithm had been properly defined. Of particular use was the factthat Trees can have any type T as elements. This meant that the intermediatealphabet Ω could be realised using the original VarTrees as elements, with thefinite state relabeling simply inserting the trees at the proper nodes of the originaltree:

class BUTSplitter[F,T](val but:BUTreeTransducer[F,T])

// The alphabet of right-hand sides of the bu transducer

private val omega = (new RankedAlphabet(

(for(rhs <- but.rules.values;

v <- rhs)

yield (v._1, v._1.rank)) toMap)

)

val rel:TDTreeTransducer[F,VarTree[T]] = //Relabeling

new TDTreeTransducer(

but.sigma, // Same input alphabet obviously

omega, // Right hand sides as ’state markers’

but.states, // Same states in the relabeling‘

// For-loop below creates a sequence of pairs which fit

// into a TD transducer, however the right hand sides still

// need to be organised into sets to make a proper map

(for(((sym,states), pairs) <- but.rules.toList;

( tree , state) <- pairs) yield

(

(sym,state),

// Note: tree is the root of a height-1 tree

(VarTree(tree,tree.rank),states.zipWithIndex)

)) groupBy (_._1) map (case (lhs,rhss) =>

(lhs,(rhss map(_._2)) toSet)) ,

but.fin // Final states is equivalent to initial states

)

val hom:TDTreeTransducer[VarTree[T],T] = //Homomorphism

new TDTreeTransducer(

omega, // Output from relabeling is input for this

but.delta, // While output is output from the original

Set("q"), // Only a single state needed for a homomorphism

(for((tree,rank) <- omega.map) yield

((tree,"q"),

Set((tree,Seq.fill(rank)("q") zipWithIndex)))) toMap,

Set("q") // The single state is also initial

)

36

The specific lines constructing the rule sets may warrant further explanation:

(for(((sym,states), pairs) <- but.rules.toList;

( tree , state) <- pairs) yield

(

(sym,state),

// Note: tree is the root of a height-1 tree

(VarTree(tree,tree.rank),states.zipWithIndex)

)) groupBy (_._1) map (case (lhs,rhss) =>

(lhs,(rhss map(_._2)) toSet)) ,

Here, we begin by extracting the components of each rule and place them intothe variables sym and states for the left-hand side of the rule, and tree andstate for the right-hand side. The next four lines produce the list of rules of therelabeling, where sym and state together make a left-hand side of a top-down rule.The VarTree produced by the factory method VarTree[T](rt:T,rk:Int) is a one-level tree, where rt is the root, and its subtrees are variables numbered 0 to rk-1.The states.zipWithIndex makes a Seq[(String,Int)] that propagates the statesthat were previously on the left-hand side downwards into the tree, completing thetop-down right-hand side.

The last two lines use various Scala constructs to organise the List of indi-vidual rules produced by the for-loop into a Map from a specific left-hand side((F,String)) to a Set of right-hand sides. They do this by first using groupBy totransform the list of rules to a Map from left-hand sides to complete rules. Then,a two-level map is applied to keep the left-hand side as it is in the outer Map, butdiscarding it in the internal List of rules, finally using toSet to arrive at a Set ofright-hand sides. The components of the tuples are either matched out using patternmatching, or through use of the 1 and 2 instance variables, which are defined inthe Tuple classes.

37

5.6 Top-down splitter implementation

Implementing the TDFTT splitting algorithm posed a much more significant chal-lenge, for many of the same reasons detailed in Subsection 5.4. In particular, thetranslation of indices from the rules of the input automaton to the second of theoutput automata. For this reason, it is impractical to include the complete imple-mentation of the algorithm. Still, there are a number of functions and constructionsthat warrant further study.

The initial homomorphism that “explodes” the input tree looks much as onewould expect, while the linear automaton is significantly more complex than therelabeling used in the bottom-up variant. Specifically, the index changes required tomake each input subtree end up in its proper place requires quite a lot of bookkeep-ing. This is all handled by the mangleIndices function, which takes as input thestate-index pairs of a single input rule1, and outputs two Maps: The first is suitableto use in a substitution on the right-hand side VarTree, replacing the previous top-down related indices with new ones corresponding to the new exploded input. Thesecond map connects the new indices with the state that should be expected there(that is, the state that would be used there if the rule was top-down). We showhere the inner function, mangleIndicesInternal, which recurses over the inputstate-index pairs, and assigns the new indices, based on the degree of copying andthe amount of copies seen.

private def mangleIndicesInternal(stsixs:Seq[(String,Int)],

ixmap:Map[Int,Int]

):List[(String,Int)] =

stsixs.headOption match

case Some((s,x)) => // There are still states to process

val occ = ixmap.getOrElse(x,0) // Has this tree already

// occurred in the sequence?

(s,x*maxCopy + occ) :: // The state is unchanged, but

// we compute and insert the new index

mangleIndicesInternal(stsixs.tail,ixmap.updated(x,occ + 1))

case None => Nil

With this detail solved in a proper manner, the rest of the transducer construc-tion boiled down to applying the maps to the correct constructs in the proper orderand then using the now familiar groupBy construction to organise the rules intosets according to the left-hand side.

1 recall that the implementation of top-down tree transducers use VarTrees where thevariables are numbered 0 to l− 1 rather than to k− 1, instead relying on a sequence ofstate-index pairs to process the proper subtrees and place them at the correct position

38

6 Conclusions and future work

While the larger goal of having a usable GUI and proper Java integration was notattained, a viable prototype of the Marbles system was eventually produced. Thatit is sufficient for certain algorithm implementation and exploration tasks has beenshown in Section 5, although more thorough and varied tests would obviously havebeen ideal.

In particular, further testing of algorithms on weighted automata could haveexposed weaknesses that would require changes to the basic architecture used intheir implementation, as was done for the unweighted automata.

Due to the amount of time that has elapsed since the initiation of this projectand its conclusion, it is currently unknown to the author whether any other projectof similar scope and vision has been successfully initiated, though personal com-munication with people in the field seem to indicate that it has not. Regardless,it is likely that the Marbles system will receive further attention and development,given that uses are found for the current prototype. Several additions, expansionsand changes would greatly improve the chances of this being the case:

– First of all, several of the current automata types share code in various ways,and it might be of interest to generalise this in some way. For example, thereis little reason, apart from efficiency, to implement the unweighted automata asanything but a specific subtype of the weighted type. Indeed, even the efficiencyargument has not been tested, and the as the framework is intended not to actas a tool to produce production-ready code, but to test various algorithms, itis not likely to be an issue.

– For various typing reasons, it might be more prudent to make the automataparameterised not in the type of the alphabet, but in the various alphabetsthemselves. It is not known at this time if even Scalas powerful typing systemis capable of this.

– Implementing various other types of automata, apart from those outlined in thisthesis, would generally be a large concern. Having more types of automata totest ones algorithms on is likely to produce unexpected benefits. In particular,macro tree transducers, attributed tree transducers, various context-free treeautomata, and diverse rewrite systems are of interest, as well as e.g. residualtree automata.

– Using the ideas found in [Cle08], one might implement the notion of languageclass taxonomies, and attempt to find common constructions and define newtypes of automata with potentially interesting properties.

– The original intention was to have a small proof-of-concept Java integrationproject ready to show how the prototype could be used by client programswritten by researchers. This was not done, but considering the ease by whichgeneric Scala code can be used in Java projects, it is likely that java integrationwill not pose much of a problem. The main concerns lie in documenting howgetter and setter methods work over the Java/Scala boundary. Alternatively,one could define suitable interface traits through which Java code would work.However, which method is the most suitable for this particular application isnot known at this time, and further implementation work would have to be doneto find out.

– In Treebag, most of the “exploration” of the effect of tree transducers andtree grammar rules is done in the graphical workbench UI. The project plancalled for a similar interface to be available for the Marbles prototype, butfor several reasons, this proved to be less than feasible. The first and mostimportant reason is that time was short, and having a usable prototype at all hada higher priority than any graphical interfaces. However, during the course of the

39

project, the thought of a Treebag-type graphical interface, “stepping” throughthe individual steps of the computations seemed less appealing as opposed tosimply running the algorithm on the automata, and outputting the result insome human-readable form. Thus, effort was made to make the parsing and“unparsing” system more powerful, rather than to complicate things with somegraphical interface toolkit.

7 Acknowledgements

Many people have played various roles in making sure that this thesis has beenbrought to conclusion. My supervisor Frank Drewes has contributed not only withadvice, encouragement, administerial expertise and (occasional) prodding, but ex-tensive editorial comments on the various drafts as well. For this and much more, Iowe him thanks in abundance.

Brink van der Merwe of the University of Stellenbosch, who served as my assis-tant supervisor, likewise has contributed immensely, not only as my on-site super-visor during my stay in Stellenbosch, but in many other ways, both in interactionswith administration and in otherwise making my integration into South Africa arelatively smooth one.

Many other people, in Sweden and in South Africa also deserve my thanks,though any sort of list will necessarily be incomplete. In no particular order, anumber of names come to mind, however: Hanna, Gun, Lars, Tove, Gideon, McElory,Johan, Johanna, Rickard, Tonje, Jock, Benjamin and Eric. All of these and morehave had a part to play in making this thesis into what it is.

40

References

[AJMd02] Parosh Aziz Abdulla, Bengt Jonsson, Pritha Mahata, and Julien d’Orso.Regular tree model checking. In E. Brinksma and K. Guldstrand Larsen,editors, Proc. 14th Intl. Conf. on Computer Aided Verification (CAV’02),volume 2404 of Lecture Notes in Computer Science, pages 555–568, 2002.

[CDG+02] Hubert Comon, Max Dauchet, Remi Gilleron, Florent Jacquemard,Denis Lugiez, Sophie Tison, and Marc Tommasi. Tree AutomataTechniques and Applications, 2002. Internet publication available athttp://www.grappa.univ-lille3.fr/tata.

[Cho56] Noam Chomsky. Three models for the description of language. IRE Trans-actions on Information Theory, 2:113–124, 1956.

[Cle08] (Loek) Cleophas. Tree algorithms:two taxonomies and a toolkit. PhD thesis,Technische Universiteit Eindhoven, 2008.

[Cle09] Loek Cleophas. Forest fire and fire wood: Tools for tree automata and treealgorithms. In Proceedings of the 2009 conference on Finite-State Methodsand Natural Language Processing: Post-proceedings of the 7th InternationalWorkshop FSMNLP 2008, pages 191–198, Amsterdam, The Netherlands,The Netherlands, 2009. IOS Press.

[DKV09] Manfred Droste, Werner Kuich, and Heiko Vogler. Handbook of WeightedAutomata. Springer Publishing Company, Incorporated, 2009.

[Dre98] Frank Drewes. TREEBAG—a tree-based generator for objects of varioustypes. Report 1/98, Univ. Bremen, 1998.

[Dre09] Frank Drewes. Towards the tree automata workbench marbles, 2009.[Eng75] Joost Engelfriet. Tree automata and tree grammars. Technical Report

DAIMI FN-10 (Lecture Notes), Aarhus University, 1975.[GB03] Thomas Genet and Yohan Boichut. Timbuk - for reachability analysis and

tree automata calculations, 2003.[HPJW+92] Paul Hudak, Simon Peyton Jones, Philip Wadler, Brian Boutel, Jon Fair-

bairn, Joseph Fasel, Marıa M. Guzman, Kevin Hammond, John Hughes,Thomas Johnsson, Dick Kieburtz, Rishiyur Nikhil, Will Partain, and JohnPeterson. Report on the programming language haskell: a non-strict, purelyfunctional language version 1.2. SIGPLAN Not., 27(5):1–164, May 1992.

[KG05] Kevin Knight and Jonathan Graehl. An overview of probabilistic tree trans-ducers for natural language processing. In Alexander F. Gelbukh, editor,Proc. 6th Intl. Conf. on Computational Linguistics and Intelligent Text Pro-cessing (CICLing 2005), volume 3406 of Lecture Notes in Computer Science,pages 1–24. Springer, 2005.

[MK06] Jonathan May and Kevin Knight. Tiburon: A weighted tree automatatoolkit. In Oscar Ibarra and Hsu-Chun Yen, editors, Implementation andApplication of Automata, volume 4094 of Lecture Notes in Computer Sci-ence, pages 102–113. Springer Berlin / Heidelberg, 2006.

[NP92] Maurice Nivat and Andreas Podelski, editors. Tree Automata and Languages.Elsevier, Amsterdam, 1992.

[OMM+04] Martin Odersky, Stphane Micheloud, Nikolay Mihaylov, Michel Schinz, ErikStenman, Matthias Zenger, and et al. An overview of the scala programminglanguage. Technical report, 2004.

[OSV11] Martin Odersky, Lex Spoon, and Bill Venners. Programming in Scala. Ar-tima Inc, 2 edition, 2011.

[RSB05] Harald Raffelt, Bernhard Steffen, and Therese Berg. Learnlib: a library forautomata learning and experimentation. In Proceedings of the 10th interna-tional workshop on Formal methods for industrial critical systems, FMICS’05, pages 62–71, New York, NY, USA, 2005. ACM.

[Sch07] Thomas Schwentick. Automata for XML – a survey. Journal of Computerand System Sciences, 73(3):289–315, 2007.

[Sip06] Michael Sipser. Introduction to the Theory of Computation. Cengage Learn-ing, 2 edition, 2006.

41

Prototyping the Tree Automata Workbench Marbles · 2012-11-14 · Prototyping the Tree Automata...

Documents

Transcript of Prototyping the Tree Automata Workbench Marbles · 2012-11-14 · Prototyping the Tree Automata...