Implementing persistent data structures using C++

SOFTWARE—PRACTICE AND EXPERIENCE, VOL. 28(15), 1559–1579 (25 DECEMBER 1998)

Implementing Persistent Data StructuresUsing C11

allen parrish1, brandon dixon1, david cordes1, susan vrbsky1

and john lusth2

1Department of Computer Science, The University of Alabama, Box 870290, Tuscaloosa,AL 35487, USA

(email: {parrish,dixon,cordes,vrbsky}Kcs.ua.edu)2Department of Mathematics and Computer Science, Boise State University, Boise, Idaho

83725, USA(email: lusthKcs.idbsu.edu)

SUMMARY

Persistent data structures allow efficient access to, and modification of, previous values of thedata structure. In this paper, we illustrate a class-based implementation of persistence. Ourimplementation provides a mechanism to transform a given (non-persistent) class to a persistentform without making any significant modifications to the class. We then demonstrate how ourimplementation may be used to improve the efficiency of a previously devised procedure for classtesting of object-oriented software. 1998 John Wiley & Sons, Ltd.

key words: persistence; object-orientation; software testing

INTRODUCTION

Persistent data structures1 have been used to develop improved algorithms in severalcontexts, including computational geometry,2 text file editing,3 high level program-ming languages4 and the management of time-evolving databases.5 The idea behindpersistent data structures is that old values of the data structure (values that havesince been changed) may be retrieved and/or modified. The algorithms given inReference1 are efficient in both time and space.

Previous work in this area has been mostly theoretical. An exception is the workreported in Reference5, which included a persistence-based implementation for theirdatabase model. In this paper, we develop an object-oriented implementation ofpersistence, using C11 as our implementation language. Unlike the database appli-cation reported in Reference5, our implementation is reusable, in the sense that itcan be utilized in any application where persistence is desired. Also unlike Reference5, our implementation is transparent, in the sense that it may be utilized in anexistingnon-persistent object-oriented system without any significant modifications to theexisting code. In this regard, our implementation also differs significantly from theapproach proposed (but not implemented) in Reference1, in which the implementationsof persistence and the underlying data structure are closely intertwined.

CCC 0038–0644/98/151559–21$17.50 Received 4 July 1997 1998 John Wiley & Sons, Ltd. Revised and accepted 20 July 1998

1560 a. parrish et al.

As a working application, we show how our persistence implementation canimprove an existing approach to testing object-oriented software. Much of theprevious work in object-oriented software testing has centred around the idea oftesting classes as independent units.6–9 In previous work,8 we presented a classtesting technique calledstate tree generation, which involves the systematic executionof as many combinations of class methods as possible. This technique, althoughquite thorough in the number and types of combinations that it executes, is relativelyinefficient. We utilize persistence to make our state tree generation testing techniquemore efficient. Specifically, by implementing the class to be tested using persistentdata structures, we can ‘roll back’ test objects to previous states. State-tree generationthus becomes more efficient, as states may now be generated incrementally fromprevious states, rather than completely independently. The fact that little modificationis required when making an existing class persistent means that this technique doesnot require any significant overhead prior to testing.

As an aside, we note that the use of the term ‘persistence’ here is somewhatdifferent from the typical usages of the term in the context of databases and‘persistent objects’. Such usages typically refer to the idea of making permanent anin-memory object (e.g. by storing it in an object-oriented database). In this paper,‘persistent’ does not necessarily imply permanence. Instead, persistent simply meansthat the value of an object is not lost just because the object has changed. Thus,the lifetime of an object’s value extends past subsequent changes to that object;however, unless a version of persistence is implemented that involves permanentstorage, an object’s lifetime terminates when program execution ceases.

Persistent vs. ephemeral structures

In an ordinary data structure, making a change to any field in the data structurecreates a new version of the data structure and, in the process, destroys the old(original) version. We call this type of data structureephemeral. As discussed above,a persistentdata structure1 allows access to any/all of the old versions of the data.There are two types of persistence defined in Reference1. A data structure ispartiallypersistentif we allow read access to the old versions, but restrict modifications ofthe data structure to only the latest (current) version.Full persistence, on the otherhand, allows a new version of the data structure to be created from any of theprevious versions (without loss of any version).

To access previous versions of the data structure, some mechanism is needed toidentify each unique version. A natural way to do this is through the use of a clock.Every version of the data stucture is associated with some particular (and unique)value of the clock. When we create a new version of the data structure, we associateit with the current value of the clock. The clock can be viewed as an integer, withlarger clock values representing later times. The clock is incremented each time anychange is made to the data structure.

We use a simple example to explain the nature of both partial and full persistence.Consider the notion of a persistent integer whose values (at various clock times) areas follows:

Clock 1 2 3 4 5Value 145 57 59 47 567

1561implementing persistent data structures using c11

Suppose that the current clock value is 5, indicating that we have just set thepersistent integer to 567. With partial persistence, we can look back at all previousvalues. For example, we can print the value of the persistent integer at time 3 (i.e.59). With full persistence, not only can we revisit the value at previous times, butwe can also modify the integer at those times. For example, we could revisit time3 (value 59) and add one to this version of the integer, obtaining a new versionwith value 60. We are now operating on an ‘alternative timeline’ as further modifi-cations are made to this value. Effectively, with full persistence, the history of valuesassociated with a specific object forms a tree.

Of course, the real contribution of this work is not persistent integers (or scalarsin general), but rather our ability to make large data structures persistent. In Reference1, algorithms are given to implement any pointer-based data structure in a persistentfashion. These algorithms are efficient in both time and space. If a set ofmoperations on the ephemeral structure takes timef(m), then the same set of operationson the persistent version will take timeO(f(m)). In other words, the persistentoperations take only a constant factor more time than the corresponding ephemeraloperations. The persistent version requires onlyO(f(m)) space as well. This extraspace is required to retain ‘old’ versions of the data structure. For example, considerimplementing a persistent integer. While the ephemeral version only requires thespace to store the single integer, the persistent version will need to store every olderversion (as many asf(m) versions) of the integer.

Although the techniques reported in Reference1 are efficient, they are nottransparent. In Reference1, the implementation of persistence is closely coupled tothe implementation of the data structure itself. That is, the data structure must bedesigned and implemented with extensive use of the various persistence mechanisms.In this paper, we provide an alternative implementation of persistence for a C11class, one that is completely independent of the actual data structure. As a result, itis relatively straightforward to convert an existing ephemeral data structure(implemented as its own C11 class) to its persistent analog.

Our primary interest is in implementingpartial persistence in C11. We alsoimplement a variant of partial persistence (which we refer to as ‘rollback’ persistence)in order to demonstrate our application to software testing. We do not present animplementation of full persistence, as the implementation issues for full persistenceare conceptually similar to those for partial persistence.

PERSISTENT INTEGERS

Although our ultimate implementation is completely independent of the object beingmade persistent, we initially illustrate our implementation of persistence in thespecific (and simplified) context of integers. To support our goals of transparencyin client programs, we implement a partially persistentint class exporting thestandard integer operations, allowing clients to treat persistent integers more-or-lessas standard C11 int s. Such an implementation is in the spirit of Reference10,where integers with various initialization semantics were implemented as transparentreplacements for the built-inint type. This implementation is also similar to the Javaidea of ‘wrapper’ classes, where new semantics for standard types are implemented fortransparent client usage.

Our implementation of partial persistence for integers is straightforward. Although


partial persistence only requires maintaining a set of values at a linearly orderedseries of clock times, we implement a persistent integer (PerInt ) as a balancedbinary search tree to support efficient retrieval of old values. Each node in the treecontains an integerdata value and a uniqueclock value. When a new value isassigned to aPerInt object, the value is inserted into the tree using the currenttime stamp; for every reference to this integer (e.g. its appearance on the right-hand-side of an assignment statement), the value from the tree appropriate for that timestamp is retrieved. Since the clock is automatically incremented every timeanyPerInt object is changed, then if there are multiple objects, some objects will retainthe same value over multiple time stamps. Thus, retrieval at a particular instance oftime means either retrieving the value from the tree with the desired time stamp orat the nearest time stamp preceding the desired time stamp.

Using a balanced search tree scheme, insertions and retrievals both takeO(logn)time. While this is a logn factor beyond what is required by the approach inReference1, the approach in Reference1 only applied to pointer-based structures.Our approach also supports transparency in the implementation of non-pointer-basedstructures, as discussed in the next section. Also, if an upper bound ofT is assignedto be the largest possible timestamp, then a van Emde Boas priority queue11 can beused for insertions and retrievals, with the running time of both operations beingO(log logT). While these running times are not strictly comparable, theO(log logT)bound can represent a significant reduction if a sufficiently small bound forTis assumed.

Figure 1contains the interface for the history class (i.e. the balanced binary searchtree) which is used in the implementation of our persistent integer class.Figure 2then contains the persistent integer (PerInt ) class. Given our objective of obtainingclient transparency in the style of Reference10, we overload operators to allowbasic integer manipulation. As in Reference10, these operators include:

1. A default constructor.2. A copy constructor to construct aPerInt from an int .3. A copy constructor to construct aPerInt from anotherPerInt .4. A copy (assignment) operator.

Figure 1.History Class


Figure 2.PerInt Class

5. A type conversion operator (to convert aPerInt to an int when necessary).

Operations (1), (2), (3) and (4) may be categorized asl-value operations, in thatthey all assign values toPerInt objects. In each case, they insert values into anobject’s history by calling theinsert operation from classHistory . Operation (5)is strictly an r-value operation, in that it retrieves a value from the object’s historyby calling the retrieve operation from classHistory . In practice, some of the 1-value operations must also perform aretrieve for the value being copied (e.g. theright-hand-sidePerInt object in the assignment operator). (Note that in addition tothese five operations, there is also an increment operator (11), which we discuss later.)

Each of the operations treats the clock as a global object (clock ). If the clockwere not a global object, then it would have to be passed as a parameter to allPerInt operations. This would destroy the transparency ofPerInt . To properly


retrieve old values, this (global) clock must be incremented every time a valuechanges; thus alll-value operations inPerInt contain calls to the clockincrementoperation. Each operation must also capture the current clock value (via thecaptureoperation) in order to insert and retrieve values as of the current time from theHistory object. Finally, classClock also provides areset operation, which setsthe clock to an arbitrary time; this operation is needed by clients when the clockmust be ‘backed up’ to previous times so as to obtain old values of a givenPerInt .Because reset and capture are used by clients, they are public operations;increment is private to classPerInt . Figure 3 contains the completeClock class.

Note that the Clock class guarantees that the partial persistence protocol isfollowed (i.e. that oldPerInt values are observed, but not changed). This restrictionis enforced by classClock via the inclusion of a maximum value (maxtime ) thatreflects the greatest time reached by the clock. The clockincrement operation willraise an exception if an attempt is made to increment the clock when its value isless than its maximum value.

An extensive explanation of why each of thePerInt operators is needed isgiven in Reference10 in the context of SafeInts . With regard to the threeconstructor operators:

(a) The default constructor is automatically invoked upon encountering a declar-ation of PerInt .

(b) The PerInt -to-PerInt copy constructor ((3 above) is invoked in twoinstances: (a) initializing aPerInt to anotherPerInt within a declaration,and (b) initializing a formalPerInt parameter passed by copy.

(c) The int -to-PerInt copy constructor ((2 above) is invoked whenever aPerIntis initialized to an ordinary int (e.g., anint constant), or an ordinaryint ispassed by copy to aPerInt formal parameter.

The copy/assignment operator supports standard assignment, although the semanticsare somewhat different than the standard assignment operator. Consider the statementa = b, wherea and b are bothPerInt s. The typical semantics of assignment are tomake either a ‘shallow’ (pointer) copy or a ‘deep’ (all object components) copy ofb and store the result ina. We want the objecta to maintain its history prior tothe assignment; making an identical copy ofb and storing it ina eliminates a’shistory prior to the assignment statement. To avoid this problem, we retrieve a copy

Figure 3.Clock Class


of the int in b at the desired timet (based on the current value of the globalclock), and then insert that value into a’s history usingt as the time stamp. Thus,the semantics are that of integer copy, maintaining object histories as appropriate.

The type conversion (int() ) operator supports the retrieval of a standardintfrom a PerInt object. This operator allows aPerInt object to be used in statementswhere a standardint is expected. Consider, for example, the statementc = a 1 b,where a,b and c are PerInts . While there is no1 operator defined forPerInt ,this statement works as expected since theint() operator converts botha and bto standardint s and then applies the standard1 operator. More generally, thecompiler applies the following sequence of operations in this example (a = b 1 c):

(1) The int() type conversion operator is applied to bothb and c , returningstandardint s as of the current time stamp.

(2) The standard1 operator for int s is applied to the results of (1), returninganother standardint .

(3) Given that the copy operator requires aPerInt on both sides of the=, andsince (2) produced anint on the right-hand side, theint -to-PerInt copyconstructor is performed on theint resulting from (2). This copy constructorproduces aPerInt .

(4) The copy operator (=) is now performed, usinga as the left-hand-side value,and thePerInt resulting from (3), as the right-hand-side value.

Thus, the int() operator permits standardint operators to be used withoutspecifically overloading them forPerInt . This works fine for r-value operators,since int() simply retrieves anint value according to the current clock time.However, this does not work forl-value operators; the program will not compile ifa PerInt is used in any operation which normally expects anl-value int (i.e. anint& ). In general, we do not want non-PerInt operations making modifications toPerInt objects, as such operations will not properly maintain the appropriate objecthistories. The primary implication of this is that standardint l-value operators (e.g.,11, 2, - =, 1=, etc.) must be overloaded for thePerInt class in order to functionproperly. To demonstrate this, we have overloaded the11 operator in Figure 2.The remaining operators have been omitted in the interest of brevity, but areimplemented similarly.

Figure 4 contains a samplePerInt client, reflecting our ability to reset the clockand obtain prior values ofPerInt objects. Note that ourPerInt class is completelytransparent; the operations performed onPerInt s are the same operations that areused onint s.

MAKING GENERALIZED DATA STRUCTURES PERSISTENT

We now use the basic techniques defined in the previous section to make arbitrarydata structures persistent. Our approach involves implementing classPerInt as atemplate class, which we callPer . Since the persistent data are actually stored in aHistory object, the History class must also be made persistent. The templateimplementation ofPer appears inFigure 5; since there are no substantive changesto History , we do not reproduce it here.

Per is used as an independent mechanism to make an existing ephemeral classpersistent. Our approach minimizes the number of changes that must be made to


Figure 4.PerInt Client

the existing ephemeral class (although a few such changes are required). For example,consider the following analogous ephemeral and persistent stack classes:

Ephemeral Persistent

class Stack { class Stack {private: private:

int top; Per kint l top;char data [100]; Per kchar l data [100];

public: public:% %

}; };

To construct a persistent stack from its ephemeral analog, the private instancevariables must be changed as shown above. Now consider the stack client shown inFigure 6. In this example, two push operations are executed on a persistent stackobject, and then the current clock value is saved. Next, a pop of the top value onthe stack is performed, and then the clock is reset to the previously saved value.At this point, any references to the stack value (such as incout ,, s) will resultin referencing the stack’s value at the previous time.

We note that classPer is designed for parameterization by scalar types only.Thus, it would have been inappropriate for the stack client inFigure 6 to have


Figure 5.Per Class

attempted to declare a persistent stack simply via the instantiationPer kStack l. Theproblem is thatStack ’s l-value operations are not designed to maintain the necessaryobject histories, and cannot be transparently modified to handle this task. Fortunately,our implementation guarantees the persistence of a given classas long as all of itsconstituent scalar components are persistent. Since our implementation relies on asingle global clock that is incremented every time a modification is made, thesescalar components consist of a history of values at different time stamps. Retrievingthe value of an object of this class means retrieving all constituent scalar values asof the given time stamp; thus, we have a global view of the object for that time stamp.

The set of scalar types with whichPer may be parameterized contains all C11scalar types, including pointers. Consider, as a simple example involving pointers, alinked list class that is implemented as a pointer to some type of dynamicallyallocated node structure. We might have the following:

class Node { class List {& &

private: private:int data; Node *head;Node *next; };

};


Figure 6.Stack Client

In this case, both classNode and classList must be made persistent. This isaccomplished by making all components of these two classes persistent. Since bothof these classes consist only of scalar objects, it is sufficient to replace the types ofthese objects with instantiations of the generic template classPer , as follows.

class Node { class List {& &

private: private:Per kint l data; Per kNode * l head;Per kNode * l next; };

};

In general, both the pointer and the object referenced by this pointer must bepersistent in order to guarantee persistent behavior. Making the pointer persistentensures that a history is maintained whenever the address is changed; making thereferenced object persistent ensures that a history is maintained whenever that object’svalue is changed (yet its address remains constant). In the above example, the


referenced object is a complex type (divisible into individual scalar objects). Incontrast, consider a case where the referenced object is itself a scalar (e.g.int* ).To guarantee the persistence of anint* object, it is necessary to change itsdeclaration toPer kPer kint l * l (i.e. a persistent pointer to a persistent integer).* Ineither case, we are declaring a persistent pointer to a persistent object; however, thefact that the object is persistent is only explicit in the case of scalar objects. Structuredobjects are made persistent by making their constituent components persistent.

The previous examples have involved classes whose constituent components arescalars. We note that our approach is also applicable to classes whose componentsmay consist in part of non-scalar objects (possibly objects of other classes). Forexample, consider aQueue class that has been implemented using a pair of stacks:

class Queue {&

private:Stack s1, s2;

};

In this case, in order forQueue to be persistent, it is only necessary that classStack be persistent. However, since classStack is non-scalar, one must ensure thatits components are persistent to guarantee the persistence of the class itself. IfStackconsists entirely of scalar components, then those components must be explicitlymade persistent (i.e. for each scalar declaration involving typeX, it must be convertedto Per kXl). If Stack itself contains non-scalar components, then these declarationsare unchanged; however, all classes associated with those components must beexamined (ensuring that all of their scalar components are explicitly made persistent,and all non-scalar components are further examined, etc.). In general, as long as allscalar components at the lowest levels of such a ‘class composition hierarchy’ areexplicitly made persistent via declaration changes, then all classes in the hierarchyare persistent. This is true because every class in the hierarchy ultimately consistsof the (persistent) scalar components at the lowest level. By our earlier arguments,persistence of all scalar components is both necessary and sufficient to ensurepersistence of a class which aggregates those components.

There are two issues that impact the overall utility and applicability of thistechnique: garbage collection and library functions. With regard to garbage collection,a well-written class should deallocate memory when it is no longer needed. Forexample, with our linked list class above, memory for a node should be deallocatedwhenever that node is deleted from the list. However, for thePer class to properlymaintain the history associated with the list, these old values must be retained. Thus,all delete statements (which free memory) must be removed from the class beingmade persistent. While this is not a desirable modification, it is inherently necessaryin any scheme for persistence; if we are to retain copies of old values, we mustalso retain space in which to store those values.

With regard to library functions, the standard C/C11 libraries obviously havebeen developed independently of our notion of persistence. Consider an arbitrary

* The GNU C11 compiler does not permit such a declaration, but does permit the following two statements (providingan equivalent definition):typedef Per kint l PerInt; Per kPerInt * l.


formal parameter of some typeT for such a library function. There are fundamentallyfour possibilities for such an object:

(1) The object might be passed by value, with a formal parameter type ofT. (Inthis case, we assume thatT itself is not defined to contain any pointers, eitherby a typedef or class. We defer such a case to (4) below.)

(2) The object might be passed by reference, with a formal parameter type ofT&.(As in (1), we assume thatT itself does not contain any pointers.)

(3) An explicit pointer to the object might be passed of typeT* .(4) The object of typeT is passed by value or reference, but contains one or

more pointers (either becauseT is defined using atypedef or a class).

In case (1), we may wish to pass aPer kTl object in for the formal parameter oftype T. This poses no problem, as the type conversion operatorT() will obtain avalue of typeT from a Per kTl object at the current clock value. Thus, a pass-by-value behaves as a normal variable reference, as one might expect.

For case (2), if we attempt to pass aPer kTl object as a parameter when aT& isexpected, a compilation error will occur due to the lack of aT&() type conversionoperator in classPer . However, the underlying problem is that we do not want toallow the library function to change an object of typeT, since the implementationof the function does not incorporate persistence semantics. Moreover, since we(typically) do not have access to the source code for the library function, we cannotchange the formal parameter toPer kTl& (which would force our persistence semanticsto be incorporated).

One solution is to generate ‘wrapper’ functions for library functions. As anexample of such a function, consider a library functionfoo that expects a parameterof type int& . A wrapper function forfoo is as follows:

void Foo(Per kint l& x){int y = x;foo(y);x = y; //Invokes Per-based = op

}

With Foo, the formal parameter is of typePer kint l&, and thus consistent withthe persistent actual parameter (of typePer kint l). The formal parameter is copiedto a local variable (y in this case) that is compatible with the formal parameter typeof the library function. Once the library function finishes, the (modified) formalparameter (y) is copied back to the persistent object (x). This copy invokes the=operation fromPer , which appropriately modifies the history of the persistent object.

When considering cases (3) and (4), we note that the problem is similar to thatof case (2), but no similar solution appears to exist. For case (3), a persistent objectcorresponding toT* would be of typePer kPer kXl * l (i.e. a persistent pointer to apersistent object of typeT). Syntactically, we are confronted with a situation whereno operator exists to convert aPer kPer kTl * l object to aT* object. Like the casewith reference parameters, the real problem is that the library function does notincorporate persistence semantics. Thus, changes to the underlying object via pointerdereferencing does not modify the object’s history. With reference parameters, wecould modify the object history by assigning the non-persistent actual parameter to


a persistent object outside of the function; however, in this case, the parameter itselfis just the (unchanged) pointer. Changes to the underlying object inside the functioncannot be detected and recorded. Case (4) presents a similar problem. However, ifthe parameter is passed by value, no syntax error would occur (by the argumentsfor case (1) above). Nonetheless, changes made to non-persistent data objects viathe pointer would not be historically maintained.

AN APPLICATION TO SOFTWARE TESTING

We now illustrate how persistent objects can be utilized in a realistic application.We have incorporated this idea into our object-oriented testing technique calledstatetree generation.8 The basic goal of this technique is to test an individual class bygenerating a large number of objects of that class. Each generated object is thenexamined to determine whether or not it is ‘correct’. Although state tree generationwas shown to be useful in some cases in Reference8, its implementation wasrelatively inefficient. In this section, we show an efficient implementation of statetree generation using persistent objects.

To illustrate state tree generation, we consider an example of a simple ordered(integer) list class with six methods:Add, Delete, Tail, Member, Length andthe constructor operationList (that initializes an empty list). A C11 interface forthis class appears inFigure 7.

It is important to classify these methods (operations) in the following sense:12

(a) List is categorized as aconstructor (as well as a constructor in the C11usage of the term).List qualifies as a constructor as it produces a list fromscratch, without taking a previous list object as an input.

(b) Add, Delete and Tail are transformers. That is, each of these three methodstakes an initial list and produces a modified list as a result (effectivelychanging the list’s state).

(c) Member and Length are observers. That is, these two methods take a listand report some information about the list without making modifications tothe list (no change in state).

As noted above, our goal is to generate a large number of list objects, and thendetermine whether or not each of those objects is ‘correct’. There are a number oftechniques for evaluating whether a given object is correct. We utilize the well-

Figure 7.List Class


known concept of aclass invariant,13,14 as described in Reference15. The classinvariant should be true for all list objects; for example, one clause of a list classinvariant can be expressed asL.Add(v).Length = L.Length 1 1. That is, for anylist object L, adding some itemv to the list and taking the length of the resultshould be equivalent to taking the length of the original listL and adding 1. Normalclass invariants contain several such clauses.

Our overall approach can be defined as:

(1) Generaten object states (wheren is determined by the amount of time andspace available for testing, as well as the desired reliability of the application).

(2) Check the class invariant (all clauses) for all object states generated. If theclass invariant is satisfied for all object states, then no defect is revealed.

(3) If the class invariant fails for some object state(s), then a defect exists whichshould be identified and corrected.

As Reference15 suggests, we include a special method in the class interfacecalled CheckInvariant that is invoked for every generated object to determinewhether or not the class invariant is satisfied. This method does not appear in theabove class interface, but would be added before the class is tested.

We use state tree generation to generate object states. To do this, we first identifythe methods that have an impact on the object’s state. From our discussion above,this includes the constructor method (List ) and three transformer methods (Add,Delete and Tail ). We then generate astate treecontaining object states producedfrom executing various combinations of these methods.Figure 8 contains a partial(small) state tree for our list class. The labels at each node represent the methodsequences used to generate this particular state (e.g.LAD refers to the state generatedby executingList , followed by Add, followed by Delete ).* To produce a givenstate, the constructor method is executed first (since an initial instance of the objectmust be created), followed by one or more transformer methods. The goal is toexecute all constructors with as many combinations of transformers as resourcespermit, producing a (large) number of object states for testing our class invariants.It should be noted that our technique is intended for defensively designed classes,

Figure 8. State tree for ordered list class

* Note that this technique ignores the selection of non-List parameters to the individual methods. For example,Add requires anint parameter (the item to be added to the list). Such parameters are chosen at random in thecurrent implementation.


where no errors are introduced due to the use of a method in an appropriate context(e.g. Delete from an empty list).

This state tree can become large relatively quickly. For a class withk transformermethods, there are (kn1121)/(k21) states in ann-level tree. Thus, a class with threemodifier methods would result in 21,523,360 states after only 15 levels. After 20levels, this grows to over 5 billion states. More problematically, states becomeprogressively more complex deeper into the tree, and the number of redundantexecutions increases rapidly. For example, consider the transition from stateLA atlevel 2 to its child states at level 3 (LAA, LAD, LAT ). To generateLAA, LAD andLAT as three independent states requires executing nine separate methods (threechains of three separate messages), of which the first two methods in each chainare the same.

What we desire is to efficiently generate state trees in either breadth-first or depth-first order. We would actually prefer breadth-first order, as this results in the simplerstates near the top of the tree being generated first. By generating simpler statesbefore more complex ones, the process of error removal is simplified.8 However, itturns out that breadth-first generation requires full persistence, which is quite expens-ive in terms of space requirements. On the other hand, depth-first generation can beaccomplished relatively efficiently (in terms of its space requirements).

With depth-first state tree generation, we would generate the states inFigure 8 inthe order:L, LA, LAA, LAD, LAT, LD, LDA, LDD, LDT, LT, LTA, LTD, LTT . Weutilize persistence to accomplish this efficiently, and identify the following sequenceof events. (Note thatCheckInvariant is implicitly executed upon the constructionof each state.)

(1) Execute theList constructor to generate (and check) stateL.(2) Execute theAdd operation on stateL to generate (and check) stateLA.(3) Execute theAdd operation on stateLA to generate (and check) stateLAA.(4) Reset the clock to stateLA.(5) Execute theDelete operation on stateLA to generate (and check) stateLAD.(6) Reset the clock to stateLA.(7) Execute theTail operation on stateLA to generate (and check) stateLAT.(8) Reset the clock to stateL.(9) Continue this process to generate (and check) the remainder of the states.

An efficient implementation: rollback persistence

This approach was used in Reference8 to test several different classes with goodresults; a discussion of the merits of the approach in terms of finding defects isfound in Reference8. However, this implementation reported in Reference8 wasnot efficient, thus substantially limiting the number of states that could be generatedusing this technique. We now consider a more efficient implementation usingpersistent data structures.

Since this process requires that we must modify old states (states with clockvalues in the past), we cannot accomplish this process using partial persistence.However, full persistence is not required either, as we do not need to maintain any‘alternative timelines’. Thus, we define an intermediate form of persistence, whichwe call rollback persistence. With rollback persistence, we are allowed to modify


old states, but we discard all later states upon resetting the clock to a particularpoint in time (unlike full persistence). (Semantically, this is the same concept asstandard database rollback,16 although our implementation is different from thestandard database rollback implementation.) For example, once we roll back fromstateLAA to stateLA (step 4 above), we no longer retain stateLAA. This ‘rollback’model is actually distinct from both the full and partial persistence models, as statesare never ‘thrown away’ in either of those models.

To implement rollback persistence, it is necessary to modify the underlyingHistory class to reflect the rollback paradigm. Given the order in which the oldstates are accessed, it is more efficient to maintain the history in an array-basedstack rather than in a binary search tree. (The use of an array allows us to eliminatememory allocation operations.) Such a stack is maintained in reverse time stamporder (i.e. more recent times are closer to the top of the stack.) For example, thefollowing list for a persistent integer object is ordered according to this strategy:

Clock 10 7 5 2 1Integer Value 145 57 59 47 567

Retrieving the value at time 5 involves removing the values at times 10 and 7from the stack. Such values are no longer needed in this model. As a result, therollback model allows us to free memory for object values when we no longer needthem, thus conserving space (which was not permitted with the two previouslydefined persistence models).

Figure 9 contains the modifiedHistory class. These modifications include the

Figure 9. ModifiedHistory class


conversion to a linear structure, as well as the automatic elimination of all valueswith time stamps greater than the current clock value in the retrieval operation.

The Clock class also required some minor modifications for the rollback approach.First, the restriction preventing the modification of old object versions must beremoved, since we are allowed to modify old versions under the rollback model.Secondly, to implement our depth-first state tree generation algorithm, clock valuescannot simply monotonically increase with the generation of each new object state.Our state tree generation algorithm (described below) requires that the clock beadvanced and retracted to an assortment of different non-adjacent numeric values.Thus, we remove theincrement operation, and remove all invocations ofincrementfrom Per . Instead, we permit the client to arbitrarily invokereset to set the clockas needed.Figure 10 contains the modifiedClock class. (For the sake of brevity,we do not reproducePer .)

Since our state tree generation algorithm requires that the clock be advanced andretracted to abitrary non-adjacent values, we are no longer able to include theassertion inreset that the clock not be randomly set to some future time (i.e.assert(time ,= maxtime) ). However, this assertion is not necessary, given thatthe newClock class is specifically oriented toward the testing application. Thus, anassumption is made within the newClock class regarding the conditions underwhich reset is used; specifically, within the testing application, no attempt is madeto reset the clock to the future, accessing a value that doesn’t exist (reset s areonly performed when a new object value is generated).

We now consider our state tree generation algorithm. The clock is maintainedsuch that, of all of the nodes seen by the depth-first search, only the ancestors ofthe current node have a smaller timestamp. Thus, all siblings (along with theirdescendants) to the left of a given node are effectively ignored, and in fact eliminatedfrom the history stack, when encountered. This approach requires that the clock beadvanced bymk when creating the leftmost child of a node, wherem is the numberof modifier operations and the node isk levels from the bottom of the depth-firsttree. The clock is then decreased by (12mk)/(12m) for each subsequent child. Theneed for this type of flexible client control necessitated the changes in the clock’soperation as described in the previous paragraphs.Figure 11illustrates the state treefrom Figure 8 with the appropriate clock values for each node.

The actual algorithm for state tree generation is a standard recursive algorithmfor depth-first search. To ‘visit’ a node in the tree is to simply generate the nodeat a particular timestamp (by executing the appropriate method on the parent stateand setting the timestamp for that state as described above). Return visits to ancestornodes involve rolling the clock back to the ancestor’s timestamp. For classes that

Figure 10. ModifiedClock class


Figure 11. OrderedList class state tree with timestamps

do not dynamically allocate memory, such a rollback poses no problem. However,rolling back from a node whose terminal operation allocates memory will result ina memory leak; such a memory leak is not affordable in state trees containing largenumbers of nodes. One way to address this problem is to implement a garbagecollection algorithm within the test driver that is run periodically. Another approachmight be to implement this scheme for a language such as Java, where garbagecollection is automatic. Future research is needed to achieve an ideal resolution ofthis problem.

Even though this testing methodology requires instrumentation changes to the classunder test, it is possible to make error corrections in the original class while updatingthe instrumented version very time the original class is changed. Consider a classC intended for testing. It is possible to automatically generateC9 from C, whereC9is identical to C except thatC9’s scalar data members have been made persistent(i.e. by replacing typeX with Per kXl). Thus, one could testC9, identify defects,make changes toC, throw away the previousC9 and regenerate it, retest, etc.Similarly, if it is desired to testC which is part of a hierarchy of classes relatedvia composition (i.e. data members of a class are objects of another class), onecould automatically generate a persistent version of the entire hierarchy by generatinga Per kXl type for every scalar typeX found within the hierarchy.

Performance analysis

Our implementation of state tree generation using persistence is substantially moreefficient than the non-persistence-based approach given in Reference8. In particular,the approach in Reference8 recomputed the entire sequence of operations everytime that a new node in the tree was produced. This seems particularly wastefulwhen one considers that the operations required to test a sibling node differs onlyin the last operation. The space required can be minimized by using an iterativescheme to construct the operator sequences; nevertheless, if all operations take timec1 then the total time to generate a tree of depthn and k modifier operations pernode is

c1 ·k2(n11)kn111nkn12

(k21)2If we take the earlier example of a tree with three operations at a depth of 15nodes, then this cost becomesc1·258,280,326.


Table I. List class

Levels Persistence approach (seconds) Naive approach (seconds)

12 1 513 4 1714 14 5515 45 175

With our persistence-based (depth-first) generation scheme, the time required toconstruct a new node in the tree is the time needed to apply the modifier operationon the persistent version of the data structure. The only other work (besides theoperation itself) is the traversal of the linked list of object versions. Since we aredisposing of all versions with a larger time stamp, we can amortize the cost oftraversing the list to those disposals. Thus each operation performs only a constantamount of extra work. If every operation takes timec2 then the time to create a treeof depth n with k modifier operations per node is reduced toO(c2·(kn1121)/(k21)).This is (roughly) better by a factor ofn. With our example tree this value becomesc2·21,523,360.

Of course, the constants for the two approaches (c1 and c2) are different. Sinceour persistence-based approach involves additional work beyond the naive approach,we would expectc2 . c1. Trees with large numbers of levels are infeasible witheither approach, makingn relatively small. Thus, without determiningc1 and c2,there is the theoretical risk thatc2/c1 may be larger than any feasiblen, thuseliminating any time savings via the persistence-based approach. To address thisquestion, we timed the generation of trees of varying numbers of levels for theList class example.Table I gives the results of this test.

In this case, the persistence-based approach improved on the naive approach,although not by the factor ofn suggested by our theoretical analysis. Thus, theconstantc2 is indeed larger thanc1, although not large enough to eliminate theoverall savings associated with persistence. However, our hypothesis was that theincrease in the constant factor was associated with the increased cost of memoryallocation of persistent objects. This hypothesis was confirmed by a second test,where we evaluated a stack class with three operations:push, pop and doubletop(doubles the integer item on the top of the stack). This stack class was implementedwith simple arrays involving no memory allocation.Table II gives the result ofthis test.

Table II. Stack class

Levels Persistence approach (seconds) Naive approach (seconds)

12 negligible 713 1 2114 3 6515 8 19516 26 592


In this case, the performance improvement was greater than the factor ofnsuggested by our theoretical analysis. This suggests that the constant associated withpersistence is in fact smaller than the constant associated with the naive approach.We attribute the level of improvement beyond our theoretical analysis to the improvedlocality of reference obtainable from the persistence-based approach. With the naiveapproach, the redundant generation of similar states requires expensive memoryreferences to non-local memory. With the persistence-based approach, small modifi-cations are being made to object states via accesses to recently accessed states thatare still cached in local memory.

These two examples represent the two basic approaches that an operation mightuse to manage memory: heap-based and stack-based allocation. An increase in thenumber of memory accesses per operation will not change the relative running timesbetween the persistence-based and naive approaches. (If the time associated with amemory access in the naive approach is 1 and the time associated with a memoryaccess in the persistence-based approach isc, then the time associated withn memoryaccesses with the two approaches isn versusnc, which is still a difference ofcregardless ofn.) The other possible way to perturb an operation is to increase theamount of computation time not involving memory accesses. In such a case, theeffect of memory accesses on overall performance is decreased, and the differencebetween constants associated with the two strategies is reduced. As such, theperformance differential approaches our theoretical analysis, where the persistence-based approach is superior by roughly a factor ofn.

As for the space requirements of this approach, the algorithm only needs to keepversions for one path in the tree at any given time. This implies that the spacerequirements are onlyO(n) for a tree depthn, if all operations only modify aconstant number of locations in the structure. Note that this is independent ofk,and is exponentially smaller than the size of the whole tree. In our example tree,our space requirement isO(15).

Despite the advantages in debugging, we have elected not to implement a persist-ence-based, breadth-first generation scheme. A proper breadth-first implementationusing full persistence yields the same time bounds as the depth-first scheme. Theproblems arise from the additional space requirements. As mentioned above, we arerequired to keep a version of the data structure for each leaf of the tree if we areto expand the tree in breadth-first order. If each operation changes the data structureat only a constant number of locations then the total space requirements areO((kn1121)/(k21)) for the stated tree. While this is only a constant amount ofstorage for each node in the tree, it is exponentially more space than the depth-firstscheme. In contrast, we are able to use the depth-first scheme to generate state treesthat are much larger than the size of available memory, simply because we neednot retain states once we have visited them.

CONCLUSION

This paper has discussed the development of a mechanism that converts an existingC11 class into a persistent class, requiring a relatively small number of changesto the existing class. Persistent data structures allow us to revisit old values of thedata structure via the association of old values with unique time stamps and thensupporting retrieval for a particular time stamp. This work is based primarily on


Reference1. However, Reference1 proposed algorithms for implementing datastructures in a persistent fashion from the very start, while our approach is orientedtoward making existing ephemeral structures persistent in a transparent fashion. Assuch, our approach is not quite as efficient as that of Reference1, although it iswithin a logn factor.

We also demonstrated an application of persistence to a fundamental softwareengineering problem: the problem of class testing. Our class testing approach(originally reported in Reference8) involves generating a number of object statesthat are minor mutations of each other. In Reference8, we simply regenerated eachstate from scratch. Persistence allows us to generate states, and then ‘back up’ to aprevious point in time to generate the mutation, without having to generate themutation from scratch. Accesses to old states can then occur in constant time.

Future work will involve finding additional practical applications of this technique.In particular, Reference5 identifies algorithms for applying persistence to time-evolving databases. However, no actual implementation is provided in Reference5.We are exploring the feasibility of implementing the techniques defined in Reference5 in a transparent fashion. It is hoped that efficient rollback and ‘peekback’ can beadded to existing database implementations in a relatively transparent fashion, withouta total redesign of the entire implementation.

REFERENCES

1. J. Driscoll, N. Sarnak, D. Sleator and R. Tarjan, ‘Making data structures persistent’,J. Computer andSystem Sciences,38(1), 86–124 (February 1989).

2. N. Sarnak and R. E. Tarjan, ‘Planar point location using persistent search trees’,Comm. ACM,29,669–679 (1986).

3. T. Reps, T. Teitelbaum and A. Demers, ‘Incremental context-dependent analysis for language basededitors’, ACM Trans. Program. System. Lang.5 449–477 (1983).

4. R. Hood and R. Melville. ‘Real-time queue operations in pure LISP’,Inform. Process. Lett.13, 50–54 (1981).

5. V. Tsotras, B. Gopinath and G. Hart, ‘Efficient management of time-evolving databases’,IEEE Trans.Knowledge and Data Engineering, 7(4), 591–607 (August 1995).

6. R. Doong and P. Frankl, ‘The ASTOOT approach to testing object-oriented programs’,ACM Trans.Software Engineering and Methodology, 3(2), 101–130 (April 1994).

7. D. Hoffman and P. Strooper, ‘The TestGraphs methodology: Automated testing of collection classes’,J. Object-Oriented Programming, 8(6) (November/December 1995).

8. A. Parrish, D. Cordes and D. Brown, ‘An environment to support micro-incremental class development’,Annals of Software Engineering,2, 213–236 (1996).

9. G. Murphy, P. Townsend and P. Wong, ‘Experiences with cluster and class testing’,Comm. ACM,37(9), 39–47 (September 1994).

10. A. Parrish, D. Cordes, R. Borie and S. Edara, ‘Illustrating client and implementation readability tradeoffsin Ada and C11’, Software—Practice and Experience,26(7), 799–814 (July 1996).

11. P. van Emde Boas, R. Kaas and E. Zijlstra, ‘Design and implementation of an efficient priority queue’,Math. Systems Theory,10, 99–127 (1977).

12. B. Liskov and J. Guttag,Abstraction and Specification in Program Development, McGraw-Hill, 1986.13. B. Liskov and J. Wing, ‘Specifications and their use in defining subtypes’,Proc. OOPSLA ’93,1993,

pp. 16–28.14. B. Meyer,Object-Oriented Software Construction, Prentice-Hall, 1988.15. C. Horstmann,Mastering Object-Oriented Design in C11, Wiley, 1995.16. R. Elmasri and S. Navathe,Fundamentals of Database Systems, Benjamin/Cummings, 1994.

Implementing persistent data structures using C++

Documents

Transcript of Implementing persistent data structures using C++