Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central...

102
Rheinisch-Westf¨ alische Technische Hochschule Aachen Implementing Automatic Addition and Verification of Fault Tolerance Implementierung einer Methode zur automatischen Synthese fehlertoleranter Systeme Diploma Thesis in Computer Science by Bastian Braun August 11, 2006 Advisor and First Examiner: Prof. Dr. Felix Freiling (University of Mannheim) Second Examiner: Prof. Dr. Ir. Joost-Pieter Katoen (RWTH Aachen University)

Transcript of Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central...

Page 1: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

Rheinisch-Westfalische Technische Hochschule Aachen

Implementing Automatic Addition andVerification of Fault Tolerance

Implementierung einer Methode zur automatischen Synthesefehlertoleranter Systeme

Diploma Thesis in Computer Scienceby

Bastian Braun

August 11, 2006

Advisor and First Examiner: Prof. Dr. Felix Freiling (University of Mannheim)

Second Examiner: Prof. Dr. Ir. Joost-Pieter Katoen (RWTH Aachen University)

Page 2: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach
Page 3: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

Hiermit versichere ich, dass ich die Arbeit selbststandigverfasst und keine anderen als die angegebe-nen Quellen und Hilfsmittel benutzt sowie Zitate kenntlichgemacht habe.

Aachen, den 11. August 2006 Bastian Braun

Page 4: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

Abstract

Fault tolerance is a central aspect of dependability in distributed systems. The presence of a faultdoes not cause a system failure if the system tolerates this fault. There is an approach of automaticaddition of fault tolerance to existing programs, the Arora-Kulkarni method. However, the techniqueis restrictive regarding the input, i.e. it assumes so-called fusion closed specifications. There is amethod to “preprocess” inputs with non-fusion closed specifications, the Gartner-Jhumka-method.In this work, the Gartner-Jhumka-method is implemented. By the preprocessing, an intermediateprogram is constructed. This program satisfies the preconditions of the restrictive Arora-Kulkarni-method. So, the fault tolerance synthesis can be completed by this.

To enhance the trust in the results of the synthesis, a model checker is used to verify the resultingprogram. It checks whether the occurrence of the defined fault leads to a specification violation ofthe final system, i.e. a failure, or not. A program is developed to manage the complete synthesis.After the preprocessing, the intermediate program is givenas input to the Arora-Kulkarni-tool. Theresulting program is then checked by the model checker. Finally, the output concludes if the faulttolerance synthesis was done successfully.

In the context of this work, a fault in the Gartner-Jhumka-method was found. Though this fault is rareand does not cause incorrect results, there might be no result. So, there are programs that cannot besynthesized.

4

Page 5: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

Zusammenfassung

Fehlertoleranz bezeichnet die Eigenschaft von Systemen, Fehler zu tolerieren, d.h. die Funktion-alitat des Systems wird durch das Auftreten der Fehler nicht beeintrachtigt. Es existiert bereits einVerfahren, um bestehende Programme nachtraglich fehlertolerant zu machen, die Arora-Kulkarni-Methode. Dafur wird ein neues Programm generiert, dem ein Fehlertoleranz-Mechanismus hinzugefugtwurde. Dieses Verfahren ist allerdings sehr restriktiv hinsichtlich der moglichen Eingaben. Sowerden nur sogenannte fusion closed Spezifikationen akzeptiert. In dieser Arbeit wird ein neuesVerfahren implementiert, das auch andere Spezifikationen akzeptiert, die Gartner-Jhumka-Methode.Dieses Verfahren funktioniert als Vorverarbeitung und gibt ein Zwischenprogramm aus, das die re-striktiven Anforderungen der Arora-Kulkarni-Methode erfullt. Der Rest der Fehlertoleranz-Synthesekann schließlich von dieser geleistet werden.

Um das Vertrauen in die Ausgaben der Synthese zu erhohen, werden sie jeweils durch einen ModelChecker verifiziert. Er uberpruft, ob das Auftreten der Fehler zu einem Verlust der Funktionalitatfuhrt. Ein Programm wird entwickelt, das den Ablauf der Fehlertoleranz-Synthese steuert. Zunachstwird die Vorverarbeitung mittels der Gartner-Jhumka-Methode durchgefuhrt. Das daraus resultierendeZwischenprogramm wird der Arora-Kulkarni-Methode als Eingabe ubergeben. Schließlich wird dasendgultige Programm von dem Model Checker uberpruft. Die Ausgabe konstatiert, ob das Programmdie Fehler toleriert.

Im Rahmen dieser Arbeit wurde ein Fehler in der Gartner-Jhumka-Methode festgestellt. Dieser Fehlertritt nur selten auf und fuhrt nicht zu fehlerhaften Ergebnissen. Allerdings ist er in einigen Fallen dafurverantwortlich, dass gar kein Ergebnis berechnet werden kann.

5

Page 6: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach
Page 7: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

Contents

List of Figures v

List of Tables vii

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 11.2 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 41.4 Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Background 72.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 72.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 7

2.2.1 The Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.2 Traces, Properties and Specifications . . . . . . . . . . . . .. . . . . . . . . 82.2.3 Extensions, Faults and Fault Tolerant Versions . . . . .. . . . . . . . . . . 92.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Fusion Closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 132.4 The Arora-Kulkarni method (AK) . . . . . . . . . . . . . . . . . . . . .. . . . . . 162.5 The Gartner-Jhumka method (GJ) . . . . . . . . . . . . . . . . . . . .. . . . . . . 192.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 The Fault in GJ 253.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 253.2 The Fault in the Correctness Proof . . . . . . . . . . . . . . . . . . .. . . . . . . . 253.3 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .253.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4 Using FTSyn 294.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 294.2 The Origins of FTSyn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 294.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .30

4.3.1 The Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.3.2 The Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

i

Page 8: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

Contents

4.4 Deficiencies of FTSyn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 364.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5 Model Checking with SPIN 395.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 395.2 Model Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 395.3 SPIN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.3.1 Overview of SPIN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.3.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.3.2.1 Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.3.2.2 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.3.2.3 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.3.2.4 Using “never claims” . . . . . . . . . . . . . . . . . . . . . . . . 44

5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6 Implementing GJ 496.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 496.2 Theoretical Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . 496.3 Algorithms and Data Structures . . . . . . . . . . . . . . . . . . . . .. . . . . . . . 52

6.3.1 Encoding Programs, Faults, and Specifications . . . . . .. . . . . . . . . . 526.3.2 Identifying Bad Fusion Points . . . . . . . . . . . . . . . . . . . .. . . . . 556.3.3 Elimination of Bad Fusion Points . . . . . . . . . . . . . . . . . .. . . . . 576.3.4 Interfacing with FTSyn . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 586.3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.4 Using the System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 596.4.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

7 Validating the Implementation 637.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 637.2 Interfacing with SPIN . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 63

7.2.1 Deficiencies and Format Casting . . . . . . . . . . . . . . . . . . .. . . . . 637.3 Handling the System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 647.4 Illustrated Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 65

7.4.1 First Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .657.4.2 Second Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657.4.3 Third Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

8 Conclusions and Future Work 738.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 738.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 73

ii

Page 9: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

Contents

A The original Output of FTSyn 75

B The Second Output of SPIN 81

C The FTSyn Source Code for the example of Figure 2.3 83

Bibliography 85

iii

Page 10: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach
Page 11: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

List of Figures

1.1 The implementation of the Arora-Kulkarni method, FTSyn. . . . . . . . . . . . . . . 31.2 The interrelationship of FCPre and FTSyn. . . . . . . . . . . . .. . . . . . . . . . . 5

2.1 Examples of extension monotonic and not extension monotonic fault models. . . . . 122.2 Example of not exploiting non-fusion closure. . . . . . . . .. . . . . . . . . . . . . 142.3 Example of exploiting non-fusion closure. . . . . . . . . . . .. . . . . . . . . . . . 152.4 Example of a not exploitable non-fusion closed specification. . . . . . . . . . . . . . 162.5 Example of the Arora-Kulkarni method. . . . . . . . . . . . . . . .. . . . . . . . . 182.6 Example of a program that cannot be synthesized. . . . . . . .. . . . . . . . . . . . 182.7 Second example of a program that cannot be synthesized. .. . . . . . . . . . . . . . 192.8 Overview of ways to solve general fail-safe transformation problem. . . . . . . . . . 202.9 Example of a bad fusion point. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 212.10 The graphical illustration of the removal of a bad fusion point. . . . . . . . . . . . . 22

3.1 The graphical illustration of the fault in GJ correctness lemma. . . . . . . . . . . . . 263.2 The sequel to the illustration of the fault in GJ correctness lemma. . . . . . . . . . . 27

4.1 Graphical representation of our example of FTSyn functioning. . . . . . . . . . . . . 314.2 The Example from Figure 2.5 as a valid input code for FTSyn. . . . . . . . . . . . . 324.3 Example of introduction of multiple transitions by one guarded command. . . . . . . 334.4 Output of FTSyn belonging to the Input given in Figure 4.2. . . . . . . . . . . . . . 354.5 The fault tolerant version of the example from Figure 4.1. . . . . . . . . . . . . . . . 36

5.1 Visualization of Synthesis String ending with SPIN. . . .. . . . . . . . . . . . . . . 415.2 Program Tree visualizing depth-first search. . . . . . . . . .. . . . . . . . . . . . . 425.3 Example of a PROMELA input using active process. . . . . . . .. . . . . . . . . . 435.4 Example of an output of SPIN using the active process method. . . . . . . . . . . . . 455.5 Example of a fault found by SPIN using active process method. . . . . . . . . . . . . 455.6 Example of the system coding using a never claim. . . . . . . .. . . . . . . . . . . 465.7 Example of the specification coding using a never claim. .. . . . . . . . . . . . . . 47

6.1 Classification of FCPre and FTSyn. . . . . . . . . . . . . . . . . . . .. . . . . . . 506.2 Example of a Specification Tree. . . . . . . . . . . . . . . . . . . . . .. . . . . . . 556.3 Illustration of the identification of tracesα andβ. . . . . . . . . . . . . . . . . . . . 566.4 Illustration of the workflow in FCPre. . . . . . . . . . . . . . . . .. . . . . . . . . 60

v

Page 12: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

List of Figures

7.1 The shell script that manages the fault tolerance synthesis. . . . . . . . . . . . . . . 667.2 Example of a complete fault tolerance synthesis, sequelto the example of Figure 2.3. 677.3 Second example of a complete fault tolerance synthesis.. . . . . . . . . . . . . . . . 697.4 Third example of a complete fault tolerance synthesis. .. . . . . . . . . . . . . . . . 707.5 The final output of SPIN after fault tolerance synthesis of the first example. . . . . . 717.6 The final output of SPIN after fault tolerance synthesis of the second example. . . . . 717.7 The final output of SPIN after fault tolerance synthesis of the third example. . . . . . 72

B.1 Example of the SPIN output using a never claim. . . . . . . . . .. . . . . . . . . . 81B.2 Example of the violating program path using a never claim. . . . . . . . . . . . . . . 82

C.1 The FTSyn Source Code of the program given in Figure 2.3. .. . . . . . . . . . . . 83C.2 The appropriate output of FCPre. . . . . . . . . . . . . . . . . . . . .. . . . . . . . 84

vi

Page 13: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

List of Tables

6.1 Valid Operators to encode the Specification for FCPre. . .. . . . . . . . . . . . . . 536.2 The Syntax of Specification Formula for FCPre. . . . . . . . . .. . . . . . . . . . . 54

vii

Page 14: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach
Page 15: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

1 Introduction

1.1 Motivation

Safety is a very important aspect in the application of computer systems. It is a central property ofdependable systems. Dependability is one main goal of computer engineering, “the ability to deliverservice that can justifiably be trusted” [ALR04]. By definition of Laprie, a dependable system has toprovide (cf. [ALR04], [Lap92])

• availability, i.e. readiness for correct service,

• reliability, i.e. continuity of correct service,

• security, i.e. confidentiality (absence of unauthorized disclosure of information), integrity (ab-sence of improper system alterations) and availability in the presence of attackers, and finally

• safety, i.e. absence of catastrophic consequences on the user(s) and the environment.

Safety can be realized in different ways. One approach isfault avoidanceor fault prevention. Bydefinition of Laprie et al. (cf. [ALR01]), fault avoidance means “preventing the occurrence or in-troduction of faults. It is attained by quality control techniques employed during the design andmanufacturing of hardware and software.” The idea is the following: A fault that does not occurcannot cause any problem, i.e. a failure.

The other approach isfault tolerance. Fault tolerance does not aim at preventing the occurrence offaults but at preventing the impact of faults. It is intendedto preserve the correct provision of servicein spite of the presence of faults. It is an important aspect of fault tolerance that faults are not avoidedat all, but they should be tolerated by the system.

As one can imagine, fault tolerance is always desirable. Themore faults a program tolerates the saferit is. However, designing a fault tolerant system is not trivial. So, it is not wise to make a system robustagainst a variety of faults. One has to reason accurately which faults should be tolerated. Therefore,the criticality of the faults as well as their likelihood hasto be determined. The cost of fault toleranceneed to be compared to the cost of a failure. After choosing the faults, the system has to tolerate, onehas to develop a mechanism that handles the faults on occurrence. Finally, the mechanism has to bevalidated and tested if it is correct with respect to the specification, i.e. the purpose of the system.

1

Page 16: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

1 Introduction

There is an easier way to obtain fault tolerance than designing fault tolerant programs “from scratch”.Programs can be made fault tolerant by automatic methods that add fault tolerance. There are a lotof fault intolerant programs. One of these is taken as initial version and made fault tolerant in anautomatic synthesis process.

A non-realistic intention would be to assume that no faults may occur after the fault tolerance algo-rithms have been applied. However, the assumption that a fault tolerant program may not crash everis at least as unrealistic as the one above. Instead, it is much more reasonable to assume that theprogram may not be susceptible to the faults modeled before because it is “robust” against the faults,i.e. it has become fault tolerant against them.

1.2 Context

In [Jos96], Arora and Kulkarni have shown that an automatic addition of fault tolerance to existingsystems can be efficiently implemented. The automatic synthesis of fault tolerant systems providesthe possibility to make fault tolerance property independent of system design. To understand theirmethod, the components of their method are briefly presented.

We focus on distributed programs. These programs consist ofmultiple processes. In future, programsare referred to asΣ. First, to model the misbehavior of a process, one has to knowwhat the program ismeant to do. This is defined by a specification. The specification determines what should happen in anormal execution of the system. In this case — everything is done as it is supposed to — one says thatthe specification is satisfied. In the other case — something unwanted happens — the specificationis violated or compromised. It is assumed that in the absenceof faults, the specification is satisfied.Otherwise, no fault tolerant version with the same behaviorwould exist. It is the major property ofthe altered version to behave just like the original one if nofault occurs. Arora and Kulkarni assumesafety specifications (cf. [Jos96]). Safety specificationsare those conditions that guarantee the safetyproperty of a system, i.e. they avoid catastrophic consequences in the presence of faults (cf. Section1.1).

Second, a fault has to be modeled. The assumption is made thatthe fault compromises the specifica-tion of the program. Otherwise, it would not be necessary to make an effort and alter the program.Faults may happen independently of each other while it is notpossible to circumvent them. The chal-lenge is to catch the faults and make the program react predictably, i.e. the specification has still tobe satisfied. For safety specifications, this means that the system may simply stop to execute stepsbefore violating the specification.

There is an implementation of the Arora-Kulkarni method, namedFTSyn (cf. [FTSa]). It was imple-mented by Ebnenasir. Figure 1.1 displays the functioning ofFTSyn. However, the Arora-Kulkarnimethod suffers from one major drawback: It only works for so called fusion closedspecifications.This assumption restricts the set of possible inputs. We want to give an idea what fusion closuremeans.

2

Page 17: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

1.2 Context

SpecificationProgram

Fault assumptionsFTSyn

Specification’Program’

Fault assumptions’

Σfault intolerant with

respect to fault assumptions

Σ′

fault tolerant withrespect to fault assumptions

Figure 1.1: The implementation of the Arora-Kulkarni method, FTSyn.

Fusion closure is defined as the property that at every point in a run of a program, one can decidewhether the specification is held or violated only by the information contained in the actual state.The opposite is non-fusion closure. In the case of non-fusion closure, the following scenario mayhappen: While executing the program, a program state is reached where one cannot decide definitelyif the non-fusion closed specification is satisfied or violated. Note that even a non-fusion closedspecification always needs to be satisfied or violated but it may be impossible to decide which of bothalternatives is true only by the present information. It would have been necessary to record the pastprogram states of the execution — or at least some of them depending on the concrete specification.

This necessity describes the problem resulting by non-fusion closure. The observation of the historyof an execution takes exponential amount of space. There areways to minimize the use of historyvariables but the problem keeps growing with growing input size and growing complexity of thespecification.

For example, assuming one program variablex, the specification

“never(x = 13)”

is fusion closed. It is easy to decide if the specification holds or not in every state. One only hasto look if x has the value13 or not. So, one may forget the value ofx after leaving each state. Bycontrast, the specification

“if (x = 5) then somewhere in the past(x = 2)”

is not fusion closed, because the violation of the specification in statex = 5 depends on the past andis undecidable without this information.

Even the comprehension of a second variable does not change anything. In contrast, the problem willincrease as we will see later on. Proposing another variabley, the specification

“never(x = 13) and never(y = 23)”

3

Page 18: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

1 Introduction

is still fusion closed just like“never((x = 13) or (y = 23))”

which is the same as the specification above. We state that thefusion closure property does not dependon the notation of a specification. So, it is not possible to transform a non-fusion closed specificationto an equivalent fusion closed one. That is why

“if (x = 5) then previously(y = 2)”

stays non-fusion closed, no matter which equivalence conserving transformation we apply. Note that“somewhere in the past” and “previously” denote the same logical operators as well as “if . . . then. . . ” and “implies” have the same meaning.

The handling of non-fusion closure is done by introducing ahistory variable. This is a set variablethat provides the possibility to insert the states that werevisited so far. Unfortunately, this variablehas to be added to every state and this leads to exponential growth of state space. If we transform theexamples from above and indicate the history variable byh, we derive

“if (x = 5) then(2 ∈ hx)”

and“if (x = 5) then(2 ∈ hy)”

respectively.

Gartner and Jhumka proposed a method to extend the Arora-Kulkarni method (AK) to also handlenon-fusion closed specifications. The Gartner-Jhumka method (GJ) is based on the concept ofbadfusion pointsand their removal. Bad fusion points are those states of a program where the violationor compliance of the specification cannot be definitely decided, i.e. it depends on the history. Thatis what makes non-fusion closure so hard. The removal of these bad fusion points is done as apreprocessing step so that the established algorithms may still be applied as the specification violationcan be definitely decided after the execution of GJ. Needlessto say that not the states themselves areremoved, but the property of the states to be a bad fusion point. This is done by a duplication of thesestates. This way, we obtain a “good” and a “bad” state, where the specification is held or violatedrespectively. The implementation of GJ is named “FCPre” in this work. The context of FCPre andFTSyn is displayed in Figure 1.2.

1.3 Contributions

We implemented FCPre as a preprocessing extension of FTSyn,so for the synthesis of programs withnon-fusion closed specifications, both tools are needed. Itis common in computer science to buildnew methods on the base of already existing ones.

In the framework of this implementation, a mistake of GJ was discovered. There is a case wherethis method does not work. During the execution, it might be necessary to change or erase faults

4

Page 19: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

1.4 Roadmap

Σ FCPre Σ′ FTSyn Σ′′

Figure 1.2: The interrelationship of FCPre and FTSyn.

which we exclude because we want to tolerate the modeled faults. Faults are assumed to be given andunchangeable. As there is a correctness proof of GJ in [GJ03], there has to be a mistake in the proofas well.

Finally, we validate our implementation using the model checker SPIN (cf. [SPIc]). Therefore, wehave implemented a program that translates the FTSyn outputto SPIN input. SPIN checks whetherthe resulting program is in fact fault tolerant. By this, we can provide more confidence in the resultsof the synthesis.

The correctness of the theoretical methods was proved by theauthors respectively, the correctnessof the implementation however is verified by the model checker. Of course, the results can only bechecked individually.

We show that the system delivers correct output by providingsome examples. We give the interme-diate steps of the synthesis. So, the functioning of the method is comprehensible. This way, faulttolerance can be obtained in an automated fashion.

1.4 Roadmap

Section 2 gives the theoretical background of fault tolerance. Some terms have to be defined and themethods AK and GJ are presented formally. The following Section 3 explains the fault of GJ that hasbeen found in the context of this thesis. Section 4 describesthe tool FTSyn that has been used forsynthesis of fault tolerance. We show the origin, present some examples of usage, and point out somedeficiencies. Section 5 deals with model checking in generalas well as SPIN in particular. Examplesof syntax and usage give an impression of this software and the reasons why we chose it.

Section 6 then goes into the implementation of GJ, which we call FCPre(FusionClosurePreprocessor).The input syntax, the encoding of the program and the occurring faults, the interface with FTSyn, theapplied algorithms, and the usage are explained. We continue with Section 7 that presents the use

5

Page 20: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

1 Introduction

of SPIN for our purposes including the translation from FTSyn to SPIN. Furthermore, we presentsome examples of complete fault tolerance synthesis. We conclude with a summary and pose futureresearch questions in Section 8.

6

Page 21: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

2 Background

2.1 Introduction

This section deals with the formal background of fault tolerance. We first introduce the system modelof this thesis, then define the problem of automatically adding fault tolerance and finally describe thetwo algorithms to synthesize fault tolerance, which are implemented and tested in the remainder ofthis work. This formal background is taken from [GJ03]. The algorithms are AK (cf. [Jos96]) and GJ(cf. [GJ04]).

2.2 System Model

This section defines and describes the elements of the systemmodel. We have already mentioned theprogram that should be synthesized. Further, we need a specification stating what the program has todo — and what it should not do. Finally, we introduce the notion of a fault and define the concept offault tolerance.

2.2.1 The Program

The program represents the main element of our efforts. Thisis the only variable part — the one wework on. The specification and the fault assumptions stay constant.

The program is defined as a directed graph. The vertices of thegraph are the states of the program,and the edges of the graph define the control flow.

Definition 2.1 (state spaceC). A state spaceis an unstructured finite nonempty setC of states.

A state predicate overC is a boolean predicate overC. In general, we denote states by uncapitalizedletters, e.g.a, b, c, . . .. These states are linked by state transitions.

Definition 2.2 (state transition overC). A state transitiont overC is a pair(r, s) of states fromC.

Altogether, a program is defined as follows.

7

Page 22: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

2 Background

Definition 2.3 (program). A program(or system) Σ consists of a triple(C, I, T ) whereI ⊆ C denotesthe set of initial states andT the set of transitions.

The smallest possible program consists of only one state which is also the initial state, and no transi-tions:

Σ0 = ({a}, {a}, ∅)

2.2.2 Traces, Properties and Specifications

We now define some constructs based on the definition of a program in the previous section andleading to the definition of the next major element, the specification.

We have to define first what we mean by atrace.

Definition 2.4 (trace overC). A trace overC is a nonempty sequences1, s2, s3, . . . of states overC.Traces can be finite or infinite. A trace isfinite if its length is finite. Theconcatenationof two tracesαandβ is denoted byα ·β. A transitiont occurs in a traceσ if there exists ani such that(si, si+1) = t.

To work with traces, we have to be able to classify them. This is done by setting upproperties overC.

Definition 2.5 (property overC). A property overC is defined as a set of traces overC. A traceσsatisfiesa propertyP iff σ ∈ P . If σ does not satisfyP we say thatσ violatesor compromisesP .

For example, let us assume the following state spaceC = {a, b, c, d}. Then a propertyP overCwould be the set of all traces of length≥ 3. Now,a, b, c satisfiesP , while a, b does not satisfyP .

We have established the basic terms that are needed to approach the definition of a specification. Sincewe already know that the specification describes the behavior of the program, one can guess that aspecification defines a property, i.e. a set of traces. This set of traces can be divided into two kinds ofproperties:safetyandlivenessproperties.

Informally, one can say that liveness properties are properties that are violated in infinite time andfulfilled in finite (but arbitrary) time. A characterizationis made by the wordeventuallywhich is sig-nificant for liveness properties. A liveness property says that something good will eventually happen.

Consider the following example of a liveness property: “Eventually, the system will output a value.”We state that the only way to violate this property is an infinitely long run of the system. Therefore,it is violated in infinite time. However, the achievement of the property is done in finite time: whena value is output. If we assume that the property is met — therewill be a system output — we canconclude that it is met in finite time, but we do not know when itwill be met. Further examples ofliveness properties are “All variables are eventually initialized”, “Eventually, the counter variable willreach value 13”, or assuming a distributed system “All messages are eventually delivered”.

We do not consider liveness properties in this thesis, so we omit a formal definition of liveness.

8

Page 23: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

2.2 System Model

By contrast, safety properties are violated in finite time and fulfilled in infinite time. A safety propertydenotes that something bad will not happen. Formally, a safety property is defined as follows.

Definition 2.6 (safety property overC). A safety propertyS overC is a property overC for whichthe following holds: For each traceσ which violatesS there exists a prefixα of σ such that for alltracesβ, α · β violatesS.

The set of initial statesI and the set of transitionsT together describe a safety propertyS, whereS contains all traces starting in a state inI and using only transitions fromT . S is denoted bysafety-prop(Σ), or brieflyΣ instead ofsafety-prop(Σ).

What the definition reveals is the characteristic of safety properties that — if they are violated once— they cannot be “healed”, i.e. if a safety property is violated, there will be no point of computationin future where the property is satisfied. Asafety specificationis a safety property.

The next case gives an example of a safety property: “No valueis ever divided by zero.” Of course,the division by zero is unwanted in almost all systems. As oneimmediately detects, the property isviolated in finite time — just when the division is executed, and if it happens, it happens after finitetime. If however the property is fulfilled, one will only knowin eternity, so it needs infinite time.Further examples of safety properties are “If one value is output, it is the correct value”, “If a childprocess is generated, it will be killed before termination of host process”, or assuming a distributedsystem “Every message is only delivered once by each host”.

In this work, we only consider safety specifications.

2.2.3 Extensions, Faults and Fault Tolerant Versions

It is appropriate to define how a fault can be modeled. Surely,it is nice if the system works fine butthis is not the topic here. A way to model unfavorable system behavior is presented in this section.Finally, we specify what we mean with a fault tolerant version of a program.

One goal of fault tolerance synthesis is to maintain the equivalence of the original program and thesynthesized one, i.e. the program we construct and consideras the fault tolerant version has to behavethe same way the original program behaves in fault-free scenarios. This claim builds the correctnesspart of the synthesis. The remaining part consists of the completeness of fault tolerance synthesis. Thecompleteness demands a different behavior in the case of a specification violation that was caused bya fault. This is the reason why the fault tolerant program might have additional states. The first thingwe have to do is defining the meaning of these states with respect to the original program.

Let us call the given programΣ1 = (C1, I1, T1) and the fault tolerant versionΣ2 = (C2, I2, T2).

Definition 2.7 (state projection function). A state projection functionπ : C2 7→ C1 tells which statesof Σ2 are equivalent with respect toΣ1. It is applicable to traces in the following way: for a traces1, s2, s3, . . . overC2 holds thatπ(s1, s2, . . .) = π(s1), π(s2), . . ..

9

Page 24: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

2 Background

The original system and the fault tolerant one must be the same under the state projection function,i.e. the state projection function maps duplicated states to its origin. More precisely, the fault tolerantversionΣ2 must be an extension of fault intolerantΣ1.

Definition 2.8 (program extension). Let Σ1 = (C1, I1, T1) andΣ2 = (C2, I2, T2) be two programs.ProgramΣ2 extends programΣ1 using state projection functionπ iff the following conditions hold:

1. C2 ⊇ C1,

2. π is a total mapping fromC2 to C1 (for simplicity we assume that for anys ∈ C1 holds thatπ(s) = s), and

3. π(Σ2) = Σ1.

One can conclude intuitively that an extension only adds program states (that should be clear by theword extension), but it does not remove any. Further, the added states have to be relatable to somestate ofΣ1. We will see later how the program is extended in a sensible way.

Since this thesis deals with fault tolerant systems, we musthave a way to model faulty behavior.Therefore, we first have to ask ourselves “How do we model behavior in general?”. This question hasbeen answered in the last sections (cf. Sections 2.2.1 and 2.2.2). There, we considered a program as agraph with directed edges. The (directed) program transitions lead from one program state to another.So, the transitions represent the program behavior.

Obviously, it is advisable to model faulty behavior as transitions, too. It is abstracted from the reasonfor this program transition, while it is only important to prohibit a specification violation originatingin this transition. Furthermore, it is sufficient to restrict faults to the addition of transitions becauseadditional transitions are the only way to compromise a safety specification. The loss of transitionswould accordingly be the way to compromise a liveness specification (cf. Section 2.2.2). We excludefaults that modify initial states, i.e. those faults that occur before system starts.

Formally, a fault model is defined as a program transformation F , i.e. a mapping from programs toprograms. Let us assume we start with programΣ. There are no faults inΣ. After the mapping ofF , we deriveF (Σ), the fault-affected versionof Σ. Sometimes,F (Σ) is calledprogramΣ in thepresence of faultsF . The preconditions of a mapping to be a fault model are definedas follows.

Definition 2.9 (fault model). A fault modelF maps a programΣ = (C, I, T ) to a programF (Σ) =(F (C), F (I), F (T )) such that the following conditions hold:

1. F (C) = C,

2. F (I) = I,

3. F (T ) ⊃ T

For a given fault modelF and a specificationSPEC, we say that aprogramΣ is F-intolerant withrespect toSPEC if Σ satisfiesSPEC butF (Σ) violatesSPEC.

10

Page 25: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

2.2 System Model

This definition confirms the statement that we can focus on transition addition since safety specifica-tions can only be violated by additional transitions and notby their removal. Needless to say that infact transitions are added.

For example, letΣ = ({a, b, c}, {a}, {(a, b), (b, c)}).

ThenF (Σ) = ({a, b, c}, {a}, {(a, b), (b, c), (a, c)})

would be an example of applying a fault model. By contrast, neither

F ′(Σ) = ({a, b, c}, {a}, {(a, b)})

norF ′′(Σ) = ({a, b, c}, {a}, {(a, b), (a, c)})

norF ′′′(Σ) = ({a, b, c, d}, {a}, {(a, b), (b, c)})

fulfill the requirements of a fault model.F ′ removes transition(b, c), F ′′ diverts transition(b, c) to(a, c), andF ′′′ adds stated to the system. Any fault function that modifies the set of initial stateswould not be a valid fault model, too.

Further on, let us assume two given programsΣ1, Σ2 whereΣ2 extendsΣ1 and a fault modelF . Fnow influencesΣ2 in the same way asΣ1, i.e. all fault transitions that exist inΣ1 also exist inΣ2.So, no fault disappears by making the program fault tolerantagainst it. However, new faults may becreated byF in Σ2 concerning the newly created states. If a fault model has these properties, it iscalledextension monotonic.

Definition 2.10 (fault extension monotonicity). A fault modelF is extension monotoniciff for anytwo programsΣ1 = (C1, I1, T1) andΣ2 = (C2, I2, T2) such thatΣ2 extendsΣ1 usingπ holds:

F (T1)\T1 ⊆ F (T2)\T2

For clarity, an example is given in Figure 2.1. The original systemΣ1 is given in the left column theextension in the right column respectively. The state projection functionπ maps statec to stateb.

Example (a) is extension monotonic. The faults contained inΣ1 are also contained in the extensionΣ2. The additional fault transition(a, c) in Σ2 does not matter. However, example (b) is not extensionmonotonic. There, the fault transition(a, b) that is contained inΣ1 is missing in the extensionΣ2.

After we excluded the loss of faults, we have to delimit the generation of new faults. Fault extensionmonotonicity does not give any restrictions on faults referring to new states. Assume however thatafter every step of fault tolerance synthesis new faults areadded. Then — in general — it would beimpossible to build a fault tolerant version of the program since the procedure of adding fault tolerancemechanisms could go on endlessly. If the fault model guarantees that at some point in future, no newfaults are generated, it is calledfinite fault model.

11

Page 26: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

2 Background

Σ1 Σ2

(a) a b a b

c

(b) a b a b

c

Figure 2.1: Examples of extension monotonic and not extension monotonic fault models.(a) is extension monotonic,(b) is not extension monotonic.

Definition 2.11(finite fault model). An extension monotonic fault modelF is finite iff for any infinitesequence of programsΣ1, Σ2, . . . such that for alli, Σi+1 extendsΣi holds that there exists aj suchthat for allk ≥ j no new fault transition is introduced inΣk, i.e. F (Tk+1)\Tk+1 = F (Tk)\Tk.

We assume a finite fault model in the framework of this thesis.Note that a finite fault model is alwaysextension monotonic. It does not restrict the introductionof new faults until some step of expansion.An infinite fault model would in general need infinite redundancy to be tolerated. That is what wewant to exclude. Suppose for example the following scenario: Starting with systemΣ1, the first faultis identified and eliminated introducing some new state and resulting in systemΣ2 that is an extensionof Σ1. Σ2 may then contain new faults referring to the new state. So, the program has to be madetolerant against these faults, too. In doing so, probably some new state may be introduced again. Andagain, new faults may occur regarding the new state. Obviously, it is necessary to bound the ongoingof this behavior. That is what the finite fault model is used for.

Now, as the assumptions and preconditions relating to the faults are clear, we can define our goals.As already mentioned above, we want to build afault tolerant version. We start with a fault intolerantprogramΣ1, given a specificationSPEC and a fault modelF . We want the fault tolerant versionΣ2

to act the same way if no fault occurs. But otherwise in the presence of faults, the specification hasstill to be satisfied. Thereby,SPEC andF stay unchanged, of course. Remember thatF is a functionand not the set of fault transitions. So, ifF stays the same it does not necessarily mean that the faultsare unchanged. IfF is applied toΣ2 the result may be different fromF (Σ1). The mapping functionπ is responsible for the comprehension of the new states to thespecification.

Definition 2.12(fault tolerant version). Let F be a fault model,SPEC be a specification andΣ1 andΣ2 be programs. Assume thatΣ1 satisfiesSPEC butF (Σ1) violatesSPEC. We call a programΣ2

12

Page 27: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

2.3 Fusion Closure

theF -tolerant versionof programΣ1 for SPEC using state projection functionπ iff the followingconditions hold:

1. Σ2 extendsΣ1 usingπ,

2. F (Σ2) satisfiesSPEC.

2.2.4 Summary

All aspects of the program are defined and explained. The solution of the fault tolerance synthesisproblem needs to be made up of some changes in the program but only in the program transitions andstates since fault transitions may not be erased, diverted or changed in any other way.

2.3 Fusion Closure

We now discuss the notion of fusion closure. Fusion closure is the central link between previousapproaches and this one.

Above, we defined a safety specification to be a safety property. We recall the definition of a propertyoverC to be a set of traces. Beside the partition in safety and liveness properties, one can classifyspecifications by theirfusion closureproperty. A fusion closed set is defined as follows.

Definition 2.13 (fusion closed set). Let C be a state set,s ∈ C, X be property overC, α andγ finitestate sequences, andβ, δ andσ be state sequences overC. The setX is fusion closedif the followingholds: Ifα · s · β andγ · s · δ are inX thenα · s · δ andγ · s · β are also inX.

Intuitively, in fusion closed specifications the past of a system run does not play any role for thedecision if the specification is held or violated. That is what this definition states. Alternatively, onecan say, that the entire past of the computation is stored in the current state. Both views mean thesame.

For a better understanding, imagine thatα, β, γ, andδ just contain one state respectively, e.g.α = a,β = b, γ = c, andδ = d. What the definition says is that ifa, s, b andc, s, d (we only need the states for concatenation) are elements ofX, then the last states (state sequences) may be interchangedand nothing will cause a specification violation. On the one hand, it is not important which way onetook to reach the states becauseα andγ are arbitrary and may change as we see. Informally, thiswas expressed by “the past does not matter”. On the other hand, if a specification violation occurson a path it will occur on the other path with the same state sequence at the beginning of the pathagain. So, it is impossible that a specification violation occurs inβ after passingα ands but it doesnot compromise the specification after passingγ and s. We observe that a fusion closed set canbe expressed as a set of bad transitions. If a bad transition occurs on a path, the path violates thespecification.

13

Page 28: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

2 Background

From now on, we will focus on handling non-fusion closed specifications, i.e. specifications that donot follow the definition above. The next question is if non-fusion closed specifications always causemore problems than fusion closed specifications. Or phrasing the question differently, if we alwayshave to handle non-fusion closed specifications differently from fusion closed ones. After takingeverything into account, one can guess that it needs more than just non-fusion closure. The faultmodelF is important as well. We say, thatF has to “exploit” the non-fusion closed specification.

a b c d e f g

Figure 2.2: Example of a non-fusion closed specification where the fault modelF does not exploitnon-fusion closure. The specification is “(e implies previouslyc) and (neverg)”.

For example, consider Figure 2.2. Obviously, the specification “(e implies previouslyc) and (neverg)” is not fusion closed. If statee is reached, the compliance with the specification depends onthe previous visit of statec. This dependency is characteristic for non-fusion closed specifications.However, and this is the most important thing for our purposes now, the case thate is reached whilec was not reached before can never happen inF (Σ). So, the first part of the specification is alwayssatisfied. And this is the non-fusion closed part. The rest ofthis specification is fusion closed.

Now, we go into the formal definition of exploiting non-fusion closure, i.e. the property of a spec-ification to be non-fusion closed. Therefore, we first define the termsfusion and fusion point oftraces.

Definition 2.14 (fusion and fusion point of traces). Let s be a state andα = αpre · s · αpost andβ = βpre · s · βpost be two traces in whichs occurs. Then we define

fusion(α, s, β) = αpre · s · βpost

If fusion(α, s, β) 6= α andfusion(α, s, β) 6= β we calls a fusion pointof α andβ.

We observe the following statements: For the fusion of threetracesα, β, γ holds: If s occurs befores′ in β then

fusion(α, s, fusion(β, s′, γ)) = fusion(fusion(α, s, β), s′, γ)

andfusion(γ, s′, fusion(α, s, β)) = fusion(γ, s′, β).

Informally, the fusion function is defined as a special kind of combination of two or more traces. Ifαandβ are two traces each containing a states, the fusion ofα, s andβ consists of the first part ofαfrom the beginning tos, s itself and the remaining part ofβ from s to the end.

14

Page 29: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

2.3 Fusion Closure

As an example, takeα = a, b, c, d, e, β = f, g, c, h, i. Then the fusion points would be statec andfusion(α, c, β) = a, b, c, h, i. Note thatfusion(α, s, β) 6= fusion(β, s, α). Here, fusion(β, c, α) =f, g, c, d, e.

The fusion closure of a specificationSPEC, fusion-closure(SPEC), contains all traces ofSPECand all fusions of two traces infusion-closure(SPEC). So,fusion-closure(SPEC) is the set that isclosed under finite application offusionoperator onSPEC.

Definition 2.15(fusion closure). Given a specificationSPEC, a traceσ is in fusion-closure(SPEC)iff

1. σ is in SPEC, or

2. σ = fusion(α, s, β) for tracesα, β ∈ fusion-closure(SPEC) and a states that occurs inα andβ.

The next example shows how non-fusion closure can be exploited. The system given in Figure 2.3 hasthe corresponding specification “f implies previouslyd”. The tracea, b, e, f causes a specificationviolation. The detection however depends on the knowledge of the past of the run — one has toknow that stated was not visited. As already mentioned, the deciding aspect is the exploitation ofnon-fusion closure. The non-fusion closure ofSPEC is exploited forΣ in the presence ofF if thereis a bad trace that can be built by the fusion of two good tracesand a fusion points. Theexploitationof non-fusion closureis defined as follows.

Definition 2.16 (exploiting non-fusion closure). Let Σ be a system,F be a fault model andSPECbe a specification which is satisfied byΣ. ThenF (Σ) exploits the non-fusion closure ofSPEC iffthere exists a traceσ ∈ F (Σ) such thatσ /∈ SPEC andσ ∈ fusion-closure(SPEC).

a b c d e f

Figure 2.3: Example of a non-fusion closed specification where the fault modelF does exploit non-fusion closure. The specification is “f implies previouslyd”.

We want to find a traceσ that achieves the conditions of the definition. First, the trace has to make useof the fault transition. So,b, e is part ofσ. Second, it has to violate the specificationSPEC. By this,we know that statef has to be part of the trace. We obtainb, e, f . As we always have to start withstatea since it is the only initial state, we geta, b, e, f . This is just the trace we intuitively identifiedabove as the violating one. However, there is a third condition: the trace has to be an element of thefusion-closureof SPEC. So, there need to be two or more traces, each element of the specification,

15

Page 30: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

2 Background

whosefusionresults inσ. We know that the last part has to include the statef . As there is only onetrace includingf which is inSPEC it has to be

β = a, d, e, f.

The first part also has to start at the initial statea but it has to include the fault transition(b, e). Thisway, we derive

α = a, b, e.

That is all we need. The fusion point is statee, and

fusion(α, e, β) = a, b, e, f.

Obviously, non-fusion closure cannot be exploited by any fault modelF if the specification is fusionclosed. However, the other direction is not valid. If non-fusion closure cannot be exploited by anyfault model, it does not necessarily mean that the specification is fusion closed. Consider for exampleFigure 2.4. The specification is “c implies previouslya”. Since every trace definitely starts with statea (remember the initial state preservation), no fault modelF can ever exploit non-fusion closure. Thespecification however is not fusion closed.

a b c

Figure 2.4: Example of a non-fusion closed specification where no fault modelF can exploit non-fusion closure. The specification is “c implies previouslya”.

The main interest of this work are systems where non-fusion closure can be exploited.

2.4 The Arora-Kulkarni method (AK)

At this point, it is important to expose the difference of thework by Arora and Kulkarni (cf. [AK98a]),and Ebnenasir and Kulkarni (cf. [EK05a]) on the one hand and Gartner and Jhumka (cf. [GJ04]) andthis work on the other hand. AK solves thefusion closed fail-safe transformation problemwhile GJcan be used to solve thegeneral fail-safe transformation problem, as we now explain.

We first have to define formally the goals of AK. Therefore, a general version of the problem — thegeneral fail-safe transformation problem— is defined before the more restricted version is presented.Afterwards, we will show their approach to reach the goals.

16

Page 31: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

2.4 The Arora-Kulkarni method (AK)

Definition 2.17 (general fail-safe transformation problem). Given a fault modelF and a programΣ1

which is F -intolerant with respect to a general safety specificationSPEC1. Thegeneral fail-safetransformation problemconsists of finding a fault tolerant version ofΣ1, i.e., a programΣ2 such thatΣ2 extendsΣ1 andF (Σ2) satisfiesSPEC1.

This definition makes no hard assumptions on the elements of the system. But as we have seen —and will see later — the precondition is the fusion closure property of the safety specificationSPEC.That is why thefusion closed fail-safe transformation problemis now defined.

Definition 2.18 (fusion closed fail-safe transformation problem). The fusion closed fail-safe trans-formation problemconsists of solving the general fail-safe transformation problem whereSPEC1 isfusion closed.

The basic mechanism AK uses to reach this goal is the creationof redundant states. They utilize thatfusion closed safety specifications can be regularly expressed as a set of so called “bad” transitions.Bad transitions are those which cause a safety specificationviolation by their execution. Rememberthat fusion closure means that the achievement or violationof the specification can be decided onevery state without any further information. So, there needs to exist at least one transitiont = (a, b)with the following property: until the program is in statea, the specification is met. As soon as thetransition is used and stateb is reached, the specification is violated.

For example, assume the specification “neverc”. It is quite easy to see that every transition leading toc must be a bad transition. To continue our presentation, the termmaintainsmust be defined.

Definition 2.19 (maintains). Let Σ be a program,SPEC be a specification andα be a finite com-putation ofΣ. We say thatα maintainsSPEC iff there exists a sequence of statesβ such thatα · β ∈ SPEC.

The idea of bad transitions can be formulated in the following way.

Lemma 2.1(Bad Transition Lemma). LetΣ = (C, I, T ) be a system,SPEC be a safety specificationwhich is fusion closed and assume thatΣ violatesSPEC and that for allx ∈ I holds thatx maintainsSPEC. Then there exists a transition(d, b) ∈ T such that for all tracesσ of Σ holds: if (d, b) occursin σ thenσ /∈ SPEC.

The idea of the previous lemma is the following: Let us assumethat SPEC is a safety property,then every traceσ with σ /∈ SPEC has a maximal prefixα andα ∈ SPEC. But after the nextstep the trace is no longer inSPEC, i.e. α · b /∈ SPEC andα · b is prefix ofσ. At this point, thetrace “switches from ‘good’ to ‘bad’” (cf. [GJ03, p. 8]). Letus say now thata is the last state ofα. Thereafter,t = (a, b) is a bad transition, its occurrence has to be prevented. We know by Aroraand Kulkarni (cf. [AK98b]) that the execution of this transition would cause any trace to violate thespecification. So,t has to be made unreachable. To guarantee that, one has to distinguish between thefollowing cases:

17

Page 32: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

2 Background

• If t is a reachable program transition, specification violations may occur without the incidenceof a fault. This scenario is excluded through our assumptions.

• If t is a non-reachable program transition, it can be simply removed.

• If t is a fault transition, it cannot be just removed. In this case, a has to be made unreachable.But this is only possible if there is a non-reachable programtransition inα that can be simplyremoved. Otherwise, there is no fault tolerant version.

This line of reasoning does not work if we assume a non-fusionclosed specification. It is easyto imagine that if the violation of the specification dependson the past (see above), a transition isnot necessarily good or bad but it may be both. The method works in case of non-fusion closedspecifications if the fault model does not exploit non-fusion closure.

a b c d e f g h

Figure 2.5: Example of the Arora-Kulkarni method. The specification is “neverh”.

Let us analyze an example for better understanding. Consider Figure 2.5. The specification is “neverh”. As it was already declared, the challenge is to make stateh unreachable. First, one has to check ifthe safety specification can be violated in the absence of faults. Since this is not the case here, we cancontinue. The first transition leading to stateh is (g, h). It is a program transition and it is reachablefrom the initial state. So, this transition(g, h) is to be eliminated. We may just remove it because theremoval does not change the behavior of the system in the fault free scenario. The other transitionending at stateh is (f, h). This however is a fault transition which cannot be removed.Rememberthat fault transitions have to be tolerated but not removed.The solution is to make statef , the originof the transition, unreachable. This procedure goes on until we arrive at statee. It can be cut off byeliminating transition(d, e). This transition is redundant as well, i.e. it is a program transition that isnot reachable from any initial state by just using program transitions.

a b c d e f g h

Figure 2.6: Example of a program that cannot be synthesized.The specification is “neverh”.

18

Page 33: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

2.5 The Gartner-Jhumka method (GJ)

Figure 2.5 also helps to show examples that cannot be synthesized. Imagine there would be a transition(c, h) (cf. Figure 2.6). In this case, stateh would be reachable only by program transitions, so that notransition could be removed without changing the program behavior in the absence of faults. Such aremoval would destroy the equivalence of the fault intolerant and the fault tolerant version. We obtainthe other example by introducing a fault transition(b, h) (cf. Figure 2.7). This transition could notbe removed. Further more, stateb could not be made redundant. The impossibility of fault tolerancesynthesis in these cases is not a drawback of the method but itshows that the fault assumption waschosen too hard.

a b c d e f g h

Figure 2.7: Another example of a program that cannot be synthesized. The specification is “neverh”.

An overview of the situation that built the basis for this work has been given in this section. The workof Gartner and Jhumka followed this of Arora and Kulkarni and served as theoretical foundation forthe present thesis.

2.5 The Gartner-Jhumka method (GJ)

After presenting the base work of Arora and Kulkarni above, we now give an overview of the subse-quent work of Gartner and Jhumka. The differences and similarities in their approaches are presentedin this section.

There are two ways to add fault tolerance to fault intolerantsystems with a non-fusion closed spec-ification. The first way is the construction of a new method from scratch. It is represented by thetop part of Figure 2.8. The second way — and this is the one thatis chosen by GJ — is to make useof already existing approved methods. The idea is to “prepare” the not compliant input to make itapplicable to AK. This approach is displayed by the bottom ofFigure 2.8.

The GJ method wants to prevent the exploitation of non-fusion closure. As we pointed out above,the exploitation of non-fusion closure is the most important property of a system that one wants tosynthesize.

Let us assume that a systemΣ1 and a fault modelF are given. They have to satisfy the assumptions ofthe general fail-safe transformation problem (cf. Definition 2.17), i.e.Σ1 has to beF -intolerant withrespect toSPEC1. The aim is to find a corresponding regular input for the fusion closed fail-safe

19

Page 34: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

2 Background

Σ1 Σ2

Σ′

2

general method

Gartner-Jhumka method“standard” fail-safe transformation(e.g. Arora-Kulkarni method)

fault intolerantw.r.t. SPEC1

fusion closedSPEC2

fault tolerantw.r.t. SPEC1

Figure 2.8: Overview of ways to solve general fail-safe transformation problem.

transformation problem, i.e. the specificationSPEC2 must be fusion closed. Finally, the outputΣ2

has to meet the requirements of Definition 2.17. So,Σ2 must be an extension ofΣ1 andF (Σ2) has tosatisfySPEC1.

The first step is the following: constructSPEC2 by applying the fusion closure toSPEC1, i.e.

SPEC2 = fusion-closure(SPEC1).

Second, the intermediate programΣ′

2 is generated fromΣ1 such thatF (Σ′

2) does not exploit thenon-fusion closure ofSPEC1. As we want to be able to apply the definitions of the fail-safetrans-formation problem, the following two conditions have to be held:

1. Σ′

2 extendsΣ1 using some state projection functionπ.

2. F (Σ′

2) does not exploit the non-fusion closure ofSPEC1.

We need the first condition because the extension of an extension is still an extension, i.e. the exten-sion property is transitive.

The second condition is necessary since the second step of the method solves the fusion closed fail-safe transformation problem. As we stated above, it is appropriate to treat the specification as fusionclosed if the non-fusion closure is not exploited by the fault model. GJ claims that the program thatresults by first the application of GJ and afterwards the application of AK satisfies the requirementsof the general fail-safe transformation problem. That means thatΣ2 is anF -tolerant version ofΣ1

with respect toSPEC1.

It is advisable to go into the details of the constructive method of GJ now. Therefore, we take a lookat those states where specification violation cannot be definitely decided. These states are calledbadfusion points.

20

Page 35: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

2.5 The Gartner-Jhumka method (GJ)

Definition 2.20(bad fusion point). Let SPEC be a specification,Σ be a system satisfyingSPEC, sbe a state ofΣ, andF a fault model such thatF (Σ) violatesSPEC. States is abad fusion pointofΣ for SPEC in the presence ofF iff there exist tracesα, β ∈ SPEC such that

1. s is a fusion point ofα andβ,

2. fusion(α, s, β) ∈ F (Σ), and

3. fusion(α, s, β) /∈ SPEC.

For example, consider Figure 2.3. There, statee is a bad fusion point. Letα be a, b, e, andβ bea, d, e, f (note thatα, β ∈ SPEC), thene is a fusion point ofα andβ, fusion(α, e, β) = a, b, e, f ∈F (Σ) and finally,fusion(α, e, β) = a, b, e, f /∈ SPEC.

Another example is given in Figure 2.9. The specification is “d implies previouslyb”. There, statecis a bad fusion point.

a b c d

Figure 2.9: Example of a program with bad fusion pointc. The specification is“d implies previouslyb”.

If we look closely at Figure 2.5, we can state that there is no bad fusion point. This may have beenclear since the specification is fusion closed. So, we note that there are no bad fusion points if thespecification is fusion closed. In truth, one can claim even alittle more.

Lemma 2.2 (Bad Fusion Point criterion). Let SPEC be a specification,Σ be a system satisfyingSPEC andF be a fault model. The following two statements are equivalent:

1. Σ has no bad fusion point forSPEC in the presence ofF .

2. F (Σ) does not exploit the non-fusion closure ofSPEC.

As shown above it is both necessary and sufficient to remove all bad fusion points from the consideredprogram. Therefore, bad fusion points have to be marked first, of course. We focus on necessaryconditions to build a set of so called bad fusion point candidates. We recall the definition and observethat a bad fusion point has at least two incoming transitions(one contained inα and one inβ).Secondly, sinceβ ∈ SPEC andα ∈ F (Σ), one past must contain a fault transition. Afterwards, forevery state in the candidates set, matchingα andβ are searched for.

If a true bad fusion point is found, the following steps can beexecuted for removal: Lets be thebad fusion point that is to be removed andΣ2 = (C2, I2, T2) be the system resulting by removing

21

Page 36: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

2 Background

s from Σ1 = (C1, I1, T1). Then we duplicates, naming the new states′ and adding it toC2, i.e.C2 = C1 ∪ {s′} wheres′ is a new state. The set of initial states stays unchanged (I2 = I1). Thetransitions belonging toα are kept untouched, too. Those belonging toβ are diverted to the new states′. This diversion affects incoming just as outgoing transitions. So, afterwards,s is not an element ofβ anymore. Figure 2.10 displays the method. The upper part shows the situation before the algorithmis applied, the lower part marks the altered program.

α

β

s

α

βs

s′

Figure 2.10: The graphical illustration of the removal of a bad fusion point.

Finally, the state projection functionπ has to be defined to maps′ to s. The algorithm eliminates ingeneral one bad fusion point per execution. In some cases, states have to be extended more than oncebecause they are multiple bad fusion points for different combinations of traces. Moreover, a dynamicfault model is assumed, i.e. new fault transitions may be added after extending a state including thenew state (changing the program implies the possibility to change the fault model). But since this faultmodel has to be finite, the repetition of algorithm application will terminate. Otherwise, if the faultmodel would not be finite, the task of adding fault tolerance would be infeasible. The next lemmaresumes the properties that the above method guarantees.

Lemma 2.3(GJ Correctness Lemma). LetF be a fault model,SPEC1 be a non-fusion closed specifi-cation, andΣ1 be a program such thatΣ1 satisfiesSPEC1 butF (Σ1) violatesSPEC1. The programΣ′

2 which results from applying the constructive method described above satisfies the following prop-erties:

1. Σ′

2 extendsΣ1 using some state projectionπ and

2. F (Σ′

2) does not exploit the non-fusion closure ofSPEC1.

22

Page 37: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

2.6 Summary

It is worth mentioning that we found a fault in the correctness proof of the above lemma given in[GJ03, p. 19 f]. There is a case where GJ does not work. Section3 gives more details.

Finally, as we want to maintain the properties of the generalfail-safe transformation problem, thenext lemma points out the correctness of the second part withthe assumption of a fusion closedspecification.

Lemma 2.4 (Solution of the fusion closed specification problem). GivenF , SPEC1, andΣ1 as inLemma 2.3, letSPEC2 = fusion-closure(SPEC1) and letΣ2 be the result of applying any of theknown methods that solve the fusion closed transformation problem (e.g. AK) toΣ′

2 with respect toFandSPEC2, whereΣ′

2 results fromΣ1 through the application of the constructive method GJ. Thenthe following statements hold:

1. Σ2 extendsΣ1 using some state projectionπ.

2. If F (Σ2) satisfiesSPEC2 thenF (Σ2) satisfiesSPEC1.

The proofs belonging to the last two lemmata are given in [GJ03, pp. 20 ff]. As we now have bothcorrectness parts of our fault tolerance method, we conclude the results in the following theorem.

Theorem 2.1.Given a fault modelF and a programΣ1 which isF -intolerant with respect to a non-fusion closed specificationSPEC1. The composition of the constructive method GJ and the fail-safetransformation methods for fusion closed specifications solves the general transformation problem,i.e. constructs a programΣ2 such thatΣ2 extendsΣ1 andF (Σ2) satisfiesSPEC1.

2.6 Summary

In this section, we have defined all important terms regarding the topic of fault tolerance synthesiswith respect to fusion closure. We have given an introduction on the modeling of programs and themain elements. We have pointed out the meaning of fusion closure and that the exploitation of fusionclosure is at least as important as non-fusion closure itself.

Finally, we have examined two approaches on adding fault tolerance. On the one hand, the Arora-Kulkarni method assumes that the associated specification is fusion closed. On the other hand, theGartner-Jhumka method delivers systems with fusion closed specifications as output while it takesthose with non-fusion closed specifications as input. So, wecan conclude that both approaches amendeach other, and that they are both necessary and sufficient tosolve properly the general fail-safetransformation problem.

23

Page 38: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

2 Background

24

Page 39: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

3 The Fault in GJ

3.1 Introduction

In this section, we describe the fault that we have found in GJ. There is a case where GJ cannotcontinue synthesizing the program. We have to state that this is not a fatal error. It is rare and theoccurrence is evident so that the intermediate result does not appear to be the final result.

3.2 The Fault in the Correctness Proof

As there is a fault in the method and the correctness of the method has been proven, there needs to bea mistake in the proof. In this section, we want to describe the mistake.

At step 1.2 in the proof of the first property (cf. [GJ03, p. 20]), it is stated that “every executionσ′ ofΣ′

2 evolves from an execution ofΣ1 by splitting fusion paths and adaptingπ appropriately. Therefore,under the projection functionπ both executions look the same.”

Unfortunately, it is abstracted from the kind of incoming and outgoing transitions. It is just assumedthat every transition can be diverted. However, we assumed that fault transitions cannot be diverted.So, there is the implicit assumption that the adjacent transitions are program transitions. It wouldbe preferable to make a case distinction at this point. So, the method itself might not be faulty butpotentially incomplete. If one introduces an instruction what one has to do in the case of a faulttransition, the method would work in all cases.

3.3 An Example

We want to demonstrate the fault case by the following example.

Let us look at Figure 3.1. The specification says “g implies previously (b or e)”. The first bad fusionpoint c is duplicated. Afterwards, stated is a bad fusion point candidate. We look for applicableαandβ and findα = a, c, d andβ = a, b, c′, d, f, g. Note thatα, β ∈ SPEC but fusion(α, d, β) =a, c, d, f, g /∈ SPEC. The next step is the duplication of stated. The transition(c′, d) ∈ β is divertedto (c′, d′). Finally, one would have to add the transition(d′, f). This, however, is a fault transition thatcannot be newly generated.

25

Page 40: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

3 The Fault in GJ

(1)

a b c d e f g

(2)

a b c d e f g

c′

(3)

a b c d e f g

c′ d′

Figure 3.1: The graphical illustration of the fault in GJ correctness lemma. The specification is“g implies previously (b or e)”.

26

Page 41: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

3.4 Summary

Let us assume for a moment that adding new fault transitions would be allowed but not the redirectionor deletion. In this case, we would add a new fault transition(d′, f) (cf. Figure 3.2). The next badfusion point would bef , α = a, c, d, f , andβ = a, b, c′, d′, f, g. After the duplication off , the faulttransition(d′, f) would have to be diverted to(d′, f ′). This cannot work.

(1)

a b c d e f g

c′ d′

(2)

a b c d e f g

c′ d′ f ′

Figure 3.2: The sequel to the graphical illustration of the fault in GJ correctness lemma.The specification is “g implies previously (b or e)”.

3.4 Summary

The correctness proof of GJ is correct as long as the transition that belongs to the good pathβ andstarts at the bad fusion point is not a fault transition. In the following, we restrict ourselves to thiscase.

27

Page 42: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

3 The Fault in GJ

28

Page 43: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

4 Using FTSyn

4.1 Introduction

This section deals withFTSyn, the implementation of the Arora-Kulkarni method that was discussedin Section 2. As outlined above, FTSyn is used as a building block within the Gartner-Jhumka method.So, FTSyn is important as well to implement a tool that solvesthe general fail-safe transformationproblem.

We explain the origins of FTSyn, its documentation and give some examples of its usage. Finally, wecover its deficiencies that we faced in the framework of this thesis.

4.2 The Origins of FTSyn

It is the aim of this section to present the author of FTSyn andthe circumstances under within FTSynwas created. Further on, we give an overview of its availability and what documentation exists.

FTSyn was implemented by Ali Ebnenasir (see [Ebn]), now a postdoctoral researcher at the SoftwareEngineering and Network Systems Laboratory (SENS) at Computer Science and Engineering Depart-ment, Michigan State University. The development of FTSyn was done as part of his PhD thesis (cf.[Ebn05]). In the process, he was advised by Sandeep Kulkarni(see [Kul]), one of the developers ofthe above mentioned Arora-Kulkarni method. Ebnenasir describes his work the following way:

Thus far, Dr. Kulkarni and I have been involved in the development of algorithmsthat add fault tolerance properties to distributed programs. Based on these synthesisalgorithms, I have developed an extensible framework for the synthesis of fault tolerantdistributed programs from their fault intolerant versions. The framework is extensible inthe sense that developers of fault tolerance can easily integrate their own heuristics andalgorithms into the framework. [Des]

FTSyn has a project website, see [FTSa], which gives furtherdetails and documentation. The sourcecode is available for download upon request. FTSyn was written in the Java programming language.However, the execution of FTSyn is restricted to the Windowsoperating system since it uses a dy-namic link library (dll) as a solver for the Boolean satisfiability problem (SAT) (cf. [SiDMS97]).

29

Page 44: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

4 Using FTSyn

Substantial documentation is available on the homepage. There is a technical report (cf. [EK05b])that discusses the capabilities such as extensibility, estimates the degree of complexity and scalability,and gives an example. Practical aspects are brought in connection with the corresponding theoreticalbackground. Furthermore, there is a user manual (cf. [FTSc]). It mainly consists of program codeand the code of an example input file.

The design class hierarchy that belongs to FTSyn is available at [FTSb]. However, only constructorsand methods are enumerated. A list of attributes is missing as well as a description what each methoddoes and which values are returned.

To summarize the development of FTSyn as implementation of AK was meant as a proof of concept.It should be considered as a finished project whose source code is free for further enhancements.

4.3 Examples

In this section, we present some examples of the usage of FTSyn. The syntax of the input language isexplained, and the output will be declared and verified.

4.3.1 The Input

We first give an account of the general syntax and afterwards present a concrete example.

We assume that a fault intolerant program is given as input toFTSyn. So, the input file always startswith program <program name> . Next, the program variables are declared. The supported datatypes include Integer (int ) and Boolean (boolean ). In contrast to the announcement on the manualpages, the declaration of a boolean variable using thebool type declarator did not work. For everyInteger variable that is declared, one needs to indicate a domain of variable values.

The subsequent part denotes the process definition. The process header consists of the key wordprocess followed by the process name. The actions have the form

< guard > → < statement >

where theguardmeans the precondition that has to be fulfilled so that thestatementcan be executed.This form is calledguarded commands(cf. [Dij75, Cha88]) according to the command, i.e. thestatement, that is guarded by a condition. So, in general, the guard encodes a boolean expression. Ifthis expression is evaluated to true, the statement is executed. We assume that the statement denotesan assignment of new values to the variables. So, every guarded command can be taken as variableupdate. Remembering that a program state means exactly one assignment of values to variables,one can easily guess that each guarded command encodes a state transition in the program transitiongraph. Finally, theread andwrite permissions for the process are declared, i.e. it is denotedwhichvariables may be read and changed by the respective process.

30

Page 45: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

4.3 Examples

We have assumed that faults can be modeled as transitions. This is the same in FTSyn. One definesthe fault transitions in a separate process, but instead of the keywordprocess , the keywordfaultfollowed by a name is used. The actions are again encoded as guarded commands.

After the notation of the program’sinvariant , we come to thespecification . Up to now, wehave always indicated specifications by boolean expressions. This has been done because of betterreadability. As one can imagine, it is more difficult to parsea boolean expression in relation to parsinga set of “bad” transitions. As we already stated in Section 2.4, a fusion closed safety specification canbe denoted by such a set of transitions that definitely lead toits violation. The specification part canbe divided into two subdivisions. The first subdivision is thedestination part. There, those statesare listed that should not be reached. The other subdivisionis therelation part. This is the placefor the transitions that are not allowed to use. So, if one state should not be reached at all, it would besensible to denote this in the destination part. If however,only some of the transitions leading to thisstate are bad, one would prefer the relation construct.

The final part of the input file is taken for the encoding of the initial state(s) of the program.

Let us now give an example. We take Figure 4.1 as the graphicalrepresentation. The valid input codefor FTSyn of this instance of the fusion closed fail-safe transformation problem can be seen in Figure4.2.

a b c d e f g h

Figure 4.1: Graphical representation of our example of FTSyn functioning.The specification is “neverh”.

We named the program “never h” because this is the dedicated specification. First, an Integer variableis declared that models the actual state. Of course, it is more sensible to take Integer values insteadof Strings because they can be handled much more easily (apart from the fact that FTSyn does notaccept String variables). The domain is restricted to the number of states of our program.

There is one processtrans that implements the transitions of the program, e.g. the first guardedcommand states that if the variablestateno has value1, then the new value2 will be assigned tostateno . That is the first program transition(a, b). The process has the permission to read variablestateno and to write to it. Needless to say that it would not work otherwise. These restrictions aremeant to prohibit unwanted variable changes. The faults arecoded inMalfunction . The notationis the same as it was in the process clause. Notice that the twofirst guarded commands have thesame guard. The semantics of FTSyn in this case is that the statement which is executed is chosennon-deterministically among all guarded commands whose guards evaluate to true on a given state.

31

Page 46: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

4 Using FTSyn

Figure 4.2: The Example from Figure 2.5 as a valid input code for FTSyn.

32

Page 47: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

4.3 Examples

It is possible that the guard only refers to a subset of the program variables so that even two or moreprogram states and their outgoing transitions may apply. Assume for example another program vari-ableb of typeboolean . Then, all the program state space would be duplicated. Eachpair of valuesfor (stateno, b) refers to one state. If however the guard only affectsb, e.g.(b == 0) -> b = 1 ,this one guarded command encodes a set of transitions. One can imagine this by introducing a tran-sition from every state of the actual program to the duplicated state, see Figure 4.3. The states at thetop are those whereb = 0, those at the bottom haveb = 1. This one guarded command from aboveintroduces now all transitions from top to the bottom. Of course, at every point of time only one ofthem can be executed. So, there is no non-determinism because there is only one guarded command.

a b c d e f g h

a′ b′ c′ d′ e′ f ′ g′ h′

introduced transitions

Figure 4.3: Example of introduction of multiple transitions by one guarded command.

The invariant of the program defines the accepted range of program variable values. As mentionedabove, the subsequent specification consists of two parts. The first partdestination is used toconstitute that state8 (or h respectively) should not be reached. The letterd that is appended tothe variable name denotes that the value at the destination is meant. In this section of the input file,no other construct makes sense. In the next section however,the relation is denoted by variablevalues and attacheds andd (source anddestination) to the variable name respectively. If we wantedtoexclude for example that transition(7, 8) is used, we would have to write((statenos == 7) &&(statenod == 8)) .

Finally, the variable initialization withstateno = 1 defines the set of initial states. If more statesare meant for initial states, another block beginning withstate and followed by the correspondingvariable assignment would have to be added.

4.3.2 The Output

As we have explained what an input file for FTSyn looks like, wenow give an idea of the outputfile. The output is most important of course since it represents the final result of the fault tolerancesynthesis process.

33

Page 48: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

4 Using FTSyn

We do not want to analyze the different ways of FTSyn to obtainthe result. It is just mentionedthat FTSyn offers two possibilities: an “automatic” and a “semi-automatic” way. If the user choosesthe automatic way, all results will be saved in one output fileand the process is done completelyautomatic. In the other case, intermediate results are saved in different files and the synthesis is donestep by step. However, these steps have to follow a given order. It is not our aim to analyze the singlesteps of FTSyn, so we restrict our view on the automatic function.

The output file contains some general information, e.g. timing data, as well as information about thefault intolerant and the fault tolerant version of the program. It is not sensible to comment on thegeneral structure of the file, but we want to exemplify the output now, see Figure 4.4.

The output of Figure 4.4 is not the original output. It is an abbreviated version since the originaloutput includes about two pages of----------- The list of minterms is empty ----------lines all over the file. Ebnenasir answered on the question ofthe reason and meaning of these linesthat they were meant for debugging. One can examine the original output in Appendix A.

First, the time that FTSyn needed to expand to reachability graph is given (110 ms), followed bythe number of states. Afterwards, the fault intolerant program, i.e. the program that was given asinput, is displayed. The notation needs getting used to. Each transition is encoded in three lines.The first line denotes the value of our program variable in thesource state, e.g.stateno == 1 .The second line consists of a seemingly unordered sequence of braces. Other program variableswould be coded there, and at the end, these empty braces remain. So, this line can definitely beignored. The last of the three lines gives us the destinationof the transition. It is represented bythe beginning arrow (-> ). The program variable that is assigned a new value is alwaysframed bya set_<variable name>_val<variable value> construct, e.g.set_stateno_val2 .So, the reassignment is more a command. Note that only program transitions can be found in the file.Unfortunately, there is no way to find fault transitions which would be desirable for our purposes.

As FTSyn continues, it periodically denotes the amount of time that it used to compute the next step,e.g. to remove malicious states or transitions, and the number of states that the actual program has.Note that this number can only decrease, since FTSyn does notadd any states in contrast to FCPre.But some states may become unreachable because the transitions that lead to them are cut off —this is not a fault, in most cases, it is wanted this way. In ourexample, we know that the final faulttolerant system has five reachable states. Additionally, two program transitions are remaining,(1, 2)and(2, 3). See Figure 4.5 for a graphical output.

Figure 4.5 results from taking the original program with eight states and adding the two programtransitions and the four fault transitions that cannot be diverted or erased. The program now includesthe reachable statesa, b, c, d and g, or 1, 2, 3, 4 and 7 respectively. The other three states arenot reachable and can thus be disregarded. It is important tostate that this is the correct result.The specification “never h” is satisfied even in the case of faults, but the system behavior remainsunchanged in the fault-free scenario.

34

Page 49: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

4.3 Examples

Figure 4.4: Output of FTSyn belonging to the Input given in Figure 4.2.

35

Page 50: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

4 Using FTSyn

a b c d e f g h

Figure 4.5: The fault tolerant version of the example from Figure 4.1. The specification is “neverh”.The transitions(d, e) and(g, h) have been removed by FTSyn.

4.3.3 Summary

The above examples have given an overview of the work with FTSyn. We have seen what a validinput looks like and how the output is printed. The syntax of input and output are very importantbecause FTSyn is used as a black box by other modules of this thesis. Of course, as the source code ispublic in general, everyone is invited to adjust those things the way he needs. However, this job is nottrivial (especially changing the input parser) and would gobeyond the scope of this thesis in additionto the primal challenge.

Interestingly, the output file has Linux file format though itwas compiled and executed by the useof a Windows operating system. The format effects the codingof line breaks. The printing of thecomment lines and the format of transitions is very hard to parse.

4.4 Deficiencies of FTSyn

We finally report on some additional deficiencies of FTSyn which added a couple of new challengesto the work of this thesis.

We already mentioned that FTSyn runs under the Windows operating system. This is not a failure buta restriction that is unnecessary. Needless to say, that we did not repair this drawback because it isoversized work. There is more software that is bound to a special platform. The real drawback is thatthe advantage of portability of Java programs was not taken,though FTSyn was written in the Javaprogramming language.

Another gratuitous constraint is the following: assume a guarded commandA → B. As we explainedabove, that means that the commandB is executed only if preconditionA is fulfilled. But B consistsof a variable reassignment. Unfortunately, FTSyn just allows that|B| = 1, i.e. only one variable canget a new value. As one immediately recognizes, this is a reallimitation, since a lot of transitionsfrom one program state to another are not possible as the destination is not reachable with one changeof value assuming at least two program variables. Because ofthis issue, our implementation hasto “bypass” FTSyn, i.e. much information is stored before FTSyn is executed, the program statesare identified by unique state numbers, and after the execution, the complete information of each

36

Page 51: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

4.5 Summary

state is read again to be delivered to the model checker. There is no possibility to change more thanone variable a time, even not by brackets. Of course, one can argue that one program step can onlychange one variable value. However, in the synthesis of programs, it might be necessary to store moreinformation in one state, e.g. the history. So, this is a realrestriction for automatic fault tolerancesynthesis.

The alternative to tolerate the drawbacks of FTSyn is to correct at least some of them. This howeveris complicated by the inconsistent use of data structures. Especially the use of data typesVectorandLinkedList changes frequently. It is not recognizable why which data structure was used.Approximately the same or sometimes exactly the same data isstored one time in this structure thenext time in the other. That makes it really hard to enhance orcorrect the code.

Finally, missing or not documented interfaces constrain the usability. In most cases, one cannot accessobject attributes by a method. Instead, it is necessary to gointo the implementation file and searchthe attribute variable — the one which one thinks is plausible. Afterwards, one can directly change orread it which however should not be the favored way. For lack of well-founded alternatives it is theonly way. Regrettably, those methods that are available arenot documented at all, i.e. they are listedon the website, but they are not explained. So, one can only guess what the parameters that are to begiven are meant for. It is the same regarding the returned data. The returned data type is given, so thesyntax can be satisfied, but the meaning of the data is not described. One can guess the meaning bythe methods’ names but this is only a guess.

4.5 Summary

FTSyn is a tool that implements the AK method for solving the fusion closed fail-safe transformationproblem. FTSyn was designed as a proof of concept and not meant for productive use. We haveidentified some deficiencies of FTSyn in this section which made it a challenge to incorporate thesystem into the context of this thesis. However, the output of FTSyn has always been correct and soit was possible to use it.

37

Page 52: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

4 Using FTSyn

38

Page 53: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

5 Model Checking with SPIN

5.1 Introduction

In this chapter, the operation subsequent to the automatic addition of fault tolerance is described —the practical proof of correctness. This is accomplished bygiving the resulting program that shouldbe fault tolerant as input to the model checkerSPIN (Simple Promela INterpreter). SPIN serves asan additional back end to the synthesis procedure to increase confidence in the correctness of thetransformation.

In the following, we define the termModel Checkingand how it is used in this case. Afterwards, themodel checker SPIN that is considered best for our purposes is described.

5.2 Model Checking

This section gives an overview of model checking. We explainwhat it is generally used for. We startwith the definition ofModel Checking.

Model Checking is an automatic technique for verifying finite-state reactive systems, such as sequen-tial circuit designs and communication protocols. Specifications are expressed in temporal logic, andthe reactive system is modeled as a state-transition graph.An efficient search procedure is used todetermine whether or not the state-transition graph satisfies the specification (see [CGL94] for moredetails).

As one realizes, a model checker does not contribute to the synthesis at all. It just checks if the givensystem is a model for the given specification. So, the use of a model checker is restricted to theverification of a design, e.g. for the correctness proof of a system that was developed from scratch.The output of a model checker is restricted to “correct” and “not correct”. Sometimes, it is amendedby an example of the violation of the specification in the casethat the input system is not correct.Hence, a model checker cannot be considered as a productive part of development, it is responsiblefor a quality estimation and reliability appreciation.

As an example, assume that a new computer chip has been designed inVHDL (Very High Speed In-tegrated Circuit Hardware Description Language, or shortVHSIC Hardware Description Language).VHDL is a hardware description language, comparable to a programming language, that enables the

39

Page 54: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

5 Model Checking with SPIN

engineer to easily design complex digital systems. VHDL is based on a high abstraction level. Thisway, it abstracts from concrete electronic components but describes the desired behavior of the finalcircuit.

Additionally to the designated functionality of the circuit, a given specification has to be held. Thisspecification may contain safety as well as liveness properties (cf. Section 2.2.2). One should notmix the functionality and the specification up. The functionality defines what we want the circuit todo, e.g. “realize the addition of two binary numbers”. The specification instead denotes what has tohappen or may not happen in use of the circuit, e.g. “give an output after finite time”, “never computea result smaller than zero if both numbers were positive”.

So, after the hardware engineer has finished the design of thecircuit, the functionality just like thespecification have to be tested. In VHDL, there is a testing routine for the functionality. It is calledbehavioral simulation. After this simulation is done, the next step consists of the circuit arrangement.And finally, this configuration is tested on specification compatibility, the so called post-fit simulation.The last step is done by a model checker.

This example made clear what we expect of a model checker. Onecan imagine that it is useful inautomatic addition of fault tolerance to check the resulting program on specification violation. If themethod or more precisely the implementation is correct, themodel checker will never find a violation.

To summarize, the use of the model checker is meant to increase the trust of the user in the software.By the correctness confirmation of it, the final result is provably correct. The application of theexternal tool FTSyn (cf. Section 4) requires the use of a model checker to be still able to guaranteecorrectness.

5.3 SPIN

We focus on the model checkerSPIN in this section. We explain why we consider SPIN to be thebest suitable model checker for our purposes. Furthermore,some examples of input and output aregiven. The usage of SPIN is explained as well.

5.3.1 Overview of SPIN

In this section, the properties of SPIN are explained. We point out its advantages and the reasons forour choice.

The model checker SPIN is used to verify the final results of FTSyn after each synthesis. Therefore,FTSyn’s output that is present as guarded commands has to be transformed to PROMELA (PROcessMEta LAnguage) (cf. [PRO]), the input language of SPIN.

The use of SPIN is visualized in Figure 5.1. As one immediately recognizes, SPIN has to interact withFTSyn in a certain way, i.e. SPIN takes the output of FTSyn andfinally decides about the correctness

40

Page 55: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

5.3 SPIN

Σ FCPre Σ′ FTSyn Σ′′ SPIN

“correct”

“not correct”

Figure 5.1: Visualization of Synthesis String ending with SPIN.

of the overall method. So, SPIN is as important as the rest of the synthesis elements. Beyond themeaning of each element in the production string, one shouldnot forget the casting of formats. Thatmeans, a mistake in the transformation from FTSyn to SPIN is as bad as a mistake in one of the tools.We will go more into detail in Section 7. However, we can statethat guarded commands (used inFTSyn) and PROMELA (used by SPIN, see below) have some similarities. This is certainly a greatadvantage.

Furthermore, SPIN has the advantage of portability across alarge number of platforms, namely Unix,Linux, cygwin, Plan9, Inferno, Solaris, Mac, and Windows. Of course, one might argue that thisis not necessary here, but a higher degree of portability is aquality feature of this tool, and it isdefinitely never a downside. Together with the presumption that FTSyn and FCPre are written inthe Java programming language that also provides a certain portability with its virtual machine, theportability of SPIN may become a big advantage.

Thirdly, very good documentation of SPIN is available (see [Spia]). Needless to say that it is notsufficient to just have a good tool that is not explained in anyway, but an amply documented toolis much more valuable, especially if it is only used from the command line. The input languagePROMELA is declared in detail as well as the runtime options that can be used to adjust the methodof model checking executed by SPIN. The methods are completely understood and they are well-founded.

Finally, the flexibility of runtime options builds another criterion for the choice of SPIN. So, one canchoose between random and interactive, i.e. user guided simulation as well as exhaustive and partialproof techniques based on depth-first or breadth-first search. In the future, we will concentrate onthe exhaustive depth-first search since it meets our purposes best. The working of this method isvisualized in Figure 5.2.

Assume that the initial program state is state0. We abstract from program steps with only one decisionbranch. So, there are several possible branches at every step. Depth-first search means that SPIN willfirst go deep and afterwards take the alternative paths in breadth. In our example, SPIN would go0 → 0.0 → 0.0.0 . . . When it arrives at the “bottom”, it backtracks and takes the next possible way,

41

Page 56: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

5 Model Checking with SPIN

0

0.0 0.1

0.0.20.0.10.0.0 . . .

Figure 5.2: Program Tree visualizing depth-first search.

i.e. when it returns to0.0, it goes0.0.1 and so on. The exhaustive part ensures that SPIN keeps onsearching until either a specification violation is found orall paths are simulated so that there is noviolation in the given system.

The use of a model checker serves as a practical proof of correctness. The model checker provesthe correctness of the implementations for each solution. For this reason, an exhaustive search forspecification violations is chosen by default.

5.3.2 Examples

In this section, we want to give some examples of the usage of SPIN as well as of the input and output.The input language PROMELA is presented, and we demonstratethe use of the model checker.

We describe the syntax and two different ways to give input toSPIN. They differ in the coding of thespecification. The syntax is described in general and exemplified. Thereby, we limit the descriptionto the elements used in the content of this work.

5.3.2.1 Input

When we make use of SPIN, we have to start with coding the input. It is very similar to the input ofthe synthesis systems. There is a set of legal procedures andan additional one to model the fault. Ofcourse, the occurrence of the fault is bound to a state.

In the input file, variables have to be declared first. Interesting types arebool andbyte . Booldeclares a boolean variable, i.e.0 or 1, byte means an unsigned character, i.e. the range0 . . . 255.

The second part of the input is built by the declaration of processes. These are equivalent to thosepreviously declared in FTSyn. An additional process is designed to play the role of the malfunction.There is no special construct in PROMELA for this.

42

Page 57: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

5.3 SPIN

Finally, the initialization is responsible to give initialvalues to the variables and to make the processesrun. The interaction of the processes is afterwards simulated by SPIN using exhaustive depth-firstsearch.

An example is given in Figure 5.3. It is the PROMELA coding of the example given in Figure 2.5with the specification “never h”. Note that the statesa, b, c, . . . , h are renamed by the dedicated statenumbers1, 2, 3, . . . , 8 respectively.

Figure 5.3: Example of a PROMELA input using active process.

PROMELA offers a special process type called “active”. Thispredicate makes the process run fromthe point of declaration on, i.e. it is started before the initialization. The specification process isdeclared active. That makes sure, that the specification is checked at every point of time, e.g. beforethe first normal process starts.

Further on, there are assertions that can be used to check boolean conditions. If an assertion is vio-lated on a program path, a fault is raised. That is what we expect from a supervisor (specification)process. So, the specification process regularly checks thespecified violation condition and reaches

43

Page 58: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

5 Model Checking with SPIN

theassert(false) part if it matches. The reason for this construct is mainly performance, sincethis way less redundant states are created by SPIN than by usingassert(<safety specification>) .

The timeout condition is inserted because SPIN dropped errors otherwise. Note that at model check-ing, we have to make use of history variables in case of non-fusion closed specifications, but this is aweakness of the correctness proof by the use of a model checker not of our method.

5.3.2.2 Usage

We assume a cygwin (see [cyg]) environment for the executionof SPIN (cf. Section 7.3). In the firststep, a verifier is produced if SPIN was executed by

spin -a <filename>.prom .

This verifier is written in automatically generatedC code. The execution of this command addition-ally delivers a set of files, all namedpan with a file type ending respectively. One of these ispan.c ,the verifier generated by SPIN as already mentioned above. The executable we derive by compilationwith

gcc -c -mno-cygwin pan.c

gcc -o pan.exe -mno-cygwin pan.o

searches violating program paths. If one is found, it is written to a so named trail-file. An example ofthe execution of the verifier by

./pan.exe

is given in Figure 5.4.

5.3.2.3 Output

The trail file resulting by a violation match can be evaluatedby SPIN. The appearance of such anevaluation using

spin -p <filename>.prom

is depicted in Figure 5.5. It is possible to refine the displayed information by command line options.We chose the option “-p”. That is why the executed statementsare printed. So, one can observe thatthe path that led to the violation is1, 2, 7, 8, considering thestateno variable respectively.

5.3.2.4 Using “never claims”

The next example that we deal with is nearly the same as the oneabove. It only differs in the codingof the specification. This second method makes use of two input files. The first file contains the

44

Page 59: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

5.3 SPIN

Figure 5.4: Example of an output of SPIN using the active process method.

Figure 5.5: Example of a fault found by SPIN using active process method.

45

Page 60: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

5 Model Checking with SPIN

system itself, i.e. variable declarations, the process declarations including the fault process and theinitialization. The second file however consists of the specification coded as a so callednever claim,i.e. a formula that states what should never happen in the execution of this program.

Figure 5.6: Example of the system coding using a never claim.This part contains the processes, thefaults, and the initialization. It is stored in a file ending with “.prom”.

Consider Figure 5.6 for the example of the system coding. Obviously, it is nearly the same as the codein the previous example. The specification is removed as wellas the “timeout” actions. The latter arenot necessary any more when we make use of never claims. In fact, this is a reason why the neverclaim alternative was chosen to be used in the development ofthis work. One should not forget thatthe introduction of the timeout conditions inserts new program states. Of course, these states do notchange the decision of SPIN if the given system is correct assuming an exhaustive search, however,the original system and the system that was given to SPIN differ slightly.

The remaining part of the correctness problem, the specification, is shown in Figure 5.7. At thebeginning, it is possible to define so called macros. Equality and inequality conditions have to bedefined in these macros. Each macro gets a name, this way macros can be referenced. Furthermore,binary operators like conjunction and disjunction can be used to link macros as well as other booleanconditions.

After the macros have been defined, the never claim has to be written. It is advisable to make SPINgenerate this claim automatically and just give it the LTL (Linear Time Logic) formula that should beexpressed by the never claim. The command to generate such a claim is the following:

spin -f ‘<LTL-formula> ′ >> <filename>.ltl

46

Page 61: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

5.3 SPIN

Figure 5.7: Example of the specification coding using a neverclaim. It contains the never claim andthe definition of a clause (macro). This is stored in a file ending with “.ltl”.

The formula may contain the implication operator-> , the negation! , the unary operators[] (glob-ally) and<> (eventually) just like the binary&&(AND), || (OR) and all the macros that were definedabove. The generation of the claim using the SPIN routine ensures that the claim isstutter invariant,i.e. the number of steps that were done by the program does notmatter which is important consideringliveness properties. Besides, the luxury that LTL formulasmay be used is reason enough.

The coded formula is always introduced as a comment after thenever claim initialization. It is im-portant to know that obviously the never claim causes an error when it is fulfilled. The compositionof the formula!([](!s1)) was done the following way: First, we have to remember that weneverwant s1 = (stateno == 8) to be true. It is the appropriate way to denote a positive formulaand negate it finally. So, we conclude that we want that at every point in time (globally), nots1 maybecome true. This results in[](!s1) . The final negation leads to the formula that was given toSPIN.

In fact, taking justs1 as the LTL formula that one never wants, does not deliver the correct result.The reason for that is that SPIN generates a Buchi automaton for every declared process. At eachround, it is fairly decided which automaton may take the nextstep.

“In effect, when a never claim is present, the system and the claim execute in lockstep.That is, we can think of system execution as always consisting of a pair of transitions: onein the claim and one in the system, with the second transitioncoming from any one of theactive processes. The claim automaton always executes first. If the claim automaton doesnot have any executable transitions, no further move is possible, and the search alongthis path stops. The search will then backtrack so that otherexecutions can be explored.”[Spib]

The never claim fors1 is exactly the same as for our formula above with the only difference thatthe line:: (1) -> goto T0_init is missing. So, the never automaton could not take one stepsinces1 was not true, and the search was finished immediately.

It is interesting to see that the second method found the other compromising path1, 2, 4, 5, 6, 8. How-ever, this is not so relevant to this work. The main question is if there is a violating path at all. If thereare two or even more, the order of their output does not play any role.

47

Page 62: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

5 Model Checking with SPIN

The output of SPIN using this second method is comparable to the one we described at the firstmethod. That is why we do not present the output here. It can beexamined in Appendix B.

Overall, these examples have given an overview of the usage of SPIN. It has become clear whichways to use SPIN (active process and never claim) we could choose and why we chose the secondway using a never claim.

5.4 Summary

In this section, it has been shown that we do not just claim to automatically add fault tolerance toexisting programs but that we can prove that we do it correctly. That is important as justification justlike a theoretical proof for a lemma claimed in a paper.

48

Page 63: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

6 Implementing GJ

6.1 Introduction

In this section, we describe the main element of this work — the implementation of GJ. GJ denotesthe theory of Gartner and Jhumka (cf. [GJ04]) while its implementation is denoted by FCPre.

We first repeat some central theoretic aspects (cf. Section 2), before practical issues of the implemen-tation are faced. Finally, instructions of how to use the system and some examples are given.

6.2 Theoretical Fundamentals

It is the aim to briefly recapitulate the most important aspects of Section 2 for better appreciation ofthe practical details.

The overall job that we want to accomplish is solving the general fail-safe transformation problem:

LetF be a fault model andΣ1 a program which isF -intolerant with respect to a generalsafety specificationSPEC1. The general fail-safe transformation problemconsists offinding a fault tolerant version ofΣ1, i.e., a programΣ2 such thatΣ2 extendsΣ1 andF (Σ2) satisfiesSPEC1.

It is assumed that a tool is given that solves the fusion closed fail-safe transformation problem:

The fusion closed fail-safe transformation problemconsists of solving the general fail-safe transformation problem whereSPEC1 is fusion closed.

This tool is named FTSyn and it is based on the theory of Arora and Kulkarni (AK). So, the aim ofFCPre is to close the gap between the general and the fusion closed fail-safe transformation problem,i.e. the input to the general problem is rendered fusion closed so that it is applicable as input to thefusion closed problem which in turn is solved by FTSyn. The context is displayed in Figure 6.1.

A programΣ = (C, I, T ) as a possible input consists of the setC of program states, the initial statesI ⊆ C, and the setT of transitions between the states. The safety specificationSPEC has to be

49

Page 64: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

6 Implementing GJ

Bad Fusion PointElimination

Computation ofFusion Closure

Arora-Kulkarni-Tool

FaultassumptionF

GeneralSafetySPEC

ProgramΣ

F

SPEC ′

(fusion closed)

Σ′

(without BadFusion Points)

Σ′′

Σ′′′

Figure 6.1: Classification of FCPre and FTSyn in the context of the general fail-safe transformationproblem. FCPre computes the fusion closure of the specification and eliminates the badfusion points. FTSyn is represented by the Arora-Kulkarni-Tool and solves the fusionclosed fail-safe transformation problem.

50

Page 65: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

6.2 Theoretical Fundamentals

classified as fusion closed or non-fusion closed. It is the special interest of this work to handle non-fusion closed specifications, i.e. the violation of the specification may depend on the path that ledto the state. Additionally, it is assumed that the fault model exploits the non-fusion closure property.Otherwise, the specification could be treated as fusion closed disregarding the non-fusion closed part.The faults indeed are implemented as collateral transitions in the system. This is sufficient to violatesafety specifications.

The fault tolerant versionΣ2 of the fault intolerant programΣ1 has to be an extension ofΣ1 usinga state projection functionπ. Since only an intermediate result is computed by FCPre, this result isnamedΣ′

2. In the framework of the implementation, it is assumed that the fault is static which is aninstance of the former assumption of a finite fault model, i.e. no new fault transitions are added afterthe fault model has been defined (F (T1) = F (T ′

2)). Note that this is not a real drawback. If additionalfaults are desired, they can be modeled after first synthesis, introduced to the system and the resultingsystem can be synthesized again.

The fusionfunction is important to model the job of FCPre. The definition of a fusion point resultsby the definition of thefusionfunction.

Let s be a state andα = αpre · s · αpost andβ = βpre · s · βpost be two traces in whichsoccurs. Then we define

fusion(α, s, β) = αpre · s · βpost

If fusion(α, s, β) 6= α andfusion(α, s, β) 6= β we calls a fusion point ofα andβ.

This definition is used to combine traces and helps to determine whether non-fusion closure can beexploited or not. Therefore, however, the additional property to be a bad fusion point is necessary. Abad fusion point is defined as follows.

Let SPEC be a specification,Σ be a system satisfyingSPEC, s be a state ofΣ, andF a fault model such thatF (Σ) violatesSPEC. States is a bad fusion pointof Σ forSPEC in the presence ofF iff there exist tracesα, β ∈ SPEC such that

1. s is a fusion point ofα andβ,

2. fusion(α, s, β) ∈ F (Σ), and

3. fusion(α, s, β) /∈ SPEC.

Finally, we know thatΣ has no bad fusion point forSPEC in the presence ofF iff F (Σ) does notexploit the non-fusion closure ofSPEC. We can conclude that it is necessary and sufficient to searchfor bad fusion points of the system and to eliminate them. Thelatter is done by duplicating the statewith bad fusion point property and divert the transitions that do not lead to a specification violation tothe new state.

Regarding Figure 6.1, we have described howΣ′ is computed. The fusion closed specificationSPEC ′

is given as a set of bad transitions, i.e. these transitions that lead to a specification violation.

51

Page 66: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

6 Implementing GJ

6.3 Algorithms and Data Structures

In this section, we give an overview of the details of the implementation of GJ. The implementationis named FCPre. We show, how the specificationSPEC is encoded as well as the faultF and thesystemΣ. Data structures and algorithms are presented. Finally, weexhibit the way of interfacingwith FTSyn that is aimed to complete the synthesis.

6.3.1 Encoding Programs, Faults, and Specifications

The first thing that we want to consider is the encoding of the programΣ and the fault assumptionF . It was already mentioned that the fault assumption is generally modeled as additional transitions.Here, this is the case as well.

We recall that the only difference between FTSyn and FCPre concerning the input is the specification.In both cases, a reachability graph is expanded, program andfault transitions link the program states.So, we made use of the existing routine of FTSyn to initializethe reachability graph for FCPre. Thatis why one has to generate an input file that follows the FTSyn syntax. After the graph is expanded itis passed to FCPre for further processing. One can conclude that the entry in the specification field isunconcerned in our case. It is just ignored since only the graph is passed but not the specification.

Obviously, the encoding of the specification is more interesting. The specification has to be given inan extra file. There, it is encoded as a boolean formula with certain operators. The formula is parsedand afterwards written to a specification tree. Below, we want to present the operators that may beused to code the specification. An overview with a short description is given in Table 6.1.

Most operators are known as standard boolean functions. However, in this case, at least one specialoperator is needed, thepreviouslyoperator. As pointed out above, a previously-condition is significantfor non-fusion closed specifications. A previously condition is in general bound to another criterion.In most cases, the incidence of an event implies the previousoccurrence of another. For example, thereaching of a program state or a group of states that share a common criterion implies that previouslyanother program state or some state of a group has been visited. This way, for example, it is possibleto assert that program states are visited in a correct order or that some states are visited at all. One canimagine the following case. Assume a program with a stateb where variablex is used the first time,i.e. it is denoted on the right side of an assignment, e.g.y = x+1. Then, one wants to ensure that thisvariable has been correctly initialized. The program stateof initialization of variablex is denoted bya, e.g.int x = 0. So, the specification could say “b implies previouslya” or short “(b) − > (˜(a))”.Let us assume a pathα and the specification from above. Then

α · b

{

∈ SPEC, if a ∈ α, i.e. α = α′ · a · α′′ for some tracesα′, α′′

/∈ SPEC, otherwise

52

Page 67: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

6.3 Algorithms and Data Structures

Table 6.1: Valid Operators to encode the Specification for FCPre.

Operator Name Symbol Comment Exampleequals == Returns true iff both sides have the

same valuex == 3

unequal ! = Returns true iff both sides have differ-ent values

x! = 3

greater than > Returns true iff left side is greater thanright side

x > 3

less than < Returns true iff left side is less thanright side

x < 3

greater or equal >= Returns true iff left side is greater thanor at least equal to the right side

x >= 3

less or equal <= Returns true iff left side is less than orat most equal to the right side

x <= 3

not ! Returns true iff the following argumentreturns false

!(x == 3)

previously ˜ Returns true iff anywhere in the pastthe following argument has been true

˜(x == 3)

and && Returns true iff both sides return true (x == 3)&&(˜(x == 2))or || Returns true iff at least one of the sides

returns true(x == 3)||(x == 2)

implies − > Returns true iff the left side is false orthe right side is true

(x == 3)− > (˜(x == 2))

53

Page 68: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

6 Implementing GJ

Consider the grammar in Table 6.2 for a formal description ofthe specification syntax. Note that thesemicolon at the end and the correct insertion of brackets are important for the correct parsing of thespecification file.

Table 6.2: The Syntax of Specification Formula for FCPre.

Expression −→ BooleanExpression;BooleanExpression−→ !(BooleanExpression)

˜(BooleanExpression)(BooleanExpression) && (BooleanExpression)(BooleanExpression) || (BooleanExpression)(BooleanExpression) − > (BooleanExpression)Comparison

Comparison −→ IDENTIFIER == INTEGERLITERALIDENTIFIER ! = INTEGERLITERALIDENTIFIER >= INTEGERLITERALIDENTIFIER <= INTEGERLITERALIDENTIFIER > INTEGERLITERALIDENTIFIER < INTEGERLITERAL

After the specification is read from the input file, it is inserted into a data structure that has theform of a (binary) tree. Assuming that the complete specification can be divided into two subclausesconnected by a binary operator, the operator is taken as the root node and the subclauses are spreadin the left and right child node respectively. The same distribution is met if it concerns a comparison.Then, however, the left and the right children are leaf nodes. In the case of a unary operator, only theleft child is used. We can resume that inner nodes are always operators while the identifiers build theleaf nodes together with the integer values.

For better illustration, let us examine an example. We recall the specification from above. Let stateaandb be denoted by state number1 and2 respectively. Then, we obtain the specification“(state number == 2) -> (˜(state number == 1))”.The corresponding specification tree is the one in Figure 6.2. We observe, that the implication operatoris placed in the root node because it is the top level operatorregarding the brackets.

The previously operator is implemented in the following way. In the simulation of a run, the past isrecorded. If the specification has to be checked and containsa previously part, the subtree on the leftis taken as a new tree. For every state in the past the new sub specification is checked for compliance.If there is a positive match, the previously part is fulfilled, otherwise, it is violated. For example,the previously condition could say “previously, stateb has been visited”. Of course, there might be aviolating condition as well, e.g. “previously, statec has not been visited”. If statec occurs in the past,this part delivers “false”.

54

Page 69: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

6.3 Algorithms and Data Structures

== ˜

stateno 2 ==

stateno 1

Figure 6.2: Example of a Specification Tree.

6.3.2 Identifying Bad Fusion Points

As we are now able to process non-fusion closed specifications, we go on with the identification ofpossible bad fusion points. First, all states that accomplish two necessary conditions are marked as“bad fusion point candidates”. There need to be at least two incoming transitions, one that belongs tothe “good” path and the other for the “bad” one. It does not matter if these transitions are programor fault transitions. However, if there are only several incoming fault transitions, then there mightbe no fault tolerant version of the program since all of them will lead to a specification violation.Additionally, none of them can be diverted. On the other hand, it might suffice to eliminate outgoingprogram transitions to prevent a specification violation which is not the function of FCPre but ofFTSyn. So, either our method would not succeed or this state does not serve as a bad fusion point.Though, such a state would be marked as a candidate first. As wewill see later, there will be an exactinspection where this state is rejected. The second condition says that there needs to be at least onefault transition in the past of one incoming path. We alreadypointed out that a fault definitely leadsto a violation of the specification and that we exclude “harmless” faults. So, we know that one of thepaths has to be bad. Of course, the other path has to satisfy the specification which is checked in thenext step.

The next thing we are going to present is the identification ofthe tracesα andβ that lead to a definitebad fusion points, see Figure 6.3. We remember that the traces both have to be elements of thespecificationSPEC, the fusion has to be inF (Σ) but not inSPEC, ands is the fusion point of thesetraces. Furthermore, there is a set of pre-computed bad fusion point candidates. W.l.o.g. we assumethat α is the bad trace, i.e. the one that contains the fault transition. So, we first try to find a pathβ from an initial state to an end state, i.e. a state that has no successor. This path has to satisfy thespecification and it must contains. We take the bad fusion point candidates ass, one after the other.If the algorithm succeeds to findα andβ that match the definition, the bad fusion point candidate is areal bad fusion point. For all initial states, we start the search algorithm.

55

Page 70: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

6 Implementing GJ

(1)β

i s e

(2)β

i s e

α

Figure 6.3: Illustration of the identification of tracesα andβ. First,β is identified,(1). Afterwards,αis identified in a backwards search starting from fusion point candidates, (2). Of course,α starts in statei and leads tos. Statei marks an initial state, statee is an end state.

find_traces(DAG F (Σ), bad fusion point candidate s){for all c ∈ I(F (Σ)) do

find_ β( F (Σ), c, < c>, SPEC, s);end for}

So, the following procedure looks for matching traces withβ ∈ SPEC ands ∈ β. First, all pathsthrough the directed acyclic graph (DAG) are recorded. If a path l matches the criteria, it is taken asβ and the algorithm that findsα is started.

find_ β(DAG F (Σ), actual state c, β candidate l, specification SPEC,bad fusion point candidate s){

if ( c has no successor) then/ * l is β candidate * /if ( l ∈ SPEC) and ( s ∈ l) then

find_ α( F (Σ), s, < s>, SPEC, l);end if

else / * c has successor * /for all successors t of c do

find_ β( F (Σ), t, l · <t>, SPEC, s);break if a bad fusion point has been found;

end forend if}

It is important to break the search after a bad fusion point has been found. Since the procedure is re-cursive, it might return to that point of execution after a matching state has been found and duplicated.

56

Page 71: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

6.3 Algorithms and Data Structures

So, the graph has changed, and the collected data could have become inconsistent. The path that hasbeen recorded might not exist any more. Thefind_traces procedure must be executed for eachbad fusion point.

We found a specification compliant path that includes a bad fusion point candidates. The part afters is taken for the fusion. It is the first part that is still needed. It suffices to find a path from an initialstate tos that satisfies the specification. According to the definition, the fusion of both traces has toviolate the specification. The following algorithm is meantto find the traceα. The path is temporarilystored in variablel.

find_ α(DAG F (Σ), actual state c, α candidate l, specification SPEC,trace β){

if ( c has no predecessor) then/ * l is α candidate * /if ( l ∈ SPEC) then

if ( fusion(l, s, β) /∈ SPEC) then/ * α is found: l * // * s is bad fusion point * /eliminate bad fusion point s with respect to α and β;

end ifend if

else / * c has predecessors * /for all predecessors t of c do

find_ α( F (Σ), t, < t> · l, SPEC, β);break if a bad fusion point has been found;

end forend if}

The algorithm works in the following way. It starts at the badfusion point candidates and considersall paths from an initial state tos by going backwards froms. Probably, the continuation ofα to anend state would cause a specification violation. But this is not important here. It is just interestingthatα does not violateSPEC. So, all criteria are met:

• α andβ ∈ SPEC,

• s is a fusion point ofα andβ,

• fusion(α, s, β) ∈ F (Σ), and

• fusion(α, s, β) /∈ SPEC.

6.3.3 Elimination of Bad Fusion Points

As we have just explained the method how a bad fusion point is identified, the next thing to do isthe presentation of bad fusion point removal. The algorithmhas already been outlined in Section 2.5.

57

Page 72: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

6 Implementing GJ

The idea is the separation of both tracesα andβ. Therefore, a new states′ is generated as an exactcopy of the bad fusion points. The transitions that belong toβ are then diverted tos′. Additionally,the outgoing transition ofs that belongs toβ is copied with initial states′. Let us assume that thetransition(s, t) is part ofβ. Then, after the removal,(s′, t) is introduced in the system. Note, that thetransition(s, t) is not immediately removed. However, this transition is nowpart of the bad trace andprobably will be removed in the next step by FTSyn. Regardingthe faults, it is important to be ableto distinguish the “clone” statess ands′. We explain below how this is done.

Now, the computation of bad transitions is explained. Recall that after the removal of all bad fusionpoints, there are just good or bad transitions. The fusion closed specification can be expressed asa set of bad transitions. The algorithm starts at an initial state and traverses all transitions. If onetransition violates the specification, it is added to the setof bad transitions and the consecutive path isnot considered.

find_all_Bad_Transitions(DAG F (Σ), specification SPEC){for all initial states i do

find_bad_transitions( F (Σ), SPEC, <>, i);end for}

find_bad_transitions(DAG F (Σ), specification SPEC, trace p,actual state s){

for all transitions t originating at s dop = p · < t >;if ( p violates SPEC) then

t is bad transition;stop this path;

else / * SPEC is not violated by t * /find_bad_transitions( F (Σ), SPEC, p, destination(t));

end ifend for}

6.3.4 Interfacing with FTSyn

We have presented the marking algorithms of FCPre. The next topic is interfacing with FTSyn.Unfortunately, as it was presented in Section 4, FTSyn limits the number of variables that may bechanged in one step to one. A problem is if a transition needs to change more values in one step. Inour case, it is important to make duplicated states distinguishable but not different in the meaning oftheir variable values. The only way to reach both is to disregard the program variable values and towork just on a unique variable: the state number that is delivered by the graph initialization methodof FTSyn. We already mentioned that this method is invoked byFCPre for graph generation. Thegeneration of the FTSyn input has to be adjusted. The programand fault transitions refer to the statenumber and the specification has to do so as well. However, since the specification is given as a set ofbad transitions, this is not a big problem. The major drawback appears at interfacing with the model

58

Page 73: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

6.4 Using the System

checker. Obviously, the “original” specification has to be checked and not the fusion closed one. It isthe aim to make the program tolerate the given specification and not an altered one. As the originalspecification refers to program variable values, one has to get back the relation of state number andthese values. This relation is recorded in an extra file and re-read after the execution of FTSyn forthe generation of the SPIN input. The details of the SPIN input generation are treated in the nextsection. There is no method in FTSyn to generate valid FTSyn input, so we implemented such amethod ourselves. A description of FTSyn input was given in Section 4.3.1. There, it was shown thatFTSyn supports specifications in terms of transitions.

It is worth mentioning that the state projection function isimplemented indirectly. The most importantaspect of the state projection function is to ensure that duplicated states are relevant for the compliancewith specification, i.e. if a duplicated state is visited, the effect should be the same as if the originalstate is visited. For example, assume the specification “e implies previouslyb”. The state projectionfunction guarantees that ife is visited afterb′ butb is not visited at all, the specification is satisfied. Ife′

is visited,b or an equivalent state has to be visited before. This job is done by construction. Rememberthat the specification relies on program variables. As it wasmentioned above, the duplicated statehas the same values as its original. So, all specification clauses that concern a states, concern itsduplications′ the same way. For example, assume a states with the unique state number3. Theprogram variablex has the value4. If s is duplicated, a new states′ is generated. It gets a uniquestate number, e.g.13. The program variablex has the value4 as in s. So, s and s′ differ in atleast one variable value (the state number), but by the specification, they are treated the same. Thespecification can only refer tox because the state number variable is introduced after the definition ofthe specification.

6.3.5 Summary

We have presented the most important aspects of the implementation of FCPre. One can recognizemost elements mentioned in the theoretical part. Other details are implementation specific. Onecan imagine that there were difficulties regarding the format transformation and the encoding of thespecification. Nevertheless, the combination of FCPre and FTSyn generates fault tolerant systemsand works dependably.

6.4 Using the System

The working principles of FCPre have been explained. Next, we want to explain how FCPre is used.For the moment, we restrict ourselves to the use of FCPre. Theuse of the overall system includingFCPre, FTSyn, and SPIN is explained in the next section.

59

Page 74: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

6 Implementing GJ

6.4.1 Examples

This section deals with the usage of FCPre. It is the aim to define the input, the execution, and theoutput. We continue to use the example based on Figure 2.3. The “workflow” of the elements can beseen in Figure 6.4.

source.txt MyParser

Fault.java,InitialStates.java,

Invariant.java,OutputGenerator.java,

Parameters.java,ProblemSpecific.java,

ProgramImplementation.java,SafetySpecification.java,

State.java,Tool.java,trans.java

spec.txt PreParser

spec.txt.ltl,spec.txtpre.ltl

SPIN

spec.syn Preprocessor

states.obj,states.txt,

spec.txt.out,debug*,

FaultTrans.txt,sourceFTSynIn.txt

Figure 6.4: Illustration of the workflow in FCPre. MyParser is the graph generator of FTSyn.PreParser parses the non-fusion closed specification. Preprocessor executes the algorithmof GJ.

First, we have to explain the assumed directory structure. In the project root directory (in futuredenoted by\ ), there need to be two subdirectories, namelyFTSyn andPreprocessing . Onehas to start with the generation of the input file, i.e. the problem description. Since it follows theFTSyn syntax, one can find a detailed manual in Section 4.3.1.For this example, it is given inAppendix C. So, we continue with the generation of the appropriate reachability graph. As it wasalready mentioned, the reachability graph is generated by FTSyn. Therefore, the source file, let ussaysource.txt , is copied to the folder\FTSyn\InputGen . By the command

java MyParser source.txt

one executes the graph generation routine of FTSyn. After the deletion of possibly existing.classfiles in \Preprocessing , those eleven files that are output byMyParser plus the source filesource.txt are moved to\Preprocessing .

The following operations are done in this directory. The next step consists of parsing the non-fusion closed specification that is for example encoded inspec.txt . Generally, this file only

60

Page 75: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

6.4 Using the System

contains one line. In our case, this line is “(stateno == 6) -> (˜(stateno == 4)); ”.Remember that the states that were formally denoted by letters are now denoted by numbers. ThePreParser.java has to be first compiled by

javac PreParser.java .

The execution of the PreParser typing

java PreParser spec.txt

leads to the generation of some new files. There are two files ending with .ltl . These are neededfor the final model checking with SPIN. The more important filefor our current purposes is the oneending with.syn , e.g.spec.syn . It contains the specification tree. We will return to it later. Next,all .class files have to be deleted again. Afterwards, the preprocessorof FCPre and one of the filesgenerated by the FTSyn parser are compiled by

javac Preprocessor.java

javac trans.java .

Finally, the preprocessor can be executed.

java Preprocessor spec.syn

This program executes the algorithms that we presented in the last section. The results are recordedin several files for further use. Thestates.obj file retains the correlation of state number andvariable values (remember that FTSyn cannot handle more than one variable at the same time). Thisfile is not human readable because it is encoded by the Java “object to file” routine. However, thesame information is denoted instates.txt in ASCII format. The first line contains the variables.The next lines denote the states, one state each line, and their variable values in the same order.

The file ending with.out denotes the programΣ′

2, i.e. the intermediate result after the executionof FCPre. It is denoted in a human readable manner. An exampleis given in Appendix C. After thenumber of states and their enumeration, each state is listed. The notation starts with the (internal)state number (State ) followed by the program variables and values (e.g.stateno=1 ). Finally, theprogram and fault transitions starting at this state are mentioned. The destination is given by the statenumber that should not be interchanged with the chosen variablestateno .

The files starting withdebug are generated by those parts of FTSyn that were reused for FCPre. Thefile FaultTrans.txt is used to record the fault transitions of the program for theinput of SPINsince FTSyn does not output any information about that. As these transitions do not change, theycan be recorded at any point of synthesis. The file ending withFTSynIn.txt encodes the outputgraph of FCPre denoted in FTSyn syntax, i.e. it contains the same information as the.out file butwith respect of the syntax expected of FTSyn. So, this file is given as input to FTSyn for furtherprocessing.

61

Page 76: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

6 Implementing GJ

6.5 Summary

This section has shown the major implementation details. Ithas been made clear how the theoreticmodel has been implemented and which problems have occurred. The encoding of the input programas well as of the faults and the specification have been presented. Additionally, we have pointed out thedifficulties of interfacing with FTSyn which has never been meant for interfacing with a preprocessor.

62

Page 77: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

7 Validating the Implementation

7.1 Introduction

In this section, we discuss interfacing with SPIN. We point out difficulties on converting inputs andoutputs of two programs that were not made for interacting. Moreover, we show how the wholesystem — starting with FCPre, passing FTSyn, and ending withSPIN — can be used. Finally, someexamples are presented.

7.2 Interfacing with SPIN

This section deals with those difficulties that occurred on interfacing fault tolerance synthesis pro-grams, i.e. FCPre and FTSyn, with SPIN.

7.2.1 Deficiencies and Format Casting

As we had no influence neither on the output format of FTSyn noron the input format of SPIN, onehas to convert one into the other though it might be hard to maintain semantics. The challenge is toconvert the output as it is given in Appendix A into a valid PROMELA input as it is given in Figure 5.6on page 46. Not all needed information is encoded in the FTSynoutput file, e.g. the fault transitionsand the specification do not appear. This is the reason why additional files have to be generated byFCPre for SPIN. These files contain detailed information about states besides fault transitions and thespecification.

Furthermore, there are deficiencies in the documentation ofSPIN. If one makes use of the neverclaim method, there needs to be a new line at the end of the.prom file, because SPIN concatenatesthe .prom file with the .ltl file before the problem specific model checker in theC programminglanguage is generated. So, if the last new line is missing, the first line of the second file is situated inthe same row as the last line of the first file. This causes a syntax error. Neither SPIN seems beingable to insert a line break nor is a usable error message generated. This property is not mentioned inthe SPIN documentation. Regrettably, error messages of SPIN are generally not very helpful.

Finally, the implication operator-> is indeed documented, however its usage is restricted to thetoplevel formula given to SPIN. As it was already pointed out in Section 5.3.2, it is possible to define

63

Page 78: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

7 Validating the Implementation

some LTL clauses and link them with an identifier. According to the documentation, it is possibleto use all operators and variables and even other clause identifiers in these clauses. Yet, the useof the implication operator in the definition of a clause causes parse errors. In the formula that isfinally given to SPIN to generate a stutter invariant never claim (cf. Section 5.3.2), this operator maybe used without problems. So, the automatic generation of clauses has to stop on the level wherethe first implication operator occurs. The upper levels of the specification tree have to be coded inone formula and may not be diverted into clauses. Of course, as this operator is used frequently innon-fusion closed safety specifications, this is a real drawback that has to be treated.

7.3 Handling the System

This section gives a short instruction on handling the wholesystem that includes FCPre, FTSyn andSPIN.

FCPre can be obtained on [FCP]. For the availability of FTSyn, one has to contact Ali Ebnenasir (see[Ebn]). SPIN has been used in the version 4.26. It is freely available on [SPIc]. Cygwin is necessaryfor uniform platform assumptions. We have used version 1.5.19 of the cygwin1.dll. The software isavailable for download on [cyg].

There are a lot of operations necessary to make the system work, e.g. compiling source code, execut-ing programs, and moving files. So, a bash shell script is provided that handles nearly all tasks. Wewant to explain everything one needs to know to use this script.

Of course, one might ask why a Linux shell script and not a Windows script is used though FTSynneeds the Windows operating system to work. We assume that all the processes run in a Cygwinbash (see [cyg]). Cygwin is a free Linux-like environment for Windows. It provides a commandline and a collection of tools. Most Linux tools that are helpful on file management are available forCygwin. It consists of a DLL (cygwin1.dll) and is extensibleby a large number of packages. Cygwinis available for all 32bit versions of Microsoft Windows, i.e. Windows 95/98/ME/NT/2000/XP/2003.For download and documentation, see [cyg].

Cygwin is the best way to adjust the differences in Windows shells, first of all those between Windows9x/ME on the one hand and Windows NT/2000/XP on the other hand. By the use of Cygwin, we canabstract from these differences and there is one implementation for all systems. Additionally, Cygwinprovides a nativeC compiler, namelygcc. It is needed to compile the model checker that is generatedby SPIN. It is possible by parameters to make gcc generate Windows compliant executable files.

Some part of the assumed directory structure was mentioned in the last section. We want to go moreinto details here. We assume\ to be the root directory of the project, i.e. for each synthesis. Itis advisable to take different directories for different input programs because generated files mightmix up and so result in an inconsistent state. Though, the provided algorithm is designed to pre-vent such mistakes, it cannot be guaranteed. Furthermore, there are three subdirectories\FTSyn\ ,\Preprocessing\ , and\Spin\ . They contain the files for the respective programs.

64

Page 79: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

7.4 Illustrated Examples

The root directory\ contains the scriptsyn.sh that executes the synthesis. As a parameter, itexpects a project name. This name can be freely chosen thoughit should not contain spaces. In thefollowing, we denote this name by “projectname”. The initial file that encodes the fault intolerantprogram needs to be located in the root directory, too, as well as the file that encodes the non-fusionclosed specification. These files have to be namedprojectname.txt andprojectnamePre.txtrespectively. The process name in the program file is assumedto betrans as it has been the case inall examples in this document. The synthesis script can be called by

./syn.sh projectname

in the Cygwin shell where the actual working directory needsto be the project root directory. Thescript will start and work until the execution of FTSyn. As itaccepts no controlling parameters, onehas to choosea for automatic execution when prompted. Afterwards, one is asked for a file name tostore the output. It is necessary to name this fileprojectname.out . Finally, FTSyn returns to thestart menu wheret for termination should be chosen. The rest of the synthesis is done automatically.The last lines of the output in the shell assert whether the synthesis was correct or not. If it was notcorrect, the fault analysis of SPIN is given in a fileRESULT.txt in the project root directory\ . Thecomplete script code can be considered in Figure 7.1.

7.4 Illustrated Examples

We want to illustrate the functioning of the complete fault tolerance synthesis by the graphical presen-tation of some examples. The state projection functionπ is represented by a vertical split, i.e. thosestates that are situated one below the other are mapped to thesame state byπ.

7.4.1 First Example

The first example that we want to treat has already been given in Figure 2.3. At this point, we give allsynthesis steps in Figure 7.2. This example was chosen because it was already mentioned in Figure2.3. So, we wanted to finish the synthesis of this example. It is taken from [GJ03, p. 15]. There,however, it is not synthesized.

First, bad fusion pointe is duplicated, and “good” transitions are diverted by FCPre. The bad transi-tion (e, f) that definitely leads to a specification violation is afterwards eliminated by FTSyn. Finally,SPIN checks the program that is given in(3) and states that there is no specification violation.

7.4.2 Second Example

The next example is given in Figure 7.3. We chose this examplebecause it was treated in [GJ03, p. 23].

65

Page 80: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

7 Validating the Implementation

Figure 7.1: The shell script that manages the fault tolerance synthesis.

66

Page 81: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

7.4 Illustrated Examples

(1)

a b c d e f

(2)

a b c d e f

e′

π

(3)

a b c d e f

e′

π

Figure 7.2: The first example of a complete fault tolerance synthesis, the sequel to the example ofFigure 2.3. The specification is “f implies previouslyd”.

67

Page 82: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

7 Validating the Implementation

There are two bad fusion points, namelyc andd. They are removed by FCPre,c first andd afterwards,before both violating transitions(c, d) and (d, e), each violating one part of the specification, aredeleted by FTSyn. As a proof of correctness, SPIN verified thefault tolerance property.

In [GJ03], there is one more step: After(3), d is duplicated another time. The transition(c, d) isdiverted to(c, d′′), and the new transition(d′′, e) is inserted. This is another error in [GJ03]. FCPrehas shown that this duplication should not be made becaused is not a bad fusion point in(3).

7.4.3 Third Example

The last example is shown in Figure 7.4. It can be considered as a model of an if-else-clause. Stateb depicts the decision branch. If the condition is fulfilled, the program continues with statec, other-wise with statee. The faults compromise the condition check. This is a novel example not treatedelsewhere.

First, FCPre removes bad fusion pointe, then the same is done with bad fusion pointc. Finally,violating program transitions can be removed. FTSyn eliminates the transitions(c, d) and(e, f). Thecorrectness is again verified by SPIN.

7.5 Summary

This section pointed out the difficulties of interfacing FTSyn and SPIN and how the SPIN input isgenerated. Furthermore, we explained how the software can be used and exemplified its working.The last examples are typical tasks for FCPre and FTSyn. The synthesis tool works properly on theseexamples. SPIN verified all results successfully. The final output of the synthesis, i.e. the output ofSPIN, can be seen in Figures 7.5, 7.6, and 7.7 for the three examples above respectively.

The algorithms of FCPre are not optimized for runtime. Theremay be faster algorithms for thesepurposes. The algorithms that are used are developed for stability and traceability. However, it takesless than half a minute on a really old computer (900 MHz, 256 MB RAM) to synthesize each of theexamples that are presented above. Most of the time is neededto compile the source files byjavac .Testing the system on larger examples is part of future work.

68

Page 83: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

7.5 Summary

(1)

a b c d e

(2)

a b c d e

c′

π

(3)

a b c d e

c′ d′

π

(4)

a b c d e

c′ d′

π

Figure 7.3: The second example of a complete fault tolerancesynthesis. The specification is“(d implies previouslyb) and (e implies previouslyc)”.

69

Page 84: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

7 Validating the Implementation

(1)

a b c d e f

(2)

a b c d e f

e′

π

(3)

a b c d e f

e′c′

π

(4)

a b c d e f

e′c′

π

Figure 7.4: The third example of a complete fault tolerance synthesis. The specification is“(d implies previouslyb) and (f implies previouslyb)”.

70

Page 85: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

7.5 Summary

Figure 7.5: The final output of SPIN after fault tolerance synthesis of the first example.

Figure 7.6: The final output of SPIN after fault tolerance synthesis of the second example.

71

Page 86: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

7 Validating the Implementation

Figure 7.7: The final output of SPIN after fault tolerance synthesis of the third example.

72

Page 87: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

8 Conclusions and Future Work

8.1 Conclusions

The aim of this thesis has been the implementation of a preprocessor for standard automatic faulttolerance synthesis methods. This preprocessor has been meant to extend the set of possible inputs bythose problems including a non-fusion closed specification. Finally, the resulting software is able tosynthesize programs with a given fault model and a general safety specification. In other words, onecan say that it solvesthe general fail-safe transformation problem.

The model checker SPIN is successfully embedded in the synthesis process to increase the trust in theresults. If there is still a fault, it is traceable.

There are about six programs interacting in the fault tolerance synthesis process including the modelchecking part. These programs are managed by a shell script.There are programs for the synthesisitself as well as for format casting in-between. The compilation, the execution, the removal, and themoving of files are done by this script.

Runtime criteria are not considered in the framework of thisthesis. The operating experience hasshown that runtime aspects can be neglected. Memory aspectscan also be disregarded as a limitingfactor. Remember that FCPre was designed to reduce the memory usage in comparison to methodsthat make use of standard history variables.

FCPre has been developed for research proposes just like FTSyn. It has been never claimed thatFCPre would be applicable for commercial or productive exertion. Thus, the code is freely available.In case of interest, please visit the project web site [FCP].All changes that are made have to be madefreely available, too. No guarantee for soundness nor for completeness are given. Everyone is invitedto enhance the code. However, there is no service beside the present documentation.

8.2 Future Work

A task for future research would be the consideration of liveness or mixed specifications, i.e. non-masking fault tolerance. If other kinds of specifications are possible, new kinds of faults need tobe examined as well. In the case of liveness properties, the removal of a transition might lead to aspecification violation.

73

Page 88: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

8 Conclusions and Future Work

Another way of extension is the development of a tool with a graphical user interface (GUI) or at leastsuch an interface for FCPre. However, the graphical representation of graphs of unknown size is nottrivial. If they should be planar, the problem becomes NP complete. The positioning of an unknownnumber of nodes is the hardest challenge. One approach mightbe to take some number e.g. fivenodes in a row, and as many rows as needed to display the complete graph. Then the nodes might beconnected arbitrarily. Yet, this might lead to an unrecognizable illustration of the graph.

Further, one might implement a translation ofC or Java source code to the guarded commandsof FTSyn and vice versa. This way, programs that are written in a high level language could besynthesized though the resulting code after re-translation will not be well human readable. The mostimportant aspect is that the applicability would be increased. This would also allow to test the entiresystem on larger examples to evaluate the scalability of theapproach.

74

Page 89: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

A The original Output of FTSyn

This chapter displays the original output of FTSyn to the input that was given in Figure 4.2. Anabbreviated version was given in Figure 4.4

The execution time of reachability graph expansion (msec.) :110No. of states: 8

******************* The fault-intolerant program *******************No. of states: 8

******************* The fault-intolerant program *******************

---------- The actions of Process trans ----------

----------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty ----------( stateno == 1) &&( ) )

-> set_stateno_val2

----------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty ----------( stateno == 2) &&( ) )

75

Page 90: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

A The original Output of FTSyn

-> set_stateno_val3

----------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty ----------( stateno == 4) &&( ) )

-> set_stateno_val5

----------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty ----------

76

Page 91: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

----------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty ----------( stateno == 7) &&( ) )

-> set_stateno_val8

----------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty ----------

*************** After identifying ms states ***************The execution time of Identify_ms (msec.):0

*************** After removing mt transitions ***************The execution time of Remove_mt (msec.):0No. of states in the reachability graph: 5

*************** After marking invariant states ***************The execution time of Mark_Inv (msec.):0

*************** After solving deadlocks ***************The execution time of solveDeadlock (msec.):0No. of states: 5

******************* The fault-tolerant program *******************No. of states: 5

******************* The fault-tolerant program *******************

---------- The actions of Process trans ----------

----------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty ----------( stateno == 1) &&( ) )

-> set_stateno_val2

77

Page 92: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

A The original Output of FTSyn

----------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty ----------( stateno == 2) &&( ) )

-> set_stateno_val3

----------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty ----------

78

Page 93: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

----------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty --------------------- The list of minterms is empty ----------

79

Page 94: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

A The original Output of FTSyn

80

Page 95: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

B The Second Output of SPIN

In chapter 5, the output of SPIN using the active process method was given (cf. Figures 5.4, 5.5). Theoutput of SPIN using the never claim method is shown in FigureB.1.

Figure B.1: Example of the SPIN output using a never claim.

It was generated by the command./pan.exe -a .

Afterwards, one can examine the program path to the specification violation by

spin -p <filename>.prom .

This output can be seen in Figure B.2.

81

Page 96: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

B The Second Output of SPIN

Figure B.2: Example of the violating program path using a never claim.

82

Page 97: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

C The FTSyn Source Code for theexample of Figure 2.3

Figure C.1: The FTSyn Source Code of the program given in Figure 2.3.

83

Page 98: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

C The FTSyn Source Code for the example of Figure 2.3

Figure C.2: The appropriate output of FCPre.

84

Page 99: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

Bibliography

[AK98a] Anish Arora and Sandeep S. Kulkarni. Component based design of multitolerant sys-tems.IEEE Transactions on Software Engineering, 24(1):63–78, 1998.

[AK98b] Anish Arora and Sandeep S. Kulkarni. Detectors and correctors: A theory of fault-tolerance components. InInternational Conference on Distributed Computing Systems,pages 436–443, 1998.

[ALR01] A. Avizienis, J. Laprie, and B. Randell. Fundamental concepts of dependability. Tech-nical Report 1145, LAAS-CNRS, April 2001. published in “IEEE Transactions on De-pendable and Secure Computing 2004”.

[ALR04] Algirdas Avizienis, Jean-Claude Laprie, and BrianRandell. Dependability and itsthreats: A taxonomy. Technical report, IFIP Congress Topical Sessions, 2004.

[CDH+00] James C. Corbett, Matthew B. Dwyer, John Hatcliff, ShawnLaubach, Corina S.Pasareanu, Robby, and Hongjun Zheng. Bandera: extracting finite-state models fromjava source code. InInternational Conference on Software Engineering, pages 439–448, 2000.

[CGL94] Edmund M. Clarke, Orna Grumberg, and David E. Long. Model checking and abstrac-tion. ACM Transactions on Programming Languages and Systems, 16(5):1512–1542,September 1994.

[Cha88] K. Mani Chandy.Parallel program design: A foundation. Addison-Wesley LongmanPublishing Co., Inc., Boston, MA, USA, 1988.

[CM01] Muffy Calder and Alice Miller. Using SPIN for featureinteraction analysis: A casestudy. InSPIN ’01: Proceedings of the 8th international SPIN workshop on Modelchecking of software, pages 143–162. Springer-Verlag New York, Inc., New York, NY,USA, 2001.

[cyg] cygwin main page.http://www.cygwin.com/ . accessed: July 25, 2006.

[Des] Ali Ebnenasir Work Description. http://www.cse.msu.edu/ ˜ ebnenasi/research/tools/ftsyn-desc.htm . accessed: August 8, 2006.

85

Page 100: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

Bibliography

[DG01] Fabrice Derepas and Paul Gastin. Model checking systems of replicated processes withSPIN. In SPIN ’01: Proceedings of the 8th international SPIN workshop on Modelchecking of software, pages 235–251, New York, NY, USA, 2001. Springer-Verlag NewYork, Inc.

[Dij75] Edsger W. Dijkstra. Guarded commands, non-determinacy and formal derivation ofprograms.Comm. ACM, 18(8):453–457, 1975.

[Ebn] Ali Ebnenasir.http://www.cse.msu.edu/ ˜ ebnenasi/ . accessed: August 8,2006.

[Ebn02] Ali Ebnenasir.A Framework for the Synthesis of Fault-Tolerant Programs. MichiganState University, Computer Science and Engineering Department, East Lansing, Michi-gan, 2002.

[Ebn05] Ali Ebnenasir. Automatic Synthesis of Fault Tolerance. PhD thesis, Michigan StateUniversity, 2005.

[EK05a] Ali Ebnenasir and Sandeep S. Kulkarni. A framework for automatic synthesis of fault-tolerance.International Journal of Software Tools for Technology Transfer, 2005.

[EK05b] Ali Ebnenasir and Sandeep S. Kulkarni. A framework for automatic synthesis of fault-tolerance. Technical report, Software Engineering and Network Systems Laboratory,Department of Computer Science and Engineering, Michigan State University, EastLansing MI 48824 USA, 2005.

[FCP] Fcpre. http://pi1.informatik.uni-mannheim.de . accessed: August 9,2006.

[FTSa] Ftsyn. http://www.cse.msu.edu/ ˜ ebnenasi/research/tools/ftsyn.htm . accessed: August 8, 2006.

[FTSb] Ftsyn design class hierarchy. http://www.cse.msu.edu/ ˜ ebnenasi/research/tools/doc/class/overview-tree.html . accessed: August 8,2006.

[FTSc] Ftsyn user manual. HTML version:http://www.cse.msu.edu/ ˜ ebnenasi/research/tools/doc/usermanual/node1.html , accessed: August 8, 2006PDF version: http://www.cse.msu.edu/ ˜ ebnenasi/research/tools/userman.pdf , accessed: August 8, 2006.

[GJ03] Felix C. Gartner and Arshad Jhumka. Automating the addition of fail-safe fault-tolerance: Beyond fusion-closed specifications. Technical Report IC/2003/23, SwissFederal Institute of Technology (EPFL), School of Computerand Communication Sci-ences, April 2003.

86

Page 101: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

Bibliography

[GJ04] Felix Gartner and Arshad Jhumka. Automating the addition of fail-safe fault-tolerance:Beyond fusion-closed specifications. InProceedings of Formal Techniques in Real-Time and Fault-Tolerant Systems (FTRTFT), number 3253 in Lecture Notes in ComputerScience, Grenoble, France, 2004.

[Gou98] Mohamed G. Gouda.Elements of Network Protocol Design. John Wiley & Sons Inc, 1edition, 1998.

[GPVW95] Rob Gerth, Doron Peled, Moshe Y. Vardi, and Pierre Wolper. Simple on-the-fly auto-matic verification of linear temporal logic. InProtocol Specification Testing and Verifi-cation, pages 3–18, Warsaw, Poland, 1995. Chapman & Hall.

[Hav99a] K. Havelund, editor.Java PathFinder: A Translator from Java to Promela. Theoret-ical and Practical Aspects of SPIN Model Checking., 5th and 6th International SPINWorkshops, 1999.

[Hav99b] K. Havelund. Java pathfinder user guide. Technicalreport, NASA Ames ResearchCenter, USA, 1999.

[Hol97] Gerard J. Holzmann. The model checker SPIN.IEEE Transactions on Software Engi-neering, 23(5):279–295, 1997.

[HP99] Gerard J. Holzmann and Anuj Puri. A minimized automaton representation of reachablestates.International Journal on Software Tools for Technology Transfer, 2(3):270–278,1999.

[HP00] K. Havelund and T. Pressburger. Model checking Java programs using Java pathfinder.International Journal on Software Tools for Technology Transfer, 2(4), April 2000.

[Jos96] M. Joseph, editor.Automating the addition of fault tolerance, In 6th International Sym-posium on Formal Techniques in Real-Time and Fault-Tolerant Systems, FTRTFT’00.S. Kulkarni and A. Arora, 1996.

[KE03] Sandeep S. Kulkarni and Ali Ebnenasir. Adding fault-tolerance using pre-synthesizedcomponents. Technical Report MSU-CSE-03-28, Department of Computer Science,Michigan State University, East Lansing, Michigan, October 2003.

[Kul] Sandeep Kulkarni.http://www.cse.msu.edu/ ˜ sandeep/ . accessed: August8, 2006.

[Lap92] Jean-Claude Laprie, editor.Dependability: Basic Concepts and Terminology in English,French, German, Italian and Japanese. Springer-Verlag, Vienna, 1992.

[Lit] Literate programming. http://www.literateprogramming.com . accessed:June 14, 2006.

87

Page 102: Implementing Automatic Addition and Verification of Fault ... · Fault tolerance is a central aspect of dependability in distributed systems. The presence of a fault ... One approach

Bibliography

[MG04] T. M. McGuire and Mohamed G. Gouda.The Austin Protocol Compiler. Springer, 1edition, December 2004.

[PRO] Promela language reference.http://spinroot.com/spin/Man/promela.html . accessed: July 12, 2006.

[SiDMS97] NSF Science, Technology Center in Discrete Mathematics, and Theoretical ComputerScience.Satisfiability Problem: Theory and Applications. Dimacs Series in DiscreteMathematics and Theoretical Computer Science. Jun Gu and Panos M. Pardalos andDing-Zhu Du, October 1997. ISBN: 0821804790.

[Spia] Spin documentation.http://spinroot.com/spin/Man/index.html . ac-cessed: July 12, 2006.

[Spib] Spin documentation of never claim. http://spinroot.com/spin/Man/never.html . accessed: July 12, 2006.

[SPIc] Spin. http://spinroot.com/spin/whatispin.html . accessed: July 12,2006.

[VHBP00a] W. Visser, K. Havelund, G. Brat, and S. Park, editors. Java PathFinder - Second Gener-ation of a Java Model Checker, In Proc. of Post-CAV Workshop on Advances in Verifi-cation, Chicago, 2000.

[VHBP00b] W. Visser, K. Havelund, G. Brat, and S. Park, editors. Model checking programs, InInternational Conference on Automated Software Engineering, 2000.

[VW86] Moshe Y. Vardi and Pierre Wolper, editors.An automata-theoretic approach to auto-matic program verification, number 1st in Proceedings, Symposium on Logic in Com-puter Science, Cambridge, Massachusetts, June 1986. IEEE Computer Society. pages332-344.

[Yah01] E. Yahav, editor.Verifying safety properties of concurrent Java programs using 3-valuedlogic, Proc. of POPL ’01, 2001.

88