Automated Revision of Existing Programs

38
Automated Revision of Existing Programs Software Engineering and Network Systems Laboratory (SENS) Borzoo Bonakdarpour

description

Automated Revision of Existing Programs. Software Engineering and Network Systems Laboratory (SENS). Borzoo Bonakdarpour. Motivation. Question : Is it possible to revise the model automatically such that it satisfies the failed property while preserving the other properties?. Model - PowerPoint PPT Presentation

Transcript of Automated Revision of Existing Programs

Automated Revision of Existing Programs

Software Engineering and Network Systems Laboratory (SENS)

Borzoo Bonakdarpour

Automated Revision of Existing Programs 2

Motivation

Question Is it possible to revise the model automatically such that it satisfies the failed property while preserving the other properties

CounterexampleCounterexample

Model

Property

ModelChecker

Automated Revision of Existing Programs 3

Motivation (contrsquod)

Requirements

Question Is it possible to add a newly discovered property to an existing program automatically

SpecificationSpecification

Designer Program

Incomplete Incomplete SpecificationSpecification

New Property

Automated Revision of Existing Programs 4

Outline

What is program revision Adding properties to existing programs

Results Example

Adding fault-tolerance to existing real-time programs Results Example

Ongoing research Open problems and Future work

Automated Revision of Existing Programs 5

Program Revision

Revision by synthesis From specification Comprehensive revision

Highly expensive No reusability

From the existing program + new property Local revision Provides reusability In some cases offers lower classes of time and space complexity Does not need to have the entire specification of the existing program

RevisionAlgorithm

Program P

Property P

Automated Revision of Existing Programs 6

Our Goal

We identify classes of interesting properties typically used in specifying reactive systems

Designing synthesis methods where revising existing programs is feasible time-wise and space-wise

QuestionQuestion Why comprehensive revision is highly Why comprehensive revision is highly complexcomplexAnswerAnswer Expressiveness of specifications Expressiveness of specifications

Automated Revision of Existing Programs 7

Part I

Adding PropertiesProperties to Existing Programs

Automated Revision of Existing Programs 8

Preliminary Concepts

A program p is a triple p = Sp Ip p ie finite state space set of initial states and program transitions

A state predicate is any subset of Sp

A computation is a state sequence s0 s1 hellip iff s0 Ip

i gt 0 (si-1 si) p

If terminates in state sf then there does not exist s such that (sf s) p

Automated Revision of Existing Programs 9

Preliminary Concepts (cont)

Sample Properties Safety

P unless Q

stable(P)

invariant(P)

Liveness P leads-to Q P Q

P PP P P P

PP P

P P P Q

Automated Revision of Existing Programs 10

Preliminary Concepts (cont)

A specification is a conjunction of a set of properties

spec = L1 L2 hellip Ln

A computation satisfies spec iff (i | 0 i n satisfies Li)

A program p satisfies spec iff all computations of p1 are infinite

2 satisfy spec

Automated Revision of Existing Programs 11

Problem Statement

Formulation of the problem p satisfies existing specification spece

Sp = Sp

Ip = Ip

p p

All computations of p are infinite satisfy specn

SynthesisAlgorithm

Program p = Sp Ip p

Program p = Sp Ip pA Specification specn

Automated Revision of Existing Programs 12

Adding a Single Leads-to Property (R T )

Sp

Ip R T

Case 1 Deadlock statesRemove transitions (s1 s2) if s2 R and T is not reachable from S2

s0

s1

s2

Automated Revision of Existing Programs 13

Adding a Single Leads-to Property (R T )

Sp Ip R T

Break cycles reachable from R without reaching Q

s1

s4

s2

s3

Case 2 Cycles

s0

Automated Revision of Existing Programs 14

Soundness and Completeness

Theorem (1) The algorithm for adding multiple safety properties along with a leads-to property is sound and complete Fixability

Theorem (2) The complexity of the algorithm for adding multiple safety properties along with a leads-to property is polynomial-time

Automated Revision of Existing Programs 15

Adding Two Leads-to Properties Adding two leads-to properties one after another

Sp

Ip R T

s5

s6

s3s4

s7

P Qs9

s0

s6

s1

s2

s8

Automated Revision of Existing Programs 16

Other Results

The problems of simultaneous addition of two leads-to properties is NP-complete

The problem of adding one leads-to property while maintaining maximum non-determinism is NP-complete

Automated Revision of Existing Programs 17

Another Problem

Adding two eventually properties

1 true leads-to Q

2 true leads-to T

This problem is also NP-complete

Q

T

Automated Revision of Existing Programs 18

Example Real-Time Resource Allocation

Two processes j 1 2 Each has two tasks to complete (each takes 1 time unit)

Submitting a request Performing IO operation

RQj reqj (x = 1) ioj reqj = true false

IOj ioj (x = 1) reqj ioj = true false WT 0 x 1 1048576 wait

Bounded response

L (io1 2 req1)

x

x

Automated Revision of Existing Programs 19

req10 lt x t lt 1

req2

req1x t = 0req2

req1x t = 1req2

io1x t = 0 req2

req1x t = 0 1

io2

req10 lt x lt 11 lt t lt 2

io2

req1x t = 1 2

io2

io10 lt x t lt 1

req2

io1x t = 0

io2

io10 lt x t lt 1

io2

io1x = 0 t gt 2

req2

io1x = 1 t gt 2

req2

io1x = 0 t gt 2

io2

req10 lt x lt 1

t gt 2req2

req1x = 0 t gt 2

req2

req1x = 1 t gt 2

req2

io10 lt x lt 1

t gt 2io2

req1x = 0 t gt 2

io2

req10 lt x lt 1

t gt 2 io2

req1x = 1 t gt 2

io2

io1x t = 0 2

io2

req1x t = 0 2

req2

req1x t = 0 1

io2

req10 lt x lt 11 lt t lt 2

io2

req1x t = 1 2

io2

req1x t = 0 2

io2

req1x t = 0 1

req2

req10 lt x lt 11 lt t lt 2

req2

req1x t = 1 2

req2

Regions whereio1 becomes true

Edges removed during addition

Regions made unreachable

Legend

Initial region

Edges participatingin a shortest path

io1x t = 0 1

req2

io1x t = 1

io2

io10 lt x lt 11 lt t lt 2

req2

io1x t = 1 2

req2

Edges in additionalshortest paths

io1x t = 1req2

io1x t = 01

io2

io10 lt x lt 11 lt t lt 2

io2

io1x t = 1 2

io2

io1x t = 0 2

req2

io1x = 1 t gt 2

io2

io10 lt x lt 1

t gt 2req2

Automated Revision of Existing Programs 20

Example (contrsquod)

RQ1 req1 (x = 1) io1 req1 = true false

IO1 io1 (x = 1) req1 io1 = true false

RQ2 req2 (x = 1) (io1 t 1) io2 req2 = true false

IO1 io2 (x = 1) (io1 t 1) req2 io2 = true false

WT 0 x 1 1048576 wait

x t

x

x

x

Automated Revision of Existing Programs 21

Part II

Adding Fault-ToleranceFault-Tolerance to Existing Programs

Automated Revision of Existing Programs 22

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Decision dg 0 1

(dj = ) ( fj = false) dj = dg

(dj ) ( fj = false) fj = true

dj

dk 0 1

dl

Decision

fj

fk false true

fl

Final

GENERAL

NON-GENERALS

Program

Automated Revision of Existing Programs 23

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Byzantine

bg false true

bj

bk false true

bl

Byzantine

(bj bk bl bg = false) bj = true

(bj = true) dj = 0|1Faults

Automated Revision of Existing Programs 24

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Safety specification Agreement The final decision of any two non-Byzantine

process cannot be different

Validity If the general is non-Byzantine then the final decision of a non-Byzantine process must be the same as that of the general

Fact The program does not meet the safety specification in the presence of faults

dg = 0 bg = true dj = 0 fj = false fj = true dk = 1 fk = true

Automated Revision of Existing Programs 25

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

QuestionIs it possible to revise a distributed program in order to add fault-tolerance to the original program with respect to a safety specification and a class of faults

AnswerYes but the problem is NP-complete which itself is a problem

SolutionA set of polynomial-time heuristics has been developed to add fault-tolerance to a large class of distributed programs

Automated Revision of Existing Programs 26

Some Terminalogy

Boolean Variables V = v1 vn State where l(vj) is a literal State predicate A finite set of states Transition Let V = v | v V A transition (s s) = s s Transition predicate is a set of transitions Program A program is defined by a transition predicate P Closure A state predicate S is closed in program P iff

(s ||S) (s ||S)

Computation of P A sequence of states s0 s1 where (sj sj+1) || P Safety specification A set of bad transitions that should not occur in any

program computation defined by transition predicate SPEC Satisfaction A program P satisfies SPEC from S iff

1 S is closed in P

2 For all s0 s1 where s0 || S (sj sj+1) || SPEC Invariant If P satisfies SPEC from S and S then S is an invariant of P

imi sS 0

)(0 jnj vls

)(0 iimi ssP

Pss ||)(

Automated Revision of Existing Programs 27

Preliminaries (contrsquod)

Faults A set F of transitions

Fault-span A state predicate T such that

1 S T

2 T is closed in P F

Invariant

f

f

ff

f

Fault-Span

Finite state space

p

pp

p

Fault-tolerance A program P is F-tolerant to SPEC from S if1 P satisfies SPEC from S2 There exists T such that

T is an F-span of P P F satisfies SPEC from T (safety) after faults stop occurring every computation of P reaches S (liveness)

Automated Revision of Existing Programs 28

Levels of Fault-Tolerance [3]

In the presence of faults

A failsafe program satisfies the safety specification A nonmasking program satisfies the liveness specification A masking program satisfies both safety and liveness specifications

In the absence of faults the program must continue to satisfy its entire specification

Question Are these levels able to capture the requirements for modeling fault-tolerant real-time programs

[3] A Arora S Kulkarni [3] A Arora S Kulkarni Designing masking fault-tolerance via nonmasking fault-toleranceDesigning masking fault-tolerance via nonmasking fault-tolerance TSE 1998TSE 1998

Automated Revision of Existing Programs 29

Fault-Tolerant Real-Time Programs

Railroad Crossing Safety If the train is passing the crossing the gate must be closed Bounded response Once the gate is closed it should reopen within

4min Fault-tolerance The gate must be closed while the train is passing the

intersection even in the presence of faults

Boiler Controller Bounded response Once the pressure gauge reads 30 pounds per

square inch the controller must issue a command to open a valve within 20s

Fault-tolerance Faults should not affect the functionality of the controller

Automated Revision of Existing Programs 30

New Levels of Fault-Tolerance for RT Programs

Soft Fault-Tolerance

A program is soft fault-tolerant if it is NOT required to satisfy its timing constraints in the presence of faults However it should continue to meet its timing constraints in the absence of faults

Hard Fault-Tolerance

A program is hard fault-tolerant if it is required to satisfy its timing constraints even in the presence of faults

Automated Revision of Existing Programs 31

Levels of Fault-Tolerance (contrsquod)

Safety LivenessTiming

constraintsBounded-time

recovery

Soft-failsafe

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Automated Revision of Existing Programs 32

Problem Statement

SynthesisAlgorithm

Program PSafety spec

Fault-tolerant

program PSet of faults f

Desired level of fault-tolerance SoftHard-Failsafe

NonmaskingSoftHard-Masking

Automated Revision of Existing Programs 33

Problem Statement

Given program P Invariant S a set of faults F and safety specification SPEC identify a program P with invariant S such that

S S P P and P is F-tolerant to SPEC from S

Automated Revision of Existing Programs 34

Current Results

Safety LivenessTiming

constraintsBounded-recovery

Complexity

Soft-failsafe

Polynomial space sound and complete

algorithm

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Open problem

[2] S Kulkarni A Arora [2] S Kulkarni A Arora Automating the Addition of Fault-ToleranceAutomating the Addition of Fault-Tolerance FTRTFT 2000 FTRTFT 2000

Automated Revision of Existing Programs 35

Example Altitude Switch

Initx 06

Standby

Await (Power z 2 )

gtgt Actuator Power on ltlt

ltlt LowAlt gtgt[z = 0]

ltlt Reset gtgt(CorruptSensor y 2)

(x 06)

ltlt Reset gtgt

[x y z = 0]

(Power z 2 )

(z gt 2 )[t = 0]

(x gt 06)[t = 0]

ltlt SensorFail gtgt[y = 0]

(y gt 2)[t = 0]

y 1y 2

z 2

(Power CorruptSensor)( Power)

Fault-span

Fault-span

(t 1)

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

Automated Revision of Existing Programs 37

Ongoing Research

Implementation of the algorithms

Coping with the state explosion problem Synthesis using zone automata Parallelizing state space

Coping with NP-hardness results Heuristics Decision procedures (Yices)

Fault-tolerance in hybrid systems

[8] httpwwwcsemsueduebnenasiresearchtoolsftsynhtm

Automated Revision of Existing Programs 38

Conclusion

We study the problem of automated revision of programs inside their state space

Adding properties to programs Adding fault-tolerance

Automated Revision of Existing Programs 39

Questions

Automated Revision of Existing Programs 2

Motivation

Question Is it possible to revise the model automatically such that it satisfies the failed property while preserving the other properties

CounterexampleCounterexample

Model

Property

ModelChecker

Automated Revision of Existing Programs 3

Motivation (contrsquod)

Requirements

Question Is it possible to add a newly discovered property to an existing program automatically

SpecificationSpecification

Designer Program

Incomplete Incomplete SpecificationSpecification

New Property

Automated Revision of Existing Programs 4

Outline

What is program revision Adding properties to existing programs

Results Example

Adding fault-tolerance to existing real-time programs Results Example

Ongoing research Open problems and Future work

Automated Revision of Existing Programs 5

Program Revision

Revision by synthesis From specification Comprehensive revision

Highly expensive No reusability

From the existing program + new property Local revision Provides reusability In some cases offers lower classes of time and space complexity Does not need to have the entire specification of the existing program

RevisionAlgorithm

Program P

Property P

Automated Revision of Existing Programs 6

Our Goal

We identify classes of interesting properties typically used in specifying reactive systems

Designing synthesis methods where revising existing programs is feasible time-wise and space-wise

QuestionQuestion Why comprehensive revision is highly Why comprehensive revision is highly complexcomplexAnswerAnswer Expressiveness of specifications Expressiveness of specifications

Automated Revision of Existing Programs 7

Part I

Adding PropertiesProperties to Existing Programs

Automated Revision of Existing Programs 8

Preliminary Concepts

A program p is a triple p = Sp Ip p ie finite state space set of initial states and program transitions

A state predicate is any subset of Sp

A computation is a state sequence s0 s1 hellip iff s0 Ip

i gt 0 (si-1 si) p

If terminates in state sf then there does not exist s such that (sf s) p

Automated Revision of Existing Programs 9

Preliminary Concepts (cont)

Sample Properties Safety

P unless Q

stable(P)

invariant(P)

Liveness P leads-to Q P Q

P PP P P P

PP P

P P P Q

Automated Revision of Existing Programs 10

Preliminary Concepts (cont)

A specification is a conjunction of a set of properties

spec = L1 L2 hellip Ln

A computation satisfies spec iff (i | 0 i n satisfies Li)

A program p satisfies spec iff all computations of p1 are infinite

2 satisfy spec

Automated Revision of Existing Programs 11

Problem Statement

Formulation of the problem p satisfies existing specification spece

Sp = Sp

Ip = Ip

p p

All computations of p are infinite satisfy specn

SynthesisAlgorithm

Program p = Sp Ip p

Program p = Sp Ip pA Specification specn

Automated Revision of Existing Programs 12

Adding a Single Leads-to Property (R T )

Sp

Ip R T

Case 1 Deadlock statesRemove transitions (s1 s2) if s2 R and T is not reachable from S2

s0

s1

s2

Automated Revision of Existing Programs 13

Adding a Single Leads-to Property (R T )

Sp Ip R T

Break cycles reachable from R without reaching Q

s1

s4

s2

s3

Case 2 Cycles

s0

Automated Revision of Existing Programs 14

Soundness and Completeness

Theorem (1) The algorithm for adding multiple safety properties along with a leads-to property is sound and complete Fixability

Theorem (2) The complexity of the algorithm for adding multiple safety properties along with a leads-to property is polynomial-time

Automated Revision of Existing Programs 15

Adding Two Leads-to Properties Adding two leads-to properties one after another

Sp

Ip R T

s5

s6

s3s4

s7

P Qs9

s0

s6

s1

s2

s8

Automated Revision of Existing Programs 16

Other Results

The problems of simultaneous addition of two leads-to properties is NP-complete

The problem of adding one leads-to property while maintaining maximum non-determinism is NP-complete

Automated Revision of Existing Programs 17

Another Problem

Adding two eventually properties

1 true leads-to Q

2 true leads-to T

This problem is also NP-complete

Q

T

Automated Revision of Existing Programs 18

Example Real-Time Resource Allocation

Two processes j 1 2 Each has two tasks to complete (each takes 1 time unit)

Submitting a request Performing IO operation

RQj reqj (x = 1) ioj reqj = true false

IOj ioj (x = 1) reqj ioj = true false WT 0 x 1 1048576 wait

Bounded response

L (io1 2 req1)

x

x

Automated Revision of Existing Programs 19

req10 lt x t lt 1

req2

req1x t = 0req2

req1x t = 1req2

io1x t = 0 req2

req1x t = 0 1

io2

req10 lt x lt 11 lt t lt 2

io2

req1x t = 1 2

io2

io10 lt x t lt 1

req2

io1x t = 0

io2

io10 lt x t lt 1

io2

io1x = 0 t gt 2

req2

io1x = 1 t gt 2

req2

io1x = 0 t gt 2

io2

req10 lt x lt 1

t gt 2req2

req1x = 0 t gt 2

req2

req1x = 1 t gt 2

req2

io10 lt x lt 1

t gt 2io2

req1x = 0 t gt 2

io2

req10 lt x lt 1

t gt 2 io2

req1x = 1 t gt 2

io2

io1x t = 0 2

io2

req1x t = 0 2

req2

req1x t = 0 1

io2

req10 lt x lt 11 lt t lt 2

io2

req1x t = 1 2

io2

req1x t = 0 2

io2

req1x t = 0 1

req2

req10 lt x lt 11 lt t lt 2

req2

req1x t = 1 2

req2

Regions whereio1 becomes true

Edges removed during addition

Regions made unreachable

Legend

Initial region

Edges participatingin a shortest path

io1x t = 0 1

req2

io1x t = 1

io2

io10 lt x lt 11 lt t lt 2

req2

io1x t = 1 2

req2

Edges in additionalshortest paths

io1x t = 1req2

io1x t = 01

io2

io10 lt x lt 11 lt t lt 2

io2

io1x t = 1 2

io2

io1x t = 0 2

req2

io1x = 1 t gt 2

io2

io10 lt x lt 1

t gt 2req2

Automated Revision of Existing Programs 20

Example (contrsquod)

RQ1 req1 (x = 1) io1 req1 = true false

IO1 io1 (x = 1) req1 io1 = true false

RQ2 req2 (x = 1) (io1 t 1) io2 req2 = true false

IO1 io2 (x = 1) (io1 t 1) req2 io2 = true false

WT 0 x 1 1048576 wait

x t

x

x

x

Automated Revision of Existing Programs 21

Part II

Adding Fault-ToleranceFault-Tolerance to Existing Programs

Automated Revision of Existing Programs 22

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Decision dg 0 1

(dj = ) ( fj = false) dj = dg

(dj ) ( fj = false) fj = true

dj

dk 0 1

dl

Decision

fj

fk false true

fl

Final

GENERAL

NON-GENERALS

Program

Automated Revision of Existing Programs 23

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Byzantine

bg false true

bj

bk false true

bl

Byzantine

(bj bk bl bg = false) bj = true

(bj = true) dj = 0|1Faults

Automated Revision of Existing Programs 24

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Safety specification Agreement The final decision of any two non-Byzantine

process cannot be different

Validity If the general is non-Byzantine then the final decision of a non-Byzantine process must be the same as that of the general

Fact The program does not meet the safety specification in the presence of faults

dg = 0 bg = true dj = 0 fj = false fj = true dk = 1 fk = true

Automated Revision of Existing Programs 25

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

QuestionIs it possible to revise a distributed program in order to add fault-tolerance to the original program with respect to a safety specification and a class of faults

AnswerYes but the problem is NP-complete which itself is a problem

SolutionA set of polynomial-time heuristics has been developed to add fault-tolerance to a large class of distributed programs

Automated Revision of Existing Programs 26

Some Terminalogy

Boolean Variables V = v1 vn State where l(vj) is a literal State predicate A finite set of states Transition Let V = v | v V A transition (s s) = s s Transition predicate is a set of transitions Program A program is defined by a transition predicate P Closure A state predicate S is closed in program P iff

(s ||S) (s ||S)

Computation of P A sequence of states s0 s1 where (sj sj+1) || P Safety specification A set of bad transitions that should not occur in any

program computation defined by transition predicate SPEC Satisfaction A program P satisfies SPEC from S iff

1 S is closed in P

2 For all s0 s1 where s0 || S (sj sj+1) || SPEC Invariant If P satisfies SPEC from S and S then S is an invariant of P

imi sS 0

)(0 jnj vls

)(0 iimi ssP

Pss ||)(

Automated Revision of Existing Programs 27

Preliminaries (contrsquod)

Faults A set F of transitions

Fault-span A state predicate T such that

1 S T

2 T is closed in P F

Invariant

f

f

ff

f

Fault-Span

Finite state space

p

pp

p

Fault-tolerance A program P is F-tolerant to SPEC from S if1 P satisfies SPEC from S2 There exists T such that

T is an F-span of P P F satisfies SPEC from T (safety) after faults stop occurring every computation of P reaches S (liveness)

Automated Revision of Existing Programs 28

Levels of Fault-Tolerance [3]

In the presence of faults

A failsafe program satisfies the safety specification A nonmasking program satisfies the liveness specification A masking program satisfies both safety and liveness specifications

In the absence of faults the program must continue to satisfy its entire specification

Question Are these levels able to capture the requirements for modeling fault-tolerant real-time programs

[3] A Arora S Kulkarni [3] A Arora S Kulkarni Designing masking fault-tolerance via nonmasking fault-toleranceDesigning masking fault-tolerance via nonmasking fault-tolerance TSE 1998TSE 1998

Automated Revision of Existing Programs 29

Fault-Tolerant Real-Time Programs

Railroad Crossing Safety If the train is passing the crossing the gate must be closed Bounded response Once the gate is closed it should reopen within

4min Fault-tolerance The gate must be closed while the train is passing the

intersection even in the presence of faults

Boiler Controller Bounded response Once the pressure gauge reads 30 pounds per

square inch the controller must issue a command to open a valve within 20s

Fault-tolerance Faults should not affect the functionality of the controller

Automated Revision of Existing Programs 30

New Levels of Fault-Tolerance for RT Programs

Soft Fault-Tolerance

A program is soft fault-tolerant if it is NOT required to satisfy its timing constraints in the presence of faults However it should continue to meet its timing constraints in the absence of faults

Hard Fault-Tolerance

A program is hard fault-tolerant if it is required to satisfy its timing constraints even in the presence of faults

Automated Revision of Existing Programs 31

Levels of Fault-Tolerance (contrsquod)

Safety LivenessTiming

constraintsBounded-time

recovery

Soft-failsafe

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Automated Revision of Existing Programs 32

Problem Statement

SynthesisAlgorithm

Program PSafety spec

Fault-tolerant

program PSet of faults f

Desired level of fault-tolerance SoftHard-Failsafe

NonmaskingSoftHard-Masking

Automated Revision of Existing Programs 33

Problem Statement

Given program P Invariant S a set of faults F and safety specification SPEC identify a program P with invariant S such that

S S P P and P is F-tolerant to SPEC from S

Automated Revision of Existing Programs 34

Current Results

Safety LivenessTiming

constraintsBounded-recovery

Complexity

Soft-failsafe

Polynomial space sound and complete

algorithm

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Open problem

[2] S Kulkarni A Arora [2] S Kulkarni A Arora Automating the Addition of Fault-ToleranceAutomating the Addition of Fault-Tolerance FTRTFT 2000 FTRTFT 2000

Automated Revision of Existing Programs 35

Example Altitude Switch

Initx 06

Standby

Await (Power z 2 )

gtgt Actuator Power on ltlt

ltlt LowAlt gtgt[z = 0]

ltlt Reset gtgt(CorruptSensor y 2)

(x 06)

ltlt Reset gtgt

[x y z = 0]

(Power z 2 )

(z gt 2 )[t = 0]

(x gt 06)[t = 0]

ltlt SensorFail gtgt[y = 0]

(y gt 2)[t = 0]

y 1y 2

z 2

(Power CorruptSensor)( Power)

Fault-span

Fault-span

(t 1)

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

Automated Revision of Existing Programs 37

Ongoing Research

Implementation of the algorithms

Coping with the state explosion problem Synthesis using zone automata Parallelizing state space

Coping with NP-hardness results Heuristics Decision procedures (Yices)

Fault-tolerance in hybrid systems

[8] httpwwwcsemsueduebnenasiresearchtoolsftsynhtm

Automated Revision of Existing Programs 38

Conclusion

We study the problem of automated revision of programs inside their state space

Adding properties to programs Adding fault-tolerance

Automated Revision of Existing Programs 39

Questions

Automated Revision of Existing Programs 3

Motivation (contrsquod)

Requirements

Question Is it possible to add a newly discovered property to an existing program automatically

SpecificationSpecification

Designer Program

Incomplete Incomplete SpecificationSpecification

New Property

Automated Revision of Existing Programs 4

Outline

What is program revision Adding properties to existing programs

Results Example

Adding fault-tolerance to existing real-time programs Results Example

Ongoing research Open problems and Future work

Automated Revision of Existing Programs 5

Program Revision

Revision by synthesis From specification Comprehensive revision

Highly expensive No reusability

From the existing program + new property Local revision Provides reusability In some cases offers lower classes of time and space complexity Does not need to have the entire specification of the existing program

RevisionAlgorithm

Program P

Property P

Automated Revision of Existing Programs 6

Our Goal

We identify classes of interesting properties typically used in specifying reactive systems

Designing synthesis methods where revising existing programs is feasible time-wise and space-wise

QuestionQuestion Why comprehensive revision is highly Why comprehensive revision is highly complexcomplexAnswerAnswer Expressiveness of specifications Expressiveness of specifications

Automated Revision of Existing Programs 7

Part I

Adding PropertiesProperties to Existing Programs

Automated Revision of Existing Programs 8

Preliminary Concepts

A program p is a triple p = Sp Ip p ie finite state space set of initial states and program transitions

A state predicate is any subset of Sp

A computation is a state sequence s0 s1 hellip iff s0 Ip

i gt 0 (si-1 si) p

If terminates in state sf then there does not exist s such that (sf s) p

Automated Revision of Existing Programs 9

Preliminary Concepts (cont)

Sample Properties Safety

P unless Q

stable(P)

invariant(P)

Liveness P leads-to Q P Q

P PP P P P

PP P

P P P Q

Automated Revision of Existing Programs 10

Preliminary Concepts (cont)

A specification is a conjunction of a set of properties

spec = L1 L2 hellip Ln

A computation satisfies spec iff (i | 0 i n satisfies Li)

A program p satisfies spec iff all computations of p1 are infinite

2 satisfy spec

Automated Revision of Existing Programs 11

Problem Statement

Formulation of the problem p satisfies existing specification spece

Sp = Sp

Ip = Ip

p p

All computations of p are infinite satisfy specn

SynthesisAlgorithm

Program p = Sp Ip p

Program p = Sp Ip pA Specification specn

Automated Revision of Existing Programs 12

Adding a Single Leads-to Property (R T )

Sp

Ip R T

Case 1 Deadlock statesRemove transitions (s1 s2) if s2 R and T is not reachable from S2

s0

s1

s2

Automated Revision of Existing Programs 13

Adding a Single Leads-to Property (R T )

Sp Ip R T

Break cycles reachable from R without reaching Q

s1

s4

s2

s3

Case 2 Cycles

s0

Automated Revision of Existing Programs 14

Soundness and Completeness

Theorem (1) The algorithm for adding multiple safety properties along with a leads-to property is sound and complete Fixability

Theorem (2) The complexity of the algorithm for adding multiple safety properties along with a leads-to property is polynomial-time

Automated Revision of Existing Programs 15

Adding Two Leads-to Properties Adding two leads-to properties one after another

Sp

Ip R T

s5

s6

s3s4

s7

P Qs9

s0

s6

s1

s2

s8

Automated Revision of Existing Programs 16

Other Results

The problems of simultaneous addition of two leads-to properties is NP-complete

The problem of adding one leads-to property while maintaining maximum non-determinism is NP-complete

Automated Revision of Existing Programs 17

Another Problem

Adding two eventually properties

1 true leads-to Q

2 true leads-to T

This problem is also NP-complete

Q

T

Automated Revision of Existing Programs 18

Example Real-Time Resource Allocation

Two processes j 1 2 Each has two tasks to complete (each takes 1 time unit)

Submitting a request Performing IO operation

RQj reqj (x = 1) ioj reqj = true false

IOj ioj (x = 1) reqj ioj = true false WT 0 x 1 1048576 wait

Bounded response

L (io1 2 req1)

x

x

Automated Revision of Existing Programs 19

req10 lt x t lt 1

req2

req1x t = 0req2

req1x t = 1req2

io1x t = 0 req2

req1x t = 0 1

io2

req10 lt x lt 11 lt t lt 2

io2

req1x t = 1 2

io2

io10 lt x t lt 1

req2

io1x t = 0

io2

io10 lt x t lt 1

io2

io1x = 0 t gt 2

req2

io1x = 1 t gt 2

req2

io1x = 0 t gt 2

io2

req10 lt x lt 1

t gt 2req2

req1x = 0 t gt 2

req2

req1x = 1 t gt 2

req2

io10 lt x lt 1

t gt 2io2

req1x = 0 t gt 2

io2

req10 lt x lt 1

t gt 2 io2

req1x = 1 t gt 2

io2

io1x t = 0 2

io2

req1x t = 0 2

req2

req1x t = 0 1

io2

req10 lt x lt 11 lt t lt 2

io2

req1x t = 1 2

io2

req1x t = 0 2

io2

req1x t = 0 1

req2

req10 lt x lt 11 lt t lt 2

req2

req1x t = 1 2

req2

Regions whereio1 becomes true

Edges removed during addition

Regions made unreachable

Legend

Initial region

Edges participatingin a shortest path

io1x t = 0 1

req2

io1x t = 1

io2

io10 lt x lt 11 lt t lt 2

req2

io1x t = 1 2

req2

Edges in additionalshortest paths

io1x t = 1req2

io1x t = 01

io2

io10 lt x lt 11 lt t lt 2

io2

io1x t = 1 2

io2

io1x t = 0 2

req2

io1x = 1 t gt 2

io2

io10 lt x lt 1

t gt 2req2

Automated Revision of Existing Programs 20

Example (contrsquod)

RQ1 req1 (x = 1) io1 req1 = true false

IO1 io1 (x = 1) req1 io1 = true false

RQ2 req2 (x = 1) (io1 t 1) io2 req2 = true false

IO1 io2 (x = 1) (io1 t 1) req2 io2 = true false

WT 0 x 1 1048576 wait

x t

x

x

x

Automated Revision of Existing Programs 21

Part II

Adding Fault-ToleranceFault-Tolerance to Existing Programs

Automated Revision of Existing Programs 22

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Decision dg 0 1

(dj = ) ( fj = false) dj = dg

(dj ) ( fj = false) fj = true

dj

dk 0 1

dl

Decision

fj

fk false true

fl

Final

GENERAL

NON-GENERALS

Program

Automated Revision of Existing Programs 23

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Byzantine

bg false true

bj

bk false true

bl

Byzantine

(bj bk bl bg = false) bj = true

(bj = true) dj = 0|1Faults

Automated Revision of Existing Programs 24

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Safety specification Agreement The final decision of any two non-Byzantine

process cannot be different

Validity If the general is non-Byzantine then the final decision of a non-Byzantine process must be the same as that of the general

Fact The program does not meet the safety specification in the presence of faults

dg = 0 bg = true dj = 0 fj = false fj = true dk = 1 fk = true

Automated Revision of Existing Programs 25

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

QuestionIs it possible to revise a distributed program in order to add fault-tolerance to the original program with respect to a safety specification and a class of faults

AnswerYes but the problem is NP-complete which itself is a problem

SolutionA set of polynomial-time heuristics has been developed to add fault-tolerance to a large class of distributed programs

Automated Revision of Existing Programs 26

Some Terminalogy

Boolean Variables V = v1 vn State where l(vj) is a literal State predicate A finite set of states Transition Let V = v | v V A transition (s s) = s s Transition predicate is a set of transitions Program A program is defined by a transition predicate P Closure A state predicate S is closed in program P iff

(s ||S) (s ||S)

Computation of P A sequence of states s0 s1 where (sj sj+1) || P Safety specification A set of bad transitions that should not occur in any

program computation defined by transition predicate SPEC Satisfaction A program P satisfies SPEC from S iff

1 S is closed in P

2 For all s0 s1 where s0 || S (sj sj+1) || SPEC Invariant If P satisfies SPEC from S and S then S is an invariant of P

imi sS 0

)(0 jnj vls

)(0 iimi ssP

Pss ||)(

Automated Revision of Existing Programs 27

Preliminaries (contrsquod)

Faults A set F of transitions

Fault-span A state predicate T such that

1 S T

2 T is closed in P F

Invariant

f

f

ff

f

Fault-Span

Finite state space

p

pp

p

Fault-tolerance A program P is F-tolerant to SPEC from S if1 P satisfies SPEC from S2 There exists T such that

T is an F-span of P P F satisfies SPEC from T (safety) after faults stop occurring every computation of P reaches S (liveness)

Automated Revision of Existing Programs 28

Levels of Fault-Tolerance [3]

In the presence of faults

A failsafe program satisfies the safety specification A nonmasking program satisfies the liveness specification A masking program satisfies both safety and liveness specifications

In the absence of faults the program must continue to satisfy its entire specification

Question Are these levels able to capture the requirements for modeling fault-tolerant real-time programs

[3] A Arora S Kulkarni [3] A Arora S Kulkarni Designing masking fault-tolerance via nonmasking fault-toleranceDesigning masking fault-tolerance via nonmasking fault-tolerance TSE 1998TSE 1998

Automated Revision of Existing Programs 29

Fault-Tolerant Real-Time Programs

Railroad Crossing Safety If the train is passing the crossing the gate must be closed Bounded response Once the gate is closed it should reopen within

4min Fault-tolerance The gate must be closed while the train is passing the

intersection even in the presence of faults

Boiler Controller Bounded response Once the pressure gauge reads 30 pounds per

square inch the controller must issue a command to open a valve within 20s

Fault-tolerance Faults should not affect the functionality of the controller

Automated Revision of Existing Programs 30

New Levels of Fault-Tolerance for RT Programs

Soft Fault-Tolerance

A program is soft fault-tolerant if it is NOT required to satisfy its timing constraints in the presence of faults However it should continue to meet its timing constraints in the absence of faults

Hard Fault-Tolerance

A program is hard fault-tolerant if it is required to satisfy its timing constraints even in the presence of faults

Automated Revision of Existing Programs 31

Levels of Fault-Tolerance (contrsquod)

Safety LivenessTiming

constraintsBounded-time

recovery

Soft-failsafe

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Automated Revision of Existing Programs 32

Problem Statement

SynthesisAlgorithm

Program PSafety spec

Fault-tolerant

program PSet of faults f

Desired level of fault-tolerance SoftHard-Failsafe

NonmaskingSoftHard-Masking

Automated Revision of Existing Programs 33

Problem Statement

Given program P Invariant S a set of faults F and safety specification SPEC identify a program P with invariant S such that

S S P P and P is F-tolerant to SPEC from S

Automated Revision of Existing Programs 34

Current Results

Safety LivenessTiming

constraintsBounded-recovery

Complexity

Soft-failsafe

Polynomial space sound and complete

algorithm

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Open problem

[2] S Kulkarni A Arora [2] S Kulkarni A Arora Automating the Addition of Fault-ToleranceAutomating the Addition of Fault-Tolerance FTRTFT 2000 FTRTFT 2000

Automated Revision of Existing Programs 35

Example Altitude Switch

Initx 06

Standby

Await (Power z 2 )

gtgt Actuator Power on ltlt

ltlt LowAlt gtgt[z = 0]

ltlt Reset gtgt(CorruptSensor y 2)

(x 06)

ltlt Reset gtgt

[x y z = 0]

(Power z 2 )

(z gt 2 )[t = 0]

(x gt 06)[t = 0]

ltlt SensorFail gtgt[y = 0]

(y gt 2)[t = 0]

y 1y 2

z 2

(Power CorruptSensor)( Power)

Fault-span

Fault-span

(t 1)

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

Automated Revision of Existing Programs 37

Ongoing Research

Implementation of the algorithms

Coping with the state explosion problem Synthesis using zone automata Parallelizing state space

Coping with NP-hardness results Heuristics Decision procedures (Yices)

Fault-tolerance in hybrid systems

[8] httpwwwcsemsueduebnenasiresearchtoolsftsynhtm

Automated Revision of Existing Programs 38

Conclusion

We study the problem of automated revision of programs inside their state space

Adding properties to programs Adding fault-tolerance

Automated Revision of Existing Programs 39

Questions

Automated Revision of Existing Programs 4

Outline

What is program revision Adding properties to existing programs

Results Example

Adding fault-tolerance to existing real-time programs Results Example

Ongoing research Open problems and Future work

Automated Revision of Existing Programs 5

Program Revision

Revision by synthesis From specification Comprehensive revision

Highly expensive No reusability

From the existing program + new property Local revision Provides reusability In some cases offers lower classes of time and space complexity Does not need to have the entire specification of the existing program

RevisionAlgorithm

Program P

Property P

Automated Revision of Existing Programs 6

Our Goal

We identify classes of interesting properties typically used in specifying reactive systems

Designing synthesis methods where revising existing programs is feasible time-wise and space-wise

QuestionQuestion Why comprehensive revision is highly Why comprehensive revision is highly complexcomplexAnswerAnswer Expressiveness of specifications Expressiveness of specifications

Automated Revision of Existing Programs 7

Part I

Adding PropertiesProperties to Existing Programs

Automated Revision of Existing Programs 8

Preliminary Concepts

A program p is a triple p = Sp Ip p ie finite state space set of initial states and program transitions

A state predicate is any subset of Sp

A computation is a state sequence s0 s1 hellip iff s0 Ip

i gt 0 (si-1 si) p

If terminates in state sf then there does not exist s such that (sf s) p

Automated Revision of Existing Programs 9

Preliminary Concepts (cont)

Sample Properties Safety

P unless Q

stable(P)

invariant(P)

Liveness P leads-to Q P Q

P PP P P P

PP P

P P P Q

Automated Revision of Existing Programs 10

Preliminary Concepts (cont)

A specification is a conjunction of a set of properties

spec = L1 L2 hellip Ln

A computation satisfies spec iff (i | 0 i n satisfies Li)

A program p satisfies spec iff all computations of p1 are infinite

2 satisfy spec

Automated Revision of Existing Programs 11

Problem Statement

Formulation of the problem p satisfies existing specification spece

Sp = Sp

Ip = Ip

p p

All computations of p are infinite satisfy specn

SynthesisAlgorithm

Program p = Sp Ip p

Program p = Sp Ip pA Specification specn

Automated Revision of Existing Programs 12

Adding a Single Leads-to Property (R T )

Sp

Ip R T

Case 1 Deadlock statesRemove transitions (s1 s2) if s2 R and T is not reachable from S2

s0

s1

s2

Automated Revision of Existing Programs 13

Adding a Single Leads-to Property (R T )

Sp Ip R T

Break cycles reachable from R without reaching Q

s1

s4

s2

s3

Case 2 Cycles

s0

Automated Revision of Existing Programs 14

Soundness and Completeness

Theorem (1) The algorithm for adding multiple safety properties along with a leads-to property is sound and complete Fixability

Theorem (2) The complexity of the algorithm for adding multiple safety properties along with a leads-to property is polynomial-time

Automated Revision of Existing Programs 15

Adding Two Leads-to Properties Adding two leads-to properties one after another

Sp

Ip R T

s5

s6

s3s4

s7

P Qs9

s0

s6

s1

s2

s8

Automated Revision of Existing Programs 16

Other Results

The problems of simultaneous addition of two leads-to properties is NP-complete

The problem of adding one leads-to property while maintaining maximum non-determinism is NP-complete

Automated Revision of Existing Programs 17

Another Problem

Adding two eventually properties

1 true leads-to Q

2 true leads-to T

This problem is also NP-complete

Q

T

Automated Revision of Existing Programs 18

Example Real-Time Resource Allocation

Two processes j 1 2 Each has two tasks to complete (each takes 1 time unit)

Submitting a request Performing IO operation

RQj reqj (x = 1) ioj reqj = true false

IOj ioj (x = 1) reqj ioj = true false WT 0 x 1 1048576 wait

Bounded response

L (io1 2 req1)

x

x

Automated Revision of Existing Programs 19

req10 lt x t lt 1

req2

req1x t = 0req2

req1x t = 1req2

io1x t = 0 req2

req1x t = 0 1

io2

req10 lt x lt 11 lt t lt 2

io2

req1x t = 1 2

io2

io10 lt x t lt 1

req2

io1x t = 0

io2

io10 lt x t lt 1

io2

io1x = 0 t gt 2

req2

io1x = 1 t gt 2

req2

io1x = 0 t gt 2

io2

req10 lt x lt 1

t gt 2req2

req1x = 0 t gt 2

req2

req1x = 1 t gt 2

req2

io10 lt x lt 1

t gt 2io2

req1x = 0 t gt 2

io2

req10 lt x lt 1

t gt 2 io2

req1x = 1 t gt 2

io2

io1x t = 0 2

io2

req1x t = 0 2

req2

req1x t = 0 1

io2

req10 lt x lt 11 lt t lt 2

io2

req1x t = 1 2

io2

req1x t = 0 2

io2

req1x t = 0 1

req2

req10 lt x lt 11 lt t lt 2

req2

req1x t = 1 2

req2

Regions whereio1 becomes true

Edges removed during addition

Regions made unreachable

Legend

Initial region

Edges participatingin a shortest path

io1x t = 0 1

req2

io1x t = 1

io2

io10 lt x lt 11 lt t lt 2

req2

io1x t = 1 2

req2

Edges in additionalshortest paths

io1x t = 1req2

io1x t = 01

io2

io10 lt x lt 11 lt t lt 2

io2

io1x t = 1 2

io2

io1x t = 0 2

req2

io1x = 1 t gt 2

io2

io10 lt x lt 1

t gt 2req2

Automated Revision of Existing Programs 20

Example (contrsquod)

RQ1 req1 (x = 1) io1 req1 = true false

IO1 io1 (x = 1) req1 io1 = true false

RQ2 req2 (x = 1) (io1 t 1) io2 req2 = true false

IO1 io2 (x = 1) (io1 t 1) req2 io2 = true false

WT 0 x 1 1048576 wait

x t

x

x

x

Automated Revision of Existing Programs 21

Part II

Adding Fault-ToleranceFault-Tolerance to Existing Programs

Automated Revision of Existing Programs 22

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Decision dg 0 1

(dj = ) ( fj = false) dj = dg

(dj ) ( fj = false) fj = true

dj

dk 0 1

dl

Decision

fj

fk false true

fl

Final

GENERAL

NON-GENERALS

Program

Automated Revision of Existing Programs 23

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Byzantine

bg false true

bj

bk false true

bl

Byzantine

(bj bk bl bg = false) bj = true

(bj = true) dj = 0|1Faults

Automated Revision of Existing Programs 24

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Safety specification Agreement The final decision of any two non-Byzantine

process cannot be different

Validity If the general is non-Byzantine then the final decision of a non-Byzantine process must be the same as that of the general

Fact The program does not meet the safety specification in the presence of faults

dg = 0 bg = true dj = 0 fj = false fj = true dk = 1 fk = true

Automated Revision of Existing Programs 25

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

QuestionIs it possible to revise a distributed program in order to add fault-tolerance to the original program with respect to a safety specification and a class of faults

AnswerYes but the problem is NP-complete which itself is a problem

SolutionA set of polynomial-time heuristics has been developed to add fault-tolerance to a large class of distributed programs

Automated Revision of Existing Programs 26

Some Terminalogy

Boolean Variables V = v1 vn State where l(vj) is a literal State predicate A finite set of states Transition Let V = v | v V A transition (s s) = s s Transition predicate is a set of transitions Program A program is defined by a transition predicate P Closure A state predicate S is closed in program P iff

(s ||S) (s ||S)

Computation of P A sequence of states s0 s1 where (sj sj+1) || P Safety specification A set of bad transitions that should not occur in any

program computation defined by transition predicate SPEC Satisfaction A program P satisfies SPEC from S iff

1 S is closed in P

2 For all s0 s1 where s0 || S (sj sj+1) || SPEC Invariant If P satisfies SPEC from S and S then S is an invariant of P

imi sS 0

)(0 jnj vls

)(0 iimi ssP

Pss ||)(

Automated Revision of Existing Programs 27

Preliminaries (contrsquod)

Faults A set F of transitions

Fault-span A state predicate T such that

1 S T

2 T is closed in P F

Invariant

f

f

ff

f

Fault-Span

Finite state space

p

pp

p

Fault-tolerance A program P is F-tolerant to SPEC from S if1 P satisfies SPEC from S2 There exists T such that

T is an F-span of P P F satisfies SPEC from T (safety) after faults stop occurring every computation of P reaches S (liveness)

Automated Revision of Existing Programs 28

Levels of Fault-Tolerance [3]

In the presence of faults

A failsafe program satisfies the safety specification A nonmasking program satisfies the liveness specification A masking program satisfies both safety and liveness specifications

In the absence of faults the program must continue to satisfy its entire specification

Question Are these levels able to capture the requirements for modeling fault-tolerant real-time programs

[3] A Arora S Kulkarni [3] A Arora S Kulkarni Designing masking fault-tolerance via nonmasking fault-toleranceDesigning masking fault-tolerance via nonmasking fault-tolerance TSE 1998TSE 1998

Automated Revision of Existing Programs 29

Fault-Tolerant Real-Time Programs

Railroad Crossing Safety If the train is passing the crossing the gate must be closed Bounded response Once the gate is closed it should reopen within

4min Fault-tolerance The gate must be closed while the train is passing the

intersection even in the presence of faults

Boiler Controller Bounded response Once the pressure gauge reads 30 pounds per

square inch the controller must issue a command to open a valve within 20s

Fault-tolerance Faults should not affect the functionality of the controller

Automated Revision of Existing Programs 30

New Levels of Fault-Tolerance for RT Programs

Soft Fault-Tolerance

A program is soft fault-tolerant if it is NOT required to satisfy its timing constraints in the presence of faults However it should continue to meet its timing constraints in the absence of faults

Hard Fault-Tolerance

A program is hard fault-tolerant if it is required to satisfy its timing constraints even in the presence of faults

Automated Revision of Existing Programs 31

Levels of Fault-Tolerance (contrsquod)

Safety LivenessTiming

constraintsBounded-time

recovery

Soft-failsafe

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Automated Revision of Existing Programs 32

Problem Statement

SynthesisAlgorithm

Program PSafety spec

Fault-tolerant

program PSet of faults f

Desired level of fault-tolerance SoftHard-Failsafe

NonmaskingSoftHard-Masking

Automated Revision of Existing Programs 33

Problem Statement

Given program P Invariant S a set of faults F and safety specification SPEC identify a program P with invariant S such that

S S P P and P is F-tolerant to SPEC from S

Automated Revision of Existing Programs 34

Current Results

Safety LivenessTiming

constraintsBounded-recovery

Complexity

Soft-failsafe

Polynomial space sound and complete

algorithm

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Open problem

[2] S Kulkarni A Arora [2] S Kulkarni A Arora Automating the Addition of Fault-ToleranceAutomating the Addition of Fault-Tolerance FTRTFT 2000 FTRTFT 2000

Automated Revision of Existing Programs 35

Example Altitude Switch

Initx 06

Standby

Await (Power z 2 )

gtgt Actuator Power on ltlt

ltlt LowAlt gtgt[z = 0]

ltlt Reset gtgt(CorruptSensor y 2)

(x 06)

ltlt Reset gtgt

[x y z = 0]

(Power z 2 )

(z gt 2 )[t = 0]

(x gt 06)[t = 0]

ltlt SensorFail gtgt[y = 0]

(y gt 2)[t = 0]

y 1y 2

z 2

(Power CorruptSensor)( Power)

Fault-span

Fault-span

(t 1)

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

Automated Revision of Existing Programs 37

Ongoing Research

Implementation of the algorithms

Coping with the state explosion problem Synthesis using zone automata Parallelizing state space

Coping with NP-hardness results Heuristics Decision procedures (Yices)

Fault-tolerance in hybrid systems

[8] httpwwwcsemsueduebnenasiresearchtoolsftsynhtm

Automated Revision of Existing Programs 38

Conclusion

We study the problem of automated revision of programs inside their state space

Adding properties to programs Adding fault-tolerance

Automated Revision of Existing Programs 39

Questions

Automated Revision of Existing Programs 5

Program Revision

Revision by synthesis From specification Comprehensive revision

Highly expensive No reusability

From the existing program + new property Local revision Provides reusability In some cases offers lower classes of time and space complexity Does not need to have the entire specification of the existing program

RevisionAlgorithm

Program P

Property P

Automated Revision of Existing Programs 6

Our Goal

We identify classes of interesting properties typically used in specifying reactive systems

Designing synthesis methods where revising existing programs is feasible time-wise and space-wise

QuestionQuestion Why comprehensive revision is highly Why comprehensive revision is highly complexcomplexAnswerAnswer Expressiveness of specifications Expressiveness of specifications

Automated Revision of Existing Programs 7

Part I

Adding PropertiesProperties to Existing Programs

Automated Revision of Existing Programs 8

Preliminary Concepts

A program p is a triple p = Sp Ip p ie finite state space set of initial states and program transitions

A state predicate is any subset of Sp

A computation is a state sequence s0 s1 hellip iff s0 Ip

i gt 0 (si-1 si) p

If terminates in state sf then there does not exist s such that (sf s) p

Automated Revision of Existing Programs 9

Preliminary Concepts (cont)

Sample Properties Safety

P unless Q

stable(P)

invariant(P)

Liveness P leads-to Q P Q

P PP P P P

PP P

P P P Q

Automated Revision of Existing Programs 10

Preliminary Concepts (cont)

A specification is a conjunction of a set of properties

spec = L1 L2 hellip Ln

A computation satisfies spec iff (i | 0 i n satisfies Li)

A program p satisfies spec iff all computations of p1 are infinite

2 satisfy spec

Automated Revision of Existing Programs 11

Problem Statement

Formulation of the problem p satisfies existing specification spece

Sp = Sp

Ip = Ip

p p

All computations of p are infinite satisfy specn

SynthesisAlgorithm

Program p = Sp Ip p

Program p = Sp Ip pA Specification specn

Automated Revision of Existing Programs 12

Adding a Single Leads-to Property (R T )

Sp

Ip R T

Case 1 Deadlock statesRemove transitions (s1 s2) if s2 R and T is not reachable from S2

s0

s1

s2

Automated Revision of Existing Programs 13

Adding a Single Leads-to Property (R T )

Sp Ip R T

Break cycles reachable from R without reaching Q

s1

s4

s2

s3

Case 2 Cycles

s0

Automated Revision of Existing Programs 14

Soundness and Completeness

Theorem (1) The algorithm for adding multiple safety properties along with a leads-to property is sound and complete Fixability

Theorem (2) The complexity of the algorithm for adding multiple safety properties along with a leads-to property is polynomial-time

Automated Revision of Existing Programs 15

Adding Two Leads-to Properties Adding two leads-to properties one after another

Sp

Ip R T

s5

s6

s3s4

s7

P Qs9

s0

s6

s1

s2

s8

Automated Revision of Existing Programs 16

Other Results

The problems of simultaneous addition of two leads-to properties is NP-complete

The problem of adding one leads-to property while maintaining maximum non-determinism is NP-complete

Automated Revision of Existing Programs 17

Another Problem

Adding two eventually properties

1 true leads-to Q

2 true leads-to T

This problem is also NP-complete

Q

T

Automated Revision of Existing Programs 18

Example Real-Time Resource Allocation

Two processes j 1 2 Each has two tasks to complete (each takes 1 time unit)

Submitting a request Performing IO operation

RQj reqj (x = 1) ioj reqj = true false

IOj ioj (x = 1) reqj ioj = true false WT 0 x 1 1048576 wait

Bounded response

L (io1 2 req1)

x

x

Automated Revision of Existing Programs 19

req10 lt x t lt 1

req2

req1x t = 0req2

req1x t = 1req2

io1x t = 0 req2

req1x t = 0 1

io2

req10 lt x lt 11 lt t lt 2

io2

req1x t = 1 2

io2

io10 lt x t lt 1

req2

io1x t = 0

io2

io10 lt x t lt 1

io2

io1x = 0 t gt 2

req2

io1x = 1 t gt 2

req2

io1x = 0 t gt 2

io2

req10 lt x lt 1

t gt 2req2

req1x = 0 t gt 2

req2

req1x = 1 t gt 2

req2

io10 lt x lt 1

t gt 2io2

req1x = 0 t gt 2

io2

req10 lt x lt 1

t gt 2 io2

req1x = 1 t gt 2

io2

io1x t = 0 2

io2

req1x t = 0 2

req2

req1x t = 0 1

io2

req10 lt x lt 11 lt t lt 2

io2

req1x t = 1 2

io2

req1x t = 0 2

io2

req1x t = 0 1

req2

req10 lt x lt 11 lt t lt 2

req2

req1x t = 1 2

req2

Regions whereio1 becomes true

Edges removed during addition

Regions made unreachable

Legend

Initial region

Edges participatingin a shortest path

io1x t = 0 1

req2

io1x t = 1

io2

io10 lt x lt 11 lt t lt 2

req2

io1x t = 1 2

req2

Edges in additionalshortest paths

io1x t = 1req2

io1x t = 01

io2

io10 lt x lt 11 lt t lt 2

io2

io1x t = 1 2

io2

io1x t = 0 2

req2

io1x = 1 t gt 2

io2

io10 lt x lt 1

t gt 2req2

Automated Revision of Existing Programs 20

Example (contrsquod)

RQ1 req1 (x = 1) io1 req1 = true false

IO1 io1 (x = 1) req1 io1 = true false

RQ2 req2 (x = 1) (io1 t 1) io2 req2 = true false

IO1 io2 (x = 1) (io1 t 1) req2 io2 = true false

WT 0 x 1 1048576 wait

x t

x

x

x

Automated Revision of Existing Programs 21

Part II

Adding Fault-ToleranceFault-Tolerance to Existing Programs

Automated Revision of Existing Programs 22

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Decision dg 0 1

(dj = ) ( fj = false) dj = dg

(dj ) ( fj = false) fj = true

dj

dk 0 1

dl

Decision

fj

fk false true

fl

Final

GENERAL

NON-GENERALS

Program

Automated Revision of Existing Programs 23

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Byzantine

bg false true

bj

bk false true

bl

Byzantine

(bj bk bl bg = false) bj = true

(bj = true) dj = 0|1Faults

Automated Revision of Existing Programs 24

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Safety specification Agreement The final decision of any two non-Byzantine

process cannot be different

Validity If the general is non-Byzantine then the final decision of a non-Byzantine process must be the same as that of the general

Fact The program does not meet the safety specification in the presence of faults

dg = 0 bg = true dj = 0 fj = false fj = true dk = 1 fk = true

Automated Revision of Existing Programs 25

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

QuestionIs it possible to revise a distributed program in order to add fault-tolerance to the original program with respect to a safety specification and a class of faults

AnswerYes but the problem is NP-complete which itself is a problem

SolutionA set of polynomial-time heuristics has been developed to add fault-tolerance to a large class of distributed programs

Automated Revision of Existing Programs 26

Some Terminalogy

Boolean Variables V = v1 vn State where l(vj) is a literal State predicate A finite set of states Transition Let V = v | v V A transition (s s) = s s Transition predicate is a set of transitions Program A program is defined by a transition predicate P Closure A state predicate S is closed in program P iff

(s ||S) (s ||S)

Computation of P A sequence of states s0 s1 where (sj sj+1) || P Safety specification A set of bad transitions that should not occur in any

program computation defined by transition predicate SPEC Satisfaction A program P satisfies SPEC from S iff

1 S is closed in P

2 For all s0 s1 where s0 || S (sj sj+1) || SPEC Invariant If P satisfies SPEC from S and S then S is an invariant of P

imi sS 0

)(0 jnj vls

)(0 iimi ssP

Pss ||)(

Automated Revision of Existing Programs 27

Preliminaries (contrsquod)

Faults A set F of transitions

Fault-span A state predicate T such that

1 S T

2 T is closed in P F

Invariant

f

f

ff

f

Fault-Span

Finite state space

p

pp

p

Fault-tolerance A program P is F-tolerant to SPEC from S if1 P satisfies SPEC from S2 There exists T such that

T is an F-span of P P F satisfies SPEC from T (safety) after faults stop occurring every computation of P reaches S (liveness)

Automated Revision of Existing Programs 28

Levels of Fault-Tolerance [3]

In the presence of faults

A failsafe program satisfies the safety specification A nonmasking program satisfies the liveness specification A masking program satisfies both safety and liveness specifications

In the absence of faults the program must continue to satisfy its entire specification

Question Are these levels able to capture the requirements for modeling fault-tolerant real-time programs

[3] A Arora S Kulkarni [3] A Arora S Kulkarni Designing masking fault-tolerance via nonmasking fault-toleranceDesigning masking fault-tolerance via nonmasking fault-tolerance TSE 1998TSE 1998

Automated Revision of Existing Programs 29

Fault-Tolerant Real-Time Programs

Railroad Crossing Safety If the train is passing the crossing the gate must be closed Bounded response Once the gate is closed it should reopen within

4min Fault-tolerance The gate must be closed while the train is passing the

intersection even in the presence of faults

Boiler Controller Bounded response Once the pressure gauge reads 30 pounds per

square inch the controller must issue a command to open a valve within 20s

Fault-tolerance Faults should not affect the functionality of the controller

Automated Revision of Existing Programs 30

New Levels of Fault-Tolerance for RT Programs

Soft Fault-Tolerance

A program is soft fault-tolerant if it is NOT required to satisfy its timing constraints in the presence of faults However it should continue to meet its timing constraints in the absence of faults

Hard Fault-Tolerance

A program is hard fault-tolerant if it is required to satisfy its timing constraints even in the presence of faults

Automated Revision of Existing Programs 31

Levels of Fault-Tolerance (contrsquod)

Safety LivenessTiming

constraintsBounded-time

recovery

Soft-failsafe

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Automated Revision of Existing Programs 32

Problem Statement

SynthesisAlgorithm

Program PSafety spec

Fault-tolerant

program PSet of faults f

Desired level of fault-tolerance SoftHard-Failsafe

NonmaskingSoftHard-Masking

Automated Revision of Existing Programs 33

Problem Statement

Given program P Invariant S a set of faults F and safety specification SPEC identify a program P with invariant S such that

S S P P and P is F-tolerant to SPEC from S

Automated Revision of Existing Programs 34

Current Results

Safety LivenessTiming

constraintsBounded-recovery

Complexity

Soft-failsafe

Polynomial space sound and complete

algorithm

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Open problem

[2] S Kulkarni A Arora [2] S Kulkarni A Arora Automating the Addition of Fault-ToleranceAutomating the Addition of Fault-Tolerance FTRTFT 2000 FTRTFT 2000

Automated Revision of Existing Programs 35

Example Altitude Switch

Initx 06

Standby

Await (Power z 2 )

gtgt Actuator Power on ltlt

ltlt LowAlt gtgt[z = 0]

ltlt Reset gtgt(CorruptSensor y 2)

(x 06)

ltlt Reset gtgt

[x y z = 0]

(Power z 2 )

(z gt 2 )[t = 0]

(x gt 06)[t = 0]

ltlt SensorFail gtgt[y = 0]

(y gt 2)[t = 0]

y 1y 2

z 2

(Power CorruptSensor)( Power)

Fault-span

Fault-span

(t 1)

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

Automated Revision of Existing Programs 37

Ongoing Research

Implementation of the algorithms

Coping with the state explosion problem Synthesis using zone automata Parallelizing state space

Coping with NP-hardness results Heuristics Decision procedures (Yices)

Fault-tolerance in hybrid systems

[8] httpwwwcsemsueduebnenasiresearchtoolsftsynhtm

Automated Revision of Existing Programs 38

Conclusion

We study the problem of automated revision of programs inside their state space

Adding properties to programs Adding fault-tolerance

Automated Revision of Existing Programs 39

Questions

Automated Revision of Existing Programs 6

Our Goal

We identify classes of interesting properties typically used in specifying reactive systems

Designing synthesis methods where revising existing programs is feasible time-wise and space-wise

QuestionQuestion Why comprehensive revision is highly Why comprehensive revision is highly complexcomplexAnswerAnswer Expressiveness of specifications Expressiveness of specifications

Automated Revision of Existing Programs 7

Part I

Adding PropertiesProperties to Existing Programs

Automated Revision of Existing Programs 8

Preliminary Concepts

A program p is a triple p = Sp Ip p ie finite state space set of initial states and program transitions

A state predicate is any subset of Sp

A computation is a state sequence s0 s1 hellip iff s0 Ip

i gt 0 (si-1 si) p

If terminates in state sf then there does not exist s such that (sf s) p

Automated Revision of Existing Programs 9

Preliminary Concepts (cont)

Sample Properties Safety

P unless Q

stable(P)

invariant(P)

Liveness P leads-to Q P Q

P PP P P P

PP P

P P P Q

Automated Revision of Existing Programs 10

Preliminary Concepts (cont)

A specification is a conjunction of a set of properties

spec = L1 L2 hellip Ln

A computation satisfies spec iff (i | 0 i n satisfies Li)

A program p satisfies spec iff all computations of p1 are infinite

2 satisfy spec

Automated Revision of Existing Programs 11

Problem Statement

Formulation of the problem p satisfies existing specification spece

Sp = Sp

Ip = Ip

p p

All computations of p are infinite satisfy specn

SynthesisAlgorithm

Program p = Sp Ip p

Program p = Sp Ip pA Specification specn

Automated Revision of Existing Programs 12

Adding a Single Leads-to Property (R T )

Sp

Ip R T

Case 1 Deadlock statesRemove transitions (s1 s2) if s2 R and T is not reachable from S2

s0

s1

s2

Automated Revision of Existing Programs 13

Adding a Single Leads-to Property (R T )

Sp Ip R T

Break cycles reachable from R without reaching Q

s1

s4

s2

s3

Case 2 Cycles

s0

Automated Revision of Existing Programs 14

Soundness and Completeness

Theorem (1) The algorithm for adding multiple safety properties along with a leads-to property is sound and complete Fixability

Theorem (2) The complexity of the algorithm for adding multiple safety properties along with a leads-to property is polynomial-time

Automated Revision of Existing Programs 15

Adding Two Leads-to Properties Adding two leads-to properties one after another

Sp

Ip R T

s5

s6

s3s4

s7

P Qs9

s0

s6

s1

s2

s8

Automated Revision of Existing Programs 16

Other Results

The problems of simultaneous addition of two leads-to properties is NP-complete

The problem of adding one leads-to property while maintaining maximum non-determinism is NP-complete

Automated Revision of Existing Programs 17

Another Problem

Adding two eventually properties

1 true leads-to Q

2 true leads-to T

This problem is also NP-complete

Q

T

Automated Revision of Existing Programs 18

Example Real-Time Resource Allocation

Two processes j 1 2 Each has two tasks to complete (each takes 1 time unit)

Submitting a request Performing IO operation

RQj reqj (x = 1) ioj reqj = true false

IOj ioj (x = 1) reqj ioj = true false WT 0 x 1 1048576 wait

Bounded response

L (io1 2 req1)

x

x

Automated Revision of Existing Programs 19

req10 lt x t lt 1

req2

req1x t = 0req2

req1x t = 1req2

io1x t = 0 req2

req1x t = 0 1

io2

req10 lt x lt 11 lt t lt 2

io2

req1x t = 1 2

io2

io10 lt x t lt 1

req2

io1x t = 0

io2

io10 lt x t lt 1

io2

io1x = 0 t gt 2

req2

io1x = 1 t gt 2

req2

io1x = 0 t gt 2

io2

req10 lt x lt 1

t gt 2req2

req1x = 0 t gt 2

req2

req1x = 1 t gt 2

req2

io10 lt x lt 1

t gt 2io2

req1x = 0 t gt 2

io2

req10 lt x lt 1

t gt 2 io2

req1x = 1 t gt 2

io2

io1x t = 0 2

io2

req1x t = 0 2

req2

req1x t = 0 1

io2

req10 lt x lt 11 lt t lt 2

io2

req1x t = 1 2

io2

req1x t = 0 2

io2

req1x t = 0 1

req2

req10 lt x lt 11 lt t lt 2

req2

req1x t = 1 2

req2

Regions whereio1 becomes true

Edges removed during addition

Regions made unreachable

Legend

Initial region

Edges participatingin a shortest path

io1x t = 0 1

req2

io1x t = 1

io2

io10 lt x lt 11 lt t lt 2

req2

io1x t = 1 2

req2

Edges in additionalshortest paths

io1x t = 1req2

io1x t = 01

io2

io10 lt x lt 11 lt t lt 2

io2

io1x t = 1 2

io2

io1x t = 0 2

req2

io1x = 1 t gt 2

io2

io10 lt x lt 1

t gt 2req2

Automated Revision of Existing Programs 20

Example (contrsquod)

RQ1 req1 (x = 1) io1 req1 = true false

IO1 io1 (x = 1) req1 io1 = true false

RQ2 req2 (x = 1) (io1 t 1) io2 req2 = true false

IO1 io2 (x = 1) (io1 t 1) req2 io2 = true false

WT 0 x 1 1048576 wait

x t

x

x

x

Automated Revision of Existing Programs 21

Part II

Adding Fault-ToleranceFault-Tolerance to Existing Programs

Automated Revision of Existing Programs 22

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Decision dg 0 1

(dj = ) ( fj = false) dj = dg

(dj ) ( fj = false) fj = true

dj

dk 0 1

dl

Decision

fj

fk false true

fl

Final

GENERAL

NON-GENERALS

Program

Automated Revision of Existing Programs 23

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Byzantine

bg false true

bj

bk false true

bl

Byzantine

(bj bk bl bg = false) bj = true

(bj = true) dj = 0|1Faults

Automated Revision of Existing Programs 24

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Safety specification Agreement The final decision of any two non-Byzantine

process cannot be different

Validity If the general is non-Byzantine then the final decision of a non-Byzantine process must be the same as that of the general

Fact The program does not meet the safety specification in the presence of faults

dg = 0 bg = true dj = 0 fj = false fj = true dk = 1 fk = true

Automated Revision of Existing Programs 25

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

QuestionIs it possible to revise a distributed program in order to add fault-tolerance to the original program with respect to a safety specification and a class of faults

AnswerYes but the problem is NP-complete which itself is a problem

SolutionA set of polynomial-time heuristics has been developed to add fault-tolerance to a large class of distributed programs

Automated Revision of Existing Programs 26

Some Terminalogy

Boolean Variables V = v1 vn State where l(vj) is a literal State predicate A finite set of states Transition Let V = v | v V A transition (s s) = s s Transition predicate is a set of transitions Program A program is defined by a transition predicate P Closure A state predicate S is closed in program P iff

(s ||S) (s ||S)

Computation of P A sequence of states s0 s1 where (sj sj+1) || P Safety specification A set of bad transitions that should not occur in any

program computation defined by transition predicate SPEC Satisfaction A program P satisfies SPEC from S iff

1 S is closed in P

2 For all s0 s1 where s0 || S (sj sj+1) || SPEC Invariant If P satisfies SPEC from S and S then S is an invariant of P

imi sS 0

)(0 jnj vls

)(0 iimi ssP

Pss ||)(

Automated Revision of Existing Programs 27

Preliminaries (contrsquod)

Faults A set F of transitions

Fault-span A state predicate T such that

1 S T

2 T is closed in P F

Invariant

f

f

ff

f

Fault-Span

Finite state space

p

pp

p

Fault-tolerance A program P is F-tolerant to SPEC from S if1 P satisfies SPEC from S2 There exists T such that

T is an F-span of P P F satisfies SPEC from T (safety) after faults stop occurring every computation of P reaches S (liveness)

Automated Revision of Existing Programs 28

Levels of Fault-Tolerance [3]

In the presence of faults

A failsafe program satisfies the safety specification A nonmasking program satisfies the liveness specification A masking program satisfies both safety and liveness specifications

In the absence of faults the program must continue to satisfy its entire specification

Question Are these levels able to capture the requirements for modeling fault-tolerant real-time programs

[3] A Arora S Kulkarni [3] A Arora S Kulkarni Designing masking fault-tolerance via nonmasking fault-toleranceDesigning masking fault-tolerance via nonmasking fault-tolerance TSE 1998TSE 1998

Automated Revision of Existing Programs 29

Fault-Tolerant Real-Time Programs

Railroad Crossing Safety If the train is passing the crossing the gate must be closed Bounded response Once the gate is closed it should reopen within

4min Fault-tolerance The gate must be closed while the train is passing the

intersection even in the presence of faults

Boiler Controller Bounded response Once the pressure gauge reads 30 pounds per

square inch the controller must issue a command to open a valve within 20s

Fault-tolerance Faults should not affect the functionality of the controller

Automated Revision of Existing Programs 30

New Levels of Fault-Tolerance for RT Programs

Soft Fault-Tolerance

A program is soft fault-tolerant if it is NOT required to satisfy its timing constraints in the presence of faults However it should continue to meet its timing constraints in the absence of faults

Hard Fault-Tolerance

A program is hard fault-tolerant if it is required to satisfy its timing constraints even in the presence of faults

Automated Revision of Existing Programs 31

Levels of Fault-Tolerance (contrsquod)

Safety LivenessTiming

constraintsBounded-time

recovery

Soft-failsafe

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Automated Revision of Existing Programs 32

Problem Statement

SynthesisAlgorithm

Program PSafety spec

Fault-tolerant

program PSet of faults f

Desired level of fault-tolerance SoftHard-Failsafe

NonmaskingSoftHard-Masking

Automated Revision of Existing Programs 33

Problem Statement

Given program P Invariant S a set of faults F and safety specification SPEC identify a program P with invariant S such that

S S P P and P is F-tolerant to SPEC from S

Automated Revision of Existing Programs 34

Current Results

Safety LivenessTiming

constraintsBounded-recovery

Complexity

Soft-failsafe

Polynomial space sound and complete

algorithm

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Open problem

[2] S Kulkarni A Arora [2] S Kulkarni A Arora Automating the Addition of Fault-ToleranceAutomating the Addition of Fault-Tolerance FTRTFT 2000 FTRTFT 2000

Automated Revision of Existing Programs 35

Example Altitude Switch

Initx 06

Standby

Await (Power z 2 )

gtgt Actuator Power on ltlt

ltlt LowAlt gtgt[z = 0]

ltlt Reset gtgt(CorruptSensor y 2)

(x 06)

ltlt Reset gtgt

[x y z = 0]

(Power z 2 )

(z gt 2 )[t = 0]

(x gt 06)[t = 0]

ltlt SensorFail gtgt[y = 0]

(y gt 2)[t = 0]

y 1y 2

z 2

(Power CorruptSensor)( Power)

Fault-span

Fault-span

(t 1)

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

Automated Revision of Existing Programs 37

Ongoing Research

Implementation of the algorithms

Coping with the state explosion problem Synthesis using zone automata Parallelizing state space

Coping with NP-hardness results Heuristics Decision procedures (Yices)

Fault-tolerance in hybrid systems

[8] httpwwwcsemsueduebnenasiresearchtoolsftsynhtm

Automated Revision of Existing Programs 38

Conclusion

We study the problem of automated revision of programs inside their state space

Adding properties to programs Adding fault-tolerance

Automated Revision of Existing Programs 39

Questions

Automated Revision of Existing Programs 7

Part I

Adding PropertiesProperties to Existing Programs

Automated Revision of Existing Programs 8

Preliminary Concepts

A program p is a triple p = Sp Ip p ie finite state space set of initial states and program transitions

A state predicate is any subset of Sp

A computation is a state sequence s0 s1 hellip iff s0 Ip

i gt 0 (si-1 si) p

If terminates in state sf then there does not exist s such that (sf s) p

Automated Revision of Existing Programs 9

Preliminary Concepts (cont)

Sample Properties Safety

P unless Q

stable(P)

invariant(P)

Liveness P leads-to Q P Q

P PP P P P

PP P

P P P Q

Automated Revision of Existing Programs 10

Preliminary Concepts (cont)

A specification is a conjunction of a set of properties

spec = L1 L2 hellip Ln

A computation satisfies spec iff (i | 0 i n satisfies Li)

A program p satisfies spec iff all computations of p1 are infinite

2 satisfy spec

Automated Revision of Existing Programs 11

Problem Statement

Formulation of the problem p satisfies existing specification spece

Sp = Sp

Ip = Ip

p p

All computations of p are infinite satisfy specn

SynthesisAlgorithm

Program p = Sp Ip p

Program p = Sp Ip pA Specification specn

Automated Revision of Existing Programs 12

Adding a Single Leads-to Property (R T )

Sp

Ip R T

Case 1 Deadlock statesRemove transitions (s1 s2) if s2 R and T is not reachable from S2

s0

s1

s2

Automated Revision of Existing Programs 13

Adding a Single Leads-to Property (R T )

Sp Ip R T

Break cycles reachable from R without reaching Q

s1

s4

s2

s3

Case 2 Cycles

s0

Automated Revision of Existing Programs 14

Soundness and Completeness

Theorem (1) The algorithm for adding multiple safety properties along with a leads-to property is sound and complete Fixability

Theorem (2) The complexity of the algorithm for adding multiple safety properties along with a leads-to property is polynomial-time

Automated Revision of Existing Programs 15

Adding Two Leads-to Properties Adding two leads-to properties one after another

Sp

Ip R T

s5

s6

s3s4

s7

P Qs9

s0

s6

s1

s2

s8

Automated Revision of Existing Programs 16

Other Results

The problems of simultaneous addition of two leads-to properties is NP-complete

The problem of adding one leads-to property while maintaining maximum non-determinism is NP-complete

Automated Revision of Existing Programs 17

Another Problem

Adding two eventually properties

1 true leads-to Q

2 true leads-to T

This problem is also NP-complete

Q

T

Automated Revision of Existing Programs 18

Example Real-Time Resource Allocation

Two processes j 1 2 Each has two tasks to complete (each takes 1 time unit)

Submitting a request Performing IO operation

RQj reqj (x = 1) ioj reqj = true false

IOj ioj (x = 1) reqj ioj = true false WT 0 x 1 1048576 wait

Bounded response

L (io1 2 req1)

x

x

Automated Revision of Existing Programs 19

req10 lt x t lt 1

req2

req1x t = 0req2

req1x t = 1req2

io1x t = 0 req2

req1x t = 0 1

io2

req10 lt x lt 11 lt t lt 2

io2

req1x t = 1 2

io2

io10 lt x t lt 1

req2

io1x t = 0

io2

io10 lt x t lt 1

io2

io1x = 0 t gt 2

req2

io1x = 1 t gt 2

req2

io1x = 0 t gt 2

io2

req10 lt x lt 1

t gt 2req2

req1x = 0 t gt 2

req2

req1x = 1 t gt 2

req2

io10 lt x lt 1

t gt 2io2

req1x = 0 t gt 2

io2

req10 lt x lt 1

t gt 2 io2

req1x = 1 t gt 2

io2

io1x t = 0 2

io2

req1x t = 0 2

req2

req1x t = 0 1

io2

req10 lt x lt 11 lt t lt 2

io2

req1x t = 1 2

io2

req1x t = 0 2

io2

req1x t = 0 1

req2

req10 lt x lt 11 lt t lt 2

req2

req1x t = 1 2

req2

Regions whereio1 becomes true

Edges removed during addition

Regions made unreachable

Legend

Initial region

Edges participatingin a shortest path

io1x t = 0 1

req2

io1x t = 1

io2

io10 lt x lt 11 lt t lt 2

req2

io1x t = 1 2

req2

Edges in additionalshortest paths

io1x t = 1req2

io1x t = 01

io2

io10 lt x lt 11 lt t lt 2

io2

io1x t = 1 2

io2

io1x t = 0 2

req2

io1x = 1 t gt 2

io2

io10 lt x lt 1

t gt 2req2

Automated Revision of Existing Programs 20

Example (contrsquod)

RQ1 req1 (x = 1) io1 req1 = true false

IO1 io1 (x = 1) req1 io1 = true false

RQ2 req2 (x = 1) (io1 t 1) io2 req2 = true false

IO1 io2 (x = 1) (io1 t 1) req2 io2 = true false

WT 0 x 1 1048576 wait

x t

x

x

x

Automated Revision of Existing Programs 21

Part II

Adding Fault-ToleranceFault-Tolerance to Existing Programs

Automated Revision of Existing Programs 22

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Decision dg 0 1

(dj = ) ( fj = false) dj = dg

(dj ) ( fj = false) fj = true

dj

dk 0 1

dl

Decision

fj

fk false true

fl

Final

GENERAL

NON-GENERALS

Program

Automated Revision of Existing Programs 23

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Byzantine

bg false true

bj

bk false true

bl

Byzantine

(bj bk bl bg = false) bj = true

(bj = true) dj = 0|1Faults

Automated Revision of Existing Programs 24

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Safety specification Agreement The final decision of any two non-Byzantine

process cannot be different

Validity If the general is non-Byzantine then the final decision of a non-Byzantine process must be the same as that of the general

Fact The program does not meet the safety specification in the presence of faults

dg = 0 bg = true dj = 0 fj = false fj = true dk = 1 fk = true

Automated Revision of Existing Programs 25

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

QuestionIs it possible to revise a distributed program in order to add fault-tolerance to the original program with respect to a safety specification and a class of faults

AnswerYes but the problem is NP-complete which itself is a problem

SolutionA set of polynomial-time heuristics has been developed to add fault-tolerance to a large class of distributed programs

Automated Revision of Existing Programs 26

Some Terminalogy

Boolean Variables V = v1 vn State where l(vj) is a literal State predicate A finite set of states Transition Let V = v | v V A transition (s s) = s s Transition predicate is a set of transitions Program A program is defined by a transition predicate P Closure A state predicate S is closed in program P iff

(s ||S) (s ||S)

Computation of P A sequence of states s0 s1 where (sj sj+1) || P Safety specification A set of bad transitions that should not occur in any

program computation defined by transition predicate SPEC Satisfaction A program P satisfies SPEC from S iff

1 S is closed in P

2 For all s0 s1 where s0 || S (sj sj+1) || SPEC Invariant If P satisfies SPEC from S and S then S is an invariant of P

imi sS 0

)(0 jnj vls

)(0 iimi ssP

Pss ||)(

Automated Revision of Existing Programs 27

Preliminaries (contrsquod)

Faults A set F of transitions

Fault-span A state predicate T such that

1 S T

2 T is closed in P F

Invariant

f

f

ff

f

Fault-Span

Finite state space

p

pp

p

Fault-tolerance A program P is F-tolerant to SPEC from S if1 P satisfies SPEC from S2 There exists T such that

T is an F-span of P P F satisfies SPEC from T (safety) after faults stop occurring every computation of P reaches S (liveness)

Automated Revision of Existing Programs 28

Levels of Fault-Tolerance [3]

In the presence of faults

A failsafe program satisfies the safety specification A nonmasking program satisfies the liveness specification A masking program satisfies both safety and liveness specifications

In the absence of faults the program must continue to satisfy its entire specification

Question Are these levels able to capture the requirements for modeling fault-tolerant real-time programs

[3] A Arora S Kulkarni [3] A Arora S Kulkarni Designing masking fault-tolerance via nonmasking fault-toleranceDesigning masking fault-tolerance via nonmasking fault-tolerance TSE 1998TSE 1998

Automated Revision of Existing Programs 29

Fault-Tolerant Real-Time Programs

Railroad Crossing Safety If the train is passing the crossing the gate must be closed Bounded response Once the gate is closed it should reopen within

4min Fault-tolerance The gate must be closed while the train is passing the

intersection even in the presence of faults

Boiler Controller Bounded response Once the pressure gauge reads 30 pounds per

square inch the controller must issue a command to open a valve within 20s

Fault-tolerance Faults should not affect the functionality of the controller

Automated Revision of Existing Programs 30

New Levels of Fault-Tolerance for RT Programs

Soft Fault-Tolerance

A program is soft fault-tolerant if it is NOT required to satisfy its timing constraints in the presence of faults However it should continue to meet its timing constraints in the absence of faults

Hard Fault-Tolerance

A program is hard fault-tolerant if it is required to satisfy its timing constraints even in the presence of faults

Automated Revision of Existing Programs 31

Levels of Fault-Tolerance (contrsquod)

Safety LivenessTiming

constraintsBounded-time

recovery

Soft-failsafe

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Automated Revision of Existing Programs 32

Problem Statement

SynthesisAlgorithm

Program PSafety spec

Fault-tolerant

program PSet of faults f

Desired level of fault-tolerance SoftHard-Failsafe

NonmaskingSoftHard-Masking

Automated Revision of Existing Programs 33

Problem Statement

Given program P Invariant S a set of faults F and safety specification SPEC identify a program P with invariant S such that

S S P P and P is F-tolerant to SPEC from S

Automated Revision of Existing Programs 34

Current Results

Safety LivenessTiming

constraintsBounded-recovery

Complexity

Soft-failsafe

Polynomial space sound and complete

algorithm

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Open problem

[2] S Kulkarni A Arora [2] S Kulkarni A Arora Automating the Addition of Fault-ToleranceAutomating the Addition of Fault-Tolerance FTRTFT 2000 FTRTFT 2000

Automated Revision of Existing Programs 35

Example Altitude Switch

Initx 06

Standby

Await (Power z 2 )

gtgt Actuator Power on ltlt

ltlt LowAlt gtgt[z = 0]

ltlt Reset gtgt(CorruptSensor y 2)

(x 06)

ltlt Reset gtgt

[x y z = 0]

(Power z 2 )

(z gt 2 )[t = 0]

(x gt 06)[t = 0]

ltlt SensorFail gtgt[y = 0]

(y gt 2)[t = 0]

y 1y 2

z 2

(Power CorruptSensor)( Power)

Fault-span

Fault-span

(t 1)

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

Automated Revision of Existing Programs 37

Ongoing Research

Implementation of the algorithms

Coping with the state explosion problem Synthesis using zone automata Parallelizing state space

Coping with NP-hardness results Heuristics Decision procedures (Yices)

Fault-tolerance in hybrid systems

[8] httpwwwcsemsueduebnenasiresearchtoolsftsynhtm

Automated Revision of Existing Programs 38

Conclusion

We study the problem of automated revision of programs inside their state space

Adding properties to programs Adding fault-tolerance

Automated Revision of Existing Programs 39

Questions

Automated Revision of Existing Programs 8

Preliminary Concepts

A program p is a triple p = Sp Ip p ie finite state space set of initial states and program transitions

A state predicate is any subset of Sp

A computation is a state sequence s0 s1 hellip iff s0 Ip

i gt 0 (si-1 si) p

If terminates in state sf then there does not exist s such that (sf s) p

Automated Revision of Existing Programs 9

Preliminary Concepts (cont)

Sample Properties Safety

P unless Q

stable(P)

invariant(P)

Liveness P leads-to Q P Q

P PP P P P

PP P

P P P Q

Automated Revision of Existing Programs 10

Preliminary Concepts (cont)

A specification is a conjunction of a set of properties

spec = L1 L2 hellip Ln

A computation satisfies spec iff (i | 0 i n satisfies Li)

A program p satisfies spec iff all computations of p1 are infinite

2 satisfy spec

Automated Revision of Existing Programs 11

Problem Statement

Formulation of the problem p satisfies existing specification spece

Sp = Sp

Ip = Ip

p p

All computations of p are infinite satisfy specn

SynthesisAlgorithm

Program p = Sp Ip p

Program p = Sp Ip pA Specification specn

Automated Revision of Existing Programs 12

Adding a Single Leads-to Property (R T )

Sp

Ip R T

Case 1 Deadlock statesRemove transitions (s1 s2) if s2 R and T is not reachable from S2

s0

s1

s2

Automated Revision of Existing Programs 13

Adding a Single Leads-to Property (R T )

Sp Ip R T

Break cycles reachable from R without reaching Q

s1

s4

s2

s3

Case 2 Cycles

s0

Automated Revision of Existing Programs 14

Soundness and Completeness

Theorem (1) The algorithm for adding multiple safety properties along with a leads-to property is sound and complete Fixability

Theorem (2) The complexity of the algorithm for adding multiple safety properties along with a leads-to property is polynomial-time

Automated Revision of Existing Programs 15

Adding Two Leads-to Properties Adding two leads-to properties one after another

Sp

Ip R T

s5

s6

s3s4

s7

P Qs9

s0

s6

s1

s2

s8

Automated Revision of Existing Programs 16

Other Results

The problems of simultaneous addition of two leads-to properties is NP-complete

The problem of adding one leads-to property while maintaining maximum non-determinism is NP-complete

Automated Revision of Existing Programs 17

Another Problem

Adding two eventually properties

1 true leads-to Q

2 true leads-to T

This problem is also NP-complete

Q

T

Automated Revision of Existing Programs 18

Example Real-Time Resource Allocation

Two processes j 1 2 Each has two tasks to complete (each takes 1 time unit)

Submitting a request Performing IO operation

RQj reqj (x = 1) ioj reqj = true false

IOj ioj (x = 1) reqj ioj = true false WT 0 x 1 1048576 wait

Bounded response

L (io1 2 req1)

x

x

Automated Revision of Existing Programs 19

req10 lt x t lt 1

req2

req1x t = 0req2

req1x t = 1req2

io1x t = 0 req2

req1x t = 0 1

io2

req10 lt x lt 11 lt t lt 2

io2

req1x t = 1 2

io2

io10 lt x t lt 1

req2

io1x t = 0

io2

io10 lt x t lt 1

io2

io1x = 0 t gt 2

req2

io1x = 1 t gt 2

req2

io1x = 0 t gt 2

io2

req10 lt x lt 1

t gt 2req2

req1x = 0 t gt 2

req2

req1x = 1 t gt 2

req2

io10 lt x lt 1

t gt 2io2

req1x = 0 t gt 2

io2

req10 lt x lt 1

t gt 2 io2

req1x = 1 t gt 2

io2

io1x t = 0 2

io2

req1x t = 0 2

req2

req1x t = 0 1

io2

req10 lt x lt 11 lt t lt 2

io2

req1x t = 1 2

io2

req1x t = 0 2

io2

req1x t = 0 1

req2

req10 lt x lt 11 lt t lt 2

req2

req1x t = 1 2

req2

Regions whereio1 becomes true

Edges removed during addition

Regions made unreachable

Legend

Initial region

Edges participatingin a shortest path

io1x t = 0 1

req2

io1x t = 1

io2

io10 lt x lt 11 lt t lt 2

req2

io1x t = 1 2

req2

Edges in additionalshortest paths

io1x t = 1req2

io1x t = 01

io2

io10 lt x lt 11 lt t lt 2

io2

io1x t = 1 2

io2

io1x t = 0 2

req2

io1x = 1 t gt 2

io2

io10 lt x lt 1

t gt 2req2

Automated Revision of Existing Programs 20

Example (contrsquod)

RQ1 req1 (x = 1) io1 req1 = true false

IO1 io1 (x = 1) req1 io1 = true false

RQ2 req2 (x = 1) (io1 t 1) io2 req2 = true false

IO1 io2 (x = 1) (io1 t 1) req2 io2 = true false

WT 0 x 1 1048576 wait

x t

x

x

x

Automated Revision of Existing Programs 21

Part II

Adding Fault-ToleranceFault-Tolerance to Existing Programs

Automated Revision of Existing Programs 22

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Decision dg 0 1

(dj = ) ( fj = false) dj = dg

(dj ) ( fj = false) fj = true

dj

dk 0 1

dl

Decision

fj

fk false true

fl

Final

GENERAL

NON-GENERALS

Program

Automated Revision of Existing Programs 23

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Byzantine

bg false true

bj

bk false true

bl

Byzantine

(bj bk bl bg = false) bj = true

(bj = true) dj = 0|1Faults

Automated Revision of Existing Programs 24

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Safety specification Agreement The final decision of any two non-Byzantine

process cannot be different

Validity If the general is non-Byzantine then the final decision of a non-Byzantine process must be the same as that of the general

Fact The program does not meet the safety specification in the presence of faults

dg = 0 bg = true dj = 0 fj = false fj = true dk = 1 fk = true

Automated Revision of Existing Programs 25

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

QuestionIs it possible to revise a distributed program in order to add fault-tolerance to the original program with respect to a safety specification and a class of faults

AnswerYes but the problem is NP-complete which itself is a problem

SolutionA set of polynomial-time heuristics has been developed to add fault-tolerance to a large class of distributed programs

Automated Revision of Existing Programs 26

Some Terminalogy

Boolean Variables V = v1 vn State where l(vj) is a literal State predicate A finite set of states Transition Let V = v | v V A transition (s s) = s s Transition predicate is a set of transitions Program A program is defined by a transition predicate P Closure A state predicate S is closed in program P iff

(s ||S) (s ||S)

Computation of P A sequence of states s0 s1 where (sj sj+1) || P Safety specification A set of bad transitions that should not occur in any

program computation defined by transition predicate SPEC Satisfaction A program P satisfies SPEC from S iff

1 S is closed in P

2 For all s0 s1 where s0 || S (sj sj+1) || SPEC Invariant If P satisfies SPEC from S and S then S is an invariant of P

imi sS 0

)(0 jnj vls

)(0 iimi ssP

Pss ||)(

Automated Revision of Existing Programs 27

Preliminaries (contrsquod)

Faults A set F of transitions

Fault-span A state predicate T such that

1 S T

2 T is closed in P F

Invariant

f

f

ff

f

Fault-Span

Finite state space

p

pp

p

Fault-tolerance A program P is F-tolerant to SPEC from S if1 P satisfies SPEC from S2 There exists T such that

T is an F-span of P P F satisfies SPEC from T (safety) after faults stop occurring every computation of P reaches S (liveness)

Automated Revision of Existing Programs 28

Levels of Fault-Tolerance [3]

In the presence of faults

A failsafe program satisfies the safety specification A nonmasking program satisfies the liveness specification A masking program satisfies both safety and liveness specifications

In the absence of faults the program must continue to satisfy its entire specification

Question Are these levels able to capture the requirements for modeling fault-tolerant real-time programs

[3] A Arora S Kulkarni [3] A Arora S Kulkarni Designing masking fault-tolerance via nonmasking fault-toleranceDesigning masking fault-tolerance via nonmasking fault-tolerance TSE 1998TSE 1998

Automated Revision of Existing Programs 29

Fault-Tolerant Real-Time Programs

Railroad Crossing Safety If the train is passing the crossing the gate must be closed Bounded response Once the gate is closed it should reopen within

4min Fault-tolerance The gate must be closed while the train is passing the

intersection even in the presence of faults

Boiler Controller Bounded response Once the pressure gauge reads 30 pounds per

square inch the controller must issue a command to open a valve within 20s

Fault-tolerance Faults should not affect the functionality of the controller

Automated Revision of Existing Programs 30

New Levels of Fault-Tolerance for RT Programs

Soft Fault-Tolerance

A program is soft fault-tolerant if it is NOT required to satisfy its timing constraints in the presence of faults However it should continue to meet its timing constraints in the absence of faults

Hard Fault-Tolerance

A program is hard fault-tolerant if it is required to satisfy its timing constraints even in the presence of faults

Automated Revision of Existing Programs 31

Levels of Fault-Tolerance (contrsquod)

Safety LivenessTiming

constraintsBounded-time

recovery

Soft-failsafe

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Automated Revision of Existing Programs 32

Problem Statement

SynthesisAlgorithm

Program PSafety spec

Fault-tolerant

program PSet of faults f

Desired level of fault-tolerance SoftHard-Failsafe

NonmaskingSoftHard-Masking

Automated Revision of Existing Programs 33

Problem Statement

Given program P Invariant S a set of faults F and safety specification SPEC identify a program P with invariant S such that

S S P P and P is F-tolerant to SPEC from S

Automated Revision of Existing Programs 34

Current Results

Safety LivenessTiming

constraintsBounded-recovery

Complexity

Soft-failsafe

Polynomial space sound and complete

algorithm

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Open problem

[2] S Kulkarni A Arora [2] S Kulkarni A Arora Automating the Addition of Fault-ToleranceAutomating the Addition of Fault-Tolerance FTRTFT 2000 FTRTFT 2000

Automated Revision of Existing Programs 35

Example Altitude Switch

Initx 06

Standby

Await (Power z 2 )

gtgt Actuator Power on ltlt

ltlt LowAlt gtgt[z = 0]

ltlt Reset gtgt(CorruptSensor y 2)

(x 06)

ltlt Reset gtgt

[x y z = 0]

(Power z 2 )

(z gt 2 )[t = 0]

(x gt 06)[t = 0]

ltlt SensorFail gtgt[y = 0]

(y gt 2)[t = 0]

y 1y 2

z 2

(Power CorruptSensor)( Power)

Fault-span

Fault-span

(t 1)

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

Automated Revision of Existing Programs 37

Ongoing Research

Implementation of the algorithms

Coping with the state explosion problem Synthesis using zone automata Parallelizing state space

Coping with NP-hardness results Heuristics Decision procedures (Yices)

Fault-tolerance in hybrid systems

[8] httpwwwcsemsueduebnenasiresearchtoolsftsynhtm

Automated Revision of Existing Programs 38

Conclusion

We study the problem of automated revision of programs inside their state space

Adding properties to programs Adding fault-tolerance

Automated Revision of Existing Programs 39

Questions

Automated Revision of Existing Programs 9

Preliminary Concepts (cont)

Sample Properties Safety

P unless Q

stable(P)

invariant(P)

Liveness P leads-to Q P Q

P PP P P P

PP P

P P P Q

Automated Revision of Existing Programs 10

Preliminary Concepts (cont)

A specification is a conjunction of a set of properties

spec = L1 L2 hellip Ln

A computation satisfies spec iff (i | 0 i n satisfies Li)

A program p satisfies spec iff all computations of p1 are infinite

2 satisfy spec

Automated Revision of Existing Programs 11

Problem Statement

Formulation of the problem p satisfies existing specification spece

Sp = Sp

Ip = Ip

p p

All computations of p are infinite satisfy specn

SynthesisAlgorithm

Program p = Sp Ip p

Program p = Sp Ip pA Specification specn

Automated Revision of Existing Programs 12

Adding a Single Leads-to Property (R T )

Sp

Ip R T

Case 1 Deadlock statesRemove transitions (s1 s2) if s2 R and T is not reachable from S2

s0

s1

s2

Automated Revision of Existing Programs 13

Adding a Single Leads-to Property (R T )

Sp Ip R T

Break cycles reachable from R without reaching Q

s1

s4

s2

s3

Case 2 Cycles

s0

Automated Revision of Existing Programs 14

Soundness and Completeness

Theorem (1) The algorithm for adding multiple safety properties along with a leads-to property is sound and complete Fixability

Theorem (2) The complexity of the algorithm for adding multiple safety properties along with a leads-to property is polynomial-time

Automated Revision of Existing Programs 15

Adding Two Leads-to Properties Adding two leads-to properties one after another

Sp

Ip R T

s5

s6

s3s4

s7

P Qs9

s0

s6

s1

s2

s8

Automated Revision of Existing Programs 16

Other Results

The problems of simultaneous addition of two leads-to properties is NP-complete

The problem of adding one leads-to property while maintaining maximum non-determinism is NP-complete

Automated Revision of Existing Programs 17

Another Problem

Adding two eventually properties

1 true leads-to Q

2 true leads-to T

This problem is also NP-complete

Q

T

Automated Revision of Existing Programs 18

Example Real-Time Resource Allocation

Two processes j 1 2 Each has two tasks to complete (each takes 1 time unit)

Submitting a request Performing IO operation

RQj reqj (x = 1) ioj reqj = true false

IOj ioj (x = 1) reqj ioj = true false WT 0 x 1 1048576 wait

Bounded response

L (io1 2 req1)

x

x

Automated Revision of Existing Programs 19

req10 lt x t lt 1

req2

req1x t = 0req2

req1x t = 1req2

io1x t = 0 req2

req1x t = 0 1

io2

req10 lt x lt 11 lt t lt 2

io2

req1x t = 1 2

io2

io10 lt x t lt 1

req2

io1x t = 0

io2

io10 lt x t lt 1

io2

io1x = 0 t gt 2

req2

io1x = 1 t gt 2

req2

io1x = 0 t gt 2

io2

req10 lt x lt 1

t gt 2req2

req1x = 0 t gt 2

req2

req1x = 1 t gt 2

req2

io10 lt x lt 1

t gt 2io2

req1x = 0 t gt 2

io2

req10 lt x lt 1

t gt 2 io2

req1x = 1 t gt 2

io2

io1x t = 0 2

io2

req1x t = 0 2

req2

req1x t = 0 1

io2

req10 lt x lt 11 lt t lt 2

io2

req1x t = 1 2

io2

req1x t = 0 2

io2

req1x t = 0 1

req2

req10 lt x lt 11 lt t lt 2

req2

req1x t = 1 2

req2

Regions whereio1 becomes true

Edges removed during addition

Regions made unreachable

Legend

Initial region

Edges participatingin a shortest path

io1x t = 0 1

req2

io1x t = 1

io2

io10 lt x lt 11 lt t lt 2

req2

io1x t = 1 2

req2

Edges in additionalshortest paths

io1x t = 1req2

io1x t = 01

io2

io10 lt x lt 11 lt t lt 2

io2

io1x t = 1 2

io2

io1x t = 0 2

req2

io1x = 1 t gt 2

io2

io10 lt x lt 1

t gt 2req2

Automated Revision of Existing Programs 20

Example (contrsquod)

RQ1 req1 (x = 1) io1 req1 = true false

IO1 io1 (x = 1) req1 io1 = true false

RQ2 req2 (x = 1) (io1 t 1) io2 req2 = true false

IO1 io2 (x = 1) (io1 t 1) req2 io2 = true false

WT 0 x 1 1048576 wait

x t

x

x

x

Automated Revision of Existing Programs 21

Part II

Adding Fault-ToleranceFault-Tolerance to Existing Programs

Automated Revision of Existing Programs 22

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Decision dg 0 1

(dj = ) ( fj = false) dj = dg

(dj ) ( fj = false) fj = true

dj

dk 0 1

dl

Decision

fj

fk false true

fl

Final

GENERAL

NON-GENERALS

Program

Automated Revision of Existing Programs 23

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Byzantine

bg false true

bj

bk false true

bl

Byzantine

(bj bk bl bg = false) bj = true

(bj = true) dj = 0|1Faults

Automated Revision of Existing Programs 24

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Safety specification Agreement The final decision of any two non-Byzantine

process cannot be different

Validity If the general is non-Byzantine then the final decision of a non-Byzantine process must be the same as that of the general

Fact The program does not meet the safety specification in the presence of faults

dg = 0 bg = true dj = 0 fj = false fj = true dk = 1 fk = true

Automated Revision of Existing Programs 25

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

QuestionIs it possible to revise a distributed program in order to add fault-tolerance to the original program with respect to a safety specification and a class of faults

AnswerYes but the problem is NP-complete which itself is a problem

SolutionA set of polynomial-time heuristics has been developed to add fault-tolerance to a large class of distributed programs

Automated Revision of Existing Programs 26

Some Terminalogy

Boolean Variables V = v1 vn State where l(vj) is a literal State predicate A finite set of states Transition Let V = v | v V A transition (s s) = s s Transition predicate is a set of transitions Program A program is defined by a transition predicate P Closure A state predicate S is closed in program P iff

(s ||S) (s ||S)

Computation of P A sequence of states s0 s1 where (sj sj+1) || P Safety specification A set of bad transitions that should not occur in any

program computation defined by transition predicate SPEC Satisfaction A program P satisfies SPEC from S iff

1 S is closed in P

2 For all s0 s1 where s0 || S (sj sj+1) || SPEC Invariant If P satisfies SPEC from S and S then S is an invariant of P

imi sS 0

)(0 jnj vls

)(0 iimi ssP

Pss ||)(

Automated Revision of Existing Programs 27

Preliminaries (contrsquod)

Faults A set F of transitions

Fault-span A state predicate T such that

1 S T

2 T is closed in P F

Invariant

f

f

ff

f

Fault-Span

Finite state space

p

pp

p

Fault-tolerance A program P is F-tolerant to SPEC from S if1 P satisfies SPEC from S2 There exists T such that

T is an F-span of P P F satisfies SPEC from T (safety) after faults stop occurring every computation of P reaches S (liveness)

Automated Revision of Existing Programs 28

Levels of Fault-Tolerance [3]

In the presence of faults

A failsafe program satisfies the safety specification A nonmasking program satisfies the liveness specification A masking program satisfies both safety and liveness specifications

In the absence of faults the program must continue to satisfy its entire specification

Question Are these levels able to capture the requirements for modeling fault-tolerant real-time programs

[3] A Arora S Kulkarni [3] A Arora S Kulkarni Designing masking fault-tolerance via nonmasking fault-toleranceDesigning masking fault-tolerance via nonmasking fault-tolerance TSE 1998TSE 1998

Automated Revision of Existing Programs 29

Fault-Tolerant Real-Time Programs

Railroad Crossing Safety If the train is passing the crossing the gate must be closed Bounded response Once the gate is closed it should reopen within

4min Fault-tolerance The gate must be closed while the train is passing the

intersection even in the presence of faults

Boiler Controller Bounded response Once the pressure gauge reads 30 pounds per

square inch the controller must issue a command to open a valve within 20s

Fault-tolerance Faults should not affect the functionality of the controller

Automated Revision of Existing Programs 30

New Levels of Fault-Tolerance for RT Programs

Soft Fault-Tolerance

A program is soft fault-tolerant if it is NOT required to satisfy its timing constraints in the presence of faults However it should continue to meet its timing constraints in the absence of faults

Hard Fault-Tolerance

A program is hard fault-tolerant if it is required to satisfy its timing constraints even in the presence of faults

Automated Revision of Existing Programs 31

Levels of Fault-Tolerance (contrsquod)

Safety LivenessTiming

constraintsBounded-time

recovery

Soft-failsafe

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Automated Revision of Existing Programs 32

Problem Statement

SynthesisAlgorithm

Program PSafety spec

Fault-tolerant

program PSet of faults f

Desired level of fault-tolerance SoftHard-Failsafe

NonmaskingSoftHard-Masking

Automated Revision of Existing Programs 33

Problem Statement

Given program P Invariant S a set of faults F and safety specification SPEC identify a program P with invariant S such that

S S P P and P is F-tolerant to SPEC from S

Automated Revision of Existing Programs 34

Current Results

Safety LivenessTiming

constraintsBounded-recovery

Complexity

Soft-failsafe

Polynomial space sound and complete

algorithm

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Open problem

[2] S Kulkarni A Arora [2] S Kulkarni A Arora Automating the Addition of Fault-ToleranceAutomating the Addition of Fault-Tolerance FTRTFT 2000 FTRTFT 2000

Automated Revision of Existing Programs 35

Example Altitude Switch

Initx 06

Standby

Await (Power z 2 )

gtgt Actuator Power on ltlt

ltlt LowAlt gtgt[z = 0]

ltlt Reset gtgt(CorruptSensor y 2)

(x 06)

ltlt Reset gtgt

[x y z = 0]

(Power z 2 )

(z gt 2 )[t = 0]

(x gt 06)[t = 0]

ltlt SensorFail gtgt[y = 0]

(y gt 2)[t = 0]

y 1y 2

z 2

(Power CorruptSensor)( Power)

Fault-span

Fault-span

(t 1)

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

Automated Revision of Existing Programs 37

Ongoing Research

Implementation of the algorithms

Coping with the state explosion problem Synthesis using zone automata Parallelizing state space

Coping with NP-hardness results Heuristics Decision procedures (Yices)

Fault-tolerance in hybrid systems

[8] httpwwwcsemsueduebnenasiresearchtoolsftsynhtm

Automated Revision of Existing Programs 38

Conclusion

We study the problem of automated revision of programs inside their state space

Adding properties to programs Adding fault-tolerance

Automated Revision of Existing Programs 39

Questions

Automated Revision of Existing Programs 10

Preliminary Concepts (cont)

A specification is a conjunction of a set of properties

spec = L1 L2 hellip Ln

A computation satisfies spec iff (i | 0 i n satisfies Li)

A program p satisfies spec iff all computations of p1 are infinite

2 satisfy spec

Automated Revision of Existing Programs 11

Problem Statement

Formulation of the problem p satisfies existing specification spece

Sp = Sp

Ip = Ip

p p

All computations of p are infinite satisfy specn

SynthesisAlgorithm

Program p = Sp Ip p

Program p = Sp Ip pA Specification specn

Automated Revision of Existing Programs 12

Adding a Single Leads-to Property (R T )

Sp

Ip R T

Case 1 Deadlock statesRemove transitions (s1 s2) if s2 R and T is not reachable from S2

s0

s1

s2

Automated Revision of Existing Programs 13

Adding a Single Leads-to Property (R T )

Sp Ip R T

Break cycles reachable from R without reaching Q

s1

s4

s2

s3

Case 2 Cycles

s0

Automated Revision of Existing Programs 14

Soundness and Completeness

Theorem (1) The algorithm for adding multiple safety properties along with a leads-to property is sound and complete Fixability

Theorem (2) The complexity of the algorithm for adding multiple safety properties along with a leads-to property is polynomial-time

Automated Revision of Existing Programs 15

Adding Two Leads-to Properties Adding two leads-to properties one after another

Sp

Ip R T

s5

s6

s3s4

s7

P Qs9

s0

s6

s1

s2

s8

Automated Revision of Existing Programs 16

Other Results

The problems of simultaneous addition of two leads-to properties is NP-complete

The problem of adding one leads-to property while maintaining maximum non-determinism is NP-complete

Automated Revision of Existing Programs 17

Another Problem

Adding two eventually properties

1 true leads-to Q

2 true leads-to T

This problem is also NP-complete

Q

T

Automated Revision of Existing Programs 18

Example Real-Time Resource Allocation

Two processes j 1 2 Each has two tasks to complete (each takes 1 time unit)

Submitting a request Performing IO operation

RQj reqj (x = 1) ioj reqj = true false

IOj ioj (x = 1) reqj ioj = true false WT 0 x 1 1048576 wait

Bounded response

L (io1 2 req1)

x

x

Automated Revision of Existing Programs 19

req10 lt x t lt 1

req2

req1x t = 0req2

req1x t = 1req2

io1x t = 0 req2

req1x t = 0 1

io2

req10 lt x lt 11 lt t lt 2

io2

req1x t = 1 2

io2

io10 lt x t lt 1

req2

io1x t = 0

io2

io10 lt x t lt 1

io2

io1x = 0 t gt 2

req2

io1x = 1 t gt 2

req2

io1x = 0 t gt 2

io2

req10 lt x lt 1

t gt 2req2

req1x = 0 t gt 2

req2

req1x = 1 t gt 2

req2

io10 lt x lt 1

t gt 2io2

req1x = 0 t gt 2

io2

req10 lt x lt 1

t gt 2 io2

req1x = 1 t gt 2

io2

io1x t = 0 2

io2

req1x t = 0 2

req2

req1x t = 0 1

io2

req10 lt x lt 11 lt t lt 2

io2

req1x t = 1 2

io2

req1x t = 0 2

io2

req1x t = 0 1

req2

req10 lt x lt 11 lt t lt 2

req2

req1x t = 1 2

req2

Regions whereio1 becomes true

Edges removed during addition

Regions made unreachable

Legend

Initial region

Edges participatingin a shortest path

io1x t = 0 1

req2

io1x t = 1

io2

io10 lt x lt 11 lt t lt 2

req2

io1x t = 1 2

req2

Edges in additionalshortest paths

io1x t = 1req2

io1x t = 01

io2

io10 lt x lt 11 lt t lt 2

io2

io1x t = 1 2

io2

io1x t = 0 2

req2

io1x = 1 t gt 2

io2

io10 lt x lt 1

t gt 2req2

Automated Revision of Existing Programs 20

Example (contrsquod)

RQ1 req1 (x = 1) io1 req1 = true false

IO1 io1 (x = 1) req1 io1 = true false

RQ2 req2 (x = 1) (io1 t 1) io2 req2 = true false

IO1 io2 (x = 1) (io1 t 1) req2 io2 = true false

WT 0 x 1 1048576 wait

x t

x

x

x

Automated Revision of Existing Programs 21

Part II

Adding Fault-ToleranceFault-Tolerance to Existing Programs

Automated Revision of Existing Programs 22

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Decision dg 0 1

(dj = ) ( fj = false) dj = dg

(dj ) ( fj = false) fj = true

dj

dk 0 1

dl

Decision

fj

fk false true

fl

Final

GENERAL

NON-GENERALS

Program

Automated Revision of Existing Programs 23

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Byzantine

bg false true

bj

bk false true

bl

Byzantine

(bj bk bl bg = false) bj = true

(bj = true) dj = 0|1Faults

Automated Revision of Existing Programs 24

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Safety specification Agreement The final decision of any two non-Byzantine

process cannot be different

Validity If the general is non-Byzantine then the final decision of a non-Byzantine process must be the same as that of the general

Fact The program does not meet the safety specification in the presence of faults

dg = 0 bg = true dj = 0 fj = false fj = true dk = 1 fk = true

Automated Revision of Existing Programs 25

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

QuestionIs it possible to revise a distributed program in order to add fault-tolerance to the original program with respect to a safety specification and a class of faults

AnswerYes but the problem is NP-complete which itself is a problem

SolutionA set of polynomial-time heuristics has been developed to add fault-tolerance to a large class of distributed programs

Automated Revision of Existing Programs 26

Some Terminalogy

Boolean Variables V = v1 vn State where l(vj) is a literal State predicate A finite set of states Transition Let V = v | v V A transition (s s) = s s Transition predicate is a set of transitions Program A program is defined by a transition predicate P Closure A state predicate S is closed in program P iff

(s ||S) (s ||S)

Computation of P A sequence of states s0 s1 where (sj sj+1) || P Safety specification A set of bad transitions that should not occur in any

program computation defined by transition predicate SPEC Satisfaction A program P satisfies SPEC from S iff

1 S is closed in P

2 For all s0 s1 where s0 || S (sj sj+1) || SPEC Invariant If P satisfies SPEC from S and S then S is an invariant of P

imi sS 0

)(0 jnj vls

)(0 iimi ssP

Pss ||)(

Automated Revision of Existing Programs 27

Preliminaries (contrsquod)

Faults A set F of transitions

Fault-span A state predicate T such that

1 S T

2 T is closed in P F

Invariant

f

f

ff

f

Fault-Span

Finite state space

p

pp

p

Fault-tolerance A program P is F-tolerant to SPEC from S if1 P satisfies SPEC from S2 There exists T such that

T is an F-span of P P F satisfies SPEC from T (safety) after faults stop occurring every computation of P reaches S (liveness)

Automated Revision of Existing Programs 28

Levels of Fault-Tolerance [3]

In the presence of faults

A failsafe program satisfies the safety specification A nonmasking program satisfies the liveness specification A masking program satisfies both safety and liveness specifications

In the absence of faults the program must continue to satisfy its entire specification

Question Are these levels able to capture the requirements for modeling fault-tolerant real-time programs

[3] A Arora S Kulkarni [3] A Arora S Kulkarni Designing masking fault-tolerance via nonmasking fault-toleranceDesigning masking fault-tolerance via nonmasking fault-tolerance TSE 1998TSE 1998

Automated Revision of Existing Programs 29

Fault-Tolerant Real-Time Programs

Railroad Crossing Safety If the train is passing the crossing the gate must be closed Bounded response Once the gate is closed it should reopen within

4min Fault-tolerance The gate must be closed while the train is passing the

intersection even in the presence of faults

Boiler Controller Bounded response Once the pressure gauge reads 30 pounds per

square inch the controller must issue a command to open a valve within 20s

Fault-tolerance Faults should not affect the functionality of the controller

Automated Revision of Existing Programs 30

New Levels of Fault-Tolerance for RT Programs

Soft Fault-Tolerance

A program is soft fault-tolerant if it is NOT required to satisfy its timing constraints in the presence of faults However it should continue to meet its timing constraints in the absence of faults

Hard Fault-Tolerance

A program is hard fault-tolerant if it is required to satisfy its timing constraints even in the presence of faults

Automated Revision of Existing Programs 31

Levels of Fault-Tolerance (contrsquod)

Safety LivenessTiming

constraintsBounded-time

recovery

Soft-failsafe

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Automated Revision of Existing Programs 32

Problem Statement

SynthesisAlgorithm

Program PSafety spec

Fault-tolerant

program PSet of faults f

Desired level of fault-tolerance SoftHard-Failsafe

NonmaskingSoftHard-Masking

Automated Revision of Existing Programs 33

Problem Statement

Given program P Invariant S a set of faults F and safety specification SPEC identify a program P with invariant S such that

S S P P and P is F-tolerant to SPEC from S

Automated Revision of Existing Programs 34

Current Results

Safety LivenessTiming

constraintsBounded-recovery

Complexity

Soft-failsafe

Polynomial space sound and complete

algorithm

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Open problem

[2] S Kulkarni A Arora [2] S Kulkarni A Arora Automating the Addition of Fault-ToleranceAutomating the Addition of Fault-Tolerance FTRTFT 2000 FTRTFT 2000

Automated Revision of Existing Programs 35

Example Altitude Switch

Initx 06

Standby

Await (Power z 2 )

gtgt Actuator Power on ltlt

ltlt LowAlt gtgt[z = 0]

ltlt Reset gtgt(CorruptSensor y 2)

(x 06)

ltlt Reset gtgt

[x y z = 0]

(Power z 2 )

(z gt 2 )[t = 0]

(x gt 06)[t = 0]

ltlt SensorFail gtgt[y = 0]

(y gt 2)[t = 0]

y 1y 2

z 2

(Power CorruptSensor)( Power)

Fault-span

Fault-span

(t 1)

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

Automated Revision of Existing Programs 37

Ongoing Research

Implementation of the algorithms

Coping with the state explosion problem Synthesis using zone automata Parallelizing state space

Coping with NP-hardness results Heuristics Decision procedures (Yices)

Fault-tolerance in hybrid systems

[8] httpwwwcsemsueduebnenasiresearchtoolsftsynhtm

Automated Revision of Existing Programs 38

Conclusion

We study the problem of automated revision of programs inside their state space

Adding properties to programs Adding fault-tolerance

Automated Revision of Existing Programs 39

Questions

Automated Revision of Existing Programs 11

Problem Statement

Formulation of the problem p satisfies existing specification spece

Sp = Sp

Ip = Ip

p p

All computations of p are infinite satisfy specn

SynthesisAlgorithm

Program p = Sp Ip p

Program p = Sp Ip pA Specification specn

Automated Revision of Existing Programs 12

Adding a Single Leads-to Property (R T )

Sp

Ip R T

Case 1 Deadlock statesRemove transitions (s1 s2) if s2 R and T is not reachable from S2

s0

s1

s2

Automated Revision of Existing Programs 13

Adding a Single Leads-to Property (R T )

Sp Ip R T

Break cycles reachable from R without reaching Q

s1

s4

s2

s3

Case 2 Cycles

s0

Automated Revision of Existing Programs 14

Soundness and Completeness

Theorem (1) The algorithm for adding multiple safety properties along with a leads-to property is sound and complete Fixability

Theorem (2) The complexity of the algorithm for adding multiple safety properties along with a leads-to property is polynomial-time

Automated Revision of Existing Programs 15

Adding Two Leads-to Properties Adding two leads-to properties one after another

Sp

Ip R T

s5

s6

s3s4

s7

P Qs9

s0

s6

s1

s2

s8

Automated Revision of Existing Programs 16

Other Results

The problems of simultaneous addition of two leads-to properties is NP-complete

The problem of adding one leads-to property while maintaining maximum non-determinism is NP-complete

Automated Revision of Existing Programs 17

Another Problem

Adding two eventually properties

1 true leads-to Q

2 true leads-to T

This problem is also NP-complete

Q

T

Automated Revision of Existing Programs 18

Example Real-Time Resource Allocation

Two processes j 1 2 Each has two tasks to complete (each takes 1 time unit)

Submitting a request Performing IO operation

RQj reqj (x = 1) ioj reqj = true false

IOj ioj (x = 1) reqj ioj = true false WT 0 x 1 1048576 wait

Bounded response

L (io1 2 req1)

x

x

Automated Revision of Existing Programs 19

req10 lt x t lt 1

req2

req1x t = 0req2

req1x t = 1req2

io1x t = 0 req2

req1x t = 0 1

io2

req10 lt x lt 11 lt t lt 2

io2

req1x t = 1 2

io2

io10 lt x t lt 1

req2

io1x t = 0

io2

io10 lt x t lt 1

io2

io1x = 0 t gt 2

req2

io1x = 1 t gt 2

req2

io1x = 0 t gt 2

io2

req10 lt x lt 1

t gt 2req2

req1x = 0 t gt 2

req2

req1x = 1 t gt 2

req2

io10 lt x lt 1

t gt 2io2

req1x = 0 t gt 2

io2

req10 lt x lt 1

t gt 2 io2

req1x = 1 t gt 2

io2

io1x t = 0 2

io2

req1x t = 0 2

req2

req1x t = 0 1

io2

req10 lt x lt 11 lt t lt 2

io2

req1x t = 1 2

io2

req1x t = 0 2

io2

req1x t = 0 1

req2

req10 lt x lt 11 lt t lt 2

req2

req1x t = 1 2

req2

Regions whereio1 becomes true

Edges removed during addition

Regions made unreachable

Legend

Initial region

Edges participatingin a shortest path

io1x t = 0 1

req2

io1x t = 1

io2

io10 lt x lt 11 lt t lt 2

req2

io1x t = 1 2

req2

Edges in additionalshortest paths

io1x t = 1req2

io1x t = 01

io2

io10 lt x lt 11 lt t lt 2

io2

io1x t = 1 2

io2

io1x t = 0 2

req2

io1x = 1 t gt 2

io2

io10 lt x lt 1

t gt 2req2

Automated Revision of Existing Programs 20

Example (contrsquod)

RQ1 req1 (x = 1) io1 req1 = true false

IO1 io1 (x = 1) req1 io1 = true false

RQ2 req2 (x = 1) (io1 t 1) io2 req2 = true false

IO1 io2 (x = 1) (io1 t 1) req2 io2 = true false

WT 0 x 1 1048576 wait

x t

x

x

x

Automated Revision of Existing Programs 21

Part II

Adding Fault-ToleranceFault-Tolerance to Existing Programs

Automated Revision of Existing Programs 22

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Decision dg 0 1

(dj = ) ( fj = false) dj = dg

(dj ) ( fj = false) fj = true

dj

dk 0 1

dl

Decision

fj

fk false true

fl

Final

GENERAL

NON-GENERALS

Program

Automated Revision of Existing Programs 23

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Byzantine

bg false true

bj

bk false true

bl

Byzantine

(bj bk bl bg = false) bj = true

(bj = true) dj = 0|1Faults

Automated Revision of Existing Programs 24

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Safety specification Agreement The final decision of any two non-Byzantine

process cannot be different

Validity If the general is non-Byzantine then the final decision of a non-Byzantine process must be the same as that of the general

Fact The program does not meet the safety specification in the presence of faults

dg = 0 bg = true dj = 0 fj = false fj = true dk = 1 fk = true

Automated Revision of Existing Programs 25

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

QuestionIs it possible to revise a distributed program in order to add fault-tolerance to the original program with respect to a safety specification and a class of faults

AnswerYes but the problem is NP-complete which itself is a problem

SolutionA set of polynomial-time heuristics has been developed to add fault-tolerance to a large class of distributed programs

Automated Revision of Existing Programs 26

Some Terminalogy

Boolean Variables V = v1 vn State where l(vj) is a literal State predicate A finite set of states Transition Let V = v | v V A transition (s s) = s s Transition predicate is a set of transitions Program A program is defined by a transition predicate P Closure A state predicate S is closed in program P iff

(s ||S) (s ||S)

Computation of P A sequence of states s0 s1 where (sj sj+1) || P Safety specification A set of bad transitions that should not occur in any

program computation defined by transition predicate SPEC Satisfaction A program P satisfies SPEC from S iff

1 S is closed in P

2 For all s0 s1 where s0 || S (sj sj+1) || SPEC Invariant If P satisfies SPEC from S and S then S is an invariant of P

imi sS 0

)(0 jnj vls

)(0 iimi ssP

Pss ||)(

Automated Revision of Existing Programs 27

Preliminaries (contrsquod)

Faults A set F of transitions

Fault-span A state predicate T such that

1 S T

2 T is closed in P F

Invariant

f

f

ff

f

Fault-Span

Finite state space

p

pp

p

Fault-tolerance A program P is F-tolerant to SPEC from S if1 P satisfies SPEC from S2 There exists T such that

T is an F-span of P P F satisfies SPEC from T (safety) after faults stop occurring every computation of P reaches S (liveness)

Automated Revision of Existing Programs 28

Levels of Fault-Tolerance [3]

In the presence of faults

A failsafe program satisfies the safety specification A nonmasking program satisfies the liveness specification A masking program satisfies both safety and liveness specifications

In the absence of faults the program must continue to satisfy its entire specification

Question Are these levels able to capture the requirements for modeling fault-tolerant real-time programs

[3] A Arora S Kulkarni [3] A Arora S Kulkarni Designing masking fault-tolerance via nonmasking fault-toleranceDesigning masking fault-tolerance via nonmasking fault-tolerance TSE 1998TSE 1998

Automated Revision of Existing Programs 29

Fault-Tolerant Real-Time Programs

Railroad Crossing Safety If the train is passing the crossing the gate must be closed Bounded response Once the gate is closed it should reopen within

4min Fault-tolerance The gate must be closed while the train is passing the

intersection even in the presence of faults

Boiler Controller Bounded response Once the pressure gauge reads 30 pounds per

square inch the controller must issue a command to open a valve within 20s

Fault-tolerance Faults should not affect the functionality of the controller

Automated Revision of Existing Programs 30

New Levels of Fault-Tolerance for RT Programs

Soft Fault-Tolerance

A program is soft fault-tolerant if it is NOT required to satisfy its timing constraints in the presence of faults However it should continue to meet its timing constraints in the absence of faults

Hard Fault-Tolerance

A program is hard fault-tolerant if it is required to satisfy its timing constraints even in the presence of faults

Automated Revision of Existing Programs 31

Levels of Fault-Tolerance (contrsquod)

Safety LivenessTiming

constraintsBounded-time

recovery

Soft-failsafe

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Automated Revision of Existing Programs 32

Problem Statement

SynthesisAlgorithm

Program PSafety spec

Fault-tolerant

program PSet of faults f

Desired level of fault-tolerance SoftHard-Failsafe

NonmaskingSoftHard-Masking

Automated Revision of Existing Programs 33

Problem Statement

Given program P Invariant S a set of faults F and safety specification SPEC identify a program P with invariant S such that

S S P P and P is F-tolerant to SPEC from S

Automated Revision of Existing Programs 34

Current Results

Safety LivenessTiming

constraintsBounded-recovery

Complexity

Soft-failsafe

Polynomial space sound and complete

algorithm

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Open problem

[2] S Kulkarni A Arora [2] S Kulkarni A Arora Automating the Addition of Fault-ToleranceAutomating the Addition of Fault-Tolerance FTRTFT 2000 FTRTFT 2000

Automated Revision of Existing Programs 35

Example Altitude Switch

Initx 06

Standby

Await (Power z 2 )

gtgt Actuator Power on ltlt

ltlt LowAlt gtgt[z = 0]

ltlt Reset gtgt(CorruptSensor y 2)

(x 06)

ltlt Reset gtgt

[x y z = 0]

(Power z 2 )

(z gt 2 )[t = 0]

(x gt 06)[t = 0]

ltlt SensorFail gtgt[y = 0]

(y gt 2)[t = 0]

y 1y 2

z 2

(Power CorruptSensor)( Power)

Fault-span

Fault-span

(t 1)

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

Automated Revision of Existing Programs 37

Ongoing Research

Implementation of the algorithms

Coping with the state explosion problem Synthesis using zone automata Parallelizing state space

Coping with NP-hardness results Heuristics Decision procedures (Yices)

Fault-tolerance in hybrid systems

[8] httpwwwcsemsueduebnenasiresearchtoolsftsynhtm

Automated Revision of Existing Programs 38

Conclusion

We study the problem of automated revision of programs inside their state space

Adding properties to programs Adding fault-tolerance

Automated Revision of Existing Programs 39

Questions

Automated Revision of Existing Programs 12

Adding a Single Leads-to Property (R T )

Sp

Ip R T

Case 1 Deadlock statesRemove transitions (s1 s2) if s2 R and T is not reachable from S2

s0

s1

s2

Automated Revision of Existing Programs 13

Adding a Single Leads-to Property (R T )

Sp Ip R T

Break cycles reachable from R without reaching Q

s1

s4

s2

s3

Case 2 Cycles

s0

Automated Revision of Existing Programs 14

Soundness and Completeness

Theorem (1) The algorithm for adding multiple safety properties along with a leads-to property is sound and complete Fixability

Theorem (2) The complexity of the algorithm for adding multiple safety properties along with a leads-to property is polynomial-time

Automated Revision of Existing Programs 15

Adding Two Leads-to Properties Adding two leads-to properties one after another

Sp

Ip R T

s5

s6

s3s4

s7

P Qs9

s0

s6

s1

s2

s8

Automated Revision of Existing Programs 16

Other Results

The problems of simultaneous addition of two leads-to properties is NP-complete

The problem of adding one leads-to property while maintaining maximum non-determinism is NP-complete

Automated Revision of Existing Programs 17

Another Problem

Adding two eventually properties

1 true leads-to Q

2 true leads-to T

This problem is also NP-complete

Q

T

Automated Revision of Existing Programs 18

Example Real-Time Resource Allocation

Two processes j 1 2 Each has two tasks to complete (each takes 1 time unit)

Submitting a request Performing IO operation

RQj reqj (x = 1) ioj reqj = true false

IOj ioj (x = 1) reqj ioj = true false WT 0 x 1 1048576 wait

Bounded response

L (io1 2 req1)

x

x

Automated Revision of Existing Programs 19

req10 lt x t lt 1

req2

req1x t = 0req2

req1x t = 1req2

io1x t = 0 req2

req1x t = 0 1

io2

req10 lt x lt 11 lt t lt 2

io2

req1x t = 1 2

io2

io10 lt x t lt 1

req2

io1x t = 0

io2

io10 lt x t lt 1

io2

io1x = 0 t gt 2

req2

io1x = 1 t gt 2

req2

io1x = 0 t gt 2

io2

req10 lt x lt 1

t gt 2req2

req1x = 0 t gt 2

req2

req1x = 1 t gt 2

req2

io10 lt x lt 1

t gt 2io2

req1x = 0 t gt 2

io2

req10 lt x lt 1

t gt 2 io2

req1x = 1 t gt 2

io2

io1x t = 0 2

io2

req1x t = 0 2

req2

req1x t = 0 1

io2

req10 lt x lt 11 lt t lt 2

io2

req1x t = 1 2

io2

req1x t = 0 2

io2

req1x t = 0 1

req2

req10 lt x lt 11 lt t lt 2

req2

req1x t = 1 2

req2

Regions whereio1 becomes true

Edges removed during addition

Regions made unreachable

Legend

Initial region

Edges participatingin a shortest path

io1x t = 0 1

req2

io1x t = 1

io2

io10 lt x lt 11 lt t lt 2

req2

io1x t = 1 2

req2

Edges in additionalshortest paths

io1x t = 1req2

io1x t = 01

io2

io10 lt x lt 11 lt t lt 2

io2

io1x t = 1 2

io2

io1x t = 0 2

req2

io1x = 1 t gt 2

io2

io10 lt x lt 1

t gt 2req2

Automated Revision of Existing Programs 20

Example (contrsquod)

RQ1 req1 (x = 1) io1 req1 = true false

IO1 io1 (x = 1) req1 io1 = true false

RQ2 req2 (x = 1) (io1 t 1) io2 req2 = true false

IO1 io2 (x = 1) (io1 t 1) req2 io2 = true false

WT 0 x 1 1048576 wait

x t

x

x

x

Automated Revision of Existing Programs 21

Part II

Adding Fault-ToleranceFault-Tolerance to Existing Programs

Automated Revision of Existing Programs 22

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Decision dg 0 1

(dj = ) ( fj = false) dj = dg

(dj ) ( fj = false) fj = true

dj

dk 0 1

dl

Decision

fj

fk false true

fl

Final

GENERAL

NON-GENERALS

Program

Automated Revision of Existing Programs 23

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Byzantine

bg false true

bj

bk false true

bl

Byzantine

(bj bk bl bg = false) bj = true

(bj = true) dj = 0|1Faults

Automated Revision of Existing Programs 24

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Safety specification Agreement The final decision of any two non-Byzantine

process cannot be different

Validity If the general is non-Byzantine then the final decision of a non-Byzantine process must be the same as that of the general

Fact The program does not meet the safety specification in the presence of faults

dg = 0 bg = true dj = 0 fj = false fj = true dk = 1 fk = true

Automated Revision of Existing Programs 25

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

QuestionIs it possible to revise a distributed program in order to add fault-tolerance to the original program with respect to a safety specification and a class of faults

AnswerYes but the problem is NP-complete which itself is a problem

SolutionA set of polynomial-time heuristics has been developed to add fault-tolerance to a large class of distributed programs

Automated Revision of Existing Programs 26

Some Terminalogy

Boolean Variables V = v1 vn State where l(vj) is a literal State predicate A finite set of states Transition Let V = v | v V A transition (s s) = s s Transition predicate is a set of transitions Program A program is defined by a transition predicate P Closure A state predicate S is closed in program P iff

(s ||S) (s ||S)

Computation of P A sequence of states s0 s1 where (sj sj+1) || P Safety specification A set of bad transitions that should not occur in any

program computation defined by transition predicate SPEC Satisfaction A program P satisfies SPEC from S iff

1 S is closed in P

2 For all s0 s1 where s0 || S (sj sj+1) || SPEC Invariant If P satisfies SPEC from S and S then S is an invariant of P

imi sS 0

)(0 jnj vls

)(0 iimi ssP

Pss ||)(

Automated Revision of Existing Programs 27

Preliminaries (contrsquod)

Faults A set F of transitions

Fault-span A state predicate T such that

1 S T

2 T is closed in P F

Invariant

f

f

ff

f

Fault-Span

Finite state space

p

pp

p

Fault-tolerance A program P is F-tolerant to SPEC from S if1 P satisfies SPEC from S2 There exists T such that

T is an F-span of P P F satisfies SPEC from T (safety) after faults stop occurring every computation of P reaches S (liveness)

Automated Revision of Existing Programs 28

Levels of Fault-Tolerance [3]

In the presence of faults

A failsafe program satisfies the safety specification A nonmasking program satisfies the liveness specification A masking program satisfies both safety and liveness specifications

In the absence of faults the program must continue to satisfy its entire specification

Question Are these levels able to capture the requirements for modeling fault-tolerant real-time programs

[3] A Arora S Kulkarni [3] A Arora S Kulkarni Designing masking fault-tolerance via nonmasking fault-toleranceDesigning masking fault-tolerance via nonmasking fault-tolerance TSE 1998TSE 1998

Automated Revision of Existing Programs 29

Fault-Tolerant Real-Time Programs

Railroad Crossing Safety If the train is passing the crossing the gate must be closed Bounded response Once the gate is closed it should reopen within

4min Fault-tolerance The gate must be closed while the train is passing the

intersection even in the presence of faults

Boiler Controller Bounded response Once the pressure gauge reads 30 pounds per

square inch the controller must issue a command to open a valve within 20s

Fault-tolerance Faults should not affect the functionality of the controller

Automated Revision of Existing Programs 30

New Levels of Fault-Tolerance for RT Programs

Soft Fault-Tolerance

A program is soft fault-tolerant if it is NOT required to satisfy its timing constraints in the presence of faults However it should continue to meet its timing constraints in the absence of faults

Hard Fault-Tolerance

A program is hard fault-tolerant if it is required to satisfy its timing constraints even in the presence of faults

Automated Revision of Existing Programs 31

Levels of Fault-Tolerance (contrsquod)

Safety LivenessTiming

constraintsBounded-time

recovery

Soft-failsafe

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Automated Revision of Existing Programs 32

Problem Statement

SynthesisAlgorithm

Program PSafety spec

Fault-tolerant

program PSet of faults f

Desired level of fault-tolerance SoftHard-Failsafe

NonmaskingSoftHard-Masking

Automated Revision of Existing Programs 33

Problem Statement

Given program P Invariant S a set of faults F and safety specification SPEC identify a program P with invariant S such that

S S P P and P is F-tolerant to SPEC from S

Automated Revision of Existing Programs 34

Current Results

Safety LivenessTiming

constraintsBounded-recovery

Complexity

Soft-failsafe

Polynomial space sound and complete

algorithm

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Open problem

[2] S Kulkarni A Arora [2] S Kulkarni A Arora Automating the Addition of Fault-ToleranceAutomating the Addition of Fault-Tolerance FTRTFT 2000 FTRTFT 2000

Automated Revision of Existing Programs 35

Example Altitude Switch

Initx 06

Standby

Await (Power z 2 )

gtgt Actuator Power on ltlt

ltlt LowAlt gtgt[z = 0]

ltlt Reset gtgt(CorruptSensor y 2)

(x 06)

ltlt Reset gtgt

[x y z = 0]

(Power z 2 )

(z gt 2 )[t = 0]

(x gt 06)[t = 0]

ltlt SensorFail gtgt[y = 0]

(y gt 2)[t = 0]

y 1y 2

z 2

(Power CorruptSensor)( Power)

Fault-span

Fault-span

(t 1)

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

Automated Revision of Existing Programs 37

Ongoing Research

Implementation of the algorithms

Coping with the state explosion problem Synthesis using zone automata Parallelizing state space

Coping with NP-hardness results Heuristics Decision procedures (Yices)

Fault-tolerance in hybrid systems

[8] httpwwwcsemsueduebnenasiresearchtoolsftsynhtm

Automated Revision of Existing Programs 38

Conclusion

We study the problem of automated revision of programs inside their state space

Adding properties to programs Adding fault-tolerance

Automated Revision of Existing Programs 39

Questions

Automated Revision of Existing Programs 13

Adding a Single Leads-to Property (R T )

Sp Ip R T

Break cycles reachable from R without reaching Q

s1

s4

s2

s3

Case 2 Cycles

s0

Automated Revision of Existing Programs 14

Soundness and Completeness

Theorem (1) The algorithm for adding multiple safety properties along with a leads-to property is sound and complete Fixability

Theorem (2) The complexity of the algorithm for adding multiple safety properties along with a leads-to property is polynomial-time

Automated Revision of Existing Programs 15

Adding Two Leads-to Properties Adding two leads-to properties one after another

Sp

Ip R T

s5

s6

s3s4

s7

P Qs9

s0

s6

s1

s2

s8

Automated Revision of Existing Programs 16

Other Results

The problems of simultaneous addition of two leads-to properties is NP-complete

The problem of adding one leads-to property while maintaining maximum non-determinism is NP-complete

Automated Revision of Existing Programs 17

Another Problem

Adding two eventually properties

1 true leads-to Q

2 true leads-to T

This problem is also NP-complete

Q

T

Automated Revision of Existing Programs 18

Example Real-Time Resource Allocation

Two processes j 1 2 Each has two tasks to complete (each takes 1 time unit)

Submitting a request Performing IO operation

RQj reqj (x = 1) ioj reqj = true false

IOj ioj (x = 1) reqj ioj = true false WT 0 x 1 1048576 wait

Bounded response

L (io1 2 req1)

x

x

Automated Revision of Existing Programs 19

req10 lt x t lt 1

req2

req1x t = 0req2

req1x t = 1req2

io1x t = 0 req2

req1x t = 0 1

io2

req10 lt x lt 11 lt t lt 2

io2

req1x t = 1 2

io2

io10 lt x t lt 1

req2

io1x t = 0

io2

io10 lt x t lt 1

io2

io1x = 0 t gt 2

req2

io1x = 1 t gt 2

req2

io1x = 0 t gt 2

io2

req10 lt x lt 1

t gt 2req2

req1x = 0 t gt 2

req2

req1x = 1 t gt 2

req2

io10 lt x lt 1

t gt 2io2

req1x = 0 t gt 2

io2

req10 lt x lt 1

t gt 2 io2

req1x = 1 t gt 2

io2

io1x t = 0 2

io2

req1x t = 0 2

req2

req1x t = 0 1

io2

req10 lt x lt 11 lt t lt 2

io2

req1x t = 1 2

io2

req1x t = 0 2

io2

req1x t = 0 1

req2

req10 lt x lt 11 lt t lt 2

req2

req1x t = 1 2

req2

Regions whereio1 becomes true

Edges removed during addition

Regions made unreachable

Legend

Initial region

Edges participatingin a shortest path

io1x t = 0 1

req2

io1x t = 1

io2

io10 lt x lt 11 lt t lt 2

req2

io1x t = 1 2

req2

Edges in additionalshortest paths

io1x t = 1req2

io1x t = 01

io2

io10 lt x lt 11 lt t lt 2

io2

io1x t = 1 2

io2

io1x t = 0 2

req2

io1x = 1 t gt 2

io2

io10 lt x lt 1

t gt 2req2

Automated Revision of Existing Programs 20

Example (contrsquod)

RQ1 req1 (x = 1) io1 req1 = true false

IO1 io1 (x = 1) req1 io1 = true false

RQ2 req2 (x = 1) (io1 t 1) io2 req2 = true false

IO1 io2 (x = 1) (io1 t 1) req2 io2 = true false

WT 0 x 1 1048576 wait

x t

x

x

x

Automated Revision of Existing Programs 21

Part II

Adding Fault-ToleranceFault-Tolerance to Existing Programs

Automated Revision of Existing Programs 22

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Decision dg 0 1

(dj = ) ( fj = false) dj = dg

(dj ) ( fj = false) fj = true

dj

dk 0 1

dl

Decision

fj

fk false true

fl

Final

GENERAL

NON-GENERALS

Program

Automated Revision of Existing Programs 23

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Byzantine

bg false true

bj

bk false true

bl

Byzantine

(bj bk bl bg = false) bj = true

(bj = true) dj = 0|1Faults

Automated Revision of Existing Programs 24

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Safety specification Agreement The final decision of any two non-Byzantine

process cannot be different

Validity If the general is non-Byzantine then the final decision of a non-Byzantine process must be the same as that of the general

Fact The program does not meet the safety specification in the presence of faults

dg = 0 bg = true dj = 0 fj = false fj = true dk = 1 fk = true

Automated Revision of Existing Programs 25

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

QuestionIs it possible to revise a distributed program in order to add fault-tolerance to the original program with respect to a safety specification and a class of faults

AnswerYes but the problem is NP-complete which itself is a problem

SolutionA set of polynomial-time heuristics has been developed to add fault-tolerance to a large class of distributed programs

Automated Revision of Existing Programs 26

Some Terminalogy

Boolean Variables V = v1 vn State where l(vj) is a literal State predicate A finite set of states Transition Let V = v | v V A transition (s s) = s s Transition predicate is a set of transitions Program A program is defined by a transition predicate P Closure A state predicate S is closed in program P iff

(s ||S) (s ||S)

Computation of P A sequence of states s0 s1 where (sj sj+1) || P Safety specification A set of bad transitions that should not occur in any

program computation defined by transition predicate SPEC Satisfaction A program P satisfies SPEC from S iff

1 S is closed in P

2 For all s0 s1 where s0 || S (sj sj+1) || SPEC Invariant If P satisfies SPEC from S and S then S is an invariant of P

imi sS 0

)(0 jnj vls

)(0 iimi ssP

Pss ||)(

Automated Revision of Existing Programs 27

Preliminaries (contrsquod)

Faults A set F of transitions

Fault-span A state predicate T such that

1 S T

2 T is closed in P F

Invariant

f

f

ff

f

Fault-Span

Finite state space

p

pp

p

Fault-tolerance A program P is F-tolerant to SPEC from S if1 P satisfies SPEC from S2 There exists T such that

T is an F-span of P P F satisfies SPEC from T (safety) after faults stop occurring every computation of P reaches S (liveness)

Automated Revision of Existing Programs 28

Levels of Fault-Tolerance [3]

In the presence of faults

A failsafe program satisfies the safety specification A nonmasking program satisfies the liveness specification A masking program satisfies both safety and liveness specifications

In the absence of faults the program must continue to satisfy its entire specification

Question Are these levels able to capture the requirements for modeling fault-tolerant real-time programs

[3] A Arora S Kulkarni [3] A Arora S Kulkarni Designing masking fault-tolerance via nonmasking fault-toleranceDesigning masking fault-tolerance via nonmasking fault-tolerance TSE 1998TSE 1998

Automated Revision of Existing Programs 29

Fault-Tolerant Real-Time Programs

Railroad Crossing Safety If the train is passing the crossing the gate must be closed Bounded response Once the gate is closed it should reopen within

4min Fault-tolerance The gate must be closed while the train is passing the

intersection even in the presence of faults

Boiler Controller Bounded response Once the pressure gauge reads 30 pounds per

square inch the controller must issue a command to open a valve within 20s

Fault-tolerance Faults should not affect the functionality of the controller

Automated Revision of Existing Programs 30

New Levels of Fault-Tolerance for RT Programs

Soft Fault-Tolerance

A program is soft fault-tolerant if it is NOT required to satisfy its timing constraints in the presence of faults However it should continue to meet its timing constraints in the absence of faults

Hard Fault-Tolerance

A program is hard fault-tolerant if it is required to satisfy its timing constraints even in the presence of faults

Automated Revision of Existing Programs 31

Levels of Fault-Tolerance (contrsquod)

Safety LivenessTiming

constraintsBounded-time

recovery

Soft-failsafe

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Automated Revision of Existing Programs 32

Problem Statement

SynthesisAlgorithm

Program PSafety spec

Fault-tolerant

program PSet of faults f

Desired level of fault-tolerance SoftHard-Failsafe

NonmaskingSoftHard-Masking

Automated Revision of Existing Programs 33

Problem Statement

Given program P Invariant S a set of faults F and safety specification SPEC identify a program P with invariant S such that

S S P P and P is F-tolerant to SPEC from S

Automated Revision of Existing Programs 34

Current Results

Safety LivenessTiming

constraintsBounded-recovery

Complexity

Soft-failsafe

Polynomial space sound and complete

algorithm

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Open problem

[2] S Kulkarni A Arora [2] S Kulkarni A Arora Automating the Addition of Fault-ToleranceAutomating the Addition of Fault-Tolerance FTRTFT 2000 FTRTFT 2000

Automated Revision of Existing Programs 35

Example Altitude Switch

Initx 06

Standby

Await (Power z 2 )

gtgt Actuator Power on ltlt

ltlt LowAlt gtgt[z = 0]

ltlt Reset gtgt(CorruptSensor y 2)

(x 06)

ltlt Reset gtgt

[x y z = 0]

(Power z 2 )

(z gt 2 )[t = 0]

(x gt 06)[t = 0]

ltlt SensorFail gtgt[y = 0]

(y gt 2)[t = 0]

y 1y 2

z 2

(Power CorruptSensor)( Power)

Fault-span

Fault-span

(t 1)

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

Automated Revision of Existing Programs 37

Ongoing Research

Implementation of the algorithms

Coping with the state explosion problem Synthesis using zone automata Parallelizing state space

Coping with NP-hardness results Heuristics Decision procedures (Yices)

Fault-tolerance in hybrid systems

[8] httpwwwcsemsueduebnenasiresearchtoolsftsynhtm

Automated Revision of Existing Programs 38

Conclusion

We study the problem of automated revision of programs inside their state space

Adding properties to programs Adding fault-tolerance

Automated Revision of Existing Programs 39

Questions

Automated Revision of Existing Programs 14

Soundness and Completeness

Theorem (1) The algorithm for adding multiple safety properties along with a leads-to property is sound and complete Fixability

Theorem (2) The complexity of the algorithm for adding multiple safety properties along with a leads-to property is polynomial-time

Automated Revision of Existing Programs 15

Adding Two Leads-to Properties Adding two leads-to properties one after another

Sp

Ip R T

s5

s6

s3s4

s7

P Qs9

s0

s6

s1

s2

s8

Automated Revision of Existing Programs 16

Other Results

The problems of simultaneous addition of two leads-to properties is NP-complete

The problem of adding one leads-to property while maintaining maximum non-determinism is NP-complete

Automated Revision of Existing Programs 17

Another Problem

Adding two eventually properties

1 true leads-to Q

2 true leads-to T

This problem is also NP-complete

Q

T

Automated Revision of Existing Programs 18

Example Real-Time Resource Allocation

Two processes j 1 2 Each has two tasks to complete (each takes 1 time unit)

Submitting a request Performing IO operation

RQj reqj (x = 1) ioj reqj = true false

IOj ioj (x = 1) reqj ioj = true false WT 0 x 1 1048576 wait

Bounded response

L (io1 2 req1)

x

x

Automated Revision of Existing Programs 19

req10 lt x t lt 1

req2

req1x t = 0req2

req1x t = 1req2

io1x t = 0 req2

req1x t = 0 1

io2

req10 lt x lt 11 lt t lt 2

io2

req1x t = 1 2

io2

io10 lt x t lt 1

req2

io1x t = 0

io2

io10 lt x t lt 1

io2

io1x = 0 t gt 2

req2

io1x = 1 t gt 2

req2

io1x = 0 t gt 2

io2

req10 lt x lt 1

t gt 2req2

req1x = 0 t gt 2

req2

req1x = 1 t gt 2

req2

io10 lt x lt 1

t gt 2io2

req1x = 0 t gt 2

io2

req10 lt x lt 1

t gt 2 io2

req1x = 1 t gt 2

io2

io1x t = 0 2

io2

req1x t = 0 2

req2

req1x t = 0 1

io2

req10 lt x lt 11 lt t lt 2

io2

req1x t = 1 2

io2

req1x t = 0 2

io2

req1x t = 0 1

req2

req10 lt x lt 11 lt t lt 2

req2

req1x t = 1 2

req2

Regions whereio1 becomes true

Edges removed during addition

Regions made unreachable

Legend

Initial region

Edges participatingin a shortest path

io1x t = 0 1

req2

io1x t = 1

io2

io10 lt x lt 11 lt t lt 2

req2

io1x t = 1 2

req2

Edges in additionalshortest paths

io1x t = 1req2

io1x t = 01

io2

io10 lt x lt 11 lt t lt 2

io2

io1x t = 1 2

io2

io1x t = 0 2

req2

io1x = 1 t gt 2

io2

io10 lt x lt 1

t gt 2req2

Automated Revision of Existing Programs 20

Example (contrsquod)

RQ1 req1 (x = 1) io1 req1 = true false

IO1 io1 (x = 1) req1 io1 = true false

RQ2 req2 (x = 1) (io1 t 1) io2 req2 = true false

IO1 io2 (x = 1) (io1 t 1) req2 io2 = true false

WT 0 x 1 1048576 wait

x t

x

x

x

Automated Revision of Existing Programs 21

Part II

Adding Fault-ToleranceFault-Tolerance to Existing Programs

Automated Revision of Existing Programs 22

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Decision dg 0 1

(dj = ) ( fj = false) dj = dg

(dj ) ( fj = false) fj = true

dj

dk 0 1

dl

Decision

fj

fk false true

fl

Final

GENERAL

NON-GENERALS

Program

Automated Revision of Existing Programs 23

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Byzantine

bg false true

bj

bk false true

bl

Byzantine

(bj bk bl bg = false) bj = true

(bj = true) dj = 0|1Faults

Automated Revision of Existing Programs 24

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Safety specification Agreement The final decision of any two non-Byzantine

process cannot be different

Validity If the general is non-Byzantine then the final decision of a non-Byzantine process must be the same as that of the general

Fact The program does not meet the safety specification in the presence of faults

dg = 0 bg = true dj = 0 fj = false fj = true dk = 1 fk = true

Automated Revision of Existing Programs 25

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

QuestionIs it possible to revise a distributed program in order to add fault-tolerance to the original program with respect to a safety specification and a class of faults

AnswerYes but the problem is NP-complete which itself is a problem

SolutionA set of polynomial-time heuristics has been developed to add fault-tolerance to a large class of distributed programs

Automated Revision of Existing Programs 26

Some Terminalogy

Boolean Variables V = v1 vn State where l(vj) is a literal State predicate A finite set of states Transition Let V = v | v V A transition (s s) = s s Transition predicate is a set of transitions Program A program is defined by a transition predicate P Closure A state predicate S is closed in program P iff

(s ||S) (s ||S)

Computation of P A sequence of states s0 s1 where (sj sj+1) || P Safety specification A set of bad transitions that should not occur in any

program computation defined by transition predicate SPEC Satisfaction A program P satisfies SPEC from S iff

1 S is closed in P

2 For all s0 s1 where s0 || S (sj sj+1) || SPEC Invariant If P satisfies SPEC from S and S then S is an invariant of P

imi sS 0

)(0 jnj vls

)(0 iimi ssP

Pss ||)(

Automated Revision of Existing Programs 27

Preliminaries (contrsquod)

Faults A set F of transitions

Fault-span A state predicate T such that

1 S T

2 T is closed in P F

Invariant

f

f

ff

f

Fault-Span

Finite state space

p

pp

p

Fault-tolerance A program P is F-tolerant to SPEC from S if1 P satisfies SPEC from S2 There exists T such that

T is an F-span of P P F satisfies SPEC from T (safety) after faults stop occurring every computation of P reaches S (liveness)

Automated Revision of Existing Programs 28

Levels of Fault-Tolerance [3]

In the presence of faults

A failsafe program satisfies the safety specification A nonmasking program satisfies the liveness specification A masking program satisfies both safety and liveness specifications

In the absence of faults the program must continue to satisfy its entire specification

Question Are these levels able to capture the requirements for modeling fault-tolerant real-time programs

[3] A Arora S Kulkarni [3] A Arora S Kulkarni Designing masking fault-tolerance via nonmasking fault-toleranceDesigning masking fault-tolerance via nonmasking fault-tolerance TSE 1998TSE 1998

Automated Revision of Existing Programs 29

Fault-Tolerant Real-Time Programs

Railroad Crossing Safety If the train is passing the crossing the gate must be closed Bounded response Once the gate is closed it should reopen within

4min Fault-tolerance The gate must be closed while the train is passing the

intersection even in the presence of faults

Boiler Controller Bounded response Once the pressure gauge reads 30 pounds per

square inch the controller must issue a command to open a valve within 20s

Fault-tolerance Faults should not affect the functionality of the controller

Automated Revision of Existing Programs 30

New Levels of Fault-Tolerance for RT Programs

Soft Fault-Tolerance

A program is soft fault-tolerant if it is NOT required to satisfy its timing constraints in the presence of faults However it should continue to meet its timing constraints in the absence of faults

Hard Fault-Tolerance

A program is hard fault-tolerant if it is required to satisfy its timing constraints even in the presence of faults

Automated Revision of Existing Programs 31

Levels of Fault-Tolerance (contrsquod)

Safety LivenessTiming

constraintsBounded-time

recovery

Soft-failsafe

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Automated Revision of Existing Programs 32

Problem Statement

SynthesisAlgorithm

Program PSafety spec

Fault-tolerant

program PSet of faults f

Desired level of fault-tolerance SoftHard-Failsafe

NonmaskingSoftHard-Masking

Automated Revision of Existing Programs 33

Problem Statement

Given program P Invariant S a set of faults F and safety specification SPEC identify a program P with invariant S such that

S S P P and P is F-tolerant to SPEC from S

Automated Revision of Existing Programs 34

Current Results

Safety LivenessTiming

constraintsBounded-recovery

Complexity

Soft-failsafe

Polynomial space sound and complete

algorithm

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Open problem

[2] S Kulkarni A Arora [2] S Kulkarni A Arora Automating the Addition of Fault-ToleranceAutomating the Addition of Fault-Tolerance FTRTFT 2000 FTRTFT 2000

Automated Revision of Existing Programs 35

Example Altitude Switch

Initx 06

Standby

Await (Power z 2 )

gtgt Actuator Power on ltlt

ltlt LowAlt gtgt[z = 0]

ltlt Reset gtgt(CorruptSensor y 2)

(x 06)

ltlt Reset gtgt

[x y z = 0]

(Power z 2 )

(z gt 2 )[t = 0]

(x gt 06)[t = 0]

ltlt SensorFail gtgt[y = 0]

(y gt 2)[t = 0]

y 1y 2

z 2

(Power CorruptSensor)( Power)

Fault-span

Fault-span

(t 1)

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

Automated Revision of Existing Programs 37

Ongoing Research

Implementation of the algorithms

Coping with the state explosion problem Synthesis using zone automata Parallelizing state space

Coping with NP-hardness results Heuristics Decision procedures (Yices)

Fault-tolerance in hybrid systems

[8] httpwwwcsemsueduebnenasiresearchtoolsftsynhtm

Automated Revision of Existing Programs 38

Conclusion

We study the problem of automated revision of programs inside their state space

Adding properties to programs Adding fault-tolerance

Automated Revision of Existing Programs 39

Questions

Automated Revision of Existing Programs 15

Adding Two Leads-to Properties Adding two leads-to properties one after another

Sp

Ip R T

s5

s6

s3s4

s7

P Qs9

s0

s6

s1

s2

s8

Automated Revision of Existing Programs 16

Other Results

The problems of simultaneous addition of two leads-to properties is NP-complete

The problem of adding one leads-to property while maintaining maximum non-determinism is NP-complete

Automated Revision of Existing Programs 17

Another Problem

Adding two eventually properties

1 true leads-to Q

2 true leads-to T

This problem is also NP-complete

Q

T

Automated Revision of Existing Programs 18

Example Real-Time Resource Allocation

Two processes j 1 2 Each has two tasks to complete (each takes 1 time unit)

Submitting a request Performing IO operation

RQj reqj (x = 1) ioj reqj = true false

IOj ioj (x = 1) reqj ioj = true false WT 0 x 1 1048576 wait

Bounded response

L (io1 2 req1)

x

x

Automated Revision of Existing Programs 19

req10 lt x t lt 1

req2

req1x t = 0req2

req1x t = 1req2

io1x t = 0 req2

req1x t = 0 1

io2

req10 lt x lt 11 lt t lt 2

io2

req1x t = 1 2

io2

io10 lt x t lt 1

req2

io1x t = 0

io2

io10 lt x t lt 1

io2

io1x = 0 t gt 2

req2

io1x = 1 t gt 2

req2

io1x = 0 t gt 2

io2

req10 lt x lt 1

t gt 2req2

req1x = 0 t gt 2

req2

req1x = 1 t gt 2

req2

io10 lt x lt 1

t gt 2io2

req1x = 0 t gt 2

io2

req10 lt x lt 1

t gt 2 io2

req1x = 1 t gt 2

io2

io1x t = 0 2

io2

req1x t = 0 2

req2

req1x t = 0 1

io2

req10 lt x lt 11 lt t lt 2

io2

req1x t = 1 2

io2

req1x t = 0 2

io2

req1x t = 0 1

req2

req10 lt x lt 11 lt t lt 2

req2

req1x t = 1 2

req2

Regions whereio1 becomes true

Edges removed during addition

Regions made unreachable

Legend

Initial region

Edges participatingin a shortest path

io1x t = 0 1

req2

io1x t = 1

io2

io10 lt x lt 11 lt t lt 2

req2

io1x t = 1 2

req2

Edges in additionalshortest paths

io1x t = 1req2

io1x t = 01

io2

io10 lt x lt 11 lt t lt 2

io2

io1x t = 1 2

io2

io1x t = 0 2

req2

io1x = 1 t gt 2

io2

io10 lt x lt 1

t gt 2req2

Automated Revision of Existing Programs 20

Example (contrsquod)

RQ1 req1 (x = 1) io1 req1 = true false

IO1 io1 (x = 1) req1 io1 = true false

RQ2 req2 (x = 1) (io1 t 1) io2 req2 = true false

IO1 io2 (x = 1) (io1 t 1) req2 io2 = true false

WT 0 x 1 1048576 wait

x t

x

x

x

Automated Revision of Existing Programs 21

Part II

Adding Fault-ToleranceFault-Tolerance to Existing Programs

Automated Revision of Existing Programs 22

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Decision dg 0 1

(dj = ) ( fj = false) dj = dg

(dj ) ( fj = false) fj = true

dj

dk 0 1

dl

Decision

fj

fk false true

fl

Final

GENERAL

NON-GENERALS

Program

Automated Revision of Existing Programs 23

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Byzantine

bg false true

bj

bk false true

bl

Byzantine

(bj bk bl bg = false) bj = true

(bj = true) dj = 0|1Faults

Automated Revision of Existing Programs 24

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Safety specification Agreement The final decision of any two non-Byzantine

process cannot be different

Validity If the general is non-Byzantine then the final decision of a non-Byzantine process must be the same as that of the general

Fact The program does not meet the safety specification in the presence of faults

dg = 0 bg = true dj = 0 fj = false fj = true dk = 1 fk = true

Automated Revision of Existing Programs 25

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

QuestionIs it possible to revise a distributed program in order to add fault-tolerance to the original program with respect to a safety specification and a class of faults

AnswerYes but the problem is NP-complete which itself is a problem

SolutionA set of polynomial-time heuristics has been developed to add fault-tolerance to a large class of distributed programs

Automated Revision of Existing Programs 26

Some Terminalogy

Boolean Variables V = v1 vn State where l(vj) is a literal State predicate A finite set of states Transition Let V = v | v V A transition (s s) = s s Transition predicate is a set of transitions Program A program is defined by a transition predicate P Closure A state predicate S is closed in program P iff

(s ||S) (s ||S)

Computation of P A sequence of states s0 s1 where (sj sj+1) || P Safety specification A set of bad transitions that should not occur in any

program computation defined by transition predicate SPEC Satisfaction A program P satisfies SPEC from S iff

1 S is closed in P

2 For all s0 s1 where s0 || S (sj sj+1) || SPEC Invariant If P satisfies SPEC from S and S then S is an invariant of P

imi sS 0

)(0 jnj vls

)(0 iimi ssP

Pss ||)(

Automated Revision of Existing Programs 27

Preliminaries (contrsquod)

Faults A set F of transitions

Fault-span A state predicate T such that

1 S T

2 T is closed in P F

Invariant

f

f

ff

f

Fault-Span

Finite state space

p

pp

p

Fault-tolerance A program P is F-tolerant to SPEC from S if1 P satisfies SPEC from S2 There exists T such that

T is an F-span of P P F satisfies SPEC from T (safety) after faults stop occurring every computation of P reaches S (liveness)

Automated Revision of Existing Programs 28

Levels of Fault-Tolerance [3]

In the presence of faults

A failsafe program satisfies the safety specification A nonmasking program satisfies the liveness specification A masking program satisfies both safety and liveness specifications

In the absence of faults the program must continue to satisfy its entire specification

Question Are these levels able to capture the requirements for modeling fault-tolerant real-time programs

[3] A Arora S Kulkarni [3] A Arora S Kulkarni Designing masking fault-tolerance via nonmasking fault-toleranceDesigning masking fault-tolerance via nonmasking fault-tolerance TSE 1998TSE 1998

Automated Revision of Existing Programs 29

Fault-Tolerant Real-Time Programs

Railroad Crossing Safety If the train is passing the crossing the gate must be closed Bounded response Once the gate is closed it should reopen within

4min Fault-tolerance The gate must be closed while the train is passing the

intersection even in the presence of faults

Boiler Controller Bounded response Once the pressure gauge reads 30 pounds per

square inch the controller must issue a command to open a valve within 20s

Fault-tolerance Faults should not affect the functionality of the controller

Automated Revision of Existing Programs 30

New Levels of Fault-Tolerance for RT Programs

Soft Fault-Tolerance

A program is soft fault-tolerant if it is NOT required to satisfy its timing constraints in the presence of faults However it should continue to meet its timing constraints in the absence of faults

Hard Fault-Tolerance

A program is hard fault-tolerant if it is required to satisfy its timing constraints even in the presence of faults

Automated Revision of Existing Programs 31

Levels of Fault-Tolerance (contrsquod)

Safety LivenessTiming

constraintsBounded-time

recovery

Soft-failsafe

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Automated Revision of Existing Programs 32

Problem Statement

SynthesisAlgorithm

Program PSafety spec

Fault-tolerant

program PSet of faults f

Desired level of fault-tolerance SoftHard-Failsafe

NonmaskingSoftHard-Masking

Automated Revision of Existing Programs 33

Problem Statement

Given program P Invariant S a set of faults F and safety specification SPEC identify a program P with invariant S such that

S S P P and P is F-tolerant to SPEC from S

Automated Revision of Existing Programs 34

Current Results

Safety LivenessTiming

constraintsBounded-recovery

Complexity

Soft-failsafe

Polynomial space sound and complete

algorithm

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Open problem

[2] S Kulkarni A Arora [2] S Kulkarni A Arora Automating the Addition of Fault-ToleranceAutomating the Addition of Fault-Tolerance FTRTFT 2000 FTRTFT 2000

Automated Revision of Existing Programs 35

Example Altitude Switch

Initx 06

Standby

Await (Power z 2 )

gtgt Actuator Power on ltlt

ltlt LowAlt gtgt[z = 0]

ltlt Reset gtgt(CorruptSensor y 2)

(x 06)

ltlt Reset gtgt

[x y z = 0]

(Power z 2 )

(z gt 2 )[t = 0]

(x gt 06)[t = 0]

ltlt SensorFail gtgt[y = 0]

(y gt 2)[t = 0]

y 1y 2

z 2

(Power CorruptSensor)( Power)

Fault-span

Fault-span

(t 1)

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

Automated Revision of Existing Programs 37

Ongoing Research

Implementation of the algorithms

Coping with the state explosion problem Synthesis using zone automata Parallelizing state space

Coping with NP-hardness results Heuristics Decision procedures (Yices)

Fault-tolerance in hybrid systems

[8] httpwwwcsemsueduebnenasiresearchtoolsftsynhtm

Automated Revision of Existing Programs 38

Conclusion

We study the problem of automated revision of programs inside their state space

Adding properties to programs Adding fault-tolerance

Automated Revision of Existing Programs 39

Questions

Automated Revision of Existing Programs 16

Other Results

The problems of simultaneous addition of two leads-to properties is NP-complete

The problem of adding one leads-to property while maintaining maximum non-determinism is NP-complete

Automated Revision of Existing Programs 17

Another Problem

Adding two eventually properties

1 true leads-to Q

2 true leads-to T

This problem is also NP-complete

Q

T

Automated Revision of Existing Programs 18

Example Real-Time Resource Allocation

Two processes j 1 2 Each has two tasks to complete (each takes 1 time unit)

Submitting a request Performing IO operation

RQj reqj (x = 1) ioj reqj = true false

IOj ioj (x = 1) reqj ioj = true false WT 0 x 1 1048576 wait

Bounded response

L (io1 2 req1)

x

x

Automated Revision of Existing Programs 19

req10 lt x t lt 1

req2

req1x t = 0req2

req1x t = 1req2

io1x t = 0 req2

req1x t = 0 1

io2

req10 lt x lt 11 lt t lt 2

io2

req1x t = 1 2

io2

io10 lt x t lt 1

req2

io1x t = 0

io2

io10 lt x t lt 1

io2

io1x = 0 t gt 2

req2

io1x = 1 t gt 2

req2

io1x = 0 t gt 2

io2

req10 lt x lt 1

t gt 2req2

req1x = 0 t gt 2

req2

req1x = 1 t gt 2

req2

io10 lt x lt 1

t gt 2io2

req1x = 0 t gt 2

io2

req10 lt x lt 1

t gt 2 io2

req1x = 1 t gt 2

io2

io1x t = 0 2

io2

req1x t = 0 2

req2

req1x t = 0 1

io2

req10 lt x lt 11 lt t lt 2

io2

req1x t = 1 2

io2

req1x t = 0 2

io2

req1x t = 0 1

req2

req10 lt x lt 11 lt t lt 2

req2

req1x t = 1 2

req2

Regions whereio1 becomes true

Edges removed during addition

Regions made unreachable

Legend

Initial region

Edges participatingin a shortest path

io1x t = 0 1

req2

io1x t = 1

io2

io10 lt x lt 11 lt t lt 2

req2

io1x t = 1 2

req2

Edges in additionalshortest paths

io1x t = 1req2

io1x t = 01

io2

io10 lt x lt 11 lt t lt 2

io2

io1x t = 1 2

io2

io1x t = 0 2

req2

io1x = 1 t gt 2

io2

io10 lt x lt 1

t gt 2req2

Automated Revision of Existing Programs 20

Example (contrsquod)

RQ1 req1 (x = 1) io1 req1 = true false

IO1 io1 (x = 1) req1 io1 = true false

RQ2 req2 (x = 1) (io1 t 1) io2 req2 = true false

IO1 io2 (x = 1) (io1 t 1) req2 io2 = true false

WT 0 x 1 1048576 wait

x t

x

x

x

Automated Revision of Existing Programs 21

Part II

Adding Fault-ToleranceFault-Tolerance to Existing Programs

Automated Revision of Existing Programs 22

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Decision dg 0 1

(dj = ) ( fj = false) dj = dg

(dj ) ( fj = false) fj = true

dj

dk 0 1

dl

Decision

fj

fk false true

fl

Final

GENERAL

NON-GENERALS

Program

Automated Revision of Existing Programs 23

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Byzantine

bg false true

bj

bk false true

bl

Byzantine

(bj bk bl bg = false) bj = true

(bj = true) dj = 0|1Faults

Automated Revision of Existing Programs 24

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Safety specification Agreement The final decision of any two non-Byzantine

process cannot be different

Validity If the general is non-Byzantine then the final decision of a non-Byzantine process must be the same as that of the general

Fact The program does not meet the safety specification in the presence of faults

dg = 0 bg = true dj = 0 fj = false fj = true dk = 1 fk = true

Automated Revision of Existing Programs 25

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

QuestionIs it possible to revise a distributed program in order to add fault-tolerance to the original program with respect to a safety specification and a class of faults

AnswerYes but the problem is NP-complete which itself is a problem

SolutionA set of polynomial-time heuristics has been developed to add fault-tolerance to a large class of distributed programs

Automated Revision of Existing Programs 26

Some Terminalogy

Boolean Variables V = v1 vn State where l(vj) is a literal State predicate A finite set of states Transition Let V = v | v V A transition (s s) = s s Transition predicate is a set of transitions Program A program is defined by a transition predicate P Closure A state predicate S is closed in program P iff

(s ||S) (s ||S)

Computation of P A sequence of states s0 s1 where (sj sj+1) || P Safety specification A set of bad transitions that should not occur in any

program computation defined by transition predicate SPEC Satisfaction A program P satisfies SPEC from S iff

1 S is closed in P

2 For all s0 s1 where s0 || S (sj sj+1) || SPEC Invariant If P satisfies SPEC from S and S then S is an invariant of P

imi sS 0

)(0 jnj vls

)(0 iimi ssP

Pss ||)(

Automated Revision of Existing Programs 27

Preliminaries (contrsquod)

Faults A set F of transitions

Fault-span A state predicate T such that

1 S T

2 T is closed in P F

Invariant

f

f

ff

f

Fault-Span

Finite state space

p

pp

p

Fault-tolerance A program P is F-tolerant to SPEC from S if1 P satisfies SPEC from S2 There exists T such that

T is an F-span of P P F satisfies SPEC from T (safety) after faults stop occurring every computation of P reaches S (liveness)

Automated Revision of Existing Programs 28

Levels of Fault-Tolerance [3]

In the presence of faults

A failsafe program satisfies the safety specification A nonmasking program satisfies the liveness specification A masking program satisfies both safety and liveness specifications

In the absence of faults the program must continue to satisfy its entire specification

Question Are these levels able to capture the requirements for modeling fault-tolerant real-time programs

[3] A Arora S Kulkarni [3] A Arora S Kulkarni Designing masking fault-tolerance via nonmasking fault-toleranceDesigning masking fault-tolerance via nonmasking fault-tolerance TSE 1998TSE 1998

Automated Revision of Existing Programs 29

Fault-Tolerant Real-Time Programs

Railroad Crossing Safety If the train is passing the crossing the gate must be closed Bounded response Once the gate is closed it should reopen within

4min Fault-tolerance The gate must be closed while the train is passing the

intersection even in the presence of faults

Boiler Controller Bounded response Once the pressure gauge reads 30 pounds per

square inch the controller must issue a command to open a valve within 20s

Fault-tolerance Faults should not affect the functionality of the controller

Automated Revision of Existing Programs 30

New Levels of Fault-Tolerance for RT Programs

Soft Fault-Tolerance

A program is soft fault-tolerant if it is NOT required to satisfy its timing constraints in the presence of faults However it should continue to meet its timing constraints in the absence of faults

Hard Fault-Tolerance

A program is hard fault-tolerant if it is required to satisfy its timing constraints even in the presence of faults

Automated Revision of Existing Programs 31

Levels of Fault-Tolerance (contrsquod)

Safety LivenessTiming

constraintsBounded-time

recovery

Soft-failsafe

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Automated Revision of Existing Programs 32

Problem Statement

SynthesisAlgorithm

Program PSafety spec

Fault-tolerant

program PSet of faults f

Desired level of fault-tolerance SoftHard-Failsafe

NonmaskingSoftHard-Masking

Automated Revision of Existing Programs 33

Problem Statement

Given program P Invariant S a set of faults F and safety specification SPEC identify a program P with invariant S such that

S S P P and P is F-tolerant to SPEC from S

Automated Revision of Existing Programs 34

Current Results

Safety LivenessTiming

constraintsBounded-recovery

Complexity

Soft-failsafe

Polynomial space sound and complete

algorithm

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Open problem

[2] S Kulkarni A Arora [2] S Kulkarni A Arora Automating the Addition of Fault-ToleranceAutomating the Addition of Fault-Tolerance FTRTFT 2000 FTRTFT 2000

Automated Revision of Existing Programs 35

Example Altitude Switch

Initx 06

Standby

Await (Power z 2 )

gtgt Actuator Power on ltlt

ltlt LowAlt gtgt[z = 0]

ltlt Reset gtgt(CorruptSensor y 2)

(x 06)

ltlt Reset gtgt

[x y z = 0]

(Power z 2 )

(z gt 2 )[t = 0]

(x gt 06)[t = 0]

ltlt SensorFail gtgt[y = 0]

(y gt 2)[t = 0]

y 1y 2

z 2

(Power CorruptSensor)( Power)

Fault-span

Fault-span

(t 1)

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

Automated Revision of Existing Programs 37

Ongoing Research

Implementation of the algorithms

Coping with the state explosion problem Synthesis using zone automata Parallelizing state space

Coping with NP-hardness results Heuristics Decision procedures (Yices)

Fault-tolerance in hybrid systems

[8] httpwwwcsemsueduebnenasiresearchtoolsftsynhtm

Automated Revision of Existing Programs 38

Conclusion

We study the problem of automated revision of programs inside their state space

Adding properties to programs Adding fault-tolerance

Automated Revision of Existing Programs 39

Questions

Automated Revision of Existing Programs 17

Another Problem

Adding two eventually properties

1 true leads-to Q

2 true leads-to T

This problem is also NP-complete

Q

T

Automated Revision of Existing Programs 18

Example Real-Time Resource Allocation

Two processes j 1 2 Each has two tasks to complete (each takes 1 time unit)

Submitting a request Performing IO operation

RQj reqj (x = 1) ioj reqj = true false

IOj ioj (x = 1) reqj ioj = true false WT 0 x 1 1048576 wait

Bounded response

L (io1 2 req1)

x

x

Automated Revision of Existing Programs 19

req10 lt x t lt 1

req2

req1x t = 0req2

req1x t = 1req2

io1x t = 0 req2

req1x t = 0 1

io2

req10 lt x lt 11 lt t lt 2

io2

req1x t = 1 2

io2

io10 lt x t lt 1

req2

io1x t = 0

io2

io10 lt x t lt 1

io2

io1x = 0 t gt 2

req2

io1x = 1 t gt 2

req2

io1x = 0 t gt 2

io2

req10 lt x lt 1

t gt 2req2

req1x = 0 t gt 2

req2

req1x = 1 t gt 2

req2

io10 lt x lt 1

t gt 2io2

req1x = 0 t gt 2

io2

req10 lt x lt 1

t gt 2 io2

req1x = 1 t gt 2

io2

io1x t = 0 2

io2

req1x t = 0 2

req2

req1x t = 0 1

io2

req10 lt x lt 11 lt t lt 2

io2

req1x t = 1 2

io2

req1x t = 0 2

io2

req1x t = 0 1

req2

req10 lt x lt 11 lt t lt 2

req2

req1x t = 1 2

req2

Regions whereio1 becomes true

Edges removed during addition

Regions made unreachable

Legend

Initial region

Edges participatingin a shortest path

io1x t = 0 1

req2

io1x t = 1

io2

io10 lt x lt 11 lt t lt 2

req2

io1x t = 1 2

req2

Edges in additionalshortest paths

io1x t = 1req2

io1x t = 01

io2

io10 lt x lt 11 lt t lt 2

io2

io1x t = 1 2

io2

io1x t = 0 2

req2

io1x = 1 t gt 2

io2

io10 lt x lt 1

t gt 2req2

Automated Revision of Existing Programs 20

Example (contrsquod)

RQ1 req1 (x = 1) io1 req1 = true false

IO1 io1 (x = 1) req1 io1 = true false

RQ2 req2 (x = 1) (io1 t 1) io2 req2 = true false

IO1 io2 (x = 1) (io1 t 1) req2 io2 = true false

WT 0 x 1 1048576 wait

x t

x

x

x

Automated Revision of Existing Programs 21

Part II

Adding Fault-ToleranceFault-Tolerance to Existing Programs

Automated Revision of Existing Programs 22

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Decision dg 0 1

(dj = ) ( fj = false) dj = dg

(dj ) ( fj = false) fj = true

dj

dk 0 1

dl

Decision

fj

fk false true

fl

Final

GENERAL

NON-GENERALS

Program

Automated Revision of Existing Programs 23

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Byzantine

bg false true

bj

bk false true

bl

Byzantine

(bj bk bl bg = false) bj = true

(bj = true) dj = 0|1Faults

Automated Revision of Existing Programs 24

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Safety specification Agreement The final decision of any two non-Byzantine

process cannot be different

Validity If the general is non-Byzantine then the final decision of a non-Byzantine process must be the same as that of the general

Fact The program does not meet the safety specification in the presence of faults

dg = 0 bg = true dj = 0 fj = false fj = true dk = 1 fk = true

Automated Revision of Existing Programs 25

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

QuestionIs it possible to revise a distributed program in order to add fault-tolerance to the original program with respect to a safety specification and a class of faults

AnswerYes but the problem is NP-complete which itself is a problem

SolutionA set of polynomial-time heuristics has been developed to add fault-tolerance to a large class of distributed programs

Automated Revision of Existing Programs 26

Some Terminalogy

Boolean Variables V = v1 vn State where l(vj) is a literal State predicate A finite set of states Transition Let V = v | v V A transition (s s) = s s Transition predicate is a set of transitions Program A program is defined by a transition predicate P Closure A state predicate S is closed in program P iff

(s ||S) (s ||S)

Computation of P A sequence of states s0 s1 where (sj sj+1) || P Safety specification A set of bad transitions that should not occur in any

program computation defined by transition predicate SPEC Satisfaction A program P satisfies SPEC from S iff

1 S is closed in P

2 For all s0 s1 where s0 || S (sj sj+1) || SPEC Invariant If P satisfies SPEC from S and S then S is an invariant of P

imi sS 0

)(0 jnj vls

)(0 iimi ssP

Pss ||)(

Automated Revision of Existing Programs 27

Preliminaries (contrsquod)

Faults A set F of transitions

Fault-span A state predicate T such that

1 S T

2 T is closed in P F

Invariant

f

f

ff

f

Fault-Span

Finite state space

p

pp

p

Fault-tolerance A program P is F-tolerant to SPEC from S if1 P satisfies SPEC from S2 There exists T such that

T is an F-span of P P F satisfies SPEC from T (safety) after faults stop occurring every computation of P reaches S (liveness)

Automated Revision of Existing Programs 28

Levels of Fault-Tolerance [3]

In the presence of faults

A failsafe program satisfies the safety specification A nonmasking program satisfies the liveness specification A masking program satisfies both safety and liveness specifications

In the absence of faults the program must continue to satisfy its entire specification

Question Are these levels able to capture the requirements for modeling fault-tolerant real-time programs

[3] A Arora S Kulkarni [3] A Arora S Kulkarni Designing masking fault-tolerance via nonmasking fault-toleranceDesigning masking fault-tolerance via nonmasking fault-tolerance TSE 1998TSE 1998

Automated Revision of Existing Programs 29

Fault-Tolerant Real-Time Programs

Railroad Crossing Safety If the train is passing the crossing the gate must be closed Bounded response Once the gate is closed it should reopen within

4min Fault-tolerance The gate must be closed while the train is passing the

intersection even in the presence of faults

Boiler Controller Bounded response Once the pressure gauge reads 30 pounds per

square inch the controller must issue a command to open a valve within 20s

Fault-tolerance Faults should not affect the functionality of the controller

Automated Revision of Existing Programs 30

New Levels of Fault-Tolerance for RT Programs

Soft Fault-Tolerance

A program is soft fault-tolerant if it is NOT required to satisfy its timing constraints in the presence of faults However it should continue to meet its timing constraints in the absence of faults

Hard Fault-Tolerance

A program is hard fault-tolerant if it is required to satisfy its timing constraints even in the presence of faults

Automated Revision of Existing Programs 31

Levels of Fault-Tolerance (contrsquod)

Safety LivenessTiming

constraintsBounded-time

recovery

Soft-failsafe

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Automated Revision of Existing Programs 32

Problem Statement

SynthesisAlgorithm

Program PSafety spec

Fault-tolerant

program PSet of faults f

Desired level of fault-tolerance SoftHard-Failsafe

NonmaskingSoftHard-Masking

Automated Revision of Existing Programs 33

Problem Statement

Given program P Invariant S a set of faults F and safety specification SPEC identify a program P with invariant S such that

S S P P and P is F-tolerant to SPEC from S

Automated Revision of Existing Programs 34

Current Results

Safety LivenessTiming

constraintsBounded-recovery

Complexity

Soft-failsafe

Polynomial space sound and complete

algorithm

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Open problem

[2] S Kulkarni A Arora [2] S Kulkarni A Arora Automating the Addition of Fault-ToleranceAutomating the Addition of Fault-Tolerance FTRTFT 2000 FTRTFT 2000

Automated Revision of Existing Programs 35

Example Altitude Switch

Initx 06

Standby

Await (Power z 2 )

gtgt Actuator Power on ltlt

ltlt LowAlt gtgt[z = 0]

ltlt Reset gtgt(CorruptSensor y 2)

(x 06)

ltlt Reset gtgt

[x y z = 0]

(Power z 2 )

(z gt 2 )[t = 0]

(x gt 06)[t = 0]

ltlt SensorFail gtgt[y = 0]

(y gt 2)[t = 0]

y 1y 2

z 2

(Power CorruptSensor)( Power)

Fault-span

Fault-span

(t 1)

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

Automated Revision of Existing Programs 37

Ongoing Research

Implementation of the algorithms

Coping with the state explosion problem Synthesis using zone automata Parallelizing state space

Coping with NP-hardness results Heuristics Decision procedures (Yices)

Fault-tolerance in hybrid systems

[8] httpwwwcsemsueduebnenasiresearchtoolsftsynhtm

Automated Revision of Existing Programs 38

Conclusion

We study the problem of automated revision of programs inside their state space

Adding properties to programs Adding fault-tolerance

Automated Revision of Existing Programs 39

Questions

Automated Revision of Existing Programs 18

Example Real-Time Resource Allocation

Two processes j 1 2 Each has two tasks to complete (each takes 1 time unit)

Submitting a request Performing IO operation

RQj reqj (x = 1) ioj reqj = true false

IOj ioj (x = 1) reqj ioj = true false WT 0 x 1 1048576 wait

Bounded response

L (io1 2 req1)

x

x

Automated Revision of Existing Programs 19

req10 lt x t lt 1

req2

req1x t = 0req2

req1x t = 1req2

io1x t = 0 req2

req1x t = 0 1

io2

req10 lt x lt 11 lt t lt 2

io2

req1x t = 1 2

io2

io10 lt x t lt 1

req2

io1x t = 0

io2

io10 lt x t lt 1

io2

io1x = 0 t gt 2

req2

io1x = 1 t gt 2

req2

io1x = 0 t gt 2

io2

req10 lt x lt 1

t gt 2req2

req1x = 0 t gt 2

req2

req1x = 1 t gt 2

req2

io10 lt x lt 1

t gt 2io2

req1x = 0 t gt 2

io2

req10 lt x lt 1

t gt 2 io2

req1x = 1 t gt 2

io2

io1x t = 0 2

io2

req1x t = 0 2

req2

req1x t = 0 1

io2

req10 lt x lt 11 lt t lt 2

io2

req1x t = 1 2

io2

req1x t = 0 2

io2

req1x t = 0 1

req2

req10 lt x lt 11 lt t lt 2

req2

req1x t = 1 2

req2

Regions whereio1 becomes true

Edges removed during addition

Regions made unreachable

Legend

Initial region

Edges participatingin a shortest path

io1x t = 0 1

req2

io1x t = 1

io2

io10 lt x lt 11 lt t lt 2

req2

io1x t = 1 2

req2

Edges in additionalshortest paths

io1x t = 1req2

io1x t = 01

io2

io10 lt x lt 11 lt t lt 2

io2

io1x t = 1 2

io2

io1x t = 0 2

req2

io1x = 1 t gt 2

io2

io10 lt x lt 1

t gt 2req2

Automated Revision of Existing Programs 20

Example (contrsquod)

RQ1 req1 (x = 1) io1 req1 = true false

IO1 io1 (x = 1) req1 io1 = true false

RQ2 req2 (x = 1) (io1 t 1) io2 req2 = true false

IO1 io2 (x = 1) (io1 t 1) req2 io2 = true false

WT 0 x 1 1048576 wait

x t

x

x

x

Automated Revision of Existing Programs 21

Part II

Adding Fault-ToleranceFault-Tolerance to Existing Programs

Automated Revision of Existing Programs 22

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Decision dg 0 1

(dj = ) ( fj = false) dj = dg

(dj ) ( fj = false) fj = true

dj

dk 0 1

dl

Decision

fj

fk false true

fl

Final

GENERAL

NON-GENERALS

Program

Automated Revision of Existing Programs 23

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Byzantine

bg false true

bj

bk false true

bl

Byzantine

(bj bk bl bg = false) bj = true

(bj = true) dj = 0|1Faults

Automated Revision of Existing Programs 24

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Safety specification Agreement The final decision of any two non-Byzantine

process cannot be different

Validity If the general is non-Byzantine then the final decision of a non-Byzantine process must be the same as that of the general

Fact The program does not meet the safety specification in the presence of faults

dg = 0 bg = true dj = 0 fj = false fj = true dk = 1 fk = true

Automated Revision of Existing Programs 25

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

QuestionIs it possible to revise a distributed program in order to add fault-tolerance to the original program with respect to a safety specification and a class of faults

AnswerYes but the problem is NP-complete which itself is a problem

SolutionA set of polynomial-time heuristics has been developed to add fault-tolerance to a large class of distributed programs

Automated Revision of Existing Programs 26

Some Terminalogy

Boolean Variables V = v1 vn State where l(vj) is a literal State predicate A finite set of states Transition Let V = v | v V A transition (s s) = s s Transition predicate is a set of transitions Program A program is defined by a transition predicate P Closure A state predicate S is closed in program P iff

(s ||S) (s ||S)

Computation of P A sequence of states s0 s1 where (sj sj+1) || P Safety specification A set of bad transitions that should not occur in any

program computation defined by transition predicate SPEC Satisfaction A program P satisfies SPEC from S iff

1 S is closed in P

2 For all s0 s1 where s0 || S (sj sj+1) || SPEC Invariant If P satisfies SPEC from S and S then S is an invariant of P

imi sS 0

)(0 jnj vls

)(0 iimi ssP

Pss ||)(

Automated Revision of Existing Programs 27

Preliminaries (contrsquod)

Faults A set F of transitions

Fault-span A state predicate T such that

1 S T

2 T is closed in P F

Invariant

f

f

ff

f

Fault-Span

Finite state space

p

pp

p

Fault-tolerance A program P is F-tolerant to SPEC from S if1 P satisfies SPEC from S2 There exists T such that

T is an F-span of P P F satisfies SPEC from T (safety) after faults stop occurring every computation of P reaches S (liveness)

Automated Revision of Existing Programs 28

Levels of Fault-Tolerance [3]

In the presence of faults

A failsafe program satisfies the safety specification A nonmasking program satisfies the liveness specification A masking program satisfies both safety and liveness specifications

In the absence of faults the program must continue to satisfy its entire specification

Question Are these levels able to capture the requirements for modeling fault-tolerant real-time programs

[3] A Arora S Kulkarni [3] A Arora S Kulkarni Designing masking fault-tolerance via nonmasking fault-toleranceDesigning masking fault-tolerance via nonmasking fault-tolerance TSE 1998TSE 1998

Automated Revision of Existing Programs 29

Fault-Tolerant Real-Time Programs

Railroad Crossing Safety If the train is passing the crossing the gate must be closed Bounded response Once the gate is closed it should reopen within

4min Fault-tolerance The gate must be closed while the train is passing the

intersection even in the presence of faults

Boiler Controller Bounded response Once the pressure gauge reads 30 pounds per

square inch the controller must issue a command to open a valve within 20s

Fault-tolerance Faults should not affect the functionality of the controller

Automated Revision of Existing Programs 30

New Levels of Fault-Tolerance for RT Programs

Soft Fault-Tolerance

A program is soft fault-tolerant if it is NOT required to satisfy its timing constraints in the presence of faults However it should continue to meet its timing constraints in the absence of faults

Hard Fault-Tolerance

A program is hard fault-tolerant if it is required to satisfy its timing constraints even in the presence of faults

Automated Revision of Existing Programs 31

Levels of Fault-Tolerance (contrsquod)

Safety LivenessTiming

constraintsBounded-time

recovery

Soft-failsafe

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Automated Revision of Existing Programs 32

Problem Statement

SynthesisAlgorithm

Program PSafety spec

Fault-tolerant

program PSet of faults f

Desired level of fault-tolerance SoftHard-Failsafe

NonmaskingSoftHard-Masking

Automated Revision of Existing Programs 33

Problem Statement

Given program P Invariant S a set of faults F and safety specification SPEC identify a program P with invariant S such that

S S P P and P is F-tolerant to SPEC from S

Automated Revision of Existing Programs 34

Current Results

Safety LivenessTiming

constraintsBounded-recovery

Complexity

Soft-failsafe

Polynomial space sound and complete

algorithm

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Open problem

[2] S Kulkarni A Arora [2] S Kulkarni A Arora Automating the Addition of Fault-ToleranceAutomating the Addition of Fault-Tolerance FTRTFT 2000 FTRTFT 2000

Automated Revision of Existing Programs 35

Example Altitude Switch

Initx 06

Standby

Await (Power z 2 )

gtgt Actuator Power on ltlt

ltlt LowAlt gtgt[z = 0]

ltlt Reset gtgt(CorruptSensor y 2)

(x 06)

ltlt Reset gtgt

[x y z = 0]

(Power z 2 )

(z gt 2 )[t = 0]

(x gt 06)[t = 0]

ltlt SensorFail gtgt[y = 0]

(y gt 2)[t = 0]

y 1y 2

z 2

(Power CorruptSensor)( Power)

Fault-span

Fault-span

(t 1)

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

Automated Revision of Existing Programs 37

Ongoing Research

Implementation of the algorithms

Coping with the state explosion problem Synthesis using zone automata Parallelizing state space

Coping with NP-hardness results Heuristics Decision procedures (Yices)

Fault-tolerance in hybrid systems

[8] httpwwwcsemsueduebnenasiresearchtoolsftsynhtm

Automated Revision of Existing Programs 38

Conclusion

We study the problem of automated revision of programs inside their state space

Adding properties to programs Adding fault-tolerance

Automated Revision of Existing Programs 39

Questions

Automated Revision of Existing Programs 19

req10 lt x t lt 1

req2

req1x t = 0req2

req1x t = 1req2

io1x t = 0 req2

req1x t = 0 1

io2

req10 lt x lt 11 lt t lt 2

io2

req1x t = 1 2

io2

io10 lt x t lt 1

req2

io1x t = 0

io2

io10 lt x t lt 1

io2

io1x = 0 t gt 2

req2

io1x = 1 t gt 2

req2

io1x = 0 t gt 2

io2

req10 lt x lt 1

t gt 2req2

req1x = 0 t gt 2

req2

req1x = 1 t gt 2

req2

io10 lt x lt 1

t gt 2io2

req1x = 0 t gt 2

io2

req10 lt x lt 1

t gt 2 io2

req1x = 1 t gt 2

io2

io1x t = 0 2

io2

req1x t = 0 2

req2

req1x t = 0 1

io2

req10 lt x lt 11 lt t lt 2

io2

req1x t = 1 2

io2

req1x t = 0 2

io2

req1x t = 0 1

req2

req10 lt x lt 11 lt t lt 2

req2

req1x t = 1 2

req2

Regions whereio1 becomes true

Edges removed during addition

Regions made unreachable

Legend

Initial region

Edges participatingin a shortest path

io1x t = 0 1

req2

io1x t = 1

io2

io10 lt x lt 11 lt t lt 2

req2

io1x t = 1 2

req2

Edges in additionalshortest paths

io1x t = 1req2

io1x t = 01

io2

io10 lt x lt 11 lt t lt 2

io2

io1x t = 1 2

io2

io1x t = 0 2

req2

io1x = 1 t gt 2

io2

io10 lt x lt 1

t gt 2req2

Automated Revision of Existing Programs 20

Example (contrsquod)

RQ1 req1 (x = 1) io1 req1 = true false

IO1 io1 (x = 1) req1 io1 = true false

RQ2 req2 (x = 1) (io1 t 1) io2 req2 = true false

IO1 io2 (x = 1) (io1 t 1) req2 io2 = true false

WT 0 x 1 1048576 wait

x t

x

x

x

Automated Revision of Existing Programs 21

Part II

Adding Fault-ToleranceFault-Tolerance to Existing Programs

Automated Revision of Existing Programs 22

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Decision dg 0 1

(dj = ) ( fj = false) dj = dg

(dj ) ( fj = false) fj = true

dj

dk 0 1

dl

Decision

fj

fk false true

fl

Final

GENERAL

NON-GENERALS

Program

Automated Revision of Existing Programs 23

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Byzantine

bg false true

bj

bk false true

bl

Byzantine

(bj bk bl bg = false) bj = true

(bj = true) dj = 0|1Faults

Automated Revision of Existing Programs 24

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Safety specification Agreement The final decision of any two non-Byzantine

process cannot be different

Validity If the general is non-Byzantine then the final decision of a non-Byzantine process must be the same as that of the general

Fact The program does not meet the safety specification in the presence of faults

dg = 0 bg = true dj = 0 fj = false fj = true dk = 1 fk = true

Automated Revision of Existing Programs 25

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

QuestionIs it possible to revise a distributed program in order to add fault-tolerance to the original program with respect to a safety specification and a class of faults

AnswerYes but the problem is NP-complete which itself is a problem

SolutionA set of polynomial-time heuristics has been developed to add fault-tolerance to a large class of distributed programs

Automated Revision of Existing Programs 26

Some Terminalogy

Boolean Variables V = v1 vn State where l(vj) is a literal State predicate A finite set of states Transition Let V = v | v V A transition (s s) = s s Transition predicate is a set of transitions Program A program is defined by a transition predicate P Closure A state predicate S is closed in program P iff

(s ||S) (s ||S)

Computation of P A sequence of states s0 s1 where (sj sj+1) || P Safety specification A set of bad transitions that should not occur in any

program computation defined by transition predicate SPEC Satisfaction A program P satisfies SPEC from S iff

1 S is closed in P

2 For all s0 s1 where s0 || S (sj sj+1) || SPEC Invariant If P satisfies SPEC from S and S then S is an invariant of P

imi sS 0

)(0 jnj vls

)(0 iimi ssP

Pss ||)(

Automated Revision of Existing Programs 27

Preliminaries (contrsquod)

Faults A set F of transitions

Fault-span A state predicate T such that

1 S T

2 T is closed in P F

Invariant

f

f

ff

f

Fault-Span

Finite state space

p

pp

p

Fault-tolerance A program P is F-tolerant to SPEC from S if1 P satisfies SPEC from S2 There exists T such that

T is an F-span of P P F satisfies SPEC from T (safety) after faults stop occurring every computation of P reaches S (liveness)

Automated Revision of Existing Programs 28

Levels of Fault-Tolerance [3]

In the presence of faults

A failsafe program satisfies the safety specification A nonmasking program satisfies the liveness specification A masking program satisfies both safety and liveness specifications

In the absence of faults the program must continue to satisfy its entire specification

Question Are these levels able to capture the requirements for modeling fault-tolerant real-time programs

[3] A Arora S Kulkarni [3] A Arora S Kulkarni Designing masking fault-tolerance via nonmasking fault-toleranceDesigning masking fault-tolerance via nonmasking fault-tolerance TSE 1998TSE 1998

Automated Revision of Existing Programs 29

Fault-Tolerant Real-Time Programs

Railroad Crossing Safety If the train is passing the crossing the gate must be closed Bounded response Once the gate is closed it should reopen within

4min Fault-tolerance The gate must be closed while the train is passing the

intersection even in the presence of faults

Boiler Controller Bounded response Once the pressure gauge reads 30 pounds per

square inch the controller must issue a command to open a valve within 20s

Fault-tolerance Faults should not affect the functionality of the controller

Automated Revision of Existing Programs 30

New Levels of Fault-Tolerance for RT Programs

Soft Fault-Tolerance

A program is soft fault-tolerant if it is NOT required to satisfy its timing constraints in the presence of faults However it should continue to meet its timing constraints in the absence of faults

Hard Fault-Tolerance

A program is hard fault-tolerant if it is required to satisfy its timing constraints even in the presence of faults

Automated Revision of Existing Programs 31

Levels of Fault-Tolerance (contrsquod)

Safety LivenessTiming

constraintsBounded-time

recovery

Soft-failsafe

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Automated Revision of Existing Programs 32

Problem Statement

SynthesisAlgorithm

Program PSafety spec

Fault-tolerant

program PSet of faults f

Desired level of fault-tolerance SoftHard-Failsafe

NonmaskingSoftHard-Masking

Automated Revision of Existing Programs 33

Problem Statement

Given program P Invariant S a set of faults F and safety specification SPEC identify a program P with invariant S such that

S S P P and P is F-tolerant to SPEC from S

Automated Revision of Existing Programs 34

Current Results

Safety LivenessTiming

constraintsBounded-recovery

Complexity

Soft-failsafe

Polynomial space sound and complete

algorithm

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Open problem

[2] S Kulkarni A Arora [2] S Kulkarni A Arora Automating the Addition of Fault-ToleranceAutomating the Addition of Fault-Tolerance FTRTFT 2000 FTRTFT 2000

Automated Revision of Existing Programs 35

Example Altitude Switch

Initx 06

Standby

Await (Power z 2 )

gtgt Actuator Power on ltlt

ltlt LowAlt gtgt[z = 0]

ltlt Reset gtgt(CorruptSensor y 2)

(x 06)

ltlt Reset gtgt

[x y z = 0]

(Power z 2 )

(z gt 2 )[t = 0]

(x gt 06)[t = 0]

ltlt SensorFail gtgt[y = 0]

(y gt 2)[t = 0]

y 1y 2

z 2

(Power CorruptSensor)( Power)

Fault-span

Fault-span

(t 1)

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

Automated Revision of Existing Programs 37

Ongoing Research

Implementation of the algorithms

Coping with the state explosion problem Synthesis using zone automata Parallelizing state space

Coping with NP-hardness results Heuristics Decision procedures (Yices)

Fault-tolerance in hybrid systems

[8] httpwwwcsemsueduebnenasiresearchtoolsftsynhtm

Automated Revision of Existing Programs 38

Conclusion

We study the problem of automated revision of programs inside their state space

Adding properties to programs Adding fault-tolerance

Automated Revision of Existing Programs 39

Questions

Automated Revision of Existing Programs 20

Example (contrsquod)

RQ1 req1 (x = 1) io1 req1 = true false

IO1 io1 (x = 1) req1 io1 = true false

RQ2 req2 (x = 1) (io1 t 1) io2 req2 = true false

IO1 io2 (x = 1) (io1 t 1) req2 io2 = true false

WT 0 x 1 1048576 wait

x t

x

x

x

Automated Revision of Existing Programs 21

Part II

Adding Fault-ToleranceFault-Tolerance to Existing Programs

Automated Revision of Existing Programs 22

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Decision dg 0 1

(dj = ) ( fj = false) dj = dg

(dj ) ( fj = false) fj = true

dj

dk 0 1

dl

Decision

fj

fk false true

fl

Final

GENERAL

NON-GENERALS

Program

Automated Revision of Existing Programs 23

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Byzantine

bg false true

bj

bk false true

bl

Byzantine

(bj bk bl bg = false) bj = true

(bj = true) dj = 0|1Faults

Automated Revision of Existing Programs 24

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Safety specification Agreement The final decision of any two non-Byzantine

process cannot be different

Validity If the general is non-Byzantine then the final decision of a non-Byzantine process must be the same as that of the general

Fact The program does not meet the safety specification in the presence of faults

dg = 0 bg = true dj = 0 fj = false fj = true dk = 1 fk = true

Automated Revision of Existing Programs 25

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

QuestionIs it possible to revise a distributed program in order to add fault-tolerance to the original program with respect to a safety specification and a class of faults

AnswerYes but the problem is NP-complete which itself is a problem

SolutionA set of polynomial-time heuristics has been developed to add fault-tolerance to a large class of distributed programs

Automated Revision of Existing Programs 26

Some Terminalogy

Boolean Variables V = v1 vn State where l(vj) is a literal State predicate A finite set of states Transition Let V = v | v V A transition (s s) = s s Transition predicate is a set of transitions Program A program is defined by a transition predicate P Closure A state predicate S is closed in program P iff

(s ||S) (s ||S)

Computation of P A sequence of states s0 s1 where (sj sj+1) || P Safety specification A set of bad transitions that should not occur in any

program computation defined by transition predicate SPEC Satisfaction A program P satisfies SPEC from S iff

1 S is closed in P

2 For all s0 s1 where s0 || S (sj sj+1) || SPEC Invariant If P satisfies SPEC from S and S then S is an invariant of P

imi sS 0

)(0 jnj vls

)(0 iimi ssP

Pss ||)(

Automated Revision of Existing Programs 27

Preliminaries (contrsquod)

Faults A set F of transitions

Fault-span A state predicate T such that

1 S T

2 T is closed in P F

Invariant

f

f

ff

f

Fault-Span

Finite state space

p

pp

p

Fault-tolerance A program P is F-tolerant to SPEC from S if1 P satisfies SPEC from S2 There exists T such that

T is an F-span of P P F satisfies SPEC from T (safety) after faults stop occurring every computation of P reaches S (liveness)

Automated Revision of Existing Programs 28

Levels of Fault-Tolerance [3]

In the presence of faults

A failsafe program satisfies the safety specification A nonmasking program satisfies the liveness specification A masking program satisfies both safety and liveness specifications

In the absence of faults the program must continue to satisfy its entire specification

Question Are these levels able to capture the requirements for modeling fault-tolerant real-time programs

[3] A Arora S Kulkarni [3] A Arora S Kulkarni Designing masking fault-tolerance via nonmasking fault-toleranceDesigning masking fault-tolerance via nonmasking fault-tolerance TSE 1998TSE 1998

Automated Revision of Existing Programs 29

Fault-Tolerant Real-Time Programs

Railroad Crossing Safety If the train is passing the crossing the gate must be closed Bounded response Once the gate is closed it should reopen within

4min Fault-tolerance The gate must be closed while the train is passing the

intersection even in the presence of faults

Boiler Controller Bounded response Once the pressure gauge reads 30 pounds per

square inch the controller must issue a command to open a valve within 20s

Fault-tolerance Faults should not affect the functionality of the controller

Automated Revision of Existing Programs 30

New Levels of Fault-Tolerance for RT Programs

Soft Fault-Tolerance

A program is soft fault-tolerant if it is NOT required to satisfy its timing constraints in the presence of faults However it should continue to meet its timing constraints in the absence of faults

Hard Fault-Tolerance

A program is hard fault-tolerant if it is required to satisfy its timing constraints even in the presence of faults

Automated Revision of Existing Programs 31

Levels of Fault-Tolerance (contrsquod)

Safety LivenessTiming

constraintsBounded-time

recovery

Soft-failsafe

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Automated Revision of Existing Programs 32

Problem Statement

SynthesisAlgorithm

Program PSafety spec

Fault-tolerant

program PSet of faults f

Desired level of fault-tolerance SoftHard-Failsafe

NonmaskingSoftHard-Masking

Automated Revision of Existing Programs 33

Problem Statement

Given program P Invariant S a set of faults F and safety specification SPEC identify a program P with invariant S such that

S S P P and P is F-tolerant to SPEC from S

Automated Revision of Existing Programs 34

Current Results

Safety LivenessTiming

constraintsBounded-recovery

Complexity

Soft-failsafe

Polynomial space sound and complete

algorithm

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Open problem

[2] S Kulkarni A Arora [2] S Kulkarni A Arora Automating the Addition of Fault-ToleranceAutomating the Addition of Fault-Tolerance FTRTFT 2000 FTRTFT 2000

Automated Revision of Existing Programs 35

Example Altitude Switch

Initx 06

Standby

Await (Power z 2 )

gtgt Actuator Power on ltlt

ltlt LowAlt gtgt[z = 0]

ltlt Reset gtgt(CorruptSensor y 2)

(x 06)

ltlt Reset gtgt

[x y z = 0]

(Power z 2 )

(z gt 2 )[t = 0]

(x gt 06)[t = 0]

ltlt SensorFail gtgt[y = 0]

(y gt 2)[t = 0]

y 1y 2

z 2

(Power CorruptSensor)( Power)

Fault-span

Fault-span

(t 1)

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

Automated Revision of Existing Programs 37

Ongoing Research

Implementation of the algorithms

Coping with the state explosion problem Synthesis using zone automata Parallelizing state space

Coping with NP-hardness results Heuristics Decision procedures (Yices)

Fault-tolerance in hybrid systems

[8] httpwwwcsemsueduebnenasiresearchtoolsftsynhtm

Automated Revision of Existing Programs 38

Conclusion

We study the problem of automated revision of programs inside their state space

Adding properties to programs Adding fault-tolerance

Automated Revision of Existing Programs 39

Questions

Automated Revision of Existing Programs 21

Part II

Adding Fault-ToleranceFault-Tolerance to Existing Programs

Automated Revision of Existing Programs 22

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Decision dg 0 1

(dj = ) ( fj = false) dj = dg

(dj ) ( fj = false) fj = true

dj

dk 0 1

dl

Decision

fj

fk false true

fl

Final

GENERAL

NON-GENERALS

Program

Automated Revision of Existing Programs 23

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Byzantine

bg false true

bj

bk false true

bl

Byzantine

(bj bk bl bg = false) bj = true

(bj = true) dj = 0|1Faults

Automated Revision of Existing Programs 24

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Safety specification Agreement The final decision of any two non-Byzantine

process cannot be different

Validity If the general is non-Byzantine then the final decision of a non-Byzantine process must be the same as that of the general

Fact The program does not meet the safety specification in the presence of faults

dg = 0 bg = true dj = 0 fj = false fj = true dk = 1 fk = true

Automated Revision of Existing Programs 25

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

QuestionIs it possible to revise a distributed program in order to add fault-tolerance to the original program with respect to a safety specification and a class of faults

AnswerYes but the problem is NP-complete which itself is a problem

SolutionA set of polynomial-time heuristics has been developed to add fault-tolerance to a large class of distributed programs

Automated Revision of Existing Programs 26

Some Terminalogy

Boolean Variables V = v1 vn State where l(vj) is a literal State predicate A finite set of states Transition Let V = v | v V A transition (s s) = s s Transition predicate is a set of transitions Program A program is defined by a transition predicate P Closure A state predicate S is closed in program P iff

(s ||S) (s ||S)

Computation of P A sequence of states s0 s1 where (sj sj+1) || P Safety specification A set of bad transitions that should not occur in any

program computation defined by transition predicate SPEC Satisfaction A program P satisfies SPEC from S iff

1 S is closed in P

2 For all s0 s1 where s0 || S (sj sj+1) || SPEC Invariant If P satisfies SPEC from S and S then S is an invariant of P

imi sS 0

)(0 jnj vls

)(0 iimi ssP

Pss ||)(

Automated Revision of Existing Programs 27

Preliminaries (contrsquod)

Faults A set F of transitions

Fault-span A state predicate T such that

1 S T

2 T is closed in P F

Invariant

f

f

ff

f

Fault-Span

Finite state space

p

pp

p

Fault-tolerance A program P is F-tolerant to SPEC from S if1 P satisfies SPEC from S2 There exists T such that

T is an F-span of P P F satisfies SPEC from T (safety) after faults stop occurring every computation of P reaches S (liveness)

Automated Revision of Existing Programs 28

Levels of Fault-Tolerance [3]

In the presence of faults

A failsafe program satisfies the safety specification A nonmasking program satisfies the liveness specification A masking program satisfies both safety and liveness specifications

In the absence of faults the program must continue to satisfy its entire specification

Question Are these levels able to capture the requirements for modeling fault-tolerant real-time programs

[3] A Arora S Kulkarni [3] A Arora S Kulkarni Designing masking fault-tolerance via nonmasking fault-toleranceDesigning masking fault-tolerance via nonmasking fault-tolerance TSE 1998TSE 1998

Automated Revision of Existing Programs 29

Fault-Tolerant Real-Time Programs

Railroad Crossing Safety If the train is passing the crossing the gate must be closed Bounded response Once the gate is closed it should reopen within

4min Fault-tolerance The gate must be closed while the train is passing the

intersection even in the presence of faults

Boiler Controller Bounded response Once the pressure gauge reads 30 pounds per

square inch the controller must issue a command to open a valve within 20s

Fault-tolerance Faults should not affect the functionality of the controller

Automated Revision of Existing Programs 30

New Levels of Fault-Tolerance for RT Programs

Soft Fault-Tolerance

A program is soft fault-tolerant if it is NOT required to satisfy its timing constraints in the presence of faults However it should continue to meet its timing constraints in the absence of faults

Hard Fault-Tolerance

A program is hard fault-tolerant if it is required to satisfy its timing constraints even in the presence of faults

Automated Revision of Existing Programs 31

Levels of Fault-Tolerance (contrsquod)

Safety LivenessTiming

constraintsBounded-time

recovery

Soft-failsafe

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Automated Revision of Existing Programs 32

Problem Statement

SynthesisAlgorithm

Program PSafety spec

Fault-tolerant

program PSet of faults f

Desired level of fault-tolerance SoftHard-Failsafe

NonmaskingSoftHard-Masking

Automated Revision of Existing Programs 33

Problem Statement

Given program P Invariant S a set of faults F and safety specification SPEC identify a program P with invariant S such that

S S P P and P is F-tolerant to SPEC from S

Automated Revision of Existing Programs 34

Current Results

Safety LivenessTiming

constraintsBounded-recovery

Complexity

Soft-failsafe

Polynomial space sound and complete

algorithm

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Open problem

[2] S Kulkarni A Arora [2] S Kulkarni A Arora Automating the Addition of Fault-ToleranceAutomating the Addition of Fault-Tolerance FTRTFT 2000 FTRTFT 2000

Automated Revision of Existing Programs 35

Example Altitude Switch

Initx 06

Standby

Await (Power z 2 )

gtgt Actuator Power on ltlt

ltlt LowAlt gtgt[z = 0]

ltlt Reset gtgt(CorruptSensor y 2)

(x 06)

ltlt Reset gtgt

[x y z = 0]

(Power z 2 )

(z gt 2 )[t = 0]

(x gt 06)[t = 0]

ltlt SensorFail gtgt[y = 0]

(y gt 2)[t = 0]

y 1y 2

z 2

(Power CorruptSensor)( Power)

Fault-span

Fault-span

(t 1)

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

Automated Revision of Existing Programs 37

Ongoing Research

Implementation of the algorithms

Coping with the state explosion problem Synthesis using zone automata Parallelizing state space

Coping with NP-hardness results Heuristics Decision procedures (Yices)

Fault-tolerance in hybrid systems

[8] httpwwwcsemsueduebnenasiresearchtoolsftsynhtm

Automated Revision of Existing Programs 38

Conclusion

We study the problem of automated revision of programs inside their state space

Adding properties to programs Adding fault-tolerance

Automated Revision of Existing Programs 39

Questions

Automated Revision of Existing Programs 22

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Decision dg 0 1

(dj = ) ( fj = false) dj = dg

(dj ) ( fj = false) fj = true

dj

dk 0 1

dl

Decision

fj

fk false true

fl

Final

GENERAL

NON-GENERALS

Program

Automated Revision of Existing Programs 23

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Byzantine

bg false true

bj

bk false true

bl

Byzantine

(bj bk bl bg = false) bj = true

(bj = true) dj = 0|1Faults

Automated Revision of Existing Programs 24

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Safety specification Agreement The final decision of any two non-Byzantine

process cannot be different

Validity If the general is non-Byzantine then the final decision of a non-Byzantine process must be the same as that of the general

Fact The program does not meet the safety specification in the presence of faults

dg = 0 bg = true dj = 0 fj = false fj = true dk = 1 fk = true

Automated Revision of Existing Programs 25

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

QuestionIs it possible to revise a distributed program in order to add fault-tolerance to the original program with respect to a safety specification and a class of faults

AnswerYes but the problem is NP-complete which itself is a problem

SolutionA set of polynomial-time heuristics has been developed to add fault-tolerance to a large class of distributed programs

Automated Revision of Existing Programs 26

Some Terminalogy

Boolean Variables V = v1 vn State where l(vj) is a literal State predicate A finite set of states Transition Let V = v | v V A transition (s s) = s s Transition predicate is a set of transitions Program A program is defined by a transition predicate P Closure A state predicate S is closed in program P iff

(s ||S) (s ||S)

Computation of P A sequence of states s0 s1 where (sj sj+1) || P Safety specification A set of bad transitions that should not occur in any

program computation defined by transition predicate SPEC Satisfaction A program P satisfies SPEC from S iff

1 S is closed in P

2 For all s0 s1 where s0 || S (sj sj+1) || SPEC Invariant If P satisfies SPEC from S and S then S is an invariant of P

imi sS 0

)(0 jnj vls

)(0 iimi ssP

Pss ||)(

Automated Revision of Existing Programs 27

Preliminaries (contrsquod)

Faults A set F of transitions

Fault-span A state predicate T such that

1 S T

2 T is closed in P F

Invariant

f

f

ff

f

Fault-Span

Finite state space

p

pp

p

Fault-tolerance A program P is F-tolerant to SPEC from S if1 P satisfies SPEC from S2 There exists T such that

T is an F-span of P P F satisfies SPEC from T (safety) after faults stop occurring every computation of P reaches S (liveness)

Automated Revision of Existing Programs 28

Levels of Fault-Tolerance [3]

In the presence of faults

A failsafe program satisfies the safety specification A nonmasking program satisfies the liveness specification A masking program satisfies both safety and liveness specifications

In the absence of faults the program must continue to satisfy its entire specification

Question Are these levels able to capture the requirements for modeling fault-tolerant real-time programs

[3] A Arora S Kulkarni [3] A Arora S Kulkarni Designing masking fault-tolerance via nonmasking fault-toleranceDesigning masking fault-tolerance via nonmasking fault-tolerance TSE 1998TSE 1998

Automated Revision of Existing Programs 29

Fault-Tolerant Real-Time Programs

Railroad Crossing Safety If the train is passing the crossing the gate must be closed Bounded response Once the gate is closed it should reopen within

4min Fault-tolerance The gate must be closed while the train is passing the

intersection even in the presence of faults

Boiler Controller Bounded response Once the pressure gauge reads 30 pounds per

square inch the controller must issue a command to open a valve within 20s

Fault-tolerance Faults should not affect the functionality of the controller

Automated Revision of Existing Programs 30

New Levels of Fault-Tolerance for RT Programs

Soft Fault-Tolerance

A program is soft fault-tolerant if it is NOT required to satisfy its timing constraints in the presence of faults However it should continue to meet its timing constraints in the absence of faults

Hard Fault-Tolerance

A program is hard fault-tolerant if it is required to satisfy its timing constraints even in the presence of faults

Automated Revision of Existing Programs 31

Levels of Fault-Tolerance (contrsquod)

Safety LivenessTiming

constraintsBounded-time

recovery

Soft-failsafe

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Automated Revision of Existing Programs 32

Problem Statement

SynthesisAlgorithm

Program PSafety spec

Fault-tolerant

program PSet of faults f

Desired level of fault-tolerance SoftHard-Failsafe

NonmaskingSoftHard-Masking

Automated Revision of Existing Programs 33

Problem Statement

Given program P Invariant S a set of faults F and safety specification SPEC identify a program P with invariant S such that

S S P P and P is F-tolerant to SPEC from S

Automated Revision of Existing Programs 34

Current Results

Safety LivenessTiming

constraintsBounded-recovery

Complexity

Soft-failsafe

Polynomial space sound and complete

algorithm

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Open problem

[2] S Kulkarni A Arora [2] S Kulkarni A Arora Automating the Addition of Fault-ToleranceAutomating the Addition of Fault-Tolerance FTRTFT 2000 FTRTFT 2000

Automated Revision of Existing Programs 35

Example Altitude Switch

Initx 06

Standby

Await (Power z 2 )

gtgt Actuator Power on ltlt

ltlt LowAlt gtgt[z = 0]

ltlt Reset gtgt(CorruptSensor y 2)

(x 06)

ltlt Reset gtgt

[x y z = 0]

(Power z 2 )

(z gt 2 )[t = 0]

(x gt 06)[t = 0]

ltlt SensorFail gtgt[y = 0]

(y gt 2)[t = 0]

y 1y 2

z 2

(Power CorruptSensor)( Power)

Fault-span

Fault-span

(t 1)

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

Automated Revision of Existing Programs 37

Ongoing Research

Implementation of the algorithms

Coping with the state explosion problem Synthesis using zone automata Parallelizing state space

Coping with NP-hardness results Heuristics Decision procedures (Yices)

Fault-tolerance in hybrid systems

[8] httpwwwcsemsueduebnenasiresearchtoolsftsynhtm

Automated Revision of Existing Programs 38

Conclusion

We study the problem of automated revision of programs inside their state space

Adding properties to programs Adding fault-tolerance

Automated Revision of Existing Programs 39

Questions

Automated Revision of Existing Programs 23

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Byzantine

bg false true

bj

bk false true

bl

Byzantine

(bj bk bl bg = false) bj = true

(bj = true) dj = 0|1Faults

Automated Revision of Existing Programs 24

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Safety specification Agreement The final decision of any two non-Byzantine

process cannot be different

Validity If the general is non-Byzantine then the final decision of a non-Byzantine process must be the same as that of the general

Fact The program does not meet the safety specification in the presence of faults

dg = 0 bg = true dj = 0 fj = false fj = true dk = 1 fk = true

Automated Revision of Existing Programs 25

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

QuestionIs it possible to revise a distributed program in order to add fault-tolerance to the original program with respect to a safety specification and a class of faults

AnswerYes but the problem is NP-complete which itself is a problem

SolutionA set of polynomial-time heuristics has been developed to add fault-tolerance to a large class of distributed programs

Automated Revision of Existing Programs 26

Some Terminalogy

Boolean Variables V = v1 vn State where l(vj) is a literal State predicate A finite set of states Transition Let V = v | v V A transition (s s) = s s Transition predicate is a set of transitions Program A program is defined by a transition predicate P Closure A state predicate S is closed in program P iff

(s ||S) (s ||S)

Computation of P A sequence of states s0 s1 where (sj sj+1) || P Safety specification A set of bad transitions that should not occur in any

program computation defined by transition predicate SPEC Satisfaction A program P satisfies SPEC from S iff

1 S is closed in P

2 For all s0 s1 where s0 || S (sj sj+1) || SPEC Invariant If P satisfies SPEC from S and S then S is an invariant of P

imi sS 0

)(0 jnj vls

)(0 iimi ssP

Pss ||)(

Automated Revision of Existing Programs 27

Preliminaries (contrsquod)

Faults A set F of transitions

Fault-span A state predicate T such that

1 S T

2 T is closed in P F

Invariant

f

f

ff

f

Fault-Span

Finite state space

p

pp

p

Fault-tolerance A program P is F-tolerant to SPEC from S if1 P satisfies SPEC from S2 There exists T such that

T is an F-span of P P F satisfies SPEC from T (safety) after faults stop occurring every computation of P reaches S (liveness)

Automated Revision of Existing Programs 28

Levels of Fault-Tolerance [3]

In the presence of faults

A failsafe program satisfies the safety specification A nonmasking program satisfies the liveness specification A masking program satisfies both safety and liveness specifications

In the absence of faults the program must continue to satisfy its entire specification

Question Are these levels able to capture the requirements for modeling fault-tolerant real-time programs

[3] A Arora S Kulkarni [3] A Arora S Kulkarni Designing masking fault-tolerance via nonmasking fault-toleranceDesigning masking fault-tolerance via nonmasking fault-tolerance TSE 1998TSE 1998

Automated Revision of Existing Programs 29

Fault-Tolerant Real-Time Programs

Railroad Crossing Safety If the train is passing the crossing the gate must be closed Bounded response Once the gate is closed it should reopen within

4min Fault-tolerance The gate must be closed while the train is passing the

intersection even in the presence of faults

Boiler Controller Bounded response Once the pressure gauge reads 30 pounds per

square inch the controller must issue a command to open a valve within 20s

Fault-tolerance Faults should not affect the functionality of the controller

Automated Revision of Existing Programs 30

New Levels of Fault-Tolerance for RT Programs

Soft Fault-Tolerance

A program is soft fault-tolerant if it is NOT required to satisfy its timing constraints in the presence of faults However it should continue to meet its timing constraints in the absence of faults

Hard Fault-Tolerance

A program is hard fault-tolerant if it is required to satisfy its timing constraints even in the presence of faults

Automated Revision of Existing Programs 31

Levels of Fault-Tolerance (contrsquod)

Safety LivenessTiming

constraintsBounded-time

recovery

Soft-failsafe

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Automated Revision of Existing Programs 32

Problem Statement

SynthesisAlgorithm

Program PSafety spec

Fault-tolerant

program PSet of faults f

Desired level of fault-tolerance SoftHard-Failsafe

NonmaskingSoftHard-Masking

Automated Revision of Existing Programs 33

Problem Statement

Given program P Invariant S a set of faults F and safety specification SPEC identify a program P with invariant S such that

S S P P and P is F-tolerant to SPEC from S

Automated Revision of Existing Programs 34

Current Results

Safety LivenessTiming

constraintsBounded-recovery

Complexity

Soft-failsafe

Polynomial space sound and complete

algorithm

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Open problem

[2] S Kulkarni A Arora [2] S Kulkarni A Arora Automating the Addition of Fault-ToleranceAutomating the Addition of Fault-Tolerance FTRTFT 2000 FTRTFT 2000

Automated Revision of Existing Programs 35

Example Altitude Switch

Initx 06

Standby

Await (Power z 2 )

gtgt Actuator Power on ltlt

ltlt LowAlt gtgt[z = 0]

ltlt Reset gtgt(CorruptSensor y 2)

(x 06)

ltlt Reset gtgt

[x y z = 0]

(Power z 2 )

(z gt 2 )[t = 0]

(x gt 06)[t = 0]

ltlt SensorFail gtgt[y = 0]

(y gt 2)[t = 0]

y 1y 2

z 2

(Power CorruptSensor)( Power)

Fault-span

Fault-span

(t 1)

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

Automated Revision of Existing Programs 37

Ongoing Research

Implementation of the algorithms

Coping with the state explosion problem Synthesis using zone automata Parallelizing state space

Coping with NP-hardness results Heuristics Decision procedures (Yices)

Fault-tolerance in hybrid systems

[8] httpwwwcsemsueduebnenasiresearchtoolsftsynhtm

Automated Revision of Existing Programs 38

Conclusion

We study the problem of automated revision of programs inside their state space

Adding properties to programs Adding fault-tolerance

Automated Revision of Existing Programs 39

Questions

Automated Revision of Existing Programs 24

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

Safety specification Agreement The final decision of any two non-Byzantine

process cannot be different

Validity If the general is non-Byzantine then the final decision of a non-Byzantine process must be the same as that of the general

Fact The program does not meet the safety specification in the presence of faults

dg = 0 bg = true dj = 0 fj = false fj = true dk = 1 fk = true

Automated Revision of Existing Programs 25

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

QuestionIs it possible to revise a distributed program in order to add fault-tolerance to the original program with respect to a safety specification and a class of faults

AnswerYes but the problem is NP-complete which itself is a problem

SolutionA set of polynomial-time heuristics has been developed to add fault-tolerance to a large class of distributed programs

Automated Revision of Existing Programs 26

Some Terminalogy

Boolean Variables V = v1 vn State where l(vj) is a literal State predicate A finite set of states Transition Let V = v | v V A transition (s s) = s s Transition predicate is a set of transitions Program A program is defined by a transition predicate P Closure A state predicate S is closed in program P iff

(s ||S) (s ||S)

Computation of P A sequence of states s0 s1 where (sj sj+1) || P Safety specification A set of bad transitions that should not occur in any

program computation defined by transition predicate SPEC Satisfaction A program P satisfies SPEC from S iff

1 S is closed in P

2 For all s0 s1 where s0 || S (sj sj+1) || SPEC Invariant If P satisfies SPEC from S and S then S is an invariant of P

imi sS 0

)(0 jnj vls

)(0 iimi ssP

Pss ||)(

Automated Revision of Existing Programs 27

Preliminaries (contrsquod)

Faults A set F of transitions

Fault-span A state predicate T such that

1 S T

2 T is closed in P F

Invariant

f

f

ff

f

Fault-Span

Finite state space

p

pp

p

Fault-tolerance A program P is F-tolerant to SPEC from S if1 P satisfies SPEC from S2 There exists T such that

T is an F-span of P P F satisfies SPEC from T (safety) after faults stop occurring every computation of P reaches S (liveness)

Automated Revision of Existing Programs 28

Levels of Fault-Tolerance [3]

In the presence of faults

A failsafe program satisfies the safety specification A nonmasking program satisfies the liveness specification A masking program satisfies both safety and liveness specifications

In the absence of faults the program must continue to satisfy its entire specification

Question Are these levels able to capture the requirements for modeling fault-tolerant real-time programs

[3] A Arora S Kulkarni [3] A Arora S Kulkarni Designing masking fault-tolerance via nonmasking fault-toleranceDesigning masking fault-tolerance via nonmasking fault-tolerance TSE 1998TSE 1998

Automated Revision of Existing Programs 29

Fault-Tolerant Real-Time Programs

Railroad Crossing Safety If the train is passing the crossing the gate must be closed Bounded response Once the gate is closed it should reopen within

4min Fault-tolerance The gate must be closed while the train is passing the

intersection even in the presence of faults

Boiler Controller Bounded response Once the pressure gauge reads 30 pounds per

square inch the controller must issue a command to open a valve within 20s

Fault-tolerance Faults should not affect the functionality of the controller

Automated Revision of Existing Programs 30

New Levels of Fault-Tolerance for RT Programs

Soft Fault-Tolerance

A program is soft fault-tolerant if it is NOT required to satisfy its timing constraints in the presence of faults However it should continue to meet its timing constraints in the absence of faults

Hard Fault-Tolerance

A program is hard fault-tolerant if it is required to satisfy its timing constraints even in the presence of faults

Automated Revision of Existing Programs 31

Levels of Fault-Tolerance (contrsquod)

Safety LivenessTiming

constraintsBounded-time

recovery

Soft-failsafe

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Automated Revision of Existing Programs 32

Problem Statement

SynthesisAlgorithm

Program PSafety spec

Fault-tolerant

program PSet of faults f

Desired level of fault-tolerance SoftHard-Failsafe

NonmaskingSoftHard-Masking

Automated Revision of Existing Programs 33

Problem Statement

Given program P Invariant S a set of faults F and safety specification SPEC identify a program P with invariant S such that

S S P P and P is F-tolerant to SPEC from S

Automated Revision of Existing Programs 34

Current Results

Safety LivenessTiming

constraintsBounded-recovery

Complexity

Soft-failsafe

Polynomial space sound and complete

algorithm

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Open problem

[2] S Kulkarni A Arora [2] S Kulkarni A Arora Automating the Addition of Fault-ToleranceAutomating the Addition of Fault-Tolerance FTRTFT 2000 FTRTFT 2000

Automated Revision of Existing Programs 35

Example Altitude Switch

Initx 06

Standby

Await (Power z 2 )

gtgt Actuator Power on ltlt

ltlt LowAlt gtgt[z = 0]

ltlt Reset gtgt(CorruptSensor y 2)

(x 06)

ltlt Reset gtgt

[x y z = 0]

(Power z 2 )

(z gt 2 )[t = 0]

(x gt 06)[t = 0]

ltlt SensorFail gtgt[y = 0]

(y gt 2)[t = 0]

y 1y 2

z 2

(Power CorruptSensor)( Power)

Fault-span

Fault-span

(t 1)

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

Automated Revision of Existing Programs 37

Ongoing Research

Implementation of the algorithms

Coping with the state explosion problem Synthesis using zone automata Parallelizing state space

Coping with NP-hardness results Heuristics Decision procedures (Yices)

Fault-tolerance in hybrid systems

[8] httpwwwcsemsueduebnenasiresearchtoolsftsynhtm

Automated Revision of Existing Programs 38

Conclusion

We study the problem of automated revision of programs inside their state space

Adding properties to programs Adding fault-tolerance

Automated Revision of Existing Programs 39

Questions

Automated Revision of Existing Programs 25

The Byzantine Agreement ProblemThe Byzantine Agreement Problem

QuestionIs it possible to revise a distributed program in order to add fault-tolerance to the original program with respect to a safety specification and a class of faults

AnswerYes but the problem is NP-complete which itself is a problem

SolutionA set of polynomial-time heuristics has been developed to add fault-tolerance to a large class of distributed programs

Automated Revision of Existing Programs 26

Some Terminalogy

Boolean Variables V = v1 vn State where l(vj) is a literal State predicate A finite set of states Transition Let V = v | v V A transition (s s) = s s Transition predicate is a set of transitions Program A program is defined by a transition predicate P Closure A state predicate S is closed in program P iff

(s ||S) (s ||S)

Computation of P A sequence of states s0 s1 where (sj sj+1) || P Safety specification A set of bad transitions that should not occur in any

program computation defined by transition predicate SPEC Satisfaction A program P satisfies SPEC from S iff

1 S is closed in P

2 For all s0 s1 where s0 || S (sj sj+1) || SPEC Invariant If P satisfies SPEC from S and S then S is an invariant of P

imi sS 0

)(0 jnj vls

)(0 iimi ssP

Pss ||)(

Automated Revision of Existing Programs 27

Preliminaries (contrsquod)

Faults A set F of transitions

Fault-span A state predicate T such that

1 S T

2 T is closed in P F

Invariant

f

f

ff

f

Fault-Span

Finite state space

p

pp

p

Fault-tolerance A program P is F-tolerant to SPEC from S if1 P satisfies SPEC from S2 There exists T such that

T is an F-span of P P F satisfies SPEC from T (safety) after faults stop occurring every computation of P reaches S (liveness)

Automated Revision of Existing Programs 28

Levels of Fault-Tolerance [3]

In the presence of faults

A failsafe program satisfies the safety specification A nonmasking program satisfies the liveness specification A masking program satisfies both safety and liveness specifications

In the absence of faults the program must continue to satisfy its entire specification

Question Are these levels able to capture the requirements for modeling fault-tolerant real-time programs

[3] A Arora S Kulkarni [3] A Arora S Kulkarni Designing masking fault-tolerance via nonmasking fault-toleranceDesigning masking fault-tolerance via nonmasking fault-tolerance TSE 1998TSE 1998

Automated Revision of Existing Programs 29

Fault-Tolerant Real-Time Programs

Railroad Crossing Safety If the train is passing the crossing the gate must be closed Bounded response Once the gate is closed it should reopen within

4min Fault-tolerance The gate must be closed while the train is passing the

intersection even in the presence of faults

Boiler Controller Bounded response Once the pressure gauge reads 30 pounds per

square inch the controller must issue a command to open a valve within 20s

Fault-tolerance Faults should not affect the functionality of the controller

Automated Revision of Existing Programs 30

New Levels of Fault-Tolerance for RT Programs

Soft Fault-Tolerance

A program is soft fault-tolerant if it is NOT required to satisfy its timing constraints in the presence of faults However it should continue to meet its timing constraints in the absence of faults

Hard Fault-Tolerance

A program is hard fault-tolerant if it is required to satisfy its timing constraints even in the presence of faults

Automated Revision of Existing Programs 31

Levels of Fault-Tolerance (contrsquod)

Safety LivenessTiming

constraintsBounded-time

recovery

Soft-failsafe

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Automated Revision of Existing Programs 32

Problem Statement

SynthesisAlgorithm

Program PSafety spec

Fault-tolerant

program PSet of faults f

Desired level of fault-tolerance SoftHard-Failsafe

NonmaskingSoftHard-Masking

Automated Revision of Existing Programs 33

Problem Statement

Given program P Invariant S a set of faults F and safety specification SPEC identify a program P with invariant S such that

S S P P and P is F-tolerant to SPEC from S

Automated Revision of Existing Programs 34

Current Results

Safety LivenessTiming

constraintsBounded-recovery

Complexity

Soft-failsafe

Polynomial space sound and complete

algorithm

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Open problem

[2] S Kulkarni A Arora [2] S Kulkarni A Arora Automating the Addition of Fault-ToleranceAutomating the Addition of Fault-Tolerance FTRTFT 2000 FTRTFT 2000

Automated Revision of Existing Programs 35

Example Altitude Switch

Initx 06

Standby

Await (Power z 2 )

gtgt Actuator Power on ltlt

ltlt LowAlt gtgt[z = 0]

ltlt Reset gtgt(CorruptSensor y 2)

(x 06)

ltlt Reset gtgt

[x y z = 0]

(Power z 2 )

(z gt 2 )[t = 0]

(x gt 06)[t = 0]

ltlt SensorFail gtgt[y = 0]

(y gt 2)[t = 0]

y 1y 2

z 2

(Power CorruptSensor)( Power)

Fault-span

Fault-span

(t 1)

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

Automated Revision of Existing Programs 37

Ongoing Research

Implementation of the algorithms

Coping with the state explosion problem Synthesis using zone automata Parallelizing state space

Coping with NP-hardness results Heuristics Decision procedures (Yices)

Fault-tolerance in hybrid systems

[8] httpwwwcsemsueduebnenasiresearchtoolsftsynhtm

Automated Revision of Existing Programs 38

Conclusion

We study the problem of automated revision of programs inside their state space

Adding properties to programs Adding fault-tolerance

Automated Revision of Existing Programs 39

Questions

Automated Revision of Existing Programs 26

Some Terminalogy

Boolean Variables V = v1 vn State where l(vj) is a literal State predicate A finite set of states Transition Let V = v | v V A transition (s s) = s s Transition predicate is a set of transitions Program A program is defined by a transition predicate P Closure A state predicate S is closed in program P iff

(s ||S) (s ||S)

Computation of P A sequence of states s0 s1 where (sj sj+1) || P Safety specification A set of bad transitions that should not occur in any

program computation defined by transition predicate SPEC Satisfaction A program P satisfies SPEC from S iff

1 S is closed in P

2 For all s0 s1 where s0 || S (sj sj+1) || SPEC Invariant If P satisfies SPEC from S and S then S is an invariant of P

imi sS 0

)(0 jnj vls

)(0 iimi ssP

Pss ||)(

Automated Revision of Existing Programs 27

Preliminaries (contrsquod)

Faults A set F of transitions

Fault-span A state predicate T such that

1 S T

2 T is closed in P F

Invariant

f

f

ff

f

Fault-Span

Finite state space

p

pp

p

Fault-tolerance A program P is F-tolerant to SPEC from S if1 P satisfies SPEC from S2 There exists T such that

T is an F-span of P P F satisfies SPEC from T (safety) after faults stop occurring every computation of P reaches S (liveness)

Automated Revision of Existing Programs 28

Levels of Fault-Tolerance [3]

In the presence of faults

A failsafe program satisfies the safety specification A nonmasking program satisfies the liveness specification A masking program satisfies both safety and liveness specifications

In the absence of faults the program must continue to satisfy its entire specification

Question Are these levels able to capture the requirements for modeling fault-tolerant real-time programs

[3] A Arora S Kulkarni [3] A Arora S Kulkarni Designing masking fault-tolerance via nonmasking fault-toleranceDesigning masking fault-tolerance via nonmasking fault-tolerance TSE 1998TSE 1998

Automated Revision of Existing Programs 29

Fault-Tolerant Real-Time Programs

Railroad Crossing Safety If the train is passing the crossing the gate must be closed Bounded response Once the gate is closed it should reopen within

4min Fault-tolerance The gate must be closed while the train is passing the

intersection even in the presence of faults

Boiler Controller Bounded response Once the pressure gauge reads 30 pounds per

square inch the controller must issue a command to open a valve within 20s

Fault-tolerance Faults should not affect the functionality of the controller

Automated Revision of Existing Programs 30

New Levels of Fault-Tolerance for RT Programs

Soft Fault-Tolerance

A program is soft fault-tolerant if it is NOT required to satisfy its timing constraints in the presence of faults However it should continue to meet its timing constraints in the absence of faults

Hard Fault-Tolerance

A program is hard fault-tolerant if it is required to satisfy its timing constraints even in the presence of faults

Automated Revision of Existing Programs 31

Levels of Fault-Tolerance (contrsquod)

Safety LivenessTiming

constraintsBounded-time

recovery

Soft-failsafe

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Automated Revision of Existing Programs 32

Problem Statement

SynthesisAlgorithm

Program PSafety spec

Fault-tolerant

program PSet of faults f

Desired level of fault-tolerance SoftHard-Failsafe

NonmaskingSoftHard-Masking

Automated Revision of Existing Programs 33

Problem Statement

Given program P Invariant S a set of faults F and safety specification SPEC identify a program P with invariant S such that

S S P P and P is F-tolerant to SPEC from S

Automated Revision of Existing Programs 34

Current Results

Safety LivenessTiming

constraintsBounded-recovery

Complexity

Soft-failsafe

Polynomial space sound and complete

algorithm

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Open problem

[2] S Kulkarni A Arora [2] S Kulkarni A Arora Automating the Addition of Fault-ToleranceAutomating the Addition of Fault-Tolerance FTRTFT 2000 FTRTFT 2000

Automated Revision of Existing Programs 35

Example Altitude Switch

Initx 06

Standby

Await (Power z 2 )

gtgt Actuator Power on ltlt

ltlt LowAlt gtgt[z = 0]

ltlt Reset gtgt(CorruptSensor y 2)

(x 06)

ltlt Reset gtgt

[x y z = 0]

(Power z 2 )

(z gt 2 )[t = 0]

(x gt 06)[t = 0]

ltlt SensorFail gtgt[y = 0]

(y gt 2)[t = 0]

y 1y 2

z 2

(Power CorruptSensor)( Power)

Fault-span

Fault-span

(t 1)

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

Automated Revision of Existing Programs 37

Ongoing Research

Implementation of the algorithms

Coping with the state explosion problem Synthesis using zone automata Parallelizing state space

Coping with NP-hardness results Heuristics Decision procedures (Yices)

Fault-tolerance in hybrid systems

[8] httpwwwcsemsueduebnenasiresearchtoolsftsynhtm

Automated Revision of Existing Programs 38

Conclusion

We study the problem of automated revision of programs inside their state space

Adding properties to programs Adding fault-tolerance

Automated Revision of Existing Programs 39

Questions

Automated Revision of Existing Programs 27

Preliminaries (contrsquod)

Faults A set F of transitions

Fault-span A state predicate T such that

1 S T

2 T is closed in P F

Invariant

f

f

ff

f

Fault-Span

Finite state space

p

pp

p

Fault-tolerance A program P is F-tolerant to SPEC from S if1 P satisfies SPEC from S2 There exists T such that

T is an F-span of P P F satisfies SPEC from T (safety) after faults stop occurring every computation of P reaches S (liveness)

Automated Revision of Existing Programs 28

Levels of Fault-Tolerance [3]

In the presence of faults

A failsafe program satisfies the safety specification A nonmasking program satisfies the liveness specification A masking program satisfies both safety and liveness specifications

In the absence of faults the program must continue to satisfy its entire specification

Question Are these levels able to capture the requirements for modeling fault-tolerant real-time programs

[3] A Arora S Kulkarni [3] A Arora S Kulkarni Designing masking fault-tolerance via nonmasking fault-toleranceDesigning masking fault-tolerance via nonmasking fault-tolerance TSE 1998TSE 1998

Automated Revision of Existing Programs 29

Fault-Tolerant Real-Time Programs

Railroad Crossing Safety If the train is passing the crossing the gate must be closed Bounded response Once the gate is closed it should reopen within

4min Fault-tolerance The gate must be closed while the train is passing the

intersection even in the presence of faults

Boiler Controller Bounded response Once the pressure gauge reads 30 pounds per

square inch the controller must issue a command to open a valve within 20s

Fault-tolerance Faults should not affect the functionality of the controller

Automated Revision of Existing Programs 30

New Levels of Fault-Tolerance for RT Programs

Soft Fault-Tolerance

A program is soft fault-tolerant if it is NOT required to satisfy its timing constraints in the presence of faults However it should continue to meet its timing constraints in the absence of faults

Hard Fault-Tolerance

A program is hard fault-tolerant if it is required to satisfy its timing constraints even in the presence of faults

Automated Revision of Existing Programs 31

Levels of Fault-Tolerance (contrsquod)

Safety LivenessTiming

constraintsBounded-time

recovery

Soft-failsafe

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Automated Revision of Existing Programs 32

Problem Statement

SynthesisAlgorithm

Program PSafety spec

Fault-tolerant

program PSet of faults f

Desired level of fault-tolerance SoftHard-Failsafe

NonmaskingSoftHard-Masking

Automated Revision of Existing Programs 33

Problem Statement

Given program P Invariant S a set of faults F and safety specification SPEC identify a program P with invariant S such that

S S P P and P is F-tolerant to SPEC from S

Automated Revision of Existing Programs 34

Current Results

Safety LivenessTiming

constraintsBounded-recovery

Complexity

Soft-failsafe

Polynomial space sound and complete

algorithm

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Open problem

[2] S Kulkarni A Arora [2] S Kulkarni A Arora Automating the Addition of Fault-ToleranceAutomating the Addition of Fault-Tolerance FTRTFT 2000 FTRTFT 2000

Automated Revision of Existing Programs 35

Example Altitude Switch

Initx 06

Standby

Await (Power z 2 )

gtgt Actuator Power on ltlt

ltlt LowAlt gtgt[z = 0]

ltlt Reset gtgt(CorruptSensor y 2)

(x 06)

ltlt Reset gtgt

[x y z = 0]

(Power z 2 )

(z gt 2 )[t = 0]

(x gt 06)[t = 0]

ltlt SensorFail gtgt[y = 0]

(y gt 2)[t = 0]

y 1y 2

z 2

(Power CorruptSensor)( Power)

Fault-span

Fault-span

(t 1)

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

Automated Revision of Existing Programs 37

Ongoing Research

Implementation of the algorithms

Coping with the state explosion problem Synthesis using zone automata Parallelizing state space

Coping with NP-hardness results Heuristics Decision procedures (Yices)

Fault-tolerance in hybrid systems

[8] httpwwwcsemsueduebnenasiresearchtoolsftsynhtm

Automated Revision of Existing Programs 38

Conclusion

We study the problem of automated revision of programs inside their state space

Adding properties to programs Adding fault-tolerance

Automated Revision of Existing Programs 39

Questions

Automated Revision of Existing Programs 28

Levels of Fault-Tolerance [3]

In the presence of faults

A failsafe program satisfies the safety specification A nonmasking program satisfies the liveness specification A masking program satisfies both safety and liveness specifications

In the absence of faults the program must continue to satisfy its entire specification

Question Are these levels able to capture the requirements for modeling fault-tolerant real-time programs

[3] A Arora S Kulkarni [3] A Arora S Kulkarni Designing masking fault-tolerance via nonmasking fault-toleranceDesigning masking fault-tolerance via nonmasking fault-tolerance TSE 1998TSE 1998

Automated Revision of Existing Programs 29

Fault-Tolerant Real-Time Programs

Railroad Crossing Safety If the train is passing the crossing the gate must be closed Bounded response Once the gate is closed it should reopen within

4min Fault-tolerance The gate must be closed while the train is passing the

intersection even in the presence of faults

Boiler Controller Bounded response Once the pressure gauge reads 30 pounds per

square inch the controller must issue a command to open a valve within 20s

Fault-tolerance Faults should not affect the functionality of the controller

Automated Revision of Existing Programs 30

New Levels of Fault-Tolerance for RT Programs

Soft Fault-Tolerance

A program is soft fault-tolerant if it is NOT required to satisfy its timing constraints in the presence of faults However it should continue to meet its timing constraints in the absence of faults

Hard Fault-Tolerance

A program is hard fault-tolerant if it is required to satisfy its timing constraints even in the presence of faults

Automated Revision of Existing Programs 31

Levels of Fault-Tolerance (contrsquod)

Safety LivenessTiming

constraintsBounded-time

recovery

Soft-failsafe

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Automated Revision of Existing Programs 32

Problem Statement

SynthesisAlgorithm

Program PSafety spec

Fault-tolerant

program PSet of faults f

Desired level of fault-tolerance SoftHard-Failsafe

NonmaskingSoftHard-Masking

Automated Revision of Existing Programs 33

Problem Statement

Given program P Invariant S a set of faults F and safety specification SPEC identify a program P with invariant S such that

S S P P and P is F-tolerant to SPEC from S

Automated Revision of Existing Programs 34

Current Results

Safety LivenessTiming

constraintsBounded-recovery

Complexity

Soft-failsafe

Polynomial space sound and complete

algorithm

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Open problem

[2] S Kulkarni A Arora [2] S Kulkarni A Arora Automating the Addition of Fault-ToleranceAutomating the Addition of Fault-Tolerance FTRTFT 2000 FTRTFT 2000

Automated Revision of Existing Programs 35

Example Altitude Switch

Initx 06

Standby

Await (Power z 2 )

gtgt Actuator Power on ltlt

ltlt LowAlt gtgt[z = 0]

ltlt Reset gtgt(CorruptSensor y 2)

(x 06)

ltlt Reset gtgt

[x y z = 0]

(Power z 2 )

(z gt 2 )[t = 0]

(x gt 06)[t = 0]

ltlt SensorFail gtgt[y = 0]

(y gt 2)[t = 0]

y 1y 2

z 2

(Power CorruptSensor)( Power)

Fault-span

Fault-span

(t 1)

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

Automated Revision of Existing Programs 37

Ongoing Research

Implementation of the algorithms

Coping with the state explosion problem Synthesis using zone automata Parallelizing state space

Coping with NP-hardness results Heuristics Decision procedures (Yices)

Fault-tolerance in hybrid systems

[8] httpwwwcsemsueduebnenasiresearchtoolsftsynhtm

Automated Revision of Existing Programs 38

Conclusion

We study the problem of automated revision of programs inside their state space

Adding properties to programs Adding fault-tolerance

Automated Revision of Existing Programs 39

Questions

Automated Revision of Existing Programs 29

Fault-Tolerant Real-Time Programs

Railroad Crossing Safety If the train is passing the crossing the gate must be closed Bounded response Once the gate is closed it should reopen within

4min Fault-tolerance The gate must be closed while the train is passing the

intersection even in the presence of faults

Boiler Controller Bounded response Once the pressure gauge reads 30 pounds per

square inch the controller must issue a command to open a valve within 20s

Fault-tolerance Faults should not affect the functionality of the controller

Automated Revision of Existing Programs 30

New Levels of Fault-Tolerance for RT Programs

Soft Fault-Tolerance

A program is soft fault-tolerant if it is NOT required to satisfy its timing constraints in the presence of faults However it should continue to meet its timing constraints in the absence of faults

Hard Fault-Tolerance

A program is hard fault-tolerant if it is required to satisfy its timing constraints even in the presence of faults

Automated Revision of Existing Programs 31

Levels of Fault-Tolerance (contrsquod)

Safety LivenessTiming

constraintsBounded-time

recovery

Soft-failsafe

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Automated Revision of Existing Programs 32

Problem Statement

SynthesisAlgorithm

Program PSafety spec

Fault-tolerant

program PSet of faults f

Desired level of fault-tolerance SoftHard-Failsafe

NonmaskingSoftHard-Masking

Automated Revision of Existing Programs 33

Problem Statement

Given program P Invariant S a set of faults F and safety specification SPEC identify a program P with invariant S such that

S S P P and P is F-tolerant to SPEC from S

Automated Revision of Existing Programs 34

Current Results

Safety LivenessTiming

constraintsBounded-recovery

Complexity

Soft-failsafe

Polynomial space sound and complete

algorithm

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Open problem

[2] S Kulkarni A Arora [2] S Kulkarni A Arora Automating the Addition of Fault-ToleranceAutomating the Addition of Fault-Tolerance FTRTFT 2000 FTRTFT 2000

Automated Revision of Existing Programs 35

Example Altitude Switch

Initx 06

Standby

Await (Power z 2 )

gtgt Actuator Power on ltlt

ltlt LowAlt gtgt[z = 0]

ltlt Reset gtgt(CorruptSensor y 2)

(x 06)

ltlt Reset gtgt

[x y z = 0]

(Power z 2 )

(z gt 2 )[t = 0]

(x gt 06)[t = 0]

ltlt SensorFail gtgt[y = 0]

(y gt 2)[t = 0]

y 1y 2

z 2

(Power CorruptSensor)( Power)

Fault-span

Fault-span

(t 1)

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

Automated Revision of Existing Programs 37

Ongoing Research

Implementation of the algorithms

Coping with the state explosion problem Synthesis using zone automata Parallelizing state space

Coping with NP-hardness results Heuristics Decision procedures (Yices)

Fault-tolerance in hybrid systems

[8] httpwwwcsemsueduebnenasiresearchtoolsftsynhtm

Automated Revision of Existing Programs 38

Conclusion

We study the problem of automated revision of programs inside their state space

Adding properties to programs Adding fault-tolerance

Automated Revision of Existing Programs 39

Questions

Automated Revision of Existing Programs 30

New Levels of Fault-Tolerance for RT Programs

Soft Fault-Tolerance

A program is soft fault-tolerant if it is NOT required to satisfy its timing constraints in the presence of faults However it should continue to meet its timing constraints in the absence of faults

Hard Fault-Tolerance

A program is hard fault-tolerant if it is required to satisfy its timing constraints even in the presence of faults

Automated Revision of Existing Programs 31

Levels of Fault-Tolerance (contrsquod)

Safety LivenessTiming

constraintsBounded-time

recovery

Soft-failsafe

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Automated Revision of Existing Programs 32

Problem Statement

SynthesisAlgorithm

Program PSafety spec

Fault-tolerant

program PSet of faults f

Desired level of fault-tolerance SoftHard-Failsafe

NonmaskingSoftHard-Masking

Automated Revision of Existing Programs 33

Problem Statement

Given program P Invariant S a set of faults F and safety specification SPEC identify a program P with invariant S such that

S S P P and P is F-tolerant to SPEC from S

Automated Revision of Existing Programs 34

Current Results

Safety LivenessTiming

constraintsBounded-recovery

Complexity

Soft-failsafe

Polynomial space sound and complete

algorithm

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Open problem

[2] S Kulkarni A Arora [2] S Kulkarni A Arora Automating the Addition of Fault-ToleranceAutomating the Addition of Fault-Tolerance FTRTFT 2000 FTRTFT 2000

Automated Revision of Existing Programs 35

Example Altitude Switch

Initx 06

Standby

Await (Power z 2 )

gtgt Actuator Power on ltlt

ltlt LowAlt gtgt[z = 0]

ltlt Reset gtgt(CorruptSensor y 2)

(x 06)

ltlt Reset gtgt

[x y z = 0]

(Power z 2 )

(z gt 2 )[t = 0]

(x gt 06)[t = 0]

ltlt SensorFail gtgt[y = 0]

(y gt 2)[t = 0]

y 1y 2

z 2

(Power CorruptSensor)( Power)

Fault-span

Fault-span

(t 1)

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

Automated Revision of Existing Programs 37

Ongoing Research

Implementation of the algorithms

Coping with the state explosion problem Synthesis using zone automata Parallelizing state space

Coping with NP-hardness results Heuristics Decision procedures (Yices)

Fault-tolerance in hybrid systems

[8] httpwwwcsemsueduebnenasiresearchtoolsftsynhtm

Automated Revision of Existing Programs 38

Conclusion

We study the problem of automated revision of programs inside their state space

Adding properties to programs Adding fault-tolerance

Automated Revision of Existing Programs 39

Questions

Automated Revision of Existing Programs 31

Levels of Fault-Tolerance (contrsquod)

Safety LivenessTiming

constraintsBounded-time

recovery

Soft-failsafe

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Automated Revision of Existing Programs 32

Problem Statement

SynthesisAlgorithm

Program PSafety spec

Fault-tolerant

program PSet of faults f

Desired level of fault-tolerance SoftHard-Failsafe

NonmaskingSoftHard-Masking

Automated Revision of Existing Programs 33

Problem Statement

Given program P Invariant S a set of faults F and safety specification SPEC identify a program P with invariant S such that

S S P P and P is F-tolerant to SPEC from S

Automated Revision of Existing Programs 34

Current Results

Safety LivenessTiming

constraintsBounded-recovery

Complexity

Soft-failsafe

Polynomial space sound and complete

algorithm

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Open problem

[2] S Kulkarni A Arora [2] S Kulkarni A Arora Automating the Addition of Fault-ToleranceAutomating the Addition of Fault-Tolerance FTRTFT 2000 FTRTFT 2000

Automated Revision of Existing Programs 35

Example Altitude Switch

Initx 06

Standby

Await (Power z 2 )

gtgt Actuator Power on ltlt

ltlt LowAlt gtgt[z = 0]

ltlt Reset gtgt(CorruptSensor y 2)

(x 06)

ltlt Reset gtgt

[x y z = 0]

(Power z 2 )

(z gt 2 )[t = 0]

(x gt 06)[t = 0]

ltlt SensorFail gtgt[y = 0]

(y gt 2)[t = 0]

y 1y 2

z 2

(Power CorruptSensor)( Power)

Fault-span

Fault-span

(t 1)

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

Automated Revision of Existing Programs 37

Ongoing Research

Implementation of the algorithms

Coping with the state explosion problem Synthesis using zone automata Parallelizing state space

Coping with NP-hardness results Heuristics Decision procedures (Yices)

Fault-tolerance in hybrid systems

[8] httpwwwcsemsueduebnenasiresearchtoolsftsynhtm

Automated Revision of Existing Programs 38

Conclusion

We study the problem of automated revision of programs inside their state space

Adding properties to programs Adding fault-tolerance

Automated Revision of Existing Programs 39

Questions

Automated Revision of Existing Programs 32

Problem Statement

SynthesisAlgorithm

Program PSafety spec

Fault-tolerant

program PSet of faults f

Desired level of fault-tolerance SoftHard-Failsafe

NonmaskingSoftHard-Masking

Automated Revision of Existing Programs 33

Problem Statement

Given program P Invariant S a set of faults F and safety specification SPEC identify a program P with invariant S such that

S S P P and P is F-tolerant to SPEC from S

Automated Revision of Existing Programs 34

Current Results

Safety LivenessTiming

constraintsBounded-recovery

Complexity

Soft-failsafe

Polynomial space sound and complete

algorithm

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Open problem

[2] S Kulkarni A Arora [2] S Kulkarni A Arora Automating the Addition of Fault-ToleranceAutomating the Addition of Fault-Tolerance FTRTFT 2000 FTRTFT 2000

Automated Revision of Existing Programs 35

Example Altitude Switch

Initx 06

Standby

Await (Power z 2 )

gtgt Actuator Power on ltlt

ltlt LowAlt gtgt[z = 0]

ltlt Reset gtgt(CorruptSensor y 2)

(x 06)

ltlt Reset gtgt

[x y z = 0]

(Power z 2 )

(z gt 2 )[t = 0]

(x gt 06)[t = 0]

ltlt SensorFail gtgt[y = 0]

(y gt 2)[t = 0]

y 1y 2

z 2

(Power CorruptSensor)( Power)

Fault-span

Fault-span

(t 1)

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

Automated Revision of Existing Programs 37

Ongoing Research

Implementation of the algorithms

Coping with the state explosion problem Synthesis using zone automata Parallelizing state space

Coping with NP-hardness results Heuristics Decision procedures (Yices)

Fault-tolerance in hybrid systems

[8] httpwwwcsemsueduebnenasiresearchtoolsftsynhtm

Automated Revision of Existing Programs 38

Conclusion

We study the problem of automated revision of programs inside their state space

Adding properties to programs Adding fault-tolerance

Automated Revision of Existing Programs 39

Questions

Automated Revision of Existing Programs 33

Problem Statement

Given program P Invariant S a set of faults F and safety specification SPEC identify a program P with invariant S such that

S S P P and P is F-tolerant to SPEC from S

Automated Revision of Existing Programs 34

Current Results

Safety LivenessTiming

constraintsBounded-recovery

Complexity

Soft-failsafe

Polynomial space sound and complete

algorithm

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Open problem

[2] S Kulkarni A Arora [2] S Kulkarni A Arora Automating the Addition of Fault-ToleranceAutomating the Addition of Fault-Tolerance FTRTFT 2000 FTRTFT 2000

Automated Revision of Existing Programs 35

Example Altitude Switch

Initx 06

Standby

Await (Power z 2 )

gtgt Actuator Power on ltlt

ltlt LowAlt gtgt[z = 0]

ltlt Reset gtgt(CorruptSensor y 2)

(x 06)

ltlt Reset gtgt

[x y z = 0]

(Power z 2 )

(z gt 2 )[t = 0]

(x gt 06)[t = 0]

ltlt SensorFail gtgt[y = 0]

(y gt 2)[t = 0]

y 1y 2

z 2

(Power CorruptSensor)( Power)

Fault-span

Fault-span

(t 1)

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

Automated Revision of Existing Programs 37

Ongoing Research

Implementation of the algorithms

Coping with the state explosion problem Synthesis using zone automata Parallelizing state space

Coping with NP-hardness results Heuristics Decision procedures (Yices)

Fault-tolerance in hybrid systems

[8] httpwwwcsemsueduebnenasiresearchtoolsftsynhtm

Automated Revision of Existing Programs 38

Conclusion

We study the problem of automated revision of programs inside their state space

Adding properties to programs Adding fault-tolerance

Automated Revision of Existing Programs 39

Questions

Automated Revision of Existing Programs 34

Current Results

Safety LivenessTiming

constraintsBounded-recovery

Complexity

Soft-failsafe

Polynomial space sound and complete

algorithm

Hard-failsafe

Nonmasking

Soft-masking

Hard-masking

Open problem

[2] S Kulkarni A Arora [2] S Kulkarni A Arora Automating the Addition of Fault-ToleranceAutomating the Addition of Fault-Tolerance FTRTFT 2000 FTRTFT 2000

Automated Revision of Existing Programs 35

Example Altitude Switch

Initx 06

Standby

Await (Power z 2 )

gtgt Actuator Power on ltlt

ltlt LowAlt gtgt[z = 0]

ltlt Reset gtgt(CorruptSensor y 2)

(x 06)

ltlt Reset gtgt

[x y z = 0]

(Power z 2 )

(z gt 2 )[t = 0]

(x gt 06)[t = 0]

ltlt SensorFail gtgt[y = 0]

(y gt 2)[t = 0]

y 1y 2

z 2

(Power CorruptSensor)( Power)

Fault-span

Fault-span

(t 1)

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

Automated Revision of Existing Programs 37

Ongoing Research

Implementation of the algorithms

Coping with the state explosion problem Synthesis using zone automata Parallelizing state space

Coping with NP-hardness results Heuristics Decision procedures (Yices)

Fault-tolerance in hybrid systems

[8] httpwwwcsemsueduebnenasiresearchtoolsftsynhtm

Automated Revision of Existing Programs 38

Conclusion

We study the problem of automated revision of programs inside their state space

Adding properties to programs Adding fault-tolerance

Automated Revision of Existing Programs 39

Questions

Automated Revision of Existing Programs 35

Example Altitude Switch

Initx 06

Standby

Await (Power z 2 )

gtgt Actuator Power on ltlt

ltlt LowAlt gtgt[z = 0]

ltlt Reset gtgt(CorruptSensor y 2)

(x 06)

ltlt Reset gtgt

[x y z = 0]

(Power z 2 )

(z gt 2 )[t = 0]

(x gt 06)[t = 0]

ltlt SensorFail gtgt[y = 0]

(y gt 2)[t = 0]

y 1y 2

z 2

(Power CorruptSensor)( Power)

Fault-span

Fault-span

(t 1)

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

The Altitude Switch is required to

1) HFS Tolerate the delay fault while in Standby mode

((Standby CorruptSensor) lozloz22 (Init))

2) HFS Not power on the actuators while it fails to read the altitude sensors

3) SMK Recover within 1s after occurrence of faults

Automated Revision of Existing Programs 37

Ongoing Research

Implementation of the algorithms

Coping with the state explosion problem Synthesis using zone automata Parallelizing state space

Coping with NP-hardness results Heuristics Decision procedures (Yices)

Fault-tolerance in hybrid systems

[8] httpwwwcsemsueduebnenasiresearchtoolsftsynhtm

Automated Revision of Existing Programs 38

Conclusion

We study the problem of automated revision of programs inside their state space

Adding properties to programs Adding fault-tolerance

Automated Revision of Existing Programs 39

Questions

Automated Revision of Existing Programs 37

Ongoing Research

Implementation of the algorithms

Coping with the state explosion problem Synthesis using zone automata Parallelizing state space

Coping with NP-hardness results Heuristics Decision procedures (Yices)

Fault-tolerance in hybrid systems

[8] httpwwwcsemsueduebnenasiresearchtoolsftsynhtm

Automated Revision of Existing Programs 38

Conclusion

We study the problem of automated revision of programs inside their state space

Adding properties to programs Adding fault-tolerance

Automated Revision of Existing Programs 39

Questions

Automated Revision of Existing Programs 38

Conclusion

We study the problem of automated revision of programs inside their state space

Adding properties to programs Adding fault-tolerance

Automated Revision of Existing Programs 39

Questions

Automated Revision of Existing Programs 39

Questions