Pontus Boström and Marina Waldén Åbo Akademi University/ TUCS
description
Transcript of Pontus Boström and Marina Waldén Åbo Akademi University/ TUCS
Pontus Boström and Marina Waldén Åbo Akademi University/ TUCS
Development of Fault Tolerant Grid
Applications Using Distributed B
Motivation
•Grids have become widespread in organizations• Handle large amount of information• Manage computational resources
•Difficult to implement “correct” Grid applications • Formal methods useful in order to ensure correctness of
specifications• Can be difficult to implement
•The specification language should take into account the features of the underlying platform
•Fault tolerance also important for correctness due to the nature of the grid environment
Grids
•Used for large-scale distributed systems• Scientific computing, e.g., in Physics and engineering• Business applications
•Share information and computational resources over organizational boundaries
•Loosely coupled systems• SOA
•Client – Server architecture
Grid services
•Services in a grid environment that can be accessed by clients
•Similar to remote objects in CORBA and RMI• Remote procedures used for communication
HostClient
Client
Grid service
Grid service
Grid Services
•Based on Web Services• XML• SOAP• WSDL
•Extends Web services with• Potentially transient services containing state• Service data• Notifications
•Globus Toolkit middleware
Client Grid service
Remote procedure call
Notification
The architecture of a grid application
OS OS
Globus T. Globus T.
Client Grid service
RPC
Notification
Host 1 Host 2
Faults
• Remote procedure calls can fail in four different ways1. The server grid service instance has crashed before the
call2. The network connection fails when calling a remote
procedure3. The server instance fails during the call4. The network connection fails when returning the result
• A notification can fail to arrive for two reasons1. The sending grid service crashed before sending the
notification2. The network connection fails during sending
• The client crashes when using a server grid service• The server grid service becomes an orphan
Fault tolerance using GT
•No support for advanced fault tolerance mechanisms such as replication or check-pointing
•Exception is raised in the caller when a call to a remote procedure fails
• Not easy to know what caused the exception to be raised
•That a notifications is lost can be discovered with timers in the client
•The most difficult error to handle is removal of orphan grid service instances
Orphan control
•Need to remove orphan grid service instances, since they waste resources
•New remote procedure isAlive
•Timer in both client and server grid service
Orphan control
•Need to remove orphan grid service instances, since they waste resources
•New remote procedure isAlive
•Timer in both client and server grid service
•An exception is raised in the client when a call to isAlive fails
•The server grid service is deleted when a timeout occurs in it
Event B
•Extension of the B Method•Developed by J. R. Abrial•Based on Action Systems by Back and Kurki-Suonio•Related To B Action Systems
SYSTEM CVARIABLES xINVARIANT Inv_CINITIALISATION x := x0EVENTS C_Evt1 = ANY u WHERE G1(u,x) THEN S1 END; C_Evt2 = SELECT G2 THEN S2 END;END
Formal development of Grid applications
•We like to have a formal method suitable for developing fault tolerant grid applications•Difficult to create implementable specifications of grid applications in Event B
• No grid communication mechanisms such as remote procedures and notifications
• No fault tolerance mechanisms• Difficult to implement due to synchronization issues and
atomicity of events
•We need to extend Event B with constructs for • Specifying grid services• Remote procedure calls and notifications• Fault tolerance
•Extensions should be introduced in a manner that simplifies implementation
Distributed B
•Provides two new types of B machines• GRIDSERVICE• GRID_REFINEMENT
•Take into account grid specific features• Remote procedures• Notifications• Timeouts due to lost notifications• Exceptions due to failed calls to isAlive
•Enables us to prove properties about the entire system
•Are translated to ordinary B for verification• New constructs get their semantics from the translation• Automatic generation of proof obligations
•Enable automatic or semi-automatic translation of the specification to a programming language
Grid service machine
•Abstract specification of a grid service
•A grid service machine is a template that clients obtain instances of
• Compare to Classes in OO
•Remote procedures• Ordinary B procedures called from
a client
•Events• Executed independently of a client
•Notifications• Sent when all events have
become disabled
Proc(p)
(J1 J2) Q
J2T2
J1T1
Grid service
Remote procedures:
Events:
Notifications:
Grid refinement machine (1)
•A client that uses grid service machine instances
•Refines GRIDSERVICE, ordinary SYSTEM or REFINEMENT
•Clause for enabling dynamic management of grid service machine instances
• Instances are used as variables• When a failed instance is discovered it is marked as no
longer in use and deleted from the application
•Clause for refining remote procedures
•Clause for refining events
Grid refinement machine (2)
•Special substitution used in events for making remote procedures calls
• Enables the exceptions for failed calls to be handled
•Special events that consists of two parts for handling notifications
• First part enabled when a notification has been sent from a grid service
• Second part enabled when a timeout occurs• Executed once for each notification/timeout
•Special event for handling failed calls to isAlive• Enabled for each grid service instance in use.• Non-deterministically models failures of instances
The behaviour of grid components
Proc(p)
(J1 J2) Q
G2S2
NotifHandler
J2T2G3S3
J1T1
Grid serviceGrid refinement
G1S1
Remote procedures:
Events:
Notifications:
Notification handlers:
Events:
Grid service machine
GRIDSERVICE A
VARIABLES yINVARIANT Inv_AINITIALISATION y := y0
REMOTE_PROCEDURES
Proc(p) = PRE P(p) THEN TEND
EVENTS
A_Evt1 = ANY u WHERE J1(u,y) THEN T1 END;
A_Evt2 = SELECT J2 THEN T2
END;
NOTIFICATIONS Notif =
GUARANTEES Q END
END
Grid refinement machine (1)
GRID_REFINEMENT C2REFINES C1
REFERENCES AVARIABLES z,x,a_inst
INVARIANT a_inst:A & Inv_C’
INITIALISATION x := x0 || z:=z0 || a_inst::A
EVENTS C_Evt1 =
SELECT G1’ THEN
CALL a_inst.Proc(x) EXCEPTION SE
END || S1’ END;
C_Evt2 = SELECT G2’ THEN S2’ END
Grid refinement machine (2)
NOTIFICATION_HANDLERS NotifHandler = NOTIFICATION Notif SOURCE v:A THEN S3 TIMEOUT ST END
IS_ALIVE_HANDLERS IAHandler = SOURCE v:A
THEN SIA
ENDEND
Conclusions
•Enables construction of correct fault tolerant grid applications
•Automatic generation of proof obligations
•Implementable architecture by construction
•These Event B extensions can also use other middleware for distributed systems