Availability and reliability
-
Upload
sommerville-videos -
Category
Technology
-
view
521 -
download
1
description
Transcript of Availability and reliability
Availability and reliability, 2013 Slide 1
Availability and Reliability
Availability and reliability, 2013 Slide 2
Principal dependability properties
Availability and reliability, 2013 Slide 3
• Reliability– The probability of failure-free
system operation over a specified time in a given environment for a given purpose
Availability and reliability, 2013 Slide 4
• Availability– The probability that a system, at a
point in time, will be operational and able to deliver the requested services
Availability and reliability, 2013 Slide 5
Availability specification
• Both reliability and availability attributes can be expressed as numbers:– Availability of 0.999 means that the
system is up and running for 99.9% of the time;
Availability and reliability, 2013 Slide 6
Reliability specification
• Probability of failure on demand (POFOD) of 0.0001 means that on average 1 in 10, 000 demands for service from a system will fail in some way
Availability and reliability, 2013 Slide 7
Availability and reliability
• Availability and reliability are closely related– Obviously if a system is unavailable it is
not delivering the specified system services.
Availability and reliability, 2013 Slide 8
• However, it is possible to have systems with low reliability that must be available. – So long as system failures can be
repaired quickly and does not damage data, some system failures may not be a problem.
Availability and reliability, 2013 Slide 9
• Availability is therefore best considered as a separate attribute reflecting whether or not the system can deliver its services.
• Availability takes repair time into account, if the system has to be taken out of service to repair faults.
Availability and reliability, 2013 Slide 10
Availability perception
• Availability is usually expressed as a percentage of the time that the system is available to deliver services e.g. 99.9%.
Availability and reliability, 2013 Slide 11
Availability and reliability, 2013 Slide 12
Subjective availability
• The number of users affected by the service outage. – Loss of service in the middle of the
night is less important for many systems than loss of service during peak usage periods.
Availability and reliability, 2013 Slide 13
• The length of the outage. – The longer the outage, the more the
disruption. Several short outages are less likely to be disruptive than 1 long outage. Long repair times are a particular problem.
Availability and reliability, 2013 Slide 14
Reliability metrics
• Probability of failure on demand (POFOD)– Probability that a system will not
deliver a service correctly when requested
– Used for systems where demands are infrequent and intermittent
Availability and reliability, 2013 Slide 15
• Rate of occurrence of failure (ROCOF)– Number of system failures in a given
time period
– Used for transaction processing systems with frequent and regular transactions
Availability and reliability, 2013 Slide 16
• Fault– A characteristic of a software system that can lead to a
system error.
• Error– An erroneous system state that can lead to system behavior
that is unexpected by system users.
• Failure– An event that occurs at some point in time when the system
does not deliver a service as expected by its users.
Availability and reliability, 2013 Slide 17
Faults-errors-failures
Fault
Error
Failure
Availability and reliability, 2013 Slide 18
Faults and failures
• Failures are a usually a result of system errors.
• The incorrect state causes undesirable system behaviour
• Incorrect state is a consequence of executing faulty code
Availability and reliability, 2013 Slide 19
• However, faults do not necessarily result in system errors– The erroneous system state resulting
from the fault may be transient and ‘corrected’ before an error arises.
– The faulty code may never be executed.
Availability and reliability, 2013 Slide 20
• Errors do not necessarily lead to system failures– The error can be corrected by built-in
error detection and recovery – The failure can be protected against
by built-in protection facilities. These may, for example, protect system resources from system errors
Availability and reliability, 2013 Slide 21
Reliability achievement
• Fault avoidance– Development technique are used
that either minimise the possibility of mistakes or trap mistakes before they result in the introduction of system faults.
Availability and reliability, 2013 Slide 22
• Fault detection and removal– Verification and validation
techniques that increase the probability of detecting and correcting errors before the system goes into service are used.
Availability and reliability, 2013 Slide 23
• Fault tolerance– Run-time techniques are used to
ensure that system faults do not result in system errors and/or that system errors do not lead to system failures.
Availability and reliability, 2013 Slide 24
Summary
• Availability is the probability that a system will be available when a service request is made
• Reliability is the probablity that a system will deliver a service as expected by users
Availability and reliability, 2013 Slide 25
Summary
• Software faults lead to state errors lead to operational failures
• Fault avoidance, detection and tolerance are strategies for achieving reliability