CSSE 477 – More on Availability & Reliability

1

CSSE 477 – More on

Availability & Reliability

Steve Chenoweth

Thursday, 9/22/11

Week 3, Day 3

Right – High availability with VMWare – the major goal is to eliminate all single points of failure. When the an ESX host goes down (the red X), the virtual machines running on the failed host migrate over to the other, healthy ESX hosts that have spare capacity. After migration, the virtual machines are restarted automatically, without intervention from the administrator. From http://www.vmracks.com/it-hosting/esx-high-availability.html.

2

Today• 5 Minute talks on availability project

• Tactics for software availability engineering…– A bit more, mostly from Musa’s book

• Tonight:– Project 2, final part– HW 3 (individual)

John Musa (1933-2009), the inventor of “software reliability engineering.”

3

Software failures

• Important to discuss with customers– Need to know what variations in system behavior

are tolerable

• Requirements provide a positive specification– Defining failures gives you a

“negative specification”• What the system must not do

– Adds another dimension to communication with the user

Below – A windmill goes south.

4

How to classify severity of software failures

• Based on cost:

(these are 1999 $ from Musa’s book, so you need to set your own scale)

Severity class Definition ($)

1 > 100,000

2 10,000 – 100,000

3 1000 – 10,000

4 < 1000

5

How to classify severity of software failures, cntd

• Based on operational impact:

Severity class

Definition

1 Unavailability to users of one or more key operations

2 Unavailability to users of one or more important operations

3 Unavailability to users of one or more operations but workarounds available

4 Minor deficiencies in one or more operations

6

How to classify severity of software failures, cntd

• Need to make such a table product-specific.• Suppose the product is “Fone Follower”:

Failure severity class

Failure definition

1 Failure that prevents calls from being forwarded

2 Failure that prevents entry of phone numbers to which calls will be forwarded

3 Failure that makes system administration more difficult, but possible through alternate means. Like GUI doesn’t work for some feature, but can use text I/F.

4 Failure that causes minor inconvenience – like screens don’t show current date.

7

Need to set “failure intensity objectives” for each release

• Need to define a global way of measuring the “intensity” – like failures per hour of operation that will be tolerated.

• Next goal is to convert these to global measures related to code, like failures per million lines of code run, for the critical subsystems.

• Heuristic: For a given release, the product of these factors tends to be roughly constant, related to the amount of new functionality added:1. Failure intensity

2. Development time

3. Development cost

8

Failure intensity vs reliability

• Generally,

= - ln R

t• Where R = reliability, = failure intensity, and t = number of natural

time units.• E.g., if reliability is 0.992 for 8 hours, the failure intensity is one

failure per 1000 hours.• Note that Musa’s definition of “reliability” is slightly more

sophisticated than usual: It’s the probability of execution without failure for a specified time interval (like hours).

• In contrast, usually we talk of the average time to failure, and so skip over the probability part (it’s 50%).

9

For a new system…• We don’t already have a track record of failure

intensities– So it’s tougher to judge how much time to spend

trying to get it right, for the “next release” – that’s release 1.0 !

• What to do?– Find operational data for similar systems

• Or, how reliable are the underlying systems you’ll use?

– Consider vendor warranties – what will we promise? (Or what do others promise?)

– Get experts to estimate, based on prior work

10

Availability Tactics

• Once you know what to prevent, you develop to counteract those problems:

• Try the 3 Strategies from Bass Ch 5:– Fault detection

– Fault recovery

– Fault prevention

• Musa’s variation on this list is:– Fault prevention

– Fault removal

– Fault tolerance

Opening rounds

Later

11

Musa’s development strategies

• Engineer the right balance among these reliability strategies– Determine where to focus them, to maximize

the likelihood of meeting the objectives in an economical way

– Components you buy and integrate are a problem – you have less control over those!

• The best you can do may be to test with them as thoroughly as possible, give feedback to their vendors

12

Musa’s strategies, cntd

• Testers should be in on setting the objectives and deciding the system architecture

• Fault prevention is done by:– Good development processes,

especially:• Having sound underlying methodologies

• Doing reviews – like to requirements & design

• Enforcing standards

• Using design tools that keep faults from being introduced

13


• How’s fault removal done?– Primarily by code reviews and testing

• You can measure the effectiveness of the code reviews by how many were caught vs how many remained to be caught in testing

• Measure the effectiveness of testing by the number of faults found by that testing, vs those found after (like by the customer!)

14


• How to achieve fault-tolerance?– Needs design:

• Anticipate what deviations are likely to occur and will lead to failures

• Implement “robust” software to counteract them– Like handling unexpected input from users or from other

systems

– How to minimize performance degradation, data corruption, or undesirable outputs

• Use hardware to help (see slide 1!)

– Can measure effectiveness by the reduction in failure intensity that results

15


• In large organizations:– Important to measure the effectiveness of

each of these availability tactics– Leads to sound decisions about what to

spend money on – “best strategies”

16

Software Safety

• A topic related to reliability

• Means freedom from mishaps– Mishap = loss of human life, injury, or property

damage– Most software failures just cause user

dissatisfaction

• Software safety is dependent on realistic testing– Need operational profiles to focus testing

17

Software vs hardware

• Hardware reliability is affected by aging and wear.

• Software – not so much.– Execution time affects reliability

• What is its “duty cycle” (in hours)? And • What are its failures per execution hour?

– To get ultrareliable functions, we need a test duration of several times the reciprocal of the failure intensity objective

• E.g., to get a failure intensity of 10-9 failures per hour, you’d need to test for several times 109 hours.

18

Musa on fault detection

• A less precise idea than software failure

• Absolute concept – an entity you can define without reference to software failures– Like a bad disk read

• Operational concept – an entity that exists only with reference to the fact that failures will occur if that state is executed– The fault is the defect causing the

failure

19

Absolute faults

• Makes it necessary to postulate a “perfect” program to which an actual program can be compared.

• Then a fault is incorrect code, in comparison– It’s defective, missing, or an extra instruction

– The defective program can be compared with the perfect program.

• Of course, in reality, there could be many “perfect” programs.– You don’t know about the bad code till you are trying to

fix it.

20

Operational faults

• The “fault” has some reality of its own, as an implementation defect.

• But really, a piece of bad code could be responsible for no failures, or lots of failures, etc.

• And “software failures” may not be in the code, per se.

• So this definition of “fault” has issues, too.

21

A realistic definition

• Musa ends up saying, like Bass, that a software fault is incorrect code that leads to actual or potential failures.

• Which leaves open the question of “What code should be included in the fault?”– Requires judgment– Would changing a few related instructions

prevent additional failures from occurring?

22

Availability in summary• Strategic thing is to work to improve it

– Definitely don’t ignore it on any real product

– “Hope is a city on de Nile”.

• Lots more to explore– Plenty of systems out there, where people will pay to have them

improve on this

– Like all the quality attributes, this could be a career activity

• A good additional course in this area – Dr. Radu’s ECE 497 – Design of fault tolerant digital systems

Right - A software cruise down “de Nile”

CSSE 477 – More on Availability & Reliability

Documents

Transcript of CSSE 477 – More on Availability & Reliability