Software Reliability CSCI 5801: Software Engineering.

Software Reliability

CSCI 5801: Software Engineering


• What you know after testing:– Software passes all cases in test suite

• What customer wants to know:– Is the code well written in general?– How often will it fail?– What has to happen for it to fail?– What happens when it fails?

Larger Context of Reliability

• Fault detection (testing and validation)– Detect faults before the system is put into

operation

• Fault avoidance– Build systems with the objective of creating

fault-free software

• Fault tolerance– Build systems that continue to operate when

faults occur

Code Reviews

• Examining code without running it– Remove dependency on test cases

• Methodology: look for typical flaws

• Best done by others who have different POV– Code walkthroughs done by other programmers– Pair programming in XP– Static analysis tools–

• Goal: Detect flaws before they become faults (fault avoidance)

Code Walkthroughs

• Going through code by hand, statement by statement– 90 – 125 statements/hour on average

• Team with ~4 members, with specific roles:– Moderator: runs session, insure proceeds smoothly– Code author– Inspectors (at least 2)– Scribe: writes down results/suggestions

• Estimated to find 60% to 90% of code errors

Code Walkthroughs

• Preparation– Developer provides colleagues with code listing

and documentation– Participants study the documentation in advance

• Meeting– Developer leads reviewers through the code,

describing what each section does and encouraging questions

– Inspectors look for possible flaws and suggest improvements

Code Walkthroughs

• Example checklist:– Data faults: Initialization, constants, array bounds, character

strings– Control faults: Conditions, loop termination, compound

statements, case statements– Input/output faults: All inputs used; all outputs assigned a

value– Interface faults: Parameter numbers, types, and order;

structures and shared memory– Storage management faults: Modification of links, allocation

and de-allocation of memory– Exceptions: Possible errors, error handlers

Static Analysis Tools

• Scan source code for possible faults and anomalies– Lint for C programs– PMD for Java

• Examples:– Control flow: Loops with multiple exit or entry points– Data use: Undeclared or uninitialized variables, unused

variables, multiple assignments, array bounds– Interface faults: Parameter mismatches, non-use of

functions results, uncalled procedures– Storage management: Unassigned pointers, pointer

arithmetic

Good programming practice eliminates all warnings from source code

PMD Example

Static Analysis Tools

• Cross-reference table: Shows every use of a variable, procedure, object, etc.

• Information flow analysis: Identifies input variables on which an output depends.

• Path analysis: Identifies all possible paths through the program.

Software reliability• Definition:

Probability that the system will not fail during a certain period of time in a certain environment– Failures/CPU hour, etc.

• Questions:• How much more testing is needed to reach

required reliability?• What is expected reliability gain for further

testing?

12

Statistical Testing

• Testing software for reliability rather than fault detection– Measuring the number of errors/transaction

allows the reliability of the software to be predicted

• Key problem: Software will never be 100% reliable!– An acceptable level of reliability should be specified in

RSD, and the software tested and modified until that level of reliability is reached

Reliability Prediction

• Reliability growth model– Mathematical model of how system reliability is

predicted to change over time as faults found and removed

– Extrapolated from current data about failures• Can be used to determine whether system meets

reliability requirements– Mean time to failure– Average failures per transaction

• Can be used to predict when testing will be completed and what level of reliability is feasible

Operational Profile

• Problem: Statistical testing requires large number of test cases for statistical significance (thousands)

• Where do such test cases come from?– Often too many to create by hand– Random generation not sufficient

Operational Profile

• Operational profile: Set of test data whose frequency matches the actual frequency of these inputs from ‘normal’ usage of the system– Close match with actual usage is necessary or the measured

reliability will not be reflected in the actual usage of the system

• Can be generated from real data collected from an existing system or (more often) depends on assumptions made about the pattern of usage of a system.

Example Operational Profile

...

Number ofinputs

Input classes

Note that some types of inputs much more likely than others

LPM Estimates

• Logarithmic Poisson execution time model (LPM)– Major bugs found quickly– Those major bugs cause most failures– Effectiveness of fault correction decreases over

time– There exists a point at which further testing has

little gain

18

Reliability predictionReliability

Requiredreliability

Fitted reliabilitymodel curve

Estimatedtime of reliability

achievement

Time

= Measured reliability

Reliability Measurement Problems

• Operational profile uncertainty– The operational profile may not be an accurate

reflection of the real use of the system• High costs of test data generation

– Costs can be very high if the test data for the system cannot be generated automatically

• Statistical uncertainty– You need a statistically significant number of

failures to compute the reliability but highly reliable systems will rarely fail

Stress Testing

• Goal of stress testing: Determine what it will take to “break” system – “Break” = no longer meets requirements in some way– Functional: fails to perform required functions– Reliability: fails more often than specified– Performance: slower than required

• Approaches:– Increase load/decrease resources until system breaks– Perform “attacks” designed to produce undesirable result

Stress Testing

• Increase load on system in different ways– Number of students simultaneously adding courses– Size of files/databases that must be read– …

• Decrease resources available to system (may require fault injection software)– Increase number of other processes running on system– Increase lag time of networked resources

• Goal: point at which system fails should be much greater than scenarios listed in RSD

Stress Testing

• “Attack” testing common in security– Goal of normal testing:

– Goal of secure programming:

SystemInput for specific test case

Desired response for specific test case

Any inputDoes not produce undesirable resultSystem

Stress Testing

• Based on risk analysis from design stage:– Can roster database be deleted?– Can intruder read files (in violation of FERPA)?– Can a student add a course but not be added to the roster?

Fault Tolerance

• Goals:– System continues to operate when problems occur– System avoids critical failures (data loss, etc.)

• Problems can occur from many sources– Anticipated at design stage– Unanticipated (hardware faults, etc.)

• Cannot prevent all failures!

Fault Tolerance

• Usually based on idea of “backward recovery”– Record system state at specific events (checkpoints).

After failure, recreate state at last checkpoint.– Combine checkpoints with system log (audit trail of

transactions) that allows transactions from last checkpoint to be repeated automatically.

• Note that backward recovery software must also be thoroughly tested!

Software Reliability CSCI 5801: Software Engineering.

Documents

Transcript of Software Reliability CSCI 5801: Software Engineering.