1 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Critical Systems
DevelopmentIS301 – Software Engineering
Lecture #27 – 2004-11-03M. E. Kabay, PhD, CISSP
Assoc. Prof. Information AssuranceDivision of Business & Management, Norwich University
mailto:[email protected] V: 802.479.7937
2 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
First, take a deep breath.You are about to enter the
fire-hose zone.
3 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Objectives
To explain how fault tolerance and fault avoidance contribute to the development of dependable systems
To describe characteristics of dependable software processes
To introduce programming techniques for fault avoidance
To describe fault tolerance mechanisms and their use of diversity and redundancy
4 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Topics covered
Dependable processesDependable programmingFault toleranceFault tolerant architectures
5 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Dependable Software Development
Programming techniques for building dependable software systems.
6 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Software Dependability
In general, software customers expect all software to be dependable
For non-critical applications, may be willing to accept some system failures
Some applications have very high dependability requirements Special programming techniques req’d
7 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Dependability Achievement
Fault avoidanceSoftware developed so
Human error avoided and System faults minimized
Development process organized so Faults in software detected and Repaired before delivery to customer
Fault toleranceSoftware designed so
Faults in delivered software do not result in system failure
8 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Diversity and RedundancyRedundancy
Keep more than 1 version of a critical component available so that if one fails then a backup is available.
DiversityProvide the same functionality in different
ways so that they will not fail in the same way.However, adding diversity and redundancy adds
complexity and this can increase the chances of error.
Some engineers advocate simplicity and extensive verification & validation (V&V) as a more effective route to software dependability.
9 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Diversity and Redundancy ExamplesRedundancy
Where availability is critical (e.g. in e-commerce systems),
companies normally keep backup servers and switch to these automatically if failure occurs.
Diversity. To provide resilience against external attacks, different servers may be implemented using
different operating systems (e.g. Windows and Linux)
10 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Fault Minimization
Current methods of software engineering now
allow for production of fault-free softwareFault-free software means it conforms to its
specificationDoes NOT mean software
which will always perform correctly
Why not?
Because of specificatio
n errors.
11 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Cost of Producing Fault-Free Software (1)
Very highCost-effective only in exceptional
situationsWhich?
May be cheaper to accept software faultsBut who will bear costs?
Users?Manufacturers?Both?
Will the risk-sharing be with full knowledge?
12 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Cost of Producing Fault-Free Software (2)
The Pareto Principle
Costs
To
tal
% o
f E
rro
rs F
ixed
20%
80%
100%
If curve really is asymptotic to 100%, cost
may
approach
13 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Cost of ProducingFault-Free Software (3)
Many Few Very fewNumber of residual errors
Co
st p
er e
rro
r d
etec
ted
Just a different way of
looking at it.
16 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Validation activities
Requirements inspections.Requirements management.Model checking.Design and code inspection.Static analysis.Test planning and management.Configuration management, discussed in
Chapter 29, is also essential.
19 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Safe Programming
Faults in programs are usually a consequence of programmers making mistakes.
These mistakes occur because people lose track of the relationships among program variables.
Some programming constructs are more error-prone than others so avoiding their use reduces programmer mistakes.
20 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Fault-Free Software Development
Needs precise (preferably formal) specification
Requires organizational commitment to quality
Information hiding and encapsulation in software design essential
Use programming language with strict typing and run-time checking
Avoid error-prone constructsUse dependable and repeatable development
process
21 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Structured Programming
First discussed in 1970'sProgramming without gotoWhile loops and if statements as only
control statementsTop-down design Important because it promoted thought and
discussion about programmingPrograms easier to read and understand than
old spaghetti code
22 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Error-Prone Constructs (1)Floating-point numbers
Inherently imprecise – and machine-dependent
Imprecision may lead to invalid comparisons
PointersPointers referring to wrong memory as can
corrupt dataAliasing can make programs difficult to
understand and changeDynamic memory allocation
Run-time allocation can cause memory overflow
23 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Error-Prone Constructs (2)Parallelism
Can result in subtle timing errors (race conditions) because of unforeseen interaction between parallel processes
RecursionErrors in recursion can cause memory
overflow Interrupts
Interrupts can cause critical operation to be terminated and make program difficult to understand
Similar to goto statements
24 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Error-Prone Constructs (3)
InheritanceCode not localizedCan result in unexpected behavior when
changes madeCan be hard to understandDifficult to debug problems
All of these constructs don’t have to be absolutely eliminated But must be used with great care
25 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Reliable Software Processes
Well-defined, repeatable software process:Reduces software faultsDoes not depend entirely on individual
skills – can be enacted by different peopleProcess activities should include significant
verification and validation
26 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Process Validation Activities
Requirements inspectionsRequirements managementModel checkingDesign and code inspectionStatic analysisTest planning and managementConfiguration management also essential
27 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Fault Tolerance
Critical software systems must be fault tolerantSystem can continue operating in spite of
software failureFault tolerance required in
High availability requirements orSystem failure costs very high
Even “fault-free” systems need fault tolerance May be specification errors orValidation may be incorrect
28 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Fault Tolerance ActionsFault detection
Incorrect system state has occurredDamage assessment
Identify parts of system state affected by fault
Fault recoveryReturn to known safe state
Fault repairPrevent recurrence of faultIdentify underlying problemIf not transient*, then fix errors of design,
implementation, documentation or training that led to error
E.g., hardware failure
*
29 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Approaches to Fault ToleranceDefensive programming
Programmers assume faults in codeCheck state after modifications to ensure
consistencyFault-tolerant architectures
HW & SW system architectures support redundancy and fault tolerance
Controller detects problems and supports fault recovery
Complementary rather than opposing techniques
30 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Fault Detection (1)
Strictly-typed languages E.g., Java and Ada Many errors trapped at compile-time
Some classes of error can only be discovered at run-time
Fault detection: Detecting erroneous system state Throwing exception
To manage detected fault
31 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Fault Detection (2)
Preventative fault detectionCheck conditions before making changesIf bad state detected, don’t make change
Retrospective fault detectionCheck validity after system state has been
changedUsed when
Incorrect sequence of correct actions leads to erroneous state or
When preventative fault detection involves too much overhead
32 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Damage Assessment
Analyze system stateJudge extent of corruption caused by
system failureAssess what parts of state space have been
affected by failureGenerally based on ‘validity functions’
Can be applied to state elements Assess if their value within allowed range
33 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Damage Assessment Techniques
Checksums Used for damage assessment in data
transmissionVerify integrity after transmission
Redundant pointers Check integrity of data structuresE.g., databases
Watch-dog timers Check for non-terminating processesIf no response after certain time, there’s a
problem
34 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Fault Recovery
Forward recoveryApply repairs to corrupted system stateDomain knowledge required to compute
possible state correctionsForward recovery usually application
specificBackward recovery
Restore system state to known safe stateSimpler than forward recoveryDetails of safe state maintained and
replaces corrupted system state
35 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Forward Recovery
Data communicationsAdd redundancy to coded dataUse to repair data corrupted during
transmissionRedundant pointers
E.g., doubly-linked lists Damaged list / file may be repaired if
enough links are still validOften used for database and file system
repair
36 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Backward Recovery
Transaction processing often uses conservative methods to avoid problems
Complete computations, then apply changesKeep original data in buffersPeriodic checkpoints allow system to 'roll-
back' to correct state
45 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Recovery Blocks (1)
Acceptancetest
Algorithm 2
Algorithm 1
Algorithm 3
Recoveryblocks
Test forsuccess
Retest
Retry
Retest
Try algorithm1
Continue execution ifacceptance test succeedsSignal exception if allalgorithms fail
Acceptance testfails – re-try
46 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
Recovery Blocks (2)
Force different algorithm to be used for each version so they reduce probability of common errors
However, design of acceptance test difficult as it must be independent of computation used
Problems with approach for real-time systems because of sequential operation of redundant versions
52 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
HomeworkStudy Chapter 18 in detail using SQ3RRequired
By next Wed 10 Nov 2004For 30 points
20.1, 20.2, 20.4 – 20.6, 20.9 (@5) and pay attention to demands for examples
OPTIONALBy Wed 17 Nov 2004For up to 14 extra points, any or all of
20.10 (@3), 20.11 (@3) – details please20.12 (@8) – detailed answers to all
parts of this question
53 Note content copyright © 2004 Ian Sommerville. NU-specific content copyright © 2004 M. E. Kabay. All rights reserved.
DISCUSSION
Top Related