Socio-technical systems failure (LSCITS EngD 2012)
-
Upload
ian-sommerville -
Category
Technology
-
view
1.259 -
download
0
description
Transcript of Socio-technical systems failure (LSCITS EngD 2012)
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 1
Systems failure – a socio-technical perspective
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 2
Complex software systems
• Multi-purpose. Organisational systems that support different functions within an organisation
• System of systems. Usually distributed and normally constructed by integrating existing systems/components/services
• Unlimited. Not subject to limitations derived from the laws of physics (so, no natural constraints on their size)
• Data intensive. System data orders of magnitude larger than code; long-lifetime data
• Dynamic. Changing quickly in response to changes in the business environment
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 3
Systems of systems• Operational
independence
• Managerial independence
• Multiple stakeholder viewpoints
• Evolutionary development
• Emergent behaviour
• Geographic distribution
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 4
Complex system realities
• There is no definitive specification of what the system should ‘do’ and it is practically impossible to create such a specification
• The complexity of the system is such that it is not ‘understandable’ as a whole
• It is likely that, at all times, some parts of the system will not be fully operational
• Actors responsible for different parts of the system are likely to have conflicting goals
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 5
System failure
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 6
System dependability model
A system characteristic that can (but need not) lead to a system error
An erroneous system state that can (but need not) lead to a system failure
System fault System error
Externally-observed, unexpected and undesirable system behaviour
System failure
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 7
A hospital system
• A hospital system is designed to maintain information about available beds for incoming patients and to provide information about the number of beds to the admissions unit.
• It is assumed that the hospital has a number of empty beds and this changes over time. The variable B reflects the number of empty beds known to the system.
• Sometimes the system reports that the number of empty beds is the actual number available; sometimes the system reports that fewer than the actual number are available .
• In circumstances where the system reports that an incorrect number of beds are available, is this a failure?
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 8
What is failure?
• Technical, engineering view: a failure is ‘a deviation from a specification’.
• An oracle can examine a specification, observe a system’s behaviour and detect failures.
• Failure is an absolute - the system has either failed or it hasn’t
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 9
Bed management system
• The percentage of system users who considered the system’s incorrect reporting of the number of available beds to be a failure was 0%.
• Mostly, the number did not matter so long as it was greater than 1. What mattered was whether or not patients could be admitted to the hospital.
• When the hospital was very busy (available beds = 0), then people understood that it was practically impossible for the system to be accurate.
• They used other methods to find out whether or not a bed was available for an incoming patient.
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 10
Failure is a judgement• Specifications are a gross simplification of reality
for complex systems.
• Users don’t read and don’t care about specifications
• Whether or not system behaviour should be considered to be a failure, depends on the observer’s judgement
• This judgement depends on:– The observer’s expectations
– The observer’s knowledge and experience
– The observer’s role
– The observer’s context or situation
– The observer’s authority
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 11
Failures are inevitable• Technical reasons
– When systems are composed of opaque and uncontrolled components, the behaviour of these components cannot be completely understood
– Failures often can be considered to be failures in data rather than failures in behaviour
• Socio-technical reasons– Changing contexts of use mean that the judgement on
what constitutes a failure changes as the effectiveness of the system in supporting work changes
– Different stakeholders will interpret the same behaviour in different ways because of different interpretations of ‘the problem’
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 12
Conflict inevitability
• Impossible to establish a set of requirements where stakeholder conflicts are all resolved
• Therefore, successful operation of a system for one set of stakeholders will inevitably mean ‘failure’ for another set of stakeholders
• Groups of stakeholders in organisations are often in perennial conflict (e.g. managers and clinicians in a hospital). The support delivered by a system depends on the power held at some time by a stakeholder group.
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 13
Normal failures
• ‘Failures’ are not just catastrophic events but normal, everyday system behaviour that disrupts normal work and that mean that people have to spend more time on a task than necessary
• A system failure occurs when a direct or indirect user of a system has to carry out extra work, over and above that normally required to carry out some task, in response to some inappropriate or unexpected system behaviour
• This extra work constitutes the cost of recovery from system failure
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 14
The Swiss Cheese model
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 15
Failure trajectories
• Failures rarely have a single cause. Generally, they arise because several events occur simultaneously
– Loss of data in a critical system
• User mistypes command and instructs data to be deleted
• System does not check and ask for confirmation of destructive action
• No backup of data available
• A failure trajectory is a sequence of undesirable events that coincide in time, usually initiated by some human action. It represents a failure in the defensive layers in the system
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 16
Vulnerabilities and defences
• Vulnerabilities– Faults in the (socio-technical) system which, if triggered
by a human or technical error, can lead to system failure
– e.g. missing check on input validity
• Defences– System features that avoid, tolerate or recover from
human error
– Type checking that disallows allocation of incorrect types of value
• When an adverse event happens, the key question is not ‘whose fault was it’ but ‘why did the system defences fail?’
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 17
Reason’s Swiss Cheese Model
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 18
Active failures
• Active failures– Active failures are the unsafe acts committed by people
who are in direct contact with the system or failures in the system technology.
– Active failures have a direct and usually short-lived effect on the integrity of the defenses.
• Latent conditions– Fundamental vulnerabilities in one or more layers of the
socio-technical system such as system faults, system and process misfit, alarm overload, inadequate maintenance, etc.
– Latent conditions may lie dormant within the system for many years before they combine with active failures and local triggers to create an accident opportunity.
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 19
Defensive layers
• Complex IT systems should have many defensive layers:– some are engineered - alarms, physical barriers,
automatic shutdowns,
– others rely on people - surgeons, anesthetists, pilots, control room operators,
– and others depend on procedures and administrative controls.
• In an ideal word, each defensive layer would be intact.
• In reality, they are more like slices of Swiss cheese, having many holes- although unlike in the cheese, these holes are continually opening, shutting, and shifting their location.
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 20
Dynamic vulnerabilities
• While some vulnerabilities are static (e.g. programming errors), others are dynamic and depend on the context where the system is used.
• For example– vulnerabilities may be related to human actions
whose performance is dependent on workload, state of mind, etc. An operator may be distracted and forget to check something
– vulnerabilities may depend on configuration – checks may depend on particular programs being up and running so if program A is running in a system then a check may be made but if program B is running, then the check is not made
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 21
Recovering from failure
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 22
Coping with failure• People are good at
coping with unexpected situations when things go wrong.
– They can take the initiative, adopt responsibilities and, where necessary, break the rules or step outside the normal process of doing things.
– People can prioritise and focus on the essence of a problem
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 23
Recovery strategies
• Local knowledge
– Who to call; who knows what; where things are
• Process reconfiguration
– Doing things in a different way from that defined in the ‘standard’ process
– Work-arounds, breaking the rules (safe violations)
• Redundancy and diversity
– Maintaining copies of information in different forms from that maintained in a software system
– Informal information annotation
– Using multiple communication channels
• Trust
– Relying on others to cope
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 24
Design for recovery
• Holistic systems engineering– Software systems design has to be seen as part of a
wider process of socio-technical systems engineering
• We cannot build ‘correct’ systems – We must therefore design systems to allow the
broader socio-technical systems to recognise, diagnose and recover from failures
• Extend current systems to support recovery
• Develop recovery support systems as an integral part of systems of systems
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 25
Recovery strategy
• Designing for recovery is a holistic approach to system design and not (just) the identification of ‘recovery requirements’
• Should support the natural ability of people and organisations to cope with problems
– Ensure that system design decisions do not increase the amount of recovery work required
– Make system design decisions that make it easier to recover from problems (i.e. reduce extra work required)
• Earlier recognition of problems
• Visibility to make hypotheses easier to formulate
• Flexibility to support recovery actions
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 26
Key points
• Failures are inevitable in complex systems because multiple stakeholders see these systems in different ways and because there is no single manager of these systems
• Failures are a judgement – they are not absolute – but depend on the system observer
• The Swiss cheese model is a failure model based on active failures (trigger events) and latent errors (system vulnerabilities).
• People have developed strategies for coping with failure and systems should not be designed to make coping more difficult.