Failure in Safety-Critical Systems: A HANDBOOK OF INCIDENT ...

1000

Transcript of Failure in Safety-Critical Systems: A HANDBOOK OF INCIDENT ...

  • Failure in Safety-Critical Systems:

    A HANDBOOK OF INCIDENT AND ACCIDENT

    REPORTING

    Chris Johnson

    Glasgow University Press

  • i

    Glasgow University Press,

    Publicity Services,No. 2 The Square,University of Glasgow,Glasgow, G12 8QQ,Scotland.

    Copyright c2003 by C.W. Johnson.All rights reserved. No part of this manuscript may be reproduced in any form, by photostat, mi-croform, retrieval system, or any other means without the prior written permission of the author.

    First printed October 2003.

    ISBN 0-85261-784-4.

  • ii

  • Contents

    1 Abnormal Incidents 11.1 The Hazards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    1.1.1 The Likelihood of Injury and Disease . . . . . . . . . . . . . . . . . . . . . . . 61.1.2 The Costs of Failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    1.2 Social and Organisational Inuences . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.2.1 Normal Accidents? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.2.2 The Culture of Incident Reporting . . . . . . . . . . . . . . . . . . . . . . . . 10

    1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    2 Motivations for Incident Reporting 212.1 The Strengths of Incident Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.2 The Weaknesses of Incident Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . 252.3 Dierent Forms of Reporting Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    2.3.1 Open, Condential or Anonymous? . . . . . . . . . . . . . . . . . . . . . . . . 282.3.2 Scope and Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

    3 Sources of Failure 453.1 Regulatory Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

    3.1.1 Incident Reporting to Inform Regulatory Intervention . . . . . . . . . . . . . 463.1.2 The Impact of Incidents on Regulatory Organisations . . . . . . . . . . . . . 47

    3.2 Managerial Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.2.1 The Role of Management in Latent and Catalytic Failures . . . . . . . . . . . 493.2.2 Safety Management Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

    3.3 Hardware Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503.3.1 Acquisition and Maintenance Eects on Incident Reporting . . . . . . . . . . 523.3.2 Source, Duration and Extent . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

    3.4 Software Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.4.1 Failure Throughout the Lifecycle . . . . . . . . . . . . . . . . . . . . . . . . . 573.4.2 Problems in Forensic Software Engineering . . . . . . . . . . . . . . . . . . . 61

    3.5 Human Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.5.1 Individual Characteristics and Performance Shaping Factors . . . . . . . . . . 633.5.2 Slips, Lapses and Mistakes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

    3.6 Team Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 733.6.1 Common Ground and Group Communication . . . . . . . . . . . . . . . . . . 773.6.2 Situation Awareness and Crew Resource Management . . . . . . . . . . . . . 79

    3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

    4 The Anatomy of Incident Reporting 854.1 Dierent Roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

    4.1.1 Reporters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 864.1.2 Initial Receivers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 894.1.3 Incident Investigators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

    iii

  • iv CONTENTS

    4.1.4 Safety Managers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 944.1.5 Regulators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

    4.2 Dierent Anatomies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1004.2.1 Simple Monitoring Architectures . . . . . . . . . . . . . . . . . . . . . . . . . 1004.2.2 Regulated Monitoring Architectures . . . . . . . . . . . . . . . . . . . . . . . 1014.2.3 Local Oversight Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . 1024.2.4 Gatekeeper Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1034.2.5 Devolved Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

    4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

    5 Detection and Notication 1075.1 `Incident Starvation' and the Problems of Under-Reporting . . . . . . . . . . . . . . 108

    5.1.1 Reporting Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1095.1.2 Mandatory Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115.1.3 Special Initiatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

    5.2 Encouraging the Detection of Incidents . . . . . . . . . . . . . . . . . . . . . . . . . . 1155.2.1 Automated Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1155.2.2 Manual Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

    5.3 Form Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1315.3.1 Sample Incident Reporting Forms . . . . . . . . . . . . . . . . . . . . . . . . 1325.3.2 Providing Information to the Respondents . . . . . . . . . . . . . . . . . . . . 1345.3.3 Eliciting Information from Respondents . . . . . . . . . . . . . . . . . . . . . 138

    5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

    6 Primary Response 1436.1 Safeguarding the System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

    6.1.1 First, Do No Harm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1476.1.2 Incident and Emergency Management . . . . . . . . . . . . . . . . . . . . . . 151

    6.2 Acquiring Evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1536.2.1 Automated Logs and Physical Evidence . . . . . . . . . . . . . . . . . . . . . 1536.2.2 Eye-Witness Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

    6.3 Drafting A Preliminary Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1696.3.1 Organisational and Managerial Barriers . . . . . . . . . . . . . . . . . . . . . 1696.3.2 Technological Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1716.3.3 Links to Subsequent Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

    6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

    7 Secondary Investigation 1777.1 Gathering Evidence about Causation . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

    7.1.1 Framing an Investigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1807.1.2 Commissioning Expert Witnesses . . . . . . . . . . . . . . . . . . . . . . . . . 1857.1.3 Replaying Automated Logs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

    7.2 Gathering Evidence about Consequences . . . . . . . . . . . . . . . . . . . . . . . . . 1967.2.1 Tracing Immediate and Long-Term Eects . . . . . . . . . . . . . . . . . . . . 1977.2.2 Detecting Mitigating Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . 2007.2.3 Identifying Related Incidents . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

    7.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

    8 Computer-Based Simulation 2098.1 Why Bother with Reconstruction? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

    8.1.1 Coordination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2148.1.2 Generalisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2168.1.3 Resolving Ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

    8.2 Types of Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2228.2.1 Declarative Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224

  • CONTENTS v

    8.2.2 Animated Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230

    8.2.3 Subjunctive Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238

    8.2.4 Hybrid Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250

    8.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253

    9 Modelling Notations 257

    9.1 Reconstruction Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257

    9.1.1 Graphical Time Lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258

    9.1.2 Fault Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262

    9.1.3 Petri Nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280

    9.1.4 Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293

    9.2 Requirements for Reconstructive Modelling . . . . . . . . . . . . . . . . . . . . . . . 310

    9.2.1 Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311

    9.2.2 Expressiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318

    9.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335

    10 Causal Analysis 337

    10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337

    10.1.1 Why Bother With Causal Analysis? . . . . . . . . . . . . . . . . . . . . . . . 338

    10.1.2 Potential Pitfalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341

    10.1.3 Loss of the Mars Climate Orbiter & Polar Lander . . . . . . . . . . . . . . . 345

    10.2 Stage 1: Incident Modelling (Revisited) . . . . . . . . . . . . . . . . . . . . . . . . . 348

    10.2.1 Events and Causal Factor Charting . . . . . . . . . . . . . . . . . . . . . . . . 349

    10.2.2 Barrier Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354

    10.2.3 Change Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368

    10.3 Stage 2: Causal Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391

    10.3.1 Causal Factors Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391

    10.3.2 Cause and Contextual Summaries . . . . . . . . . . . . . . . . . . . . . . . . 402

    10.3.3 Tier Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409

    10.3.4 Non-Compliance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421

    10.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427

    11 Alternative Causal Analysis Techniques 431

    11.1 Event-Based Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432

    11.1.1 Multilinear Events Sequencing (MES) . . . . . . . . . . . . . . . . . . . . . . 432

    11.1.2 Sequentially Timed and Events Plotting (STEP) . . . . . . . . . . . . . . . . 441

    11.2 Check-List Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449

    11.2.1 Management Oversight and Risk Tree (MORT) . . . . . . . . . . . . . . . . . 449

    11.2.2 Prevention and Recovery Information System forMonitoring and Analysis (PRISMA) . . . . . . . . . . . . . . . . . . . . . . . 463

    11.2.3 Tripod . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473

    11.3 Mathematical Models of Causation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481

    11.3.1 Why-Because Analysis (WBA) . . . . . . . . . . . . . . . . . . . . . . . . . . 481

    11.3.2 Partition Models for Probabilistic Causation . . . . . . . . . . . . . . . . . . 495

    11.3.3 Bayesian Approaches to Probabilistic Causation . . . . . . . . . . . . . . . . 501

    11.4 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507

    11.4.1 Bottom-Up Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509

    11.4.2 Top-Down Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512

    11.4.3 Experiments into Domain Experts' Subjective Responses . . . . . . . . . . . 521

    11.4.4 Experimental Applications of Causal Analysis Techniques . . . . . . . . . . . 524

    11.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531

  • vi CONTENTS

    12 Recommendations 53512.1 From Causal Findings to Recommendations . . . . . . . . . . . . . . . . . . . . . . . 535

    12.1.1 Requirements for Causal Findings . . . . . . . . . . . . . . . . . . . . . . . . 53712.1.2 Scoping Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54012.1.3 Conicting Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . 552

    12.2 Recommendation Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56212.2.1 The `Perfectability' Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 56512.2.2 Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57012.2.3 Enumerations and Recommendation Matrices . . . . . . . . . . . . . . . . . . 57312.2.4 Generic Accident Prevention Models . . . . . . . . . . . . . . . . . . . . . . . 58412.2.5 Risk Assessment Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590

    12.3 Process Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59912.3.1 Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59912.3.2 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60312.3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61012.3.4 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613

    12.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617

    13 Feedback and the Presentation of Incident Reports 61913.1 The Challenges of Reporting Adverse Occurrences . . . . . . . . . . . . . . . . . . . 619

    13.1.1 Dierent Reports for Dierent Incidents . . . . . . . . . . . . . . . . . . . . . 62013.1.2 Dierent Reports for Dierent Audiences . . . . . . . . . . . . . . . . . . . . 62313.1.3 Condentiality, Trust and the Media . . . . . . . . . . . . . . . . . . . . . . . 626

    13.2 Guidelines for the Presentation of Incident Reports . . . . . . . . . . . . . . . . . . . 63513.2.1 Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63513.2.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64913.2.3 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 661

    13.3 Quality Assurance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67313.3.1 Verication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67313.3.2 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 691

    13.4 Electronic Presentation Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69813.4.1 Limitations of Existing Approaches to Web-Based Reports . . . . . . . . . . 70013.4.2 Using Computer Simulations as an Interface to On-Line Accident Reports . . 70213.4.3 Using Time-lines as an Interface to Accident Reports . . . . . . . . . . . . . . 704

    13.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 707

    14 Dissemination 71114.1 Problems of Dissemination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 711

    14.1.1 Number and Range of Reports Published . . . . . . . . . . . . . . . . . . . . 71114.1.2 Tight Deadlines and Limited Resources . . . . . . . . . . . . . . . . . . . . . 71314.1.3 Reaching the Intended Readership . . . . . . . . . . . . . . . . . . . . . . . . 715

    14.2 From Manual to Electronic Dissemination . . . . . . . . . . . . . . . . . . . . . . . . 71914.2.1 Anecdotes, Internet Rumours and Broadcast Media . . . . . . . . . . . . . . 71914.2.2 Paper documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72314.2.3 Fax and Telephone Notication . . . . . . . . . . . . . . . . . . . . . . . . . . 728

    14.3 Computer-Based Dissemination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72914.3.1 Infrastructure Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73014.3.2 Access Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74114.3.3 Security and Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74214.3.4 Accessibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745

    14.4 Computer-Based Search and Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . 74614.4.1 Relational Data Bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74814.4.2 Lexical Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . 76914.4.3 Case Based Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 780

  • CONTENTS vii

    14.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 790

    15 Monitoring 79515.1 Outcome Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 811

    15.1.1 Direct Feedback: Incident and Reporting Rates . . . . . . . . . . . . . . . . . 81215.1.2 Indirect Feedback: Training and Operations . . . . . . . . . . . . . . . . . . . 81715.1.3 Feed-forward: Risk Assessment and Systems Development . . . . . . . . . . . 820

    15.2 Process Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82615.2.1 Submission Rates and Reporting Costs . . . . . . . . . . . . . . . . . . . . . . 82615.2.2 Investigator Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82915.2.3 Intervention Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 833

    15.3 Acceptance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83615.3.1 Safety Culture and Safety Climate? . . . . . . . . . . . . . . . . . . . . . . . 83815.3.2 Probity and Equity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84315.3.3 Financial Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845

    15.4 Monitoring Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84815.4.1 Public Hearings, Focus Groups, Working Parties and Standing Committees . 84815.4.2 Incident Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85415.4.3 Sentinel systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85915.4.4 Observational Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86315.4.5 Statistical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86715.4.6 Electronic Visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87215.4.7 Experimental Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 883

    15.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 891

    16 Conclusions 89516.1 Human Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 896

    16.1.1 Reporting Biases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89716.1.2 Blame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89716.1.3 Analytical Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 898

    16.2 Technical Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90016.2.1 Poor Investigatory and Analytical Procedures . . . . . . . . . . . . . . . . . . 90016.2.2 Inadequate Risk Assessments . . . . . . . . . . . . . . . . . . . . . . . . . . . 90116.2.3 Causation and the Problems of Counter-Factual Reasoning . . . . . . . . . . 90216.2.4 Classication Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 904

    16.3 Managerial Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90716.3.1 Unrealistic Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90716.3.2 Reliance on Reminders and Quick Fixes . . . . . . . . . . . . . . . . . . . . . 90816.3.3 Flaws in the Systemic View of Failure . . . . . . . . . . . . . . . . . . . . . . 910

    16.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 912

  • viii CONTENTS

  • List of Figures

    1.1 Components of Systems Failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Process of Systems Failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Normal and Abnormal States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    2.1 Federal Railroad Administration Safety Iceberg . . . . . . . . . . . . . . . . . . . . . 232.2 Heinrich Ratios for US Aviation (NTSB and ASRS Data) . . . . . . . . . . . . . . . 37

    3.1 Levels of Reporting and Monitoring in Safety Critical Applications . . . . . . . . . . 453.2 Costs Versus Maintenance Interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.3 Failure Probability Distribution for Hardware Devices . . . . . . . . . . . . . . . . . 543.4 Cognitive Inuences in Decision Making and Control . . . . . . . . . . . . . . . . . . 693.5 Cognitive Inuences on Group Decision Making and Control . . . . . . . . . . . . . . 763.6 Inuences on Group Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

    4.1 A Simple Monitoring Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1004.2 Regulated Monitoring Reporting System . . . . . . . . . . . . . . . . . . . . . . . . . 1014.3 Local Oversight Reporting System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1024.4 Gatekeeper Reporting System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1034.5 Devolved Reporting System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

    5.1 Generic Phases in Incident Reporting Systems . . . . . . . . . . . . . . . . . . . . . . 1075.2 Accident and Incident Rates for Rejected Takeo Overruns . . . . . . . . . . . . . . 1265.3 CHIRP and ASRS Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1275.4 Web Interface to the CHSIB Incident Collection . . . . . . . . . . . . . . . . . . . . . 1295.5 Incident Reporting Form for a UK Neonatal Intensive Care Unit [119] . . . . . . . . 1325.6 ASRS Reporting Form for Air Trac Control Incidents (January 2000) . . . . . . . 1345.7 The CIRS Reporting System [756] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

    6.1 Generic Phases in Incident Reporting Systems . . . . . . . . . . . . . . . . . . . . . . 1446.2 US Army Preliminary Incident/Accident Checklist . . . . . . . . . . . . . . . . . . . 1486.3 Interview Participation Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1626.4 US Army Incident/Accident Reporting Procedures . . . . . . . . . . . . . . . . . . . 1726.5 US Army Preliminary Incident/Accident Telephone Reports . . . . . . . . . . . . . . 1736.6 US Army Aviation and Missile Command Preliminary Incident Form . . . . . . . . . 174

    8.1 Imagemap Overview of the Herald of Free Enterprise . . . . . . . . . . . . . . . . . . 2258.2 Imagemap Detail of the Herald of Free Enterprise . . . . . . . . . . . . . . . . . . . . 2268.3 QuicktimeVR Simulation of a Boeing 757 . . . . . . . . . . . . . . . . . . . . . . . . 2298.4 QuicktimeVR Simulation of Lukas Spreaders . . . . . . . . . . . . . . . . . . . . . . 2298.5 VRML Simulation of Building Site Incidents . . . . . . . . . . . . . . . . . . . . . . . 2338.6 NTSB Simulation of the Bus Accident (HWY-99-M-H017). . . . . . . . . . . . . . . 2348.7 3 Dimensional Time-line Using DesktopVR . . . . . . . . . . . . . . . . . . . . . . . 2368.8 Overview of Perspective Wall Using DesktopVR . . . . . . . . . . . . . . . . . . . . . 2378.9 Detail of Perspective Wall Using DesktopVR . . . . . . . . . . . . . . . . . . . . . . 238

    ix

  • x LIST OF FIGURES

    8.10 Graphical Modelling Using Boeing's EASY5 Tool . . . . . . . . . . . . . . . . . . . . 2408.11 NTSB Simulated Crash Pulse Of School Bus and Truck Colliding . . . . . . . . . . . 2428.12 NTSB Simulation of Motor Vehicle Accident, Wagner Oklahoma . . . . . . . . . . . 2438.13 Biomechanical Models in NTSB Incident Simulations (1) . . . . . . . . . . . . . . . . 2448.14 Biomechanical Models in NTSB Incident Simulations (2) . . . . . . . . . . . . . . . . 2458.15 Multi-User Air Trac Control (Datalink) Simulation . . . . . . . . . . . . . . . . . . 2498.16 EUROCONTROL Proposals for ATM Incident Simulation . . . . . . . . . . . . . . . 2518.17 US National Crash Analysis Centre's Simulation of Ankle Injury in Automobile Ac-

    cidents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2528.18 Integration of MIIU Plans, Models, Maps and Photographs . . . . . . . . . . . . . . 2538.19 NTSB Use of Simulations in Incident Reports . . . . . . . . . . . . . . . . . . . . . . 254

    9.1 Graphical Time-line Showing Initial Regulatory Background. . . . . . . . . . . . . . 2589.2 Graphical Time-line Showing Intermediate Regulatory Background. . . . . . . . . . . 2599.3 Graphical Time-line Showing Immediate Regulatory Background. . . . . . . . . . . . 2609.4 Graphical Time-line of Events Surrounding the Allentown Explosion. . . . . . . . . . 2619.5 Graphical Time-line of the Allentown Explosion. . . . . . . . . . . . . . . . . . . . . 2639.6 Two-Axis Time-line of the Allentown Explosion. . . . . . . . . . . . . . . . . . . . . 2649.7 Fault tree components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2659.8 A Simple Fault Tree for Design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2669.9 Simplied Fault Tree Representing Part of the Allentown Incident. . . . . . . . . . . 2679.10 Fault Tree Showing Events Leading to Allentown Explosion . . . . . . . . . . . . . . 2699.11 Using Inhibit Gates to Represent Alternative Scenarios . . . . . . . . . . . . . . . . . 2719.12 Using House Events to Represent Alternative Scenarios . . . . . . . . . . . . . . . . 2739.13 Fault Tree Showing Post-Explosion Events . . . . . . . . . . . . . . . . . . . . . . . . 2759.14 Fault Tree Showing NTSB Conclusions about the Causes of the Explosion . . . . . . 2769.15 Fault Tree Showing Conclusions about Injuries and Loss of Life . . . . . . . . . . . . 2799.16 Fault Tree Showing Conclusions about Reliability of Excess Flow Valves . . . . . . . 2809.17 Petri Net of Initial Events in the Allentown Incident . . . . . . . . . . . . . . . . . . 2829.18 A Petri Net With Multiple Tokens . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2849.19 A Petri Net Showing Catalytic Transition. . . . . . . . . . . . . . . . . . . . . . . . . 2869.20 A Petri Net Showing Conict . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2879.21 A Petri Net With An Inhibitor Avoiding Conict. . . . . . . . . . . . . . . . . . . . 2889.22 A Sub-Net Showing Crew Interaction. . . . . . . . . . . . . . . . . . . . . . . . . . . 2909.23 A Sub-Net Showing Alternative Reasons for the Foreman's Decision. . . . . . . . . . 2929.24 High-Level CAE Diagram for the Allentown Incident . . . . . . . . . . . . . . . . . . 3079.25 Representing Counter Arguments in a CAE Diagram (1) . . . . . . . . . . . . . . . . 3089.26 Representing Counter Arguments in a CAE Diagram (2) . . . . . . . . . . . . . . . . 3099.27 Representing Counter Arguments in a CAE Diagram (3) . . . . . . . . . . . . . . . . 3109.28 High-Level CAE Diagram Integrating Formal and Informal Material . . . . . . . . . 3119.29 Extended CAE Diagram Integrating Formal and Informal Material (1) . . . . . . . . 3129.30 Extended CAE Diagram Integrating Formal and Informal Material (2) . . . . . . . . 3129.31 Subjective Responses to Modelling Notations. . . . . . . . . . . . . . . . . . . . . . . 3139.32 Subjective Responses to Logic-Based Reconstruction

    How Easy did you nd it to understand the logic-based model? . . . . . . . . . . . . 3159.33 Qualitative Assessments Of CAE-Based Diagrams

    How Easy Did You Find It to Understand the CAE Diagram? . . . . . . . . . . . . . 3169.34 Qualitative Assessments of Hybrid Approach . . . . . . . . . . . . . . . . . . . . . . 3179.35 Allentown Fault Tree Showing Pre- and Post-Incident Events . . . . . . . . . . . . . 3209.36 Cross-Referencing Problems in Incident Reports . . . . . . . . . . . . . . . . . . . . . 3229.37 Using a Petri Net to Build a Coherent Model of Concurrent Events . . . . . . . . . . 3239.38 Lack of Evidence, Imprecise Timings and Time-lines . . . . . . . . . . . . . . . . . . 3259.39 Continuous changes and Time-lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3269.40 Using Petri Nets to Represent Dierent Versions of Events . . . . . . . . . . . . . . . 327

  • LIST OF FIGURES xi

    9.41 Annotating Petri Nets to Resolve Apparent Contradictions . . . . . . . . . . . . . . 3289.42 Representing the Criticality of Distal Causes . . . . . . . . . . . . . . . . . . . . . . 3309.43 Representing the Impact of Proximal Causes . . . . . . . . . . . . . . . . . . . . . . 3329.44 Representing the Impact of Mitigating Factors . . . . . . . . . . . . . . . . . . . . . 3339.45 Representing Impact in a Causal Analysis . . . . . . . . . . . . . . . . . . . . . . . . 334

    10.1 Overview of the Dept. of Energy's `Core' Techniques . . . . . . . . . . . . . . . . . . 34810.2 Simplied Structure of an ECF Chart . . . . . . . . . . . . . . . . . . . . . . . . . . 34910.3 Components of ECF Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35010.4 High-Level ECF Chart for the Mars Climate Orbiter (MCO) . . . . . . . . . . . . . 35010.5 Angular Momentum Desaturation Events Aect MCO Navigation . . . . . . . . . . 35110.6 High-Level ECF chart for the Mars Polar Lander (MPL) . . . . . . . . . . . . . . . . 35210.7 Premature MPL Engine Shut-Down and DS2 Battery Failure . . . . . . . . . . . . . 35310.8 Integrating the Products of Barrier Analysis into ECF Charts . . . . . . . . . . . . . 35910.9 Process Barriers Fail to Protect the Climate Orbiter . . . . . . . . . . . . . . . . . . 36110.10Process Barriers Fail to Protect the Climate Orbiter (2) . . . . . . . . . . . . . . . . 36210.11Process Barriers Fail to Protect the Climate Orbiter (3) . . . . . . . . . . . . . . . . 36310.12Technological Barriers Fail to Protect the Climate Orbiter . . . . . . . . . . . . . . . 36510.13Technological Barriers Fail to Protect the Climate Orbiter (2) . . . . . . . . . . . . . 36810.14Integrating Change Analysis into an ECF Chart . . . . . . . . . . . . . . . . . . . . 37410.15Representing Stang Limitations within an ECF Chart . . . . . . . . . . . . . . . . 37610.16Representing Risk Management Issues within an ECF Chart . . . . . . . . . . . . . 37810.17Representing Technological Issues within an ECF chart (1) . . . . . . . . . . . . . . 38110.18Representing Technological Issues within an ECF chart (2) . . . . . . . . . . . . . . 38210.19Using Change Analysis to Collate Contextual Conditions . . . . . . . . . . . . . . . 38510.20Integrating Development Issues into an ECF chart (1) . . . . . . . . . . . . . . . . . 38610.21Integrating Development Issues into an ECF chart (2) . . . . . . . . . . . . . . . . . 38710.22Integrating Review Issues into an ECF chart . . . . . . . . . . . . . . . . . . . . . . 39010.23An ECF chart of the Deep Space 2 Mission Failure . . . . . . . . . . . . . . . . . . . 39310.24An ECF chart of the Polar Lander Mission Failure . . . . . . . . . . . . . . . . . . . 39510.25An ECF chart of the Climate Orbiter Mission Failure . . . . . . . . . . . . . . . . . 39910.26NASA Headquarters' Oce of Space Science [569] . . . . . . . . . . . . . . . . . . . 41210.27JPL Space and Earth Sciences Programmes Directorate [569] . . . . . . . . . . . . . 413

    11.1 Abstract View of A Multilinear Events Sequence (MES) Diagram . . . . . . . . . . . 43311.2 An Initial Multilinear Events Sequence (MES) Diagram . . . . . . . . . . . . . . . . 43711.3 A MES Flowchart showing Conditions in the Nanticoke Case Study . . . . . . . . . 43811.4 A MES Flowchart showing Causation in the Nanticoke Case Study . . . . . . . . . . 44011.5 Causal Relationships in STEP Matrices . . . . . . . . . . . . . . . . . . . . . . . . . 44211.6 STEP Matrix for the Nanticoke Case Study . . . . . . . . . . . . . . . . . . . . . . . 44511.7 The Mini-MORT Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45011.8 A Causal Tree of the Nanticoke Case Study . . . . . . . . . . . . . . . . . . . . . . . 46411.9 The Eindhoven Classication Model [840] . . . . . . . . . . . . . . . . . . . . . . . . 46811.10Classication Model for the Medical Domain [844] . . . . . . . . . . . . . . . . . . . 46911.11The Three Legs of Tripod . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47411.12Tripod-Beta Event Analysis of the Nanticoke Incident (1) . . . . . . . . . . . . . . . 47711.13Tripod-Beta Event Analysis of the Nanticoke Incident (2) . . . . . . . . . . . . . . . 47911.14Why-Because Graph Showing Halon Discharge . . . . . . . . . . . . . . . . . . . . . 48311.15Why-Because Graph for the Nanticoke Alarm . . . . . . . . . . . . . . . . . . . . . . 48411.16Overview of the Why-Because Graph for the Nanticoke Incident . . . . . . . . . . . 48611.17Possible `Normative' Worlds for the Nanticoke Incident . . . . . . . . . . . . . . . . . 48711.18Bayesian Network Model for the Nanticoke Fuel Source . . . . . . . . . . . . . . . . 50511.19Causal Tree from McElroy's Evaluation of PRIMA (1) . . . . . . . . . . . . . . . . . 52611.20Causal Tree from McElroy's Evaluation of PRIMA (2) . . . . . . . . . . . . . . . . . 528

  • xii LIST OF FIGURES

    13.1 Simplied Flowchart of Report Generation Based on [631] . . . . . . . . . . . . . . . 62713.2 Data and Claims in the Navimar Case Study. . . . . . . . . . . . . . . . . . . . . . . 68413.3 Qualication and Rebuttal in the Navimar Case Study. . . . . . . . . . . . . . . . . 68513.4 More Complex Applications of Toulmin's Model. . . . . . . . . . . . . . . . . . . . . 68713.5 Snowdon's Tool for Visualising Argument in Incident Reports (1). . . . . . . . . . . 69013.6 Snowdon's Tool for Visualising Argument in Incident Reports (2). . . . . . . . . . . 69113.7 MAIB On-Line Feedback Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69713.8 MAIB Safety Digest (HTML Version) . . . . . . . . . . . . . . . . . . . . . . . . . . 70013.9 ATSB Incident Report (PDF Version) . . . . . . . . . . . . . . . . . . . . . . . . . . 70213.10NTSB Incident Report (PDF Version) . . . . . . . . . . . . . . . . . . . . . . . . . . 70213.11Douglas Melvin's Simulation Interface to Rail Incident Report (VRML Version) . . . 70313.12James Farrel's Simulation Interface to Aviation Incident Report (VRML Version) . . 70313.13Peter Hamilton's Cross-Reference Visualisation (VRML Version) . . . . . . . . . . . 704

    14.1 Perceived `Ease of Learning' in a Regional Fire Brigade . . . . . . . . . . . . . . . . 73614.2 The Heavy Rescue Vehicle Training Package . . . . . . . . . . . . . . . . . . . . . . . 73714.3 Overview of the MAUDE Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74914.4 The MAUDE User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76414.5 Precision and Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77714.6 Components of a Semantic Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 78214.7 Semantic Network for an Example MAUDE Case . . . . . . . . . . . . . . . . . . . . 78314.8 Using a Semantic Network to Model Stereotypes . . . . . . . . . . . . . . . . . . . . 78414.9 US Naval Research Laboratory's Conversational Decision Aids Environment . . . . . 786

    15.1 Static Conventional Visualisation of SPAD Severity by Year . . . . . . . . . . . . . . 87415.2 Static `Column' Visualisation of SPAD Severity by Year . . . . . . . . . . . . . . . . 87515.3 Static `Radar' Visualisation of SPAD Severity by Year . . . . . . . . . . . . . . . . . 87615.4 Computer-Based Visualisation of SPAD Data . . . . . . . . . . . . . . . . . . . . . . 87915.5 Signal Detail in SPAD Visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 88015.6 Dynamic Queries in the Visualisation of SPAD Incidents . . . . . . . . . . . . . . . . 88115.7 Eccentric Labelling in the Visualisation of SPAD Incidents . . . . . . . . . . . . . . . 882

  • Preface

    Incident reporting systems have been proposed as means of preserving safety in many industries,including aviation [308], chemical production [162], marine transportation [387], military acquisition[287] and operations [806], nuclear power production [382], railways [664] and healthcare [105]. Un-fortunately, the lack of training material or other forms of guidance can make it very dicult forengineers and managers to set up and maintain reporting systems. These problems have been exac-erbated by a proliferation of small-scale local initiatives, for example within individual departmentsin UK hospitals. This, in turn, has made it very dicult to collate national statistics for incidentswithin a single industry.

    There are, of course, exceptions to this. For example, the Aviation Safety Reporting System(ASRS) has established national reporting procedures throughout the US aviation industry. Simi-larly, the UK Health and Safety Executive have supported national initiatives to gather data on Re-portable Injuries, Diseases and Dangerous Occurrences (RIDDOR). In contrast to the local schemes,these national systems face problems of scale. It can become dicult to search databases of 500,000records to determine whether similar incidents have occurred in the past.

    This book, therefore, addresses two needs. The rst is to provide engineers and managers witha practical guide on how to set up and maintain an incident reporting system. The second is toprovide guidance on how to cope with the problems of scale that aect successful local and nationalincident reporting systems.

    In 1999, I was asked to help draft guidelines for incident reporting in air trac control throughoutEurope. The problems of drafting these guidelines led directly to this book. I am, therefore, gratefulto Gilles le Gallo and Martine Blaize of EUROCONTROL for helping me to focus on the problemsof international incident reporting systems. Roger Bartlett, safety manager at the Maastricht upperair space Air Trac Control center also provided valuable help during several stages in the writingof this book. Thanks are also due to Michael Holloway of NASA's Langley Research Center whoencouraged me to analyze the mishap reporting procedures being developed within his organization.Mike O'Leary of British Airways and Neil Johnstone of Aer Lingus encouraged my early workon software development for incident reporting. Ludwig Benner, Peter Ladkin, Karsten Loer andDmitri Zotov provided advice and critical guidance on the causal analysis sections. I would also liketo thank Gordon Crick and Mark Bowell of the UK Health and Safety Executive, in particular, fortheir ideas on the future of national reporting systems.

    I would like to thank the University of Glasgow for supporting the sabbatical that helped me tonish this work.

    Chris Johnson, Glasgow, 2003.

    xiii

  • Chapter 1

    Abnormal Incidents

    Every day we place our trust in a myriad of complex, heterogeneous systems. For the most part, wedo this without ever explicitly considering that these systems might fail. This trust is largely basedupon pragmatics. No individual is able to personally check that their food and drink is free fromcontamination, that their train is adequately maintained and protected by appropriate signallingequipment, that their domestic appliances continue to conform to the growing array of internationalsafety regulations [278]. As a result we must place a degree of trust in the organisations who providethe services that we use and the products that we consume. We must also, indirectly, trust theregulatory framework that guides these organisations in their commercial practices. The behaviourof phobics provides us with a glimpse of what it might be like if we did not possess this trust.For instance, a fear of ying places us in a nineteenth century world in which it takes several daysrather than a few hours to cross the Atlantic. The SS United States' record crossing took 3 days,10 hours and 40 minutes in July 1952. Today, the scheduled crossings by Cunard's QEII now takeapproximately 6 days. In some senses, therefore, trust and prot are the primary lubricants of themodern world economy. Of course, this trust is implicit and may in some cases be viewed as a formof complicit ignorance. We do not usually pause to consider the regulatory processes that ensuresour evening meal is free of contamination or that our destination airport is adequately equipped.

    From time to time our trust is shaken by failures in the infrastructure that we depend upon[70]. These incidents and accidents force us to question the safety of the systems that surround us.We begin to consider whether the benets provided by particular services and products justify therisks that they involve. For example, the Valujet accident claimed the lives of a DC-9's passengersand crew when it crashed after takeo from Miami. National Transportation Safety Board (NTSB)investigators found that SabreTech employees had improperly labelled oxygen canisters that werecarried on the ight. These cannisters created the necessary conditions for the re, which in turnled to the crash. Prior to the accident, in the rst quarter of 1996, Valujet reported a net incomeof $10.7 million. After the accident, in the nal quarter of 1996, Valujet reported a loss of $20.6million. These losses do not take into account the additional $262 million costs of settlements withthe victims relatives.

    The UK Nuclear Installations Inspectorate's report into the falsication of pellet diameter datain the MOX demonstration facility at Sellaeld also illustrates the consequences of losing interna-tional condence [641] In the wake of this document, Japan, Germany and Switzerland suspendedtheir ships to and from the facility. The United States' government initiated a review of BNFL'sparticipation in a $4.4bn contract to decommission former nuclear facilities. US Energy SecretaryBill Richardson sent a team to England to meet with British investigators. British Nuclear Fuel'sissued a statement which stated that they had nothing to hide and were condent that the USDepartment of Energy would satisfy itself on this point [106].

    The Channel Tunnel re provides another example of the commercial consequences of suchadverse events. In May 1997, the Channel Tunnel Safety Authority made 36 safety recommendationsafter nding that the re had exposed weaknesses in underlying safety systems. Insucient statraining had led to errors and delays in dealing with the re. Eurotunnel, therefore, took steps to

    1

  • 2 CHAPTER 1. ABNORMAL INCIDENTS

    address these concerns by implementing the short-term recommendations and conducting furtherstudies to consider those changes that involved longer-term infrastructure investment. However, theUK Consumer Association mirrored more general public anxiety when its representatives stated thatit was `still worried' about evacuation procedures and the non-segregation of passengers from cars onthe tourist shuttle trains [97] Te re closed the train link between the United Kingdom and Francefor approximately six months and served to exacerbate Eurotunnel's 1995 loss of $925 million.

    This book introduces the many dierent incident reporting techniques that are intended toreduce the frequency and mitigate the consequences of accidents, such as those described in previousparagraphs. The intention is that by learning more from `near misses' and minor incidents, theseapproaches can be used to avoid the losses associated with more serious mishaps. Similarly, if wecan identify patterns of failure in these low consequence events we can also reduce the longer termcosts associated with large numbers of minor mishaps. In order to justify why you should invest yourtime in reading the rest of this work it is important to provide some impression of the scale of theproblems that we face. It is dicult to directly assess the negative impact that workplace accidentshave upon safe and successful production [283]. Many low-criticality and `near miss' events are notreported even though they incur signicant cumulative costs. In spite of such caveats, it is possibleto use epidemiological surveys and reports from national healthcare systems to assess the eects ofincidents and accidents on worker welfare.

    1.1 The Hazards

    Employment brings with it numerous economic and health benets. It can even improve our lifeexpectancy over those of us who may be unfortunate enough not to nd work. However, work exposesus to a range of occupational hazards. The World Health Organisation (WHO) estimate that theremay be as many as 250 million occupational injuries each year, resulting in 330,000 fatalities [872]. Ifwork-related diseases are included then this gure grows to 1.1 million deaths throughout the globe[873]. About the same number of people die from malaria each year. The following list summarisesthe main causes of occupational injury and disease.

    Mechanical hazards. Many workplace injuries occur because of poorly designed or poorlyscreened equipment. Others occur because people work on, or with, unsafe structures. Badlymaintained tools also create hazards that may end in injury. Musculo-skeletal disorders andrepetitive strain injury are now the main cause of work-related disability in most of the devel-oped world. The consequent economic losses can be as much as 5% of the gross national prod-uct in some countries [872]. The Occupational Safety and Health Administration's (OSHA)ergonomics programme has argued that musculo-skeletal disorders are the most prevalent, ex-pensive and preventable workplace injuries in the United States. They are estimated to cost$15 billion in workers' compensation costs each year. Other hazards of the working environ-ment include noise, vibration, radiation, extremes of heat and cold.

    Chemical Hazards. Almost all industries involve exposure to chemical agents. The mostobvious hazards arise from the intensive use of chemicals in the textile, healthcare, constructionand manufacturing industries. However, people in most industries are exposed to cleaningchemicals. Others must handle petroleum derivatives and various fuel sources. Chemicalhazards result in reproductive disorders, in various forms of cancer, respiratory problems andan increasing number of allergies. The WHO now ranks allergic skin diseases as one of themost prevalent occupational diseases [872]. These hazards can also lead to metal poisoning,damage to the central nervous system and liver problems caused by exposure to solvents andto various forms of pesticide poisoning.

    Biological hazards. A wide range of biological agents contribute to workplace diseases andinfections. Viruses, bacteria, parasites, fungi, moulds and organic dusts aect many dierentindustries. Healthcare workers are at some risk from tuberculosis infections, Hepatitis B andC as well as AIDS. For agricultural workers, the inhalation of grain dust can cause asthma

  • 1.1. THE HAZARDS 3

    and bronchitis. Grain dust also contains mould spores that, if inhaled, can cause fatal disease[321].

    Psychological Hazards. Absenteeism and reduced work performance are consequences of occu-pational stress. These problems have had an increasing impact over the last decade. In theUnited Kingdom, the cost to industry is estimated to be in excess of $6 billion with over 40million working days lost each year [90]. There is considerable disagreement over the causesof such stress. People who work in the public sector or who are employed in the service indus-tries seem to be most susceptible to psychological pressures from clients and customers. Highworkload, monotonous tasks, exposure to violence, isolated work have all been cited as con-tributory factors. The consequences include unstable personal relationships, sleep disturbancesand depression. There can be physiological consequences including higher rates of coronaryheart disease and hypertension. Post traumatic stress disorder is also increasingly recognisedin workers who have been involved in, or witnessed, incidents and accidents.

    This list describes some of the hazards that threaten workers' health and safety. Unfortunately,these items tell us little about the causes of these adverse events or about potential barriers. Forexample, OSHA report describes the way in which a sheet metal worker was injured by a mechanicalhazard:

    \...employee #1 was working at station #18 (robot) of a Hitachi automatic welding line.She had been trained and was working on this line for about 2 months... The lifting armthen rises and a robot arm moves out from the operator's side of the welding line andperforms its task. Then there is a few seconds delay between functions as the robot armnishes welding, rises, returns to home and the lifting arm lowers to home, ready forthe nished length of frame steel to move on and another to take it's place. During thecourse of this operation the welding line is shut down intermittently so that the weldingtips on the robot arms can be lubricated, preventing material build up. This employee,without telling anyone else or shutting down the line, tried to perform the lubricationwith the line still in automatic mode. She thought this could be done between the smallamount of time it took all parts to complete their functions and return to home. Theemployee did not complete the task in time, as she had anticipated. Her right leg waslocated between the protruding rods on the lifting arm and the openings the rods restin. Her leg was trapped. When other employees were alerted, they had trouble tryingto switch the line to manual because the computer was trying to complete it's functionand the lifting arm was trying to return to home. The result was that one employeeused a crowbar to help relieve pressure on her leg and another used the cellenoid whichenabled the lifting arm to rise. The employee received two puncture wounds in the thigh(requiring stitches) and abrasions to the lower leg. Management once again instructedemployees working this line on the serious need to wait until all functions are complete,the line shut down and not in the automatic mode before attempting any maintenance."(OSHA Accident Report ID: 0352420).

    It is possible to identify a number of factors that were intended to prevent this incident fromoccurring. Line management had trained the employees not to intervene until the robot weldingcycle was complete. Lubrication was intended to be completed when the line was 'shut down' ratherthan in automated mode. It is also possible to identify potential factors that might have beenchanged to prevent the accident from occurring. For example, physical barriers might have beenintroduced into the working environment so that employees were prevented from intervening duringautomated operations. Similarly, established working practices may in some way have encouragedsuch risk taking as the report comments the management `once again' instructed employees to waituntil the line was shut down. These latent problems created the context in which the incident couldoccur [698]. The triggering event, or catalyst, was the employee's decision that she had enough timeto lubricate the device. The lack of physical barriers then left her exposed to the potential hazardonce she had decided to pursue this unsafe course of action. Observations about previously unsafeworking practices in this operation may also have done little to dissuade her from this intervention.

  • 4 CHAPTER 1. ABNORMAL INCIDENTS

    Figure 1.1 provides a high level view of the ways in which incidents and accidents are causedby catalytic failures and weakened defences. The diagram on the left shows how the integration

    Figure 1.1: Components of Systems Failure

    of working practices, working environment, line management and regulatory intervention togethersupport a catalytic or triggering failure. Chapter 3 will provide a detailed analysis of the sources forsuch catalytic failures. For now, however, it is sucient to observe that there a numerous potentialcauses ranging from human error through to stochastic equipment failures through to deliberateviolations of regulations and working practices. It should also be apparent that there may becatalytic failures of such magnitude that it would be impossible for any combination of the existingstructures to support, for any length of time. In contrast, the diagram on the right of Figure 1.1is intended to illustrate how weaknesses in the integration of system components can increase anapplication's vulnerability to such catalytic failures. For example, management might strive tosatisfy the requirements specied by a regulator but if those requirements are awed then there isa danger that the system will be vulnerable to future incidents. These failures in the supportinginfrastructure are liable to develop over a much longer timescale than the triggering events thatplace the system under more immediate stress.

    The diagrams in Figure 1.1 sketch out one view of the way in which specic failures place stresson the underlying defences that protect us from the hazards what were listed in previous paragraphs.A limitation of these sketches is that they provide an extremely static impression of a system as itis stressed by catalytic failures. In contrast, Figure 1.2 provides a more process oriented view ofthe development of an occurrence or critical incident. Initially, the systems is in a `normal' state.Of course, this `normal' state need not itself be safe if there are aws in the working practicesand procedures that govern everyday operation. The systems may survive through an incubationperiod in which any residual aws are not exposed by catalytic failures. This phase representsa `disaster waiting to happen'. However, at some point such an event does cause the onset ofan incident or accident. These failures may, in turn, expose further aws that trigger incidentselsewhere in the same system or in other interrelated applications. After the onset of a failure,protection equipment and other operators may intervene to mitigate any consequences. In some

  • 1.1. THE HAZARDS 5

    Figure 1.2: Process of Systems Failure

    cases, this may return the system to a nominal state in which no repair actions are taken. This haspotentially dangerous implications because the aws that were initially exposed by the triggeringevent may still reside in the system. Alternatively, a rescue and salvage period may be initiated inwhich previous shortcomings are addressed. In particular, a process of cultural readjustment is likelyif the potential consequences of the failure have threatened the continued success of the organisationas a whole. For example, the following passage comes from a report that was submitted to theEuropean Commission's Major Accident Reporting System (MARS) [229]:

    \At 15:30 the crankcase of an URACA horizontal action 3 throw pump, used to boostliquid ammonia pressure from 300 psi to 3,400 psi, was punctured by fragments of thefailed pump-ram crankshaft. The two operators investigating the previously reportednoises from the pump were engulfed in ammonia and immediately overcome by fumes.Once the pump crackcase was broken, nothing could be done to prevent the release ofthe contents of the surge drum (10 tonnes were released in the rst three minutes).The supply of ammonia from the ring main could only be stopped by switching o thesupply pump locally. No one were able to do this as the two gas-tight suits availablewere preferentially used for search and rescue operations, and thus release of ammoniacontinued. Ammonia fumes quickly began to enter the plant control room and theoperators hardly had the time to sound the alarms and start the plant shut-down beforethey had to leave the building using 10 minutes escape breathing apparatus sets. Duringthe search and rescue operation the re authorities did not use the gas-tight suits andfumes entered the gaps around the face piece and caused injuries to 5 men. The ammoniacloud generated by the initial release drifted o-site and remained at a relatively lowlevel." (MARS report 814).

    A period of normal operation led to an incubation period in which the pump-ram crankshaft wasbeginning to fail and required maintenance. The trigger event involved the puncture of the pump's

  • 6 CHAPTER 1. ABNORMAL INCIDENTS

    crankcase when the ram crankshaft eventually failed. This led to the onset of the incident in whichtwo operators were immediately overcome. This then triggered a number of further, knock-on failures.For instance, the injuries to the remen were caused because they did not use gas tight suits duringtheir response to the initial incident. In this case, only minimal mitigation was possible as operatorsdid not have the gas tight suits that were necessary in order to isolate the ammonia supply from thering main. Those suits that were available were instead deployed to search and rescue operations.

    Many of the stages shown in Figure 1.2 are based on Turner's model for the development of asystem failure [790]. The previous gure introduces a mitigation phase that was not part of thisearlier model. This is specically distinguished from Turner's rescue and salvage stage becauseit reects the way in which operators often intervene to `cover up' a potential failure by takingimmediate action to restore a nominal state. In many instances, individuals may not even be awarethat such necessary intervention should be reported as a focus for potential safety improvements.As Leveson points out, human intervention routinely prevents the adverse consequences of manymore occurrences than are ever recorded in accident and incident reports [486]. This also explainsour introduction of a feedback loop between the mitigation and the situation normal phases. Thesefeatures were not necessary in Turner's work because his focus was on accidents rather than incidents.Figure 1.2 also introduces a feedback loop between the onset and trigger phases. This is intended tocapture the ways in which an initial failure can often have knock-on eects throughout a system. Itis very important to capture these incidents because are increasingly common as we move to moretightly integrated, heterogeneous application processes.

    Previous paragraphs have sketched a number of ways in which particular hazards contribute tooccupational injuries. They have also introduce a number of high-level models that can be used toexplain some of the complex ways in which background failures and triggering events combine toexpose individuals to those hazards. The following sections build on this analysis by examining thelikelihood of injury to individuals in particular countries and industries. We also look at the costsof these adverse events to individuals and also to particular industries. The intention is to reiteratethe importance of detecting potential injuries and illnesses before they occur.

    1.1.1 The Likelihood of Injury and Disease

    Work-place incidents and accidents are relatively rare. In the United Kingdom, approximately 1in every 200 workers reports an occupational illness or injury resulting in more than three days ofabsence from employment every year [331]. OSHA estimates that the rate of work-related injuriesand illnesses dropped from 7.1 per year for every 100 workers in 1997 to 6.7 in 1998 [652]. Thesegures reect signicant improvements over the last decade. For example, the OSHA statisticsshow that the number of work-related fatalities has almost been halved since it was established byCongress in 1971. The Australian National Occupational Health and Safety Commission report thatthe rate of fatality, permanent disability or a temporary disability resulting in an absence from workof one week or more was 2.2 per 100 in 1997-8, 2.5 in 1996-7, 2.7 in 1995-6, 2.9 in 1994-95, 3.0 in1993-4, 2.8 in 1992-3 [44]. The following gures provide the same data per million hours worked: 13in 1997-8, 14 in 1996-7, 16 in 1995-6, 16 in 1994-5, 17 in 1993-4, 19 in 1992-3.

    These statistics hide a variety of factors that continue to concern governments, regulators, man-agers, operators and the general public. The rst cause for concern stems from demographic andstructural changes in the workforce. Many countries continue to experience a rising number of work-ers. This is both due to an increasing population and to structural changes in the workforce, forinstance increasing opportunities for women. In the United Kingdom, the 1% fall between 1998 and1999 in the over 3 day injury rate is being oset by a (small) rise in the total number of injuries from132,295 to 132,307 in 1999-2000 [331]. Similarly the OSHA gures for injury and illness rates showa 40 % decline since 1971. At the same time, however, U.S. employment has risen from 56 millionworkers at 3.5 million worksites to 105 million workers at nearly 6.9 million sites [652]. Populationaging will also have an impact upon occupational injury statistics. Many industrialised countriesare experiencing the twin eects of a falling birth rate and a rising life expectancy. This will in-crease pressure on the workforce for higher productivity and greater contributions to retirementprovision. Recent estimates place the number of people aged 60 and over at 590 million worldwide.

  • 1.1. THE HAZARDS 7

    By 2020, this number is projected to exceed 1,000 million [873]. Of this number, over 700 millionolder people will live in developing counties. These projections are not simply signicant for theburdens that they will place on those in work. Older elements of the workforce are often the mostlikely to suer fatal work-related injuries. In 1997-98, the highest rate of work-related fatalities inAustralia occurred in the 55 plus age group with 1.3 deaths per 100 employees. They were followedby the 45-49 and 50-54 age groups with approximately 0.8 fatalities per 100 employees. The lowestnumber of fatalities occurred in workers under that age of 20 with 0.2 deaths per 100 employees. Itcan be dicult to interpret such statistics. For example, they seem to indicate that the rising risksassociated with aging outweigh any benecial eects from greater expertise across the workforce.Alternatively, the statistics may indicate that younger workers are more likely to survive injuries thatwould prove fatal to older colleagues. The UK rate of reportable injury is lower in men aged 16-19than all age groups except for those above 55 [326]. However, the HSE report that the dierencesbetween age groups are not statistically signicant when allowing for the higher accident rates forthose occupations that are mainly performed by younger men. There is also data that contradictsthe Australian experience. Young men, aged 16-24, face a 40% higher relative risk of all workplaceinjury than men aged 45-54 even after allowing for occupations and other job characteristics.

    The calculation of health and safety statistics has also been eected by social and economicchange. Part-time work has important eects on the calculation of health and safety statistics perhead of the working population [652, 326]. The rate of injury typically increases with the amountof time exposed to a workplace risk. However, it is possible to normalise the rate using an averagenumber of weekly hours of work. The rate of all workplace injury in the UK is 8.0 per 100 for peopleworking less than 16 hours per week. For people working between 16 and 29 hours per week it is 4.3,between 30 and 49 hours it is 3.8, between 50 and 59 it is 3.2 and people working 60 or more hoursper week have an accident rate of 3.0 per 100 workers per annum. People who work a relatively lownumber of hours have substantially higher rates of all workplace and reportable injury than thoseworking longer hours. The relatively high risk in workers with low hours remains after allowing fordierent occupational characteristics [326]. The growth of temporary work has similar implicationsfor some economies. In the UK, the rate of injury to workers in the rst 6 months is double thatof their colleagues who have worked for at least a year. This relatively high risk for new workersremains after allowing for occupations and hours of work. 57% temporary workers have been withtheir employer for less than 12 months.

    Figure 1.1 shows that accident rates are not uniformly distributed across industry sectors. For ex-ample, the three day rate for agriculture and shing in the United Kingdom is 1.2 per 100 employees.The same rate for the services industries is approximately 0.4 per 100 workers.

    Industry UK Germany France Spain Italy1993 1994 1993 1994 1993 1992 1993 1991

    Agriculture 7.3 8.5 6.0 6.7 9.8 9.1 5.4 18.4Utilities 0.5 0.6 3.1 4.3 5.6 12.5 10.1 4.4Manufacturing 1.6 1.2 2.3 1.6 2.3 6.7 4.9 3.3Construction 8.9 6.9 7.9 8.0 17.6 21.0 19.3 12.8Transport 2.2 2.0 7.2 7.5 6.5 13.0 10.7 11.2Otherservices

    0.3 0.4 1.0 1.2 1.9 1.4 1.5 0.9

    AllIndustries

    1.2 0.9 3.3 3.2 3.9 6.4 5.1 5.5

    Table 1.1: Industry Fatality Rates in UK, Germany, France, Spain & Italy [324]

    Accidents rates also dierent with gender. Positive employment practices are exposing increasingnumbers of women to a greater variety of risks in the workplace. The overall Australian NationalOccupational Health and Safety Commission rate of 2.2 injuries and illnesses per 100 workers hidesa considerable variance [44]. For males the rate was 2.9 per 100 workers whilst it was 1.3 for females.

  • 8 CHAPTER 1. ABNORMAL INCIDENTS

    In 1997-8, the industries with the highest number of male fatalities were Transport and Storage (66)and Manufacturing (64), while for females Accommodation, Cafes and Restaurants (4) and Propertyand Business Services (4) were the highest. The male fatalities were mainly employed as Plant andMachine Operators, and Drivers (91). Female fatalities were mainly employed as Managers and Ad-ministrators (5). These dierences may decline with underlying changes in workplace demographics.However, UK statistics suggest some signicant residual dierences between the genders:

    \the rate of all workplace injury is over 75% higher in men than women, reecting thatmen tend to be employed in higher risk occupations. After allowing for job characteristics,the relative risk of workplace injury is 20% higher in men compared with women. Jobcharacteristics explain much of the higher rate of injury in men but not all because menstill have an unexplained 20% higher relative risk". [326]

    Table 1.1 illustrates how the rate of industrial injuries diers within Europe. Such dierences aremore marked when comparisons are extended throughout the globe. However, it is not alwayspossible to nd comparable data:

    \The evaluation of the global burden of occupational diseases and injuries is dicult.Reliable information for most developing countries is scarce, mainly due to serious lim-itations in the diagnosis of occupational illnesses and in the reporting systems. WHOestimates that in Latin America, for example, only between 1 and 4% of all occupa-tional diseases are reported. Even in industrialised countries, the reporting systems aresometimes fragmented." [873]

    For example, the Australian statistics cited in previous paragraphs include some cases of coronaryfailure that would not have been included within the UK statistics. These problems are furtherexacerbated by the way in which local practices aect the completion of death certications andother reporting instruments. For instance, the death of a worker might have been indirectly causedby a long running coronary disease or by the immediate physical exertion that brings on a heartattack. It is important to emphasise that even if it were possible to implement a consistent globalreporting system for workplace injuries, it would still not be possible to directly draw inferences aboutthe number of incidents and accidents directly from that data. Many incidents still go unreportedeven if well-established reporting systems are available. A further limitation is that injury andfatality statistics tell us little or nothing about `near miss' incidents that narrowly avoided physicalharm.

    1.1.2 The Costs of Failure

    In 1996 the UK Health and Safety Executive estimated that workers and their families lost ap-proximately $558 million per year in reduced income and additional expenditure from work-relatedinjury and ill health [322]. They also estimated that the loss of welfare in the form of pain, grief andsuering to employees and their families was equivalent to a further $5.5 billion. These personalcosts also have wider implications for employers, for the local economy and ultimately for nationalprosperity. The same study estimated that the direct cost to employers was approximately $2.5billion a year; $0.9 billion for injuries and $1.6 billion for illness. In addition, the loss caused byavoidable accidental events that do not lead to injury was estimated at between $1.4 billion and$4.5 billion per year. This represents 4-8% of all UK industrial and commercial companies' grosstrading prots.

    Employers also incur costs through regulatory intervention. These actions are intended to ensurethat a disregard for health and safety will be punished whether or not an incident has occurred.Tables 1.2 and 1.3 summarise the penalties imposed by United States' Federal and State inspectors inthe scal year 1999 [652]. Regulatory actions imposed a cost of $151,361,442 beyond the immediatenancial losses incurred from incidents and accidents. These gures do not account for the numerouscompetitive disadvantages that are incurred when organisations are associated with high-prolefailures [675].

  • 1.2. SOCIAL AND ORGANISATIONAL INFLUENCES 9

    Violations Percent Type Penalties646 0.8 Willful $24,460,31850,567 66 Serious $50,668,5091,816 2 Repeat $8,291,014226 0.3 Failure to abate $1,205,063408 0.01 Unclassied $3,740,08223,533 30 Other $1,722,33877,196 Total $90,087,324

    Table 1.2: Federal Inspections Fiscal Year 1999Violations Percent Type Penalties441 0.3 Willful $12,406,05057,010 40 Serious $35,441,2672,162 1.5 Repeat $4,326,620785 0.5 Failure to abate $2,860,97246 0.0002 Unclassied $2,607,90082,120 40 Other $3,631,309202,962 Total $61,274,118

    Table 1.3: State Inspections Fiscal Year 1999

    1.2 Social and Organisational Inuences

    These statistics illustrate the likelihood and consequences of occupational injuries. It is important,however, to emphasise that this data suers from a number of biases. Many of the organisations thatare responsible for collaring the statistics are also responsible for ensuring that mishap frequenciesare reduced over time. Problems of under-reporting can also complicate the interpretation of nationalgures. There is often a fear that some form of blame will attach itself to those organisations thatreturn an occupational health reporting form. The OSHA record keeping guidelines stress that:

    \Recording an injury or illness under the OSHA system does not necessarily implythat management was at fault, that the worker was at fault, that a violation of anOSHA standard has occurred, or that the injury or illness is compensable under workers'compensation or other systems." [653]

    However, in many counties including the United States, organisations that have a higher reportedrate of occupational illness or injury become the focus of increasing levels of regulatory inspectionand intervention. This has a certain irony because, as OSHA acknowledge, relatively low levels ofreported injuries and illnesses may be an indicator of poor health and safety management:

    \...during the initial phases of identifying and correcting hazards and implementinga safety and health program an employer may nd that its reported rate increases. Thismay occur because, as an employer improves its program, worker awareness and thusreporting of injuries and illnesses may increase. Over time, however, the employer's ...rate should decline if the employer has put into place an eective program." [648]

    It is instructive to examine how our analysis relates to previous work on enhancing the safety ofhazardous technologies. Two schools of thought can be identied; the rst stems from the `normalaccident' work of Perrow [675]; the second stems from the idea of `high reliability' organisations[718].

    1.2.1 Normal Accidents?

    Perrow argues that the characteristics of high-risk technologies make accidents inevitable, in spiteof the eectiveness of conventional safety devices. These characteristics include complexity and

  • 10 CHAPTER 1. ABNORMAL INCIDENTS

    tight coupling. Complexity arises from our limited understanding of some transformation stagesin modern processing industries. It stems from complex feedback loops in systems that rely onmultiple, interacting controls. Complexity also stems from many common-mode interconnectionsbetween subsystems that cannot easily be isolated. More complex systems produce unexpectedinteractions and so can provoke incidents that are harder to rectify.

    Perrow also argues that tight coupling plays a greater role in the adverse consequences of manyaccidents than the complexity of modern technological systems. This arises because many applica-tions are deliberately designed with narrow safety margins. For example, a tightly coupled systemmay only permit one method of achieving a goal. Access to additional equipment, raw materialsand personnel is often limited. Any buers and redundancy that are allowed in the system aredeliberately designed only to meet a few specied contingencies. In contrast, Perrow argues thataccidents can be avoided through loose coupling. This provides the time, resources and alternativepaths to cope with a disturbance.

    There is evidence to contradict parts of Perrow's argument [710, 684]. Some `high reliability'organisations do seem to be able to sustain relatively low incident rates in spirit of operating complexprocesses. Viller [847] identies a number of key features that contribute to the perceived success ofthese organisations:

    The leadership in an organisation places a high priority on safety.

    High levels of redundancy exist even under external pressures to trim budgets.

    Authority and responsibility are decentralised and key individuals can intervene to tacklepotential incidents. These actions are supported by continuous training and by organisationalsupport for the maintenance of an appropriate safety culture.

    Organisational learning takes place through a variety of means, including trial and error butalso through simulation and hypothesis testing.

    These characteristics illustrate the important role that incident reporting plays for `high reliabil-ity' organisations. Such applications are an important means of supporting organisational learning.Table 1.4 summarises the main features of 'Normal Accident' theory and 'High Reliability' organi-sations. Sagan [718] used both of these approaches to analyse the history of nuclear weapons safety.His conclusions lend weight to Perrow's pessimistic assessment that some accidents are inevitable.They are signicant because they hold important implications for the interpretation both of incidentand accident reports. For example, Sagan argues that much of the evidence put forward to supporthigh reliability organisations is based on data that those organisations help to produce. Accounts ofgood safety records in military installations are often dependent on data supplied by the military.This is an important caveat to consider during the following pages in which we will present incidentand accident statistics. We may not always be able to rely upon the accuracy of information thatorganisations use to publicise improvements in their own safety record. Sagan also argues that socialpressures act as brakes on organisational learning. He identies ways in which stories about previousfailures have been altered and falsied. He then goes on to show how the persuasive eects of suchpressures can help to convince the originators of such stories that they are, in fact, truthful accountsof incidents and accidents. This reaches extremes when failures are re-painted as notable successes.

    1.2.2 The Culture of Incident Reporting

    Sagan's work shows that a variety of factors can aect whether or not adverse events are investigated.Thes factors aect both individuals and groups within safety-critical organisations. The impact ofcultural inuences, of social and legal obligations, cannot be assessed without regard to individualdierences. Chapter 3 will describe how subjective attitudes to risk taking and to the violation ofrules can have a profound impact upon our behaviour. For now it is sucient to observe that eachof the following inuences will aect individuals in a number of dierent ways.

    In some groups, it can be disloyal to admit that either you or your colleagues have made a mistakeor have been involved in a `failure'. These concerns take a number of complex forms. For example,

  • 1.2. SOCIAL AND ORGANISATIONAL INFLUENCES 11

    High Reliability Organisations Normal Accidents TheoryAccidents can be prevented throughgood organisational design andmanagement

    Accidents are inevitable in complexand tightly coupled systems.

    Safety is the priority organisationalobjective.

    Safety is one of a number of compet-ing objectives.

    Redundancy enhances safety: dupli-cation and overlap cam make a reli-able system out of unreliable parts.

    Redundancy often causes accidents:it creates interactive complexity andencourages risk taking.

    Decentralised decision-making isneeded to permit prompt and

    exible operating responses tosurprises

    De-centralised control is needed forcomplex systems but centralisedcontrol is needed for tight coupling.

    A culture of reliability enhancessafety by encouraging uniform andappropriate responses by operators

    A military model of intense disci-pline and isolation is incompatiblewith democratic values

    Continuous operations, training andsimulations can create and maintainhigh reliability operations.

    Organisations cannot train forunimagined, highly dangerous orpolitically unpalatable operations

    Trial and error learning from acci-dents can be eective and can besupplemented by anticipation andsimulations

    Denial of responsibility, faulty re-porting and reconstruction of his-tory cripples learning eorts.

    Table 1.4: Competing Perspectives on Safety with Hazardous Technologies [718]

    individuals may be prepared to report failures. However, individuals may be reluctant to face theretribution of their colleagues should their identity become known. These fears are compounded ifthey do not trust the reporting organisation to ensure their anonymity. For this reason, NASA goto great lengths to publicise the rules that protect the identity of contributors to the US AviationSafety Reporting System.

    Companies can support a good 'safety culture' by investing in and publicising workplace reportingsystems. A number of factors can, however, undermine these initiatives. The more active a companyis in seeking out information about previous failures then the worse its safety record may appear. Itcan also be dicult to sustain the employee protection that encourages contributions when incidentshave economic as well as safety implications. Individuals can be oered re-training after a rstviolation, re-employment may be required after a second or third.

    The social inuence of a company's `safety culture' is reinforced by the legal framework thatgoverns particular industries. This is most apparent in the regulations that govern what shouldand what should not be reported to national safety agencies. For example, the OSHA regulationsfollow Part 1904.12(c) of the Code of Federal Regulations. These require that employers recordinformation about every occupational death; every nonfatal occupational illness; and those nonfataloccupational injuries which involve one or more of the following: loss of consciousness, restrictionof work or motion, transfer to another job, or medical treatment (other than rst aid). [653] As weshall see, this focus on accidents rather than `near-miss' incidents reects an ongoing debate aboutthe scope of Federal regulation and enforcement in the United States.

    It is often argued that individuals will not contribute to reporting systems unless they are pro-tected from self-incrimination through a `no blame' policy [700]. It is dicult for organisations topreserve this `no blame' approach if the information that they receive can subsequently be usedduring prosecutions. Conversely, a local culture of non-reporting can be reinforced or instigated bya fear of legal retribution if incidents are disclosed. These general concerns characterise a rangeof more detailed institutional arrangements. For example, some European Air Trac Management

  • 12 CHAPTER 1. ABNORMAL INCIDENTS

    providers operate under a legal system in which all incidents must be reported to the police. Inneighbouring countries, the same incidents are investigated by the service providers themselves and,typically, fall under an informal non-prosecution agreement with state attorneys. Other countrieshave more complex legal situations in which specic industry arrangements also fall under more gen-eral regional and national legislation. For example, the Utah Public Ocers and Employees' EthicsAct and the Illinois' Whistle Blower Protection Act are among a number of state instruments thathave been passed to protect respondents. These local Acts provide for cases that are also coveredby Federal statutes including the Federal False Claims Act or industry specic provision for WhistleBlowers such as section 405 of the Surface Transportation Assistance Act. This has created somedisagreement about whether state legislation preempts federal law in this area; several cases havebeen conducted in which claimants have led both common law and statutory suits at the sametime. Cases in Texas and Minnesota have shown that Federal statutes provide a base-line and nota ceiling for protection in certain states. Such legal complexity can deter potential contributors toreporting systems.

    There are other ways in which the legislative environment can aect reporting behaviour. Forexample, freedom of information and disclosure laws are increasing public access to the data thatorganisations can hold. The relatives or representatives of people involved in an accident can poten-tially use these laws to gain access to information about previous incidents. In such circumstances,there is an opportunity for punitive damages to be sought if previous, similar incidents were reportedbut not acted upon. These concerns arose in the aftermath of the 1998 Tobacco Settlement withcigarette manufacturers in the United States. Prior to this settlement, states alleged that companieshad conspired to withhold information about the adverse health eects of tobacco [580].

    The legislative environment for accident and incident reporting is partly shaped by higher-levelpolitical and social concerns. For example, both developed and developing nations have sought toderegulate many of their industries in an attempt to encourage growth and competition. Recentinitiatives to liberalise the Indian economy have highlighted this conict between the need to se-cure economic development whilst also coordinating health and safety policy. The Central LabourInstitute has developed national standards for the reporting of major accidents. However, the Di-rectorate General of Factory Advice Services and the Labour Institutes have not developed similarguidelines for incident and occurrence reporting. The focus has been on developing education andtraining programmes that can target specic health and safety issues after industries have becomeestablished within a region [156].

    Some occupational health and safety reporting system have, however, been extended to explicitlycollect data about both actual accidents and `near-miss' incidents. For example, employers in the UKare guided by the Reporting of Injuries, Diseases and Dangerous Occurrences Regulations (RIDDOR)1995. These cover accidents which result in an employee or a self-employed person dying, sueringa major injury, or being absent from work or unable to do their normal duties for more than threedays. They also cover `dangerous occurrences' that do not result in injury but have the potential todo signicant harm [320]. These include:

    The collapse, overturning or failure of load-bearing parts of lifts and lifting equipment.

    The accidental release of a biological agent likely to cause severe human illness.

    The accidental release of any substance which may damage health.

    The explosion, collapse or bursting of any closed vessel or associated pipework.

    An electrical short circuit or overload causing re or explosion.

    An explosion or re causing suspension of normal work for over 24 hours.

    Similarly, Singapore's Ministry of Manpower requires that both accidents and `dangerous occur-rences' must be reported. Under the fourth schedule of the national Factory Act, these may `underother circumstances' have resulted in injury or death [742]. The detailed support that accompaniesthe act provide exhaustive guidance on the denition of such dangerous occurrences. These are takento include incidents that involve bursting of a revolving vessel, wheel, grindstone or grinding wheel.

  • 1.2. SOCIAL AND ORGANISATIONAL INFLUENCES 13

    Dangerous occurrences also range from electrical short circuit or failure of electrical machinery, plantor apparatus, attended by explosion or re or causing structural damage to an explosion or failureof structure of a steam boiler, or of a cast-iron vulcaniser.

    A duty to report on incidents and accidents does not always imply that information about theseoccurrences will be successfully acted upon. This concern is at the heart of continuing attemptsto impose a `duty to investigate' upon UK employers. At present, the UK regulatory framework isone in which formal accident investigation of the most serious incidents is undertaken by speciallytrained investigators. Employers are not, in general, obliged to actively nding out what causedsomething to go wrong. Concern about this situation led to a 1998 discussion document that waspublished by the Health and Safety Commission (HSC). It was observed that:

    \At present, there is no law which explicitly requires employers to investigate thecauses of workplace accidents. Many employers do undertake accident investigationwhen there has been an event in the workplace which has caused injury in order toensure lessons are learnt, and although there is no explicit legal duty to investigateaccidents there are duties under some health and safety law which may lead employersto undertake investigation. The objective of a duty to investigate accidents would beto ensure employers draw any appropriate lessons from them in the interests of takingaction to prevent recurrence." [314]

    There are many organisational reasons why a body such as the HSC would support such an initiative.The rst is the fa