Resilience Engineering: The changing nature of safety

26
© Erik Hollnagel, 2010 Resilience Engineering: The changing nature of safety Erik Hollnagel E-mail: [email protected] Professor & Industrial Safety Chair MINES ParisTech Sophia Antipolis, France Professor II NTNU Trondheim, Norge

Transcript of Resilience Engineering: The changing nature of safety

© Erik Hollnagel, 2010

Resilience Engineering:The changing nature of safety

Erik Hollnagel

E-mail: [email protected]

Professor &Industrial Safety ChairMINES ParisTechSophia Antipolis, France

Professor IINTNU

Trondheim, Norge

© Erik Hollnagel, 2010

Normal accident theory (1984)

“On the whole, we have complex systems because we don’t know how to produce the output through linear systems.”

© Erik Hollnagel, 2010

Mor

als,

soc

ial n

orm

s

Gov

ern m

ent

MTO view: sharp-end, blunt-end

Aut

hor it

y

Com

pan y

Man

age m

ent

Wor

kpla

ce

Activity

Blunt-end failures

happened earlier and

somewhere else

Sharp-end failures happen here and now

© Erik Hollnagel, 2010

Focus on operation (sharp end)

Design Maintenance

Organisation (management)

Scope of work system (1984)

Technology, automation

Upstream

Downstream

Work has clear objectives and takes place in well-defined situations. Systems and technologies are loosely coupled and

tractable.

© Erik Hollnagel, 2010

Scope of work system (2008)

Vertical and horizontal extensions

A vertical extension to cover the entire system, from technology to organisation

Design Maintenance

Organisation (management)

Upstream Technology, automation

Downstream

© Erik Hollnagel, 2010

Scope of work system (2008)

Vertical and horizontal extensions

One horizontal extension, to cover the lifecycle,

from design to maintenance

A vertical extension to cover the entire system, from technology to organisation

Design Maintenance

Organisation (management)

Upstream Technology, automation

Downstream

© Erik Hollnagel, 2010

DownstreamScope of work system (2008)

Vertical and horizontal extensions

One horizontal extension, to cover the lifecycle,

from design to maintenance

A vertical extension to cover the entire system, from technology to organisation

A second horizontal extension, to cover upstream and downstream processes

Design Maintenance

Organisation (management)

Upstream Technology, automation

© Erik Hollnagel, 2010

Tractable and intractable systems

Principles of functioning known (white box)

System does not change while being described

Description are simple with few details

Tractable system(independent, clockwork)

System changes before description is completed

Comprehensibility

Underspecified

Stability

Complicacy

Fully specified Partly specified

Intractable system(interdependent, teamwork)

System changes before description is completed

Underspecified

Principles of functioning unknown (black box)

Elaborate descriptions with many details

© Erik Hollnagel, 2010

Performance variability is necessary

Many socio-technical systems are intractable. The conditions of work therefore never completely match what has been specified or prescribed.

Systems are so complex that work situations always are underspecified – hence partly unpredictable

Few – if any – tasks can successfully be carried out unless procedures and tools are adapted to the situation. Performance variability is both normal and necessary.

?

Individuals, groups, and organisations normally adjust their performance to meet existing conditions, specifically actual resources and requirements.

Because resources (time, manpower, information, etc.) always are finite, such adjustments will always be approximate rather than exact.

Performance variability

Success

Failure

© Erik Hollnagel, 2010

If thoroughness dominates, there may be too little time to carry out the actions.

If efficiency dominates, actions may be badly

prepared or wrong

Neglect pending actionsMiss new events

Miss pre-conditionsLook for expected results

Thoroughness: Time to thinkRecognising situation.Choosing and planning.

Efficiency: Time to doImplementing plans. Executing actions.

Efficiency-Thoroughness Trade-Off

Time & resources needed

Time & resources available

© Erik Hollnagel, 2010

The ETTO principle

The ETTO principle describes the fact that people (and organisations) as part of their activities practically always must make a trade-off between the resources (time and effort) they spend on preparing an activity and the resources (time, effort and materials) they spend on doing it.

ETTOing favours thoroughness over efficiency if safety and quality are the dominant concerns, and efficiency over thoroughness if throughput and output are the dominant concerns.

The ETTO principle means that it is impossible to maximise efficiency and thoroughness at the same time. Neither can an activity expect to succeed, if there is not a minimum of either.

© Erik Hollnagel, 2010

Failures or successes?

Who or what are responsible for the remaining 10-20%?

When something goes right, e.g., 9.999 events

out of 10.000, are humans also responsible in 80-90% of the cases?

When something goes wrong, e.g., 1 event out of 10.000

(10E-4), humans are assumed to be responsible in

80-90% of the cases.

Who or what are responsible for the remaining 10-20%?

Investigation of failures is accepted as important.

Investigation of successes is rarely

undertaken.

© Erik Hollnagel, 2010

Why only look at what goes wrong?

Focus is on what goes wrong. Look for failures and malfunctions. Try to eliminate causes and improve barriers.

Focus is on what goes right. Use that to

understand normal performance, to do

better and to be safer.

Safety = Reduced number of adverse events.

10-4 := 1 failure in 10.000 events

1 - 10-4 := 9.999 non-failures in 10.000 events

Safety and core business help each other.

Learning uses most of the data available

Safety and core business compete for resources. Learning only uses a fraction of the data available

Safety = Ability to succeed under varying

conditions.

© Erik Hollnagel, 2010

Low Low Low Moderate

Risk profile (= lack of safety)

Very critical

Critical

Marginal

Negligible

Cons

eque

nce

ProbabilityCertainLikelyPossibleUnlikelyRare

Low Low Moderate High High

Moderate Moderate High High Extreme

High Extreme Extreme ExtremeHigh

Moderate

© Erik Hollnagel, 2010

Low

Benefit profile (= safety)

Innovative

Effective

Acceptable

Negligible

Cons

eque

nce

ProbabilityRare

Low

Moderate

High

Low

Unlikely

Low

Moderate

High

Possible

Very high

High

Low

Moderate

Moderate

Likely

High

Very high

High

Moderate

Certain

High

Very high

Very high

© Erik Hollnagel, 2010

Normal outcomes(things that go

right)

Serendipity

Very highPredictability

Range of event outcomes

Outcome

Very low

Posi

tive

Nega

tiv e

Neut

ral

Disasters

Near misses

Accidents

Incidents

Good luck

Mishaps

© Erik Hollnagel, 2010

Normal outcomes(things that go

right)

Serendipity

Very high Predictability

Frequency of event outcomes

Outcome

Very low

Posi

tive

Nega

tiv e

Neut

ral

Disasters

Accidents

Good luck

Mishaps

IncidentsNear misses

102

104

106

© Erik Hollnagel, 2010

Normal outcomes(things that go

right)

Serendipity

Very high Predictability

Being safe versus being unsafe

Outcome

Very low

Posi

tive

Nega

tiv e

Neut

ral

Disasters

Accidents

Good luck

Mishaps

IncidentsNear misses

102

104

106

Safe Safe FunctioningFunctioning

(invisible)(invisible)

Unsafe Unsafe FunctioningFunctioning

(visible)(visible)

© Erik Hollnagel, 2010

Consequence: Accidents are prevented by monitoring and damping variability. Safety requires constant ability to anticipate future events.

Non-linear accident model

Assumption:

Hazards-risks:

Functional Resonance Accident Model

C e r t i f i c a t i o nI

P

C

O

R

TF A A

L u b r i c a t i o nI

P

C

O

R

T

M e c h a n i c s

H i g h w o r k l o a d

G r e a s e

M a i n t e n a n c e o v e r s i g h t

I

P

C

O

R

T

I n t e r v a l a p p r o v a l s

H o r i z o n t a l s t a b i l i z e r

m o v e m e n tI

P

C

O

R

TJ a c k s c r e w

u p - d o w n m o v e m e n t

I

P

C

O

R

T

E x p e r t i s e

C o n t r o l l e ds t a b i l i z e r

m o v e m e n t

A ir c r a f t d e s i g n

I

P

C

O

R

T

A i r c r a f t d e s i g n k n o w l e d g e

A i r c r a f t p i t c h c o n t r o l

I

P

C

O

R

T

L i m i t i n g s t a b i l i z e r

m o v e m e n tI

P

C

O

R

T

L i m i t e ds t a b i l i z e r

m o v e m e n t

A i r c r a f t

L u b r i c a t i o n

E n d - p l a y c h e c k i n g

I

P

C

O

R

T

A l l o w a b l ee n d - p l a y

J a c k s c r e w r e p l a c e m e n tI

P

C

O

R

T

E x c e s s i v ee n d - p l a y

H i g h w o r k l o a d

E q u i p m e n t E x p e r t i s e

I n t e r v a l a p p r o v a l s

R e d u n d a n td e s i g n

P r o c e d u r e s

P r o c e d u r e s

* ETTO = Efficiency-Thoroughness Trade-Off

Accidents result from unexpected combinations (resonance) of variability of normal performance.

Emerge from combinations of normal variability (socio-technical system), hence looking for ETTO* and sacrificing decision

© Erik Hollnagel, 2010

Risks as non-linear combinations

CertificationI

P

C

O

R

TFAA

LubricationI

P

C

O

R

T

Mechanics

High workload

Grease

Maintenance oversight

I

P

C

O

R

T

Interval approvals

Horizontal stabilizer

movementI

P

C

O

R

TJackscrew up-down

movementI

P

C

O

R

T

Expertise

Controlledstabilizer

movement

Aircraft design

I

P

C

O

R

T

Aircraft design knowledge

Aircraft pitch control

I

P

C

O

R

T

Limiting stabilizer

movementI

P

C

O

R

T

Limitedstabilizer

movement

Aircraft

Lubrication

End-play checking

I

P

C

O

R

T

Allowableend-play

Jackscrew replacementI

P

C

O

R

T

Excessiveend-play

High workload

Equipment Expertise

Interval approvals

Redundantdesign

Procedures

Procedures

Systems at risk are intractable rather than tractable.

The established assumptions therefore have to be revised

CertificationI

P

C

O

R

TFAA

LubricationI

P

C

O

R

T

Mechanics

High workload

Grease

Maintenance oversight

I

P

C

O

R

T

Interval approvals

Horizontal stabilizer

movementI

P

C

O

R

TJackscrew up-down

movementI

P

C

O

R

T

Expertise

Controlledstabilizer

movement

Aircraft design

I

P

C

O

R

T

Aircraft design knowledge

Aircraft pitch control

I

P

C

O

R

T

Limiting stabilizer

movementI

P

C

O

R

T

Limitedstabilizer

movement

Aircraft

Lubrication

End-play checking

I

P

C

O

R

T

Allowableend-play

Jackscrew replacementI

P

C

O

R

T

Excessiveend-play

High workload

Equipment Expertise

Interval approvals

Redundantdesign

Procedures

Procedures

Unexpected combinations (resonance) of variability of

normal performance.

If accidents happen like

this ...

... then risks can be found

like this ...

Unexpected combinations (resonance) of variability of

normal performance.

© Erik Hollnagel, 2010

Dams Power grids

NPPs

Space missions

Nano-tech

Chemical plants

Marine transport

Military adventures

Assembly lines

Post offices

UncertaintyLow (tractable)

High (intractable)

Mining

Air traffic control

Universities

Public servicesManufacturing

Railways

Healthcare

Coupling

Tig

htLo

ose

Emergency department

Integrated operations

R&D companies

Hazard

Hig

hLo

w

Methods and realityFinancial markets

LSCITS

Simple linear

Focus on technology

FMECA

Fault tree

HAZOP

FMEA

© Erik Hollnagel, 2010

Dams Power grids

NPPs

Space missions

Nano-tech

Chemical plants

Marine transport

Military adventures

Assembly lines

Post offices

UncertaintyLow (tractable)

High (intractable)

Mining

Air traffic control

Universities

Public servicesManufacturing

Railways

Healthcare

Coupling

Tig

htLo

ose

Emergency department

Integrated operations

R&D companies

Hazard

Hig

hLo

w

Methods and realityFinancial markets

LSCITS

Simple linear

Complex linear

Focus on technology

Focus on Human Factors

FMECA

Fault tree

HAZOP

FMEAHERA

AEB

RCA

ATHEANA

HEATHCR

HPES

THERPCSNI

TRACEr

© Erik Hollnagel, 2010

Dams Power grids

NPPs

Space missions

Nano-tech

Chemical plants

Marine transport

Military adventures

Assembly lines

Post offices

UncertaintyLow (tractable)

High (intractable)

Mining

Air traffic control

Universities

Public servicesManufacturing

Railways

Healthcare

Coupling

Tig

htLo

ose

Emergency department

Integrated operations

R&D companies

Hazard

Hig

hLo

w

Hazard and uncertaintyFinancial markets

LSCITS

Simple linear

Complex linear

Complex linear

Focus on technology

Focus on Human Factors

Focus on organisations

Non-linear

Focus on normal functions

FMECA

Fault tree

HAZOP

FMEAHERA

AEB

RCA

ATHEANA

HEATHCR

HPES

THERPCSNI

TRACEr

STEP

MORT

TRIPOD

MTO

Swiss cheese

AcciMap

CREAM STAMP

MERMOS

FRAMRE

© Erik Hollnagel, 2010

All outcomes (positive and negative) are due to

performance variability..

From the negative to the positive

Negative outcomes are caused by failures and

malfunctions.

Safety = Reduced number of adverse

events.

Eliminate failures and malfunctions as far

as possible.

Safety = Ability to respond when

something fails.

Improve ability to respond to adverse

events.

Safety = Ability to succeed under varying

conditions.

Improve resilience.

© Erik Hollnagel, 2010

What is safety?

Particularistic (product view)Risk should be As Low As Reasonably Practicable

(ALARP)

The ability to succeed under varying conditions (respond, monitor, anticipate, learn)

The reduction of unnecessary harm to an acceptable

minimum

Systemic (process view)Safety should be As High As

Reasonably Practicable (AHARP)

© Erik Hollnagel, 2010

Conclusions so far ...

Complex socio-technical systems can only function if performance is adjusted to conditions (ETTO)

Performance variability is the reason why things go right, but also the reason why things sometimes go wrong.

We need to understand how things go right before we can understand how they go wrong.

Resilience engineering is about how we can ensure that systems remain productive and safe in expected and unexpected conditions alike.