Living with Failure-Finding -...

Living with Failure-‐‑Finding RCM Notes series

Dr Mark Horton, Numeratis.com, March 2011

1 Introduction This paper provides some background notes to the simple failure-‐‑finding interval formulae used in RCM analysis. These notes are primarily intended to support you in your role as a trainer and you should generally not use them as part of a training course.

Very few RCM group members ever question the basis of the failure-‐‑finding formulae. Of those who do, most do not have a strong mathematical background and the rigorous mathematical derivations are more likely to frighten than to enlighten them. Each derivation is therefore split into two parts: one is the formal derivation of the results; the other is a set of intuitive arguments that you may find more useful in explaining the principles. The mathematical section number has a suffix "ʺM"ʺ; the conceptual section a suffix "ʺC"ʺ. If you ever encounter a real statistician among your trainees, you might like to give him or her a few hints and leave the derivations as an exercise...

2 Assumptions Whether you go for the intuitive methods or mathematical rigour, you should be familiar with the assumptions below which apply to all the formulae in this note.

• The failures of the protective device and of the protective system occur at random (both are pattern E in Nowlan and Heap terms)

• The failure-‐‑finding interval is much less than the mean time between failures of the protective device (preferably less than about 5% of Mdev, certainly less than 10% of Mdev)

• The failure-‐‑finding interval is much less than the mean time between failures of the protected function

It is possible to derive failure-‐‑finding formulae for more general cases, but they can be horribly complicated and results can often only be produced by numerical methods on a computer.

3 Mathematical Notation This note uses the following mathematical notation.

A Availability u Unavailability R(t) Survival function F(t) Probability that the system has failed at

time t Tff Failure-‐‑finding interval λ Failure rate of an individual protective

device in such a way that it does not provide the required protection (λ = 1/Mdev)

µ Demand rate on the protective system (µ = 1/Mdem)

L Is the rate of multiple failures (L = 1/Mmf)

n The number of parallel independent protective devices making up a protective system

Mdem The mean time between demands on the protective system

Mdev The mean time between failures of individual protective devices

4 The Basics C If a protective device fails at random, then we mean by definition that the chance of failure at any time is exactly the same as at any other time. This means that the instantaneous conditional probability of failure, generally better known as the hazard rate, is flat (Nowlan and Heap pattern E below).

RCM NOTES

2 Living with Failure-Finding Copyright © 2011-2012 numeratis.com

As shown in reliability books, if the hazard rate is constant, then the chance that the device is still working at some time in the future follows a negative exponential curve (below).

This is the curve we are interested in, but not quite in this form. If we express it slightly differently, we can show the chance that the device is in a failed state (i.e. not working) at any time.

Although the full curve has to be represented by an exponential function, the first part of it up to about 10% of Mdev can be approximated well by a straight line: this is the linear approximation to an exponential survival curve.

The relationship between time and the chance that the device is in a failed state is given by F = t/Mdev over this interval.

4 The Basics M If a device fails at a random rate λ, then provided that we are certain that the device is functional at time t = 0, the probability that it will operate at time t > 0 is given by the survival curve R(t):

The instantaneous unavailability of the protective device is

These relationships are explained for general hazard rates in any book on reliability theory.

If λ

RCM NOTES

Copyright © 2011-2012 numeratis.com Living with Failure-Finding 3

5 That Factor of Two M If a device is restored to working condition at regular intervals T, the average unavailability of the device over that interval is

Under the approximations listed at the start of this document, the average unavailability of the protective device is

6 Parallel Devices C So far we have been concerned with a single protective device. This section deals with two parallel redundant devices, where either device is able to respond fully to the demand.

A simple (but incorrect) treatment of two parallel devices could go like this. For a short time this “conceptual” section is going to become a little mathematical.

The average unavailability of a single protective device that fails with mean time between failures Mdev and which is tested at equal time intervals Tff is

€

u(T ff ) =T ff2Mdev

If there are two parallel devices, the protection is only completely unavailable if both devices have failed; so the unavailability we would expect is

€

u(T ff ) =T ff2Mdev

.T ff2Mdev

=T ff2

4Mdev2

As we will see, it is not the right answer. This section is concerned with answering the following question of why it is wrong.

Imagine that we have two parallel protective devices and decide that we will check each at an interval given by Tff. We will not check them both at the same time, but we will check one device at time zero, then the second at time Tff/2, the first again at Tff and so on.

What have we achieved by staggering the tests? Remember that the chance of a protective device being in a failed state increases with time during the failure-‐‑finding interval. By staggering the test, the period of high failure probability for one device corresponds to the period of low probability for the other, and vice versa.

Compare this with the situation where both devices are tested at the same time, shown below.

RCM NOTES

4 Living with Failure-Finding Copyright © 2011-2012 numeratis.com

Now both devices "ʺget old together"ʺ: in other words, the areas of high failure probability now coincide. Therefore checking several parallel redundant devices at the same time results in a lower overall availability than the alternative strategy of staggering the tests. Since a fixed failure-‐‑finding interval gives a lower availability if the devices are tested at the same time, then for a given required availability, we must check the devices more often if the tests are carried out at the same time. The simple approach that introduced this section goes one step further by assuming that each device is tested at an average interval of FFI, but the actual time of any test is decided at random. If you ever see a maintenance management system which supports this type of scheduling, give me a call!

6 Parallel Devices M These systems consist of several identical parallel protective devices, any of which alone can provide full protection when a demand is placed on the system.

A failure-‐‑finding task normally tests all the devices at the same time; any that are not working are repaired or replaced. Notice that it is important here to test the individual devices, and not just to test the overall function of the system; otherwise failed devices could be missed and the expected availability of the system could be far less than expected.

The instantaneous probability that the whole protective system is disabled (unavailable) at a time t after the last test is

where n is the number of parallel protective devices employed. The average unavailability over the failure-‐‑finding interval Tff is

€

u(T ff ) =

(1− e−λt )n dt0

Tff

∫T ff

Under the approximations stated at the start of this document, this becomes

€

u(T ff ) =(λT ff )

n

(n +1)

As in the section above, this represents the average availability over time. The instantaneous availability of the protective system is higher than the average availability at the start of the period, but lower at the end. The rise in unavailability is nonlinear: quadratic, cubic and so on depending on the number of parallel devices. If the failure-‐‑finding interval is lengthened, the unavailability (and hence the potential multiple failure rate) increases as the nth power of the testing interval.

The average multiple failure rate L is given by

€

L = µ(λT ff )

n

(n +1)

So the failure-‐‑finding interval Tff for a given target multiple failure rate L is

€

T ff =1λ(n +1)L

µ

⎛

⎝ ⎜

⎞

⎠ ⎟

1n

which translates into the following in terms of the device mean time between failure and mean demand times.

€

T ff = Mdev(n +1)Mdem

Mmf

⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟

1n

As discussed in the previous section, the availability achieved is actually lower than if the tests were staggered or completely unrelated.

As an extreme example, suppose that the individual devices are tested at random with no relationship between the test times. The average availability of one device tested at interval Tff is

RCM NOTES

Copyright © 2011-2012 numeratis.com Living with Failure-Finding 5

€

u(T ff ) =λT ff2

The average unavailability of the protective system as a whole is found by multiplying the individual unavailability figures:

€

u(T ff ) =λnT ff

n

2n

The rate of multiple failures is therefore

€

L = µλnT ff

n

2n

So if we specify the acceptable rate of multiple failures and we know the mean time between failures of the protective device and the demand rate on the protective system, the required failure-‐‑finding interval is

€

T ff =2λ

Lµ

⎛

⎝ ⎜

⎞

⎠ ⎟

1n

Expressed in more familiar terms, this becomes

€

T ff = 2MdevMdemMmf

⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟

1n

In a more realistic example, suppose that the protective system consists of two parallel devices. If they were tested at the same time, at intervals Tff, the average availability achieved would be

€

u(T ff ) =λ2T ff

2

3

Compare this with the situation where the devices are checked at the same interval, but the checks are offset by half of the failure-‐‑finding interval. So device 1 is tested at time 0, then device 2 at time Tff/2, device 1 again at Tff, device 2 at 3Tff/2 and so on.

Using the linear approximation to the survival curve, the probability that both devices are in a failed state at time t between 0 and Tff is

€

u(t) = λt.λ t +T ff2

⎛

⎝ ⎜

⎞

⎠ ⎟ for 0 ≤ t < Tff /2

and

€

u(t) = λt.λ t −T ff2

⎛

⎝ ⎜

⎞

⎠ ⎟ for Tff /2 < t ≤ Tff

Integrating over the interval from 0 to Tff, the average availability is

€

u(T ff ) = λ2 T ff

2

3−T ff2

8

⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟

which is less than the unavailability for simultaneous testing by 37.5%.

Terms of use and Copyright Neither the author nor the publisher accepts any responsibility for the application of the information and techniques presented in this document, nor for any errors or omissions. The reader should satisfy himself or herself of the correctness and applicability of the techniques described in this document, and bears full responsibility for the consequences of any application.

Copyright © 2011-‐‑2012 numeratis.com. Licensed for personal use only under a Creative Commons Attribution-‐‑Noncommercial-‐‑No Derivatives 3.0 Unported Licence. You may use this work for non-‐‑commercial purposes only. You may copy and distribute this work in its entirety provided that it is attributed to the author in the same way as in the original document and includes the original Terms of Use and Copyright statements. You may not create derivative works based on this work. You may not copy or use the images within this work except when copying or distributing the entire work.

Living with Failure-Finding -...

Documents

Transcript of Living with Failure-Finding -...