Living with Failure-Finding -...

5
Living with FailureFinding RCM Notes series Dr Mark Horton, Numeratis.com, March 2011 1 Introduction This paper provides some background notes to the simple failurefinding interval formulae used in RCM analysis. These notes are primarily intended to support you in your role as a trainer and you should generally not use them as part of a training course. Very few RCM group members ever question the basis of the failurefinding formulae. Of those who do, most do not have a strong mathematical background and the rigorous mathematical derivations are more likely to frighten than to enlighten them. Each derivation is therefore split into two parts: one is the formal derivation of the results; the other is a set of intuitive arguments that you may find more useful in explaining the principles. The mathematical section number has a suffix M; the conceptual section a suffix C. If you ever encounter a real statistician among your trainees, you might like to give him or her a few hints and leave the derivations as an exercise... 2 Assumptions Whether you go for the intuitive methods or mathematical rigour, you should be familiar with the assumptions below which apply to all the formulae in this note. The failures of the protective device and of the protective system occur at random (both are pattern E in Nowlan and Heap terms) The failurefinding interval is much less than the mean time between failures of the protective device (preferably less than about 5% of M dev , certainly less than 10% of M dev ) The failurefinding interval is much less than the mean time between failures of the protected function It is possible to derive failurefinding formulae for more general cases, but they can be horribly complicated and results can often only be produced by numerical methods on a computer. 3 Mathematical Notation This note uses the following mathematical notation. A Availability u Unavailability R(t) Survival function F(t) Probability that the system has failed at time t Tff Failurefinding interval λ Failure rate of an individual protective device in such a way that it does not provide the required protection (λ = 1/ M dev ) μ Demand rate on the protective system (μ = 1/ M dem ) L Is the rate of multiple failures (L = 1/ M mf ) n The number of parallel independent protective devices making up a protective system Mdem The mean time between demands on the protective system Mdev The mean time between failures of individual protective devices 4 The Basics C If a protective device fails at random, then we mean by definition that the chance of failure at any time is exactly the same as at any other time. This means that the instantaneous conditional probability of failure, generally better known as the hazard rate, is flat (Nowlan and Heap pattern E below).

Transcript of Living with Failure-Finding -...

  • Living  with  Failure-‐‑Finding  RCM  Notes  series  

    Dr  Mark  Horton,  Numeratis.com,  March  2011  

    1 Introduction This   paper   provides   some   background   notes   to  the   simple   failure-‐‑finding   interval   formulae  used  in   RCM   analysis.     These   notes   are   primarily  intended  to  support  you  in  your  role  as  a   trainer  and  you  should  generally  not  use  them  as  part  of  a  training  course.  

    Very  few  RCM  group  members  ever  question  the  basis   of   the   failure-‐‑finding   formulae.     Of   those  who  do,  most  do  not  have  a  strong  mathematical  background   and   the   rigorous   mathematical  derivations   are   more   likely   to   frighten   than   to  enlighten  them.    Each  derivation  is  therefore  split  into  two  parts:    one  is  the  formal  derivation  of  the  results;   the   other   is   a   set   of   intuitive   arguments  that   you  may   find  more  useful   in   explaining   the  principles.    The  mathematical  section  number  has  a  suffix  "ʺM"ʺ;  the  conceptual  section  a  suffix  "ʺC"ʺ.    If  you  ever  encounter  a  real  statistician  among  your  trainees,  you  might   like   to  give  him  or  her  a   few  hints  and  leave  the  derivations  as  an  exercise...  

    2 Assumptions Whether   you   go   for   the   intuitive   methods   or  mathematical  rigour,  you  should  be  familiar  with  the   assumptions   below   which   apply   to   all   the  formulae  in  this  note.      

    •   The   failures   of   the   protective   device   and   of  the   protective   system   occur   at   random   (both  are  pattern  E  in  Nowlan  and  Heap  terms)  

    •   The  failure-‐‑finding  interval   is  much  less  than  the   mean   time   between   failures   of   the  protective   device   (preferably   less   than   about  5%  of  Mdev,  certainly  less  than  10%  of  Mdev)  

    •   The  failure-‐‑finding  interval   is  much  less  than  the   mean   time   between   failures   of   the  protected  function  

    It  is  possible  to  derive  failure-‐‑finding  formulae  for  more   general   cases,   but   they   can   be   horribly  complicated   and   results   can   often   only   be  produced  by  numerical  methods  on  a  computer.      

    3 Mathematical Notation This   note   uses   the   following   mathematical  notation.  

    A   Availability  u   Unavailability  R(t)   Survival  function  F(t)   Probability  that  the  system  has  failed  at  

    time  t  Tff   Failure-‐‑finding  interval    λ   Failure  rate  of  an  individual  protective  

    device  in  such  a  way  that  it  does  not  provide  the  required  protection  (λ = 1/Mdev)  

    µ   Demand  rate  on  the  protective  system  (µ = 1/Mdem)  

    L   Is  the  rate  of  multiple  failures  (L  =  1/Mmf)  

    n   The  number  of  parallel  independent  protective  devices  making  up  a  protective  system  

    Mdem  The  mean  time  between  demands  on  the  protective  system  

    Mdev   The  mean  time  between  failures  of  individual  protective  devices  

    4 The Basics C If   a   protective   device   fails   at   random,   then   we  mean   by   definition   that   the   chance   of   failure   at  any  time  is  exactly  the  same  as  at  any  other  time.    This   means   that   the   instantaneous   conditional  probability  of  failure,  generally  better  known  as  the  hazard   rate,   is   flat   (Nowlan   and   Heap   pattern   E  below).  

  • RCM NOTES

    2 Living with Failure-Finding Copyright © 2011-2012 numeratis.com

     As  shown  in  reliability  books,  if  the  hazard  rate  is  constant,   then   the   chance   that   the   device   is   still  working   at   some   time   in   the   future   follows   a  negative  exponential  curve  (below).  

         This   is   the   curve   we   are   interested   in,   but   not  quite   in   this   form.     If   we   express   it   slightly  differently,   we   can   show   the   chance   that   the  device  is  in  a  failed  state  (i.e.  not  working)  at  any  time.  

     Although  the   full  curve  has   to  be  represented  by  an   exponential   function,   the   first   part   of   it   up   to  about  10%  of  Mdev  can  be  approximated  well  by  a  straight   line:     this   is   the   linear  approximation   to  an  exponential  survival  curve.  

     

    The   relationship   between   time   and   the   chance  that   the   device   is   in   a   failed   state   is   given   by    F  =  t/Mdev  over  this  interval.  

    4 The Basics M If  a  device  fails  at  a  random  rate  λ,  then  provided  that  we  are  certain  that  the  device  is  functional  at  time   t   =   0,   the   probability   that   it   will   operate   at  time  t  >  0  is  given  by  the  survival  curve  R(t):  

     

    The  instantaneous  unavailability  of  the  protective  device  is    

     

    These   relationships   are   explained   for   general  hazard  rates  in  any  book  on  reliability  theory.  

    If  λ  

  • RCM NOTES

    Copyright © 2011-2012 numeratis.com Living with Failure-Finding 3

     

    5 That Factor of Two M If   a   device   is   restored   to   working   condition   at  regular   intervals  T,   the   average   unavailability   of  the  device  over  that  interval  is  

     

    Under   the   approximations   listed   at   the   start   of  this   document,   the   average   unavailability   of   the  protective  device  is    

     

    6 Parallel Devices C So   far   we   have   been   concerned   with   a   single  protective   device.     This   section   deals   with   two  parallel  redundant  devices,  where  either  device  is  able  to  respond  fully  to  the  demand.      

    A  simple  (but  incorrect)  treatment  of  two  parallel  devices   could   go   like   this.     For   a   short   time   this  “conceptual”   section   is   going   to   become   a   little  mathematical.  

    The   average   unavailability   of   a   single   protective  device  that  fails  with  mean  time  between  failures  Mdev  and  which  is  tested  at  equal  time  intervals  Tff  is  

    u(T ff ) =T ff2Mdev

     

    If   there  are  two  parallel  devices,   the  protection  is  only  completely  unavailable   if  both  devices  have  failed;  so  the  unavailability  we  would  expect  is  

    u(T ff ) =T ff2Mdev

    .T ff2Mdev

    =T ff2

    4Mdev2  

    As   we  will   see,   it   is   not   the   right   answer.     This  section  is  concerned  with  answering  the  following  question  of  why  it  is  wrong.      

    Imagine   that   we   have   two   parallel   protective  devices  and  decide   that  we  will  check  each  at  an  interval  given  by  Tff.    We  will  not  check  them  both  at  the  same  time,  but  we  will  check  one  device  at  time   zero,   then   the   second   at   time  Tff/2,   the   first  again  at  Tff  and  so  on.    

     What   have  we   achieved   by   staggering   the   tests?    Remember   that   the   chance  of   a  protective  device  being   in  a   failed  state   increases  with  time  during  the   failure-‐‑finding   interval.     By   staggering   the  test,  the  period  of  high  failure  probability  for  one  device   corresponds   to   the   period   of   low  probability  for  the  other,  and  vice  versa.      

     Compare   this   with   the   situation   where   both  devices  are  tested  at  the  same  time,  shown  below.  

  • RCM NOTES

    4 Living with Failure-Finding Copyright © 2011-2012 numeratis.com

     Now   both   devices   "ʺget   old   together"ʺ:     in   other  words,   the   areas   of   high   failure   probability   now  coincide.     Therefore   checking   several   parallel  redundant   devices   at   the   same   time   results   in   a  lower   overall   availability   than   the   alternative  strategy   of   staggering   the   tests.   Since   a   fixed  failure-‐‑finding   interval   gives   a   lower   availability  if  the  devices  are  tested  at  the  same  time,  then  for  a   given   required   availability,  we  must   check   the  devices  more  often  if  the  tests  are  carried  out  at  the  same  time.    The  simple  approach  that  introduced  this   section   goes   one   step   further   by   assuming  that  each  device  is  tested  at  an  average  interval  of  FFI,   but   the   actual   time   of   any   test   is   decided   at  random.     If   you   ever   see   a   maintenance  management   system  which   supports   this   type  of  scheduling,  give  me  a  call!  

    6 Parallel Devices M These  systems  consist  of  several  identical  parallel  protective   devices,   any   of   which   alone   can  provide  full  protection  when  a  demand  is  placed  on  the  system.      

    A   failure-‐‑finding   task   normally   tests   all   the  devices  at  the  same  time;  any  that  are  not  working  are   repaired   or   replaced.     Notice   that   it   is  important  here  to  test  the  individual  devices,  and  not   just  to  test  the  overall  function  of  the  system;  otherwise  failed  devices  could  be  missed  and  the  expected   availability   of   the   system   could   be   far  less  than  expected.  

    The   instantaneous   probability   that   the   whole  protective   system   is   disabled   (unavailable)   at   a  time  t  after  the  last  test  is  

     

    where   n   is   the   number   of   parallel   protective  devices   employed.     The   average   unavailability  over  the  failure-‐‑finding  interval  Tff  is  

    u(T ff ) =

    (1− e−λt )n dt0

    Tff

    ∫T ff

     

    Under   the   approximations   stated   at   the   start   of  this  document,  this  becomes  

    u(T ff ) =(λT ff )

    n

    (n +1)  

    As   in   the   section   above,   this   represents   the  average   availability   over   time.    The   instantaneous  availability  of  the  protective  system  is  higher  than  the  average  availability  at   the  start  of   the  period,  but  lower  at  the  end.    The  rise  in  unavailability  is  nonlinear:    quadratic,  cubic  and  so  on  depending  on   the  number  of  parallel  devices.     If   the   failure-‐‑finding   interval   is   lengthened,   the   unavailability  (and   hence   the   potential   multiple   failure   rate)  increases  as  the  nth  power  of  the  testing  interval.      

    The  average  multiple  failure  rate  L  is  given  by  

    L = µ(λT ff )

    n

    (n +1)  

    So  the  failure-‐‑finding  interval  Tff  for  a  given  target  multiple  failure  rate  L  is    

    T ff =1λ(n +1)L

    µ

    ⎝ ⎜

    ⎠ ⎟

    1n  

    which  translates  into  the  following  in  terms  of  the  device   mean   time   between   failure   and   mean  demand  times.  

    T ff = Mdev(n +1)Mdem

    Mmf

    ⎝ ⎜ ⎜

    ⎠ ⎟ ⎟

    1n    

    As   discussed   in   the   previous   section,   the  availability  achieved   is   actually   lower   than   if   the  tests  were  staggered  or  completely  unrelated.  

    As   an   extreme   example,   suppose   that   the  individual   devices   are   tested   at   random  with   no  relationship  between  the   test   times.    The  average  availability  of  one  device  tested  at  interval  Tff  is  

  • RCM NOTES

    Copyright © 2011-2012 numeratis.com Living with Failure-Finding 5

    u(T ff ) =λT ff2

     

    The   average   unavailability   of   the   protective  system   as   a   whole   is   found   by   multiplying   the  individual  unavailability  figures:  

    u(T ff ) =λnT ff

    n

    2n  

    The  rate  of  multiple  failures  is  therefore  

    L = µλnT ff

    n

    2n  

    So   if   we   specify   the   acceptable   rate   of   multiple  failures   and   we   know   the   mean   time   between  failures  of   the  protective  device  and   the  demand  rate  on  the  protective  system,  the  required  failure-‐‑finding  interval  is    

    T ff =2λ

    ⎝ ⎜

    ⎠ ⎟

    1n  

    Expressed  in  more  familiar  terms,  this  becomes  

    T ff = 2MdevMdemMmf

    ⎝ ⎜ ⎜

    ⎠ ⎟ ⎟

    1n    

    In   a   more   realistic   example,   suppose   that   the  protective  system  consists  of  two  parallel  devices.    If   they  were   tested  at   the   same   time,   at   intervals  Tff,  the  average  availability  achieved  would  be  

    u(T ff ) =λ2T ff

    2

    3  

    Compare  this  with  the  situation  where  the  devices  are   checked   at   the   same   interval,   but   the   checks  are   offset   by   half   of   the   failure-‐‑finding   interval.    So   device   1   is   tested   at   time   0,   then   device   2   at  time  Tff/2,   device   1   again   at  Tff,   device   2   at   3Tff/2  and  so  on.  

     Using   the   linear   approximation   to   the   survival  curve,   the   probability   that   both   devices   are   in   a  failed  state  at  time  t  between  0  and  Tff  is  

    u(t) = λt.λ t +T ff2

    ⎝ ⎜

    ⎠ ⎟     for  0  ≤  t  <  Tff  /2  

    and  

    u(t) = λt.λ t −T ff2

    ⎝ ⎜

    ⎠ ⎟     for  Tff  /2  <  t  ≤  Tff    

    Integrating   over   the   interval   from   0   to   Tff,   the  average  availability  is    

    u(T ff ) = λ2 T ff

    2

    3−T ff2

    8

    ⎝ ⎜ ⎜

    ⎠ ⎟ ⎟  

    which   is   less   than   the   unavailability   for  simultaneous  testing  by  37.5%.  

    Terms of use and Copyright Neither  the  author  nor  the  publisher  accepts  any  responsibility  for   the   application   of   the   information   and   techniques  presented   in   this   document,   nor   for   any   errors   or   omissions.  The  reader  should  satisfy  himself  or  herself  of  the  correctness  and  applicability  of  the  techniques  described  in  this  document,  and   bears   full   responsibility   for   the   consequences   of   any  application.  

    Copyright  ©   2011-‐‑2012  numeratis.com.  Licensed   for   personal  use   only   under   a   Creative   Commons   Attribution-‐‑Noncommercial-‐‑No   Derivatives   3.0   Unported   Licence.   You  may   use   this   work   for   non-‐‑commercial   purposes   only.   You  may  copy  and  distribute  this  work  in  its  entirety  provided  that  it  is  attributed  to  the  author  in  the  same  way  as  in  the  original  document   and   includes   the   original   Terms   of   Use   and  Copyright   statements.   You  may   not   create   derivative  works  based   on   this   work.   You   may   not   copy   or   use   the   images  within   this   work   except   when   copying   or   distributing   the  entire  work.