Applying Reliability Engineering Techniques

7/30/2019 Applying Reliability Engineering Techniques

1/15

Tutorial Notes 2012 AR&MS

2012 Annual RELIABILITY and MAINTAINABILITY Symposium

Applying Reliability Engineering Techniques &

Best Practices to Achieve Functional Safety

William M. Goble, Ph.D. & Julia V. Bukowski, Ph.D.

William M. Goble, Ph.D., P.E., CSFEPrincipal Partner, exida, LLC

61 N. Main Street

Sellersville, PA 18960 USAInternet (e-mail): [email protected]

Julia V. Bukowski, Ph.D.Dept of Electrical & Computer Engineering

Villanova University

Villanova, PA 19085 USAInternet (e-mail):[email protected]


2/15

ii Goble & Burkowski 2012 AR&MS Tutorial Notes

SUMMARY & PURPOSE

The purpose of this tutorial is to introduce the basics of functional safety and illustrate how a variety of conventional

reliability engineering techniques and best practices can be applied to the problem of achieving it. The material presented has

wide applicability in industries as diverse as petro-chemical, nuclear, automotive, pharmaceuticals, railroads, and power

generation, to name a few. The material is relevant to engineers and managers who work in situations which require functional

safety to be achieved and maintained; thus it is equally beneficial for designers of safety systems and products as well as end

users who rely on such safety systems and products. The tutorial assumes no prior knowledge of functional safety and is not

mathematically intense. After completing this tutorial, attendees should be conversant with the basic concepts of functionalsafety, understand the System and Product Safety Lifecycles, be knowledgeable about which reliability engineering techniques

and best practices to consider applying at various points in the safety lifecycles, and have a broad overview of the IEC 61508

safety standard.

William M. Goble, Ph.D., P.E., CSFEWilliam M. Goble is currently Principal Partner and co-founder of exida, a product certification and engineering consulting

company focused on automation system safety and reliability. He has over 30 years of experience in electronic design,

software, reliability analysis and management. He has a BSEE from Penn State University, an MSEE from VillanovaUniversity and a Ph.D. in Reliability Engineering from Eindhoven University of Technology. He is a registered professional

engineer in the State of Pennsylvania and a Certified Functional Safety Expert (CFSE). He is a fellow member of ISA and

author/co-author of three books.

Julia V. Bukowski, Ph.D.

Julia V. Bukowski recently retired from the Department of Electrical and Computer Engineering at Villanova University

where she was a member of the standing faculty for more than 25 years. She is currently engaged in a variety of research and

consulting activities as well as part-time teaching. She has more than 30 years experience in the field of reliability and safety.

She received her BSEE and Ph.D. (Systems Engineering) from the University of Pennsylvania, and her DIC in Electronics

Engineering from Imperial College of Science and Technology, University of London. She has been a Fulbright Senior Lecturer

and Visiting Associate Professor with the Faculty of Industrial Engineering and Management at the Technion Israel Institute of

Technology in Haifa, Israel. She is a senior member of the IEEE and has been a guest editor for a special issue of theIEEE

Transactions on Reliability.

Table of Contents

1. Introduction ..........................................................................................................................................................................1

2. Background ..........................................................................................................................................................................13. Overview of IEC 61508 ....................................................................................................................................................... 3

4. Reliability Engineering Techniques & Best Practices: System Level Application .............................................................. 5

5. Reliability Engineering Techniques & Best Practices: Product Level Application ............................................................106. IEC 61508 Certification ..................................................................................................................................................... 11

7. Cyber Security .................................................................................................................................................................... 12

8. Conclusions ........................................................................................................................................................................ 129. References ..........................................................................................................................................................................12

10. Tutorial Visuals .................................................................................................................................................................. 14


3/15

2012 Annual RELIABILITY and MAINTAINABILITY Symposium Goble & Bukowski 1

1. INTRODUCTION

Many industries use automatic protection equipment to

safeguard people, property and the environment from

potentially hazardous events. The reliability engineering

techniques for optimal design of such automatic protection

equipment have evolved over the years and international

standards have been written to document best practices. This

area of engineering design is known as functional safety.

Since a best practice for achieving functional safety is

meeting the requirements of an appropriate safety standard,

we will use a well-recognized international safety standard

(IEC 61508 [1]) as a vehicle for discussing techniques and

best practices. This standard is an especially good choice for

several reasons. First, it does not prescribe any specific

techniques or practices which must be used. Rather it allows

the user to choose appropriate techniques and practices

provided the choice can be reasonably justified. Therefore,

many different techniques and practices can be highlighted in

this tutorial. Second, it relies on the concept of Safety

System and Product Lifecycles which permits us to highlight

techniques and practices used throughout the lifecycle

beginning with initial system/product concept through design,implementation, testing, validation, documentation,

commissioning, operations, and finally decommissioning.

Third, the standard covers complete systems as well as

hardware and software components. Various techniques and

practices for each of these areas are presented.The remainder of this tutorial consists of the following

topics:

1. Background information to introduce basic conceptspertinent to functional safety

2. A broad overview of the IEC 61508 safety standard whichis used as a framework for discussing various techniques

and practices

3. Details regarding engineering reliability techniques andbest practices to achieve functional safety applied at thea. system level

b. product level4. Information on IEC 61508 product certification5. Highlights of issues regarding cyber security1.1 Notation and Acronyms

ACOS Advisory Committee of Safety

A/CV actuator and control valve

A/SV actuator and safety valve

BPSC basic process control system

CMMI Capability Maturity Model Integration

E/E/PE electrical/electronic/programmableelectronic

EN European norm

DD dangerous detected failure

DU dangerous undetected failure

FD fail dangerous

FMEA failure, modes & effects analysis

FMEDA failure, modes, effects & diagnostic analysis

FS fail safe

FSM functional safety management

HAZOP hazard and operation study

IEC International Electro-technical Commission

ISCI ISA Security Compliance Institute

L/S logic solver

PES programmable electronic system

PFDavg average probability of failure on demand

PLC programmable logic controller

POS positioner

RR risk reductionRRF risk reduction factor

RRFa risk reduction factor achieved by SF

RRFr risk reduction factor required of SF

SEN sensor

SF safety function

SFF safe failure fraction

SIL safety integrity level

SRS safety related system

SrS safety requirements specification

S/V solenoid valve

2. BACKGROUND

2.1 Hazards, Risks, & Risk Reduction Factor

IEC 61508 defines a hazard as a potential source of

physical injury, damage to the health of people, or damage to

property or the environment. It also links the hazard to its

potential consequences in order to establish a measure of risk.

Consider, for example, the steam turbine system and a basic

process control system (BPCS) illustrated in Figure 1. The

steam turbine system consists of a valve to control the inlet

steam, a turbine spun by the steam, a shaft turned by the

turbine, and an unseen load on the shaft. The BPCS consists

of a sensor (SEN) to monitor the shaft speed, a logic solver

(L/S) to determine if the shaft speed is appropriate or needs to

be altered, a positioner (POS) and an actuator and controlvalve (A/CV) to adjust the amount of steam driving the

turbine.

Figure 1 - Steam turbine system with a BPCS.

One can identify a number of hazards for this system; we

Steam

Turbine

BPCS

SEN

BPCS

L/SBPCS

POS

BPCS

A/CV


4/15

2 Goble & Bukowski 2012AR&MS Tutorial Notes

consider a few examples and their possible consequences. If

the shaft spins too fast, flying projectiles may result which

represent damage to the turbine system itself, but which may

further cause personnel injury or damage to adjacent

equipment. If the shaft spins too slowly, it will bend under the

load (equipment damage) unless the load is partially or fully

removed. If steam leaks, personnel in the vicinity may sustain

serious injury. A later example illustrates hazards with

environmental consequences as well.

Risk is a quantitative measure which incorporates boththe likelihood and the consequences of a hazard, i.e., how

often can a hazard occur and what and how severe will the

consequences be if it does. The impacts of risk include

personnel, environment, equipment/property damage, business

interruption, business liability, and company image.

In performing risk analysis, we need to distinguish

between inherent and tolerable risk. Inherent risk is the risk

posed by the process (including its BPCS) unmitigated by

additional automatic protection equipment, i.e., unmitigated

by a safety function (SF) whose concept is detailed in the next

section. It is impossible and/or impractical to eliminate all

inherent risk. Tolerable risk is risk designated as that

acceptable to management, insurers, regulatory authorities,and the general public.

The concept of risk reduction factor (RRF) has two

distinct but related usages. The first is to specify the

minimum risk reduction required of an implemented SF in

order to decrease the overall risk from its inherent level to its

tolerable level. We designate this required RRF as RRFr

which is defined as

RRFr = inherent risk / tolerable risk. (1)

The second usage is to specify what RRF is achievedby a

particular SF implementation. We designate this achieved

RRF as RRFa which is defined as

RRFa = 1/PFDavg (2)

where PFDavg is the average probability of failure on

demand. Further details regarding PFDavg are described in

the sections following.

In order for an SF to be appropriate for a given

application, RRFa must be greater than or equal to RRFr.

2.2 Concepts of a Safety Function & Safety Related System

We have already referred to an SF and, here, further

explore this concept. An SF is a collection of sensors, a logic

solver, and final elements used to implement automatic

mitigation of a specific hazard; see Figure 2. Again, consider

the turbine system, now illustrated in Figure 3 with an SF also

present. The SF consists of a speed SEN, an L/S, a solenoidvalve (S/V) and an actuator and safety valve (A/SV) which are

referred to as final elements.

At first glance, it may appear that the SF directly

duplicates the function of the BPCS. However, there are

differences. For example, where the BPCS A/CV is designed

to adjust position to allow varying amounts of steam to drive

the turbine, the SF A/SV is designed merely to be either fully

opened (allowing the BPCS A/CV to control the amount of

steam) or fully closed (to deprive the turbine of steam in the

event of an over-speed hazard). The illustrated SF is designed

to protect against a specific hazard. Other SFs may be

required to protect against other hazards.

Figure 2 Illustration of the basic components

of a safety function.

Figure 3 Steam turbine system with a BPCS

and a safety function (SF).

For example, if the shaft spins too slowly, and the BPCS

does not or cannot appropriately compensate, the SF SEN will

measure the shaft speed, the SF L/S will determine that load

shedding is necessary and direct different (unseen) final

elements to perform this task. Thus, in general, a process will

need the additional protection of several SFs which will likely

share a common L/S and which may or may not share certain

sensors and final elements. A collection of SFs designed to

protect a process against several hazards is a safety related

system (SRS). Figure 4 illustrates this concept.

2.3 SF and SRS Failure Modes

SFs and SRSs are needed because we recognize that theprocess or its BPCS may fail. Clearly, an SF or SRS may also

fail due to the failure of one or more components of the SRS.

To properly analyze the impacts of an SF or SRS failure per

IEC 61508 we must distinguish between different failure

modes.

An SF or SRS is said to fail safe (FS) if, due to failure of

SF or SRS component(s), it erroneously determines that a

hazard exists and inappropriately intervenes in the process,


5/15


usually by executing a shutdown of the process that is not, if

fact, required. Safe failures are disruptive to the process but

do not pose any safety risks. On the other hand, an SF or SRS

is said to fail dangerously (FD) if, due to failure of SF or

SRS component(s), it is unable to intervene appropriately if it

is required to do so. Dangerous failures are of paramount

concern from a safety perspective because the RRFa of the SF

is a function of PFDavg, which is the average time the SF

spends in states of FD, i.e., the average time the SF is unable

to respond to a hazard (demand).

Figure 4 Illustration of a safety related system.

2.4 Automatic Diagnostics, Detected & Undetected Failures

SFs and SRSs usually have built-in diagnostics to monitor

the health of the safety system and to determine if the SRS has

entered an FD failure mode. The ability to automatically

detect FD failure modes is important because, once detected,

measures can be taken to reduce the amount of time the SFremains in an FD state thereby reducing the PFDavg and

increasing the RRFa.

During design analysis it should be possible to identify all

FD states in an SF. Ideally, then, we would like to design and

implement automatic diagnostics to detect all FD states.

Sometimes, it is not possible to implement an automatic

diagnostic for a particular FD state. This often arises with

mechanical final elements. Furthermore, even if an automatic

diagnostic could be designed for every FD, it is neither

practical nor prudent to implement diagnostics to cover all

possible FD states. Thus, it is normal practice to provide

automatic diagnostic coverage where practical for the most

likely and the most critical failures. Therefore, some SF/SRS

failures will be detected and others will be undetected.

Two strategies for minimizing the time spent in

dangerous detected failure (DD) states are to

1. Automatically convert a DD state to an FS state, or2. Minimize the repair time needed to leave the DD state

and return the SF to a functioning state.

Undetected dangerous failures (DU), on the other hand,

can only be addressed by periodic manual testing.

Consequently, in most well designed SRS, the time spent in

states of DU is the principle contributor to PFDavg.

2.5 Architectures

Classical k-out-of-n models are familiar to reliability

engineers. Thus, in Figure 5 we illustrate the familiar 1-out-

of-2 and 2-out-of-2 reliability models. When reliability is

being analyzed, continuity from input to output is the key.

Safety architectures are different. The de-energize-to-trip

design is the most common safety architecture in which the SFand SRS are designed to deprive the process of energy, i.e., to

shutdown the process, in the event of a hazard. Thus, in safety

models, the key is the ability to interruptcontinuity from input

to output. Figure 5 also illustrates two common de-energize to

trip safety models. Note how, in these two examples, the

nomenclature for the safety models is the reverse of that for

the comparable reliability models.

Figure 5 Comparison or reliability and safety models.

3. OVERVIEW OF IEC 61508

3.1 Historical Perspective

IEC 61508 is an international standard for the functional

safety of electric, electronic, and programmable electronic

equipment. Development of this standard began in the mid-

1980s when the International Electro-technical CommissionAdvisory Committee of Safety (IEC ACOS) set up a task

force to consider standardization issues raised by the use of

programmable electronic systems (PES) in automatic

protection systems. At that time, many regulatory bodies

forbade the use of any software-based equipment in safety

critical applications. Work began within IEC

SC65A/Working Group 10 on a standard for PES used in

SRS. This group merged with Working Group 9 where a

standard on software safety was in progress. The combined

group treated safety as a system issue.

3.2 Structure of the Standard

The complete IEC 61508 standard is divided into seven

parts:

1. General requirements (required for compliance)2. Requirements for electrical/electronic/programmable

electronic safety-related systems (required for

compliance)

3. Software requirements (required for compliance)4. Definitions and abbreviations (supporting information)


6/15


5. Examples of methods for the determination of safetyintegrity levels (supporting information)

6. Guidelines on the application of parts 2 and 3 (supportinginformation)

7. Overview of techniques and measures (supportinginformation)

Parts 1, 3, 4, and 5 were approved in 1998. Parts 2, 6, and

7 were approved in February 2000.

Parts 1-4 are normative meaning the requirements

(interpreted using the official definitions) must be met forcompliance with the standard. Parts 5-7 are informative

meaning that they provide examples, guidelines, techniques

and measures but do not mandate the use of any specific

guidelines, techniques or measures to be in compliance.

The normative parts of the standard comprise nearly 500

pages with thousands of requirements, i.e., sentences

including the term shall or must which need to be

correctly addressed for compliance with the standard. Broadly

speaking, these requirements fall into one of two groups

which relate directly to the two fundamental concepts of IEC

61508 discussed in Section 3.6 below:

One group of requirements covers the design lifecycleprocess. This is intended to provide a sufficient level ofintegrity against systematic failures of the system, i.e.,

fault avoidance.

One group of requirements covers the probabilisticanalysis of all hardware involved in any safety function.

This is intended to provide a sufficient level of integrity

against random failures of the system.

3.3 Philosophy and Consequences of the Standard

All of the requirements are intended to help designers

create systems that work correctly (are reliable) or fail in a

predictable (hopefully fail-safe) manner. Most designers

consider the requirements of IEC 61508 to be classical,

common sense practices that come directly from prior quality

standards and general engineering practices.

The standard focuses attention on risk-based safety-

related system design, which should result in higher levels of

safety and far more cost-effective implementation. The

standard also requires the attention to detail that is vital to any

safe system design. Finally, the standard offers flexibility by

failing to prescribe specific techniques and measures, instead,

offering alternatives to achieve compliance. Because of these

features and the large degree of international acceptance for a

single set of documents, many consider the standard to be

major advance for the technical world.

3.4 Goals of the Standard

IEC 61508 is a basic safety publication of the IEC.

Lacking industry-specific language, it is an umbrella

document covering multiple industries and applications. A

primary goal of the standard is to help individual industries

develop supplemental standards, tailored specifically to those

industries, based on the original 61508 standard. Several such

industry specific standards have now been developed with

more on the way. IEC 61511 [2] has been written for the

process industries. IEC 62061 [3] addresses machinery safety.

IEC 61513 [4] deals with the nuclear industry. There are even

productspecific standards now being released that follow the

framework and the concepts IEC 61508. One of these is IEC

61800-5-2 [5], Safety Requirements Functional Safety, for

variable speed motor controllers. All of these standards build

directly on IEC 61508 and reference it accordingly.

A secondary goal of the standard is to enable the

development of electrical/electronic/programmable electronic

(E/E/PE) SRS where specific application sector standards donot already exist.

3.5 Scope of the Standard

Although originally conceived as a standard for E/E/PE

SRS, the IEC 61508 standard covers SRS when one or more

of such systems incorporate mechanical as well as E/E/PE

devices. Thus, these devices can include anything from ball

valves, clutch/brake assemblies, solenoid valves, electrical

relays and switches to complex computerized brake controls

and programmable logic controllers (PLC). The overall

program to insure that the E/E/PE SRS brings about a safe

state when called upon to do so is defined as functional

safety.IEC 61508 does not cover safety issues such as electric

shock, hazardous falls, long-term exposure to a toxic

substance, etc.; these issues are covered by other standards.

3.6 Two Fundamental Concepts

The standard is based on two fundamental concepts:

1. The safety lifecycle, a detailed engineering designprocess, intended to reduce or eliminate failures due to

systematic errors, and

2. Probabilistic failure performance analysis, quantified inorder of magnitude levels - called safety integrity levels

(SIL) - intended to address random failures.

3.6.1 Safety Lifecycle

The safety lifecycle is defined as an engineering process

that includes all of the steps necessary to achieve required

functional safety. The safety lifecycle is included in the

standard to provide sufficient protection against systematic

errors, errors resulting in failures that are deterministically

related to a certain cause. Systematic errors are typically

design mistakes.

The basic philosophy of protection behind the safety

lifecycle is to develop and document a safety plan that

includes all engineering activities per the requirements of the

standard, execute that plan and document its execution (toshow that the plan has been met). Changes along the way

must similarly follow the pattern of planning, execution,

validation, and documentation. Although the standard is

written in the context of a custom, turnkey system, the

requirements are applicable to general product design and

development.

SIL are order of magnitude levels of RRF. There are four

SIL defined in IEC 61508 as shown in Table 1. SIL1 has the

lowest level of risk reduction (RR); SIL4 has the highest level


7/15


of RR.

3.6.2 Probabilistic Failure Performance Analysis

Probabilistic failure performance analysis is the second

fundamental concept. Quantitative RR targets, i.e., RRFr, are

established and failure probability calculations are performed

to verify that each SF design meets its RRFr. This

performance-based approach allows the standard to avoidprescriptive rules for redundancy and self-test capability that

so often become obsolete soon after they are published.

Table 1 Correspondence between SIL and RRF

Safety Integrity Level (SIL) Risk Reduction Factor (RRF)

SIL 1 (10, 100]

SIL 2 (100, 1,000]

SIL 3 (1,000, 10,000]

SIL 4 (10,000, 100,000]

IEC 61508 recognizes that all failures are not equal. Two

primary failure modes are defined, FS and FD as discussed in

the Background section.

3.6.3 The Standard from Different Viewpoints

Both of the fundamental concepts and supporting

concepts will be dealt with in greater detail later in this

tutorial. However, it is worth noting at this point that from an

installed system level viewpoint, which is usually that of the

owner-operator, the entire safety lifecycle needs to be

addressed for IEC 61508 compliance and the requirements for

this are treated primarily in Part 1 of the standard, although

Parts 2 & 3 apply to hardware and software design issues in

the lifecycle. On the other hand, from the viewpoint of a

manufacturer who is producing a component or system used in

a safety related application, Parts 2 & 3 of the standard are

paramount, though some aspects of Part 1, such as

documentation issues, must still be addressed.

3.7 Compliance with the Standard

The IEC 61508 standard states: To conform to this

standard it shall be demonstrated that the requirements have

been satisfied to the required criteria specified (e.g., SIL) and

therefore, for each clause or sub-clause, all the objectives have

been met. This is often demonstrated by the use of a Safety

Case.

The Safety Case / Safety Justification methodology

provides a systematic and complete way to show compliance

to one or more functional safety standards. The methodology

was established in industries which deal with functional safety

of computerized automation in nuclear and avionics

applications [6, 7].

For the IEC 61508 standard, all requirements from IEC

61508 have been compiled in a number of industry databases

[8, 9]. Each requirement should be precisely documented

along with the reasoning behind the requirement. Arguments

/ Solutions provide a description of how each requirement is

met by listing design arguments, verification activities and test

cases relevant to that requirement. For full traceability, each

design argument and verification/test activity is linked with

evidence documents showing the results of the work.

When a safety case for IEC 61508 compliance of a

product is completed it must show all requirements along with

an argument for each requirement as to how the system /

product meets the requirement. A link to the evidence

document that supports the argument is also provided.

Additional fields are provided for the independent assessor to

record the results of the assessment and to communicate their

expectations with other assessors and the certifyingindividuals.

Overall, the safety case concept provides a single place to

store compliance information in an organized manner. The

use of a safety case provides a systematic means to ensure

completeness of any assessment. The Safety Case method

supports company learning over multiple projects by

establishing a knowledge base consisting of patterns of

fundamental requirements and related design arguments.

Templates and previous examples of evidence documents

provide the ability to reduce effort on subsequent projects.

3.8 Legal Implications of the Standard

Because IEC 61508 is technically only a standard and nota regulation or law, compliance is not always legally required.

However, in many instances, compliance is identified as best

practice and thus can be cited in liability cases. Also, many

countries have incorporated IEC 61508 or large parts of the

standard directly into their safety codes, so in those instances,

it has the force of law. Finally, many industry and

government contracts for safety equipment, systems, and

services specifically require compliance with IEC 61508. So

although IEC 61508 originated as a standard, its wide

acceptance has led to legally required compliance in some

cases.

4. SYSTEM LEVEL APPLICATION

4.1 The System Safety Lifecycle Overview

IEC 61508 was written assuming that a complete custom

automatic protection system is being created. Thus the system

safety lifecycle process covers all activities from initial project

definition to de-commissioning of a system. These activities

are divided into three phases and are numbered to match their

depictions in the flowcharts described below. The three

system safety lifecycle phases are

Analysis phase consisting of the following activities:1. Conceptual process design2. Identification of potential risks3. Consequence analysis4. Layer of protection analysis5. SF RRFr and SIL determination6. Requirements documentation

Realization phase consisting of the following activities:7a.SRS technology selection

7b. SRS architecture selection

7c. Test frequency determination

7d. Reliability and safety evaluation


8/15


8. SRS detailed design

9. SRS installation & commissioning planning

10. SRS installation, commissioning, & acceptance

testing

Operation phase consisting of the following activities:11. Validation planning

12. Safety review

13. Operating & maintenance planning

14. Start-up, operation, maintenance, periodic proof

testing15. Modifications

16. Decommissioning

A brief flowchart illustrating the three phases of the

system safety lifecycle process is shown in Figure 6. Note

that during the lifecycle, all modifications are required to be

fed back to the analysis phase. The individual activities and

their relationships are readily depicted in an extensive

flowchart which is illustrated, by phase, in Figures 7, 8 & 9.

Figure 6 Overview of system safety lifecycle.

Figure 7 Details of analysis phase of system safety lifecycle.

There is a requirement that documentedprocedures exist

for all safety lifecycle activities. The results of all safety

lifecycle activities must also be documented. Additionally,

IEC 61508 requires quality auditing be performed to ensure

that the lifecycle process is actually being followed on a

project. This is called functional safety management

(FSM). Depending on the SIL level of a project, different

levels of FSM independence are required with an

independent organization required for the higher SIL levels.

Practical interpretations of this FSM independence

requirement, determined by SIL level, are as follows:

SIL 1: Independent FSM auditor is an independentperson(s) outside of the immediate design team/

development group.

SIL 2: Independent FSM auditor is an independentperson(s) outside of the immediate department

responsible for design/development.

SIL 3: Independent FSM auditor is an independentorganization commonly interpreted to mean an entityoutside the design/development company.

The remainder of this section explains the three phases in

greater detail and highlights some of the specific activities.

Figure 8 Details of realization phase of

system safety lifecycle.

Figure 9 Details of operation phase ofsystem safety lifecycle.

4.2 Analysis Phase

The overall objective of activities 1-4 is to identify where

dangerous situations are and how dangerous they may be.

Thus, hazards and consequences are identified and

documented, and the inherent risk (likelihood and

consequence) of each hazard (in a process without automatic


9/15


protection equipment) is estimated or calculated. IEC 61508

does not specify how these activities are to be accomplished.

There are a number of accepted methods depending on

industry. These methods are well documented and widely

practiced in many industries [10, 11]. Figures 10 and 11

provide an industrial example of hazards and consequences for

a platform separation process.

In activity 5, the inherent risk is compared to tolerable

risk criteria. Tolerable risk criteria are not included in IEC

61508.

Figure 10 Industrial example of platform separation

process.

Figure 11 Industrial example of hazards and consequences.

In some countries government regulators establish

quantitative tolerable risk criteria but in most cases, tolerablerisk is established by the owner-operator of the process. If the

inherent risk exceeds the tolerable risk, then RR requirements

for each hazard are established. Often RR is simply specified

as an order of magnitude level designated SIL. In some case,

when quantitative risk frequency methods are used, the RRFr

is calculated per (1). Figure 12 continues the platform

separation process example to the calculation of an RRFr and

Figure 13 indicates the SIL level requirements.

In activity 6, an SF is defined to protect against each

hazard when a RR is required. The description of the SF

along with the required RR, i.e., the RRFr or the required SIL,

is documented in a safety requirements specification (SrS).

This document becomes the input to the realization phase of

the system safety lifecycle.

Figure 12 Example calculation of RRFr.

Figure 13 SIL level requirements to meet RRFr.

4.3 Realization Phase

After all the SFs are identified and documented, the

realization phase begins with a conceptual design. Specific

equipment is selected. Redundancy levels are chosen. Test

strategies are planned. Based on that information, a

probability of failure calculation is then performed to verify

that the design meets the RRFr. Often initial designs do not

meet the RRFr and the designer must make changes. When

the optimal design is reached through what is normally an

iterative process, the design details can be completed.

4.3.1 Equipment Selection

In activity 7a a conceptual design is performed by

choosing the desired equipment to perform the safety function.

Equipment must be chosen to sense the hazardous condition.

Typical sensors include measuring pressure, temperature,

flow, level, proximity, velocity or other variables. Often a

microprocessor-based product is chosen to implement the

protection logic. IEC 61508 calls this device the L/S. An SF

will also need a final element. This set of equipment


10/15


performs the protective action. Commonly in the chemical /

petro-chemical industries this is a remote actuated valve that

opens or closes to reduce energy. In machine safety there is

often a clutch/brake assembly that dissipates kinetic energy.

In many applications an electrical relay will de-energize a

motor or other load.

The equipment is selected based on the classical

requirements for needed functionality, accuracy and

environmental constraints. For functional safety, it is also

necessary to justify the equipment choices. Justificationshould consider experience in using a product in similar

applications and the product functional safety design features.

Often products that are third party certified to meet

requirements of IEC 61508 are selected. The designer must

also obtain failure rate and failure mode data for each piece of

equipment. IEC 61508 certified equipment is supplied with a

Safety Manual which contains this information along with all

needed information to support compliance with functional

safety standards.

4.3.2 Redundancy

In activity 7b, the safety architecture is specified.

Redundant equipment may be chosen so as to achieve highlevels of safety integrity, high levels of availability or a

combination of both [12]. Unlike other prescriptive standards,

there are no specific requirements for redundant equipment in

IEC 61508. Instead the designer may choose the type of

redundancy that is best for the application considering

maintenance capabilities and cost issues. For the equipment

chosen, the reliability and safety models and failure rates must

be obtained. Some redundant controller manufacturers

provide calculation tools that model their redundant systems.

Others provide the models and data in the Safety Manual.

4.3.3 Testing

Once the technology and architecture have been chosen,

the designers plan any potential on-line testing methods

during activity 7c. In some applications the equipment

comprising an SF will, hopefully, not be called on to activate

frequently. This situation is called low demand. In a low

demand application, the equipment often sits dormant for

years at a time. There is no overt indication as to whether the

equipment is still working. Final element equipment in

particular can corrode, cold-weld or otherwise fail in a

completely hidden way such that the SF will not work when

needed, a condition of FD. Therefore the equipment must be

completely inspected and tested at specified time intervals.

Equipment with automatic testing and annunciation is thepreferred choice. However even with automatic testing there

is normally some manual testing that must be done. This

manual testing can verify that the automatic testing continues

to work correctly and can detect FD states not covered by the

automatic test.

In some industries, the target periodic test interval

corresponds with a process shutdown and major maintenance

cycle. In other industries, a periodic inspection/test must be

done more frequently. If these tests must be performed while

the process is operating, on-line test facilities are designed into

the SRS. A periodic inspection and test plan must be created

for all the equipment in each SF.

4.3.4 Probabilistic Failure Analysis

Once the equipment, redundant architecture, and test

strategies are defined, the designers engage in activity 7d by

performing a probabilistic failure analysis to verify that the

design has met the target SIL, RRFr, and reliability

requirements. The effort requires gathering failure rate data asa function of failure modes for each piece of equipment in the

SF.

Most manufacturers that supply equipment intended for

functional safety applications have a failure modes effects and

diagnostics analysis (FMEDA) performed for their equipment

[13]. When that data are not available, designers can use

industry failure rate databases [14, 15, 16, 17].

IEC 61508 does not specify how to perform this failure

probability analysis. There is no specific requirement that

fault trees or Markov models be used although Part 6 of IEC

61508 does have example simplified equations. There is only

a statement that industry accepted methods shall be used.

There are requirements that some important variables in theanalysis be included such as common cause failures [18, 19]

in redundant systems. There is no specific failure rate or

failure mode database in the standard either. So, again the

reliability engineer is only expected to use industry accepted

practices. Specialized analysis tools [20, 21] are available,

some with built-in failure rate databases. There is a

requirement that databases and tools be publicly available.

The results of the probabilistic failure evaluation typically

include a number of safety integrity and availability

measurements. Most importantly, however, the PFDavg and

the safe failure fraction (SFF) are calculated for low demand

mode. Probability of failure per hour is calculated for high

demand mode. From charts in IEC 61508 the SIL level that

the design achieves is determined.

Figure 14 Continuation of platform separation process

example.


11/15


Figure 15 - Output of a specialized analysis tool that computed the PFDavg, RRFa, and SIL achieved

by a conceptual design for an SF for the process.

Figures 14 and 15 continue the platform separation

process example. Figure 15 shows the output of a specializedanalysis tool that computed the PFDavg, RRFa, and SIL

achieved by a conceptual design for an SF for the process.

Note that the SFF was also computed (though not shown on

Figure 15) and used to determine the SIL Architectural

Constraints. (See box on performance metrics in Figure 15.)

Further note that the current design does not meet the required

SIL level (see lower left hand corner of Figure 15) and that the

greatest contributor to PFDavg is the final element (see first

pie chart in lower left hand corner of figure 15). Clearly

redesign is required with special attention to the final element.

Many initial designs do not meet failure probability

requirements, and the designers have a choice regarding which

changes to make. Designers may choose to: Increase manual proof test frequency for low demand

systems. This results in more manual testing for the life

of the system. While this decreases the PFDavg, it

increases on-going maintenance cost.

Choose equipment with higher SIL capability. Suchequipment will have lower DU failure rates and this will

reduce PFDavg and often increase SFF. However, it will

typically increase capital expense.

Add redundant equipment. Depending on the redundantarchitecture chosen, the PFDavg will decrease and thesystem availability may increase. Capital costs will

increase and on-going maintenance costs will increase.

The advantage of the IEC performance-based approach is

that, unlike older performance-based standards that dictated

levels of redundancy and equipment choice, IEC 61508 gives

the designers choices. The disadvantage of this approach,

however, is that designers must have the means to perform the

probabilistic failure analysis and enough knowledge to make

design tradeoffs. There is no cookbook in IEC 61508.

4.3.5 Detail Design through Acceptance Testing

When the optimal conceptual design is complete and

documented, the SrS is typically updated to include the newinformation about redundancy and test requirements.

Activities 8-10 can commence. Detailed design activities

include much of the normal project engineering that is

performed by integration companies and project engineering

teams. Wiring and piping diagrams are created. The PLC (if

used) is programmed and tested. A plan is created for the SRS

installation and commissioning. This step includes a

comprehensive test to validate that all requirements from the


12/15


original SrS have been completely and accurately

implemented. A revalidation plan, which is a subset of the

validation plan, is also completed for all changes. When the

installed system is tested and validated, the SRS is ready to

provide protection when actual operation begins.

4.4 Operation Phase

The operation phase of the system safety lifecycle

includes activities 11-16 and begins with a safety review of

the implemented SRS which should ensure that allrequirements have been met and that the SRS has been

implemented, installed and commissioned correctly. All

maintenance procedures, management of change procedures,

and test procedures must be available. All training must have

been completed. The operation phase continues with all

needed maintenance and periodic testing. All changes must

feed back through the system safety lifecycle steps to be sure

that safety integrity is maintained. This continues until the

system is de-commissioned.

5. RELIABILITY ENGINEERING TECHNIQUES & BEST

PRACTICES: PRODUCT LEVEL APPLICATION

While IEC 61508 was written for a complete turnkeysystem, the more common usage of the standard is the design

and certification of equipment and components. In this

context the two fundamental concepts still completely apply;

however, some system level requirements no longer apply.

The product safety lifecycle process requirements of IEC

61508 are intended to ensure a sufficient level of safety

integrity against systematic faults. In effect the process

should reduce design errors. The process requirements are

detailed in Parts 2 and 3 of IEC 61508. There are an extensive

number of requirements. However a study of the standard

should indicate to any professional that this material is not

radical but classic quality and software engineering techniques

[22] that have evolved over decades.

The level of detail and rigor varies with SIL Capability

rating. A SIL 1 process does not require as much procedure

and documentation as does a SIL 2 process. A SIL 3 process

has very high rigor with more methods and more

documentation requirements. A SIL 3 capable process has

been compared to somewhere between Capability Maturity

Model Integration (CMMI) Level 3 and Level 4 [23]. A SIL 4

process has the highest level of process requirements

including the use of formal methods. Most product

certifications per IEC 61508 have been performed to a SIL 3

capability level as many practitioners consider SIL 4 to be

impractical.IEC 61508 treats products with documented field

experience differently than newly developed products. If a

product has a sufficient numbers of operational hours in the

field, this is considered as partial evidence of systematic

integrity. Therefore, in these cases, certain documentation and

process steps are not required.

5.1 Analysis Phase

Since products may be used in many diverse applications,

the specific hazards of all possible processes cannot be

analyzed. Therefore system level risk analysis now becomes,

at the product application level, a market requirement. A

product must be specified to be designed to a particular SIL

level. That product can then only be used in a system at that

SIL level or lower. The SIL capability requirement is one of

the product market requirements. All safety requirements for

a product development are contained in the product SrS. This

document may be separate or part of a general product

requirements document.

5.2 Realization Phase

IEC 61508 requires a documented new product

development process with over 1000 specific requirements for

that process. Example methods are suggested and alternatives

are permissible with justification. This amount of detail and

flexibility has helped increase the acceptance level of the

standard. Knowing that many alternatives are available, an

example process can show the general concepts.

5.2.1 Requirements Review and Acceptance

The example process begins with a review of the SrS in

order to make certain that the designers understand therequirements. A concept system is designed and the design is

verified against the requirements by performing a traditional

design failure modes and effects analysis (FMEA) [24]. If

design issues are identified, new requirements are added to the

SrS and the system design is modified. Typically after a

number of iterations, the system design would show no major

design flaws. IEC 61508 requires that at least a draft

validation concept document be created at this point in the

process. This is done primarily to show that the requirements

of the SrS are testable. When the requirements have been

shown to be understandable, sufficient and testable, they are

then allocated to various specific hardware or software

implementations.

5.2.2 Hardware Design Process

The hardware design process requirements from IEC

61508 are primarily common sense quality issues. All design

tools must be qualified and judged fit for use. This typically

means that designers have a good understanding of how each

tool works including limitations and bugs. Today, most

tools for mechanical and electronic design and analysis meet

the SIL 3 requirements of IEC 61508. However, design teams

must be careful to re-evaluate any new release of a design tool

and these evaluations must be documented.

5.2.3 Software Design Process

Most of the process requirements of IEC 61508 are

software process requirements. An IEC 61508 example

software process starts with software safety requirements.

They are reviewed and if understandable, a prototype design is

performed. Several design verification methods are suggested

but most common is the software FMEA, also known as

software hazard and operation study (HAZOP). When


13/15


software safety requirements are understandable, sufficient,

and testable, the software architecture design is complete.

Software design tools must be qualified and justified for

use in IEC 61508. Again, the key requirement is that those

using a tool understand how the tool works and its limitations.

Tool justification is typically done by a combination of testing

and experience. Like all design tools it is important to

completely evaluate any new revisions of compilers and test

tools. These evaluations must be documented.

Detail design and code implementation has a set ofnormal quality requirements. There are statements in IEC

61508 that require, for example, The source code shall be

readable, understandable and testable. Many ask how this

can be proven to a third party auditor. A good code review

process can solve the problem. Who better to judge the

quality of the source code than those who must understand it

and make future changes to it? In addition, there are specific

language requirements. For any software language that is not

completely and unambiguously defined, a coding standard is

required to restrict the language to unambiguously defined

features. Therefore, effectively, a coding standard must be

created and actually used. Language constructs that are prone

to error should be banned. Language constructs that can becompiled differently by different compilers must be banned or

only one, completely understood compiler can be used.

Strong data typing is required. Static source code analyzers

are strongly recommended.

Documented module / unit testing of software is required.

Even module testing is planned and executed with

documented test results. A fault tracking system with

documented resolution of problems is required. Documented

software integration testing is required.

5.2.4 Integration Testing

Hardware and software integration test planning is

required with test results recorded indicating a pass/fail result.

A fault tracking system with documented resolution of

problems is required with version control performed on any

design revisions done from this point forward in the process.

5.2.5 Failure Modes Effects and Diagnostic Analysis

In order to support probabilistic analysis at the system

level for each set of equipment used in an SF, the failure rates

for each failure mode must be estimated and published. If

extensive accurate field failure records are available, they may

be analyzed and used to determine the failure rates for each

mode. However, realistically this never happens. Many

failure records are missing important information. Most fieldfailure recording systems (outside of NASA and nuclear

facilities) are not complete. Therefore the FMEDA technique

is used. In an FMEDA, all components of a product are

considered. The failure rate and failure modes of each

component are translated into product level failure rates

primarily by summing failure rates of individual components

for each failure mode.

5.3 Operation Phase

At the product level, operations do not typically involve

the product manufacturer. However, in IEC 61508, the

product manufacturer has the responsibility to provide all

needed information for safe operation and maintenance of a

product. This includes suggested manual proof test

procedures to detect any internal failures of automatic

diagnostics or to detect any hidden FD states, i.e., any DU

states. Any special maintenance instructions must also be

provided to the user of the product.

6. IEC 61508 PRODUCT CERTIFICATION

There are several independent companies performing

third party technical assessments to certify products as IEC

61508 compliant at specific levels of SIL capability. Most

product certification programs are operated per EN45011 [25],

a product certification program quality guide.

A product could receive IEC 61508 certification if the

detailed assessment shows that the product meets all relevant

requirements of IEC 61508. In general this certification is an

indication of high design quality for hardware and software

and high manufacturing quality.

The certification trend is relatively new with few products

achieving this distinction prior to 2006. Starting in 2007 theavailable certified instrumentation products for the process

industries increased dramatically. See the charts in Figures 16

and 17.

Figure 16 Number of IEC 61508 certified sensors.

0

5

10

15

20

25

30

35

1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008


14/15


Figure 17 Number of IEC 61508 certified mechanical

devices.

As an example, a solenoid valve, which obviously does

not have any software, achieved certification with SIL 3

capability when the manufacturer

Demonstrated a design process that met IEC 61508 SIL 3requirements

Had an FMEDA performed on the product resulting infailure rates per failure mode as well as useful life

estimates

Produced a safety manual, a document that contained aset of specific information as required by IEC 61508.

In another example an operating system supplier received

a SIL 3 capability certification per IEC 61508 because the

supplier

Demonstrated a software development process that metSIL 3 requirements

Added several features to the operating system includingscheduling timeout failure detection and task memory

protection so that this software product would make it

easier for the user to create IEC 61508 certified products.

Obviously no hardware is part of the product so no failurerates were produced.

There are hundreds of examples of products with both

hardware and software. In these cases, certification was

achieved by using a combination of good design processes,

FMEDA hardware analysis and user documentation. Figure

18 provides some details for the certification of a product with

both hardware and software components.

Figure 18 Certification details for device with both

hardware and software components.

7. CYBER SECURITY

The IEC 61508 currently requires a cyber-security threat

analysis to be performed for all SF involving software. If a

credible threat is identified, that threat must be addressed.

For the process industries, ISA Security Compliance

Institute (ISCI) has completed a set of requirements for

embedded products cyber security [26]. These requirements

are patterned after IEC 61508 and include

Specific design reviews in which the ability to withstand acyber attack is the objective

A design process audit with very similar requirements toIEC 61508 for software quality

Actual network attack testing, called network robustnesstesting, which involves stress conditions that may

originate with external hackers as well as internal failure

of equipment on a network.

Most IEC 61508 users are addressing cyber security via

current ISCI requirements.

8. CONCLUSIONS

There are many conventional reliability engineering

techniques and best practices that can be applied to the

problem of achieving functional safety. Safety standards such

as IEC 61508 provide structured frameworks for selecting

from among all techniques and practices those appropriately

suited to different phases of the safety lifecycle.

The IEC 61508 functional safety standard has been in

existence for ten years now. During that time it has found

wide acceptance in many industries. A common usage is the

third party certification of products to be used in safety critical

systems. It is also used for system level design.The standard has strong requirements for engineering

processes to defend against systematic faults. It also

utilizes probabilistic failure calculations for the equipment set

used in each safety function to show sufficient protection

against random faults.

This performance-based approach has allowed the

standard to remain relevant even with the rapid advances of

new technologies. The performance-based approach has

allowed innovation in safety designs for both products and

systems.

It is a complicated, detailed standard but allows justified

alternatives to the many methods, techniques, and practices

presented as examples. Many industry specific standards have

been derived from IEC 61508 showing its value. IEC 61508

is having a major impact on the field of reliability engineering.

9. REFERENCES

1. IEC 61508, Functional Safety of electrical / electronic /programmable electronic safety-related systems, Geneva,

Switzerland, 2000.

2. IEC 61511, Application of Safety Instrumented Systemsfor the Process Industries, Geneva, Switzerland, 2003.

3. IEC 62061, Safety of machinery - Functional safety ofsafety-related electrical, electronic and programmable

electronic control systems, Geneva, Switzerland, 2005.4. IEC 61513, Nuclear power plants - Instrumentation andcontrol for systems important to safety - General

requirements for systems, Geneva, Switzerland, 2001.

5. IEC 61800-5-2, Adjustable speed electrical power drivesystems, Part 5-2: Safety Requirements Functional,

Geneva, Switzerland, 2007.

6. Bishop, P. G. and Bloomfield, R. E., "A Methodology forSafety Case Development", Proc 6th Safety-Critical

Systems Symposium, Birmingham, U.K., Feb 1998.


15/15


7. Defence Standard 00 55, Parts 1 and 2, Issue 2, U.K.Ministry of Defence, Aug. 1997.

8. exida Safety Case Database Users Manual, exida,Sellersville, PA, 2002.

9. The CASS Guide to Functional Safety CapabilityAssessment, The CASS Scheme Ltd., U.K., Ap 2000.

10. Guidelines for Hazard Evaluation Procedures, AIChECenter for Chemical Process Safety, 1992.

11. Marszal, E., and Scharpf, E., Safety Integrity LevelSelection, ISA, Research Triangle Park, NC, 2003.

12. Goble, W. M., Control System Safety Evaluation andReliability, 3rd Ed., ISA, Research Triangle Park, NC,

2010.

13. Goble, W. M. and Brombacher, A. C., Using a FailureModes, Effects and Diagnostic Analysis (FMEDA) to

Measure Diagnostic Coverage in Programmable

Electronic Systems, Reliability Engineering and System

Safety, Vol. 66, No. 2, Nov 1999, pp. 145-148.

14. Telcordia 332- Issue 3, Reliability Prediction Procedurefor Electronic Equipment, Jan, 2011.

15.Handbook of 217Plus Reliability Prediction Models,The Reliability Information Analysis Center, 2006.

16. OREDA - 97, Offshore Reliability Data, DNV Industry,Hovik, Norway, 1997.

17. Safety Equipment Reliability Handbook, exida,Sellersville, PA, 2003.

18. Dhillon, B. S. and Rayapati, S. N., Common-causeFailures in Repairable Systems 1988 Proc Ann

Reliability and Maintainability Symp, Jan, 1988, pp. 283-

289.

19. Hokstad, P. and Bodsberg, L., Reliability Model forComputerized Safety Systems. 1989 Proc Ann

Reliability and Maintainability Symp, Jan, 1989, pp. 435-

440.

20. exSILentia Users Manual, exida, Sellersville, PA, 2008.21. Industrie-Automatisierung, SILence Handbuch, HIMAPaul Hildebrandt GmbH, Bruhl, Germany, 2003.22. Pressman, R., Software Engineering: A Practitioners

Approach, McGraw-Hill, New York, NY, 2005.

23. Ahern, D. M., Clouse, A. and Turner, R., CMMIDistilled: A Practical Introduction to Integrated Process

Improvement, Addison-Wesley, New York, NY, 2004.

24. McDermott, R. E., Mikulak, R. J., and Beaurgard, M. R.,The Basics of FMEA, Productivity, Inc., Portland, OR,

1996.

25. EN45011, ISO/IEC Guide 65, General requirements forbodies operating product certification systems, Geneva,

Switzerland, 1996.

26. ISASecure Embedded Device Security AssuranceCertification, www.isasecure.org.

Applying Reliability Engineering Techniques

Documents

Transcript of Applying Reliability Engineering Techniques