CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)

download CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)

of 23

Transcript of CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)

  • 8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)

    1/23

    Copernicus Technology Ltd 2009 1

    No Fault Found (NFF) occurrences and Intermittent Faults: improving

    Availability of aerospace platforms/systems by refining Maintenance

    Practices, Systems of Work and Testing Regimes to effectively identify

    their root causes

    J D CockramBEng(Hons) CEng MRAeS G M HubyBEng(Hons) CEng MRAeSCopernicus Technology Limited Copernicus Technology Limited

    ABSTRACT

    The adoption of preventive and corrective maintenance strategies that both provide

    aircraft availability andassure safety, at minimum cost, is fundamental to aerospace

    operations in all sectors. To provide aircraft availability with even greater success, a

    change to traditional maintenance approaches is required: from assumptions-based

    approaches and speculative component replacements, to knowledge-based strategies.One key area where knowledge-based approaches remain unexploited is the No Fault

    Found (NFF)a scenario, for which intermittency in electrical and electronic component

    circuitry is a major cause.

    Tackling the NFF issue head on is, perhaps, what many maintenance managers would

    like to do, but it is more complex than simply trying to eliminate problems by

    speculative component changes and/or manpower resources alone. If it was that simple,

    the issue would have long been consigned to the history books, but it has been estimated

    that intermittency and NFFs account for a major proportion of fault occurrences in

    aerospace maintenance organisations.

    The key themes of this paper are as follows:

    Treating a NFF occurrence as a Diagnostic Failure, and the impact and causes of

    those Diagnostic Failures.

    Mitigation of human factors and culture issues in maintenance systems of work, as

    they pertain to NFF.

    The capture of maintenance fault data and how it can contribute to diagnosing rootcauses of intermittent faults.

    The contribution of test methodology to isolating intermittent fault root causes.

    How the outcomes offunctional testing can be enhanced by proving the integrity of

    components.

    How all of these strands are brought together to define strategies to drive down NFF

    arisings, thus increasing aerospace platform/system availability.

    a Also referred to as Cannot Reproduce Fault (CNR), Cannot Duplicate (CND), Unable toReproduce Fault, Re-Test OK (RTOK), No Trouble Found (NTF), No Fault Indicated (NFI).

  • 8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)

    2/23

    Copernicus Technology Ltd 2009 2

    THE NO FAULT FOUND PHENOMENON

    Removals of equipment from service for reasons that cannot be verified by the

    maintenance process (shop or elsewhere) are a significant burden for aircraft

    operators. This phenomenon is commonly referred to as No Fault Found.1

    People like to categorise or pigeon-hole problems with simple labels from the CreditCrunch to GM Food irrespective of the complexity of tangible and intangible

    interactions within the scenario concerned. NFF, and its corresponding effects on

    system-level and aircraft-level availability, is one such scenario. It is easy to label, easy

    to define and easy to see the effects of, but the ease with which this definition can be

    applied to a given scenario is at odds with the depth and breadth of the problem.

    In plain language, a NFF isa reported fault for which the root cause cannot be found.

    Note that this definition applies irrespective of whether the associated diagnostic and

    maintenance activity succeeded in reproducing the symptom(s) experienced by the

    person reporting the fault: whether a symptom is present or not at the time of diagnosticinvestigation is academic if the actual root cause of the fault cannot be isolated. It also

    applies equally whether the root cause of the symptom, as experienced by the user,

    resulted from a physical fault condition or from user error.

    The primary elements of a NFF occurrence are defined below, along with a simple car

    example.

    There is the fault itself. This is usually reported by an end-user, such as a pilot

    (for faults occurring during a phase of flight) or by a maintenance technician (for

    faults which manifest themselves during other maintenance activity, whetherrelated or unrelated to the fault concerned). The fault is the inability of a

    component or system to fulfil its intended function.

    There is the symptom, or symptoms, of the fault. The symptoms are the set of

    circumstances that brought the fault to the attention of the end-user; it is the

    effecton the operation of the platform or system. Chronologically, although the

    symptom is a direct consequence of the fault, it is the symptom which provides

    the starting point of the corresponding maintenance/diagnostic activity.

    There is the root cause of the fault. This is the primary failure mechanismwhich caused both the specific fault and led to the corresponding manifestation

    of symptoms.

    Car Example

    Fault: the car engine will not start.

    Symptom: when you turn the key in the ignition there is the sound of a

    click and then nothing else happens.

    Root Cause: there is a corroded connector on the starter-relay. This

    created a high resistance hence the relay could not operate.

  • 8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)

    3/23

    Copernicus Technology Ltd 2009 3

    There is a chain of events from the fault occurrence, to the report of the fault and/or its

    symptoms by the end-user, to the point at which the maintenance task is either

    completed successfully or categorised as No Fault Found. At that point the

    airworthiness decision-maker must then direct what is to happen to next, perhaps basing

    their decision on a combination of: personal experience, assumptions and knowledge;

    the advice of technician colleagues; information in maintenance manuals; or the

    aircrafts fault history, whether individually and/or at fleet level.

    If this was the first time that the fault had been reported on this specific aircraft, and

    you were the airworthiness decision-maker, what would you do?

    The aforementioned NFF chain of events culminates in an inability to identify and fix

    the root cause of the reported aircraft fault. In other words, to apply a different label, an

    NFF occurrence is a diagnostic failure.

    These definitions are pivotal to the concepts expounded in this paper and to how the

    problem of reducing NFF occurrences could be addressed. From these definitions, asimplistic conclusion would be to state that the way to stop a NFF occurrence or a

    diagnostic failure is to achieve diagnostic success. Hence, one must achieve

    diagnostic success in order to identify the root cause of the fault, and thus enable

    implementation of the necessary corrective maintenance activity. To achieve this

    effectively and efficiently necessitates a closed-loop system that can readily correlate

    data pertaining to the symptom, the fault and the successful rectification solution: the

    fix.

    THE IMPACT OF NFF DIRECT AND INDIRECT

    Aerospace statistics for NFF demonstrate that achieving diagnostic success is not

    simple, so merely changing NFF terminology to that of diagnostic failure is not going

    to solve the problem and improve aircraft availability.

    Published statistics reveal a wide range of perception in terms of the extent of the

    problem and the impact. Avionics constitute 75% of NFF occurrences in aerospace;

    furthermore, avionics NFF rates are typically in the region of 30% or higher2. The

    situation does not appear to have improved significantly in recent years when one

    examines 1996 figures for Boeing, which showed a 40% rate of incorrect parts removalfrom the airframe3.

    The real financial cost of the problem is unclear. In 1997 the US Air Transport

    Association estimated the cost of impact as equating to $100 000 per aircraft per year4.

    More recently, British Airways has estimated the financial impact at 20M per year5.

    Calculating the financial impact of NFF is highly complex, depending on how far the

    effects are extrapolated in cost terms. Should the calculation only include the cost of

    unnecessary NFF repair investigations at second line workshops? Should it include the

    man-hour costs of unnecessary removals from aircraft, or include the additional spare

    LRUs purchase costs in response to arising rates, and so on? These indirect costs arediscussed in more detail later in this paper.

  • 8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)

    4/23

    Copernicus Technology Ltd 2009 4

    The primary impact of NFF is on people. It prevents them from achieving their

    operational business objectives, whether commercial or military, which puts additional

    pressure and stress on the people that populate the operation. For example, there can be

    few situations more frustrating than that in which military aircrew spend hours planning

    and briefing for a complex training sortie as part of a formation, only to abandon it part

    way through the sortie because of an intermittent fault which subsequently the

    technicians are then unable to diagnose successfully. The frustration of the aircrew is

    reflected in equal measure by the frustration felt by the maintenance technicians who

    apply their best endeavours to rectify the fault, only for it to result in a diagnostic

    failure. So how do they react to these situations? They want something done that is

    tangible in order to feel that the problem has been solved, putting a resultant

    Serviceable tag back on the maintenance planners whiteboard. Returning to the

    earlier scenario of what the airworthiness decision-maker should do next, there are a

    number of common options available to them and to their technicians:

    Firstly, they could rule out finger trouble, ie confirm that the end-user utilised

    the correct procedures to use the equipment.

    They could insist that the technicians complete all the relevant functional tests of

    the system(s) concerned and, if it was the first occurrence on the aircraft, they

    might sign it off as NFF on the basis that it was a one-off.

    They could insist that the technicians complete all the relevant functional tests of

    the system(s) concerned and then request that a limited flight test be carried out

    to see if the fault recurs in the same environment conditions as originally.

    They could ask a different team of technicians to review the symptom, fault andinvestigation carried out so far to identify the potential for additional diagnostic

    options.

    They could have the fleet maintenance history examined for information of that

    fault type on other platforms of the same type but only if this data is readily

    available.

    Similarly, they might seek the advice of colleagues or the Design/Type

    Certification Authority on whether they have experience and/or ideas of this

    fault and how to rectify it.

    They could examine the maintenance manual and then use their judgement to

    select the most likely component to replace, in the (calculated) hope that it

    rectifies the problem.

    They could opt to replace a component in the system that is quick to change and

    readily available in stock in the (somewhat less calculated) hope that it

    rectifies the problem.

    A typical outcome would be that, having ruled out user error, the decision is made toselect what is deemed to be the most likely Line Replaceable Unit (LRU) and to

    replace it. Having replaced the LRU, functional checks are carried out to confirm the

  • 8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)

    5/23

    Copernicus Technology Ltd 2009 5

    serviceability of the system; tools are returned back to tool stores and the paperwork is

    completed. The aircraft is signed off as serviceable once more. It then flies again on

    another route or on another training sortie..and the fault does not immediately recur.

    Success! Replacing the LRU fixed it! Or did it?

    Did the LRU replacement fix the reported fault, by removing the actual source of the

    faults root cause (albeit undiagnosed) from the system? Orwas the faults actual rootcause successfully rectified when the LRU was replaced, because the root cause was in

    fact the intermittent integrity and security of the wiring connections which happened to

    be re-seated and made more secure as a consequence of the replacement activity? Or

    was the fault an intermittent fault that has yet to manifest itself again perhaps owing to

    a slightly different flight profile experienced on the subsequent flight?

    The above scenario has outlined a typical set of circumstances where fault symptoms

    experienced by the end-user lead to a diagnostic failure which does not have a black or

    white solution. The resulting decision on how to deal with the diagnostic failure would

    be made first and foremost with safety and airworthiness in mind, but there will be otherinfluences on the decision concerning resources available, skills/experience available

    and commercial or military priorities (slot times, time-on-target and the like). But in this

    scenario there is often no clear diagnostic approach to opt for, hence

    business/operational/resource/deadline pressures can have a disproportionate influence

    on the diagnostic process. Ironically, if the fault recurs shortly afterwards on a

    subsequent flight and the fault is reported to the same technician staff as before then, by

    default, the options of what to do have been narrowed immensely and the diagnostic

    process would (or should) be directed elsewhere. Alternatively, subject to anecdotal

    evidence, assumptions or recent experience, the system might be perceived to be

    unreliable, or the specific LRU might be perceived as a problem item, in which case

    an assumptions-based decision could be made in which the replacement LRU is deemed

    unserviceable on fit and might be replaced again. Or the maintenance organisation

    may deal with the issue proactively, possibly by means of a quality occurrence

    investigation.

    What are the implications of these scenarios, which are played out day after day at

    airfields all over the world?

    Firstly, in this scenario, the maintenance staff cannot provide a high level of confidence

    that the fault was fixed right first time and would therefore not recur during a flight inthe immediate future so there is a risk to business, whether that business is package

    holidays or precision bombing. Moreover, depending on the system concerned there

    may also be a risk to safety, either because it is safety critical or because the potential

    for a repeat fault erodes existing levels of system redundancy. In short, there is a

    performance or safety risk to the business output or effect required.

    If the direct impact on business output is the visible effect of NFF, like the tip of an

    iceberg, then below the waterline the main bulk of the iceberg comprises the major

    impact on the supply chain, on maintenance performance and capacity and, potentially,

    indirect impacts such as effects on customer perception of the airline. If the wrong LRUor component is replaced whether through educated guesswork or simply hoping for

    the best then this adds major costs to the organisation. These costs include the time

  • 8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)

    6/23

    Copernicus Technology Ltd 2009 6

    incurred by maintenance and logistics staff in removing, processing and transporting the

    suspect LRU: because the correct rectification activity has yet to take place to the fix the

    faults actual root cause. Then there are the additional costs to bear for the wasted

    transportation incurred in sending the suspect LRU to the appropriate Maintenance,

    Repair & Overhaul (MRO) organisation or Original Equipment Manufacturer (OEM).

    Then there is the wasted time spent on diagnosing and testing an LRU that has nothing

    wrong with it. And then further logistics processing activity, storage costs, more

    transportation costs and so on.

    Supply chain information systems are not typically configured to recognise or correlate

    the relationship between second line shop repair NFF activity, and the scaling and

    resource consumption monitoring required as part of ongoing forecasting and

    procurement activity. The impact of this is illustrated as follows. If fault X generally

    leads to initial replacement of LRU Y in 80% of cases, irrespective of the reason why -

    even though the actual root cause in 80% of cases is actually to be found with LRU Z -

    then the supply chain information system will detect an increased consumption per

    flying hour of LRU Y and will forecast ahead to ensure that there is sufficient stock to

    meet forecast demand levels. This phenomenon is sometimes referred to as the

    Phantom Supply Chain, and it can be exacerbated even further as LRU Y becomes

    available in greater numbers; and so the initial speculative replacement activity becomes

    easier to justify in the context of the stock levels held. When operators calculate the

    cost of NFF do they just look at MRO NFF costs, or do they calculate the real cost by

    calculating the full cost of the Phantom Supply Chain? Assuming you wanted to

    calculate the full cost impact of NFF in this way, would you possess the data to

    successfully undertake such analysis in the first case?

    The Phantom Supply Chain also influences the effect on the maintenance policy for an

    item. If LRU Y was assigned a maintenance policy as a consequence of Reliability &

    Maintainability (R&M) analysis - equating to On Condition or Run To Fail policy

    (in other words, once fitted on aircraft there is no need to replace it until it fails) then

    what would the effect be on the Mean Time Between Failure of the erroneous

    replacements due to fault X? The R&M data would indicate an increase in arisings and

    the maintenance policy might have to change. The changes required could range from

    the introduction of scheduled inspection activity, to the assignment of a lifing limitation,

    to the instigation of a modification. In turn, the increased maintenance activity - a

    phantom maintenance policy, to continue the analogy then generates a furtherassociated impact on the supply chain, and so it continues. Yet another side effect that

    impacts on the phantom supply chain/maintenance policy is the relationship between

    repairs carried out at LRU/card level by OEMs or repair shops with the fault that caused

    the item to be sent for repair in the first place. The higher tolerances of test equipment

    used at these levels of repair may well uncover faults which have no relationship to the

    reported fault. In these circumstances, the conscientious repair body will execute the

    necessary repair, but is not guaranteed to isolate the fault which caused intermittency

    and thus caused the original, reported fault. So the newly-discovered fault is rectified (a

    fault was found, not the fault) and the item returned to the available stock. The cause of

    the intermittency lays dormant until the component is back in operational use and it thenmanifests further intermittent fault symptoms at a later date: and so the loop continues.

  • 8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)

    7/23

    Copernicus Technology Ltd 2009 7

    The effect on consumption and stock levels resulting from the combination of

    consuming LRUs for both genuine fault occurrences and for NFF occurrences, all

    combine to exacerbate the impact of component obsolescence. If LRU Y has become

    categorised as obsolete, because the platform is a mature platform and that LRUs OEM

    no longer supports it, then it is intuitive that the operator can ill afford the cost and non-

    availability of that LRU that are caused by avoidable NFF occurrences.

    Irrespective of the specific details of each instance of NFF, the impact of NFF is felt at

    every level of flight operations, on pilots, customers, technicians and logisticians; and

    NFF occurrences result in major process waste, avoidable costs and wasted time.

    CAUSES OF DIAGNOSTIC FAILURE

    Basic scrutiny of the circumstances of diagnostic failure occurrences reveals that there

    are several factors that conspire against effective fault diagnosis and root cause analysis.

    These are listed below and discussed in the following paragraphs:

    The inability to reproduce the symptom during maintenance/diagnostic activity.

    The inability of test equipment to detect the root causes of intermittent,

    randomly-occurring faults.

    The lack of availability of, or lack of access to, relevant corporate technical

    knowledge.

    Human factors, including maintenance culture/practice.

    THE ABSENCE OF FAULT SYMPTOMS

    The symptom that does not manifest itself when attempting to diagnose a reported fault

    is an obvious and frustrating characteristic of an NFF occurrence. Assuming operator-

    error has been ruled out, logic decrees that there was a root cause of the fault symptom

    that was experienced and subsequently reported. The absence of the reported symptom

    during diagnosis means that the circumstances of maintenance on the ground have not

    resulted in the root cause precipitating the same effect. This is a well-documentedconcept, hence the extent of environmental stress screening testing carried out as part of

    LRU shop repair activity or as part of reliability growth testing during component

    design and development. The effect is to replicate the operating conditions that were in

    place at the time of the fault symptom occurrence conditions that might comprise

    altitude, attitude, vibration, temperature and humidity. It may not be practicable to

    replicate all of these conditions during diagnostic activity, but the most significant

    aspects are sometimes attempted, such as vibration (with engines running) and by

    physical manipulation of the airframe, connectors or cable looms whilst carrying out,

    for example, continuity testing. If these approaches are unsuccessful, then the

    airworthiness decision-maker is then confronted by the various options and scenarioslisted earlier in this paper.

  • 8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)

    8/23

    Copernicus Technology Ltd 2009 8

    If part of the problem is trying to replicate physical conditions that influenced the root

    cause of the fault, the other part of the problem is the duration of that root cause

    occurrence. The short duration deviation from the normal operating conditions of the

    system is known as intermittency, a well documented phenomenon concerning electrical

    and electronic circuitry. Intermittency6 has been shown to be influenced by mechanical

    stress (fretting corrosion, for example) and thus this leads to transient variations or

    intermittency in degraded contacts. These intermittent events can last for mere

    nanoseconds, but this contact intermittency can be enough to result in system failure or

    loss of information. Not only are these intermittent events extremely short duration

    they are also, by definition, random. With the probability of detecting a random,

    nanosecond-duration root cause event being marginal at best, the temptation of

    speculatively replacing an LRU in the hope of removing the faults root cause from the

    system becomes a great one. By replacing the LRU, however, the electrical contact

    characteristics of the system have been changed but the susceptible components such as

    cables and connectors have been left unchanged. For connectors in particular, they

    cannot be permanently sealed and so they are susceptible to corrosion and debrisingress, plus they experience wear in use and as a consequence of maintenance.

    Contrast those usage and environmental effects on connectors with those same effects

    on an LRU: the LRU is far less susceptible to these factors than connectors and cables.

    Intermittent micro-changes in a circuits ohmic characteristics and contact resistance

    lead to performance deviations from the as designed condition and can occur at any

    level within a given system. Moreover, it has been shown from work carried out by

    Universal Synaptics Corporation7 over the past 15 years that intermittency is

    predominantly found in what this paper will designate as the3 Cs: cables, chassis (of

    LRUs) and connectors. This is not intended to ignore the feasible presence ofintermittency at circuit board level within a component or LRU; however, the higher

    susceptibility of the 3 Cs to degradation mechanisms, compared with LRUs, means that

    the benefit to be gained in tackling the problem versus the effort to be applied is

    weighted heavily towards applying more resources towards intermittent faults found

    within the 3 Cs.

    Over time, left undetected, the physical mechanisms that are affecting the contact

    intermittency and precipitating the faults root cause will degrade as a consequence of

    ageing, usage, environmental factors and maintenance factors. The intermittent events

    will become greater in duration and amplitude, degrading to the point where either theroot cause is diagnosed and detected, and/or the fault has become permanent: a hard

    fault. Given the massive variation possible in the degrading factors mentioned, the

    evolution of the faults root cause from initial intermittency to hard fault could take

    place over a lifecycle ranging from seconds to years. Therefore, the longer and more

    gradual this fault degradation life-cycle, the harder it is to detect.

    TEST EQUIPMENT CAPABILITY

    If intermittency can be found and a resultant fix carried out successfully, this provides a

    level ofintegrity to the item under test, ie the circuit under test, or unit under test

    (UUT), since it shows no signs of intermittency and can perform as designed without

  • 8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)

    9/23

    Copernicus Technology Ltd 2009 9

    any deviations or minute changes in circuit characteristics. Whether specifically

    looking for intermittency or confirming the integrity of a UUT, functional testing must

    be carried out to ensure that the functionality of the system is as designed. Moreover,

    modern functional testing technology covers a great deal of the failures; this is

    particularly true in those cases where the Automatic Test Equipment (ATE) testing

    regime matures alongside the system or unit under test, due to the application of

    historical in-service experience.

    However, despite every effort to ensure the detection of known faults using ATE and

    traditional testing methods/equipment like continuity tests, TDRs etc, these can only test

    the UUT at a single point in time. Additionally, sample rates and digital averaging

    techniques used to filter out noise have been developed in digital test equipment to

    improve numerical accuracy in measuring circuit attributes, eg resistance. However the

    combination of measuring at a single point-in-time, sampling rates and digital averaging

    result in any intermittent occurrences being missed completely or masked. Therefore,

    successfully finding a randomly occurring fault or micro-change in a UUT requires achange from these methods, to an approach that significantly increases the probability

    of detection. Digital accuracy is not the solution to detecting random, intermittent

    events: the objective is to detect the event, not to measure it.

    CORPORATE TECHNICAL KNOWLEDGE

    With the lack of a symptom to influence diagnostic thought processes and decision-

    making, a logical next step is to obtain additional, specialist information. There are a

    huge range of potential sources of such information and advice, ranging from

    maintenance colleagues on other shifts or locations, to contacting the OEM or Design

    Authority for advice, to analysis of the Maintenance Manual, to analysis of maintenance

    data for the specific aircraft or for the fleet type. The first shortfall in knowledge to be

    encountered is with colleagues or maybe even the manufacturers, because their

    knowledge is focussed on the characteristics of the system when it is working correctly,

    notwhen it is deviating from normal operating conditions. If the type of fault event has

    occurred before, there is the potential that a maintenance colleague or technical

    specialist will have come across it before and will recall the actual corrective actions

    that genuinely remedied the root cause of the fault, without repeat occurrences. If the

    operating agency and the design authority have a proactive and learning relationship,there may even be a process in place to capture diagnostic knowledge such as this in

    order to integrate it into maintenance documentation and procedures. But the fault

    symptom may not have occurred before.

    If specialist knowledge cannot be sourced from technical publications or staff, then the

    remaining option is historic maintenance data. Analysis of the data may show whether

    the same fault has occurred before on the aircraft or on another of the same type, and

    what action successfully rectified the problem; or it may show a trend of related failures

    on the same aircraft which would give an indication of where to focus fault diagnosis

    activity next. The way that the maintenance data is captured, configured and cross-referred will all have a huge bearing on the ease and extent to which it can be

    successfully interrogated to inform the fault diagnosis process.

  • 8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)

    10/23

    Copernicus Technology Ltd 2009 10

    THE INFLUENCE OF HUMAN FACTORS

    Human Factors refers to the study of human capabilities and limitations in the

    workplace; it considers the interaction of personnel, the equipment they use, the written

    and verbal procedures and rule that they follow, and the environmental conditions of

    any system8. Having identified how issues concerning absent symptoms, shortfalls in

    technical knowledge and maintenance culture conspire to prevent diagnostic success, itis necessary to fully explore the underlying Human Factors and behaviours which

    contribute to instances of diagnostic failure. These contributory factors affect the

    smooth passage of the diagnostic process and to the manner in which maintenance data

    is captured. These factors are discussed in the following paragraphs with respect to the

    m-SHEL modelb,9.

    m: Management Control of the System

    The system of work within an Aviation operation is highly complex and has a number

    of major influences bearing on it at any one time, and not necessarily in acomplementary manner. These influences range from profit, to safety, to long-term

    business objectives, to maintenance policy, to resource limitations. Each organisation

    may implement and manage their system of work differently, but they will all have very

    similar objectives that are fundamental to their success. In short, they all want to

    achieve maximum output (ie profit, or military effect) to meet the customers needs with

    the minimum consumption of input resources (ie spares, direct/indirect costs).

    In most cases these organisation will implement Performance Management systems,

    incorporating the collation and trend analysis of Key Performance Indicators (KPIs).

    However, there is a growing body of evidence that demonstrates that slavish adherenceto these KPIs can actually undermine achievement of the business objectives, because

    individuals within the system of work modify their behaviour to pursue success against

    KPIs instead of against what matters: the needs of the customer10. If the aviation

    organisation measures success with KPIs for number of flying hours achieved, or the

    number of serviceable aircraft available at the start of the flying day, or the percentage

    of flights which embarked and took off on time, then the organisations staff will

    pursue those targets. Refer this concept back to the dilemma of the airworthiness

    decision-maker described earlier, and it is evident that such business pressures can

    influence the action taken. Thus this often leads to the short-term palliative approach,

    the speculative LRU change for example, rather than the sustainment-driven approachwhich is to identify the root cause of the fault.

    Aside from Performance Management, access to the right data is crucial to business and

    to enhancing maintenance effectiveness. Depending on the magnitude of the

    organisations operations, there may be a high turnover of airframes used, either within

    or across operating sites, and across normal operations and maintenance activity.

    Developing a knowledge-based system to underpin the sustained availability of these

    assets necessitates information systems that allow the effective storage, sharing and

    b Edwards SHEL model (Software, Hardware, Environment, Liveware) for Human Factors wasmodified by Kawano by the addition of an m forManagement (Control of the System).

  • 8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)

    11/23

    Copernicus Technology Ltd 2009 11

    accessibility of the knowledge and data required to run the business effectively.

    However, this is more complex to achieve if the operating organisation, the supply-

    chain management organisation and the MRO organisation are separate business

    entities.

    The organisations maintenance culture will also be a major influence on the

    management system, and the effects of this will inexorably filter through to the tacticallevel to affect the manner in which the organisation deals with NFFs. Maintenance

    cultures are shaped by an array of factors including the parent organisations culture, the

    organisational aims, training, nationality, the experiences and knowledge of its staff and

    the intensity of the imperative for the business to perform. In the context of NFFs, that

    culture influences maintenance practice which, in turn, influences how the maintenance

    organisation responds to NFF occurrence at the working level, ie by shop floor or flight

    line maintenance staff. A full psychological profile of aircraft maintenance staff is

    beyond the scope of this paper, suffice to say that they like to get things done and they

    like to successfully meet targets. For many crisis-management and fire-fighting are

    more fulfilling approaches to their jobs, and more fun, than studiously analysing data

    and forward planning11. This can do attitude has its place, but if channelled

    inappropriately, however well intentioned, it can lead to the scenario whereby

    something tangible has to be seen to be done.

    An airline operator or a squadron pilot planning for an operational mission may not

    perceive extended fault diagnosis activity or analysis of maintenance data to find root

    causes as being a pro-active approach to returning an aircraft to use. It could even be

    interpreted as just the engineers fiddling with the aircraft. Compare that scenario with

    technicians running round changing lots of LRUs, and the illusion of activity is then

    easily associated with a concerted effort to solve the aircrafts problems. This approach

    becomes established as a norm, and the more established it becomes the harder it is for

    individuals within the system of work to assert themselves to break the pattern.

    In the absence of a clear symptom or symptoms, and with specialist technical advice

    that may well be assumptions-based, rather than knowledge-based, the airworthiness

    decisionmaker is back to their list of possible courses of action. If time is pressing and

    there are spare LRUs readily available, this is a course of action that is often selected;

    especially so if the airworthiness consequences of a recurrence were felt to be relatively

    minor. The major emphasis on LRUs can be seen in many arenas within aviation and

    they are like the NFF iceberg referred to earlier. The electrical contact components ie

    the wiring, connectors and circuit breakers are the glue that adhere together systems of

    LRUs and other components, but they are more time consuming to test and maintain

    using conventional methods and are less accessible than LRUs. These glue

    components are classed as a system in their own right: the Electrical Wiring and

    Interconnection System (EWIS)12. LRUs on the other hand are easier to see, to replace,

    to supply chain manage and to apply BITEc to than alternative and more mundane

    EWIS components, and so they become the focus of attention. However, in

    considering connectors and the like and comparing their vulnerability to LRUs, it is

    c Built In Test Equipment. To carry out BITE on an item or system refers to the act of runningits built-in test functions.

  • 8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)

    12/23

    Copernicus Technology Ltd 2009 12

    logical that EWIS components are the very aspect to focus on: but they are part of the

    NFF icebergs main bulk, submerged well out of sight below the waterline.

    S: Software

    In this Human Factors context, software refers to maintenance procedures, technical

    documentation, checklist layout etc. Having identified the complexity inherent insharing and integrating data end-to-end across the specific aviation enterprise, or system

    of work, the next challenge concerns the corporate knowledge described previously. In

    particular, do maintenance manuals include troubleshooting information that is

    knowledge-based and that increases the likelihood that the maintenance technicians can

    diagnose and fix the fault right first time, every time? If they do not include that kind

    of information, why is this? Is this because the platform is brand new and that kind of

    knowledge is still developing? Or is it because it has never been requested by the

    customer? Or is it because the cost of including such data is prohibitive to the

    operators? Or is it because it is just too complex a task to collate all the necessary data?

    Collation of such data would be problematic depending on the quality and completeness

    of data captured for maintenance activity. Individual technicians will perceive and

    interpret symptoms and faults differently, or not at all, and they will record this

    information in a mind-boggling array of variety. In the case of a hydraulic pipe failure,

    one of the symptoms is likely to be the presence of hydraulic fluid leaking in a specific

    area on the aircraft. The symptom could be recorded either as split pipe, burst pipe,

    hydraulic leak, seeping fluid, damaged, unserviceable: the possible permutations are

    numerous. But where does this variability come from? Human beings in the same

    scenario would take in the same raw data as each other via their senses and

    corresponding sensors (eyes, ears, nose etc); thereafter, the raw data is filtered by theapplication of experience, knowledge, values, culture before the end result is internally

    re-presented to the individual13. The potential variability of the internal re-presentation

    process is infinite given the variability of how every human being would filter and

    modify that raw information. The effect is then exacerbated when more than one person

    is involved in the scenario, especially if they are involved at different points in the fault

    diagnosis chain of events.

    The vital point to note is not how human variability influences the maintenance system

    of work, but simply to note that it does; therefore, a method of mitigating its effects is

    required to improve the establishment of the corporate knowledge needed to addressNFF problems.

    H: Hardware

    In this context hardware refers to tools, test equipment, physical structure of the aircraft

    etc. Modern culture is such that any new product has to be bigger (or smaller!), better,

    lighter and faster than its predecessor. These are the attributes we associate with

    progress, but often at the expense of overlooking what a product is for and its fitness-

    for-purpose at providing that function in a sustained and reliable way. The adages of if

    it isnt broken dont fix it and keep it simple, stupid may be associated with moremature individuals in organisations, but the world of aviation has long known that it is

    more important to focus on what a product is required to do, and for it to do that

  • 8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)

    13/23

    Copernicus Technology Ltd 2009 13

    effectively, repeatedly and at minimum cost. With the growth of IT solutions, test

    equipment methodology is often perceived as needing to keep pace necessarily so if

    obsolescence is a high risk but is it always necessary? For detecting intermittency it is

    not necessary, as digital accuracy has already been discussed as being an impediment

    rather than a benefit.

    E: Environment

    This encompasses the physical environment, such as conditions on aircraft operating

    surfaces, to the work environment, such as working patterns. Identifying the root cause

    of a fault can be hampered by working systems such as trade structure and shift

    systems. Trade structure can become an issue if there are feasible solutions involving

    more than one system and thus more than one trade: a fault investigation on a flying

    controls system, for example, involving avionics and mechanical trades. The solution

    selected may owe more to the strength of character of trade supervisors rather than to

    analysis of objective facts and data! Similarly, colleagues on subsequent shifts or

    reallocated from other tasks may question the diagnostic process undertaken so far andchoose to take the activity in a different direction rightly or wrongly. All of these

    factors impede the smooth chain of events from symptom to first-time-fix, and

    complicate unnecessarily the audit trail of corresponding maintenance data.

    L: Liveware

    This term refers to people. It includes the individual at the centre of an activity and the

    other people associated with the activity, in whatever guise that may be. The major

    influence of people has already been described in terms of how the variability of their

    behaviour affects maintenance data capture and the implementation of the diagnostic

    chain of events. A further human factor which impacts on the effectiveness of corporate

    technical knowledge is: knowledge retention and recall. If there is a human factors-

    caused incident, such as an aircraft towing incident involving ground equipment and an

    airframe, the details and the conclusions of the resulting investigation are publicised

    widely across the organisation. Several months later it happens again; how can that

    possibly happen again the managers ask themselves?

    How should the organisation genuinely learn from what had happened before?

    The key to this scenario is whether the root cause of the original incident wasestablished, and was a mitigating countermeasure embedded fully into working

    practices (checklists, manuals, training syllabi etc). If the countermeasure is

    insufficiently embedded (for example, maintenance staff are merely briefed about the

    incident and asked to take more care in future), then the incident soon fades from recent

    memory, the effects are not as visual as they once were, and the corporate knowledge

    may be diluted further by the influx of any new staff. The same is equally true for the

    how the maintenance culture deals with NFFs. If the predominant approach is shotgun

    maintenance, ie speculative LRU replacement, then there is limited opportunity to

    retain and transfer knowledge between individuals of what rectification actions are

    genuinely successful in eradicating the root cause of specific faults. Again, the

    organisation has not genuinely learned from what has happened before.

  • 8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)

    14/23

    Copernicus Technology Ltd 2009 14

    A PROPOSED ROUTE TO DIAGNOSTIC SUCCESS

    Insanity: doing the same thing over and over again

    and expecting different resultsd.

    Ifdiagnostic failure is caused by or exacerbated by the current test methodology used to

    attempt detection of intermittent faults, by the non-availability of relevant technical dataand by the effects of certain maintenance behaviours and practices, then it follows that

    different things - rather than the same thing over and over again - must be done to

    assure diagnostic success. There are 3 crucial elements that must all be dealt with using

    new approaches to increase significantly the prospect of diagnostic success in order to

    drive down NFF arising rates and, ultimately, increase aircraft availability for the

    customer. These are:

    1. The maintenance data approach.

    2.

    The maintenance management approach.

    3. The Intermittent Fault Detection approach.

    1. THE MAINTENANCE DATA APPROACH

    Variability in its many forms has been a significant theme in the human factors

    discussions in this paper and how it can lead to inefficiency and waste in aircraft

    operations. Given what problems exist in terms of maintenance and operational practice

    and culture, what needs to be done differently to vastly improve aviation solutions to

    NFF? Data capture and analysis comprise the first step in a journey to work smarter,

    not harder in order to significantly reduce NFF occurrences.

    Data capture has become much easier with the introduction of a multitude of BITE for

    on-board systems, data-logs of various data-buses and other health monitoring systems.

    Couple this with the vast amount of data captured for each flight in terms of the flight

    plan, debrief, work order cards etc, and it is evident that there is a substantial amount of

    data to analyse. Consider this alongside the burgeoning expanding capability to

    manipulate and trend information using todays computer and software technology, and

    the situation does beg the question: so why is there a problem?

    In aviation organisationsthere are an array of disciplines, personalities, departments and

    cultures. This creates information gapse in the flow and recording of relevant and

    crucial data. As a result, the effective analysis of any captured data is adversely

    affected by the number and extent of the information gaps that exist in a given

    organisations system.

    d Attributed to Albert Einstein, physicist, 1879-1955.e AnInformation Gap can be defined as a break in common communication between persons,

    departments etc despite that they may speak the same language. For example a Pilot will notnecessarily think or speak or use the same terminology as the technician, and as a result anInformation Gap is formed.

  • 8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)

    15/23

    Copernicus Technology Ltd 2009 15

    An example that illustrates simply this phenomenon is the person who is the owner,

    driver and maintainer of a car. If a person performs all related operational and

    maintenance tasks, including being the budget holder, they have complete ownership,

    accountability and responsibility for all operational aspects of the car; hence no

    information gaps are induced. Therefore, for any symptom that arises, this is logged and

    translated into a possiblefaultand an associatedfix and its cost remembered; if the fault

    is intermittent and the fix was not effective, then a cost-benefit analysis would take

    place to ensure that any subsequent course of action which may be taken balances

    operational commitments and regulatory requirements. Therefore, if the same symptom

    occurs again and again, this is quickly realised, so depending on the last action taken

    other possible rectification activities might well take place until the symptom stops

    arising. In reality, it is highly likely that different courses of action would be taken by

    this lone operator and it is highly unlikely any of the applied fixes or courses of action

    would be repeated. Therefore, if the symptom was the car engine keeps cutting out at

    idle, the individual is unlikely to keep changing the engine management control unit at

    600 per unit; in fact this might be the last item they would change given its high value,unless it is definitively identified as the source of the faults root cause. This is a clearly

    defined closed-loop system in which all the data is presented and correlated by the same

    person for the same vehicle, and by all the disciplines ie owner, budget holder,

    maintainer and operator.

    Introducing a second person into the scenarios to borrow the car complicates the

    system. As they experience the symptom, they begin to process the raw information

    presented to them and on their return they report their findings to the cars owner; this

    may even be coupled with emotional anecdotes depending on the purpose behind

    borrowing the car and the extent of the problem experienced. Depending on the secondpersons expertise and experience, the presented synopsis of the cars problem could

    well range from the car is noisy with no supporting information, to a full blown

    diagnostic of the possible fault(s). This is how information gaps begin. With multiple

    cars, multiple drivers and multiple repair staff involved and the information flow is

    impeded further whilst the number and extent of information gaps grow. In short, there

    is no coherent correlation between the data pertaining to symptom - fault - fix.

    The foundation of a successful data capture process is correct definition at the outset,

    coupled with a robust and standardised format. It is vital that the symptom, as

    experienced by the user, is captured using a standard and repeatable methodology.These symptoms can then be codified and entered into a searchable data field within

    the operator/maintenance data management system. This eliminates two common

    information gaps. Firstly, the fault debrief process with the operator/user is carried out

    in their technical language, which prevents information from being missed. Secondly,

    the codification of the symptom allows searching and trending on the database field and

    therefore successfully eliminates the free-text syndrome. Base-lining of the symptom

    data using this standard codification approach not only bridges the information gaps,

    more significantly it enables the direct correlation of the symptom through to the fault,

    and then to the actual fix.

    To illustrate with another car example, consider the following symptom descriptions:

    flat tyre, puncture, flat, nail in tyre, tyre flat. They are all the same, but in a free-text

  • 8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)

    16/23

    Copernicus Technology Ltd 2009 16

    database these would be seen as 5 different symptoms, none of which states which tyre

    has the problem. In addition the text focuses on the fault and omits the symptom that

    was experienced by the driver of the car which was that the car was hard to steer. The

    standardising of symptoms, which can be constructed using historical data, experience

    and knowledge, narrows the possible outcome scenarios; once this has been achieved,

    the symptom can be codified into a discrete code that uses predetermined letters to

    represent certain states and conditions which are common throughout all systems within

    the aircraft. Referring back to the car example, to assist the driver in remembering as

    much of the information as practical, it is imperative to create the debrief in the logical

    manner as experienced by the driver, and not to debrief the operator as an engineer or

    technician might instinctively tend to. Therefore, the resulting debrief could look like

    Table 1:

    System SymptomSpeed

    (mph)Weather

    Road

    Surface

    Road

    Debris

    First

    Visual

    Second

    Visual

    WordPicture

    Steering Steering is:

    Hard

    Loose

    Impossible

    X Not listed

    42 DryWet

    Icy

    TarLoose

    Gravel

    Boulder

    Mud

    NoneRocks

    Glass

    Screws/nails

    Tyre:

    OK

    Low

    Flat

    Damage:

    OK

    Scuffed

    Cut

    Code ST H 42 D T S F C

    Table 1 Example codification of a faults symptom

    The resultant code would be something like this: STH42DTSFC. In this scenario the

    last 2 aspects of this symptom diagnostic would probably be carried out by the receiving

    staff of the hire car company. Similarly, for aircraft scenarios, flight line mechanics

    might be required to complete the symptom debrief coding process in some cases to

    include where appropriate, details of warning captions, BITE codes etc.

    The output of this base-lined, standardised symptom codification activity is that the

    process is now repeatable, and can thus be carried out by the operator directly and

    without necessarily being dependent on an experienced technician to facilitate a debrief:

    and noting that different technicians would each facilitate the debrief in a different way,asking different questions some relevant, some less so - and in a different order. The

    captured symptom code is meaningful and concise, and does away with the free-text

    problem so that there is now a genuine capability ability to conduct trend-analysis on

    the symptoms being experienced. Furthermore this has enabled the capability to trend

    analyse parts of the symptom, for example search for all ST*symptoms reported.

    This approach of part trending allows the output to be defined according to data needs.

    Thus the first part of the code is more useful to strategic management because they

    might only need to know the total number of steering arisings that their customers are

    having. The entire code would be useful to the cars in-service support/maintenance

    organisation because it could be used to influence changes to manuals, to training or to

    modify the systems concerned.

  • 8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)

    17/23

    Copernicus Technology Ltd 2009 17

    Overall, while the chosen example is a very simple one, the principles are the same for

    more complex scenarios. The key to its success is the definition of the symptom as

    experienced by the user of the equipment and in the language of the user. We have

    termed this concept Symptom Diagnostics.

    Applying Symptom Diagnostics to Intermittency and NFF

    Now that the first part of data capture for a given fault has been constructed and

    standardised, this codified symptom can be linked to the diagnosed fault and the

    resulting, successful fix. Given the benefits in the ability to trend symptoms and their

    related faults with the actual fix, this approach is therefore an invaluable tool to apply

    when tackling intermittent faults and the NFF phenomenon.

    To eliminate NFF, the fault has to be solved at its root cause (like any other fault, in

    fact), and while a number of speculative solutions may be attempted to eliminate the

    fault, the need to quickly identify subsequent symptoms is crucial in order to ascertain if

    the applied fix has been effective ie it has been a diagnostic success. This use ofSymptom Diagnostics trending is pivotal to ensuring that repeat fixes are not carried out

    and that other potential fixes are considered.

    While Symptom Diagnostics trending can enhance NFF and maintenance solutions

    activity at the single airframe level, it becomes far more powerful when applied over an

    entire fleet. As the Symptom Diagnostics successful fix data builds up for the fleet, it

    allows technical staff to see what proportion of different maintenance activities led to a

    diagnostic success for a specific Symptom code, for example: for fault X the successful

    fixes were 80% for LRU Z and 20% for LRU Y. This data could be used to inform the

    technicians and airworthiness decision-makers into using the analysed, historical data todirect resources for maximum and long-term effect. Considering the concepts already

    developed in this paper, if LRU Y takes 20 minutes to replace and LRU Z takes 2 hours

    to replace, then it is possible for this to influence the diagnostic process; however,

    coupled with the aforementioned Symptom Diagnostic data there is now the opportunity

    to make a far more informed decision, based on knowledge and not on assumptions and

    not on stock levels or on time-to-replace.

    The symptom-fault-fix methodology provides the foundation to tackle both hard and,

    more importantly, intermittent faults. For an aviation enterprise with established

    corporate maintenance knowledge for a mature platform, Symptom Diagnostics may notbe an essential methodology for finding hard faults, but it does enable technicians to

    find fault root causes more quickly. The ultimate application of this approach would for

    Symptom Diagnostics codes to be compiled and transmitted by the flight crew while the

    aircraft is still airborne. This would enable maintenance staff to prepare spares and

    resources in advance of the aircraft landing, similar to a Formula One pit crew being in

    position with their tools and tyres prior to the car entering the pits. In addition to this

    operational advantage, Symptom Diagnostics provides the fundamental base-lining

    methodology for tackling intermittency. In considering the issues of speculative

    maintenance, set against a backdrop of time pressure and KPIs to meet, then without a

    standardised means of accurately recording recurring symptoms the random nature ofthe intermittent fault becomes hidden in a plethora of unnecessary maintenance

  • 8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)

    18/23

    Copernicus Technology Ltd 2009 18

    activities and uncertainty, leading to increases in operating costs and inefficiency. This

    is back in the realm of the NFF iceberg, deep below the waterline.

    By using a Symptom Diagnostics method for trending for a fleet asset it is now possible

    to ascertain whether a speculative LRU change has been successful on aircraft, or not.

    Therefore, without the need for expensive test set equipment, a Ship or Shelve policy

    could be used to provide the breathing space required to prevent an LRU from beingreturned for repair or overhaul. For example, if a symptom is reported and cannot be

    reproduced and is believed to result from an intermittent fault, then a considered LRU

    replacement could be carried out. The replaced LRU would then be quarantined on a

    designated rack within the stores system with an appropriate identification label stating

    the aircraft registration, reason for removal, time, date, Symptom Diagnostics code etc.

    After a predetermined time or conditions have been met, ie if the symptom did not

    reoccur within a specified number of flying hours/flights/usage cycles, then the LRU

    would be categorised as unserviceable and inputted into the reverse supply chain for

    repair. Alternatively, if the symptom did reoccur within the specified period, then the

    LRU could be categorised as serviceable, subject to any prerequisite functional checks,

    and returned to use. There are clear airworthiness implications to be considered with

    the ship or shelve policy, subject to the safety/performance-criticality of the LRU

    concerned, but the policy has been introduced with some success by certain operators 14.

    Implementation of the above outlined procedural solutions enables the maintenance

    staff to identify faults root causes much earlier in the diagnostic process than has

    traditionally been the case in NFF occurrences. In doing so, the resultant data reduces

    the instances of erroneous LRU replacements because trend analysis of the data

    highlights rogue LRUs or roguefaircraft. While highlighting rogue assets and taking

    the appropriate action to isolate or limit their use, the next step in achieving diagnostic

    success is detection of the intermittent fault, fixing its root cause and returning the asset

    to service with no restrictions and also with, just as importantly, vastly increased

    confidence that the intermittency has been eradicated.

    2. THE MAINTENANCE MANAGEMENT APPROACH

    This paper has examined the maintenance management and maintenance practice issues

    that contribute to NFF occurrences, or which prevent the arising rates from improving.

    The major hurdles which stand out in terms of their relative effect on diagnostic failureare the disproportionate focus on LRUs, the influence of KPIs and business pressures on

    fault diagnosis processes and the quantity of LRUs erroneously sent for repair.

    The engineering and degradation factors discussed in this paper, coupled with excessive

    NFF rates for LRUs at second line repair workshops, all suggest that the LRU is being

    treated as a sticking plaster approach to rectifying NFF occurrences treating the

    symptom, not the cause. Degradation mechanisms and operating environments mean

    that the 3 Cs are the more significant source of NFF root causes and that this is where

    diagnostic resources, including training, should increasingly be directed. However,

    f Rogue is defined as a LRU or aircraft which, through data analysis, is proven to be responsiblefor more than the average number of symptom occurrences being recorded.

  • 8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)

    19/23

    Copernicus Technology Ltd 2009 19

    before committing to this approach, use of Symptom Diagnostics data would provide

    the checks and balances to indicate that the approach was correct in practice and not just

    in theory. In parallel with a revised emphasis away from LRU replacements,

    implementation of a ship or shelve policy would complement all initiatives to reduce

    unnecessary LRU replacements and would successfully minimise the throughput of

    LRUs in the reverse supply chain.

    KPIs can have a positive motivational effect or can sub-optimise the performance of a

    system of work. If they are the right thing to revise for an organisation, then the

    Performance Management system should be amended for those maintenance KPIs

    which focus on long-term effectiveness, rather than short-term effect. The first-time-

    fix rate for, say, the top ten critical faults could be the major KPI of a maintenance

    organisations diagnostic capability, particularly if the figures display a continual,

    downward trend.

    Finally, the senior level of maintenance management should support airworthiness

    decision-makers in all efforts to prevent shotgun maintenance and focus on diagnosticsuccess to identify faults root causes. Company policy and maintenance standard

    operating procedures (SOPs) should reflect this and provide a documented process to

    follow to vastly increase the likelihood of a first-time-fix in conjunction with the

    additional measures outlined above.

    3. THE INTERMITTENT FAULT DETECTION APPROACH

    As previously alluded to with regard to human factorsHardware issues, the inexorable

    advances of technology result in an increased focus on and fascination with theaccuracy of digital equipment. Despite this, probability of detecting intermittency is

    more relevant to isolating intermittency root causes than digital measuring capability.

    Therefore, to increase the probability of detection compared with conventional digital

    equipment a technique is required that is continuous in its ability to detect nanosecond

    intermittency over a specified test period, and not sampled or averaged, or limited to a

    point-in-time testing method.

    Extending the logic of this concept one stage further, if multiple lines of continuous

    testing can be carried out simultaneously then, by default, the probability of detecting an

    intermittent fault during a specified period of time in a UUT is substantially increased.

    Combine this intermittency testing with an appropriate level of environmental

    stimulation for the UUT and this testing methodology provides maintenance staff with

    increased confidence that every part of the UUT has been subjected to representative

    conditions while all test points have been continuously and simultaneously tested.

    Analogue neural-networks provide the means to successfully achieve intermittent fault

    detection. A very low analogue signal allows the UUT to be tested continuously for a

    period of time, and without the aforementioned compromises introduced by digital

    techniques. The use of detection-optimised analogue versus measurement-optimiseddigital equipment means that nanosecond intermittency can be detected successfully.

    Widely-available Digital Multi-Meters are limited to point-in-time testing and their

  • 8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)

    20/23

    Copernicus Technology Ltd 2009 20

    sampling rates restrict their potential capability to detection of millisecond

    intermittency, thus there is an increase in the probability of detection by analogue

    equipment of over 1 x 106. The application of using an analogue neural-network

    improves matters dramatically, because it uses a method whereby all the circuits of the

    UUT are inter-linked in a synaptic-like networked arrangement. The neural-network

    allows multiple test points (ie the circuits of the UUT) to be tested simultaneously and

    continuously without missing any intermittency events across all the points under test.

    Diagram 1 compares the detection capability of the analogue neural-network with that

    of the conventional ATE methodology.

    Diagram 1 - ATE vs Analogue Neural-Network

    Conducting the test simultaneously using a neural-network provides an increase in

    probability of detection proportional to the square of the number of circuits under test.

    This substantial increase in the probability of detection, combined with the reduction in

    the time taken to complete the test (because the testing is performed for multiple points

    simultaneously, rather than testing one line at a time) mean that exploiting analogue

    neural-network equipment to detect and eradicate intermittent faults in electrical and

    electronic aerospace components, is the most effective test methodology to use.

    THE REQUIRED OUTCOMES OF DIAGNOSTIC SUCCESS

    Aviation maintenance policy concerns itself with ensuring that aircraft and their systems

    are safe and fit-for-purpose throughout their full life cycle. In addition, commercial

    demands mean that this is delivered to customers in a sustainable and cost-effective

    manner. In short, this means providing the required aircraft availability at minimum

    whole-life cost. NFF occurrences and intermittent faults directly affect that simple

    equation, hence the required outcome of the route to diagnostic success must be tounderpin and enhance availability levels by enabling the correct diagnosis and

    rectification of every fault (including intermittent faults), right first time, every time.

  • 8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)

    21/23

    Copernicus Technology Ltd 2009 21

    An availability-focused maintenance strategy is not applied using a one-flight-at-a-time

    mentality, ie is it airworthy and serviceable for the next flight. If this was the case then

    there would not be such considerable efforts invested in developments such as fatigue

    life and component life extension programmes. The concept of aircraft structural

    integrity is not new, and the approach to continuing airworthiness has spread to all

    systems on aircraft, including the integrity of the EWIS15.

    In the context of NFFs therefore, use of analogue neural-network Intermittent Fault

    Detection equipment should be used to demonstrate the system integrity of a UUT. The

    combination of proving system integrity using this testing capability, as well as

    confirming serviceability or system functionality using traditional special-to-type

    test equipment will lead to enhanced levels of sustained availability of systems and

    platforms.

    Functional Test +Integrity Test=Increased Availability

    The combination of Intermittent Fault Detection using analogue neural-network testequipment (to enable rectification of the root cause) and functional testing using

    existing test equipment can therefore provide the highest level of assurance of system

    availability where the system can perform its function without interruptions from

    transient faults.

    CONCLUSIONS

    A No Fault Found occurrence describes the set of circumstances which starts with an

    end user experiencing a faults symptoms and ends in a diagnostic failure. It is areported fault for which the root cause cannot be found. NFFs impact on aircraft

    availability directly and indirectly, through causing aborted flights and maintenance

    rework; and through wasted time/money/resources involved in erroneous, speculative

    and avoidable component/LRU replacement, repair and procurement.

    Diagnostic failures are caused by:

    Maintenance Human Factors - including: the pressure to meet operational

    deadlines; an excessive troubleshooting focus on LRUs over EWIS components,

    overlooking the 3 Cs (cables/connectors/chassis) in particular; the variability ofdata captured for symptoms, faults and fixes; and the retention, accessibility and

    integrity of fault-finding and troubleshooting data.

    Nanosecond Intermittency - resulting in intermittent symptoms that may not

    manifest themselves during fault investigation activity; and which cannot routinely

    be detected by conventional ATE because of their sampling rates, single point-in-

    time application and digital averaging techniques all of which sub-optimise their

    ability to detect randomly occurring nanosecond intermittency.

    To mitigate these obstacles to diagnostic success there are 3 key strategies which mustall be applied:

  • 8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)

    22/23

    Copernicus Technology Ltd 2009 22

    Symptom Diagnostics the capture and correlation of standardised symptom-

    fault-fix data to improve corporate technical knowledge of successful first-time-fix

    methods for real symptoms; this knowledge informs fault diagnosis processes and

    diverts attention away from LRU changes and back towards root cause solutions,

    especially in the 3 Cs.

    Management NFF-related maintenance management policy must underpin these

    new strategies by inculcating the improvements in the system of work through

    integration within the performance management system, within technician training

    and within NFF fault-finding SOPs.

    Intermittent Fault Detection the use of analogue neural-network test

    technology to overcome the shortcomings of digital equipment in this specific

    context, to the extent that the probability of detecting intermittency is increased by

    multiples of 106.

    These strategies provide the foundation for an availability-focused maintenancestrategy. Furthermore, thefunctional testing of electrical and electronic equipment by

    ATE can be enhanced by proving the integrity of circuitry, connectors and the EWIS.

    The combination of testing for Function and forIntegrity leads toIncreased

    Availability, a reduced supply chain and vastly increased confidence in the equipment

    being fitted to aircraft.

    The aerospace maintenance organisation that could harness the combined

    approach of this Intermittent Fault Detection methodology in concert with the

    Symptom-Fault-Fix approach would be genuinely world class in its ability to fix

    faults right first time, every time and, therefore, in its sustained ability to deliverincreased aircraft availability.

  • 8/14/2019 CEAS 2009Aug31 CopernicusTechnologyLimited NoFaultFound Paper(Final)

    23/23

    REFERENCES

    1 ARINC Report 672, (2008), Guidelines for the Reduction of No Fault Found (NFF), AvionicsMaintenance Conference, Aeronautical Radio Inc.

    2 http://www.aviationweek.com/aw/generic/story_generic.jsp?channel=om&id=news/om207cvr.xml accessed at 1930 on 24 Aug 09.

    3 Knotts R, (1996), Analysis of the Impact of Reliability, Maintainability and Supportability onthe Business and Economics of Civil Air Transport Aircraft Maintenance and Support, M.Phil.thesis, University of Exeter, UK.

    4 Reference 2.5 Blischke W R & Murthy D N P, (2003), Case Studies in Reliability & Maintenance, John

    Wiley & Sons, New York.6 Dunwoody S, Bock E, Sofia J, (1996), A Practical and Reliable Method for Detection of

    Nanosecond Intermittency, AMP Journal of Technology Vol 5.7 Kelly G, Sajecki A, Soresnson B A, Sorenson P W, (2001), An Analyzer for Detecting Aging

    Faults in Electronic Devices, Updated from 1994.8 CAP 715, (2002) An Introduction to Aircraft Maintenance Engineering Human Factors for

    JAR66, UK Civil Aviation Authority.9 Kawano R, (1997) Steps Towards the Realization of Human-Centred Systems, IEEE 6th

    Conference on Human Factors and Power Plants, Conference Proceedings, Orlando, pp 13/27-13/32.

    10 Seddon J, (2003), Freedom from Command and Control: a better way to make the work work,Vanguard Press.

    11 Repenning N P, Sterman J D, (2001), No-One Ever Gets Credit For Fixing Problems BeforeThey Happened: Creating And Sustaining Process Improvement, California ManagementReview Vol 43 No 4.

    12 European Aviation Safety Agency, (2008), EASA Certification Specification for Larger

    Aeroplanes CS-25 Subpart H, Amendment 5.13 Reference 8.14 Reference 1.15 Reference 12.

    Jim Cockram is a former RAF engineering officer, with a 25-year Service career that focussed heavily

    on the maintenance and logistics support of fast-jet fleets and guided weapons systems, from the vantage

    point of roles in Forward, Depth and Integrated Project Team environments. During this time he

    participated in many exercise and operational deployments, and was an early advocate of applyingLean

    Thinking to Defence organisations. His experiences in programme management and in running large,

    aircraft maintenance organisations led him to develop maintenance and data-exploitation strategies which

    he has subsequently employed successfully in business improvement projects in the private and public

    sectors. Jim is the Technical Director of Copernicus Technology Ltd and is an enthusiastic member of

    the Royal Aeronautical Society Highland Branch committee.

    Giles Huby was also an RAF engineering officer, whose 16-year Service career encompassed land-based

    and carrier-based fast-jet operations, plus Forward, Depth and Integrated Project Team roles in support of

    fast-jet fleets and guided weapons. Like Jim, Giles also possesses considerable experience of Defence

    programme management and running large, aircraft maintenance organisations. He was heavily involved

    in Lean process improvement activity in Defence; plus he amassed significant experience of Human

    Factors incident investigation and the associated development of enduring countermeasures to prevent

    further recurrences. Giles is the Managing Director of Copernicus Technology Ltd and the Chairman of

    the Royal Aeronautical Society Highland Branch committee.