System testability assessment for integrated diagnostics ...jsquad/pubs/d-and-t-1992a.pdf ·...

Integrated -

Diagnostics

I n this series of articles, we are in- terested in the ability to diagnose failures as part of an overall maintenance architecture. Testability is a means to that end. Its only pur- pose is to improve system maintenance and repair. Testability is a yardstick by which we measure our success in achieving design goals for various aspects of field maintenance. As we shall demonstrate, testability is not a single issue but comprises several issues involved in maintaining complex systems.

In the first article of the series,' we presented an overview of the problem of analyzing testability and conducting diagnosis for complex systems. In our second article,* we described in detail the form of an information flow model, including test representations, conclusions, logical interdependencies, groupings, and special logical constructs. Several tool^^,^ incorporate this modeling approach, and we will use the information flow model as the basis of discussion in this and future articles. We now describe several measures that are useful in analyzing system testability.

WllllM R. SIMPSON

JOHN W. SHEPPARD

Arinc Research Corp.

of the missile launcher system, and Figure 2 on p. 42 is a matrix representation of the same system. Be- cause the representation is a bit matrix, we can replace any letter with a 1 and any empty area with a 0. The letters represent the method by which the relationships between elements (indicated by the row and column) were derived:

In part 3 of their series on integrated

analyzing the testability of a system. Based on

second article, the measures presented here

feedback, the test set, and multiple failures.

diagnostics, the authors present techniques for

f : first-order (or input) relationships h: higher order relationships derived by transitive closure I: higher order relationships derived by logical closure n: higher order relationships derived by incremental closure

Model representation With one exception (feedback analysis, A key element in the computation of which we will discuss later), it is not im-

testability is the information flow mod- portant what the letter designation is, el, which represents the flow of diagnos- only that a relationship exists. tic information bv means of a network of

the information flow model detailed in the

idenhfy testability problems involving ambiguity,

~~- ~ ~~~~~~~~~~~ ~

Terminology logical constructs. For illustration we use the case study of an antitank missile launcher, introduced in our previous article. Figure 1 is a dependency model

Before we can describe the mathe- matics, we must define the terminology.

40 0740-747519210300-0040$03 00 1992 IEEE IEEE DESIGN & TEST OF COMPUTERS

Figure 1. Dependency diagram of antitank missile launcher case study.

In general, we use a lowercase letter to denote an individual member of a set and give the letter a subscript to indicate which member; thus, cI5 is the 15th member of the conclusion set. An up- percase letter denotes the entire set, and the cardinality symbol (for example, I X I ) indicates the number of set members. Several sets are of interest:

including all the elements in F and I (E = F u 0. Note that the cardinality of E, I E I , cannot be computed as a sum of the cardinalities of F and I (IEl # IF1 + I l l ) because testable inputs belong to both F and I (that i s , F n I = I N T ) .

w U: the set of elements in the model that have unique properties. There are two unique sets: U1 is the set of unique information sources, and UF is the set of unique fault isolation conclusions (U = U1 U UF). RU: the set of replaceable unit

C: the set of fault isolation conclusions (not including inputs or No Fault). T: the set of tests that can be evaluated (not including testable in- groups in the model. puts). IN: the set of inputs to the system To define uniqueness, we must first being evaluated. There are two in- define two vectors: failure signature put sets: INU is the set of untestable (SF) and test signature (ST,). inputs, and 1°C is the set of testable A failure signature is a vector associ- inputs (IN = INU U INT). ated with a specific element in F that in- N F the set containing the element dicates all the tests that depend on 6. No Fault and for which the cardinal- The vector corresponds to specific ity, INFI, i s l . rows in the test-to-conclusion depen- 1: the set of information sources, in- dency matrix. Thus, from row c3 in Fig- cluding the tests and testable in- ure2,

F the set of fault isolation conclu-

puts, and No Fault (F = C U IN U NF). E the set of elements in the model,

puts (l = T U INT). - ~l , l , l , l , l , l , l , l , l , l , 1

and from row cl1 in Figure 2,

sions, including conclusions, in- SF,% = ~ , ~ , ~ , ~ , ~ > ~ , ~ , ~ , ~ , ~ (1)

We call this vector a failure signature because if the tests corresponding to row entries in the test-to-conclusion dependency matrix were to fail, we would expect the element corresponding to that row to fail. That is, if c3 fails,

and if cl1 fails,

Tests may also have failure signatures. For example,

- 0,0,0,1,0,0,0,0,0,0, SF,, = (0.0.1.1.1.1.1,0.0,0

or

Say that we allow the orientation of both dependency matrices to define an

MARCH 1992 41

S Y S T E M T E S T A B I L I T Y ~~ ~~~

~

Tests

(a1

Tests

ordering on F and I. Then any element in F is a member of UF if the element is not preceded by another member having a failure signature equal to that of the element:

Under this formulation, the first occurrence of any failure signature is unique, and subsequent occurrences are not unique. Nonunique elements may be associated with the unique element whose signature they match. We do not consider the corresponding case for tests except when considering feedback.

A test signature is a vector associated with a specific element in I. It is a mapping of a specific column in both matrices. Thus, from Figure 2,

Any element in I is a member of U1 if the element is not preceded by another member having a failure signature equal to that of the element:

i, E UI i f f ST, + ST,, V k E (O,j),ik,i, E 1 (9)

Under this formulation, the first occurrence of any test signature is unique, and subsequent occurrences are not unique. Nonunique elements may be associated with the unique element whose signature they match.

The construction of matrix D will determine the order in which we evaluate the dependency structures and thus which member of a group of nonunique elements we declare to be unique. How- ever, the construction of the matrix does not affect the members of the group or the total number of unique dependency structures. Table l lists the

Figure 2. Closed dependency matrices of case study system: test-to-test dependency ma-

trix (a) and test-to-conclusion dependency matrix (b).

42 IEEE DESIGN & TEST OF COMPUTERS

Table 1 . Set memberships in the case stvdy.

e i c U

t l c1

t l CI

tr c4

t4 ca

ts C I

to c7

t I* c 11

t l lC l r t I¶ c 1.

t 18 c 16

t 14 c 16 t 16 c IS

intl c m

intr c n

e i c U1

t l tr tr t4 t a to t ** tlX t LS t ,I

t 14

t 16

intl ints

F C T IN INU INT NF

[NI= 4

e i~ IP

FI = 2t

e iE F

I C ( = 2

e iE C

INTI = 2

e i c IN?

NF(= 1

e ic NF Members ei E E ei E I set I I ? i c UF

inu 1

inu a

int 1

int I

No Fault int 1

int I C 1

C S

c4

CL

C .

C l

c I1

c 1s

c I4

c Ia

c 1. c 1. c r c I1

int I

int I

inu I

inu I

int 1

int t C 1

C ¶

CS

C4

C I

C .

c1 C .

C *

c le

c 11

c I¶

C la

c 14

c IL c 1.

c IT

c 1.

c I*

C m

c I 1

inu 1

inu t To FuuEl

tr t l ts t4 tk t * t l t . t * t I#

t 11

t IS t IS

t 1.

t ia

t 1. t I 1

t 18

int I int I

tl tr ta t4 tr t . t l t * t . t IO t 11

t I¶

t 1s

t I4

t ia

t 1. t 11

t IS

CI

cs

Cl

c4

CL

C .

c1 C .

C *

c La c I1

c I¶

c IS c 14

c 1s

c I.

c 11

c 18

c 1*

C W

C l l

set memberships as they apply to the sures concern maintenance factors ob- case study. servable in the field and were previous-

Groups are subsets of the set of ele- ly available only after a fault tree was ments that are not otherwise assigned developed. Because the information to one of the labeled sets. They are flow model incorporates the required mapped in accordance with the second , system maintenance data, we can article of this series. Group types in- compute these measures without devel- clude test, replaceable unit, multiple- oping a fault tree. The following para- failure, ambiguity, redundant-test, and graphs examine measures that address feedback. ambiguity, feedback, the test set, and

multiple failure. (Recall that our analy-

Computing testability sis assumes a single failure.')

Using the information flow model, we Ambiguity measures can compute values for a number of Ambiguity exists when the tests p r e measures associated with the ability to vided in the information flow model can- diagnose failures. Some of these mea- not distinguish between two or more

conclusions. Given that a group may contain one or more elements, an ambiguity group of cardinality 1 does not contribute to an ambiguity problem. The ability to distinguish among conclusions in the conclusion set is related to the failure signature SF,. Ambiguous conclusions have identical failure signatures. Therefore, no combination of existing tests can distinguish among ambiguous conclusions.

Figure 3 on the next page shows two example ambiguity groups in the case study. Table 2 lists all the ambiguity groups in the case study. There are six ambiguity groups, each of which may or may not be significant (that is, require

MARCH 1992 43

S Y S T E M T E S T A B I L I T Y ~~ _ _ _ ~ ~

- N 0 7 N D -3 m ro r- m - + -

+!- +!N P +!e -2 -5 ," +m -- +!- +!- +!- +!- -- -- -- +- 5 E

Figure 3. Case study ambiguity analysis.

design changes) for meeting maintenance requirements. We have derived a number of measures to indicate the amount and type of ambiguity.

Isolation level. IL is the ratio of the number of isolatable groups to the number of isolatable elements. In our definition of unique conclusions, we defined the first element of each ambiguity group as being unique. Thus, the number of isolatable groups is I UF I , and

The ideal value of IL would be 1 .OOOO. For the case study, IL = 16/26 = 0.6154. Roughly 62% of the conclusions avail-

44

able can be drawn uniquely, which may or may not be a problem. If isolation to the element level is the design goal, there are serious problems in this case study. But if isolation to the group level is the goal, it may not be important that we achieve an isolation level of only 61%. The next measure clarifies the difference between the element level and the group level.

Operational isolation. A system's o p erational isolation level (Ot[n]) is the percentage of observed faults that result in isolation to n or fewer replaceable units. To compute this measure, we must determine the number of replaceable units associated (ambiguous) with each conclusion in the model (a). For fault isolation conclusion 6,

Table 2. Ambiguity groups in the case study.

Group' Members

1 c1 c2

3 c11 c12 4 ~ 1 3 No fault

6 ~ 1 9 inul

*Group data structure is in accordance with Sheppard and Simpson.2

'Ambiguity resulting from feedback

2" c7 c8 c9 c10

5 c16 c17 c18

~~

lRU/

Here I RU I represents the cardinality of the set of replaceable unit groups, and RUk is the kth replaceable unit group. As shown in Figure 1, we have 13 replaceable unit groups (eight are the shaded defined groups, and five are the un- grouped conclusions: int,, int,, inu,, inu,, and No Fault.) Table 3 provides the data necessary to compute a,. The operational isolation is

1; a, 2 n,Vf , E K 0; otherwise

Here w, is a weighting factor associated with each fault isolation conclusion (usually the probability of occurrence), and K is a subset of F determined on the basis of the type of analysis being performed. A project office may specify operational isolation or something similar as part of the design criteria. The ideal value of OI[n] would be 1.0000 for every definition of operational isolation. Table 4 shows variations of the operational

IEEE DESIGN & TEST OF COMPUTERS

isolation measures for the case study. Which operational isolation value is used depends on several factors, including the wording of specifications. For example, the last column of Table 4 shows values that include failure rate weighting but exclude inputs and No Fault. Thus, failures of inputs are not the re- sponsibility of system testability, and nondetections (discussed in the next section) are not included in the calcula- tion (they may be penalized separately).

Nondetection (ND). Of the six ambiguity groups in the case study (listed in Table Z), the fourth is of interest for nondetections. The ambiguity between cI3 and No Fault indicates that we cannot detect the failure of ~ 1 3 with the defined set of tests. Because no test depends on No Fault, SFNoFaon = (0, 0, ..., 0). Because any conclusion ambiguous with No Fault must have the same failure signature as No Fault, no tests in our test set detect a failure of c13. Thus, ~ 1 3 is a nondetection item. We obtain a measure of nondetection by enumerating the occurrences of nondetections:

TaMe 3. Replaceable unit ambiguity groups in the case study. -~ ~ ~ ~ ~~~~ ~

Failed Isolation Replaceable a, Failure element ambiguity unit groups (Equation 1 1) frequency'

1

int, in4 C1

c2

c3 c4

c5

c7

C8

c9

Cl 0

Cl I

c12

c6

cl 3

c14

c15

1 cl6

cl 7

cl 8

cl 9

c20

c2 1 inul ;nu2

No Fault

int, in4 c1 c2 Cl c2 c3 c4

c5 c6

c7 c8 c9 c10

c7 c8 c9 c10

c7 c8 c9 c10

c7 c8 c9 cl 0

Cl1 c12

c11 c12

c1 No Fault

cl 4

cl 5

cl 6 c17 cl 8 ;nu2 cl 6 c17 c18 jnU2

cl6 cl7 c18 ;nu2 c19 inul

c20

c2 1 c19 inul

c13 No Fault cl 6 c17 cl 8 ;nu2

1 1 2 2 1 1 1 1 2 2 2 2 2 2

2 1 1 3 3 3 2 1 1 2 3 2

0.001 0 0.001 0 0.0005 0.0005 0.01 00 0.01 00 0.0100 0.01 00 0.0005 0.0005 0.0005 0.0005 0.0005 0.0005

0.0005 0.01 00 0.01 00 0.0005 0.0005 0.0005 0.0005 0.01 00 0.01 00 0.001 0 0.001 0 0.9095"

i 1

'Values taken from Sheppard and Simpson2 in units of failures per IO,ooO hours.

"Analysis goal of operational performance check (expected frequency of occurrence).

Table 4. Operational isolation values in the case study.

Opemtional w; = uniform w; = uniform w; = failure w; = failure prob.' w; = failure prob.' w; = failure prob.' isolation K=F K=F-NF prob.'K=F K=F-NF K=F- IN K = F - IN - NF

1 0.3846 0.4400 0.081 1 0.9061 0.0808 0.9364 ow 0.8462 0.8400 0.9975 0.9724 1 .OOoo 1 .oOOo OU31 1 .oooo 1 .OoOo 1 .oOOo 1 .0000 1 .om0 1 .0000

'Failure probability computed as hi = pi where P, is the failure probability of the ith conclusion and a , is the failure Frequency of

~ the ith conclusion.

MARCH 1992 45

S Y S T E M T E S T A B I L I T Y ~~

The 1 in the numerator excludes the No Fault conclusion, which is only techni- cally a nondetection.

The ideal value of ND would be 0.0000. For the case study, ND = (2 - 1)/26 =

0.0385. In analyzing the impact of nondetec-

tions on the other measures, note that they will result in ambiguity between two or more replaceable unit groups, assuming No Fault is treated as a separate replaceable unit group. The case study illustrates the effect of nondetection on operational isolation. As shown in Table 4, when the probability of a conclusion’s being drawn is 1/ I Fl , remov- ing the No Fault conclusion from the analysis set K only slightly affects the operational isolation. However, when a high probability of drawing the No Fault conclusion exists, a significant difference in operational isolation values exists. The exact form of operational isolation is critical when we check for specification compliance, and we should control the computation to meet the letter of the specification.

Feedback measures Feedback in an information flow mod-

el represents a logical circularity. Such circularities may result from physical feedback, information flow feedback, or modeling errors. When we analyze information flow models, topological circularities most often correlate to physical feedback loops. We can determine the topological circularities for the case study by examining Figure 2. From the matrix representation, r, is in feedback if and only if DfI # 0, and it ’_” in topological feedback if and only if D,, = “h” or “f.” For all other cases, D,, # 0 indicates the test is in logical feedback. Thus, t6, t7, ta, t9, and tI8 are in topological feedback, and t14, t15, tI6, and t17 are in logical feedback. Further, the feedback loop, or circularity, has a unique failure signature. Therefore, any element in F having the same failure signature as a test in feedback is a member of the same feedback

Table 5. Circularity groups for the case study.

Group*

~ ~

Failure signature ( S F j Members

1 **

2***

Group data structure is in accordance with Sheppard and Simpson

** Circuloriv is topological feedh:k. ‘ * * t13ha~thesamesigna~re,butD~313=O

- ~~ ~~

loop as the test. In addition, any two tests in feedback having the same failure signature are members of the same feedback loop. Table 5 shows the circularity groups detected in the case study. Group 1 corresponds to topological feedback, which is identified by the “h” in the diagonal cell of Dfor each of the member tests in the table. We can verify the members of this feedback loop, including the conclusion elements, by carefully inspecting Figure 1.

The following measures indicate the amount and type of circularity. Because of the close relationship between topological circularity and physical feedback, we have restricted these measures to topological circularity.

Test feedback dominance. TFD is the fraction of information sources involved in topological feedback.

I l l

1; D:, = 1;

0; otherwise (14)

where D’ is the dependency matrix following transitive closure but before logical closure (see our previous article2 for a complete discussion of the closure algorithms).

The ideal value of TFD would be 0.0000 (no feedback). For the case study, TFD = 5/20 = 0.2500. In other words, 25% of the information sources are tied up in feedback. In this case, only one feedback loop exists.

Component feedback dominance. CFD is the fraction of conclusions involved in topological feedb

0; otherwise

ck:

A

(1 5)

CFD is important because it is a measure of ambiguity caused by feedback. Recall that each failure signature is the same in a circularity; that is also the definition of ambiguity.

The ideal value of CFD would be 0.0000 (no feedback). For the case study, CFD = 4/26 = 0.1538. In other words, 15% of the conclusions are tied up in feedback. Again, only one feedback loop exists.

Feedback modifications. Although feedback is a testability problem, physi-


(1 7) III IF

cal feedback often is necessary for the system to perform. It is important that changes made to a system to improve testability do not affect system perfor- For the case study, TL = 20/26 =

mance. Several options are available to 0.7692. eliminate the effect of circularity that We derive theoretical limits of TL to results from physical feedback saturat- provide a basis for comparing the actu- ing the feedbackcircuit, conducting test al test set with an ideal set. These lim- measurement before feedback occurs, its, TLMAX and TLMIN, let us determine inserting feedback loop breaks that are how appropriately we have specified engaged only during a test cycle, and re- the test set. In other words, is the test packaging feedback loops into a single, set overspecified (overtesting), under- replaceable element. We can modify the ' specified (undertesting), or appropriate basic measures to reflect these options. for the system? For example, we modify the isolation level to ignore ambiguity resulting from feedback, as follows:

TL=-

Overtesting. We assign an upper lim- it to the test leverage by specifying one test for each conclusion. The absence of a failed-test outcome indicates No Fault, so under this extreme the number of tests would be IF1 - INFl or IF1 - 1. The maximum test leverage (TLMAX) would then be

IUFI

IF-Cti + L I F1 FMIL =

(16) i=l

where FMIL is the feedback-modified

a test before we execute it, the most efficient partition results from a test that divides F into two equal subsets.

The next test is most efficient if it again divides the feasible subset in half. This process, called the half-interval technique, is one method by which we can accomplish fault isolation. Real systems can rarely follow half-interval fault isolation, but the half-interval case r e p resents the lowest number of tests for which we can accomplish diagnosis to the element level. Under this assump tion, conducting one test reduces the feasible set from I FI to I FI /2, conducting two tests reduces the set to ( I FI /2)/ 2, and so on. Conducting n tests reduces the feasible set to I FI /2". Fault isolation results when I F I /2" = 1, so we want the minimum number of tests for fault isolation, n = log, I Fl . From this, we specify the lower bound of TL, or TLMIN:

isolation level, 6 is as defined in Equa- tion 15, and Lis the number of feedback

IF-1 I F1

TLMAX = ~

loops. The summation removes all elements in feedback, and L adds back one If TL exceeds this value, the test set has element for each feedback loop. been overspecified.

The ideal value of FMIL would be For the case study, TLMAX = (26 -l)/ 1.0000. For the case study, FMIL = 16/(26 26 = 0.9615. Note that for extremely large - 4 +1) = 0.6957. The difference between systems, I FI - 1 E I FI , and TLMAXs 1.0: FMIL and IL indicates that we can achieve an 8% improvement in compe Test leverage modifications. We nent uniqueness by repackaging the i q k [ F] = modify TL to prevent some elements in feedback loop. the model from affecting the TL value.

Undertesting. To determine the mini- For example, we compute input- Test set measures modified test leverage (IMTL) as

To utilize an effective mix of test re- sources, we must consider several is-

system, including test adequacy, suffi- ciency, consistency, and efficiency. The following measures provide insight into the usefulness of the test set in relation to these issues. sets. The modeling approach considers back, as

Any value of TL that is less than this value indicates that the test set has been underspecified.

For the case study, TLMIN = (log2 26)/ 26 = 0.1808.

0')

mum amount of testing needed to accomplish a diagnosis, we consider some fault isolation theory. For single failures

tions the set of fault isolation conclusions F into two subsets.' (This assumes binary-outcome tests. Multipleoutcome tests may actually partition F into many s u b

mostly binary-outcome tests.) The s u b sets include elements that are still feasible after a test outcome and elements

Because we do not know the outcome of

lMTL - (2 1) III - (INTI IF1 - IINI sues related to determining tests for a and binary (twcwalue) tests, a test parti-

Further, we compute feedback-modified test leverage (FMTL), which excludes tests and conclusions involved in feed-

I!

FMTL = ,=I IF1

111 - CPI + L

Iq-pL + L

Test leverage. TL measures the ro- bustness of the test set-that is, the rel-

determine system health: ative capability of the test set to that are not feasible after a test outcome. (22)

I =I

MARCH 1992 47

where 5 is as defined in Equation 15. We can also modify TL for redundancy; this is discussed later.

Test uniqueness. TU measures the degree to which the tests in the test set provide unique information:

l"Il TU= /I( The ideal value of TU would be 1 .OOOO.

For the case study, TU = 14/20 = 0.7000. Recall that the first occurrence of a

configuration of (STJ is considered unique, so that each redundancy group is represented by one element in the set UI.

Test redundancy. We identify test redundancy (TR) whenever two or more test signatures are identical-that is, whenever ST, = ST,. Redundancy sim- ply means that the evaluation of any member of the test redundancy group will provide the same information as the evaluation of any other member of that group. Figure 4 highlights two redundancy groups in the case study. Table 6 lists all the redundancy groups, which are identical to the circularity groups (Table 5), as is generally the case.

TR is the complement of test uniqueness. It is a measure of the percentage of tests that we can consider for elimination due to redundancy:

("11 111

TR= 1- TU= 1-- (24)

The ideal value of TR would be 0.0000. For the case study, TR = 1 - 14/20 =

0.3000.

Redundancy modifications. Test redundancy may be desirable, and we can modify certain measures to ignore redundant tests. For example, the test leverage may indicate that the test set is overspecified because of desirable redundant tests. A new measure, nonre-

Figure 4. Case study redundancy analysis: test-to-test dependency matrix (a]; test-to-con-

clusion dependency matrix (bj.


dundant test leverage (NRTL)? may be a more reasonable value to compare with ~ ~ ~~ ~ ~~ ~~ ~ ~~ ~ ~

TLMAX: Group' Test signature (ST;)**

Table 6. Redundancy groups for the case stydy.

Members

l"Il IF

NRTL = -

For the case study, NRTL = 14/26 =

0.5385.

Ekcess-test candidates. In assessing the value of the test set, one major ob- jective is to minimize the number of tests required. Because each test must be specified, developed, documented, and validated, reducing the number can reduce costs significantly. Frequently, ad hoc methods of developing system diagnostics result in overtesting. A nat- ural question for an analyst to ask is which tests can be eliminated without creating additional ambiguities among the conclusions. These tests are excess- test candidates, which can be individu- ally eliminated if the analyst is satisfied with the svstem's sinqle-failure testabil-

1

2

3

tl 4 tl7

tl5 tl6

Group data structure is in accordance with Sheppard and Simpson.2 ** for ease of cross-reference, i denotes the transition point beheen the test-to-test and

test-to-conclusion matrices (Figure 4a, b).

- ity and does not anticipate false alarms (discussed later in this article).

Using the matrix formulation, we identify ambiguity when conclusions have equal failure signatures. In considering a test for excess-test candidacy, we want to determine whether ambiguities will occur when that test is eliminated. If a new ambiguity is created, the test is not an excess-test candidate; if no new ambiguity is created, the test is a candidate. The former situation is shown in Figure 5 for t2 and til. Neither test can be considered an excess-test candidate because the elimination of either will create new ambiguity groups.

Table 7 on the next page shows the result of eliminating each test in the test set. Testable inputs are not usually considered in the analysis, but they could be. If they are considered, both intl and int, are excess-test candidates.

We define excess-test measures in terms of either conclusions or replace- F ; ~ ~ ~ ~ 5. case sb+ excess-test ana/ys;s

MARCH 1992 49

able unit groups. The excess-test measure for conclusions is

where y ~ , is as defined in Equation 26 and RUA is replaceable unit ambiguity. The excess-test measure for conclusions including inputs is

Id 2 VI

SF, f SFk I11

0; i f f given T-{tl},

V't;,fk E UF(j # k ) , (2s) XMIC = - (28)

1; otherwise where y ~ ! is as defined in Equation 26. The excess-test measure for replaceable unit groups including inputs is

The excess-test measure for replaceable unit groups is

10; otherwise where 0, is as defined in Equation 27

Table 7. Ambiguify groups created by eliminating individual tests. ~~ ~~ ____ ~ _ _ _ _ _ _ _ _ ~ ~ ~

Excess-test Test New isolation Replaceable candidate conclusion

eliminated ambiguity unit groups replaceable unit groups

tl t2

f3

t4

t5

t7

t9 4 0

4 1

4 2

t6

t8

tl 3

tl 4

tl 5

tl 6

t17

c5 c6

c3 c4

C18 c21 None None None None None

c4 c6

c12 c14

c14cl.5

c15 c l 6

c18 c l 9 None None None None

N 3 1113

Tu1 m3

m3 N 3

None None None None None

m 2 N 8

m 6 N 6

m 6 N 6

N6

NE m 8 None None None None

No/yes No

No/yes No Yes Yes Yes Yes Yes

No/yes No/yes

No No/yes

Yes Yes Yes Yes

tl 8 None None Yes

~

~ (column 3) *Excess test candidate at either the conclusion level (column 2) or replaceable unit level

The ideal value for XMC, XMR, XMIC, and XMIR would be 0.0000, without multiple-failure or false-alarm consider- ations. For the case study, XMC = 0.5560, XMR = 0.8890, XMIC = 0.6000, and XMIR = 0.9000, indicating that we can make significant reductions in the test set.

Excesstest analysis. The analysis we have just performed to determine which tests are excess-test candidates is not sufficient for recommending which tests we will actually eliminate. Clearly, the elimination of one test may affect our ability to eliminate another. For example, in Figure 4, we can see that t14 and fI7 are redundant. But careful examina- tion of Figure 4b indicates that if we eliminate both t14 and t17, cI9 and c20 become ambiguous. For the analysis to recommend which tests to eliminate, it must first consider each test in terms of weighting criteria defined by the analyst. We will address multiplecriteria analysis in detail in an article on fault isolation later in this series; here we give only an overview of the use of multiple criteria to evaluate excess-test candidates.

First, we compute desirability values for each test, based on the criteria and weights determined by the analyst. We then rank the tests in decreasing order of desirability of having the test performed. The idea is to consider less desirable tests first for elimination. The desirability values may be based on multiple criteria, including (but not lim- ited to) the supplied weights. Finally, we examine the tests serially. We eliminate the first test. If that increases the ambiguity of the system, we return the test to the model; otherwise, we update the model to reflect the elimination of the test. Then we go on to the next test, stopping the process when all tests have been considered.

When we apply this process to the case study, using the failure frequency data of Table 3 to compute weighting values and limiting isolation to replace-


able unit groups, the analysis recom- quire the addition of sensors to the mends t3, t5, t6, t7, tS9 tg, til, t16, and t17 for built-in test equipment. elimination. Many of these tests are rec- Increasing test tolerances. We can ommended for elimination because avoid false alarms by making the they are redundant. For example, the al- test less sensitive to anomalous be- gorithm saves t18 from redundancy havior. Unfortunately, this may re- group 1, t14 from redundancy group 2, duce the test’s ability to detect real and t15 from redundancy group 3, there- failures. by recommending elimination of the rest (see Table 6 for the redundancy W Conducting repeat polling. In repeat groups). The algorithm eliminates ts be- polling, we try to avoid false alarms cause it is excess. The algorithm elimi- by executing a test repeatedly. nates tl1 and t3 because they cause no Each time the test is evaluated, the redundancy in replaceable unit groups test algorithm uses the results to and are therefore excess at the replace- confirm any previous executions. able unit level. However, the algorithm Repeat polling is intended to allow makes no recommendation about tl, t,, transient characteristics to work tI2, and t13-all of which are candidates their way through the system within Table 7-because the interactions of out triggering a failure indication. eliminations may preclude elimination Repeat polling requirements are of a specific test. usually written as, for example,

“three or more indications within 250 milliseconds.” As this solution also may lead to missed detections, a better approach is to recognize transient characteristics by means of the first solution, improved test science.

False alarms. False alarms are usually associated with built-in test, although they may occur in any type of diagnostic testing. As defined by military stan- dards, a false alarm is an indication of failure in a system where no failure exist^.^

False alarms result from imperfect testing. The better we understand a process or technology, the more accurate the testing becomes. False alarms generally become a problem when system complexity becomes great or the design pushes the state of the art. Because we cannot actually measure false alarms in the field,6s7 specifications should be based on cannot duplicate (CND) events instead of false alarms. We will address this issue in detail later in this series.

The following are four viable soh- tions to false-alarm problems:

W Crosscorrelating test information. We can correlate an anomalous indication with other testing to either confirm or deny the original information. The information flow model can analyze this technique to assist in planning for false-alarm prosecution.

False-alarm tolerance. False-alarm tolerance (FAT) is a measure of our ability to perform test-to-test crosscheck- ing. The test-to-test matrix in Figure 2a is a complete map of the higher order interrelationships between the tests. Recall that this map is generated in closure. FAT is the average percentage of tests in the test set that we can use as verifiers. For example, we can verify an anomalous outcome of tlC3, using t14, t15,

tI6, and t17 (the values in ST,,s). False- alarm tolerance then is given by

W Improving test science. We can avoid false alarms by sampling more often, modeling in greater detail, and accounting for a greater number of variables. In the case of built-in test, these steps create an increased software requirement and may re-

1; D;, = 1, i # J

0; otherwise

There is no ideal value for FAT, but a fully tested serial system will have aval- ue of 0.5000. For the case study, FAT =

FAT typically decreases as systems become larger and excess tests are removed. We have found that real systems with FAT values below 10% should be carefully analyzed. One way to main- tain a high false-alarm tolerance is to re- tain redundant and excess tests. Of course, merely having these tests available is not sufficient; they must be used in the diagnostic strategy, which means that additional testing will be specified. We will address this point in a future article in this series.

183/380 = 0.4816.

Multiple failure measures Multiple failure is both a testability

and a diagnosis problem. As we will discuss in future articles on diagnosis, the basic problem formulation is amenable to both single- and multiple-failure diagnostic paradigms. However, there are two cases in which we must be careful to provide testability in a multiple- failure situation because diagnosis cannot solve the corresponding problems. The first case is hidden failures, the masking of symptoms that occurs in most systems. The second is false failures, situations in which symptoms lead to wrong conclusions.

Hidden failures. The failure of one element may prevent the test set from determining that a multiple failure even exists. For example, if cll in Figure 2 were to fail, the failure would be detect-

is the failure signature of cll (SF,,,) discussed earlier in this article. In addition,

ed bytlo, t1l7t12, t13, t14,t1S7 t16, and t17. This

MARCH 1992 51

Table 8. Hidden failures in the case study.

Potential Hidden Failures in a Multiple-Failure Situation Element

c 21

inu1 inu 2

NoFault

b i a cn cis inuaj 7& cm m c m

none i s ci7 ami b lox cw

is no symptom to separate this multiple failure from either of the single failures. Even this situation is not generally a problem. Repair of cI1 and subsequent testing will isolate c19. This is called peeling and is one of the rationales used in developing singlefailure isolation techniques. Peeling works well except where the hidden failure is the cause (some times referred to as a root cause) of the isolated failure. Then, the repair of c1 is ineffective because the failure in cI9 will again cause cll to fail, and subsequent diagnosis will again identify cI1. This situation occurs whether we use single or multiple-failure diagnostic approaches.

Table 8 lists all the hidden failures for each conclusion in the case study. Anal- ysis of this list will determine if root cause relationships exist. Through improved testability or maintenance- manual directions to repair the combination of components (that is, repair both cI1 and cl9), we can diagnose any failure in the second column that is a root cause of a failure in the corresponding row of the first column. If we do not improve testability, diagnosis will fail to identify the complete problem. We obtain a measure of the size of the analysis set by computing the average percentage of failures that may be hidden by any conclusion reached (HF):

HF - ( = I 1=1 .

{IUFI - 2)' '

0 1) 1; 4 , t; E UF(i # j ) ,

0; otherwise @,, = SF; cSF, I

Here @[, is the actual existence of a hidden-failure relationship. Note that we count only unique elements in F. The 2 in the denominator accounts for the

if cI9 were to fail, it would be detected by 114, t l 5 , tI6, and f17. A joint failure of cI1 and cI9 would generally exhibit all the s y m p

toms of each failure, which are just the symptoms of cI1 (that is, all the s y m p toms of cI9 are contained in c1 There

fact that two elements have no hidden failures (No Fault and at least one other element).


For the case study, HF = 94/142 =

0.4796. On average, about 48% of the reach-

able conclusions are hidden from a given failure. We can modify HF to exclude inputs; inputs generally hide a large number of other failures and are not subject to root cause:

where Qv is as defined in Equation 31 and n is the number of inputs that do not belong to any ambiguity group of a size greater than 1 .

For the case study, IMHF = 68/[(16 - 2 - 2)*] = 0.4722.

False failures. The second class of multiple failures that will render ineffective both single- and multiple-conclu-

sion diagnosis is the false failure. A false- failure indication occurs when the com- bined symptoms of two or more failures are identical to the symptoms presented by a single failure that is not a member of the failure set. We compute false failure by examining the union of failure signatures for a conclusion’s hidden failures to see if there is a match. Figure 6 illustrates the process for pairs of failures. We repeat this process for three failures, four failures, and so on. The only conclusion in the case study that has a potential false-failure problem is c6. The associated failure pairs are (c1, c5) and (cz, cs). Mathematically, the false- failure condition exists if the following equality holds:

a measure that provides the extent to which false failures may be a problem:

1; iff p i s true 04) FF - .

Here FF is the fraction of elements po- tentially falsely identified, and p is as defined in Equation 33. Two elements are removed for the No Fault and minimum-failure signature (other than No Fault) in the denominator.

The ideal value of FF would be 0.0000. For the case study, FF = 1/24 = 0.0417. We can modify this measure to exclude inputs:

Id

where p is a logical variable and r denotes the null vector. We can compute

where U , is as defined in Equation 34. The ideal value of IMFF would be

- ,

53

S Y S T E M T E S T A B l l l T Y ~ ~~ ~~ ~

0.0000. For the case study, IMFF = 1/20 =

0.0500.

T h e mathematical formulation of diagnostic information flow allows us to closely examine and assess the testability of complex systems. The analysis demonstrated in this article is an initial analysis because it only demonstrates where the deficiencies in the system are located.

In developing any complex system, we can minimize maintenance costs if we identify testability problems and cor- rect them early in system design. After applying the testability assessment techniques, our next step is to interpret the results and recommend design changes to improve system testability. Such improvements might include eliminating or reorganizing existing tests, developing new tests (and maybe test points), repackaging components to improve operational isolation, locating feedback loops, implementing cross- checks for false alarms, and improving the observability of some failures to minimize the effects of root causes or false failures.

We’ve explained how to identify the problem areas in a system design. We must point out, however, that the measures defined are indicators of testability and have no predetermined thresholds for use in the analysis. Ana- lysts must use their own engineering judgment in the context of the problem being studied to determine what the indicators mean. We will demonstrate how to apply the testability analysis to redesigning the case study system in the next article in this series. @B

References 1. W. Simpson and J. Sheppard, “System

Complexity and Integrated Diagnostics,” IEEE Design & Test o f Computers, Vol. 8, No. 3, Sept. 1991, pp. 16-30.

J. Sheppard and W. Simpson, “A Mathe- matical Model for Integrated Diagnos- tics,” IEEE Design & Test o f Computers, Vol. 8, No. 4, Dec. 1991, pp. 12-25. F. Johnson and R. Unkle, “The System Testability and Maintenance Program (STAMP): ATestability Assessment Tool for Aerospace Systems,” Proc. A I M / NASA Symp. on Maintainability o f Aerc- space Systems, AIM, New York, 1989. J. Sheppard and W. Simpson, “Incorpo- rating Model-Based Reasoning in Inter- active Maintenance Aids,” h o c . 42nd Nat’l Aerospace and Electronics C o d , IEEE Press, New York, 1990, pp. 1238- 1242. Definitions o f Terms for Test, Measure- ment, and Diagnostic Equipment, MIL- STD-l309B, Washington, D.C., May 1975. W. Simpson et al., “Prediction and Anal- ysis of Testability Attributes: Organiza- tional-Level Testability Prediction,” RADC-TR-85-268, Rome Air Develop- ment Center, Griffis AFB, N.Y., Feb. 1986. E. Gilreath, B. Kelley, and W. Simpson, “Predictors of Organizational Level Testability Attributes,” 15 1 1-02-24 179, Arinc Research Corp., Annapolis, Md., Nov. 1986.

Direct questions or comments on this article to either author at Arinc Research Corp., Advanced Research and Development Group, 2551 Riva Rd., Anapolis, MD 21401. Sheppard’s e-mail address is sheppard @csjhu.edu.

John W. Sheppard is a senior research analyst in the Advanced Research and Develop- ment Group at Arinc Research Corp. He is also pursuing a PhD in computer science at Johns Hopkins University. His research in- terests include applying Al techniques to fault diagnosis, machine learning, neural net- works, and nonstandard logic. He has developed algorithms to diagnose system failures, verify knowledge bases, and classify software. He was also a principal developer of Pointer, an intelligent, interactive maintenance aid, and assisted in the development of a prototype expert system that diagnoses system failures and reconfigures the system to keep functioning. He holds a BS from Southern Methodist University and an MS from Johns Hopkins University, both in computer science.

William Simpson, a research fellow in the Advanced Research and Development Group at Arinc Research Corp., works with testability and fault diagnosis. He helped develop the System Testability and Mainte- nance Program, which is based on an information flow model. He was also a principal developer of Pointer. He holds a BS from Virginia Polytechnic Institute and State University and an MS and a PhD in aerospace engineering from Ohio State University.


mailto:csjhu.edu

System testability assessment for integrated diagnostics ...jsquad/pubs/d-and-t-1992a.pdf ·...

Documents

Transcript of System testability assessment for integrated diagnostics ...jsquad/pubs/d-and-t-1992a.pdf ·...