Dynamic testing of knowledge bases using the heuristic testing approach

21
Expert Systems With Apphcatwns, Vol 1, pp 249-269, 1990 0957-4174/90 $3 00 + .00 Pnnted m the USA © 1990 Pergamon Press plc Dynamic Testing of Knowledge Bases Using the Heuristic Testing Approach LANCE A. MILLER Science Apphcatmns InternattonalCorporatmn, McLean, VA, USA Abstract--How to develop knowledge-based and expert systems today ts becoming more and more well understood, how to test these systems still poses some challenges There has been considerable progress tn developing techniques for static testing of these systems, checkmg for problems via formal examination methods; but there has been almost no work on dynamic testing, testing the systems under operating condtttons A novel approach for the dynamtc testmg of expert system rule bases ts presented This approach, Heuristic Testing, ts based on the idea offirst testmg systems for disastrous safety and lntegrtty problems before testing for primary functzons and other classes of problems, and a prtortttzed series of 10 classes offaults are identtfied. The Heur~sttc Testing approach ts intended to assure software rehabdtty rather than stmply find defects," the rehabthty ts based on the 10 fault clones called compotent rehablhty General procedures for conceptuahzing and generatmg test cases were developed for all fault classes, mcludmg a Generic Testing Method for generattng key test-case values One of the classes, error-metrtc, tllustrates how complexity-metrics, now used for predtcttng conventtonal software problems, could be developed for expert system rule bases Two key themes are automatton (automattcally generatmg test cases) and fix-ks-you-go testing (fixing a problem before contmumg to test) The overall approach may be generalizable to statw rule base testmg, to testing of other expert system components, to testmg of other nonconventional systems such as neural networks and object-ortented systems, and even to conventsonal software 1. INTRODUCTION FIVE YEARS AGO THERE was almost no activity con- cerned with the testing of expert systems, but interest has grown rapidly in the interim with special workshops and more and more sections on the topic at software, expert systems, and general AI conferences. It is useful for understanding expert system testing to contrast it with conventional software testing. The principle underlying both types of software testing is that one needs to test the system in two ways: (l) each component separately, and (2) as an integrated whole. In conventional programs the whole is a recursive composition of elemental procedures, and the complete panoply of testing procedures are applicable at all levels (though experts may disagree concerning the advan- Requests for repnnts should be sent to LanceA. Miller, ScienceAp- phcaUons lnternaUonalCorporauon, McLean, VA 22102. This testing proposalgrowsout of, in part, work performedby Sci- ence Applicauons Internatmnal Corporauon under contract to the Electric Power Research Institute concermng the Venficatmn and Vahdat~onof ExpertSystems (seeGroundwater, 1989; Groundwater & Blanks, 1988; Groundwater, Donnell, & Archer, 1987; Kirk & Murray, 1988; Miller, 1989b). It also derives from work whde the author participated as an industry advisoryto NASA Space Station Software Envtronment actwmes, Reston, VA. tages of extensive component testing; e.g., see Boehm, 1981, pp. 382-386; Hoffman & Brealey, 1989). Expert systems, however, are composed of nonhomogeneous components; the knowledge base is considerably dif- ferent from the inference engine, and both contrast strongly with typical system shell features (which pro- vide the user environment and various utihties; see Table l). The same testing methods cannot be applied to all components equally well. Because the task of testing--and fixing---components depends so much on the specific component, it makes a great deal of sense to test expert systems quite thoroughly, component by component, before system test. Some of the growing concern has been on when to test and has emphasized hfe cycles appropriate for ex- pert system development, with testing accomplished during the various life cycle stages (e.g., Culbert, Riley, & Savely, 1987; Miller, 1989b). Concerning what to test, almost all of the attention has been on static tests of knowledge bases--using formalized inspection pro- cedures to search system components for problems when the system is not in operation (e.g., Kusiak, 1989; Miller, 1989a; Stachowitz & Coombs, 1987; there is also some work which focuses on specialized techniques for different components, see Miller, 1989c; Rushby, 1988). 249

Transcript of Dynamic testing of knowledge bases using the heuristic testing approach

Expert Systems With Apphcatwns, Vol 1, pp 249-269, 1990 0957-4174/90 $3 00 + .00 Pnnted m the USA © 1990 Pergamon Press plc

Dynamic Testing of Knowledge Bases Using the Heuristic Testing Approach

LANCE A. MILLER

Science Apphcatmns Internattonal Corporatmn, McLean, VA, USA

Abstract--How to develop knowledge-based and expert systems today ts becoming more and more well understood, how to test these systems still poses some challenges There has been considerable progress tn developing techniques for static testing of these systems, checkmg for problems via formal examination methods; but there has been almost no work on dynamic testing, testing the systems under operating condtttons A novel approach for the dynamtc testmg of expert system rule bases ts presented This approach, Heuristic Testing, ts based on the idea offirst testmg systems for disastrous safety and lntegrtty problems before testing for primary functzons and other classes of problems, and a prtortttzed series of 10 classes offaults are identtfied. The Heur~sttc Testing approach ts intended to assure software rehabdtty rather than stmply find defects," the rehabthty ts based on the 10 fault clones called compotent rehablhty General procedures for conceptuahzing and generatmg test cases were developed for all fault classes, mcludmg a Generic Testing Method for generattng key test-case values One of the classes, error-metrtc, tllustrates how complexity-metrics, now used for predtcttng conventtonal software problems, could be developed for expert system rule bases Two key themes are automatton (automattcally generatmg test cases) and fix-ks-you-go testing (fixing a problem before contmumg to test) The overall approach may be generalizable to statw rule base testmg, to testing of other expert system components, to testmg of other nonconventional systems such as neural networks and object-ortented systems, and even to conventsonal software

1. INTRODUCTION

FIVE YEARS AGO THERE was almost no activity con- cerned with the testing of expert systems, but interest has grown rapidly in the interim with special workshops and more and more sections on the topic at software, expert systems, and general AI conferences.

It is useful for understanding expert system testing to contrast it with conventional software testing. The principle underlying both types of software testing is that one needs to test the system in two ways: (l) each component separately, and (2) as an integrated whole. In conventional programs the whole is a recursive composition of elemental procedures, and the complete panoply of testing procedures are applicable at all levels (though experts may disagree concerning the advan-

Requests for repnnts should be sent to Lance A. Miller, Science Ap- phcaUons lnternaUonal Corporauon, McLean, VA 22102.

This testing proposal grows out of, in part, work performed by Sci- ence Applicauons Internatmnal Corporauon under contract to the Electric Power Research Institute concermng the Venficatmn and Vahdat~on of Expert Systems (see Groundwater, 1989; Groundwater & Blanks, 1988; Groundwater, Donnell, & Archer, 1987; Kirk & Murray, 1988; Miller, 1989b). It also derives from work whde the author participated as an industry advisory to NASA Space Station Software Envtronment actwmes, Reston, VA.

tages of extensive component testing; e.g., see Boehm, 1981, pp. 382-386; Hoffman & Brealey, 1989). Expert systems, however, are composed of nonhomogeneous components; the knowledge base is considerably dif- ferent from the inference engine, and both contrast strongly with typical system shell features (which pro- vide the user environment and various utihties; see Table l). The same testing methods cannot be applied to all components equally well. Because the task of testing--and fixing---components depends so much on the specific component, it makes a great deal of sense to test expert systems quite thoroughly, component by component, before system test.

Some of the growing concern has been on when to test and has emphasized hfe cycles appropriate for ex- pert system development, with testing accomplished during the various life cycle stages (e.g., Culbert, Riley, & Savely, 1987; Miller, 1989b). Concerning what to test, almost all of the attention has been on static tests of knowledge bases--using formalized inspection pro- cedures to search system components for problems when the system is not in operation (e.g., Kusiak, 1989; Miller, 1989a; Stachowitz & Coombs, 1987; there is also some work which focuses on specialized techniques for different components, see Miller, 1989c; Rushby, 1988).

249

250 L A Mdler

It is well understood, however, that dynamic testmg of knowledge baseswtesting them under actual oper- ating conditionsmis needed to assure the reliability of expert systems. No matter how thorough and effective static testing is (e.g., see Mills, Dyer, & Linger, 1987), the bottom line is really whether the system performs correctly under real operational conditions. Trying to judge operating performance from static examinations of the knowledge base is much harder than with con- ventional software because of the inherent absence of modularity and organizing structures in rule bases; thus, dynamic testing is essential for assessing opera- tional validity of rule-based expert systems. Dynamic testing of rule bases is also needed because of the pos- sible nondeterministic interaction of the inference en- gine with the rule set such that the same input condi- tions could lead to importantly different outcomes. And, finally, with more and more expert systems being embedded in real-time larger systems, the effects of failures in expert systems can be propagated well be- yond their own boundaries; responsible program de- velopment managers will rightly insist on many kinds of testing, including dynamic.

A methodology is proposed in this paper for dy- namic testing of one component of expert systems: knowledge bases which are represented explicitly as declarative IF-THEN rules. Left for the future are the questions of dynamic testing of other forms of knowl- edge-base representations (e.g., frames, objects, con- ventional data structures such as relational databases, and even neural networks); postponed, as well, are more extended treatments of the problems of dynamic testing of expert systems components other than the knowledge base. The reason for focusing first on the knowledge base is because of its criticality; the rationale for concentrating on the rule form is because of its greater prevalence.

Nonetheless, the philosophy of the approach de- scribed in the paper would seem to apply to all com- ponents and the system as a whole. It rests on extending the distinction between software reliability and the proportion of bugs remaining (after some series of tests). The latter concerns the defect rates still remaining, normalized against the amount of code, for example, "1 defect per 1000 lines of source-code," In this view a bug is a bug, they all equally contribute to the defect rate. In the former view of software reliability, the con- cern is not with bugs or defects per se, but their effects: How long is the system up before it crashes (or some other performance loss), what is the mean-time between failures, how long before the system comes back up? Bugs are differentially more important to the extent that they negatively influence these measures. Highly reliable systems, then, are ones whose (remaining) bugs don't bother availability very much. This notion is here extended to apply not only to availability but to the consequences of encountered bugs on the environment

and the system itself. This is called competent reliabdtty to indicate that the system has been tested so as to retain maximum competency--that is, it will much more likely fail in some inconsequential way than cat- astrophically or harmfully.

The key to insuring competent reliability is to a priori classify all possible software failures into cate- gories based on their consequences, and then rate each category as to its importance for system competency. Those that are the least allowable will be tested for first, then the next most objectionable, and so on until all classes have been tested. This approach can be equally applied to other expert system components as to the knowledge base. We call this general approach Heu- ristic Testing, since it is a reasonable strategy (heuristic) to look for and eliminate the worst system problems first. We identify a four-step process for implementing the strategy, the heart of which is the problem of test- case selection. This process is also heuristic in the sense that we choose parametric test values which would most likely reveal the presence of errors. This four-step process also applies equally well to system or compo- nents.

A key concern in this approach is the capability to minimize the human labor component in all stages of the testing procedure via automated techniques. Sug- gestions concerning how this may be accomplished are given throughout.

It is important to stress that the approach proposed in this paper is for dynamic component testing of the knowledge base, not for overall system testing. The test-case generation methods are intended to generate plausible test parameters for the variables occurring in the rules. However, these parameters are intended to be inserted directly into the data structures consulted by the inference engine during execution of the rules. The involvement of system components other than the inference engine and rule base is to be minimized. The testing proposed here could be accomplished at any time, but it is particularly appropriate for the last and formal stage of system validation.

In the sections below we first characterize the testing problem for expert systems, and we define a prioritized series of fault classes, followed by a detailed description of a generic test-case selection method. We then con- sider how test cases are to be generated for all fault classes, some using this generic method and some using other approaches. We conclude with a brief discussion of key testing concerns.

2. THE EXPERT SYSTEMS TESTING PROBLEM

There is both "good news" and "bad news" associated with the problem of testing expert systems. The bad news primarily derives from the diversity in number and character of expert system components, as illus-

Dynamic Testing of Knowledge Bases 251

trated in Table 1. Note that there are four major com- ponents, each with several subcomponents. It is usually, but not necessarily, the case that the knowledge base component (and sometimes the shell) is represented in a declarative form, usually within an AI language, and the remaining components are implemented in pro- cedural code. Note also how qualitatively different in function are the major components from each other (as evidenced by their subcomponents).

There are three aspects to the "bad news" of testing expert systems relative to testing conventional proce- dural software: • The components differ so much that each must be

tested in different ways; • Separate testing of knowledge base and inference en-

gine components does not predict their interaction; • Faults are more complex, less obvious, and harder

to detect. Concernmg the first, each of the four major compo- nents is concerned with quite different types of func- tionality and therefore specialized functional tests have to be devised for each. In particular, the knowledge base component is usually implemented in a declara- tive form and requires new testing procedures, es- pecially for stattc (nonexecuted) testing. Second, the separation of knowledge from transfer-of-control, as reflected in the separate knowledge base and inference engine components, means that "unit" testing of each separately will not necessarily predict the interaction of the two components in dynamic execution mode. This is because of the potential for conflict among

TABLE 1 Components of Expert Systems

Major Component Subcomponents

Knowledge base Rule Base Simple Facts Assocmted Data Bases (including

texts) Frames (including models) Demons, other inline procedures

Shell Utiht,es User Interface Knowledge Representation

Facihties

Inference engine Pattern Matcher Decision Procedure(s) Rule Ordering/Confhct Resolution

External interfaces Operating System DB Management System I/O Devices (User Interface) Programming Language

Environments Other Applications Utilities, Functions, Procedure

Calls

multiple rules which could be executed at a particular time, and possible indeterminacy in the inference en- gine. These factors could lead to different outcomes for the same input.

The third problem is that "failure" modes for expert systems are typically more complex and less obvious than those for conventional systems which can fre- quently involve system "crashes" or "seizures" (lock- ups). They are thus less immediately detectable. Fur- ther, since expert systems typically exhibit higher levels of "intelligent reasoning" than their conventional counterparts, the detection of flawed reasoning may be harder and more expensive, requiring the use of "experts."

The "good news" about testing expert systems also pivots on the diversity of components and especially on the usual declarative nature of the knowledge base component. There are two aspects to the good news: • Problems in knowledge bases can be formally proven; • Expert system components are highly modular and

independent. Whereas proofs concerning conventional programs

require programmer-inserted assertions into the pro- cedural code, often coupled to formalized specifications (cf. Balzer, 1985), expert system knowledge bases are directly amenable to formalized testing. In a form of static knowledge-base testing called anomaly testing, rule bases are converted into an incidence matrix (e.g., Kusiak, 1989) or an attribute connection graph (e.g., Stachowitz & Coombs, 1987), and then these repre- sentations are subjected to various formal examination procedures using these representations. The latter is perhaps the most complete treatment of, admittedly, an open-ended set of anomalies including redundancy, inconsistency, irrelevance, incompleteness, and in- compat]bility (cf. Miller, 1989c). Figure 1 illustrates several alternative ways for representing rules, as em- ployed in different approaches.

The second major good-news aspect about expert systems concerns their component modularity. Each component has separate usually nonoverlapping func- tions to achieve, and components can be quite differ- ently implemented. These factors make it easier to de- vise and execute good testing procedures. In particular, the validity of a complex expert system can be exten- sively, if not completely, tested by testing the interac- tion of two of its components--the knowledge base and the inference engine.

The net results of the good-news/bad-news testing features from the point of view of the dynamic testing of the knowledge base component are as follows: (a) many types of problems which might otherwise have had to be addressed by dynamic testing--the typical conventional situauon--can be solved by formal static testing; indeed, formal solutions with broad coverage are already well formulated and developed (e.g., Stach- owitz & Coombs, 1987); and (b) since the knowledge

252 L A Miller

al a 2 a3...a n

R 1 1 1

R 2 1

R 3 1

a)

R m 1 1

Incidence matrix of rules R m and attributes evaluated in rules, a n

(b) Rule graph: Nodes represent rules, and paths (edges) represent e.g. "enable-to-fire" relationship

(c) Attribute graph: Nodes represent attributes (tests on/or aseertione concerning their value), and paths (edges) represent input/output

FIGURE 1. Alternative ways of representing relationships among expert system rule elements.

base implements the quahtative application require- ments it is possible to focus on testing the knowledge base for qualitative flaws in the overall system.

3. A T H E O R Y OF FAULT C L A S S I F I C A T I O N AND P R I O R I T I Z A T I O N

3.1. Definition of Fault Classes

At the heart o f our proposed Heuristic-Testing ap- proach is the idea that faults concernmg expert system requirements and performance characteristics can be prioritized, and that the testing can be coordinated to this gradient. This position endorses Rushby's (1988) advocacy for identifying and testing " m i n i m u m com- petency" requirements--requxrements concerning how poorly the system is permitted to perform; it also is related to efforts which focus first on testing safety

properties of systems (e.g., Franklin & Gabrielian, 1989). 2 We propose here a novel prioritlzed set of fault classes which, in total, provides for relatively complete testing of the knowledge base. Our 10 fault classes, roughly in order of decreasing "unallowability", are: • basic safety; • system integrity; • essential function; • robustness-failure; • secondary function;

2 Rushby divides reqmrements into two sets: competency and service Competency requirements are relate to human-level "knowledge"; scrvtce requirements are the remainder. Competency reqmrements are subdivided into desired and mtmmum, desired tmphes how well the system is to perform, while minimum lmphes how badly it ~s allowed to perform, having strong s~mllanty to safety requirements (cf. pp. 75-78).

Dynamtc Testmg of Knowledge Bases 253

TABLE 2 Pdorlty and Character of the 10 Fault Clams

Fault Class Tests For

1. Basic Safety

2 System Integrity

3. Essential Funct=on

4. Robustness-Failure

5. Secondary Funct=on

6. Incorrect I/O

7. User-Interface

8. Error-Metric

9. Resource Consumption

10. Other

Harmful or Destructive Actions (Primary and Contributory)

Impairment of Expert System Components or of attached Systems

Incomplete or Incorrect Essentsal Functionality

Incorrect Processing Due To Less-Than-Perfect Input Characteristics

Incomplete or Incorrect Secondary Functionality

Deviation from Required Input-Output Relationships

Usability and User-Overload Problems

Probable Problem-Causing Rules

Excessive Consumption of User or System Resources

Other Problems

• incorrect input/output (I/O); • user-interface; • error-metric; • resource consumption; • "other."

We explain each of these in turn in this subsection and then conclude with a general discussion of the rec- ommended ordering in subsection 3.2. 3 Table 2 pro- vides a short explanation of the fault classes to accom- pany the following descriptions. How these faults might be detected are considered in more detail in later sec- tions.

Basw safety faults concern potential system actions which could cause unintended and generally undesir- able harm or destruction to people, valued fauna or flora, or valuable propeny--part icuiar ly property whose degradation could lead to human harm. One procedure for determining which actions in the rule base might involve basic safety faults is as follows. List all of the actions which could be taken by the system rules; then engage the assistance of people experienced in the application domain and ask them to evaluate each action against the following six questions: • Is there any circumstance under which a single cor-

rect occurrence of this action could lead to harm because of a particular existing or enabled context

3 The approach taken here shares Rushby's emphasis on the impor- tance of minimum competency, interpreted as safety as well as system integrity requirements. However, the present approach emphasizes an ordered more clearly definable set of classes, avoiding distinctions made on the basis of references concermng how much human knowledge or skall would be revolved.

(e.g., opening an exterior door on a space vehicle when not all occupants are in spacesuits)?

• Is there any circumstance under which a single im- properly executed occurrence of this action could lead to harm? (e.g., requiting a user to operate a switch located a few inches away from an exposed high- voltage terminal that could be touched with a shaky hand)?

• Is there any circumstance under which a singlefadure of occurrence of this action could lead to harm? (e.g., failure to put something away, such that someone trips on it and falls)?

• Is there any circumstance under which multiple cor- rect occurrences of this action could lead to harm? (e.g., multiple recent exposures to low-level radiation whose dosage is acceptable only once over long in- tervals)?

• Is there any circumstance under which multiple im- properly executed occurrences of this action could lead to harm? (e.g., instructing the user to lift heavy loads)?

• Is there any circumstance under which multiple fail- ures of occurrence of this action could lead to harm? (e.g., repeated failure to empty a discard facility causes overflowing which causes damage)?

This class of safety faults is related to techniques known as hazard analysis and fault-tree analysis (e.g., Leveson, 1986).

The second most important fault class, system in- tegrity, involves destruction, separation, inactivation, or degradation of essential components within the ex- pert system or external to it, linked via data or com- munication interfaces. As with basic safety faults, these can be direct or indirect, and they can also be detected with the assistance of an experienced person via the above six-question technique. Examples are overwrit- ing critical elements of a database, disabling an input device or channel, executing a high-priority infinite- loop system interrupt, "hanging-up" a telephone data- link, and so forth. This fault class is related to the soft- ware testing approach known as software fault-tree analysis, but the emphasis is on general failure-mode analyses, not so much on hazards or safety as with the first class (e.g., Rushby, 1988).

The third class, essential function faults, is the first class to involve close examination of system require- ments. Essential functions may be identified from the requirements documentation on the basis of four rec- ognition characteristics: • They are explicitly labeled as essential (or compa-

rable) in the requirements documents; • They are those specified functions that, if not met or

only partially met, lead to or enable basic safety or system integrity faults;

• They comprise that minimum set of functions that embody the minimum operational concept of the expert system, if only on a very restricted set of data,

254 L A Mtller

conditions, or inputs, and if only for a very limited level of capability vis fi vis human performance.

• They are that set of functions that are not fully re- solvable by the above tests but that are specified as being essential by the customer or by an experienced professional in the application domain.

Given identification of essential functions, faults of this class are determined whenever these functions are missing or are implemented incompletely.

Faults involving robustness-failure are those which cause inappropriate system behavior (e.g., undesirable external actions, lockup, erroneous processing) as a re- sult of input conditions which deviate from those spe- cifically allowed or expected. Examples are processing errors brought on by misspellings and ungrammati- calities in an input to a natural language-processing system, quantitative input parameter values out of bounds, character-string input to a function expecting fixed-binary values, arbitrary or capricious manipula- tion of initial conditions, and so forth.

Secondary function faults refer to faults detected with respect to specifications in the requirements documents which are other than the essential functions--all of the remaining functions (e.g., help, documentation, audit facihties).

Incorrect 1/(9 faults are those which, in c6mparison to the requirements, fail to produce the expected output for a specified set of input conditions intended to pro- duce that output. The major faults to look for at this point are violations of context in the requirements specificationsmcircumstances under which more spe- cific I/O relationships are supposed to hold, but d o n ' t )

The user-interface faults refer to all problems of us- ability not covered by any of the precedmg categories. Since the choice of the actual manner and modality of information display and input is determined much more by the shell or other utility components than the knowledge base, the nature of user-interface faults de- tectable from rules is restncted. However, if the tester can determine the correlation between rule syntax and I/O style (by trying out the system, by examination of user-interface utilities, etc.), then the tester can look for problematic situations. In particular, the knowledge base component should be checked for user-interface problems involving user-overload faults: those which cause users to operate beyond their profile capacity in the processing of information. These can occur when information is presented for processing too much at a time or too quickly (relative to the profile), when de- cisions or inferences are expected on too little infor-

( There are other kinds of I/O faults--for example, reqmred outputs being m~ssmg, a desired outcome not being connected to primary input states, speofied inputs connecting to more than the speofied output Most of these should be caught by the static testing techmque, anomaly testing, which should precede dynamic testing (cf. Mdler, 1989c)

mation, when the formal or natural language of com- munication is not well understood, when the processing or skill demanded is not possessed, and so forth.

The error-metric faults are a statistical category based on hypothesized or (preferably!) empirical findings concerning the characteristics of those expert system rules which are (believed to be) associated with any of the other classes of faults; complexity of Boolean rule logic may be an example factor. The metrics of this class may take into account theories of programmer difficulties in specifying rules as in error-based ap- proaches to conventional program testing (cf. Howden, 1989), or they may simply be objective functions of rule features. Either way, these metrics are comparable in spirit and intent to those developed for procedural programs (e.g., Grady & Caswell, 1987; Halstead, 1977; McCabe, 1983).

Faults leading to unacceptable resource consumption may involve the user, the expert system itself, or other attached systems. Key user resource faults are excesses in the amount of time required to be engaged with the system to complete a function (e.g., 2 h vs. the intended 5 min) or the amount of attention needed to monitor or operate the system; also included is the total number of people required for operation or to support a single user. Intrinsic system resources primanly mvolve ex- cesses in the CPU-time and the amount of main and virtual memory. Extrinsic system resources include these types of measures for attached systems plus the number of data or I/O channels occupied, required band-width, number of supporting CPUs, plus costs of these resources.

Finally, we have specified an other fault class to rep- resent all o f the remaining types of faults that could occur. Although this fault class is not well formed, the testing procedures appropriate for ~t are simple and well understood.

3.2. Fault Class Priorities

The Heuristic Testing approach is oriented towards the situation involving expert systems embedded in larger complex systems in which the opportunities for serious harm to life or property, or significant wasting of expensive resources, are quite real. The approach also is designed to provide for the common and very sensible practice of fix-as-you-go. This software repair strategy holds that one goes ahead and fixes a very se- rious problem (ideally via a design change and not simply a code change) without continuing to test for minor problems; the fix may completely reshape the code or the performance, such that new faults appear, old ones go away, and continued testing would have been wasteful. This is a particularly reasonable repair strategy for expert systems' knowledge base testing be- cause rules, unlike procedural code, are not inherently organized into modules, control structures, or linear

Dynamtc Testing of Knowledge Bases 255

segments of sequentially executed code; it is very dif- ficult to determine what other rules would be influenced by an error in a particular rule. A key element of the fix-as-you-go strategy is to employ regression testing, repetitions of selected previous tests to assure that the fix did not introduce new errors.

Under these assumptions the choice of the basic safety and system integrity fault classes as numbers one and two in priority directly follows. Having essential function faults in the third position guarantees that the system is somewhere in the ballpark of providing the absolutely necessary function. It is better to find out as soon as possible that one has problems on essential function than to test for nonfunctional problems and then discover key function problems.

Testing for robustness-fadure faults should be next (fourth), particularly looking for problems involving the first three most critical classes. Inserting this type of stress testing at this point can reduce the overall number of test cases needed. Here at this juncture the system functionality has been roughly divided into major and minor segments, the latter containing sec- ondary function, incorrect I/O, and user-interface faults. Testing for the minor functions would seem to have a much lower probability of revealing faults which could lead to a major problem than testing for ro- bustness failures. On the other hand, testing for ro- bustness at this stage tests not only for additional major problems but also could turn up minor problems (and possibly eliminate others, if the fix-as-you-go strategy is employed). Thus, it is expected that any set of ro- bustness tests have a higher probability of flushing out competent reliability faults and other problems than any other set at this juncture.

Next we suggest checking for remaining function- ality: secondaryfuncnons first (fifth), then any remain- ing problems with I / 0 (sixth) and the user interface (seventh). With these seven classes all of the system functionality has been addressed.

The final three classes provide different approaches for getting at remaining difficult-to-detect problems. Error-metric (eighth) uses a statistical approach to dis- cover likely problematic rules; the resource-consump- tion (ninth) faults are concerned with performance is- sues, which could affect functionality; and other (tenth) is all the remaining undetected faults, for which we suggest special remedies.

This ordering has been generated from the perspec- tive of a very conservative minimax position, assuming there could be safety and system integrity hazards: Minimize the probability of maximum harm. But there is nothing sacred about the order, particularly if there is very little chance of safety or integrity faults. There could be a number of reasons why one would prefer a different ordering. For example, many of the test-case generation methods discussed later in section 5 rely on documentation to guide the process; if there is no doc-

umentation, and there are no "experts" familiar with the system reasonably available for guidance, one might well want to start with approaches which don't require documentation--the error-metric approach and the brute-force tactic of robustness testing.

4. THE HEURISTIC KNOWLEDGE BASE TESTING APPROACH

The groundwork has now been laid to propose a dy- namic testing approach for expert system knowledge bases. In this section the criteria for a multifaceted method are first mode explicit; a four-stage model is then proposed which meets these criteria. Detailed dis- cussion of the procedures to be used for generating test cases for each of the fault classes is addressed in sec- tion 5.

4.1. Criteria for the Heuristic Testing Approach

We do not believe that the dynamic testing procedure of Heuristic Testing is the only appropriate approach for expert systems. Nevertheless, we believe it to be the best approach if the following eight criteria are true: Cl. A strong guarantee is desired that however few

tests cases are run the test cases will be ordered to test for the worst failure possibilities first, followed by the next worst, and so forth.

C2. For any particular class there is a way of increasing assurance of software reliability for that class in- dependent of the assurance of reliability for other classes.

C3. The test-case selection procedures are computa- tionally feasible.

C4. The test-case selection procedures, as well as the outcome-evaluation procedures, are amenable to automation.

C5. The testing approach is not dependent on the characteristics of the particular inference engine involved in testing: however, the approach does permit detection of interaction problems between the inference engine and the knowledge base.

C6. The testing approach is applicable to any type of declarative knowledge base expressed as IF- THEN rules, regardless of the particular syntax used to implement those rules.

C7. Dynamic testing will occur only after all possible static tests of the knowledge base have been con- ducted and the detected errors corrected.

C8. The fix-as-you-go repair strategy is adopted.

4.2. The Four-Step Heuristic Testing Process

Figure 2 shows the proposed four-step procedure in- volved in the Heuristic Testing process. The first is simply to select the fault classes to be included in the

256 L A Miller

Selection of Fault Classes to be included in test

- Up to a total of i0

I Selection of Test Cases for Each Fault Class:

- Base Reliability - Extended Reliability

I Preparation of Execution Plan

- Specifications for Test- Cases, Scenarios

- Sequencing - Regression Tests

I Execution and Evaluation

- Measuring outcomes - Interpreting outcomes - Regression Testing

FIGURE 2. Stages in the dynamic testing of an expert system rule bass using the heuristic testing approach.

test-case generation process, out of the 10 possible. In the second step test cases are selected for each of the fault classes included in step 1. We identify a procedure for test-case selection which establishes a base, or min- imum, level of reliability for each fault class; we also indicate how the test-case generation process can be modified to achieve higher rehabihty levels.

The third step of the process ~s to prepare the tes tmg plan. An important aspect of this step ~s the develop- ment of the test specifications for each test case: de- scriptions of the primary and subsequent Inputs, any important environment conditions, and the expected output or environment changes as a result of the test- case execution. These specifications have been called

scrtpts, and the work in procedural testing for devel- oping formal language descriptions of these scripts could well serve as a model for similar support for ex- pert system testing (see Balcer, Hasling, & Ostrand, 1989).

A related consideration at this step ~s whether the test-case specifications are best developed into full scenarios. We define scenarios as the specification of a real-world context and rationale for the test-case specification which is understandable and plausible to persons experienced in the system's application area. Clearly, not all test cases could be developed into sce- nanos; however, it is the testing culture of some ap- plication domains (e.g., nuclear utilities) that scenario tests are required for full user and customer acceptance of software reliability testing programs. However, the zssues involved in scenario development are complex and outside the scope of this paper.

Another major consideration m this third step ~s the cost of setting up and executing the test cases. When test specifications call for "expensive" input profiles or environmental set-ups, the overall test plan should at- tempt to batch-sequence test cases which share the same costly specifications, to avoid repetitively incurring these expenses. An aspect of this step which is very important for success but which we do not address here is the problem of developing an overall testing program to meet specific budget and reliabihty reqmrements and including development and execution of regression testing. We also do not consider cost/benefit functions for each fault class. These are essential considerauons in any testing program, but we have not yet developed the needed analysis functions; we suggest following guidelines for conventional software development (e.g., see Boehm, 1981).

The final step in the four-step testing process is to execute the planned and sequenced test cases The basic "raw" data concerning the outcomes of each test have to be collected, as dictated by the speofication, and then these outcomes have to be analyzed and inter- preted. It is frequently the case for conventional soft- ware systems that test cases which uncover faults will produce catastrophic errors or some form of error no- tification generated by the system components. With expert systems, however, such outcomes can be ex- pected much less frequently. Therefore, interpretation of the results will generally be more difficult and time consuming. Moreover, since expert systems often at- tempt to produce highly competent behavior, at a level comparable to an experienced professional, interpre- tation may require special understanding of the domain well beyond that documented in the requirements statement. Also included in this last step are the routine execution of the regression tests after each set of fault- class tests, to insure that there are no new errors. This step is executed for all fault classes chosen.

Dynamtc Testmg of Knowledge Bases 257

TABLE 3 Examples of Potentially Complex If-Then Rules

Rule No. Antecedent Clause Consequent Clause

1. IF Feedwater_pH < 8.8 and Feedwater_oxy_pH < 9.2

THEN activate D~agnost=c_13, set up Alarm_trace_7

. IF Oxygen_ppb > 15 and Blowdown_sodium_ppb > 20 and Feedwater_oxy__pH > 8.4

THEN actwate Pnmary_.Alarm, CALL SHUTDOWN

. IF Condensate_pH > 0.8× Blowdown_pH and TotaL_chlonde_ppb < .078 + .5 × Condensate_oxy_ppb +. 2 × Feedwater_oxy_pH

THEN engage Override_7, set Total_chlonde-max = .3 × Blowdown_chloride_pH

4. IF System_state not_equal_to emergency or Feedwater_oxy_pH < 9.2 and Oxygen_ppb < 8.8

THEN begin Drain5 of Secondary_Water_Supply

5. TEST-CASE GENERATION FOR T H E 10 FAULT CLASSES

We now detail procedures for generating test cases for each of the fault classes. Before doing so, it is necessary to provide some detail of a genetic procedure for trans- forming any rule into a form usable for generating test- case parameters. What is presented is a high-level con- ceptualization suitable for discussion, not the format and procedure we would actually use for implemen- tation. Following this presentation the importance of requirements traceability is pointed out, and then pro- cedures for each of the fault classes in turn are dis- cussed.

5 .1 . T e s t - C a s e G e n e r a t i o n : T h e G e n e r i c T e s t i n g M e t h o d

Consider the four made-up rules in Table 3. In their present form they are very difficult to read, and cer- tainly not easily amenable to analysis. We propose the type of transformation illustrated in Table 4 for both the antecedent and the consequent clauses ofeach rule, although Table 4 shows the analysis only for the former.

Using Rule 2 as an example, note that the text ver- sion in Table 3 specifies three conditions, or terms, for the IF clause; the terms are separated by and. Each term is listed separately in Table 4 and assigned a unique variable I D - - i n this case T3, T4, and T5. Next, each term is examined for two types of elements, ig- noting the comparison operators for the moment: at- tributes and constants; these also are given unique variable IDs. Additional detailed information is entered for constants, particularly the unit of measurement, the minimum step (the default being the level of pre- cision of the constant), and the maximum and mini- mum range values. Note that an attribute which ap-

pears in more than one term is given the same variable ID for all occurrences; thus Feedwater_oxy_pH is as- signed the variable ID a2 for each of its several occur- rences (actually once in each rule). 5 When all variables have been assigned, the logic of the term is represented in terms of these variables; thus, the logic for the third term of Rule 2, is a2 > c5. Finally, the overall logic of the rule is represented in terms of the term variables (i.e., T3 and T4 and T5).

A summary, picture of the occurrences ofatt t ibutes in terms is given in Figure 3. One can easily see, for example, that the attribute a2 occurs more frequently than any other (4 times), and the term T7 has the largest number of elements, both attributes and constants (3 each); note also that a whole complete term, T2, is repeated. These and similar incidence matrices are used in statw analyses of knowledge bases for various kinds of anomalies (e.g., Kusiak, 1989). Such matrices are used in Heuristic Testing primarily for selecting test cases for the last Other fault class (see the discussion of random testing at the end of section 5.14).

There is a key assumption concerning the compar- ison of an attribute to a constant (or other value) in a rule: Errors in the knowledge-base rule are assumed to be more likely in the close vicinity of the constant than far away. That is, we believe it more likely that the specifier made a small error in choosing the exact con- stant value or else chose the incorrect operator (e.g., " < " instead o f " < = " ) rather than making a gross error

Repeated constants would also be g~ven the same variable ID, and mt mtght mmally appear that constants cl and cl I are therefore mrs- labeled and should have the same variable ID since the numerical value of both ~s 8.8. However, constants need to be subjected to a d=mens~onal analym to insure they have the same umts as well as numerical value; doing tMs, we see that cl is in pH units, while ¢11 ~s m ppb umts. They are thus distract.

258 L A. Miller

TABLE 4 Transforming the Rules for Test-Case Generation: Assignment of Variables to the Terms,

Attributes, and Constants of the Rules of Table 3

Term Attnbute Constant Rule

Rule # Text ID # Text ID # Value DT 1 Un,t Step Max Min ID Term Log¢ Log,c

1 1 Feedwater__pH < 8.8 T1 1 Feedwater_pH al 1 8.8 2 Feedwater_oxygen_pH T2 1 Feedwater_oxy_pH a2 1 9.2

< 9 2

2 10xygen._ppb > 15 T3 1 0 x y g e n _ p p b a3 1 15 2 Blowdown_sodlum-_ppb T4 1 Blowdown_sodlum a4 1 20

> 20 _ppb 3 Feedwater_oxygen_pH T5 1 Feedwater_oxy_pH a2 1 8.4

> 8 4

3 1 Condensate._pH > .8 T6 1 Condensate_pH a5 1 8 × Blowdown_pH 2 Blowdown_pH a6

2 TotaLchlodde_ppb < 078 + .5 × condensate_oxy_ ppb + 2 × Feedwater _oxygen_pH

N pH 1 11 1 C1 al <C1 T l a n d N pH .1 11 1 C2 a 2 < C 2 T2

N ppb 1 108 1 C3 a 3 > C 3 T3and N ppb 1 109 1 C4 a 4 > C 4 T4and

T5 N pH 1 11 1 C5 a2 > C5

N pH 1 11 1 C6 a 5 > C 6 × a6

T7 1 TotaL.chlonde_ppb a7 1 078 N 9 001 9 9 C7 a7 < C7 + T6and C8 T7

2 Condensate_oxy a8 2 .5 N 9 1 ~ 9 C8 xa8 + C9 _ppb

3 Feedwater_oxy _pH

T8 1 System_.state

a2 3 2 N ? 1 9 ~ C9 xa2

a9 1 Emer- S n a n . a . n . a . n . a . C10 a9=~C10 T8or gency

4 1 System_State not_equal_to Emergency

2 Feedwater_oxygen--pH T2 1 Feedwater_oxy_pH a2. 1 9 2 < 9.2

30xygen_.ppb < 8.8 T9 10xygerL .ppb a3 1 8.8

N pH 1 11 1 C2 a 2 < C 2 (T2and T9)

N ppb 1 109 1 C l l a 3 < C l l

DT ~ = Datatype, N = Numenc, S = Stnng

TI "12 7"3 T4 "1"5 T6 T7 1"8 T2 I'9 Y.

al I I a2 I I I I 4 a3 I I 2 a4 I I

(z) ~ I I a6 I I a7 I I a8 1 1 a9 1 1

5: 1 1 1 1 1 2 3 1 1 1 13

C1 1 1 C2 1 1 2 C3 1 1 C4 1 1

(ii) C5 I I C6 I I C7 I I C8 1 l C9 1 l CIO 1 1 C l l 1 1

Y. I l I 1 1 I 3 I 1 1 12

FIGURE 3. Incidence matrices of (1) attribute occurrences in terms and (2) constant occurrences in terms.

Dynamw Testmg of Knowledge Bases

TABLE 5 Generic Test-Case Values to Assign to Attributes in the Compadsen Terms of IF-THEN Rules, for Three Increasing Levels of Software Reliability, According to the Generic Testing Method

259

Reliabihty Testing LeveP

1 2 3

True or Next-True/ Just- Next Far- Opposite Far- No. Operator ~ Just-True Just-True False Just-False True Far-True False

Opposite Far-False

1 = C NA C + Step C - Step NA NA C × 5 C -= 5 2 - = C + Step C - Step C NA C × 5 C + 5 NA NA 3 < C - Step NA C C + Step C + 5 NA C × 5 NA 4 < = C C - Step C + Step NA C + 5 NA C × 5 NA 5 > C + Step NA C C - Step C × 5 NA C + 5 NA 6 > = C C + Step C - Step NA C × 5 NA C + 5 NA

1 Table entnes refer to the value of the constant, C, and the m,n,mum Step from that constant, NA means "Not Apphcable." 2 The negat=on of operators 3-6 ,s another of those; hence, negated < or > operators aren't listed separately.

in picking the constant. This assumption is certainly consistent with the error-based testing approach (cf. Basili & Selby, 1987) and with empirical studies of programmer errors (e.g., Miller, 1974; we bring this point up again after presenting the approach).

There are three steps to the test-value selection pro- cess m this generic test method (GTM), which we il- lustrate for an attribute value compared to a single constant. This method provides several levels of testing reliability and is embodied in Table 5. Successive levels of testing probe different regions of the rule logic, as discussed below. The first step of the GTM is to convert the rule into symbolic form, as shown in Table 4, arrive at the Term Logw for attributes in the term. Second, if necessary, algebraically manipulate the term logic such that the next attribute of interest appears alone on the left side of the comparison operator. Third, find the row in Table 5 which contains the comparison op- erator used in this term logac. This row will provide values for the attribute for three levels of reliability. Level 1 is the minimum level, and it tests for the term logic being exactly or minimally true. This is accom- plished by selecting the value of the attribute to be exactly equal to the constant (row 1) or to be the value at one end of the true interval (rows 4 and 6). A min- imally true value for the attribute is one just inside the true interval, computed as the constant c plus or minus the smallest step-interval, step (rows 2, 3, and 5).

Level 2 testing consists of selecting two additional minimally true-or-false values for the attribute which are within one step-unit from the constant. These val- ues complete the sampling of the attribute values in the immediate vicinity of the true-false boundary for the term logic. Level 3 testing samples values at a con- siderable extent away from, and on both sides of, this boundary; these relatively extreme values are set in Ta- ble 5 as values five times or one-fifth the constant.

Thus, one, three, or five testing values can be ge- nerically selected to test the term logic for a rule. The

value of increasing from 1 to 5 the number of values tested is the increased confidence that the rules contain no errors; the cost is the expense of running combi- natorially three or five times more tests for a rule.

The same basic procedure can be used when the attribute is being compared to a more complex expres- sion, as for attributes a5 and a7, in terms T6 and T7 in Table 4. F~rst, however, acceptable values for the attributes on the right-hand side of the comparison must be chosen (as indicated by documentation or by an experienced professional). Then, one computes the value of the whole right-hand side with these chosen values; this result is used in Table 5 as if it were equal to c. 6

For attributes whose datatype is string not numeric, the method can still be applied. These typically involve the " = " comparison (row 1, Table 5). If the other al- lowable values for a string attribute are ordered, then one can use Table 5 such that "c -- step" and "e + step" represent values just before and just after the

comparison value; otherwise, sample randomly from the possible unordered values.

The G M T procedure is illustrated for the four rules shown in Table 3, as converted into variables a l - a 7 in Table 4. The three testing-level values for each of the vanables are shown in Table 6.

A final comment concerns the assumption made in the beginning of this section that an error, if it occurs, will be local to the value of the constant, and therefore the GTM procedure is appropriate. Another way of arguing the same thing is to consider all the errors that a programmer might make in specifying a rule; we list the obvious ones in Table 7. The first three errors are highly conceptual ones, and it would take a pretty in-

6 Choosing test values for attnbutes found on the right-hand side of the comparison--such as a8 in T7, in Table 5--sirnply involves the second step of the genenc procedure, rearranging the comparison so that the desired variable is on the left-hand side.

260 L A Mdler

TABLE 6 Application of the Genedc Testing Method of Table 5 to the Rules of Table 3

Test Level Values

3 a Other Variable to Numeric 2 Vanables

Receive Constant 1 Way Way Test- Just Just Above Below Set

Values # Value Equal Above Below (×5) (+5) # to

al C1 8.8 8.8 8.9 8.7 11 1 - - - - a2 C2 9.2 9.2 9.3 9.1 11 1 - - - - a3 C3 15 15 16 14 75 1 - - - - a4 C4 20 20 21 19 100 2 - - - - a2 C5 8.4 8.4 8.5 8.3 11 1 - - a51 C6 .8 .9U1 .7U1 .1U1 5U1 .16U1 a6 Ul a62 C6 .8 U2/.8 U2/.9 U2/.7 U2/5 U2/.16 a5 U2 a71 C7 .078 .078+ .079+ .077+ .39+ 015+ a8 U3

C8 .5 .5U3+ .5U3+ .5U3+ 2.5U3+ 1U3+ a2 U4 C9 .2 .2U4 .2U4 .2U4 U4 1 U4

a2 C2 9 2 9.2 9.3 9.1 11 1 - - - - a3 C11 8.8 8.8 8.9 8.7 43.5 1 7

1 These attributes are the most logical to recetve test-values in the complex inequality. 2 These attributes could rec~ve test values if, for some reason, they cannot be assigned values but the other accompanying attributes

must be est,mated. a Values are changed by a factor of 5 ff there ~s sufficient range; else the top and bottom of the range are used

volved chain of reasoning to come up with a general test procedure for these other than human inspection. The fourth error is easily caught with static testing. The remaining four possible errors are all testable with the three testing levels of the GTM approach, regardless of one's assumptions concerning likely programmer e r r o r s .

5.2. Full Test Script Generation with Multiple Terms and with Rule Chaining

The generic procedure of 5.1 is complete for testmg a rule if the consequent of that rule is the one under test,

TABLE 7 Potential Causes of Rule Errors and Appropriate

Testing Methods for Each

Poss=ble Rule Errors Appropnate Test

1. Overall effect of legitimate rule is dlogscal re applicat=on requirements

2. Terms combmecl with the wrong Boolean logic; act=ons sequenced wrong

3. Term's comparison tests or actions illogical re application requ=rements

4. Nonexistent attributes/values; values out of range or inconsistent with definition

5. Wrong comparison operator in a term

6. Constant a little too small in a term

7. Constant a little too large 8. Constant considerably off

Human inspectton and test

Human inspection and test

Human inspection and test

Static tests Static tests

Genenc Method (GM), Levels 1, 2

GM, Levels 1, 2

GM, Levels 1, 2 GM, Level 3

there is only one term in the rule, and the attributes in that term are either primary inputs or can be set directly.

When there are n terms in a rule, and t tests for each term, there can be a total of t" test cases required to test the rule. For our suggested three levels of testing involving 1, 3, and 5 test values, the number of test cases needed for a rule having three terms thus goes from 1 to 27 to 125 for a term.

The exponential increase occurs because one ~s sys- tematically testing all test values for one term under all occurrences of test values for the remaining terms, so-called orthogonal selection. A much-reduced sample is possible if techniques from applied staustmal demgn known asfracttonal replications are employed. These provide a systematic but incomplete sampling of the complete orthogonal test-case situation (e.g., via lattice designs, Latin squares; cf. Winer, 1962, pp. 447 if).

When the to-be-tested attributes are not primary in- puts, and their values cannot be directly set (or one doesn't want to do this for some reason), then one needs to determine which rules have outcomes which supply values to this rule, determine test-level values for the antecedent attributes of these preceding rules, and continue to work backwards m this fashion until pn- mary input antecedents are found. For each rule in the chain from primary input to output-of-interest, one should select at least the three test values in the vicinity of the true-false boundary, to assess the reliability of that chain; of course, the terminal rule is expected to fire only under the level 1 test values.

With chained rules it is strongly recommended that the primary input rules are tested first, following the practice of fix-as-you-go repair. This will provide in-

Dynamtc Testmg of Knowledge Bases 261

surance that the path to the terminal rule in the chain is free of within-rule errors.

5.3. Importance of Requirements Traceability

Heuristic Testing, like all functional testing strategies, relies on documentation to determine what system components are related to what functions. The best documentation is that which, in addition to other de- sirable features, indicates how the present form of a component is derived and developed from its previous form. Begin by relating the first design considerations for that component back to the requirements, linking detailed design back to high-level design, and con- necting actual code or rule implementation back to design. This is requirements tracing, and it is an ex- tremely valuable discipline for assuring that all re- quirements are implemented, as well as for determining what the purpose is of specific implementation ele- ments.

When code is not so annotated it can be extremely dil~cult to determine the purpose(s) it serves. The rules of expert systems, while perhaps more immediately re- vealing than procedural code, can also be equally ob- scure, especially when extensively chained. Thus, rule testing, as in our Heuristic Testing approach, is greatly aided by comments attached to rules which indicate which system functions are facilitated by those rules.

5.4. Writing Special Testing Software

In conventional software unit testing and, for some aspects of system testing, the test personnel may have to write special software, often called test drivers These permit the test cases to be set up and run as desired, often simulating data inputs and other parts of the sys- tem. In the previous discussion of test scripts we pointed out that running of particular test cases might require setting up certain data conditions, providing special prior context, establishing certain database configura- tions, and so forth, in order for those cases to be validly run; such scripts include determining prior rule chains which have to be "fired" or established as true for the specific rules of interest to be activated. Special test drivers may be required for only certain of such test scripts, or else the drivers may be needed to simulate, for example, data channels, for all test cases. In addition to drivers, special test software may be justifiable for making the necessary observations concerning the ef- fects of the test cases. To minimize the amount ofspe- cml code needed, it is important to determine the code dnver and measurement needs for all of the fault classes and to write reusable testing code which can be adapted for the different faults driven by the input data sets and parameters versus in-line code.

The GTM procedure of section 5. l is used for several of the fault classes, and it may well be worth the effort to develop software to produce something like the at-

tribute graphs of Figure 1 to facilitate development of the rule chains. It is certainly cost-effective to develop special software to automate production of the infor- mation given in Table 4, and it might be worthwhile to develop code to compute the test-case values as shown in Table 5.

In addition, it is very worthwhile to develop code to trace rule outcomes (if not otherwise provided by the system) and to check these outcomes for recurring faults of interest--particularly the basic safety, system integrity, and essential function faults.

5.5. Basic Safety Test-Case Generation The discussion of generating test cases for the 10 fault classes begins with the most important of these--Basic Safety. The first step of two in generating test cases to assess Basic Safety reliability is to identify the possible safety-related outcomes by one of two means. One can either search the available system documentation (re- quirements, design, and operation manuals, etc.) for descriptions which mention hazardous or damage- causing actions. Otherwise, the tester can enlist the aid of a fully experienced application professional familiar with the systems interacting with the expert system. All rule actions which cause changes in external systems or the environment are to be identified and shown to this professional; the person is asked to consider whether any of these actions are safety related, accord- ing to the six-question schema detailed in the beginning of section 3. I. For each action judged to be safety re- lated it ~s necessary to determine the particular contexts that make them so; rule-programmed actions will usu- ally not be d~rectly and unconditionally safety related in all contexts. All of those contributing conditions and states that make the action a hazardous one must also be identified.

In the second step, for each safety fault the rules (a) which cause the specific hazardous action to occur, and (b) which establish the enabling or necessary con- text for that hazardous action must be identified. The GTM of subsection 5. l is then used to develop specific test-case values for each safety fault using this two-step method: (1) values for rule attributes are selected which will establish the enabling context for the hazardous action; this may involve value selection to cause certain rules to fire as well as preventing others; (2) then, GTM values are selected for the specific hazardous action; we recommend all three testing levels for this fault class.

5.6. System Integrity Test Case Generation

Test cases are generated for this class in the same man- ner as for safety: Documentation and the rules are an- alyzed for possible integrity-failure situations, asking one's self or others the six questions of section 3.1; the necessary contexts for these integrity failures are de- termined; the integrity failures are ordered in terms of decreasing severity; and the rule(s) associated with the

262 L A Mdler

next worst case are used with the GTM procedure in the two-step manner described above.

5.7. Essential Function Test Case Generation

Essential function test cases are more complicated than the first two classes. The requirements and other doc- umentation need to be searched for specifications of necessary functions, and these functions need to be divided into essential and secondary, as discussed in section 3. I. The idea of basing test selection on the functionality requirements has a strong base in the conventional testing community (e.g., specification- based testing; Ostrand et al., 1986; Richardson et al., 1989), and the fact that expert systems are being tested does not change the justificatmn for this approach.

A particular essential function may be accomplished only after the firing of a number of rules and after a variety of input conditions. For each essential function all of these rules and inputs need to be identified (traced requirements are enormously useful here). In addition, states of other components--data channels, databases, and so for th--may well be involved and need to be specified as to their essentml values for the function to be performed.

Before any dynamic tests are selected, the states of all necessary and enabling conditions for the function should be subjected to static tests, looking for the usual anomalies (inconsistency, incompleteness, etc.). Then the conditions should be ordered from primary input to final output. Some of this ordering may involve ex- plicit rule chaining; other ordering may be inferable from documents or experienced personnel; and other conditions may only be partially orderable or perhaps not at all. Level l values of attributes should first be chosen by the GTM method to develop a "true" test path for this function. The path should be tested, rule by rule, from primary output to output, fixing problems as they are found to mimmlze the overall testing effort.

In additmn to generating tests for the essential func- tions, the above-ordered set of rules and inputs can be reviewed by experienced persons to check the external validity of these rules and sequences. Rather than pre- sent the person with the whole ordered situation the information is better presented sequenually: New in- puts can be shown, and the values of attributes also displayed; the terms and rules relying on these can then be shown in a parallel column, followed by the resultant actions; then the next inputs and attribute values in the sequence are shown, and so forth. Viewers with a good knowledge of the real operational situation are freed from complex analyses and rather can thus con- sult their experwnce to determine whether these sets of events make sense or are erroneous in some way.

5.8. Robustness Failure Test Case Generation

There are two classes of tests for th~s class, and neither involves necessary use of the GTM method. The first

class, systematic, involves determining the implicit or explicit expectations concerning input or processing features and then systematically violating these expec- tations. The second class, brute-force, involves th~s type of attempt to cause mayhem by "senseless" and ca- pricious manipulation of entry and function keys and by nonsense entries at user prompts and for in-process parameter input. Of most interest and importance are results which cause faults of any of the three preceding classes: basic safety, system integrity, and essential function.

In developing systematic test cases one focuses pri- marily on the primary and user inputs. There are a number of expectations to look for, and three impor- tant categories are: ( 1 ) type constramts, (2) context con- stramts, and (3) well-formedness features. Type con- straints refer to the data classified into various datatypes and include specifications of possible values, explicitly or implicitly (as ranges, functions, etc.). Context con- straints specify the local circumstances under which a particular data item or input is valid--what can follow what (spatially or temporally), what can't be present when something else is, and so forth. And well-formed- hess criteria specify how something looks, when it's complete, what regularities are expected: these include things like appropriate data formatting, presence (or absence) of leading zeroes in numbers, correct capi- talization and spelling, the designated sequence of items, and so forth. Test cases should be developed to violate all three of these types of constraints and well- formedness criteria. An effective tactic is to select input which violates a number of these factors and, if there is an error, devise follow-up test cases to resolve the problem locus.

The brute-force test cases do not rely on analys~s of system functions and input expectatmns. Rather, they involve generating inputs which have some chance of causing havoc--pressing all the function keys singly and simultaneously, typing punctuation and other symbols instead of the alphabet, responding extremely quickly or slowly, and so forth. Exactly what constitutes brute-force testing depends greatly on the specific hardware and user interface of the system, but most testers will have good intuitions about what constitutes "stressful" input.

If any of the faults from the first three fault classes occur during these types of robustness testing, then this testing has achieved its primary goal: stressing the boundary conditions of operation to assure safety, in- tegrity, and basic function. If any of these faults do occur, strong consideration should be given to provid- ing additional input checking and other software guards to prevent such errors. As with other classes, all prob- lems found at this stage are recommended to be cor- rected before continuing testing.

Making one's software robust is expensive, whether it is for expert systems or other kinds of software. Per- sonal experience of a number of years of writing soft-

Dynamic Testmg of Knowledge Bases 263

ware to collect behavioral data from computer-naive subjects on programming and other problems dem- onstrated this: the amount of code needed to detect the whole range of user input errors could exceed 20% of the total code needed to set up the experimental conditions, collect data and perform the initial data reduction (e.g., Miller, 1974, 198l).

5.9. Secondary Function Test Case Generation

Test cases for secondary functions are selected in exactly the same way as for essential functions, sec- tion 5.7.

5.10. Incorrect I/O Test Case Generation

If the fault classes have been tested in the order rec- ommended, many of the I/O requirements will already have been evaluated, particularly by the test cases for robustness testing and essential/secondary functions. Nevertheless, the test coverage needed to test full I/O conformity should be determined, and then those tests selected which cover the as yet untested aspects.

There are two approaches for determining test cases for I/O faults, 1/(9 requirements analyses and I / 0 sen- SmVlty analyses Both approaches depend on devel- oping something equivalent to a decision table which shows how specific combinations of primary input val- ues are supposed to lead to specific primary output values; sources for this analysis are both the require- ments documents and the results of static analyses which would have computed something like this (or an equivalent connection graph) for the total rule set. In reviewing the requirements documents for I/O con- straints one should be particularly on the lookout for statements which imply actions being sensitive to or lnvariant over primary input values, for example, " . . . flow regulation depends on the settings of control valves 231 and 232;4he greater these settings the more is the flow restriction" (an inverse relation between flow and valve setting); and " . . . the refresh rate for the heads- up display is not sensitive to altitude b u t . . . " (altitude values should not influence refresh rates). The decision table should show the specific values for inputs and outputs gleaned from all sources; in addition, a special symbol has to be used to indicate that an action is supposed to be insensitive to variations in a particular input variable. 7

The first approach, I/O requirements analyses, in- volves testing the specific tabled sensitivities of output acUons to primary inputs, using the GTM approach

7 Since primary output actions are typically dependent on only a few primary inputs, one must be careful m recording insensitivity infor- mation or else it would be too voluminous. Only when specific m- variances have been exphcifly menUoned in the requirements, or else are otherwise strongly imphed, should invariance symbols be entered for input /output relations

to test each of these requirements, both for the true case and at least for the near-false case(s) (see Table 5).

Sensitivity analyses involve generating additional tests based on the values in the computed decision ta- ble. Tests for invariance consist of picking extreme range values (plus one or more intermediate values) of an input variable for which a specific action is noted as being invariant. Where an action is shown as de- pendent on input values, additional extreme input val- ues should be chosen which test both the "true" and the "false" conditions, corresponding to Level 3 testing in the GTM approach. For those actions which depend on a Boolean relation among several inputs (such as il and i2 and i3) additional test values should be chosen which systematically falsify the overall term by changes in one input at a time.

5.11. User-lnterface Test Case Generation

Although outputs to the user interface can usually be recognized in rules, it will most often not be possible to determine how those outputs will appear in the dis- play or output device; this will be determined by other system components. In checking the knowledge base for faults of this class, then, one is limited basically to assessing whether user-overload situations are present: where too much information is being displayed, per- haps too rapidly, or where too little information is available.

It is reasonable to assume that the user interface will be operational during knowledge-base testing. There- fore the basic test-case generation procedure is to find rules that result in information to be presented to the user, select test values that cause these rules to be fired, and observe the results, from an reformation-processing load point of view only (not whether the information is well orgamzed or well marked as to salience, etc.).

In making overload judgments it is important that the tester review the documentation to have some idea of the task(s) the user is supposed to be performing when the various information is presented. Judgments of too much information being displayed depend on whether the users are supposed to fully assimilate the information or whether they are only to notice a few features or find a few targets. In judging whether too little information is presented, the concern is whether there is a memory overload: The user is asked to make a decision based on several pieces of information, but only the last is available; the other pieces may have been presented but are no longer available, and too great a memory demand occurs. Whether information is being presented too fast also depends on whether the task is simple monitoring for a narrow set of signals, in contrast to full comprehension of the total material.

5.12. Error-Metric Test Case Generation

Although this class is put towards the end of the testing sequence, it happens to be the class that is least depen-

264 L A Mdler

dent on documentation and understanding of require- ments. If, for some reason, the system being tested has very little documentation, and the system's functions are not well understood, it would be quite reasonable to develop test cases for this fault class early in the testing procedure as a likely means of finding some problems (how one recogmzes that one has problems is another question).

Error-metric testing focuses on rules having certain complexity features, which suggest that errors may be associated with them. It is the expert system's correlate of software metrics for conventional procedural pro- grams (e.g., Grady & Caswell, 1987; Halstead, 1977; McCabe, 1983). So far as we know, this is the first time this approach has been formulated for expert system rule bases. There is, at present, no empirical data to support the features we suggest as complexity-indicat- ing candidates, just face validity based on the principle that factors causing high complexity probably are going to be more associated with errors than others. This approach is also related to fault-based testing, which assumes a basis for error causes (e.g., Morell, 1988).

In Table 8 is presented a list of some I l hypothesized complexity factors, seven for the antecedent clause in the rule (the IF clause) and four for the consequent clause (the T H E N clause). In the absence of e'mpirical data, the simplifying assumption is made that each of the j complexity factors, fj, ( j ranges from 1 to 11 in this case), contributes linearly to the overall complexity C, of some rule R,, as determined by a weighting con- stant ~) for each factor. Thus, the overall complexity for a rule can be computed as:

6", = "'tJ.~, + w2,J~, + • • • + " ' l~ ,~ l , -

The weights have been chosen in the example of Table 8 such that rules having an overall complexity value of greater than 1 would seem likely to be quite complex and therefore likely associated with faults.

The rationale for each complexity factor is briefly explained. The first and eighth number of clause terms represent the intuition that as the number of terms m a rule mcreases, with increasing amount of Boolean logic needed to relate them, so does the opportumty for making a mistake or incorrectly representing the requirements; since all rules will have at least one term in the IF and T H E N clauses, we don' t start weighting number of terms until 2 or more- -hence the "less 1" aspect of the factor.

Factor 2 reflects our observation that " < " and " > " operators are often mistaken for " < = " and "> =" op- erators, and vice versa. Factor 3 reflects the abundant evidence that people have many problems with nega- tion, while the next factor reflects similar evidence that they have difficulty with combined conjunction and disjunction. Factor 5 is based on the assumption that tests on multiple variables, or attributes, introduce more difficulty and complexity than just a few; we set the threshold at 3 - -any more and they contribute to the overall complexity estimate. The s~xth factor reflects the possibility that terms involving more than one ele- ment on each side of the comparison may get comph- cated (but not terribly so, as indicated by the low weight of 1/10). Factors 7 and 11 reflect the possibility that calls to external functions or procedures have a much higher chance of being associated with an error than those that involve only local values; we weight this with our highest weight, 1/3. Factor 9 is based on the premise that assignments, being state changes possibly involved in other rules, are potentially error associated. Finally,

TABLE 8 Hypothesized Factors of Complexity in the Antecedent and Consequent Clauses of a Rule Which Combine

to Form the Overall Complexity Rating Cl of a Rule R, Where C~ = W~Ct~ + W = C = + • • • + W ~ C I l j

Complex=ty Factors Factor ID Example WeighU

ANTECEDENT CLAUSE FACTORS 2 1. Number of Terms, less 1 f~, Wl = 1/3 2. Number of compansons revolving " < " or " > " f~ 1,1/2 = 1/4 3. Number of Negations f~ Wa = 1/4 4 Antecedent has both and and or operators (ff yes, factor = 1 ;

no, factor =0) f= W4 = 1/4 5. The total number of attributes in all terms, less 3 f~ INs = 1/5 6. Number of terms with more than two components f~ We = 1/10 7. Number of procedure or function calls fr, W~ = 1/3

CONSEQUENT CLAUSE FACTORS 8. Number of terms in consequent clause, less 1 fe We = 1/6 9. Number of variable-assignments f~ W9 = 1/6

10. Number of actions with more than two components in an equation f ~ W10 = 1/8

11. Number of procedure or funcbon calls 4 f11, Wll = 1/3

1 Factors and weights have been chosen such that rules w=th CI > 1 are hkely to be problemabc 2 If the factor =s negative, then let the factor have value of 0.

Dynamtc Testmg of Knowledge Bases 265

factor 10 indicates our suspicion that the more things one tries to do as actions in a rule, the greater the chances for error.

This complexity-metric theory can be applied by computing each of the complexity factors for the four rules of Table 3, weighting each of the factors to give 11 products, then summing them to provide an overall Complexity Value for each rule. These results are shown in Table 9: Rule 3 is the most complex, by our measures, and it is roughly twice as complex as the simplest rule, Rule I. If one were to arbitrarily decide that the most complex half of our rule base was to be tested, then we would test rules 2 and 3; alternatively, a threshold of complexity, say 1.0, could be set and every rule with a complexity greater than this threshold could be tested--rules 2-4.

Although our discussion of rule-base metrics is hy- pothetical, a language-independent framework has

been given as a starting point whenever data is available concerning rules which cause errors; the Table 8 factors provide a basis for characterizing these error-causing rules, but it would be easy to add new factors. The linear model can be tested by linear regression tech- niques, to determine values of the weights which dis- criminate between error-causing and "good" rules.

Generating test cases for rules chosen by the error- metric method is accomplished by using the GTM procedure.

5.13. Resource Consumption Test Case Generation

There are two aspects to developing test cases for this kind of fault. First, analyze the actions performed by the rules and judge whether these may be resource hogs. The second approach is to determine whether execu- tion of the rules themselves may be expensive.

TABLE 9 Complexity Factor Computations for the Rules of Table 3 and the Hypothesized Complexity Factors of Table 8

Rule 1

Complexity Factor

Elements 1 2 3 4 5 6 7 8 9 10 11

Weight 1/3 1/4 1/4 1/4 1/5 1/5 1/3 1/6 1/6 1/8 1/3 Frequency ~ 2-1 2 0 0 2-3 0 0 2-1 0 0 0

Product 1/3 1/2 0 0 0 0 0 1/6 0 0 0 ~ = .99

Rule 2

Complexity Factor

Elements 1 2 3 4 5 6 7 8 9 10 11

Weight 1/3 1/4 1/4 1/4 1/5 1/5 1/3 1/6 1/6 1/8 1/3 Frequency 1 3-1 3 0 0 3-3 0 0 2-1 0 0 1

Product 2/3 3/4 0 0 0 0 0 1/6 0 0 1/3 T. = 1.91

Rule 3

Complexity Factor

Elements 1 2 3 4 5 6 7 8 9 10 11

Weight 1/3 1/4 1/4 1/4 1/5 1/5 1/3 1/6 1/6 1/8 1/3 Frequency 1 2-1 2 0 0 5-3 2 0 2-1 1 1 0

Product 1/3 1/2 0 0 2/5 2/5 0 1/6 1/6 1/8 0 T, = 2.09

Rule 4

Complexity Factor

Elements 1 2 3 4 5 6 7 8 9 10 11

Weight 1/3 1/4 1/4 1/4 1/5 1/5 1/3 1/6 1/6 1/8 1/3 Frequency 1 3-1 2 1 1 3-3 0 0 1-1 0 0 0

Product 2/3 1/2 1/4 1/4 0 0 0 0 0 0 0 T. = 1.67

1 When frequency is <0, then let number be =0.

266 L. A. Mdler

Likely candidates for the first case are those actions which involve calls to external procedures, particularly for database access functions; also included are in-line functions and subroutines which perform calculations; excessive CPU usage could occur with either of these types. Likely candidates for excessive memory usage are those software calls which involve writing infor- mation to a database or internal files; other possibilities are operations in the expert system source language which cause the creation of internal data structures. Such operations may cause resource dlt~culties either directly, through consumption of allocated memory, or indirectly--by requiring time-consuming garbage collections, by competing with external utilities and programs for space, or even by slowing down processing because of large requirements for intermediate tem- porary storage.

One way to detect expensive actions is to write spe- cial software to measure performance costs of suspected rules, as part of the effort to write test drivers as dis- cussed in section 5.4. This type of instrumentation and simpler rule audtts (recording the sequence, or at least frequency, of rules fired for partxcular test conditions) are also useful for the second case of testing for resource consumption. Having data on the number of times each rule was executed can identify rules which are possibly excessively fired, because of poor design; sim- ilarly, highly frequent sequences of rule firings can sug- gest opportunities for performance improvements.

5.14. Other Test Cases

This last category addresses the problems of testing for errors that remain after all the other fault classes have been tested for. This category draws heavily on tradi- tional methods designed for agnostic (non-require- ments-driven) software testing. There are a wide variety of techniques which could be applied (see Basili & Selby, 1987; Goodenough & Gerhard, 1975; Howden, 1986), but the two primary approaches which we rec- ommend to be adapted to expert systems rule-base testing with the present Heuristic Testing approach are structural testing (e.g., Hamlet, 1989; Kiper, 1989; McCabe, 1983) and random testmg (e.g., Duran & Ntafos, 1984; Howden, 1986). s

The first method involves examination, for example, of the connection graph structure of the rule-base and devising tests which insure that at least all nodes in the graph have been executed; a more stringent form of structural testing is to insure that all paths (edges) in the graph have been traversed by test cases. Although this method is believed to be much less effective at finding faults--in conventional software--than func-

s Other relaUvely new exouc techniques, hke mutat ton lesttng, do not easdy apply to expert systems (cf Acree et al, 1979; DeMdlo, Gumdl, King, McCracken, & Offutt, 1988)

tional testing, or even random testing (cf. Howden, 1980; Rushby, 1988), the argument for including it here is as follows.

First of all, functional testing has already been ac- complished with the essenttal and the secondary func- tion test cases (and random testing will also be done). Second, we recommend only a small investment in this method: Test only those rule nodes which have not yet been exercised by previous test cases, and only those edges also not yet tested (if resources are available for accomplishing edge testing also).

This approach thus derives its rationale from the principle of covering obvious possibilities for faults m in the nodes and edges of the rule connection graph. It's quite possible that an error can exist for a rule node but the test of it doesn't create the special conditions of variable states which enable the fault to occur. How- ever, at least one has tested for the obvious catastrophic structural problems. If one has the resources, a bit more detailed strategy is to completely test the connection graph, regardless of the degree to which previous tests have tested portions of it; however, choose contexts for previously tested aspects that are new.

In addition to the structural testing driven by the connection graph we recommend some additional test cases derived from the frequency analyses of the inci- dence matrices (see section 5.1 and F~g. 3). The tactic is to test more completely those structures, or rule components, that occur more frequently. The rationale is not that bugs are more hkely in these components but that the impact on system reliablhty is much greater if a bug occurs here than in some less frequent aspect. Using the four rules of Table 3 as an example, Figure 3 indicates the following most frequently occurring elements: the term T2 (twice), the attributes a2 (4 times) and a3 (twice), and the constant c2 (twice). Our ap- proach would thus call for additional testing of term T2 using the G TM procedure, but under different at- tribute truth condittons than tested in the previous fault classes. Only two of the four occurrences of the attribute a2 are accounted for by T2, so this attribute, as well as a3, should be additionally tested, again under dif- ferent truth conditions than tested previously. Finally, the occurrences of constant c2 are both accounted for by the repetition of term T2, so no additional testing of this component is warranted.

Random testmg can be much more effective in m- creasing the reliability of conventional software than structural testing (e.g., Currit, Dyer, & Mills, 1986), and it can be more effective than partition testing (di- viding the program input domain into subsets for lim- ited testing from each) under certain cost assumptions (e.g., Duran & Ntafos, 1984; Jeng & Weyuker, 1989). It is recommended as the "clean-up" procedure for the Heuristic Testing approach, and a specific tactic for generating subsets of input values to be sampled is sug- gested which is particularly tamed at "quantlzing'"

Dynamw Testmg of Knowledge Bases 267

continuous attribute ranges. Since this tactic involves sampling without replacement and organizes the pos- sible input values into critical categories, sampling these before sampling final values within them, the term "random" is not particularly accurate; but it will serve.

The results of analyzing the rules into components, as shown in Table 4, provide the basis for value selec- tion. For each attribute to be tested, a vector of values is to be created. If it is a nonnumeric attribute, then the vector consists of all the allowable values. If it is numeric, then the min and max of the attribute are the first vector elements; next, the constant values to which the attribute is compared, in different rules, are added; these are sorted in increasing value, and between each pair of values a variable (e.g., w, x, y, z) is inserted. Thus, for attribute a2, the first elements are l and I l, followed by 9.2 and 8.4Mthe two comparison constants (from terms T2 and I"5); when sorted this vector is: l 8.4 9.2 I I. Adding in variables between value pairs produces the final vector: ( l w 8.4 x 9.2 y 11).

Each of these elements represents a range of possible values from which to select, and we suggest the follow- ing two-step procedure for generating these values. (l) For numeric values (i.e., l, 8.4, 9.2, and I l) generate six values from an end point, the first being two step- values away then a single step-unit, and generate three values in both directions from a midvalue, the first being two step-units away then a single step-unit. Thus, for value 8.4, the values (8.0, 8.1, 8.2, 8.6, 8.7, 8.8) would be generated; for value l, the values (1.2, 1.3, 1.4, 1.5, 1.6) would be calculated. Then, (2) for the variables, the possible value range is that remaining after step l ; for variable w, the range is from 1.6 to 8.0, or 6.4; divide this value (6.4) by 7 to find the intervalue distance for six values, rounding off to the step unit (. l), getting .9; then add this value to the lower bound, 1.6, to obtain six intermediate values for the variable w: (2.5, 3.4, 4.3, 5.2, 6. l, 7.0).

Random sampling of input values can now occur by first randomly selecting one of the attribute vector values ( 1 from 4 for a2), and then randomly sampling one of the six calculated values for that category. This procedure insures that value sampling is forced to occur from the most critical regions of the attribute range, and it insures that sampled values won't be among those already chosen by the GTM method; it finally insures that values are sufficiently spaced so as to in- crease the probability of covering the range of the at- tribute.

How many samples to take depends on one's budget and one's concern for reliability, but 10-50 would seem reasonable at this point in testing; it has been estab- lished that the marginal utility of successive random tests drops off as the number of tests is increased (cf. Jeng & Weyuker, 1989), so one can't justify too many. Also, a problem with this approach is evaluating the results. The other fault classes have much better rec-

ognizable outcomesMhazardous actions occurring, key functionality not being present, and so forth. Evalua- tion of outcomes may well require the use of an "ex- pert" (often called a test oracle), and this can make these tests quite expensive.

One final aspect of testing under the random testing umbrella is mentioned, that ofinput-orderpermutatton. By this it is meant, for the same test-case situation, varying the order in which input values are provided to the expert system for that test case. The objective is to see whether changes in the order of inputs will cause changes in actions and primary outputs. If the same input, presented in a different order, causes differences in output, then this might signal system faults. Whether these permuted orders are accomplished on a random basis for test cases, or not, the permutations themselves should be randomized.

6. DISCUSSION

This presentation concludes with five points, mostly concerns. First, the Heuristic Testing approach has been proposed for dynamic testing of knowledge rule bases; and this approach can be generalized to other knowledge base forms and other components. The concept of a prioritized set of 10 fault classes can be usefully and more generally applied to static tests of the knowledge base, to other types of knowledge bases than rules, to other expert system components, to other specialized systems such as object-oriented and neural systems, and to conventional testing as well. In partic- ular, the fault classes provide the basis for an approach to software reliability testing called competent reli- abdity

Second, a "'fix-as-you-go" testing strategy has been advocated although this is not customary testing pro- cedure. Whde the prioritized approach makes this strategy quite reasonable, there should be considerable disciphne associated with those fixes which are not of simple errors but which relate to conceptual errors concerning the processing logic. These problems should not be fixed directly. Rather, they should be analyzed to determine the flaws in thinking about the design logic and the knowledge structures, and these elements should be changed directly; changes in rules should be driven from these higher-level changes.

A related, third, concern is for regresston testing. It is crucial whenever program testing and program repair are interleaved that the fixes do not cause new prob- lems. It is essential to run a carefully selected (and evolving) set of regression test cases after each fix to insure that the system is behaving as it was before the repair, aside from the specific fault that was eliminated.

Also, on sequencing of testing, the fourth concern is that very thorough static testing must have been completed before dynamic testing is begun. There is considerable evidence from conventional software

268 L A Mdler

test ing results that static test ing is as powerful as dy- namic in locat ing faults (e.g., Mills et al., 1987; Rushby, 1988). There are m a n y reasons to believe this is even more t rue for rule bases, and we believe that the power o f the Heur is t ic M e t h o d to detect r ema in ing faults de- rives greatly from having removed majo r k inds o f faults de tec ted by stat ic means .

Final ly , there is a s t rong concern with automation.

A del ibera te a t t emp t has been m a d e to fo rmula te the Gener ic Test ing M e t h o d and the test-case genera t ion p rocedures for each fault class in as a lgor i thmic a fash- ion as possible, to pe rmi t eventua l a u t o m a t i o n . But o ther aspects o f this test ing m e t h o d can also be auto- mated , inc luding test script p repa ra t ion and recording o f test-case i n fo rma t ion and outcomes . Two o ther as- pects might well be assisted by au toma ted means. First, a s imple text-pars ing techn ique called left-corner pars-

mg (which finds left sides o f phrases), augmentab le with exper t systems, has been shown in pi lo t work to seg- men t well-wri t ten r equ i remen t s d o c u m e n t s into spe- cific r equ i r emen t e lements . This p rocedure could greatly facil i tate search for funct ional r equ i rements as well as safety, integrity, and user-interface ones. Second, exper t systems could be deve loped which could auto- mat ical ly accompl ish some essential aspects o f test-case evaluat ion, essentially via an appl ica t ion-model ing and formal proving approach .

As exper t systems are more and more accepted, and as they increasingly are e m b e d d e d in rea l - t ime inter- ac t ion with much larger systems, the need to be able to assure their re l iabihty becomes p a r a m o u n t - - e s - pecial ly since they so often accompl i sh key intel l igent funct ions and can great ly inf luence the overal l qual i ty o f the embedd ing system. Therefore, au toma ted testing methods , par t icu la r ly those which formal ly prove qual i ty assert ions, are especial ly needed.

REFERENCES

Acree, A.T., DeMfllo, R.A., Budd, T.A., & Sayward, F.G (1979). Mutatton Analysts, Technwal Report, GIT-ICS-79/08, Georgda lnstRute of Technology, Atlanta, GA

Baleer, M.J, Hailing, W M., & Ostrand, T.J (1989) Automatic gen- eratton of test scripts from formal test spectficaUons. Software Engineering Notes, 14, 210-218.

Balzer, R. ( ! 985). A 15-year perspectwe on automatm programming. IEEE Transacttons on Software Engmeermg, SE- 11, 1257-1268.

Bastli, V, & Sclby, R (1987). Companng the effecuveness of software testing strategies. IEEE Transactmns on Software Engineering, SE-13, 1278-1296.

Boehm, B. (1981). Software engineering economws Englewood Cliffs, NJ: Prentme-Hall, Inc

Culbert, C., Riley, G, & Savely, R.T (1987) Approaches to the venficauon of rule-based expert systems In Proceedings of the Ftrst Annual Workshop on Space Operanons Automatwn and Roboncs Conference (SOAR '87), Houston, TX.

Currit, P.A., Dyer, M., & Mills, H.D. (1986). Certifying the reliability of software. IEEEE Transacnons on Software Engineering, SE- 12, 3-11

DeMillo, R.A., Gumch, D.S., King, K.N., McCracken, W.M., & Of- futt, A.J. (1988). An extended overview of the Mothra software testing environment. In Proceedings of the Second Workshop on Software Testing, Verificatmn and Analyszs, Banff, Alberta

Duran, J.W., & Ntafos, S.C. (1984). An evaluation of random testmg IEEE Transactmns on Software Engineering, SE-IO, 438 444.

Franklin, M.K., & Gabnelian, A. (1989). A transformational method for verifying safety properties m real-ttme systems. In Proceedings, Real-Trine Systems Symposmm (pp 112-123).

Goodenough, J.B, & Gerhard, S.L. (1975). Toward a theory of test data selection. IEEE Transactmns on Software Engineering, June.

Grady, R.B., & CasweU, D.L. (1987). Software metrics Estabhshmg a company-wide program Englewood Cliffs, NJ: Prentice-Hall, Inc.

Groundwater, E.H (1989). Verification and vahdatwn plan for the water Chemtstry Expert Momtormg System (WCEMS) Report prepared by Soence Apphcataons International CorporaUon, for the Electnc Power Research Institute (J. Naser, Project Manager).

Groundwater, E.H, & Blanks, M.W. (1988). Reactor emergency ac- tmn level momtor expert-system prototype Independent revww (Vol. 3). Electric Power Research InstRute Report No. NP-5719, Research Project 2582-6, prepared by Science Applicattons In- ternational Corporatton.

Groundwater, E.H., Donnell, M.L., & Archer, M.A. (1987). Ap- proaches to the ver#catmn and vahdatmn of expert systems for nuclear power plants Electric Power Research lnstRute Report No. NP-5236, prepared by Science Apphcattons International Corporatton, Fmal Report July

Halstead, M H (1977). Elements of software scwnce New York: EI- sevmr North-Holland

Hamlet, R. Theoreucai comparison of testing methods. Software En- gineering Notes, 14, 28-36.

Hoffman, D., & Brealey, C. (1989) Module test case generatton Software Engineering Notes, 14, 97-102

Howden, W.E. (1980) Software valtdatlon techmques apphed to so- enttfic programs. ACM Transactmns on Programming Languages and Systems, 2, 307-320

Howden, W.E. (1986). A functtonal approach to program testang and analysis. IEEE Transacttons on Software Engineering, SE-12( IO), 997-1005.

Howden, W.E Validating programs wtthout spectficattons Software Engineering Notes, 14, 2-9

Jeng, B, & Weyuker, E J (1989). Some observations on partmon testing. Software Engineering Notes, 14, 38-47

Kiper, J.D. (1989) Structural testmg of rule-based expert systems. In Proceedings of the IJCAI-89 Workshop on Venftcatmn, Vali- datmn and Testing of Knowledge-Based Systems

Kirk, D.B., & Murray, A.E. (1988) Ver#catmn and vahdatmn of expert systems for nuclear power plant apphcatmns Electric Power Research Instttute Report No NP-5978 Paio Alto, CA: Scmnce Applications Internatmnal Corporatton

Kusmk, A. (1989). Ident#catmn of anomahes m rule bases Workmg Paper No 89-12, University of Iowa Department of Industrial and Management Engineering, Iowa CRy, IA

Leveson, N G. (1986) Sol, ware safety Why, what, and how "ACM Computing Surveys, 18(2), 125-163.

McCabe, TJ (1983) Structured testing: A tesang methodology using the McCabe complexity metric. In T.J. McCabe (Ed.), Structured testing (pp. 19-48). Anaheim, CA: IEEE Computer Sooety.

Miller, L.A. ( 1989a, January). New challenges for the vahdatton and vertficatmn of knowledge-based systems AIAA paper presented at session on Issues m Valtdatton of Knowledge-Based Systems, Reno, NV.

Mtller, L.A. (1989b, June). Tutorial on vahdatmn and venficatmn of knowledge-based systems Paper presented at the Conference on Expert Systems Apphcatlons for the Electrtc Power Industry, Or- lando, FL

Dynamic Testmg of Knowledge Bases 269

Miller, L.A. (1989c, August). A comprehenstve approach to the ver- Ification and vahdatton of knowledge-based systems Paper pre- sented at the Workshop on Verification, Validauon, and Testing of Knowledge-Based Systems, held at the Internattonal Jomt Conference on Arttfictal lntelhgence, Detroit, MI.

Mdler, L.A. (1981). Natural language programming: Styles, strategies, and contrasts. IBM Systems Journal, 20, 184--215.

Miller, L A (1974). Programming by non-programmers. Journal of Man-Machme Studws, 6, 237-260.

Mdls, H.D., Dyer, M, & Linger, R. Cleanroom software engmeenng. IEEE Software, 4, 19-25.

Morell, L.J (1988) Theorettcal mstghts mto fault-based testing. In Proceedings of the Second Workshop on Software Testmg, Verl- ftcatton, and Analysts, Alberta, July, IEEE Computer Soctety.

Ostrand, T.J., Sigal, R., & Weyuker, E.J (1986). Design for a tool to manage specification-based testing. 1EEE Transactions on Software Engineering, SE-12, 41-50.

Richardson, D.J., O'Malley, O., &Ttttle, C. (1989). Approaches to specification-based testing. Software Engineering Notes, 14, 86- 96.

Rushby, J. (1988). Quahty measures and assurance for AI software NASA Contractor Report #4187, prepared for Langley Research Center under contract NASI-17067

Stachowltz, R A., & Coombs, J.B. (1987). Vahdatton of expert sys- tems In Proceedings, Hawan International Conference on System Scwnces, Kona, HI

Winer, B.J. (1962). Statlstwal prmczples m expertmental destgn New York. McGraw-Hill.