SPECIALFEATURE ittDG

10
SPECIAL FEATURE ittDG<;AAA JEST S NGa:a Edward F. Miller, Jr. Software Research Associates Introduction The problems of providing quality assurance for computer software have received a good deal of attention from the computing community. Such areas as program proving, automatic programming, structured programming, and hierarchical design/development methodologies have all experienced significant growth-largely as a result of the increased attention focussed on them. Program testing, on the other hand, has not enjoyed the same level of intensive investigation, even though it has a number of technical and intuitive appeals: * Testing is a practical activity that relates directly to what a programmer and a program do, rather than to an abstraction of what each is supposed to do. * Techniques that deal directly with source programs tend to focus attention on what is actually going on in the program. * Some research results suggest that testing tech- niques can be formalized in an effective way. * There is hope (supported, to be sure, by a great deal of intuition) that testing technology may become the "solution" to the software unreliability problem, particularly for large-scale software systems. Both art and theory operate in program testing today. The "art" of program testing suggests new theoretical routes which drive the development of additional "theory" which, in turn, drives the accumulation of further art. *Portions of this paper were presented at the 1975 Texas Conference on Computing Systems, Austin, Texas, November 1975. This paper describes some recent efforts to build a bridge linking the theory of program testing with its practice. Although building that bridge has been a desirable goal, only now has sufficient research insight and actual testing experience been gained to even begin contem- plating the form this practically oriented but strongly founded bridge can take. Let us consider the issues of testing art and testing theory in the following steps: * A discussion of the interface between testing art and testing theory and what that interface implies. * A description of present-day idealized programn test- ing methodologies and the impact they have had in real-world applications. * A presentation of some recent testing theory results suggesting that a systematic testing methodology can measurably improve software quality. * A description of a systematic (and largely auto- matable) program testing methodology that could be put into practice using current knowledge and techniques and could accommodate later, more power- ful theoretical results. * A discussion of some of the issues remaining to be resolved. Art vs. theory It would probably be fair to say that prior to the Program Test Methods Symposium' in 1972, whatever "art" of program testing existed was a closely held secret among knowledgeable programmers. Computer software systems were considered secondary to the hardware because they cost less. Or so everyone thought. This attitude generally carried over into the treatment of software reliability issues. COMPUTER 4.2

Transcript of SPECIALFEATURE ittDG

Page 1: SPECIALFEATURE ittDG

SPECIAL FEATURE

ittDG<;AAA JESTS NGa:aEdward F. Miller, Jr.

Software Research Associates

Introduction

The problems of providing quality assurance for computersoftware have received a good deal of attention from thecomputing community. Such areas as program proving,automatic programming, structured programming, andhierarchical design/development methodologies have allexperienced significant growth-largely as a result of theincreased attention focussed on them. Program testing,on the other hand, has not enjoyed the same level ofintensive investigation, even though it has a number oftechnical and intuitive appeals:

* Testing is a practical activity that relates directlyto what a programmer and a program do, ratherthan to an abstraction of what each is supposedto do.

* Techniques that deal directly with source programstend to focus attention on what is actually goingon in the program.

* Some research results suggest that testing tech-niques can be formalized in an effective way.

* There is hope (supported, to be sure, by a greatdeal of intuition) that testing technology may becomethe "solution" to the software unreliability problem,particularly for large-scale software systems.

Both art and theory operate in program testing today.The "art" of program testing suggests new theoreticalroutes which drive the development of additional "theory"which, in turn, drives the accumulation of further art.

*Portions of this paper were presented at the 1975 TexasConference on Computing Systems, Austin, Texas, November1975.

This paper describes some recent efforts to build a bridgelinking the theory of program testing with its practice.Although building that bridge has been a desirable goal,only now has sufficient research insight and actualtesting experience been gained to even begin contem-plating the form this practically oriented but stronglyfounded bridge can take.Let us consider the issues of testing art and testing

theory in the following steps:

* A discussion of the interface between testing artand testing theory and what that interface implies.

* A description of present-day idealized programn test-ing methodologies and the impact they have had inreal-world applications.

* A presentation of some recent testing theory resultssuggesting that a systematic testing methodologycan measurably improve software quality.

* A description of a systematic (and largely auto-matable) program testing methodology that couldbe put into practice using current knowledge andtechniques and could accommodate later, more power-ful theoretical results.

* A discussion of some of the issues remaining to beresolved.

Art vs. theory

It would probably be fair to say that prior to theProgram Test Methods Symposium' in 1972, whatever"art" of program testing existed was a closely heldsecret among knowledgeable programmers. Computersoftware systems were considered secondary to thehardware because they cost less. Or so everyone thought.This attitude generally carried over into the treatmentof software reliability issues.

COMPUTER4.2

Page 2: SPECIALFEATURE ittDG

"The software can be done easily, so we don't haveto worry about that," managers would say with elan;and they acted on their belief-despite the fact thatthe problems encountered with the first large-scaleoperating systems and application software projectsin industry and the military were already out of handand nobody knew what, if anything, could be doneabout them.2No one seemed to have ever taken a serious look

at testing. For example, I remember vividly a conversa-tion circa 1973 with a competent computer man whosuggested the easy way to solve the software relia-bility problem was to "... test all the possible com-binations, just as they do for hardware!" And yet, aback-of-the-envelope computation of the total numberof input combinations for a subroutine that computesthe 36-bit integer sum of two 36-bit integers (i.e.,272 combinations) revealed that simple, exhaustive input-space coverage could not be done for software or forhardware.

Clearly, something else must be found that works.In a purely economic view, the problem can be statedthis way: a practical method must be found to demon-strate that software actually does what it was intendedto do. Program proving methods have the same goal,but those techniques are probably a decade away frompractical application.3

The interface between art and theory. The art andtheory of program testing address fundamentally differ-ent questions from fundamentally different attitudes.To emphasize the similarities, however, one can makesome direct comparisons between the art of programtesting and the theory that supports it:

ART

Practical "how-to's"Keeping costs withinbounds

How to test enoughHow to apply the theory

Concentrating onprogram behavior

Informal methodsDeveloping amethodology

THEORY

Technical intuitionMaking sure enough is spent

How much testing is enough?How to factor in pastexpernence

Dealing with ancillary detailsor formalisms not neces-sarily related to programbehavior

Formal methodsAreas for further research

of the process. An incorrect program can be provedcorrect-perfectly properly-relative to an incorrect setof assumptions.Whereas program proving is a reductive process (one

that converts or reduces facts about programs into otherforms that can themselves be shown self-consistent),program testing is inherently an affirmative process sinceeverything done in testing can potentially contributeinformation about the quality of the program beingtested.For example, if in the process of testing one finds

a test case that produces an "infinite loop," the programcan then be changed to avoid that problem. Similarly,a test case that produces a "divide-by-zero fault" indi-cates either a failure to protect the program in some wayor an important restriction of the program's activities.In the broadest terms a formal program testing meth-

odology would attempt to systematize two kinds ofinformation:

(1) The sets of tests which. when executed wouldrealiably identify program errors of various classes or,absenting that, would produce some kind of anomalousprogram behavior to signal the existence of a fault.

(2) An understanding of the structure of the programin terms that would support a test engineer's intuitionabout the proper behavior of the program.

The latter point is very important if the methodologyis going to be applied to a large-scale software system.The test engineer must have a relatively clear apprecia-tion of the program's proper operation if he is goingto interact effectively with a methodology that guidesthe testing process. With present techniques the testingprocess will require human judgment, and enlighteningthat judgment will reduce the overall cost of achievingincreased confidence in software quality.A testing methodology that applies only to single

modules (individual subprograms that are invoked byother programs) or to very small sets of modules wouldbe only a toy. Before the methodology can be adoptedseriously, it must meet two important criteria:

(1) The methodology must apply to between-moduletesting at all levels in a complex computer programset, including the highest levels (so-called system testing).

(2) The methodology must provide facilities for dealingwith testing situations in which automated assistanceby a support tool is not possible. In other words,the methodology must not "give-up" when a situationnot previously covered is encountered.

Correlations should be obvious. "Art" involves the prac-tical issues of "how to do it," whereas "theory" convertsthese "how-to's" into intuitive thought. In a practicaltesting situation it is important to know whether enoughtesting has been accomplished, but the theoretician mustconsider just how much testing is required to makevarious statements about software quality. Similarly,development of a practical methodology is an "art,"but the application of that methodology identifies areasfor further theoretical research.

Characteristics of a testing methodology. The natureof the program proof process,3 which requires firstthe development of a set of verification conditionsfollowed by their detailed analysis and proof (possiblyby a mechanical theorem prover program), has someimportant limitations. For instance, a so-called "proof"does not necessarily imply that the program is actuallycorrect since there may be an error in the other stages

July 1977

As the sections below make clear, present theory doesnot completely satisfy these two requirements. The meth-odology that is presented takes this into account, sothat human interaction in the testing activity becomesa natural adjunct to procedures that are more rigorousand systematic.

Existing test methodologies

Existing program testing methodologies describe anidealized set of procedures that operate to accomplisha specific testing goal. A test consists of an executionof the program with specific input data, called test data,in a controlled environment where it is possible tomeasure certain properties of the program as it executesthe test data. Ordinarily, the output of the programtested is allowed to happen in the normal way. Thetest engineers compare the output with that which

43

Page 3: SPECIALFEATURE ittDG

was supposed to be generated for the given test data.However, this is normally done "by inspection."Whereas an ultimate automatic software testing sys-

tem would include the capability to compare inputsand outputs automatically-and potentially without theuser's intervention-most contemporary methodologicalapproaches to testing concentrate on measuring the testcoverage. If the maximum amount of coverage is achievedduring the testing process (which may involve one ormany tests) the program testing activity can be con-sidered complete.The test coverage measure is used as a barometer

of "how far the testing process has gone." Some ofthe earliest effective measures used were very simpleones; more recently, measures of coverage that relatedirectly to structural properties of the programs beingtested have been devised. It will be useful to discuss thehierarchy of testing measures given below.

CO: Programmer's intuition.Cl: Every statement in a program exercised at least

once.C2: Every' program predicate outcome exercised at

least once.C3: At least one element of each equivalence class of

program flow exercised at least once.C4: All usefully distinct program flow classes tested

with "reliable test data (see below)," plus de factotesting of what cannot be tested reliably.

Cn: A sufficient set of tests so that the tests amountto a formal program proof of correctness.

CO is included because it is the most commonly usedtest coverage measure. One literally tests according tothe programmer's whims. The Cl measure is an intui-tively satisfying one because it tends to corroboratethe claim that "all the statements have been exercised,so there can't be any problem." The C2 measure,which implies Cl, is based on considering computerprograms as directed graphs with outways from eachnode corresponding to each possible program predicateoutcome. If C2 is achieved, the only statements thatcannot have been executed are those which could neverbe executed (so-called unreachable statements). C2 isusually interpreted as requiring only that there be sometest that forces each possible predicate outcome withinthe test set. Requiring that each set of outcomesoccur in each possible combination is the same asrequiring that a program be exercised for all of itspossible flows. That number may be finite or infinite,depending on whether the program contains iterationforms or not.The C3 measures were designed to take advantage

of auxiliary forms of program structure analysis (dis-cussed in more detail later). The C4 measure is includedbecause that is the measure suggested in the meth-odology given below. The techniques for defining-andachieving-the Cn measure of coverage remain an openquestion.

Practical experience. Measuring C2 for a program underexamination is a straightforward matter. What is neededis a mechanism for recording whether or not a program'sflow-of-control passes through an action that resultsfrom the particular value of a particular predicateoutcome. In other words, it is necessary only to instru-ment the program (while preserving its logical integrity)in such a way that each of the program's decisions

44

is recorded in some manner. Then the recorded data,called a decisional trace, can be analyzed to determinethe value of C2 achieved for each particular test. Theaggregate results for a series of tests made againstthe C2 measure show the overall effect of the testingactivity. This measure can be applied at the individualmodule or at the system level as the test engineer chooses.Although a few experiments have been performed in

controlled circumstances, it is difficult to state quantita-tive results achieved with this measure because of the ap-parently statistical nature of program errors. Some obser-vations are worth noting, however.

(1) In a methodology based on the C2 measure, thetesting activity eliminated nearly 90 percent of theprogram faults.4 It wasn't clear whether this resultedsimply from requiring the programmers to examinetheir code very carefully.

(2) The use of automated (testing coverage measure-ment) tools caught between 67 percent and 100 percentof the errors, and at 2-5 months earlier than theywould otherwise have been detected.5 The particulartools used applied the C2 measure.

Results of this sort account for the increasing use ofmethodologies based largely on the C2 measure.6

Impact on development. Merely obtaining the C2measure of test coverage is not enough. The techniquealso requires generating test data so that a C2 mea-sure of 100 percent can be obtained. This necessitatesa variety of methods of examining program structureand content. Several automated program analysis sys-tems have been developed that provide this kind ofcapability, in addition to the basic one of programinstrumentation and data collection. For example, sys-tems such as TDEM, PET, RXVP, DAVE, and JAVSprovide a spectrum of support facilities.7-"1These systems, and similar ones under development,'2

provide the test engineer with detailed information ofimportance to building new test data. Because theproblems are most severe with large programs,- themethods used are either automatic or highly automated.The typical problem the systems are asked to solveis the following: given that some program segmenthas not been executed (as the C2 measurement shows),how can a test be constructed -that does execute thatsegment? Here, segment means a sequence of programstatements that is always executed as a unit so that ifany statement in it is executed, all statements areexecuted.Advances in technology have focused on the so-called

test case data generation problem. There have beenseveral interesting proposals but very little automaticallygenerated test data.'3-" The search for an effective methodgoes on.

Software reliability measure. The foregoing descriptionaddresses the level of testing achieved in a rigoroustesting discipline, but it fails to deal with a morebasic issue: after testing, how reliably can one expecta program to perform? Later there is a discussion of"reliable tests" (where "reliable" has a somewhat differ-ent meaning) that are qualitatively equivalent to proofs.What can be said about achieved system readinessbased on the completion of a rigorous testing meth-odology?Other than the results cited in two instances where

detailed error data was kept throughout a projectthere is, sadly, little else to go on. What is neededis a measure of software readiness that can be used

COMPUTER

Page 4: SPECIALFEATURE ittDG

"in the field." That measure would have some knownproperties. First, it would increase in value only wheneach new test performs something functionally "differ-ent" from the prior set(s) of tests. Merely executingthe same test data repeatedly doesn't say very muchabout the software, so the measure should characterizethe variations included in the testing activity.Second, the measure should be sensitive to such factors

as system cost, cost of running a test, software systemcriticality, etc. As difficult as these factors are toquantify, it seems possible (in principle at least) torate them and scale them into the measure in a rea-sonable way.One method to achieve this is to use the continuing

testing effort as the basis for an empirical Markovmodel of a program's decisional transition probabilities,i.e., the effective probabilities that the program willtake a particular outcome whenever a decision withinit is reached. As each new test occurs, it updatesthe probability values for the current model. The overallreliability/readiness measure can then be recomputed.This notion relies on an interpretation of a new testin terms of the tests that have already been done,and the Markov representation simply acts as the"memory" about the past testing process.

Current developments

Current research and development activities in programtesting have a natural organization. The major researchtopics and the way they relate to one another are shownin Figure 1. Some of the topics are discussed inmore detail below.

Program analysis techniques. The notion of havinga set of programs which analyzes other programs wasintroduced a very long time ago: assemblers wereprobably the earliest example of this idea. The late1960's and early 1970's saw the development of systemsthat performed automatic analysis (other than com-pilation) in a research and development environment.For example, FACES, DAVE, and RXVP are systemsthat have many powerful capabilities for dealing withprograms and their properties.'6,"0'9 Such systems providethe basis for automation of function, which is neededin program structure analyzers and ultimately for reliabletesting techniques.

Reliable testing techniques. A landmark paper byGoodenough and Gerhart'7 and subsequent work byGoodenough"8 show that under certain special circum-stances testing methods could be the functional equivalentof a formal proof of correctness. The basic idea ofthis work is that, at least for certain kinds of program

KEY: - "SUPPORTS"PROGRAM ANALYSIS

TECHNIQUESRELIABLETESTING PROGRAM STRUCTURE

GENERALTESTING-*o"' THEORY 4METHODOLOGY COVERAGE

A MEASURESFAULT AUTOMATEDTEST DATA A

CATEGORIZATION - GENERATION RELIABILITY+ ~~~~~~MEASURES

SYMBOLIC INEQUALITY TEST DATAt EXECUTION SOLUTION DERIVATION

Figure 1. Interrelationships of research andwdevelopmentareas of program testing.

July 1977

faults, tests can be constructed to distinguish betweenprograms which do and do not have instances of thosefaults. The main result of this testing theory is asfollows:

Theorem 1: A reliable, valid, complete, and successfultesting activity is sufficient to prove that a programhas no errors.17

Although this appears to be a very attractive result,it depends on the proper understanding of the fourmain terms:

(1) A reliable test is one for which there is a partitionof the input domain within which selection of anyvalue results either in an error or a successful test.

(2) A valid test is one in which all data that showthe program is incorrect are in the corresponding inputdomain of the program.

(3) A successful test is one that produces normalprogram output when it is run.

(4) A complete test is one that, in the aggregate,tries to do all of the things the program is supposedto do (for that test).

There are detailed technical definitions of each of theseterms in Reference 17. Meeting all of these requirementsis not as easy as it might seem. In fact, it seemsthat there are certain kinds of program faults for whichthere are no reliable tests in the sense defined here.

Program structure theory. One of the core issuesin program testing theory is the development of a goodway to characterize the set of paths within a program.This characterization can be done for many reasons:(1) to organize program test case advice; (2) to assistin generating test data automatically; or (3) to providethe basis for constructing reliable tests. (This purposeis discussed later.)

Level-I path structures. The earliest development ofprogram structure analysis techniques was based onanalyzing the directed-graph representation of the pred-icate structure of the program. In the directed grapheach program predicate is seated at one of the programgraph's nodes; the outways of that node correspondto the two (or more) possible outcomes of that predicate.The entire directed graph-or digraph-is augmentedwith edges so that it is a single-entry/single-exit form.'In one approach by the writer, and extended by Paige,

potential program flows were characterized by the waythey affected the iteration level within the program.19, 20The iteration level is an index of the number of levelsof algorithm iteration within which each statement ofthe program lies. For example, a non-iterative programflow is described as a level-zero path. Higher-level paths,discovered by automatic graph analysis programs, areassigned increasing levels. The overall structure thatresults forms a tree that is used by automatic analysisprograms to provide the basis for test case generationassistance.2"The advice function provided was based on the idea

that extracting potential program flow sequences formanual evaluation of their feasibility would simplify anotherwise impossibly difficult problem. This was par-ticularly true for large programs that had "rat's nest"control structures. As a human-augmentation techniquethe method worked well, but it suffered from requiringlengthy and detailed analysis of the potential for programflow. Other techniques based on canonical representationof programs are much better, as will be seen later.

45

Page 5: SPECIALFEATURE ittDG

Equivalence classes. In an alternative development,Howden and others22"13 created techniques for auto-matically identifying a complete set of program cases.Here, a case is an instance of program flow that isdistinct from all others. Although this has been doneonly for very restricted programming languages, thetechnique results in somewhat simpler statements aboutprogram flow.The set of program cases is interpreted in terms of

a set of inequalities involving program variables thatdefine a set of conditions necessary for the particularprogram flow to actually occur. For practical programsthe set of inequalities is nonlinear, and there results avery difficult problem in finding values for the programinput that match* up with the inequalities, even for thelimited programming language used.

Automated test data generation. It has been of greatinterest to devise a method for finding a set of legitimateinput values for a program that will cause it to executea previously unexecuted segment. This problem is widelyknown to be undecidable in general." 22 However, "undecid-ability" does not mean that other nonprocedural orheuristic techniques cannot be used.The general problem of test case data generation

reduces to one of finding a set of values that satisfythe collection of simultaneous nonlinear inequalities thatin turn result from selecting a potential program path.The path might be selected manually or by one of theautomated techniques just described. There are threeroutes to doing this which have received some seriousattention.

Symbolic evaluation. This method involves eitherforward or backward symbolic interpretation of actualprogram statements that lie along the chosen path.Research in this area centers on selecting the particularstatements to be included in the analysis and the waysto process the resulting formulas. Current efforts showpromise, but no fully operational system has been demon-strated. King at IBM has been very active,23 and workis also going on at Stanford Research Institute.'3

Inequality solving. Given that the inequalities describ-ing the condition of execution for a program path arefound, they must be solved. In all but the most trivialof instances, the set of inequalities contains many non-linear ones; this appears to be a natural attribute ofpractical computer programs. One partially successfultechnique for finding a solution involved linearizing theinequalities, solving the linearized set, and then attempt-ing the solution for the nonlinear set. If found validthe process stops; otherwise, the linearization processis continued but with a finer (i.e., more precise) approxi-mation. The partial successes achieved with this methodare treated in detail in Reference 13.

Test case derivation. A somewhat different approachis to find a way to alter an existing set of test dataso that the altered data forces the program to executea previously unexecuted segment. For example, supposea test execution takes the program "near" an unexercisedsegment. One can then use the existing test data

as the starting point for a series of heuristically guidedsearches or variations that generally can result in success.This approach was described in some detail in Refer-ence 24 for a very restricted language.

46

Reliable path testingThe reliability of path-based testing was analyzed

in a recent paper by Howden.22 "Path-based testing"means that a program is treated as an aggregate of(possibly) distinct program paths, the existence of whichimplies a partition of the input domain of the programinto subdomains which control exactly those statementson the path chosen and no others. In his investigationof the theoretical reliability, Howden took the view that

(1) The program treated is a member of a class ofprograms which differ only in terms of whetheror not they are correct. The incorrect ones haveerrors of various (known) types.

(2) The objective was to find, if possible, a restrictedset of programs for which path testing would bereliable.

It is valuable to restate and paraphrase some of therather technical theorems to see their impact on thepossibility of developing a general test methodology.

Constant-structure testing. Assuming that the error ina program does not change its control flow (i.e., thatthe set of path classes of whatever origin is not affected),Howden proved the following:

Theorem 2: Path-based testing is a reliable methodto distinguish correct and incorrect programs as longas each path is feasible and as long as the input domainof an incorrect program implied by the path does notintersect with the input domain of the correct program(along the same path).22

Here, an infeasible path is simply one which cannotbe executed with any set of input data. What thistheorem means is that a relatively simple path testis a reliable one under this condition: one must be ableto choose any value in the specified input space and knowthat the program will behave correctly.This is demonstrated by Howden on an example drawn

from Reference 25. The kinds of errors intercepted bythis process are called action errors, roughly characterizedas those errors arising from an incorrect computation.By the assumption above, an action error does notaffect the structure of the program's control flow al-though it may affect other features, such as the numberof times around an iteration, etc. In order words, thereis a restricted class of programs for which action errorscan be found reliably by testing. Here, "reliably" isinterpreted in the sense given after Theorem 1. For allpractical purposes, such a reliable test is equivalentto a proof of correctness-. The class of programs involvedcan be identified by the fact that, because they haveno control-flow errors, they do not change the characteri-zation of flow used.

Infinite loop checking. The same paper investigatedanother theorem-this one of importance to the phenom-enon of testing for infinite loops:

Theorem 3: Path testing is a reliable method to dis-cover the existence of infinite loop errors if and onlyif the input domain for the program contains at leastone instance of data which will cause the program toexecute infinitely.22

This allows reliable testing (in the sense defined above)for infinite loops as long as the program can be invoked

COMPUTER

Page 6: SPECIALFEATURE ittDG

(i.e., executed) with data that causes the infinite loop.More important, it assures that that form of testing isreliable.

Control-structure testing. For situations wherein theerror category treated involves an incorrect controlstatement, the results suggest there is much moreinvolved than for simple action errors. An error in acontrol statement is called a case error, since it affectsthe manner in which the flow paths partition the pro-gram's input domain. There are two possibilities: (1) theprogram's path classes remain unchanged because ofthe error; and (2) the error results in significant changein the path classes. Recall that a path class is simplythe set of program flows that are alike because similarinput data results in essentially equivalent use of theprogram's segments.Whether there is a reliable testing method for case

errors when the program structure is changed is an openquestion in the program testing research community.When the case errors do not change the overall structureof the program, there is this result:

Theorem 4: Path testing is a reliable method todiscover case errors if and only if an incorrect program(one with a case error) has input domains that donot intersect the input domains of the correct program.22

The way one tells whether a case error exists is byexamining the input domains. If there are values out-side the correct subsets for the intended programbehavior, then there is a case error of some kind.Although this sounds relatively simple, no good tech-niques have been found for systematically identifyingand comparing input domains in this way.

Practical experience. Howden analyzed the errors in 18example programs in Reference 25 with the followingresults: (1) of the 12 action errors found, nine couldhave been discovered by reliable path testing techniques;(2) of four case errors, only one could have been foundby path testing; and (3) for the one infinite-loop error,it wasn't clear whether the error could have been foundby path testing of any kind.Although this analysis is hopeful, it's clear that much

work needs to be done. Current efforts aimed at faultcategorization (see Figure 1) will have a direct impacton the utility of program testing. They will providethe basis for devising reliable test procedures thatguarantee low-cost discovery of (and protection against)such errors.

Hierarchical decompositions

Using Howden's results on reliable testing still re-quires detailed knowledge about the control structure ofthe programs being tested. Ideally, the characterizationof the control space for a program should provide anumber of capabilities:

(1) The method of selecting test paths for detailedexamination should allow easy and natural selection ofdifferent test paths (or groups of test paths).

(2) The sets of test paths should be as "independent"of one another as possible so that different parts ofa program's capability can be addressed in a seriesof tests that do not depend too highly on one another.

(3) The method should be applicable both to largeindividual modules (separately invokable programs orsubprograms) and to large-scale software systems thatare composed of many modules.

July 1977

Page 7: SPECIALFEATURE ittDG

(4) The method should be applicable to the practicallanguages such as Fortran, Cobol, and PL/I rather thanonly to "toy languages" that are not widely used.At least some of these capabilities are provided witha technique of hierarchically decomposing program con-trol structures in terms of a limited set of controlstructure primitives.

Decomposition primitives. It is well known that allprograms can be constructed with three programmingprimitives: succession, selection, and iteration.26 Whenone draws a picture of a program decomposition, eachdecomposition is indicated as follows:

(1) Succession: A "dot," with the rule that the left-handdescendant precedes the right-hand descendant.

(2) Selection: A "plus," with the rule that the left-hand descendant corresponds to the "true" outcome,and the right-hand descendant corresponds to the "false"outcome.

(3) Iteration: A "star," with the rule that the left-handdescendant corresponds to the "repeated action," andthe right-hand descendant corresponds to the "escape"or "exit" condition.In the case of the selection primitive, a program state-ment that has more than one predicate outcome (suchas a Fortran computed-GOTO statement with more thantwo indicated targets), one can permit the decompositionto have more than one outway. Although this set ofprimitives is the most popular, other sets could have beenused as well.

Decomposition technique. The kind of tree structurethat results from decomposing a program according tothese primitives is shown for the example program inFigure 2. In the program the total "weight" of eachnondecisional statement sequence is given after thesequence name; for example, sequence A has a "weight"of four units. Here, "weight" could be interpreted asthe total number of separate statements in the sequence,or it could be related in some more complex way tothe computational difficulty of the sequence.27

Figure 2. Example program and corresponding decompositiontree.

The tree shown in Figure 2 effectively illustrates theoverall structure of the program P. Any two programsthat have the same internal organization of controlstatements will have precisely the same tree. Note

that the statement sequences always show up in thetree as leaves.The tree in this example can be constructed by inspec-

tion (as the reader can verify), but for practical-sizedprograms it is desirable to have a small program thatidentifies the tree automatically. This program, whichwill not be discussed here, effectively turns the pro-gram's control structure inside out and generates thetree from the bottom up.27, 28Once the tree is found, one can use it to assist in

constructing reasonable test paths. The set of all sub-trees that includes at least one leaf corresponds roughlyto the set of all possible program flows. The first objectiveof analyzing the tree is to identify the set of structurallyfeasible flows.These are the ones that remain after thestructurally infeasible flows are excluded. For example,in the tree shown in Figure 2 it is not possible tohave a flow which involves sequence A and any othersequence; thus, any potential subtree that does notinvolve A alone is automatically structurally infeasible.Of the trees that remain, some may be semantically

infeasible, which means that a certain program actiontaken at one point results in a set of conditions thatmakes some other program action impossible. Althoughthe obvious approach would be to examine sequence/predicate pairs in some natural order, it turns out thatthis is not really necessary. Because of the way the treeis constructed, only certain kinds of relations need beexamined in detail.28 For purposes of illustration weassume every path that is structurally feasible is semanti-cally feasible.

If we want to use some subset of the collection offeasible subtrees as a guide for testing, we need criteriafor the selection of candidates. Here are two initialcriteria:

(1) A minimum weight for each subtree selected mustbe commensurate with the desire to accomplish treatmentof the program in easily manageable portions.

(2) Each leaf of the original tree must be includedat least once in the set of subtrees selected.The second criterion simply assures that the testing doneaccomplishes the C2 coverage criterion already suggestedas the minimum one. The first criterion is intended toequalize the difficulty of each individual test case (ortest case class) considered. Once the feasible subtreesare found and weights are assigned, devising a good teststructure devolves to finding a balanced covering subset tree.For the program in Figure 2 there are nine feasible

subtrees; these are enumerated in Table 1. The subtreeis indicated simply by noting the program sequencesthat belong to it; the column just to the right givesthe weights associated with that subtree. The fourpossible covers are indicated by P- 's in the last fourcolumns. The notation Dk is used to indicate that theD segment is actually included a variable (but finite)

Table 1. Set of structurally feasible paths for example.

PROGRAMSEGMENTS COVER NO.

TREE NO. PRESENT WEIGHT 1 2 3 41 A 42 B, E 10 -3 B, DK, E 124 B,F 5 0

5 B, DK, F 7 ,6 C,E 8 0

7 C,DK,E 10 o

8 C, F 3 ,9 C, DK, F 5 ,-

4COMPUTER

P: IF P1A (4)

ELSE AIFP2

B(4)ELSE + 0

0 (2)END IFWHILE P3 B C

0 (2) +END WHILEIF P4

E (6) D E FELSE

F (1)END IF

EN I

48

Page 8: SPECIALFEATURE ittDG

number of times since D resides inside an iterationconstruct.A very simple mechanism can be used to choose among

the covers: simply multiply the weights for each elementin the cover together and choose the product with thehighest value. Other things being equal, a candidate.cover set that distributes the weights as evenly aspossible among the elements will tend to be chosen.Note that for this particular program each cover mustinvolve the A segment since it is the sole member ofan essential stibtree.The computations suggested above result in the follow-

ing totals:

SUB-TREE NO.1,2,91,3,81,4,71,5,6

COVER NO.1234

WEIGHTS4,10, 54,12, 34, 5,104, 7, 8

PRODUCTOF WEIGHTS

200144200224

A good starting point evident from this enumerationis the set of subtrees (1, 5, 6) since it represents acover and has the best distribution of program weight.

Naturally, this example is an oversimplification, butthe points to be made are clear. Algorithms for doingall of the computations described already exist, and whilechoosing an optimum cover may be something of astumbling block, there are certainly plenty of algorithmsaround to serve as good initial choices.

Non-structured programs. The example program isobviously well-structured. It is unlikely to find manyprograms which display this simple structure in thepractical world, however, and it is important to pointout how such programs can in effect be convertedinto this format. Prlogram decompositions of the kinddiscussed here run into trouble when the analyzingprograms encounter control graphs that have other than(1,1)-cycles. An (m,n)-cycle exists in the directed graphof the control structure when there is a closed sequenceof edges (a loop) for which there are m different enteringedges and n different exiting edges.Programs that have other than purely (1,l)-cycles

can be effectively reduced by the following two-steptechnique:

(1) Each (m,n) cyle is copied over m times to resultin a set ofm different (1,n) cycles.

(2) Each (1,n)-cycle is then broken down into a (1,1)-cycle and a (1,n-l)-cycle, when n is greater than 1. Each(1,1)-cycle corresponds to an iteration primitive, andthe remaining cycles are decomposed in turn untilnothing other than (1,l)-cycles remain.27

An alternative would be to incorporate decompositionprimitives different from the set used above in a waythat allows for (l,n)-cycles, n > 1.

Other attributes of the decomposition. Recall that oneobjective of developing the decomposition idea is toobtain a capability to deal with large (or even very large)programs in a reasonable way. There are several propertiesof the tree approach that support this objective:

(1) The effect of a program invocation is simple todeal with. Suppose a program P' is invoked at (orwithin) statement sequence S inside program P. In effect,the tree for P' can be copied onto the tree for Pin place of the sequence S. In these terms one canvisualize the tree for an entire program space. Although

FREE

SYSTEM SELECTION ADVICE. WE WILL HELPYOU TO CHOOSE FROM THE BEST OF EACHMANUFACTURER TO COMPLETE THE SYSTEMBEST SUITED TO YOUR NEEDS. COME SEEAND TRY:

PROCESSOR TECH.SOL-20 SYSTEMTDL ZPU.Z16KCROMEMCOPOLYMORPHICVECTOR GRAPHICSIMSAISWTPINTELBYTE

ICOM DISCSNORTH STARTARBELLSEALSDYNABYTELEAR ADM-3ACOMPUCOLORSOROCSANYOHITACHI

ALPHA MICRO 16/8OKIDATADECWRITERMULTITERMS.R POLYPHONICCOMPUTALKERS.S.MUSICIlC. S. SOCKETSTOOLS. SUPPLIESBOOKS. MAGAZINES

LAWNDALE

BYTE SHOPthe affordable computer store

16508 HAWTHORNE BLVD.LAWNDALE, CA 90260 i

(213) 371-2421HRS: TUE.-FRI. 12-8, SAT. 10-6

TORRANCE

BANKAMERICARD * MASTERCHARGE AMERICAN EXPRESS

Reader Service Number 5

The first examinations for the Certificate in Computer Pro-gramming (CCP) will be held on jOctober 22, 1977, at selectedcollege and university test centers throughout the world.

Specific requirements for this year's three separate examina-tions in Business, Scientific, or Systems Programming are

detailed in the "Certificate in Computer Programming Exami-nation Announcement and Study Guide." The study guideand application form for the 1977 examination are availableon request from ICCP.

Please forward the "Certificate in Computer ProgrammingExamination Announcement and Study Guide" along withapplication and test site list.

Name

Street Address

City State or Province Zip Code

-

.

I

Reader Service Number 6July 1977

Page 9: SPECIALFEATURE ittDG

large, this tree has the advantage of providing a uniformstructure-based representation of the total program text.

(2) Subsetting with a complicated program tree can beaccomplished via the same technique, simply by removinga part of the tree from consideration. This method corre-sponds to creating a special subroutine that carriesall of the program text in the subtree removed fromthe original program text, resulting in a logically simplerprogram to test. This technique has sometimes beencalled program factoring since the effect is to break aprogram into pieces small enough to be tested easily.

Suggested methodology

The framework for a methodology that combinesautomated program analysis, reliable path based testing,and hierarchical decompositions of programs should nowbe clear. Portions of the methodology can be used withpresent techniques as a practical basis for program test-ing. However, the methodology virtually requires auto-mated assistance of some kind if it is going to be appliedto realistic-sized programs. The reason for this is thatthe complexity of some of the computations quicklygrows to the point where manual computation is plainlyunthinkable for a 10,000-line or 100,000-line program.The automatable functions required tend to fall into

these categories:(1) Some for'm of automated test data generator, used

to develop test data after a particular program parthas been identified as previously untested.

(2) A means to verify test execution coverage auto-matically so that data can be used to assist the humaninteracting with the system in choosing (and assessing)the next appropriate actions.

(3) A structure analyzer that performs the functionsrequired to develop the program decomposition tree.

(4) A program of some kind that checks on the theoreti-cal reliability of a planned test. This program appliesthe most recent theory of reliable testing (whateverthat is) to the problem of deciding if a reliable testcan be done for the indicated path.

(5) A support system to maintain the program database, interface with the user interactively, and main-tain a management-oriented summary of actions takenand current status.These system components can be assmbled with existing

techniques and can be engineered to assure they donot falter when confronted with very large programslikely to be treated.The methodology for combining all of these ingredients

is as follows, assuming that the program is compilablycorrect and there is at least one test case:

Step 1: The automated structure analyzer computesthe tree for the entire program set and, possibly withhuman assistance, factors the tree into manageablysmall portions. Then, possibly also with human assistanceto resolve problems in developing a cover, particularprogram paths are selected for detailed treatment.- It islikely the first paths to be analyzed would be partof the initial test case.

Step 2: The automated test reliability analyzer deter-mines if the chosen path can be tested by a known-to-be-reliable test method. If it can, the methodologyproceeds with Step 3; otherwise, it continues with Step U.

Step 3: For the test path just selected, the auto-mated test data generator is used to identify a set of

50

data which causes the test case chosen to execute. Sincethe program execution trace for prior tests is available,this information can be used to help find appropriatetest data. The result is the data needed to perform averifiable test of the program.

Step 4: The parts of the program that were identifiedas reliably testable by the verified test just accomplishedcan now be pruned from the program tree. The meth-odology continues with Step 1 until there are no moreleaves on the tree.

Step U: This is the undetermined set of human actionswhich might have to be taken when (a) reliably testableprogram paths cannot be found but some segments havenot been tested; or (b) when the test data generationprocess fails for some reason. The actions taken at thispoint depend on human judgment as well as experiencewith the tools used. The methodology continues withStep 1.

Some of the humanly set parameters to the systemused in conjunction with the methodology may include:

(1) the "size" of the program factors to be treatedindividually;

(2) the variations allowed when developing a coveringset for some chosen interior subtree of the program; and

(3) conventional resource limits and progress milestones.

Future problems

While all of the techniques just described are wellwithin the state of the art, many technical issues remain.Moreover, the basic notion of applying proven testingmethods only where they can do some good will provokestill additional questions. Some discussion of the presentlyapparent problems will focus the attention of researcherson the rough points.The decomposition techniques used apply nicely to

nonrecursive programs and to programs written in high-level languages (-such as Fortran, Cobol, and PL/I). It'sclear there are many other program classes, includingthose written using recursion and those written inAssembly language, that are important to address,and the extension of the structure analysis algorithmsto a broader range is going to be quite important.A secondary problem is the method used to treat (ordistinguish between) differing mechanisms of invocation,such as call-by-name or call-by-reference.

Little research has been accomplished that addressesmethods for testing programs that run in "real time"or that participate in concurrent processes. One problemis to find techniques for verifying path sequences withouteither violating the program's time line or over-expandingthe executable text beyond a machine's capacity. Ageneral approach to testing the time-sensitive parts ofapplication programs is also needed.The classes of program errors currently treatable

are relatively small and need to be expanded signifi-cantly. In addition, a way to contend with case errorsmust be found. One suggestion is to employ the decom-position tree as a "pattern" for the program structureand assure that a correct pattern has been used by someother technique independent of the methods describedhere.Although much research has been done concerning the

problems of control of precision and such issues asround-off, underflow, and overflow, this work has never

COMPUTER

Page 10: SPECIALFEATURE ittDG

been factored into the program testing area, This topicwould be of particular concern when considering buildingsymbolic evaluation systems.Experience on large software systems with a systematic

program testing methodology is needed so that overallcosts can be estimated, possibly in comparison withcosts for other quality assurance techniques, such asprogram proving. Although it seems likely that programtesting will be far cheaper than program proving, thequantitative basis for this presumption does not yetexist.

Finally, the techniques will have to be extended tocover the full spectrum of computer applications, includ-ing in particular the newly-burgeoning micro- and mini-computer marketplace where software quality is not yet(but should be) a serious matter. M

References

1. W. C. Hetzel (ed.), Program Test Methods, Prentice-Hall, Inc., Englewood-Cliffs, New Jersey, 1973.

2. F. P. Brooks, Jr., The Mythical Man-Month, Addison-Wesley, Reading, Massachusetts, 1975.

3. R. London, "A View of Program Verification," Proc.,1975 International Conf on Reliable Software, Los Angeles,California, pp. 534-545.

4. J. R. Brown, "Why Tools?" Proc., Workshop 4, ComputerScience and Statistics: Eighth Annual Symposium on theInterface, Los Angeles, California, February 1975.

5. D. S. Alberts, "The Economics of Software Quality Assur-ance, AFIPS Conf. Proc., 1976 NCC, Montvale, NewJersey, 1976, pp. 433-442.

6. E. P. Miller, Jr., "Methodology for Comprehensive Soft-ware Testing," Rome Air Development Center, RADC-TR-75-161, June 1975.

7. J. R. Brown and M. Lipow, "Testing for SoftwareReliability," Proc., 1975 International Conf on ReliableSoftware, Los Angeles, California, pp. 518-527.

8. L. Stucki, "A Prototype Automatic Program TestingTool," AFIPS Conf Proc., 1972 FJCC, Montvale, NewJersey, 1972, pp. 829-836.

9. E. F. Miller, Jr., "RXVP-An Automated VerificationSystem for FORTRAN," Proc., Workshop 4, ComputerScience and Statistics: Eighth Annual Symposium on theInterface, Los Angeles, California, February 1975.

10. L. J. Osterweil and L. D. Fosdick, "Data Flow Analysisas an Aid in Documentation, Assertion Generation,Validation, and Error Detection," CU-CS-055-74, Universityof Colorado, September 1974.

11. E. F. Miller, Jr., et al., "Jovial Automated VerificationSystem," Rome Air Development Center, RADC-TR-76-20,February 1976.

12. E. F. Miller, Jr., "A General Purpose Program Analyzer,"Software Research Associates, RN-210, March 1977.

13. R. S. Boyer, B. Elspas, and K. N. Levitt, "SELECT-AFormal System for Testing and Debugging Programs bySymbolic Execution," Proc., 1975 International Confon Reliable Software, Los Angeles, California, pp. 234-245.

14. M. Holthouse and E. S. Cosloy, "A Practical Systemfor Automatic Testcase Generation," Presented at 1976NCC, New York, June 1976.

July 1977

15. E. F. Miller, Jr., and R. A. Melton, "Automated Genera-tion of Testcase Datasets," Proc., 1975 InternationalConf on Reliable Software, Los Angeles, California, pp. 51-57.

16. C. V. Ramamoorthy and S. F. Ho, "Fortran AutomaticCode Evaluation System," Electronics Research Labora-tory, University of California, ERL-M466, July 1974.

17. J. B. Goodenough and S. L. Gerhart, "Toward a Theoryof Test Data Selection," Proc., 1975 International Confon Reliable Software, Los Angeles, California, pp. 493-5 10.

18. J. B. Goodenough, "Program Testing Survey," to appearin InfoTech State-of-the-ArtReport, 1977.

19. E. F. Miller, Jr., "A Survey of Major Techniques of Pro-gram Validation," General Research Corporation, RM-1731,October 1972.

20. M. R. Paige, "Program Graphs, An Algebra, and TheirImplication for Programming," IEEE Trans. on SoftwareEngineering, September 1975, pp. 286-291.

21. E. F. Miller, Jr., et al., "Structurally Based AutomaticProgram Testing," Proc., EASCON-74, Washington, D.C.

22. W. E. Howden, "Reliability of the Path Analysis TestingStrategy," IEEE Trans. on Software Engineering, Septem-ber 1976, pp. 208-214.

23. J. C. King, "A New Approach to Program Testing."Proc., 1975 International Conf. on Reliable Software,Los Angeles, California, pp. 228-233.

24. M. R. Paige, "A Pragmatic Approach to Software Test-case Generation," Science Applications, Inc., September1975.

25. B. W. Kernighan and P. J. Plauger, The Elements ofProgramming Style, McGraw-Hill, New York, 1974.

26. E. Ashcroft and Z. Manna, "The Translation of GOTOPrograms to WHILE Programs," Information Process-ing 71, North-Holland Publishing, 1972, pp. 250-255.

27. J. E. Sullivan, "Measuring the Complexity of ComputerSoftware," MITRE Corporation Report, 1973.

28. E. F. Miller, Jr., "Tree-based Program Decompositionand Test Planning Techniques," Software Research Asso-ciates, RN-208, March 1977.

Edward F. Miller, Jr., is an independentconsultant and lecturer with Software Re-search Associates in San Francisco, Califor-nia. His interests include software engineer-ing management support, program testingtechnology, hierarchical design methods, pro-gram analysis tool development, automationof function in software engineering systems,and computer architecture.Dr. Miller was previously director of the

Software Technology Center, Science Applications, San Francisco,ahd director of the Program Validation Project at GeneralResearch, Santa Barbara. He holds a BSEE from Iowa StateUniversity, an MS in applied mathematics from the Universityof Colorado, and a PhD from the University of Maryland,where he was an instructor from 1964 to 1968.

Dr. Miller is a member of IEEE, ACM, and SIAM, andeditor of SIGARCH's bimonthly publication, Computer Archi-tecture News. He is currently preparing a textbook and short-course material on program testing techniques.

51