Foundations of Software Testing

Foundations of Software Testing:Dependability Theory

Dick HamletPortland State University

Center for Software Quality Research

Abstract

Testing is potentially the best grounded partof software engineering, since it deals with thewell defined situation of a fixed program and atest (a finite collection of input values). How-ever, the fundamental theory of program testingis in disarray. Part of the reason is a confusion ofthe goals of testing — what makes a test (or test-ing method) "good." I argue that testing’s pri-mary goal should be to measure the dependabilityof tested software. In support of this goal, a plau-sible theory of dependability is needed to suggestand prove results about what test methods shouldbe used, and under what circumstances.Although the outlines of dependability theory arenot yet clear, it is possible to identify some of thefundamental questions and problems that must beattacked, and to suggest promising approachesand research methods. Perhaps the hardest stepin this research is admitting that we do notalready have the answers.

1. Testing Theory — Testing Art

Glenford Myers’ textbook is entitled The Art ofSoftware Testing [Myers], and it retains its popu-larity 15 years after publication. Indeed, testingis commonly viewed as an art, with the purposeMyers assigns to it: executing software to expose________________________

failures. Program testing occupies a unique nichein the effort to make software development anengineering discipline. Testing in engineering isthe corrective feedback that allows improvementin design and construction methods to meet prac-tical goals. Nowhere is the amorphous nature ofsoftware a more evident difficulty: testing is hardto do, and fails to meet engineering needs,because software failures software defy easyunderstanding and categorization. When asoftware system fails, often we learn little exceptthat there were too many possible failure modes,so one escaped analysis.

A number of new results and ideas abouttesting, experimental and theoretical, are emerg-ing today, along with a renewed interest insoftware quality. There is reason to hope that anadequate scientific basis can be supplied for thetesting art, and that it can become the part ofsoftware engineering with the strongest founda-tion. It is right to expect exact results, quantita-tive and precise, in software testing. Each of theother development phases involves essentiallycreative, subjective work. Better methods inthose other phases are aids and checks to humaneffort. No one expects, for example, that aspecification could be proved to be "99.9%correct," since "correct" has an essentially sub-jective meaning. But by the time we come totesting, there is a completely formal product (theprogram), and a precise measure of quality (itshould not fail). Testing theory can be held to ahigh engineering standard.

1.1. Assessing Testing Methods

It is sobering to imagine advising adeveloper of safety-critical software about how totest it. While there are textbooks and researchpapers describing many testing methods, there isa dearth of experiments on what works, and thereno convincing theory to explain why a methodshould work, or under what circumstances it isindicated. Imagine giving a presentation on test-ing before the Federal Aviation Authority, andthat the FAA would write into regulations itscontent. How safe would the speaker feel flyingon planes whose flight-control programs weretested according to those regulations? I knowwhat I would tell the FAA today: get the smar-test test people available, give them whateverresources (particularly plenty of time) they need,encourage them to use any test methods they finduseful; and then, everyone cross their fingers, andfervently hope the software doesn’t fail. There isnothing wrong with our "best practices" of test-ing, and I would encourage the FAA to requiretheir study and even their use, but to suggest rely-ing on today’s testing methods is unconscionable.This imaginary advice to the FAA recognizesthat software testing today is not engineering. Itsprinciples are not known, so they cannot be rou-tinely learned and applied. Systematic effort andcareful record-keeping are important parts ofengineering, but they alone are only a facade.The substance is supplied by the underlying sci-ence that proves the methods work. In softwaretesting it is the science that is lacking.

A good deal of the confusion in evaluatingtesting methods arises from implicit disagree-ments about what software testing is supposed todo. In the literature, it is possible to identify fourepochs:

(1) Seeking failures by hard work. ([Myers],1979.) Myers recognized that to uncoverfailures requires a negative mindset. Thetesting methods used to expose failure aremostly based on test coverage. Anemphasis on systematic effort encouragesthe naive belief (not claimed by Myers) thatwhen one has worked as hard as possible,the result must be all right. By concentrat-ing on the software quality "process," the

current fad diverts attention from its techni-cal basis, and confuses following procedureswith real success.

(2) Failure-detection probability. (Codificationof (1), mid 1980s [Duran & Ntafos] to thepresent [Frankl & Weyuker].) Testingmethods were first compared only anecdo-tally, or in circular terms. Today precisestatements about failure-finding probabili-ties are expected in a comparison. AnRADC-commissioned study [Lauterbach &Randall] illustrates the transition: it has ahistogram showing that "branch coverage"got the highest coverage [branch!]; but italso measured methods’ ability to exposefailures.

(3) Software reliability engineering. (Begin-ning with the reliability theory developed atTRW [Thayer+], and currently most popularas models of debugging [Musa+].) Nelson’swork at TRW attempted to apply standardengineering reliability to software. Reliabil-ity — the probability of correct behaviorunder given operating conditions for a giventime period — is certainly an important pro-perty of any artifact. However, the applica-tion to software is suspect. Today "reliabil-ity" is usually associated with so-called"reliability-growth models" of the debug-ging process, which have been criticized aslittle more than curve fitting.

(4) Dependability theory. (Current research,e.g., [Hamlet & Voas].) Dave Parnas notedthe difference between reliability and whathe called "trustworthy" software:"trustworthiness" leaves out "given operat-ing conditions" and "given time." Thus it ismost like a confidence estimate thatsoftware is correct. I called this idea "prob-able correctness" [Hamlet87]; here it iscalled "dependability." Paradoxically,although it might seem harder to testsoftware for dependability than for reliabil-ity, it may in fact be easier in practice.

These four ideas of the essence of software test-ing are quite different, and lead to quite distinctconclusions about what kind of testing is "good."Two people, unconsciously holding the

viewpoints say (2) and (3) respectively, caneasily get into a fruitless argument about testing.

Proponents of all four ideas lay claim to thegoal explicit in (4): everyone wants to believethat software can be trusted for use because it hasbeen tested. Casting logic to the winds, someproponents of the other viewpoints claim to real-ize this goal. To give a current example, thosewho today use defect-detection methods claim aconnection between those methods andconfidence in the tested software, but the argu-ment seems to be the following:

I’ve searched hard for defects in this pro-gram, found a lot of them, and repairedthem. I can’t find any more, so I’mconfident there aren’t any.

Consider the fishing analogy:

I’ve caught a lot of fish in this lake, but Ifished all day today without a bite, so therearen’t any more.

Quantitatively, the fishing argument is much thebetter one: a day’s fishing probes far more of alarge lake than a year’s testing does of the inputdomain of even a trivial program.

The first step in testing theory is to be clearabout which viewpoint is being investigated.Secondly, an underlying theory of dependability(4) is fundamental. If we had a plausible depen-dability theory, it might be possible to establish(or refute) claims that test coverage (1) and (2)actually establish dependability.

In this paper I try to identify the problemscentral to a theory of testing for software depen-dability, and suggest promising approaches totheir solution. Unfortunately, there are moreproblems than promise. But the first step inscientific inquiry is to pinpoint important ques-tions and to establish that we do not know theanswers.

1.2. Questions Arising from Practice

Myers’ book describes many good (in thesense (1) above!) testing ideas that arose frompractice. The essential ideas are ones of sys-tematic coverage, judging the quality of a test byhow well it explores the nooks and crannies ofprogram or specification. The ideas of functional

coverage, based on the specification, and controlcoverage, based on the program’s control struc-ture, are the most intuitively appealing. Theseideas have appeared in the technical literature formore than 20 years, but their penetration in prac-tice is surprisingly shallow. The idea of mutationcoverage [Hamlet77, DeMillo78] is not so wellregarded. But mutation, a technique of practicalorigin with limited practical acceptance, has hadan important influence on testing theory.

The foundational question about coveragetesting is: What is its significance for the qualityof the software tested?

Traditional practical testing in the form oftrial executions is today under strong attack byadvocates of two "up front" software technolo-gies. Those who espouse formal developmentmethods believe that defect-free software can becreated, obviating the need for testing. The posi-tion is controversial, but can only support theneed for basic testing theory. Formal-development methods themselves must be vali-dated, and a sound theory of dependability testingwould provide the means. A profound belief inthe inevitability of human error comes with thesoftware territory, so there will always be a needfor better development methods, and a need toverify that those methods have been used prop-erly in each case. The second attack on testingcomes from the competing technology ofsoftware inspection. There is evidence to showthat inspection is a cost-effective alternative tounit testing; the IBM FSC group responsible forthe space shuttle code has found that if inspectiontechnology is used to the fullest, unit testing sel-dom uncovers any failures [Kolkhorst].

As a consequence of the success of inspec-tions, the promise of formal methods, and arenewed interest in reliability (the three are com-bined in the "cleanroom" method [Cobb &Mills]), it has been suggested that unit testing beeliminated. Practical developers are profoundlywary of giving up unit test.

Those who advocate coverage methods,attention to formal development, system testingin place of unit test, etc., argue for their positionson essentially non-scientific grounds: "do thisbecause it is obviously a good idea." A scientific

argument would require that "good" be definedand convincing arguments be presented as to whygoodness results. Even if there is agreement that"good" means contributing to dependability, thearguments cannot be given in the absence of anaccepted dependability theory. There is agree-ment on the intuitive meaning of dependablesoftware: it does not fail in unexpected or catas-trophic ways. But no one suggests that it willsoon be possible to conduct convincing real-world controlled experiments. Even case studiesare too expensive, require years worth of hard-to-get field data, and could be commerciallydamaging to the company that releases realfailure data. It appears that theory is the onlyavailable approach; but we have no acceptedtheory.

1.3. Foundational Theory to be Expected

The missing theory of testing is a "success"theory. What does it mean when a test succeeds(no failures occur)? It is hard to escape the intui-tion that such a theory must be probabilistic andfault-based. A success theory must treat the testas a sample, and must be of limited significance.The fault-based view arose from mutation[Morell & Hamlet] and from hardware testing[Foster]. Joe Duran has long been an advocate ofthe probabilistic viewpoint [Duran & Wior-kowski, Duran & Ntafos, Tsoukalas+].

In brief, a foundational theory of testing forsoftware dependability will make precise the ideaof sampling and not finding program failures.New ways of testing may be needed to realizedependability. But a major benefit to be expectedis a precise analysis of existing testing methods,and believable answers to the questions abouttheir significance. A good theory would also pro-vide direction for experiment. It is probablyimpractical to measure dependability directly[Butler & Finelli], but a theory can be supportedor disproved by deducing from it less than ulti-mate results, which can be checked.

In Section 2 to follow, some promisingwork is examined to show the pattern of anemerging body of analysis methods and results.Section 3 examines the validity of the statisticalapproach on which these results depend, and

concludes that a better foundation is needed.Section 4 explores preliminary definitions ofdependability.

2. Examples of Emerging Testing Theory

The driving force behind the emerging testingtheory is probabilistic analysis. By accepting thatthe qualities of tests cannot be absolute, the wayis opened for quantifying properties of testing.

2.1. Improving the ‘Subsumes’ Relation

As an example of changing research direc-tions, consider analysis of the data-flow coveragemethods. The initial work [Rapps & Weyuker]defined a hierarchy of methods (now usuallycalled the subsumes hierarchy) such that ifmethod Z subsumes method X, then it is impossi-ble to devise a method-Z test that is not also amethod-X test. The widespread interpretation of"Z subsumes X" is that method Z is superior tomethod X. (The most-used example is thatbranch testing is superior to statement testing,because branch coverage strictly subsumes state-ment coverage.) However, I suggested [Ham-let89] that subsumption could be misleading inthe real sense that natural (say) branch tests failto detect a failure that (different) natural state-ment tests find. A continued exploration[Weyuker+] showed that the algebraic relation-ship could be refined so that it was less likely tobe misleading, and that it could be precisely stu-died by introducing a probability that eachmethod would detect a failure. In a recent paper[Frankl & Weyuker], the subsumes relationship isrefined to a relationship called "properly covers,"and a probabilistic argument shows that "properlycovers" cannot be misleading. The all-usesdataflow criterion, for example, properly coversbranch testing. This analysis is the first com-parison of methods on theoretically defensiblegrounds. It must be noted that the results applyto failure-detection, not to reliability or dependa-bility.

2.2. ‘Partition’ Testing vs. Random Testing

Practical coverage testing could be called"partition testing," because its methods divide theinput domain into subdomains, which constitute apartition in the two important cases of

specification-based blackbox testing, and path-coverage structural testing. In the early 1980s,Duran and Ntafos conducted a seminal study con-trasting partition testing with random testing[Duran & Ntafos]. Their study was presented asa brief for random testing, but its effect has beento illuminate basic questions of testing theory.

Duran and Ntafos considered two statisticalmeasures of test quality, the probability that somefailure(s) will be detected, and the expected valueof the number of failures discovered. (The twomeasures gave similar results; only the formerhas been investigated subsequently.) Today thestatistical treatment, and a goal clearly related toquality, seem merely appropriate; at the time thepaper was presented, these were both novelties.The calculations required a detailed model forpartition testing, and some simulation to realizethe comparison. Despite the authors’ care to givepartition testing the advantage when assumptionswere needed to make the comparison mathemati-cally tractable, the results showed little differencebetween partition and random testing.

Subsequent work extended the comparisonto study why partition testing failed to show asuperiority that is commonly perceived. Themodel of partition testing was varied in one study[Hamlet & Taylor]; another [Jeng & Weyuker]used only analytical methods in a simplified set-ting. The two studies agreed that the failure-detection performance of partition testingdepends on the variability of failure probabilityacross the subdomains of the partition. If somesubdomains are not known to be more failure-prone, then partitioning the input space will be oflittle advantage.

The combined analytical and simulationtechniques pioneered by Duran and Ntafos areeasier to use than either pure analysis, or pureexperimentation. But they have not yet beenexploited to examine questions of reliability ordependability.

3. Software Failure Intensity and Reliability

The results described in Section 2 rely on theconventional theory of reliability. Random test-ing [Hamlet94] supplies the connection betweensoftware quality and testing, because tests sample

the behavior that will be observed in practice.However, the viewpoint that tests are statisticalsamples is controversial. The theory of softwarereliability, developed at TRW in the 1970s[Thayer+] remains a subject of dispute. This sec-tion investigates the question of whether thistheory, or any statistical theory, can plausiblydescribe software failures.

3.1. Analogy to Physical Systems

Distrust of statistical ideas for softwarehinges on the non-random nature of softwaredesign flaws. The software failure process isutterly unlike random physical phenomena (suchas wear, fabrication fluctuations, etc.) that makestatistical treatment of physical systems plausi-ble. All software failures are the result ofdiscrete, explicit (if unintentional) design flaws.If a program is executed on inputs where it isincorrect, failure invariably occurs; on inputswhere it is correct, failure never occurs. Thissituation is poorly described as probabilistic.Suppose that the program fails on a fraction Θ ofits possible inputs. It is true that Θ is a measureof the program’s quality, but not necessarily astatistical one that can be estimated or predicted.The conventional statistical parametercorresponding to Θ is the instantaneous hazardrate or failure intensity z , measured infailures/sec. For physical systems that fail overtime, z itself is a function of time. For example,it is common to take z (t ) as the "bathtub curve"shown in Figure 3.1-1.hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh

00

z (t )

hazard

rate

time t

wear in wear out

Figure 3.1-1. ‘Bathtub’ hazard rate

hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhWhen a physical system is new, it is more likelyto fail because of fabrication flaws. Then it"wears in" and the failure intensity drops andremains nearly constant. Finally, near the end ofits useful life, wear and tear makes the system

increasingly likely to fail.

3.2. Abstraction to the Simplest Situation

What is the corresponding situation forsoftware? Is there a sensible idea of a softwarefailure intensity? There are several complica-tions that interfere with understanding. Becausethis paper is concerned with fundamentals, itattempts to simplify the situation as much as pos-sible, to abstract away from extraneous issues,without losing the essential character of the prob-lem.

The first issue is time dependence of thefailure intensity. A physical-system hazard rateis a function of time because the physical systemchanges. Software changes only if it is changed.Hence a time-dependent failure intensity isappropriate for describing the development pro-cess, or maintenance activities. (The question ofchanging usage is considered in Section 3.3below.) Only the simplest case, of an unchang-ing, "released" program is considered here. Thuswe are not concerned with "reliability growth"during the debugging period [Musa+].

Some programs are in continuous operation,and their failure data is naturally presented as anevent sequence. From recorded failure times t 1,t 2, ..., tn starting at 0, it is possible to calculatethe mean time to failure (MTTF):

(t 1 +i =1Σ

n −1(ti +1−ti ))/n , which is the primary statisti-

cal quality parameter for such programs. ButMTTF is of questionable statistical meaning forthe same reasons that failure intensity is. It is a(usually unexamined) assumption of statisticaltheories for continuously operating programs thatthe inputs which drive the program’s executionare "representative" of its use. The inputs sup-plied and their representativeness are fundamen-tal to the theory; the behavior in time is peculiarto continuously operating programs. Exactly thesame underlying questions arise for a "batch"program in which a single input instigates a "run"that either succeeds or fails, entirely independentof all other runs.

In summary, we take for analysis anunchanging, memoryless batch program, each ofwhose runs is instigated by a single input. The

quality parameter corresponding to MTTF mightnow be called "mean runs to failure" (MRTF),and the instantaneous nature of the failure inten-sity is "per run." A great deal of complicationhas been eliminated, but the statistically ques-tionable parameters remain.

3.3. Is Software Failure Intensity Meaning-ful?

If a statistical view of software failures isappropriate, failure intensity (or MRTF) can bemeasured for a program in an idealized experi-ment. Inputs are supplied, and the failure inten-sity is the long-term average of the ratio of failedruns to total runs. The experiment immediatelyraises the fundamental question of input distribu-tion. If an exhaustive test can be performed, thenit is possible to measure the failure intensityexactly. But whether or not failure intensity canbe estimated with less than exhaustive testingdepends on how the test inputs are selected. It iscertainly possible to imagine selecting them toinadvertently emphasize incorrect executions,and thus to estimate failure intensity that isfalsely high. The more dangerous possibility isthat failures will be unfairly avoided, and theestimate will be too optimistic. When a releasetest exposes no failures, a failure-intensity esti-mate of zero is the only one possible. If subse-quent field failures show the estimate to bewrong, it demonstrates precisely the anti-statistical point of view. A more subtle criticismquestions whether MRTF is stable—is it possibleto perform repeated experiments in which themeasured values of MRTF obey the law of largenumbers?

A partial response to problems in samplinginputs to estimate MRTF is to postulate an opera-tional profile, a probability density function onthe input space describing the likelihood thateach input will be invoked when the software isactually used. If tests are drawn according to theoperational profile, a MRTF can be estimated,can be evaluated for stability, and should apply toactual use. In practice there are a myriad ofdifficulties with the operational profile. Usageinformation may be expensive to obtain, or sim-ply not available; different organizations (anddifferent individuals within one organization)

may have quite different profiles; and, testingwith the wrong profile always gives overlyoptimistic results (because when no failures areseen, it cannot be because failures have beenoveremphasized!).

The concept of an operational profile doessuccessfully explain changes observed over timein a program’s (supposedly constant) failureintensity. It is common to experience a bathtubcurve like Figure 3.1-1. When a program is newto its users, they subject it to unorthodox inputs,following what might be called the "novice"operational profile, and experience a certainfailure intensity. But as they learn to use the pro-gram, and what inputs to avoid, they graduallyshift to the "normal" user profile, where thefailure intensity is lower, because this profile iscloser to what the program’s developer expectedand tested. This transition corresponds to the"wear in" period in Figure 3.1-1. Then, as theusers become "expert," they again subject theprogram to unusual inputs, trying to stretch itscapabilities to solve unexpected problems. Againthe failure intensity rises, corresponding to the"wear out" part of the Figure.

Postulating an operational profile alsoallows us to derive Nelson’s software-reliabilitytheory [Thayer+], which is quantitative, but lesssuccessful than the qualitative explanation of thebathtub curve. Suppose that there is a meaning-ful constant failure intensity Θ (in failures/run)for a program, and hence a MRTF of 1/Θ runs.We wish to draw N random tests according to theoperational profile, to establish an upperconfidence bound α that Θ is below some level θ.These quantities are related by

(3.3.1)1−j =0ΣF ( j

N) θj (1−θ)N −j ≥ α

if the N tests uncover F failures. For the impor-tant special case F =0, 1−α is plotted in Figure4.2-1 below.

Equation 3.3.1 completely solves the funda-mental testing problem, because it predictssoftware behavior based on testing, even in thepractical release-testing case that no failures areobserved. The only question is whether or notthe theory’s assumptions are valid for software.What is most striking about equation 3.3.1 is that

it does not depend on any characteristics of theprogram being tested. Intuitively, we wouldexpect the confidence bound in a given failureintensity to be lower for more complex software.

To the best of my knowledge, no experi-ments have ever been published to support ordisprove the conventional theory. It is hard tobelieve that convincing direct experiments willever be conducted. A careful case study wouldtake years to collect the field data needed toestablish the actual failure intensity of a real pro-gram, and would be subject to the criticism thatthe program is somehow not representative.However, the following could be tried:

Suppose that F 0 ≠ 0 failures of a programare observed in N 0 runs. Then F 0/N 0 is anestimate of its failure intensity. The experi-ment may be repeated with additional runsof the same program, to see if a stable esti-mate Θ̂ of the failure intensity is obtainedfrom many trials. Then α in equation 3.3.1should estimate the fraction of experimentsin which the observed failure intensityexceeds Θ̂, for that is the meaning of theupper confidence bound.

I do not believe the conventional theory wouldsurvive such experiments.

3.4. Thought Experiments with Partitions

The flaw in conventional reliability theorylies with the assumption that there is a sensiblefailure intensity defined through the input space.It is illuminating to consider subdividing theinput domain, and applying the same conven-tional theory to its parts.

Suppose a partition of the input spacecreates k subdomains S 1,S 2,

. . . ,Sk , and the pro-bability of failure in subdomain Si (the sub-domain failure intensity) is constant at Θi . Ima-gine an operational profile D such that pointsselected according to D fall into subdomain Siwith probability pi . Then the failure intensity Θunder D is

(3.4.1)Θ =i =1Σk

pi Θi .

However, for a different profile D ′, different pi ′may well lead to a different Θ′=

i =1Σk

pi ′Θi . For all

profiles, the failure intensity cannot exceedΘmax =

1≤i ≤kmax {Θi }, because at worst a profile can

emphasize the worst subdomain to the exclusionof all others. By partition testing without failure,a bound can be established on Θmax, and henceon the overall failure intensity for all distribu-tions. (This analysis is a much-simplifiedapproximation to an accurate calculation of theupper confidence bound for the partition case[Tsoukalas+].) Thus in one sense partition test-ing multiplies the reliability-testing problem bythe number of subdomains. Instead of having tobound Θ using N tests from an operationalprofile, we must bound Θmax using N tests from auniform distribution over the worst subdomain;but, since we don’t know which subdomain isworst, we must bound all k of the Θi , whichrequires kN tests. However, the payback is aprofile-independent result. That is, a reliabilityestimate based on partition testing applies to allprofiles.

The obvious flaw in the above argument isthat the chosen partion is unconstrained. All thatis required is that its subdomains each have aconstant failure intensity. (This requirement is ageneralization of the idea of "homogeneous" sub-domains, ones in which all inputs either fail; or,all the inputs there succeed.) But are there parti-tions with such subdomains? It seems intuitivelyclear that functional testing and path testing donot have subdomains with constant failure rates.(Again, experiments are lacking, but here theyshould be relatively easy to conduct.) Of course,the failure intensity of a singleton subdomain iseither 0 or 1 depending on whether its point failsor succeeds, but these ultimate subdomainscorrespond to exhaustive testing, and are no helpin a statistical theory.

3.5. Where Does Failure Intensity Belong?

Results in Section 2, and those in Section 4below, depend on the existence of a statisticallymeaningful failure-intensity parameter. So a firstproblem to be attacked in a dependability theoryis to find a better space for this parameter thanthe program input domain. I have argued [Ham-let92] that the appropriate sample space is thecomputation space. A failure occurs when a

design flaw comes in contact with an unexpectedsystem state, and such "error" states are reason-able to imagine as uniformly distributed over allpossible computational states. (The muchmisused "error" is in fact IEEE standard termi-nology for an incorrect internal state.) Roughcorroboration for this view comes from measure-ment of software "defects/line-of-code," which isroutinely taken as a quality measure, and whichdoes not vary widely over a wide range of pro-grams.

4. Dependability Theory

"Dependability" must be defined as a probability,in a way similar to the definition of reliability. Ifa state-space-based failure parameter such as sug-gested in Section 3.5 can be defined, it would do.However, other possibilities exist.

4.1. Reliability-based Dependability

Attempts to use the Nelson TRW (input-domain) reliability to define dependability mustfind a way to handle the different reliabilityvalues that result from assuming different opera-tional profiles, since dependability intuitivelyshould not change with different users. It is theessence of dependability that the operating condi-tions cannot be controlled. Two ideas must berejected:

(U) Define dependability as the Nelson reliabil-ity, but using a uniform distribution for theprofile. This suggestion founders becausesome users with profiles that emphasize thefailure regions of a program will experiencelower reliability than the defined dependa-bility. This is intuitively unacceptable.

(W) Define dependability as the Nelson reliabil-ity, but in the worst (functional) subdomainof each user’s profile. This suggestionsolves the difficulty with definition U, butreintroduces a dependency on a particularprofile. In light of the dubious existence ofconstant failure intensities in subdomains(Section 3.4), the idea may not be welldefined.

Other suggestions use reliability only inciden-tally, introducing essentially new ideas.

4.2. Testing for Probable Correctness

Dijkstra’s famous aphorism that testing canestablish only the incorrectness of software hasnever been very palatable to practical softwaredevelopers, who believe in their hearts that exten-sive tests prove something about software quality."Probable correctness" is a name for that illusive"something." However, the TRW reliabilitytheory (Section 3.3) provides only half of what isneeded. Statistical testing supports statementslike "in 95% of usage scenarios the softwareshould fail less than 1% of the time." Thesestatements clearly involve software quality, but itis not very plausible to equate the upperconfidence bound and the chance of success, andturn "99.9% confidence in failure intensity lessthan .1%" into "probable correctness of 99.9%"[Hamlet87].

Jeff Voas has proposed [Voas & Miller] thatreliability be combined with testability analysisto do better. Testability is a lower bound proba-bility of failure if software contains faults, basedon a model of the process by which faultsbecome failures. In Voas’s model, testability isestimated by executing a program and measuringthe frequency with which each possible faultlocation is executed, the likelihood that a faultwould corrupt the internal state there, and thelikelihood that a corrupt state would not be latercorrected. The combination of these factorsidentifies locations of low testability, places inthe program where a fault could easily hide fromtesting. The program’s testability is taken to bethe minimum value over all its locations. A tes-tability near 1 thus indicates a program that"wears its faults on its sleeve:" if it can fail, it isvery likely to fail under test.

If testability estimates are made using anoperational profile as the source of executions,and conventional random testing uses the sameprofile, a "squeeze play" is possible. Successfulrandom testing demonstrates that failure isunlikely, but testability analysis shows that ifthere are any faults failures would be seen. Theonly conclusion is that faults are unlikely to exist.The squeeze play can be made quantitative[Hamlet & Voas] as shown in Figure 4.2-1. InFigure 4.2-1, the falling curve is the confidence

hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh

0

1

0 1chance of failure x

Pr[not correct ⇒failure less likely than x ]

Pr[failure more

likely than x ]

h

d

Figure 4.2-1. ‘Squeeze play’

hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhfrom reliability testing from Equation 3.3-1; thestep function comes from observing a testabilityh . Together the curves make it unlikely that thechance of failure is large (testing), or that it issmall (testability). The only other possibility isthat the software is correct, for which 1−d isapproximately the upper confidence bound,where d is the value of the falling curve at h inthe figure. Confidence that the software iscorrect can be made close to 1 by forcing h to theright in Figure 4.2-1.

Manuel Blum has proposed [Blum & Kan-nan] a quite different idea as an adjunct to relia-bility. He argues that many users of software areinterested in a particular execution of a particularprogram only — they want assurance that a sin-gle result can be trusted. Blum has found a wayto sometimes exploit the low failure intensity of a"quality" program to gain this assurance. (Con-ventional reliability would presumably be used toestimate the program quality, but Blum hasmerely postulated that failure is unlikely.)Roughly, his idea is that a program can check itsoutput by performing redundant computations.Even if these make use of the same algorithm, ifthe program is "close to correct," it is veryunlikely that a sequence of checks could agreeyet all be wrong.

4.3. Defining Dependability

Either Voas’s or Blum’s idea could serve asa definition for dependability, since both capturea probabilistic confidence in the correctness of aprogram, a confidence based on sampling.

Dependability might be defined as theconfidence in correctness given by Voas’ssqueeze play. Even if conventional reliability isused for the testing part of the squeeze play, thedependability so defined depends in an intuitivelycorrect way on program size and complexity,because even Voas’s simple model of testabilityintroduces these factors. His model also intro-duces an implicit dependence on the size of bothinput and internal state spaces, but this part of themodel has not yet been explored.

Dependability might also be defined for aBlum self-checking program as the complementof the probability that checks agree, but theircommon value is wrong. This dependability maybe different for different inputs, and must betaken to be zero when the checks do not agree.Thus a definition based on Blum’s idea mustallow software to announce its ownuntrustworthiness (for some inputs).

The promise of both Voas’s and Blum’sideas is that they extend reliability to dependabil-ity and at the same time substantially reduce thetesting cost. Instead of requiring "ultra-reliability" (roughly below 10−8 failures/run) thatcannot be estimated in practice [Butler &Finelli], their ideas add a modest cost to reliabil-ity estimates of about 10−4 failures/run, estimatesthat can be made today. Blum’s idea accrues theextra cost at runtime, for each result computed;Voas’s idea is more like conventional testing inthat it samples the whole input space, beforerelease.

5. Conclusions

It has been argued that a foundational "dependa-bility" theory of software testing must be statisti-cal in nature. The heart of such a theory is pro-gram reliability derived from a constant failureintensity, but defining failure intensity over theinput space is inappropriate. A more plausiblereliability theory is needed, and the nearly con-stant defects/line-of-code data suggests that thefailure intensity should be defined on the pro-gram state space. Dependability can then bedefined using new ideas such a Voas’s squeezeplay, or Blum’s pointwise correctness probabil-ity.

Although we do not yet have a plausibledependability theory, it is possible to imaginewhat the theory can establish. My vision issomething like the following:

Dependable software will be developedusing more front-end loaded methods thanare now common. Testing to find failureswill play a minor role in the developmentprocess. The resulting software will betested for dependability by a combination ofconventional random testing, and newmethods that probe the state space. Randomtesting will establish reliability better thanabout 10−4 failures/run. The new methodswill provide (quantitative!) high confidencein correctness, so that in use the softwarewill fail less often than once in 107 to 109

runs.

The cost of testing in my vision may not be lessthan at present, but neigher will such testing beintractable. The significance of passing the tests,however, will be far greater than at present.Instead of using release testing to find failures, aswe do now, dependability tests that succeed willquantitatively predict a low probability thatsoftware will fail in use. Software that is badlydeveloped will not pass these tests, and responsi-ble developers will not release it.

References

[Blum & Kannan]

M. Blum and S. Kannan, Designing pro-grams that check their work, Proc. 21stACM Symposium on Theory of Computing,1989, 86-96.

[Butler & Finelli]

R. Butler and G. Finelli, The infeasibility ofexperimental quantification of life-criticalsoftware reliability, Proc. Software for Crit-ical Systems, New Orleans, LA, December,1991, 66-76.

[Cobb & Mills]

R. H. Cobb and H. D. Mills, Engineeringsoftware under statistical quality control,IEEE Software, November, 1990, 44-54.

[DeMillo78]

R. DeMillo, R. Lipton, and F. Sayward,Hints on test data selection: help for thepracticing programmer, Computer 11 (April,1978), 34-43.

[Duran & Wiorkowski]

J. W. Duran and J. J. Wiorkowski, Quantify-ing software validity by sampling, IEEETrans. Reliability R-29 (1980), 141-144.

[Duran & Ntafos]

J. Duran and S. Ntafos, An evaluation ofrandom testing, IEEE Trans. Software Eng.SE-10 (July, 1984), 438-444.

[Foster]

K. A. Foster, Error sensitive test casesanalysis (ESTCA), IEEE Trans. SoftwareEng. SE-6 (May, 1980), 258-264.

[Frankl & Weyuker]

P. G. Frankl and E. J. Weyuker, A formalanalysis of the fault-detecting ability of test-ing methods, IEEE Trans. Software Eng.SE-19 (March, 1993), 202-213.

[Hamlet77]

R. Hamlet, Testing programs with the aid ofa compiler, IEEE Trans. on Software Eng.SE-3 (July, 1977), 279-290.

[Hamlet87]

R. Hamlet, Probable correctness theory, Inf.Proc. Let. 25 (April, 1987), 17-25.

[Hamlet89]

R. Hamlet, Theoretical comparison of test-ing methods, Proc. Symposium of SoftwareTesting, Analysis, and Verification (TAV3),Key West, December, 1989, 28-37.

[Hamlet92]

D. Hamlet, Are we testing for true reliabil-ity?, IEEE Software, July, 1992, 21-27.

[Hamlet94]

D. Hamlet, Random testing, in Encyclo-pedia of Software Engineering, J. Mar-ciniak, ed., Wiley, 1994, 970-978.

[Hamlet & Taylor]

D. Hamlet and R. Taylor, Partition testingdoes not inspire confidence, IEEE Trans.

Software Eng. SE-16 (December, 1990),1402-1411.

[Hamlet & Voas]

D. Hamlet and J. Voas, Faults on its sleeve:amplifying software reliability testing,ISSTA ‘93, Boston, June, 1993, 89-98.

[Jeng & Weyuker]

B. Jeng and E. Weyuker, Analyzing parti-tion testing strategies, IEEE Trans. SoftwareEng. SE-17 (July, 1991), 703-711.

[Kolkhorst]

B. G. Kolkhorst, personal communication,January, 1993.

[Lauterbach & Randall]

L. Lauterbach and W. Randall, Experimen-tal evaluation of six test techniques, Proc.COMPASS ‘89, June, 1989, Gaithersburg,MD, 36-41.

[Morell & Hamlet]

L. J. Morell and R. G. Hamlet, Error propa-gation and elimination in computer pro-grams, University of Maryland ComputerScience Technical Report TR-1065, July,1981.

[Musa+]

J. D. Musa and A. Iannino and K. Okumoto,Software Reliability: Measurement, Predic-tion, Application, McGraw-Hill, 1987.

[Myers]

G. Myers, The Art of Software Testing,Wiley, New York, 1979.

[Rapps & Weyuker]

S. Rapps and E.J. Weyuker, "SelectingSoftware Test Data Using Data Flow Infor-mation", IEEE Trans. Software Eng. SE-11(April 1985), 367-375.

[Thayer+]

R. Thayer, M. Lipow, and E. Nelson,Software Reliability, North-Holland, 1978.

[Tsoukalas+]

M. Z. Tsoukalas, J. W. Duran, and S. C.Ntafos, On some reliability estimation prob-lems in random and partition testing, Proc.

Second International Symposium onSoftware Reliability Engineering, Austin,TX, May, 1991.

[Voas & Miller]

J. M. Voas and K. W. Miller, Improving thesoftware development process using testa-bility research, Proc. Third InternationalSymposium on Software ReliabilityEngineering, Research Triangle Park, NC,October, 1992, 114-121.

[Weyuker+]

E. J. Weyuker, S. N. Weiss, and D. Hamlet,Comparison of program testing strategies,Proc. Symposium of Software Testing,Analysis, and Verification (TAV4), Victoria,October, 1991, 1-10.

Foundations of Software Testing

Documents

Transcript of Foundations of Software Testing