Kapoor 03 Experimental

21
Experimental Evaluation of the Tolerance for Control-Flow Test Criteria Kalpesh Kapoor Jonathan Bowen Centre for Applied Formal Methods South Bank University, London, UK {kapoork, bowenjp}@sbu.ac.uk http://www.cafm.sbu.ac.uk Abstract For a given test criterion, the number of test-sets satisfying the crite- rion may be very large, with varying fault detection effectiveness. In re- cent work [29], the measure of variation in effectiveness of test criterion was defined as ‘tolerance’. This paper presents an experimental evaluation of tol- erance for control-flow test criteria. The experimental analysis is done by exhaustive test-set generation, wherever possible, for a given criteria which improves on earlier empirical studies that adopted analysis of some test-sets using random selection techniques. Four industrially used control-flow test- ing criteria, Condition Coverage (CC), Decision Condition Coverage (DCC), Full Predicate Coverage (FPC) and Modified Condition/Decision Coverage (MC/DC) have been analysed against four types of faults. A new test crite- ria, Reinforced Condition/Decision Coverage (RC/DC) [28], is also analysed and compared. The Boolean specifications considered were taken from a past research paper and also generated randomly. To ensure that it is the test-set property that influences the effectiveness and not the test-set size, the average test-set size was kept same for all the test criteria except RC/DC. A further analysis of variation in average effectiveness with respect to number of con- ditions in the decision was also undertaken. The empirical results show that the MC/DC criterion is more reliable and stable when compared to the other considered criteria. Though the number of test-cases is large in RC/DC test- sets, no significant improvement in effectiveness and tolerance was observed over MC/DC. Keywords: Control-flow testing criteria, effectiveness, tolerance, empirical evalu- ation, CC, DCC, FPC, MC/DC, RC/DC, Boolean specification, fault-based testing. 1 Introduction 1 Acknowledgement: This research has benefited from participation in and discussions with col- leagues on the UK EPSRC FORTEST Network (GR/R43150/01) on formal methods and testing (http://www.fortest.org.uk). Financial support for travel is gratefully acknowledged. 1

description

gfdg

Transcript of Kapoor 03 Experimental

Page 1: Kapoor 03 Experimental

Experimental Evaluation of the Tolerance forControl-Flow Test Criteria

Kalpesh Kapoor Jonathan BowenCentre for Applied Formal MethodsSouth Bank University, London, UK{kapoork, bowenjp}@sbu.ac.uk

http://www.cafm.sbu.ac.uk

Abstract

For a given test criterion, the number of test-sets satisfying the crite-rion may be very large, with varying fault detection effectiveness. In re-cent work [29], the measure of variation in effectiveness of test criterion wasdefined as ‘tolerance’. This paper presents an experimental evaluation of tol-erance for control-flow test criteria. The experimental analysis is done byexhaustive test-set generation, wherever possible, for a given criteria whichimproves on earlier empirical studies that adopted analysis of some test-setsusing random selection techniques. Four industrially used control-flow test-ing criteria, Condition Coverage (CC), Decision Condition Coverage (DCC),Full Predicate Coverage (FPC) and Modified Condition/Decision Coverage(MC/DC) have been analysed against four types of faults. A new test crite-ria, Reinforced Condition/Decision Coverage (RC/DC) [28], is also analysedand compared. The Boolean specifications considered were taken from a pastresearch paper and also generated randomly. To ensure that it is the test-setproperty that influences the effectiveness and not the test-set size, the averagetest-set size was kept same for all the test criteria except RC/DC. A furtheranalysis of variation in average effectiveness with respect to number of con-ditions in the decision was also undertaken. The empirical results show thatthe MC/DC criterion is more reliable and stable when compared to the otherconsidered criteria. Though the number of test-cases is large in RC/DC test-sets, no significant improvement in effectiveness and tolerance was observedover MC/DC.

Keywords: Control-flow testing criteria, effectiveness, tolerance, empirical evalu-ation, CC, DCC, FPC, MC/DC, RC/DC, Boolean specification, fault-based testing.

1 Introduction

1Acknowledgement: This research has benefited from participation in and discussions with col-leagues on the UK EPSRC FORTEST Network (GR/R43150/01) on formal methods and testing(http://www.fortest.org.uk). Financial support for travel is gratefully acknowledged.

1

Page 2: Kapoor 03 Experimental

A test data adequacy criterion is set of rules that prescribe some property for test-sets. More formally, a test criterion is a functionC: Programs× Specifications→2Test-Setsused for determining whether the programP has been thoroughly testedby a test-set,T , relative to the specification,S [38]. If a test-set,T ∈ C(P, S),then we sayT is adequate for testingP with respect toS according toC, or thatTis C-adequate forP andS. The two broad categories of test criteria are data-flowand control-flow test criteria. Data-flow criteria prescribe rules for test-sets thatare based on the data definitions and its usage in the program. As the objective ofthis paper is to explore control-flow testing criteria, we shall expand only on thisaspect.

Control-flow test criteria [17,23,36] are meant to challenge the decisions madeby the program with test-cases based on the structure and logic of the design andsource code. Control-flow testing ensures that program statements and decisionsare fully exercised by code execution. The well-studied control flow graph, whichis widely used for static analysis of the program, is also used for defining andanalysing control-flow test criteria. A flow graph is a directed graph in whichevery node represents a linear sequence of computations. Each edge in the graphrepresents the transfer of control and is associated with a Boolean decision that rep-resents the condition of control transfer from one node to other. The control-flowtest criteria check these Boolean decisions of the program based on the specifica-tion.

The effectiveness of test criterion [3–7, 12, 31, 37],C, is the probability that atest-set selected randomly from the set of all test-sets satisfying the criterion willexpose an error. As there are an exponentially large number of test-sets satisfyinga test criterion, experimental analysis of fault detection effectiveness is usuallybased on random selection of different test-sets satisfying the given criterion. Suchan analysis is useful to find the average effectiveness but may be misleading due tothe uneven distribution of fault detection capability of the test-sets.

For instance, consider the set of all the test-sets,T , that satisfy a criterion,C, for a given program and specification. It may be the case that some of thetest-sets inT expose an error, while others do not. If a large percentage of theC-adequate test-sets expose an error, thenC is an effective criterion for this program[4]. However, it may happen that there is a large variation in the fault detectioneffectiveness of different test-sets with equal probabilities for high and low faultdetection. In such a scenario, empirical studies based on random selection of test-sets may give incorrect results. Therefore it is also important to study the variationin effectiveness of test-sets of a test criterion.

1.1 An Example

Consider, for example, the following decisionD,

¬A ∧B ∧ C ∨A ∧ ¬B ∧ C ∨A ∧B ∧ ¬C

2

Page 3: Kapoor 03 Experimental

whereA,B and C are atomic Boolean conditions. Let us assume that the im-plementation of the above decision includes a mistake, where one of the occur-rences of an atomic condition is replaced by its negation; i.e.X is substitutedby ¬X or vice versa. There can be nine such faulty decisions as the conditionsappear nine times in the decisionD. The well-known criterion, condition cover-age (CC), requires that the test-set must exercise both true and false evaluation forevery atomic condition. Another criteria, Modified Condition/Decision Coverage(MC/DC), which is a mandatory requirement for testing avionics software underDO-178B [24], requires that the test-set must exercise both true and false branchand also check the independent effect of the atomic conditions on the final outcomeof Boolean decision. The following figure shows the effectiveness of all possibleMC/DC and CC test-sets.

It is evident from Figure 1 that faultdetection effectiveness for a randomlyselected CC test-set for the decisionD could vary from 30 to 100 per-cent as all the test-sets with effective-ness in this range have approximatelysame probability of being selected (ex-cept those with 70–80% effectiveness).However the variation in effectiveness

Figure 1: Distribution of effectiveness for example decision D

0

20

40

60

80

100

10 20 30 40 50 60 70 80 90 100

% of faults

% o

f te

st s

ets

CC

MCDC

of MC/DC is low and is in the smaller range of 70–100%. Therefore, althoughthe average effectiveness of CC criteria is high (≈ 70%), the test criteria may beunsuitable due to high variation in effectiveness for the decisionD.

1.2 Research Objectives

The research issue addressed by this empirical study is to find how good or bada randomly selected test-set could be in exposing errors according to a given testcriterion. This question is difficult to answer because for a given program andadequacy criterion, there is typically a very large number of adequate test-sets [4].If the program is incorrect, then usually some of these test-sets expose an errorwhile others do not.

Previous empirical studies [20,21,30,33] of this nature had the following lim-itations: (a) they used a relatively small number of Boolean decisions; (b) test-setswere selected at random from the large sample space.

We attempt to solve the first problem by generating a large number of Booleandecisions with a varying number of conditions (3–12). The second problem hasbeen addressed by choosing all the test-sets instead of selecting only some (wher-ever possible). Of course, this requires the use of abstraction. The abstractionsmade are justified for the following reasons: (a) the scope of study being limitedto control-flow testing criteria which are intrinsically designed to test the decisionsmade by the program at the control points; (b) if there is more than one fault ei-ther at the same control point or at another control point there is still a very high

3

Page 4: Kapoor 03 Experimental

likelihood of finding all the faults. This is because research into the ‘fault couplingeffect’ as demonstrated in [18] shows that complex faults are coupled to simplefaults in such a way that a test data set that detects all simple faults in a programwill detect a high percentage of the complex faults.

We considered Boolean specifications as the subject for our experiments. Faultswere introduced one at a time to generate a faulty specification representing the im-plementation.

In summary, the main objectives of the study can be summarized as follows:

• To study and compare industrially used control-flow test criteria like, Con-dition Coverage (CC), Decision Condition Coverage (DCC), Full PredicateCoverage (FPC) and Modified Condition/Decision Coverage (MC/DC). Forexample, a major aviation industry standard have a mandatory requirementto use MC/DC test-sets [24].

• To evaluate a new testing criterion, Reinforced Condition/Decision Cover-age (RC/DC), proposed recently as a potential improvement on the MC/DCcriterion [28].

• To study the variation in effectiveness by exhaustively generating (whereverpossible) all the test-sets satisfying a criterion. Earlier work concentratedon choosing some test-sets according to a given criterion which introduces acertain degree of inconsistency in various results [10].

• To establish a connection between fault detection effectiveness and control-flow testing criteria. Since the considered criteria are widely used in industryand there are tools [26] which automatically generate the test data for given atest criterion, it can be guaranteed that certain types of faults are not presentin the implementation.

The rest of the paper is organized as follows. The next section introducesthe definitions and preliminary results used in other parts of the paper. Section 3gives details about related work. Experimental design and analysis of results arepresented in sections 4 and 5 respectively. Section 6 presents the conclusions ofthe study.

2 Theoretical Background

Let P be a program andS be the specification thatP is intended to satisfy. Aprogram,P is said to be correct on inputx if P (x) = S(x), whereP (x) denotesthe value computed byP on inputx, andS(x) denotes the intended output forx.Otherwise,P is said tofail with x as thefailure-causinginput. A failure is definedto be any deviation between the actual output and the specified output for a giveninput. A fault in a program is a cause which results in a failure.

4

Page 5: Kapoor 03 Experimental

A test-caseis an input. Atest-setis a set of test-cases. If a test-case is a failure-causing input for a program, then the program is said to fail on that test-case andsuch a test-case is said to expose a fault. A test-set,T , is said to detect a fault if itincludes at least onefailure-causingtest-case.

A test criterion could be viewed either as a generator or as an acceptor [38].Both generators and acceptors are mathematically equivalent. The acceptor viewcan be thought of as a functionC : P × S × T → {T,F}, whereT and Fdenotes logical ‘true’ and ‘false’ respectively. When a test criterion,C, is used asa acceptor, test-cases are added to a test-set till it satisfiesC. In other words, anacceptor-view uses the criterion as a stopping rule and therefore checks if the goalis achieved after adding a test-case to the test-set.

A criterion when used as a generator can be considered as a functionC : P ×S → 2T . Here, given a program and specification, the test criterion is used togenerate a set of test-sets that satisfies the criterion. In other words, the criterion isused as rule to select the test-cases to generate a test-set.

The control-flow testing criteria are based on the statements and Boolean deci-sion points in the program. The Boolean decisions consist of conditions connectedby logical operators. Conditions are atomic predicates which could either be aBoolean variable or a simple relational expression. The following notation is usedthroughout the paper,∧,∨ and¬ represent the ‘and’, ‘or’ and ‘not’ Boolean oper-ations respectively.

A Boolean formula can be represented in various formats such as ConjunctiveNormal Form (CNF, also knownproduct-of-sums) and Disjunctive Normal Form(DNF, sum-of-products). As these normal forms may not be unique, there are alsocanonical representations which are unique for a given Boolean formula. One suchcanonical form isprime-DNF(PDNF) which is unique up to commutativity.

Consider a given Boolean decisionD, and an elementary conjunction,K (i.e.,K ≡ ∧

i Xi, whereXi is either a condition or its negation).K is an implicant ofDif K impliesD. An implicant is said to be prime implicant of decisionD, if thereis no other implicant,K1 of D such thatK ∨K1 = K1. A PDNF form of a givenBoolean decision,D, is the sum of all prime implicants ofD.

A condition in a Boolean decisionD is said to be irredundant if the conditionappears in the PDNF representation of theD, otherwise it is said to be redundant.For example,B is redundant in decisionD ≡ A ∨ (A ∧ B) since the PDNFrepresentation ofD is A.

2.1 Control-Flow Test Criteria

In this section, we review some existing control-flow test criteria [17,23]. The ba-sic control-flow testing criterion isStatement Coverage(SC) which requires thattest-data must execute every statement of the program at least once. It is evidentthat this is very weak as it ignores the branching structure of the design and imple-mentation. TheDecision Coverage(DC) criterion requires that, in addition to SC,every possible branch must be executed at least once. In other words, the Boolean

5

Page 6: Kapoor 03 Experimental

decisions that appear in the program must take bothT andF values.The DC criterion treats a decision as a single node in the program structure,

regardless of the complexity of the Boolean expression and dependencies betweenthe conditions constituting the decision. For critical real-time applications whereover half of the executable statements may involve Boolean expressions, the com-plexity of the expression is of concern. Therefore other criteria have been devel-oped to address the concerns due to the complexity of Boolean decisions.

The Condition Coverage(CC) criterion includes SC and requires thateveryconditionin each decision must take bothT andF values at least once. AlthoughCC considers the individual condition that makes a decision it does not prescribeany rule that checks the outcome of decision.

The following two criteria improve on DC and CC by taking into account val-ues of both the decision and conditions that occur in the decision.

• Decision/Condition Coverage(DCC): Every statement in the program hasbeen executed at least once, every decision and condition in the programhas taken all possible outcomes at least once. DC essentially requires thattest-data must satisfy SC, DC and CC.

• Full Predicate Coverage(FPC): Every statement in the program has beenexecuted at least once, and each condition in a decision has taken all possibleoutcomes where the value of the decision is directly correlated with the valueof a condition [19]. This means that the test-set must include test-cases suchthat the value of the decision is different when the condition changes.

It can be observed that while both DCC and FPC combine SC, DC, CC test criteria,FPC also checks for the influence of conditions on the outcome of decision. How-ever, due to the presence of more than one condition in a decision, certain com-bination of other conditions can mask a condition if not checked independently.TheModified Condition/Decision Coverage(MC/DC) [1, 2, 13, 14, 34], which is amandatory requirement for testing avionics software [24], improves on this draw-back and is defined as follows:

Every point of entry and exit in the program has been invoked at leastonce, every condition in a decision in the program has taken on allpossible outcomes at least once, and each condition has been shownto independently affect the decision’s outcome.

A condition is shown to independently affect a decision’s outcome by varying justthat condition while holding fixed all other possible conditions. The criterion re-quires at least one pair of test-cases for every condition.

The MC/DC criterion takes into account only those situations where an inde-pendent change in the condition causes change in the value of a decision. Theshortcoming of this approach is that it does not check for the situation where achange in condition should keep the value of decision. To improve on this, a new

6

Page 7: Kapoor 03 Experimental

criterion calledReinforced Condition/Decision Coverage(RC/DC) has been pro-posed in [28]. In addition to MC/DC, RC/DC requires all the decisions to indepen-dentlykeepthe value of decision remain the same at bothT andF.

One of the goals in defining the above criteria is to minimize the numberof test-cases in a test-set while keeping effectiveness as high as possible. Inorder to completely ver-ify a decision withn con-ditions, 2n test-cases arerequired. Therefore, it iscomputationally expensiveor infeasible to test allpossible combinations asthe number of conditionsand decisions grows in the

Criterion Min. no. Max. no. Decision ConditionTest-cases Test-cases outcome outcome

DC 2 2√ ×

CC 2 n + 1 × √DCC 2 n + 2

à àFPC 2 2n

√ √‡MC/DC n + 1 2n

√ √?

RC/DC n + 1 6n√ √?

Table 1: Summary of test criteria († Uncorrelated,‡ Correlated,?Correlated and Independent)

program. In this sense MC/DC and RC/DC, provides a practical solution that onlyrequires a linear number of test-cases in terms ofn. In Table 1, we give a sum-mary of the minimum and maximum test-cases in a test-set that can satisfy a givencriterion. The table also shows if the criterion takes into account the decision andconditions.

As can be observed (in Table 1) that the average test-set size of CC, DCCand FPC is different from MC/DC and RC/DC. To ensure that it is the test-setproperty and not the test-set size that influenced the effectiveness, extra test-caseswere added to the CC, DCC and FPC test-sets to keep the average test-set size thesame as that of MC/DC. The extra test-cases added were generated at random.

The criteria described above have a hierarchy among them as the criteria de-fined later include some of the criteria defined before them. For example, anytest-set that satisfies MC/DC coverage will also satisfy DC coverage. Formally,a criterionC1 subsumesC2 if, for every programP and specificationS, everytest-set that satisfiesC1 also satisfiesC2. It might seem at first glance, that ifC1 subsumesC2, thenC1 is guaranteed to be more effective thanC2 for everyprogram. However, it is not true as it may happen that for some programP , speci-ficationS, and test generation strategyG, test-sets that only satisfyC2 may exposemore errors than those that satisfyC1. This issue has been explored formally byWeyuker [5].

2.2 Types of Faults

One way to study fault detection effectiveness is to define a set of faults that occursin a real life situation and use them to generate the implementations from the spec-ification. In the past, various types of faults have been defined and used to studyeffectiveness of test criteria; see for example [16, 22, 25, 30, 33]. An analysis ofvarious fault classes and the hierarchy among them is presented in [15,27].

We have considered the following four faults in our empirical study:

7

Page 8: Kapoor 03 Experimental

• Associative Shift Fault(ASF): A Boolean expression is replaced by one inwhich the association between conditions is incorrect; for example,X ∧(Y ∨ Z) could be replaced byX ∧ Y ∨ Z. As the ‘∧’ operator has higherprecedence over ‘∨’ operator, the second expression is equivalent to(X ∧Y ) ∨ Z.

• Expression Negation Fault(ENF): A Boolean expression is replaced by itsnegation, for example,X ∧ (Y ∨ Z) by X ∧ ¬(Y ∨ Z).

• Operator Reference Fault(ORF): A Boolean operator is replaced by another,for example,X ∧ Y by X ∨ Y .

• Variable Negation Fault(VNF): A Boolean conditionX is replaced by an-other variable¬X. VNF only considers the atomic condition and is differentfrom ENF, which changes the logical sub-expression excluding the condi-tions.

During the analysis, only one fault is introduced at a time. In a real life sce-nario, more than one fault can be present at the same or other control points. How-ever, research into ‘fault coupling effect’ by Offutt [18] demonstrated that complexfaults are coupled to simple faults in such a way that if a test data set detects allsimple faults in a program it will also detect a high percentage of the complexfaults.

3 Related Work

In [8], Goodenough and Gerhart introduced the notion of reliability and validityof a testing method. Essentially, a method,M , is said to be reliable if all thetest-sets generated usingM have same fault detection effectiveness. The obviousadvantage of using a reliable method is that it is enough to select only one test-setusing any algorithm. Unfortunately, no practical criterion satisfies this property.A similar concept of reliable test-sets was also proposed by Howden [11]. In thecurrent testing literature, ‘reliability’ has a standard engineering meaning in termsof failure probability over an interval of time [9].

An approach, known as fault based testing [16, 22], attempts to alleviate theproblem of test-set reliability by seeking a solution through hypothesizing faultsthat can occur in the implementation. The reliability of a technique is measured interms of these hypothesized faults. Thus, fault based testing provides a practicalmeans of evaluating the test criterion.

The empirical evaluation of testing techniques for Boolean specification hasbeen studied in [20, 21, 30, 33]. Harman et. al., in [10], have proposed a model toconduct empirical studies in software testing. The model suggests standardizingthe process to avoid the conflicts in results of different empirical studies. Experi-mental comparisons of effectiveness of data-flow and control-flow testing criteriaare considered in [3,4,12]. The subjects of these experiments were programs.

8

Page 9: Kapoor 03 Experimental

In [3], the space of test-sets was sampled and appropriate statistical techniqueswere used to interpret the result. The selection of test-sets was done after creating auniverse of possible test-cases and then test-sets of a given size were randomly se-lected from that universe. The distribution of effectiveness with respect to coverageis included.

Various types of formal relationships between the testing criteria have also beenstudied as in [37]. For example, the ‘subsumes’ relationship is based on the prop-erty of test criteria. However, subsumption of one criterion by other criterion doesnot mean that it is more effective in detecting fault than the other. Frankl andWeyuker, in [5], proved that the subsumes relation between software test adequacycriteria does not guarantee better fault detecting ability in the prior testing scenario.

Kuhn [15] has explored the relationship between various fault types. It is shownthat ENF are the weakest faults in the sense that any test technique that catchesstronger faults is very likely to find ENFs. The results have been further improvedby Tsuchiya and Kikuno [27].

Effectiveness measures, as defined in [5], are in terms of a specification andmatching program. As has been pointed out in [32], it is difficult to say what is atypical or representative program and specification. Further similar difficulty ariseswhen classifying faults into different categories based on the severity. Therefore itis not possible to use this measure for effectiveness in general. To overcome thedifficulty of assessing test criteria based on the severity of faults, a framework isdefined in [7].

Another research question of relevance is whether minimization of the test-setsize has any impact on fault detection effectiveness [35]. It was found that theminimization of test-sets had minimal effect. The aim of this paper automaticallyincludes this topic as we consider all possible test-sets; if the minimization reducesthe effectiveness that should be reflected in the overall result. Further, with randomselection of test-sets satisfying a criterion, by choosing a test criterion which has alow variation in effectiveness we can be more sure about fault detection.

4 Experimental Design

The four main control-flow test criteria that we have studied were CC, DCC, FPC,and MC/DC. As the average test-set size of CC, DCC, FPC is less than that forMC/DC test-sets (see Table 1), extra test-cases were added to test-sets to ensurethat average test-set size was equal. It is invalid to compare a test criterion,C1,that gives large test-sets with another criterion,C2, that selects small test-sets, asit is impossible to tell whether any reported benefit ofC1 is due to its inherentproperties or due to the fact that the test-sets considered are larger and hence morelikely to detect faults. Another newly proposed test criterion, RC/DC [28], wasalso analysed and compared with the above mentioned criteria. The number oftest-cases in an RC/DC test-set could vary from(n + 1) to 6n, wheren is thenumber of conditions in the Boolean decision.

9

Page 10: Kapoor 03 Experimental

Because the test criteria of our study are based on decisions that appear in aprogram we restricted our empirical evaluation to Boolean specifications and theirimplementations. At first sight, this may appear to be an over-simplification; how-ever, it can be noted that for a given a Boolean specification withn conditions,only one of22n

possible realizations is correct. Further complexity is added by theexponentially large number of test-sets satisfying a given criterion for a particularspecification. Let us assume that the size of a test-set is2n; then the possible can-didate space of test-sets has sizeC2n

2n . On the other hand, working with a Booleanspecification allows the generation of a large number of random decisions withvarying sizes and properties. Though the number of test-sets is large, it is stillpossible to generate (possibly) all test-sets for typical decisions that usually occurin real programs. For instance, Chilenski and Miller, in [1], investigated the com-plexity of expressions in different software domains. It was found that complexBoolean expressions are more common in the avionics domain, with the numberof conditions in a Boolean decision as high as30.

Our experimental analysis was done using software that was specifically writ-ten for this purpose. The prototype version of software was written in the func-tional programming language, Haskell. As we had to deal with an exponentiallylarge number of test-sets, the prototype Haskell version failed to handle complexdecisions. Therefore, the next version of the software was rewritten in Java. Thesoftware allows the analysis of a given or randomly generated Boolean decision.The experimental steps involved in the empirical analysis of the above mentionedtest criteria were as follows:

1. Select subject Boolean decisions.

2. Generate all faulty decisions.

3. Generate all test-sets for a criterion.

4. Compute the effectiveness of each test-set.

5. Compute the variation in effectiveness of all test-sets.

Steps 1 and 2 are elaborated in the following section. Section 4.2 presents step 3and section 4.3 gives the evaluation metrics used in steps 4 and 5.

4.1 Subject Boolean Specification and Implementation

The subject Boolean decisions were taken from research paper [33] and also gener-ated using a random generator. The Boolean specifications used in [33] were takenfrom an aircraft collision avoidance system. We also analysed 500 decisions thatwere generated randomly. The number of conditions varied from three to twelve inthe generated decisions.

The number of conditions is required as an input for the random Boolean de-cision generator. The seed was fixed for the random generator before starting the

10

Page 11: Kapoor 03 Experimental

experiments. This was done to ensure that the experiments could be repeated,restarted and re-analysed by a third party if desired [10].

The Boolean decisions generated are guaranteed to have only irredundant con-ditions. A condition can appear more than once in a decision. The generationof decisions with irredundant conditions guarantees that a faulty implementationwill not depend on any condition on which the original decision did not depend.Consider, for example, the Boolean decisionD ≡ A ∧ (A ∨ B) in which onlyA is irredundant. An associative shift fault inD would change the decision toDASF ≡ A ∧ A ∨B. But nowDASF is equivalent toA ∨B in which bothA andB are irredundant.

We studied the fault detection effectiveness of test criteria for Boolean speci-fication with respect to four types of faults: ASF, ENF, ORF and VNF. The faultyBoolean decisions (implementation) were generated for each type of fault by intro-ducing one fault at a time. For example, consider the decision,D ≡ A ∨ ¬B ∧ C.GivenD as input, the ORF fault operator would generate two Boolean decisions:A ∧ ¬B ∧ C andA ∨ ¬B ∨ C. Only one fault is present in one implementation.As mentioned before, in a real life scenario, if more than one fault is present at thesame control point or another control point, there is still high possibility of findingit [18].

4.2 Generation of All Possible Test-Sets

For a given Boolean decision withn conditions, there are2n entries in the matchingtruth table. As the objective of experiments was to do exhaustive analysis of allpossible test-sets, all entries in the truth table were generated. Let the set of test-cases for which a decision evaluates toT andF beST andSF , respectively.

A DCC test-set can be generated by choosing one test-case each fromST andSF and adding the test-cases to the test-set till the condition coverage is achieved.Thus, all DCC test-sets were generated by extending the pairs taken from the crossproduct ofST andSF . A similar approach was used for generating FPC test-setsby extending them such that every test-set included pairs satisfying the correla-tion property. The methodology adopted to generate test-sets was chosen to workaround the computational infeasibility problem. It can be noted that the space ofpossible test-sets of sizen is C2n

2k . With k equal ton andn ≥ 5 it is infeasible toconsider all combinations in practice (forn = 5,

(3210

)= 64, 512, 240).

CC test-sets were chosen purely by adding the random test-cases till the crite-rion was satisfied. The number of CC test-sets generated was equal to the|ST | ×|SF |. In the case of CC, DCC and FPC test-sets, it was ensured that the average sizeof test-sets was the same as that of the MC/DC test-sets. The average was main-tained by generating a random number betweenn and2n as the size of test-set. Asall the numbers are equally likely to be chosen the average size of the test-set isstatistically expected to be same for all the test criteria.

A further evaluation was done to check if the test-case pair inST ×SF satisfiedthe MC/DC property for any condition. A similar analysis was done for generating

11

Page 12: Kapoor 03 Experimental

the RC/DC pairs by taking all possible pairs fromST andSF sets. In that case,the test-case pair was added to the list maintained for every condition appearing inthe Boolean decision. Generation of all MC/DC and RC/DC test-sets was done byconsidering all possible combinations of test-case pairs satisfying the property forevery condition. If it was not possible to generate all possible test-sets, the test-setswere generated by making a random selection from the pair list of every conditionthat satisfied the criterion.

4.3 Evaluation Metrics

Let D be a Boolean decision withn conditions and let the number of faulty deci-sions generated using a given fault operator,F , for D bek. The effectiveness asa percentage,ET , of a test-set,T , satisfying a criterion,C, with respect toF andD is given by,ET (C,D, F, T ) = m

k × 100, wherem is the number of faulty de-cisions identified byT to be different fromD. The effectiveness of a test criterion,EF , with respect to a fault is the average of effectiveness for all test-sets:

EF (C, D,F ) =

T satisfiesCET (C, D,F, T )

no. ofT satisfyingC

Furthermore, tolerance of criterion [29] defined as the variation in effectiveness isthe variation ofET given by the following formula,

V ar(C, D, F ) =

T satisfiesC(EF (C,D, F )− ET (C, D,F, T ))2

(no. ofT satisfyingC)− 1

5 Analysis and Results

The effectiveness of each test criterion was measured by computing the average ef-fectiveness of all test-sets. The graphs presented in this section show the percentageof faults plotted against thepercentage of test-sets thatdetected them. For exam-ple, in Figure 2, for a givendecision, on average 48%of CC test-sets were foundto be between 90-100% ef-fective.

The MC/DC criterion wasfound to be highly effectivefor ENF, ORF and VNF faults(see section 2.2), a large

Figure 2: Distribution of effectiveness for expression negation fault

0

10

20

30

40

50

60

70

80

90

100

10 20 30 40 50 60 70 80 90 100

% of faults

% o

f te

st s

ets CC

DCC

FPC

MCDC

RCDC

variation in effectiveness was observed for ASF faults. In particular, it was ob-

12

Page 13: Kapoor 03 Experimental

served that around 30% of MC/DC test-sets detected less than 10% of ASF faults.As is evident from Figures 2, 3 and 4 that MC/DC and RC/DC test-sets were

always found to have morethan 50% effectiveness. Onthe other hand CC, DCCand FPC had fault detec-tion effectiveness as low as10%.As described in [15], ex-pression negation faults arethe weakest types of faults.It is expected that ENF de-tection will be high as canbe observed in Figure 2.

Figure 3: Distribution of effectiveness for operator reference fault

0

10

20

30

40

50

60

70

80

90

100

10 20 30 40 50 60 70 80 90 100

% of faults

% o

f te

st s

ets CC

DCC

FPC

MCDC

RCDC

The variation in effectiveness of test-sets for ORF and VNF faults is simi-lar as they are close in the fault relation hierarchy (Figures 3 and 4). If forcost reasons few test-setsare to be selected, then theMC/DC criterion is morereliable for detecting faults.But as it is difficult to gen-erate MC/DC test-sets com-pared to other criteria, wemay still detect a high num-ber of faults by choosinga large number of test-setssatisfying the other criteria.

Figure 4: Distribution of effectiveness for variable negation fault

0

10

20

30

40

50

60

70

80

90

100

10 20 30 40 50 60 70 80 90 100

% of faults

% o

f te

st s

ets CC

DCC

FPC

MCDC

RCDC

As shown in Figure 5, the variation in effectiveness for ASF faults is differentfrom other faults. This is due to two reasons: (a) since ASF is the strongest faultclass, most of the test tech-niques are unlikely to de-tect them, and (b) the num-ber of ASF faults in a typ-ical Boolean decision is usu-ally between one and fiveleading to uneven distribu-tion on a percentage scale.For example, if a Booleandecision has only one ASFfault, the effectiveness of atest-set would be either 0 or

Figure 5: Distribution of effectiveness for associative shift fault

0

10

20

30

40

50

60

70

80

90

100

10 20 30 40 50 60 70 80 90 100

% of faults

% o

f te

st s

ets CC

DCC

FPC

MCDC

RCDC

100%. Here, DCC and FPC have similar variation while MC/DC and RC/DC arealso similar.

13

Page 14: Kapoor 03 Experimental

The distribution of effectiveness for RC/DC was found to be similar to MC/DC,although the number of test-cases in RC/DC test-sets is higher. Therefore it canalso be concluded that the test-set property has more influence on the fault detec-tion effectiveness as compared to the size of the test-set.5.1 Variation in Effectiveness with Number of Conditions

5.1.1 Average Effectiveness

As noticed earlier, there is an exponential increase in the number of possi-ble combinations and real-isations with the increasein number of conditions ina Boolean decision. How-ever, the number of test-cases in the test-set remainslinear with respect to num-ber of conditions. To an-swer the question of whetherthe increase in number ofconditions influences the ef-fectiveness of test criteria,

Figure 6: Distribution of avg. effectiveness w.r.t. no. of conditions for expression negation fault

0

10

20

30

40

50

60

70

80

90

100

3 4 5 6 7 8 9 10 11 12

no. of conditions

aver

age

effe

ctiv

enes

s

CC

DCC

FPC

MCDC

RCDC

we also studied the average effectiveness of different criteria with respect to thenumber of conditions.

In general, the MC/DC and RC/DC criteria were found to be effective inde-pendent of the number of conditions in the Boolean decision. However, Figure6 shows that effectivenessof other criteria decreasedlinearly as the number ofconditions increased. Theaverage effectiveness was foundto be overlapping for DCCwith FPC and MC/DC withRC/DC.

For ENF, being the weak-est fault, there is less de-crease in the average effec-tiveness as shown in Figure

Figure 7: Distribution of avg. effectiveness w.r.t. no. of conditions for operator reference fault

0

10

20

30

40

50

60

70

80

90

100

3 4 5 6 7 8 9 10 11 12

no. of conditions

aver

age

effe

ctiv

enes

s

CC

DCC

FPC

MCDC

RCDC

6. In this case, the average effectiveness of all the criteria remains above 65%.There is sharp drop in the average effectiveness for ORF and VNF in the

case of the CC, DCC and FPC criteria (see Figures 7 and 8). As the num-

14

Page 15: Kapoor 03 Experimental

ber of conditions increase,the size of truth tables in-creases exponentially, lead-ing to less changes in thetruth table due to faults. Thismakes the faults indistin-guishable for simple tech-niques which do not takeinto consideration the in-dependent correlation betweenconditions. The effective-ness remained stable and over-

Figure 8: Distribution of avg. effectiveness w.r.t. no. of conditions for variable negation fault

0

10

20

30

40

50

60

70

80

90

100

3 4 5 6 7 8 9 10 11 12

no. of conditions

aver

age

effe

ctiv

nes

s

CC

DCC

FPC

MCDC

RCDC

lapping for MC/DC and RC/DC.The variation in effec-

tiveness with respect to thenumber of conditions for as-sociative shift faults was slightlydifferent in comparison toENF, ORF and VNF faults.Figure 9 shows that the ef-fectiveness of MC/DC andRC/DC increased with thenumber of conditions. Thereason is that as the num-ber of conditions rises, the

Figure 9: Distribution of avg. effectiveness w.r.t. no. of conditions for associative shift fault

0

10

20

30

40

50

60

70

80

90

3 4 5 6 7 8 9 10 11 12

no. of conditions

aver

age

effe

ctiv

enes

s

CC

DCC

FPC

MCDC

RCDC

number of associations is likely to grow, leading to a higher probability of faultdetection. For example, one ASF fault leads to 30% effectiveness, while five ASFfaults lead to more than 50% effectiveness. RC/DC was found to have the max-imum average effectiveness. The other criteria do not show any improvement ineffectiveness.5.1.2 Tolerance

We also studied the stabil-ity of a test criterion withrespect to the number ofconditions. Figures 10–13depict the tolerance with re-spect to number of condi-tions. Ideally, the test cri-terion should have a highaverage effectiveness alongwith a low standard devia-tion (high tolerance).

For expression negation

Figure 10: Distribution of std. deviation (eff.) w.r.t. no. of conditions for expression negation fault

0

5

10

15

20

25

30

3 4 5 6 7 8 9 10 11 12

no. of conditions

S.D

. (ef

fect

iven

ess)

CC

DCC

FPC

MCDC

RCDC

15

Page 16: Kapoor 03 Experimental

faults, the standard deviation of effectiveness for CC, DCC and FPC increased withthe increase in the number of conditions, while MC/DC and RC/DC had a low andalmost constant standard deviation. CC had highest variation in effectiveness. Thevariation was overlapping for DCC with FPC and MC/DC with RC/DC.

In the case of ORF and VNF, the standard deviation was observed to decreasewith the increase in numberof conditions for CC, DCCand FPC while it remainedconstant for MC/DC and RC/DC(see Figures 11 and 12). Butas the average effectivenessfor CC, DCC and FPC de-creases while MC/DC andRC/DC remain constant, thelatter can be considered morereliable (see Figures 7 and8).

Figure 11: Distribution of std. deviation (eff.) w.r.t. no. of conditions for operator reference fault

0

5

10

15

20

25

30

3 4 5 6 7 8 9 10 11 12

no. of conditions

S.D

. (ef

fect

iven

ess)

CC

DCC

FPC

MCDC

RCDC

The standard deviationof effectiveness with the in-crease in number of con-ditions, remained high forassociative shift faults in com-parison to the other faults(see Figure 13). The be-haviour of MC/DC and RC/DCtest-sets show only a marginaldecrease in standard devi-ation with increasing aver-age effectiveness (cf. Fig-

Figure 12: Distribution of std. deviation (eff.) w.r.t. no. of conditions for variable negation fault

0

5

10

15

20

25

30

3 4 5 6 7 8 9 10 11 12

no. of conditions

S.D

. (ef

fect

ivn

ess)

CC

DCC

FPC

MCDC

RCDC

ure 9).Comparison of Figures

10–12 with Figure 13 showsthat the standard deviationis high for all the test cri-teria for ASF faults, whileMC/DC and RC/DC showlow standard deviation forENF, ORF and VNF faults.This implies that the num-ber of test-sets required todetect all ASF faults is highercompared with the other types

Figure 13: Distribution of std. deviation (eff.) w.r.t. no. of conditions for associative shift fault

0

5

10

15

20

25

30

35

40

45

50

3 4 5 6 7 8 9 10 11 12

no. of conditions

S.D

. (ef

fect

iven

ess)

CC

DCC

FPC

MCDC

RCDC

of faults.

16

Page 17: Kapoor 03 Experimental

5.2 Limitations of the Study

The controlled experiments conducted in this study attempt to evaluate the faultdetection effectiveness of the control flow test criteria only for abstract Booleandecisions. In this sense, this study is artificial in nature. In real programs, more thanone control point (Boolean decision) is present and it is not sufficient to only detectthe faults, but also to propagate them causing failure. Therefore, it is important tostudy fault propagation as well.

The fault detection effectiveness was studied by introducing only one fault at atime. Though there is some empirical evidence, a more elaborate study is requiredto evaluate the effect of multiple faults on effectiveness of test criterion.

In real life specifications and implementations, often the conditions are cou-pled. Also, the presence of multiple control points may cause some combinationsof conditions to be infeasible. Both coupling and infeasibility of conditions mayreduce the number of possible test-sets significantly and therefore might prune outeffective test-sets. This also needs to be investigated further.

6 Conclusions

An empirical evaluation of effectiveness of four major control-flow test criteria(CC, DCC, FPC and MC/DC) has been performed. The analysis of tolerance of anewly proposed test criterion, RC/DC, has also been included for comparison. Theaverage size of test-sets generated is kept the same to ensure that the comparisonof effectiveness is due to the property of the test-set. The study was based onexhaustive generation of all possible test-sets (wherever possible) against ENF,ORF, VNF and ASF faults. In the circumstances where it was not possible toanalyse all the test-sets, a large set of test-sets was selected from only those test-sets that satisfied the test criterion.

Boolean decisions in programs were considered as subjects of study with vary-ing numbers of conditions. An analysis of the influence of the number of conditionson average fault detection effectiveness and standard deviation of effectiveness wasundertaken for each test criteria. It was also observed that the property of test-setshad more influence on the fault detection effectiveness in comparison to the size oftest-set.

The empirical results for the DCC and FPC test criteria were found to be simi-lar. This is due to the fact that the test-set satisfying the DCC property is likely tosatisfy the FPC property and vice versa. In general, MC/DC was found to be ef-fective in detecting faults. However, MC/DC test-sets had a large variation in faultdetection effectiveness for ASF faults. The average effectiveness for CC, DCC andFPC was found to decrease with the increase in the number of conditions, whilethat of MC/DC remained almost constant. The standard deviation of effectivenessvaried for CC, DCC and FPC but remained constant for MC/DC.

ENF was found to be the weakest fault, with all the test criteria giving goodresults. The distribution of effectiveness was uniform for ORF and VNF. ASF,

17

Page 18: Kapoor 03 Experimental

being the strongest fault, displayed a different behaviour for all metrics.In comparison to MC/DC, the RC/DC criteria did not show a great improve-

ment in effectiveness and tolerance, although the number of test-cases in RC/DCtest-sets is more than the other criteria. The MC/DC test criterion was thus foundto be more stable and promising for practical applications. If, for cost reasons, onlya small number of test-sets can be selected, it might be better to choose MC/DCtest-sets. However, as it is more expensive to generate the MC/DC test-sets, highfault detection is also possible by selecting a greater number of test-sets satisfyingthe other criteria.

References

[1] J. Chilenski and S. Miller. Applicability of modified condition/decision cov-erage to software testing.Software Engineering Journal, pages 193–200,September 1994.

[2] A. Dupuy and N. Leveson. An empirical evaluation of the MC/DC coveragecriterion on the HETE-2 satellite software. InDASC: Digital Aviation SystemsConference, Phildelphia, October 2000.

[3] P. G. Frankl and O. Iakounenko. Further empirical studies of test effective-ness. InACM SIGSOFT Software Engineering Notes, Sixth InternationalSymposium on Foundations of Software Engineering, volume 23, pages 153–162, November 1998.

[4] P. G. Frankl and S. Weiss. An experimental comparison of the effectivenessof branch testing and data flow testing.IEEE Transactions on Software Engi-neering, 19(8):774–787, August 1993.

[5] P. G. Frankl and E. Weyuker. A formal analysis of the fault-detecting abilityof testing methods.IEEE Transactions on Software Engineering, 19(3):202–213, March 1993.

[6] P. G. Frankl and E. J. Weyuker. Provable improvements on branch testing.IEEE Transactions on Software Engineering, 19(10):962–975, October 1993.

[7] P. G. Frankl and E. J. Weyuker. Testing software to detect and reduce risk.The Journal of Systems and Software, pages 275–286, 2000.

[8] J. B. Goodenough and S. L. Gerhart. Towards a theory of test data selection.IEEE Transactions on Software Engineering, 1(2):156–173, June 1975.

[9] D. Hamlet. What can we learn by testing a program. InACM SIGSOFT In-ternational Symposium on Software Testing and Analysis, Clearwater Beach,Florida, US, pages 50–52. ACM, 1998.

18

Page 19: Kapoor 03 Experimental

[10] M. Harman, R. Hierons, M. Holcombe, B. Jones, S. Reid, M. Roper, andM. Woodward. Towards a maturity model for empirical studies of softwaretesting. InFifth Workshop on Empirical Studies of Software Maintenance(WESS’99), Keble College, Oxford, UK, September 1999.

[11] W. E. Howden. Reliability of the path analysis testing strategy.IEEE Trans-actions on Software Engineering, 2(3):208–215, September 1976.

[12] M. Hutchins, H. Foster, T. Goradia, and T. Ostrand. Experiments on the effec-tiveness of dataflow and control-flow based test adequacy criteria. In16th In-ternational Conference on Software Engineering (ICSE-16), pages 191–200,1994.

[13] R. Jasper, M. Brennan, K. Williamson, B. Currier, and D. Zimmerman. Testdata generation and feasible path analysis. InSIGSOFT: International Sym-posium on Software Testing and Analysis, Seattle, Washington, US, pages95–107. ACM, 1994.

[14] J. A. Jones and M. J. Harrold. Test-suite reduction and prioritization for mod-ified condition/decision coverage. InInternational Conference on SoftwareMaintenance (ICSM’01), Florence, Italy, pages 92–101. IEEE, November2001.

[15] D. R. Kuhn. Fault classes and error detection capability of specification-based testing.ACM Transactions on Software Engineering and Methodology,8(4):411–424, October 1999.

[16] L. J. Morell. A theory of fault-based testing.IEEE Transactions on SoftwareEngineering, 16(8):844–857, August 1990.

[17] G. Myers.The art of software engineering. Wiley-Interscience, 1979.

[18] A. J. Offutt. Investigations of the software testing coupling effect.ACMTransactions on Software Engineering and Methodology, 1(1):5–20, 1992.

[19] A. J. Offutt, Y. Xiong, and S. Liu. Criteria for generating specification-basedtests. InFifth IEEE International Conference on Engineering of ComplexComputer Systems (ISECCS’99), Las Vegas, Nevada, USA, pages 119–129,October 1999.

[20] A. Paradkar and K.C. Tai. Test-generation for Boolean expressions. InSixthInternational Symposium on Software Reliability Engineering (ISSRE’95),Toulouse, France, pages 106–115, 1995.

[21] A. Paradkar, K.C. Tai, and M. A. Vouk. Automatic test-generation for predi-cates.IEEE Transactions on reliability, 45(4):515–530, December 1996.

19

Page 20: Kapoor 03 Experimental

[22] D. J. Richardson and M. C. Thompson. An analysis of test data selectioncriteria using the RELAY model of fault detection.IEEE Transactions onSoftware Engineering, 19(6):533–553, June 1993.

[23] M. Roper.Software Testing. McGraw-Hill Book Company Europe, 1994.

[24] RTCA/DO-178B. Software considerations in airborne systems and equip-ment certification. Washington DC USA, 1992.

[25] K.-C. Tai. Theory of fault-based predicate testing for computer programs.IEEE Transactions on Software Engineering, 22(8):552–562, August 1996.

[26] Software testing FAQ: Testing tools supplier.http://www.testingfaqs.org/tools.htm.

[27] T. Tsuchiya and T. Kikuno. On fault classes and error detection capability ofspecification-based testing.ACM Transactions on Software Engineering andMethodology, 11(1):58–62, January 2001.

[28] S. A. Vilkomir and J. P. Bowen. Reinforced condition/decision coverage(RC/DC): A new criterion for software testing. In D. Bert, J. P. Bowen,M. Henson, and K. Robinson, editors,ZB 2002: Formal Specification andDevelopment in Z and B, volume 2272 ofLecture Notes in Computer Sci-ence, pages 295–313. Springer-Verlag, January 2002.

[29] S. A. Vilkomir, K. Kapoor, and J. P. Bowen. Tolerance of control-flow testingcriteria. Technical Report SBU-CISM-03-01, SCISM, South Bank Univer-sity, London, UK, March 2003.

[30] M. A. Vouk, K. C. Tai, and A. Paradkar. Empirical studies of predicate-basedsoftware testing. In5th International Symposium on Software Reliability En-gineering, pages 55–64, 1994.

[31] E. Weyuker. Can we measure software testing effectiveness? InFirst Interna-tional Software Metrics Symposium, Baltimore, USA, pages 100–107. IEEE,21-22 May 1993.

[32] E. Weyuker. Thinking formally about testing without a formal specification.In Formal Approaches to Testing of Software (FATES’02), A Satellite Work-shop of CONCUR’02, Brno, Czech Republic, pages 1–10, 24 August 2002.

[33] E. Weyuker, T. Gorodia, and A. Singh. Automatically generating test datafrom a boolean specification.IEEE Transactions on Software Engineering,20(5):353–363, May 1994.

[34] A. White. Comments on modified condition/decision coverage for softwaretesting. InIEEE Aerospace Conference, Big sky, Montana, USA, volume 6,pages 2821–2828, 10-17 March 2001.

20

Page 21: Kapoor 03 Experimental

[35] W. Wong, J. Horgan, A. Mathur, and A. Pasquini. Test set size minimiza-tion and fault detection effectiveness: a case study in a space application.In Twenty-First Annual International Computer Software and ApplicationsConference (COMPSAC’97), pages 522–528, 13-15 August 1997.

[36] H. Zhu. Axiomatic assessment of control flow-based software test adequacycriteria. Software Engineering Journal, pages 194–204, September 1995.

[37] H. Zhu. A formal analysis of the subsume relation between software testadequacy criteria.IEEE Transactions on Software Engineering, 22(4):248–255, April 1996.

[38] H. Zhu, P. Hall, and H. R. May. Software unit test coverage and adequacy.ACM Computing Surveys, 29(4):336–427, December 1997.

21