Cleansing test suites from coincidental correctness to enhance falut localization

Cleansing Coincidental Correctnessto Enhance Fault Localization

Tao [email protected]

Software Engineering LaboratoryDepartment of Computer Science, Sun Yat-Sen University

The 2nd Joint Winter Workshop on Software EngineeringDecember 2010

Sun Yat-Sen University, Guangzhou, China

1/371/37

Outline Coverage-Based Fault Localization

Introduction Methodology Evaluation Discussion

Cleansing Coincidental Correctness Methodology Evaluation

Conclusion and Future Work

2/37

Software Debugging is an arduous task[1] that requires Time Effort A good understanding of the source code

Three steps to debug[2]

Fault detection Fault localization Fault correction

We focus on automatic Fault Localization…

Introduction

[1] I. Vessey. Expertise in debugging computer programs: A process analysis. International Journal of Man-Machine Studies, 23(5):459–494, November 1985.[2] D. Wieland. Model-Based Debugging of Java Programs Using Dependencies. PhD thesis, Technischen Universitat Wien, 2001.

3/37

Input of Fault Localization

Source code Test Cases

//Find the maximum among a, b and c

int max (int a, int b, int c){

1 int temp = a;

2 if (b > temp ){

3 temp = b+1; //bug

4 }

5 if (c > temp ){

6 temp = c;

7 }

8 return temp;

}

Input:

a, b, c

oracle

3, 2, 1 3

2, 1, 3 3

1, 2, 3 3

1, 2, 4 4

1, 2, 3 3

1, 3, 2 3Source Code Test Cases

4/37

Output of Fault Localization

Suspiciousness of each statement Based on likelihood of containing faults. Statement with higher suspiciousness should be examined

before statement with a lower suspiciousness.

S1 S2 S3 S4 S5 S6 S7 S8

S 0.33 0.33 0.5 0.33 0.33 0.25 0.33 0.33

Suspiciousness results for Jaccard coefficient



1 int temp = a;

2 if (b > temp ){

3 temp = b+1; //bug

4 }

5 if (c > temp ){

6 temp = c;

7 }

8 return temp;

}

Source Code

most suspicious

5/37

Coverage-Based Fault Localization (CBFL)

Based on the executable statement hit (coverage) Input of CBFL

Coverage Execution result (passed or failed)

a, b, c S1 S2 S3 S4 S5 S6 S7 S8 r

3, 2, 1 1 1 0 1 1 0 1 1 p

2, 1, 3 1 1 0 1 1 1 1 1 p

1, 2, 3 1 1 1 1 1 0 1 1 p

1, 2, 4 1 1 1 1 1 1 1 1 p

1, 2, 3 1 1 1 1 1 1 1 1 f

1, 3, 2 1 1 1 1 1 0 1 1 f



1 int temp = a;

2 if (b > temp ){

3 temp = b+1; //bug

4 }

5 if (c > temp ){

6 temp = c;

7 }

8 return temp;

}

Source Code

6/37

a, b, c S3 S6 Others r

3, 2, 1 0 0 1 p

2, 1, 3 0 1 1 p

1, 2, 3 1 0 1 p

1, 2, 4 1 1 1 p

1, 2, 3 1 1 1 f

1, 3, 2 1 0 1 f

Input of CBFL



1 int temp = a;

2 if (b > temp ){

3 temp = b+1; //bug

4 }

5 if (c > temp ){

6 temp = c;

7 }

8 return temp;

}

Source Code

For brevity…

7/37

Intuitively, for each statement, there are four factors, which will contribute to the suspiciousness.

Methodology


3, 2, 1 0 0 1 p

2, 1, 3 0 1 1 p

1, 2, 3 1 0 1 p

1, 2, 4 1 1 1 p

1, 2, 3 1 1 1 f

1, 3, 2 1 0 1 f

a00(S) 2 2 0

a10(S) 2 2 4

a01(S) 0 1 0

a11(S) 2 1 2

For each statement S An example

SJ(s) Cue

↑a00(S) ↑ |Not cover S, Passed tests|

↑a10(S) ↓ |Cover S, Passed tests|

↑a01(S) ↓ |Not cover S, Failed tests|

↑a11(S) ↑ |Cover S, Failed tests|

8/37

Jaccard [3]

[3] M. Y. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. A. Brewer. Pinpoint: Problem determination in large, dynamic internet services. In Proceedings of 2002 International Conference on Dependable Systems and Networks (DSN 2002), pages 595–604, Bethesda, MD, USA, 23-26 June 2002. IEEE Computer Society.


3, 2, 1 0 0 1 p

2, 1, 3 0 1 1 p

1, 2, 3 1 0 1 p

1, 2, 4 1 1 1 p

1, 2, 3 1 1 1 f

1, 3, 2 1 0 1 f

SJ(j) 0.5 0.25 0.33

)()()(

)(

100111

11

sasasa

sasSJ

)(

)(

spasseddtotalfaile

sfailedsSJ

Similarity of asymmetric binary attributes

9/37

)()(

)(

)()()(

)()()(

0010

10

0111

11

0111

11

sasa

sa

sasasa

sasasa

sST

Tarantula [4]

[4] J. A. Jones and M. J. Harrold. Empirical evaluation of the tarantula automatic faultlocalization technique. In D. F. Redmiles, T. Ellman, and A. Zisman, editors, 20th IEEE/ACM International Conference on Automated Software Engineering (ASE 2005), pages 273–282, Long Beach, CA, USA, November 7-11 2005. ACM.


3, 2, 1 0 0 1 p

2, 1, 3 0 1 1 p

1, 2, 3 1 0 1 p

1, 2, 4 1 1 1 p

1, 2, 3 1 1 1 f

1, 3, 2 1 0 1 f

ST(j) 0.66 0.5 0.5

dtotalpasse

spasseddtotalfaile

sfaileddtotalfaile

sfailed

sST )()(

)(

Used in the Tarantula fault localization tool

10/37

Ochiai [5]

[5] R. Abreu, P. Zoeteweij, and A. J. van Gemund. On the accuracy of spectrum-based fault localization. In P. McMinn, editor, Proceedings of the Testing: Academia and Industry Conference - Practice And Research Techniques (TAIC PART’07), pages 89–98, Windsor, United Kingdom, September 2007. IEEE Computer Society.


3, 2, 1 0 0 1 p

2, 1, 3 0 1 1 p

1, 2, 3 1 0 1 p

1, 2, 4 1 1 1 p

1, 2, 3 1 1 1 f

1, 3, 2 1 0 1 f

SO(j) 0.7 0.41 0.57

))()(())()((

)(

10110111

11

sasasasa

sasSO

))()((

)(

sfailedspasseddtotalfaile

sfailedsSO

Used in the molecular biology domain To measure genetic similarity

11/37

Evaluation Assign a score to every faulty version of each

subject program Score [6]

Describes the percentage of program that need not to be examined until the first bug-containing statement is reached

Assumption Perfect bug detection

i.e., programmers can always correctly classify faulty code as faulty, and non-faulty code as non-faulty.

[6] J. A. Jones and M. J. Harrold. Empirical evaluation of the tarantula automatic faultlocalization technique. In D. F. Redmiles, T. Ellman, and A. Zisman, editors, 20th IEEE/ACM International Conference on Automated Software Engineering (ASE 2005), pages 273–282, Long Beach, CA, USA, November 7-11 2005. ACM.

12/37

Evaluation (cont’) - An example

S6 S3 S1 S2 S4 S5 S7 S8S 0.7 0.5 0.33 0.33 0.33 0.33 0.33 0.33

Sorted suspiciousness results



1 int temp = a;

2 if (b > temp ){

3 temp = b+1; //bug

4 }

5 if (c > temp ){

6 temp = c;

7 }

8 return temp;

}Source Code

Not bug

Step 1 Not bug

13/37

Step 1 Not bug

Step 2 Find it!


S6 S3 S1 S2 S4 S5 S7 S8S 0.7 0.5 0.33 0.33 0.33 0.33 0.33 0.33

Sorted suspiciousness results



1 int temp = a;

2 if (b > temp ){

3 temp = b+1; //bug

4 }

5 if (c > temp ){

6 temp = c;

7 }

8 return temp;

}Source Code

Find it!

14/37

2 statements have been examined 8 statements in the program totally Score of this program is

1- (2 ÷ 8) = 0.75 The percentage of statements that need not to be examined


15/37

Evaluation (cont’) Assign a score to every faulty version of Siemens suite

The effectiveness of existing techniques has been limited…

16/37

Discussion

11

101

1

aa

CS

T

T

10

11

aC

aS

JJ

11

10

11

1a

aa

SO

11

10

112

1a

aC

aS

J

O

0010

0111

aa

aaCT

0111 aaCJ

100111

11

aaa

aSJ

0010

10

0111

11

0111

11

aaa

aaa

aaa

ST

)()( 10110111

11

aaaa

aSO

Rewrite the coefficients as below [7]

Divide by Replace by Square, anddivide by

0111 aaCJ

For brevity

[7] R. Abreu, P. Zoeteweij, R. Golsteijn, and A. J. C. van Gemund. A practical evaluation of spectrum-based fault localization. Journal of Systems and Software, 82(11):1780–1792, 2009.

Both CT=(a11+a01)/(a10+a00) and CJ=a11+a01 are constant for all statements Not influence the suspiciousness ranking

So rankings from three coefficients depend only on a11 and a10

17/37

The suspiciousness calculated by the coefficients have a positive correlation with a11

a negative correlation with a10

Assume that the fault is executed, this execution will fail (to increase a11), the fault is not executed, this execution will pass (to increase a10), the test suite is adequate.

Then the fault statement will always rank top.

Why ineffective? Any interferences?

The impact of a11 and a10

18/37

Interferences

Factors impair the CBFL (interferences) Coincidental Correctness [8]

The fault is executed, but this execution will not fail,

Multiple Faults The fault is not executed, but this execution will fail.

Coverage Equivalence The coverage between statements are always the same.

[8] W. Masri, R. Abou-Assi, M. El-Ghali, and N. Al-Fatairi. An empirical study of the factors that reduce the effectiveness of coverage-based fault localization. In B. Liblit, N. Nagappan, and T. Zimmermann, editors, Proceedings of the 2nd International Workshop on Defects in Large Software Systems: Held in conjunction with ISSTA 2009, pages 1–5, Chicago, Illinois, July 19-19 2009. ACM.

19/37

Coincidental Correctness

condition a, b, c S3 S6 Others r

a ＜ b, b ＋ 1 ＝ c 1, 2, 3 1 0 1 p

a ＜ b, b ＋ 1 ＜ c 1, 2, 4 1 1 1 p



1 int temp = a;

2 if (b > temp ){

3 temp = b+1; //bug

4 }

5 if (c > temp ){

6 temp = c;

7 }

8 return temp;

}

Not all conditions for failure are met. The RIP (reachability-infection-propagation) model[9]

Condition 1:the fault is executed Condition 2:the program has transitioned into an infectious state Condition 3:the infection has propagated to the output

[9] Ammann P. and Offutt J. Introduction to Software Testing. Cambridge University Press, 2008. 20/37

Multiple Faults



1 int temp = a;

2 if (b > temp ){

3 temp = b+1; //bug

4 }

5 if (c > temp ){

6 temp = c+1; //bug

7 }

8 return temp;

}

condition a, b, c S3 S6 r

a ＜ b, b + 1 ≥ c 1, 2, 4 1 0 f

a ≥ b, a ＜ c 3, 2, 4 0 1 f

The fault is not executed, but this execution will failed.(Because another fault is executed.)

21/37

Coverage Equivalence



1 int temp = a+1; //bug

2 if (b > temp ){

3 temp = b;

4 }

5 if (c > temp ){

6 temp = c;

7 }

8 return temp;

}

condition a, b, c S1 S8 r

a ＜ b or a ＜ c 1, 2, 3 1 1 p

otherwise 7, 2, 4 1 1 f

The coverage between statements are always the same. Due to

Inadequacy of the test suite The inherent property of a program

22/37

Empirical Study

Coincidental Correctness (72.1%) [8]

Strong Coincidental Correctness (15.7%) Meet Condition 1,2 of RIP(reachability-infection-propagation) model.

Weak Coincidental Correctness (56.4%) Meet only Condition 1 of RIP(reachability-infection-propagation) model.

A safety reducing factor. Causes the faulty statement has a lower score than others.

[8] W. Masri, R. Abou-Assi, M. El-Ghali, and N. Al-Fatairi. An empirical study of the factors that reduce the effectiveness of coverage-based fault localization. In B. Liblit, N. Nagappan, and T. Zimmermann, editors, Proceedings of the 2nd International Workshop on Defects in Large Software Systems: Held in conjunction with ISSTA 2009, pages 1–5, Chicago, Illinois, July 19-19 2009. ACM. 23/37

Cleansing Coincidental Correctness [10]

Input: A test suite and the coverage matrix

Output: Subset of passing tests that are likely to be coincidentally correct.

Assumption A good candidate for a cce is a program element that occurs in all

failing runs and in a non-zero but not excessively large percentage of passing runs

[10] Wes Masri, Rawad Abou Assi, Cleansing Test Suites from Coincidental Correctness to Enhance Fault-Localization, 2008 International Conference on Software Testing, Verification, and Validation, pp. 165-174, 2010 Third International Conference on Software Testing, Verification and Validation, 2010. IEEE

24/37

Technique - I

We estimate:CCE: the set of program elements that are likely to be correlated with coincidentally correct tests.cce: an element in CCEcct : test that induce cce

CCT: estimate of TCC

Assumption fT(cce) = 1.0

0 < pT(cce) ≤ θ

where fT(cce) is the percentage of TF executing cce, pT(cce) the percentage of Tp executing cce, and θ < 1.0.

T : a test suite TF : failing tests

TP : passing tests

TCC : Coincidentally Correct tests

Populate CCE with program elements that are totally correlated with failures.

25/37

Technique - I (cont’)

We estimate:CCE: the set of program elements that are likely to be correlated with coincidentally correct tests.cce: an element in CCEcct : test that induce cce

CCT: estimate of TCC

Assumption fT(cce) = 1.0

0 < pT(cce) ≤ θ

where fT(cce) is the percentage of TF executing cce, pT(cce) the percentage of Tp executing cce, and θ < 1.0.

T : a test suite TF : failing tests

TP : passing tests

TCC : Coincidentally Correct tests

Populate CCT with tests that execute one or more cce’s.

26/37

Technique - I - An example


1, 2, 3 1 0 1 p

1, 2, 4 1 1 1 p

3, 2, 1 0 0 1 p

2, 1, 3 0 1 1 p

1, 2, 3 1 1 1 f

1, 3, 2 1 0 1 f

cce



1 int temp = a;

2 if (b > temp ){

3 temp = b+1; //bug

4 }

5 if (c > temp ){

6 temp = c;

7 }

8 return temp;

}

27/37

Technique - I - An example (cont’)


cct1, 2, 3 1 0 1 p

cct1, 2, 4 1 1 1 p

3, 2, 1 0 0 1 p

2, 1, 3 0 1 1 p

1, 2, 3 1 1 1 f

1, 3, 2 1 0 1 f

cce



1 int temp = a;

2 if (b > temp ){

3 temp = b+1; //bug

4 }

5 if (c > temp ){

6 temp = c;

7 }

8 return temp;

}

Find them!

coincidentalcorrectness

28/37

Technique - II

A high average weight is more likely to be a coincidentally correct test. Weight (correlate with suspiciousness)

((average weight of the covered cce’s) + (percent of cce’s covered))

The lower ranked cct’s are discarded

29/37

Technique - III

Partitions the cct’s into two clusters based on the similarity of the suspicious cce’s

Assumptions Typically, some cce’s are relevant to the fault and others are not.

The coincidentally correct tests exercise these fault relevant cce’s whereas the correct tests don’t.

30/37

Evaluation

false negatives:

false positives:

safety change:

precision change:

coverage reduction:

31/37

Evaluation (cont’)

32/37

Evaluation (cont’)

Comparative results summaries

33/37

Conclusion

Without interferences, CBFL are effective and efficient techniques that automate Fault Localization.

Well designed coefficients will be compatible with some interferences but not all of them.

Three variations of a technique are presented to identify coincidental correctness, a safety reducing factor for CBFL.

34/37

Future Work Conduct more algorithms to identify coincidental

correctness e.g. cluster analysis and failure classification.

Evaluate whether different program elements can further reduce the rate of false positives e.g. predicates, function calls, program paths

Assess the impact of cleansing coincidental correctness on other fault localization approaches

35/37

Q & A

36/37

Thank you!Contact me via [email protected]

37/37

Cleansing test suites from coincidental correctness to enhance falut localization

Technology

Transcript of Cleansing test suites from coincidental correctness to enhance falut localization