Cleansing test suites from coincidental correctness to enhance falut localization
-
Upload
elfinhe -
Category
Technology
-
view
361 -
download
4
Transcript of Cleansing test suites from coincidental correctness to enhance falut localization
Cleansing Coincidental Correctnessto Enhance Fault Localization
Software Engineering LaboratoryDepartment of Computer Science, Sun Yat-Sen University
The 2nd Joint Winter Workshop on Software EngineeringDecember 2010
Sun Yat-Sen University, Guangzhou, China
1/371/37
Outline Coverage-Based Fault Localization
Introduction Methodology Evaluation Discussion
Cleansing Coincidental Correctness Methodology Evaluation
Conclusion and Future Work
2/37
Software Debugging is an arduous task[1] that requires Time Effort A good understanding of the source code
Three steps to debug[2]
Fault detection Fault localization Fault correction
We focus on automatic Fault Localization…
Introduction
[1] I. Vessey. Expertise in debugging computer programs: A process analysis. International Journal of Man-Machine Studies, 23(5):459–494, November 1985.[2] D. Wieland. Model-Based Debugging of Java Programs Using Dependencies. PhD thesis, Technischen Universitat Wien, 2001.
3/37
Input of Fault Localization
Source code Test Cases
//Find the maximum among a, b and c
int max (int a, int b, int c){
1 int temp = a;
2 if (b > temp ){
3 temp = b+1; //bug
4 }
5 if (c > temp ){
6 temp = c;
7 }
8 return temp;
}
Input:
a, b, c
oracle
3, 2, 1 3
2, 1, 3 3
1, 2, 3 3
1, 2, 4 4
1, 2, 3 3
1, 3, 2 3Source Code Test Cases
4/37
Output of Fault Localization
Suspiciousness of each statement Based on likelihood of containing faults. Statement with higher suspiciousness should be examined
before statement with a lower suspiciousness.
S1 S2 S3 S4 S5 S6 S7 S8
S 0.33 0.33 0.5 0.33 0.33 0.25 0.33 0.33
Suspiciousness results for Jaccard coefficient
//Find the maximum among a, b and c
int max (int a, int b, int c){
1 int temp = a;
2 if (b > temp ){
3 temp = b+1; //bug
4 }
5 if (c > temp ){
6 temp = c;
7 }
8 return temp;
}
Source Code
most suspicious
5/37
Coverage-Based Fault Localization (CBFL)
Based on the executable statement hit (coverage) Input of CBFL
Coverage Execution result (passed or failed)
a, b, c S1 S2 S3 S4 S5 S6 S7 S8 r
3, 2, 1 1 1 0 1 1 0 1 1 p
2, 1, 3 1 1 0 1 1 1 1 1 p
1, 2, 3 1 1 1 1 1 0 1 1 p
1, 2, 4 1 1 1 1 1 1 1 1 p
1, 2, 3 1 1 1 1 1 1 1 1 f
1, 3, 2 1 1 1 1 1 0 1 1 f
//Find the maximum among a, b and c
int max (int a, int b, int c){
1 int temp = a;
2 if (b > temp ){
3 temp = b+1; //bug
4 }
5 if (c > temp ){
6 temp = c;
7 }
8 return temp;
}
Source Code
6/37
a, b, c S3 S6 Others r
3, 2, 1 0 0 1 p
2, 1, 3 0 1 1 p
1, 2, 3 1 0 1 p
1, 2, 4 1 1 1 p
1, 2, 3 1 1 1 f
1, 3, 2 1 0 1 f
Input of CBFL
//Find the maximum among a, b and c
int max (int a, int b, int c){
1 int temp = a;
2 if (b > temp ){
3 temp = b+1; //bug
4 }
5 if (c > temp ){
6 temp = c;
7 }
8 return temp;
}
Source Code
For brevity…
7/37
Intuitively, for each statement, there are four factors, which will contribute to the suspiciousness.
Methodology
a, b, c S3 S6 Others r
3, 2, 1 0 0 1 p
2, 1, 3 0 1 1 p
1, 2, 3 1 0 1 p
1, 2, 4 1 1 1 p
1, 2, 3 1 1 1 f
1, 3, 2 1 0 1 f
a00(S) 2 2 0
a10(S) 2 2 4
a01(S) 0 1 0
a11(S) 2 1 2
For each statement S An example
SJ(s) Cue
↑a00(S) ↑ |Not cover S, Passed tests|
↑a10(S) ↓ |Cover S, Passed tests|
↑a01(S) ↓ |Not cover S, Failed tests|
↑a11(S) ↑ |Cover S, Failed tests|
8/37
Jaccard [3]
[3] M. Y. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. A. Brewer. Pinpoint: Problem determination in large, dynamic internet services. In Proceedings of 2002 International Conference on Dependable Systems and Networks (DSN 2002), pages 595–604, Bethesda, MD, USA, 23-26 June 2002. IEEE Computer Society.
a, b, c S3 S6 Others r
3, 2, 1 0 0 1 p
2, 1, 3 0 1 1 p
1, 2, 3 1 0 1 p
1, 2, 4 1 1 1 p
1, 2, 3 1 1 1 f
1, 3, 2 1 0 1 f
SJ(j) 0.5 0.25 0.33
)()()(
)(
100111
11
sasasa
sasSJ
)(
)(
spasseddtotalfaile
sfailedsSJ
Similarity of asymmetric binary attributes
9/37
)()(
)(
)()()(
)()()(
0010
10
0111
11
0111
11
sasa
sa
sasasa
sasasa
sST
Tarantula [4]
[4] J. A. Jones and M. J. Harrold. Empirical evaluation of the tarantula automatic faultlocalization technique. In D. F. Redmiles, T. Ellman, and A. Zisman, editors, 20th IEEE/ACM International Conference on Automated Software Engineering (ASE 2005), pages 273–282, Long Beach, CA, USA, November 7-11 2005. ACM.
a, b, c S3 S6 Others r
3, 2, 1 0 0 1 p
2, 1, 3 0 1 1 p
1, 2, 3 1 0 1 p
1, 2, 4 1 1 1 p
1, 2, 3 1 1 1 f
1, 3, 2 1 0 1 f
ST(j) 0.66 0.5 0.5
dtotalpasse
spasseddtotalfaile
sfaileddtotalfaile
sfailed
sST )()(
)(
Used in the Tarantula fault localization tool
10/37
Ochiai [5]
[5] R. Abreu, P. Zoeteweij, and A. J. van Gemund. On the accuracy of spectrum-based fault localization. In P. McMinn, editor, Proceedings of the Testing: Academia and Industry Conference - Practice And Research Techniques (TAIC PART’07), pages 89–98, Windsor, United Kingdom, September 2007. IEEE Computer Society.
a, b, c S3 S6 Others r
3, 2, 1 0 0 1 p
2, 1, 3 0 1 1 p
1, 2, 3 1 0 1 p
1, 2, 4 1 1 1 p
1, 2, 3 1 1 1 f
1, 3, 2 1 0 1 f
SO(j) 0.7 0.41 0.57
))()(())()((
)(
10110111
11
sasasasa
sasSO
))()((
)(
sfailedspasseddtotalfaile
sfailedsSO
Used in the molecular biology domain To measure genetic similarity
11/37
Evaluation Assign a score to every faulty version of each
subject program Score [6]
Describes the percentage of program that need not to be examined until the first bug-containing statement is reached
Assumption Perfect bug detection
i.e., programmers can always correctly classify faulty code as faulty, and non-faulty code as non-faulty.
[6] J. A. Jones and M. J. Harrold. Empirical evaluation of the tarantula automatic faultlocalization technique. In D. F. Redmiles, T. Ellman, and A. Zisman, editors, 20th IEEE/ACM International Conference on Automated Software Engineering (ASE 2005), pages 273–282, Long Beach, CA, USA, November 7-11 2005. ACM.
12/37
Evaluation (cont’) - An example
S6 S3 S1 S2 S4 S5 S7 S8S 0.7 0.5 0.33 0.33 0.33 0.33 0.33 0.33
Sorted suspiciousness results
//Find the maximum among a, b and c
int max (int a, int b, int c){
1 int temp = a;
2 if (b > temp ){
3 temp = b+1; //bug
4 }
5 if (c > temp ){
6 temp = c;
7 }
8 return temp;
}Source Code
Not bug
Step 1 Not bug
13/37
Step 1 Not bug
Step 2 Find it!
Evaluation (cont’) - An example
S6 S3 S1 S2 S4 S5 S7 S8S 0.7 0.5 0.33 0.33 0.33 0.33 0.33 0.33
Sorted suspiciousness results
//Find the maximum among a, b and c
int max (int a, int b, int c){
1 int temp = a;
2 if (b > temp ){
3 temp = b+1; //bug
4 }
5 if (c > temp ){
6 temp = c;
7 }
8 return temp;
}Source Code
Find it!
14/37
2 statements have been examined 8 statements in the program totally Score of this program is
1- (2 ÷ 8) = 0.75 The percentage of statements that need not to be examined
Evaluation (cont’) - An example
15/37
Evaluation (cont’) Assign a score to every faulty version of Siemens suite
The effectiveness of existing techniques has been limited…
16/37
Discussion
11
101
1
aa
CS
T
T
10
11
aC
aS
JJ
11
10
11
1a
aa
SO
11
10
112
1a
aC
aS
J
O
0010
0111
aa
aaCT
0111 aaCJ
100111
11
aaa
aSJ
0010
10
0111
11
0111
11
aaa
aaa
aaa
ST
)()( 10110111
11
aaaa
aSO
Rewrite the coefficients as below [7]
Divide by Replace by Square, anddivide by
0111 aaCJ
For brevity
[7] R. Abreu, P. Zoeteweij, R. Golsteijn, and A. J. C. van Gemund. A practical evaluation of spectrum-based fault localization. Journal of Systems and Software, 82(11):1780–1792, 2009.
Both CT=(a11+a01)/(a10+a00) and CJ=a11+a01 are constant for all statements Not influence the suspiciousness ranking
So rankings from three coefficients depend only on a11 and a10
17/37
The suspiciousness calculated by the coefficients have a positive correlation with a11
a negative correlation with a10
Assume that the fault is executed, this execution will fail (to increase a11), the fault is not executed, this execution will pass (to increase a10), the test suite is adequate.
Then the fault statement will always rank top.
Why ineffective? Any interferences?
The impact of a11 and a10
18/37
Interferences
Factors impair the CBFL (interferences) Coincidental Correctness [8]
The fault is executed, but this execution will not fail,
Multiple Faults The fault is not executed, but this execution will fail.
Coverage Equivalence The coverage between statements are always the same.
[8] W. Masri, R. Abou-Assi, M. El-Ghali, and N. Al-Fatairi. An empirical study of the factors that reduce the effectiveness of coverage-based fault localization. In B. Liblit, N. Nagappan, and T. Zimmermann, editors, Proceedings of the 2nd International Workshop on Defects in Large Software Systems: Held in conjunction with ISSTA 2009, pages 1–5, Chicago, Illinois, July 19-19 2009. ACM.
19/37
Coincidental Correctness
condition a, b, c S3 S6 Others r
a < b, b + 1 = c 1, 2, 3 1 0 1 p
a < b, b + 1 < c 1, 2, 4 1 1 1 p
//Find the maximum among a, b and c
int max (int a, int b, int c){
1 int temp = a;
2 if (b > temp ){
3 temp = b+1; //bug
4 }
5 if (c > temp ){
6 temp = c;
7 }
8 return temp;
}
Not all conditions for failure are met. The RIP (reachability-infection-propagation) model[9]
Condition 1:the fault is executed Condition 2:the program has transitioned into an infectious state Condition 3:the infection has propagated to the output
[9] Ammann P. and Offutt J. Introduction to Software Testing. Cambridge University Press, 2008. 20/37
Multiple Faults
//Find the maximum among a, b and c
int max (int a, int b, int c){
1 int temp = a;
2 if (b > temp ){
3 temp = b+1; //bug
4 }
5 if (c > temp ){
6 temp = c+1; //bug
7 }
8 return temp;
}
condition a, b, c S3 S6 r
a < b, b + 1 ≥ c 1, 2, 4 1 0 f
a ≥ b, a < c 3, 2, 4 0 1 f
The fault is not executed, but this execution will failed.(Because another fault is executed.)
21/37
Coverage Equivalence
//Find the maximum among a, b and c
int max (int a, int b, int c){
1 int temp = a+1; //bug
2 if (b > temp ){
3 temp = b;
4 }
5 if (c > temp ){
6 temp = c;
7 }
8 return temp;
}
condition a, b, c S1 S8 r
a < b or a < c 1, 2, 3 1 1 p
otherwise 7, 2, 4 1 1 f
The coverage between statements are always the same. Due to
Inadequacy of the test suite The inherent property of a program
22/37
Empirical Study
Coincidental Correctness (72.1%) [8]
Strong Coincidental Correctness (15.7%) Meet Condition 1,2 of RIP(reachability-infection-propagation) model.
Weak Coincidental Correctness (56.4%) Meet only Condition 1 of RIP(reachability-infection-propagation) model.
A safety reducing factor. Causes the faulty statement has a lower score than others.
[8] W. Masri, R. Abou-Assi, M. El-Ghali, and N. Al-Fatairi. An empirical study of the factors that reduce the effectiveness of coverage-based fault localization. In B. Liblit, N. Nagappan, and T. Zimmermann, editors, Proceedings of the 2nd International Workshop on Defects in Large Software Systems: Held in conjunction with ISSTA 2009, pages 1–5, Chicago, Illinois, July 19-19 2009. ACM. 23/37
Cleansing Coincidental Correctness [10]
Input: A test suite and the coverage matrix
Output: Subset of passing tests that are likely to be coincidentally correct.
Assumption A good candidate for a cce is a program element that occurs in all
failing runs and in a non-zero but not excessively large percentage of passing runs
[10] Wes Masri, Rawad Abou Assi, Cleansing Test Suites from Coincidental Correctness to Enhance Fault-Localization, 2008 International Conference on Software Testing, Verification, and Validation, pp. 165-174, 2010 Third International Conference on Software Testing, Verification and Validation, 2010. IEEE
24/37
Technique - I
We estimate:CCE: the set of program elements that are likely to be correlated with coincidentally correct tests.cce: an element in CCEcct : test that induce cce
CCT: estimate of TCC
Assumption fT(cce) = 1.0
0 < pT(cce) ≤ θ
where fT(cce) is the percentage of TF executing cce, pT(cce) the percentage of Tp executing cce, and θ < 1.0.
T : a test suite TF : failing tests
TP : passing tests
TCC : Coincidentally Correct tests
Populate CCE with program elements that are totally correlated with failures.
25/37
Technique - I (cont’)
We estimate:CCE: the set of program elements that are likely to be correlated with coincidentally correct tests.cce: an element in CCEcct : test that induce cce
CCT: estimate of TCC
Assumption fT(cce) = 1.0
0 < pT(cce) ≤ θ
where fT(cce) is the percentage of TF executing cce, pT(cce) the percentage of Tp executing cce, and θ < 1.0.
T : a test suite TF : failing tests
TP : passing tests
TCC : Coincidentally Correct tests
Populate CCT with tests that execute one or more cce’s.
26/37
Technique - I - An example
a, b, c S3 S6 Others r
1, 2, 3 1 0 1 p
1, 2, 4 1 1 1 p
3, 2, 1 0 0 1 p
2, 1, 3 0 1 1 p
1, 2, 3 1 1 1 f
1, 3, 2 1 0 1 f
cce
//Find the maximum among a, b and c
int max (int a, int b, int c){
1 int temp = a;
2 if (b > temp ){
3 temp = b+1; //bug
4 }
5 if (c > temp ){
6 temp = c;
7 }
8 return temp;
}
27/37
Technique - I - An example (cont’)
a, b, c S3 S6 Others r
cct1, 2, 3 1 0 1 p
cct1, 2, 4 1 1 1 p
3, 2, 1 0 0 1 p
2, 1, 3 0 1 1 p
1, 2, 3 1 1 1 f
1, 3, 2 1 0 1 f
cce
//Find the maximum among a, b and c
int max (int a, int b, int c){
1 int temp = a;
2 if (b > temp ){
3 temp = b+1; //bug
4 }
5 if (c > temp ){
6 temp = c;
7 }
8 return temp;
}
Find them!
coincidentalcorrectness
28/37
Technique - II
A high average weight is more likely to be a coincidentally correct test. Weight (correlate with suspiciousness)
((average weight of the covered cce’s) + (percent of cce’s covered))
The lower ranked cct’s are discarded
29/37
Technique - III
Partitions the cct’s into two clusters based on the similarity of the suspicious cce’s
Assumptions Typically, some cce’s are relevant to the fault and others are not.
The coincidentally correct tests exercise these fault relevant cce’s whereas the correct tests don’t.
30/37
Evaluation
false negatives:
false positives:
safety change:
precision change:
coverage reduction:
31/37
Evaluation (cont’)
32/37
Evaluation (cont’)
Comparative results summaries
33/37
Conclusion
Without interferences, CBFL are effective and efficient techniques that automate Fault Localization.
Well designed coefficients will be compatible with some interferences but not all of them.
Three variations of a technique are presented to identify coincidental correctness, a safety reducing factor for CBFL.
34/37
Future Work Conduct more algorithms to identify coincidental
correctness e.g. cluster analysis and failure classification.
Evaluate whether different program elements can further reduce the rate of false positives e.g. predicates, function calls, program paths
Assess the impact of cleansing coincidental correctness on other fault localization approaches
35/37
Q & A
36/37
Thank you!Contact me via [email protected]
37/37