1 Revisiting Difficult Constraints if (hash(x) == hash(y)) {... } How do we cover this code? Suppose...
-
date post
22-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of 1 Revisiting Difficult Constraints if (hash(x) == hash(y)) {... } How do we cover this code? Suppose...
1
Revisiting Difficult Constraints
if (hash(x) == hash(y)) {
...
}
How do we cover this code?
Suppose we’re running (DART, SAGE,SMART, CUTE, SPLAT, etc.) – we gethere, but hash(x) != hash(y). Can wesolve for hash(x) == hash(y)?
Concrete values won’t help us much – westill have to solve for hash(x) == C1 or forhash(y) == C2. . .
Any ideas?
2
Today
A brief “digression” on causality and philosophy (of science)
Fault localization & error explanation• Renieris & Reiss: Nearest Neighbors• Jones & Harrold: Tarantula
• How to evaluate a fault localization• PDGs (+ BFS or ranking)
• Solving for a nearest run (not really testing)
3
Causality
When a test case fails we start debugging
We assume that the fault (what we’re really after) causes the failure• Remember RIP (Reachability, Infection,
Propagation)?
What do we mean when we say that• “A causes B”?
4
Causality
We don’t know
Though it is central to everyday life – and to the aims of science• A real understanding of causality eludes us
to this day• Still no non-controversial way to answer the
question “does A cause B”?
5
Causality
Philosophy of causality is a fairly active area, back to Aristotle, and (more modern approaches) Hume• General agreement that a cause is
something that “makes a difference” – if the cause had not been, then the effect wouldn’t have been
• One theory that is rather popular with computer scientists is David Lewis’ counterfactual approach
• Probably because it (and probabilistic or statistical approaches) are amenable to mathematical treatment and automation
6
Causality (According to Lewis)
For Lewis (roughly – I’m conflating his counterfactual dependency and causal dependency)• A causes B (in world w) iff• In all possible worlds that are
maximally similar to w, and in which A does not take place, B also does not take place
7
Causality (According to Lewis)
Causality does not depend on• B being impossible without A• Seems reasonable: we don’t, when
asking “Was Larry slipping on the banana peel causally dependent on Curly dropping it?” consider worlds in which new circumstances (Moe dropping a banana peel) are introduced
8
Causality (According to Lewis)
Many objections to Lewis in the literature• e.g. cause precedes the event in time seems
to not be required by his approach
One is not a problem for our purposes• Distance metrics (how similar is world w to
world w’) are problematic for “worlds”• Counterfactuals are tricky
• Not a problem for program executions• May be details to handle, but no one has in-
principle objections to asking how similar two program executions are
• Or philosophical problems with multiple executions (no run is “privileged by actuality”)
9
Causality (According to Lewis)A
B
Did A cause B in this program execution?
d d’
Yes! d < d’
A
B
d d’
B
No. d > d’
10
Formally
A predicate e is causally dependent on a predicate c in an execution a iff:
1. c(a) e(a)
2. b . (c(b) e(b) (b’ . (c(b’) e(b’)) (d(a, b) < d(a, b’))))
11
What does this have to do with automated debugging??
A fault is an incorrect part of a program
In a failing test case, some fault is reached and executes• Causing the state of the program to be
corrupted (error)• This incorrect state is propagated
through the program (propagation is a series of “A causes B”s)
• Finally, bad state is observable as a failure – caused by the fault
12
Fault Localization
Fault localization, then, is:• An effort to automatically find (one of
the) causes of an observable failure• It is inherently difficult because there
are many causes of the failure that are not the fault
• We don’t mind seeing the chain of cause and effect reaching back to the fault
• But the fact that we reached the fault at all is also a cause!
13
Enough!
Ok, let’s get back to testing and some methods for localizing faults from test cases• But – keep in mind that when we
localize a fault, we’re really trying to automate finding causal relationships
• The fault is a cause of the failure
14
Lewis and Fault Localization
Causality:• Generally agreed that explanation is about
causality. [Ball,Naik,Rajamani],[Zeller],[Groce,Visser],[Sosa,Tooley],[Lewis],etc.
Similarity:• Also often assumed that successful
executions that are similar to a failing run can help explain an error. [Zeller],[Renieris,Reiss][Groce,Visser],etc.
• This work was not based on Lewis’ approach – it seems that this point about similarity is just an intuitive understanding most people (or at least computer scientists) share
15
Distance and Similarity
We already saw this idea at play in one version of Zeller’s delta-debugging• Trying to find the one change needed to take
a successful run and make it fail• Most similar thread schedule that doesn’t cause a
failure, etc.
Renieris and Reiss based a general fault localization technique on this idea – measuring distances between executions• To localize a fault, compare the failing trace
with its nearest neighbor according to some distance metric
16
Renieris and Reiss’ Localization
Basic idea (over-simplified)• We have lots of test cases
• Some fail• A much larger number pass
• Pick a failure• Find most similar successful test case• Report differences as our fault localization
“nearest neighbor”
17
Renieris and Reiss’ Localization
Collect spectra of executions, rather than the full executions• For example, just count the number of times
each source statement executed• Previous work on using spectra for
localization basically amounted to set difference/union – for example, find features unique to (or lacking in) the failing run(s)
• Problem: many failing runs have no such features – many successful test cases have R (and maybe I) but not P!
• Otherwise, localization wouldbe very easy
18
Renieris and Reiss’ Localization
Some obvious and not so obvious points to think about• Technique makes intuitive sense• But what if there are no successful runs that
are very similar?• Random testing might produce runs that all differ
in various accidental ways• Is this approach over-dependent on test suite
quality?
19
Renieris and Reiss’ Localization
Some obvious and not so obvious points to think about• What if we minimize the failing run using
delta-debugging?• Now lots of differences with original successful
runs just due to length!• We could produce a very similar run by using
delta-debugging to get a 1-change run that succeeds (there will actually be many of these)
• Can still use Renieris and Reiss’ approach – because delta-debugging works over the inputs, not the program behavior, spectra for these runs will be more or less similar to the failing test case
20
Renieris and Reiss’ Localization
Many details (see the paper):• Choice of spectra• Choice of distance metric• How to handle equal spectra for failing/passing
tests?
Basic idea is nonetheless straightforward
21
The Tarantula Approach
Jones, Harrold (and Stasko): Tarantula
Not based on distance metrics or a Lewis-like assumption
A “statistical” approach to fault localization
Originally conceived of as a visualization approach: produces a picture of all source in program, colored according to how “suspicious” it is• Green: not likely to be faulty• Yellow: hrm, a little suspicious• Red: very suspicious, likely fault
22
The Tarantula Approach
23
The Tarantula Approach
How do we score a statement in this approach? (where do all those colors come from?)
Again, assume we have a large set of tests, some passing, some failing
“Coverage entity” e (e.g., statement)• failed(e) = # tests covering e that fail• passed(e) = # tests covering e that pass• totalfailed, totalpassed = what you’d expect
24
The Tarantula Approach
How do we score a statement in this approach? (where do all those colors come from?)
dtotalfaileefailed
dtotalpasseepassed
dtotalfaileefailed
enesssuspicious)()(
)(
)(
25
The Tarantula Approach
Not very suspicious: appears in almost every passing test and almost every failing test
Highly suspicious: appears much more frequently in failing than passing tests
dtotalfaileefailed
dtotalpasseepassed
dtotalfaileefailed
enesssuspicious)()(
)(
)(
26
The Tarantula Approach
dtotalfaileefailed
dtotalpasseepassed
dtotalfaileefailed
enesssuspicious)()(
)(
)(
Simple program to computethe middle of three inputs,with a fault.
mid() int x, y, z, m;1 read (x, y, z);2 m = z;3 if (y < z)4 if (x < y)5 m = y;6 else if (x < z)7 m = y;8 else9 if (x > y)10 m = y;11 else if (x > z)12 m = x;13 print (m);
27
The Tarantula Approach
dtotalfaileefailed
dtotalpasseepassed
dtotalfaileefailed
enesssuspicious)()(
)(
)(
mid() int x, y, z, m;1 read (x, y, z); 2 m = z;3 if (y < z)4 if (x < y)5 m = y;6 else if (x < z)7 m = y;8 else9 if (x > y)10 m = y;11 else if (x > z)12 m = x;13 print (m);
Run some tests. . .
(3,3,5) (1,2,3) (3,2,1) (5,5,5) (5,3,4) (2,1,3)
Look at whether they pass or failLook at coverage of entities
Compute suspiciousness using the formula
0.50.50.50.630.00.710.830.00.00.00.00.00.5
Fault is indeed most suspicious!
28
The Tarantula Approach
Obvious benefits:• No problem if the fault is reached in some
successful test cases• Doesn’t depend on having any successful tests that
are similar to the failing test(s)• Provides a ranking of every statement, instead of
just a set of nodes – directions on where to look next• Numerical, even – how much more suspicious is X than Y?
• The pretty visualization may be quite helpful in seeing relationships between suspicious statements
• Is it less sensitive to accidental features of random tests, and to test suite quality in general?
• What about minimized failing tests here?
29
Tarantula vs. Nearest Neighbor
Which approach is better?• Once upon a time:
• Fault localization papers gave a few anecdotes of their technique working well, showed it working better than another approach on some example, and called it a day
• We’d like something more quantitative (how much better is this technique than that one?) and much less subjective!
30
Evaluating Fault Localization Approaches
Fault localization tools produce reports
We can reduce a report to a set (or ranking) of program locations
Let’s say we have three localization tools which produce• A big report that includes the fault• A much smaller report, but the actual
fault is not part of it• Another small report, also not
containing the fault
Which of theseis the “best”fault localization?
31
Evaluating a Fault Localization Report
Idea (credit to Renieris and Reiss):• Imagine an “ideal” debugger, the
perfect programmer• Starts reading the report
• Expands outwards from nodes (program locations) in the report to associated nodes, adding those at each step
• If a variable use is in the report, looks at the places it might be assigned
• If code is in the report, looks at the condition of any ifs guarding that code
• In general, follows program (causal) dependencies
• As soon as a fault is reached, recognizes it!
32
Evaluating a Fault Localization Report
Score the reports according to• How much code the ideal debugger
would read, starting from the report• Empty report: score = 0• Every line in the program: score = 0• Big report, containing the bug?
mediocre score• Small report, far from the bug? bad
score• Small report, “near” the bug? good
score• Report is the fault: great score (0.9)
0.4
0.8
0.2
0.9
33
Evaluating a Fault Localization Report
Breadth-first search of Program Dependency Graph (PDG) starting from fault localization:• Terminate the search when a real
fault is found• Score is proportion of the PDG that is
not explored during the breadth-first search
• Score near 1.00 = report includes only faults
34
Details of Evaluation MethodPDG
12 total nodes in PDG
35
Details of Evaluation MethodPDG
12 total nodes in PDG
Fault
Report
36
Details of Evaluation MethodPDG
12 total nodes in PDG
Fault
Report + 1 Layer BFS
37
Details of Evaluation MethodPDG
12 total nodes in PDG
Fault
Report + 1 Layer BFSSTOP: Real fault discovered
38
Details of Evaluation MethodPDG
12 total nodes in PDG
8 of 12 nodes not covered by
BFS: score = 8/12 ~= 0.67.
Fault
Report + 1 Layer BFSSTOP: Real fault discovered
39
Details of Evaluation MethodPDG
12 total nodes in PDG
Fault
Report
41
Details of Evaluation MethodPDG
12 total nodes in PDG
Fault
Report + 2 layers BFS
42
Details of Evaluation MethodPDG
12 total nodes in PDG
Fault
Report + 3 layers BFS
43
Details of Evaluation MethodPDG
12 total nodes in PDG
Fault
Report + 4 layers BFSSTOP: Real fault discovered
44
Details of Evaluation MethodPDG
12 total nodes in PDG
0 of 12 nodes not covered by
BFS: score = 0/12 ~= 0.00.
Fault
Report + 4 layers BFS
45
Details of Evaluation MethodPDG
Fault = Report
12 total nodes in PDG
11 of 12 nodes not covered by
BFS: score = 11/12 ~= 0.92.
46
Evaluating a Fault Localization Report
Caveats:• Isn’t a misleading report (a small number of
nodes, far from the bug) actually much worse than an empty report?
• “I don’t know” vs.• “Oh, yeah man, you left your keys in the living
room somewhere” (when in fact your keys are in a field in Nebraska)
• Nobody really searches a PDG like that!• Not backed up by user studies to show high
scores correlate to users finding the fault quickly from the report
47
Evaluating a Fault Localization Report
Still, the Renieris/Reiss scoring has been widely adopted by the testing community and some model checking folks• Best thing we’ve got, for now
48
Evaluating Fault Localization Approaches
So, how do the techniques stack up?
Tarantula seems to be the best of the test suite based techniques• Next best is the Cause Transitions approach
of Cleve and Zeller (see their paper), but it sometimes uses programmer knowledge
• Two different Nearest-Neighbor approaches are next best
• Set-intersection and set-union are worst
For details, see the Tarantula paper
49
Evaluating Fault Localization Approaches
Tarantula got scores at the 0.99 or > level 3 times more often than the next best
Trend continued at every ranking – Tarantula was always the best approach
Also appeared to be efficient:• Much faster than Cause-Transitions approach
of Cleve and Zeller• Probably about the same as the Nearest
Neighbor and set-union/intersection methods
50
Evaluating Fault Localization Approaches
Caveats:• Evaluation is over the Siemens suite (again!)
• But Tarantula has done well on larger programs• Tarantula and Nearest Neighbor might both
benefit from larger test suites produced by random testing
• Siemens is not that many tests, done by hand
51
Another Way to Do It
Question:• How good would the Nearest Neighbors method be if
our test suite contained all possible executions (the universe of tests)?
• We suspect it would do much better, right?
• But of course, that’s ridiculous – we can’t check for distance to every possible successful test case!
• Unless our program can be model checked• Leads us into next week’s topic, in a roundabout way:
testing via model checking
52
Explanation with Distance Metrics
Algorithm (very high level):1. Find a counterexample trace
(model checking term for “failing test case”)
2. Encode search for maximally similar successful execution under a distance metric d as an optimization problem
3. Report the differences (s) as an explanation (and a localization) of the error
53
Implementation #1
CBMC Bounded Model Checker for ANSI-C programs:• Input: C program + loop bounds• Checks for various properties:
• assert statements• Array bounds and pointer safety• Arithmetic overflow
• Verifies within given loop bounds• Provides counterexample if property
does not hold• Now provides error explanation and
fault localization.
54
14: assert (a < 4);
5: b = 4
6: c = -4
7: a = 2
8: a = 1
9: a = 6
10: a = 4
11: c = 9
12: c = 10
13: a = 10
4: a = 5
Given a counterexample,
55
14: assert (a < 4);
5: b = 4
6: c = -4
7: a = 2
8: a = 1
9: a = 6
10: a = 4
11: c = 9
12: c = 10
13: a = 10
4: a = 5
produce a successful executionthat is as similar as possible
(under a distance metric)
56
14: assert (a < 4);
5: b = 4
6: c = -4
7: a = 2
8: a = 1
9: a = 6
10: a = 4
11: c = 9
12: c = 10
13: a = 10
4: a = 5
14: assert (a < 4);
5: b = -3
6: c = -4
7: a = 2
8: a = 1
9: a = 6
10: a = 4
11: c = 9
12: c = 3
13: a = 3
4: a = 5
produce a successful executionthat is as similar as possible
(under a distance metric)
57
14: assert (a < 4);
5: b = 4
6: c = -4
7: a = 2
8: a = 1
9: a = 6
10: a = 4
11: c = 9
12: c = 10
13: a = 10
4: a = 5
14: assert (a < 4);
5: b = -3
6: c = -4
7: a = 2
8: a = 1
9: a = 6
10: a = 4
11: c = 9
12: c = 3
13: a = 3
4: a = 5
and examine the necessary differences:
58
14: assert (a < 4);
5: b = 4
6: c = -4
7: a = 2
8: a = 1
9: a = 6
10: a = 4
11: c = 9
12: c = 10
13: a = 10
4: a = 5
14: assert (a < 4);
5: b = -3
6: c = -4
7: a = 2
8: a = 1
9: a = 6
10: a = 4
11: c = 9
12: c = 3
13: a = 3
4: a = 5
and examine the necessary differences:
s
59
14: assert (a < 4);
5: b = 4
6: c = -4
7: a = 2
8: a = 1
9: a = 6
10: a = 4
11: c = 9
12: c = 10
13: a = 10
4: a = 5
14: assert (a < 4);
5: b = -3
6: c = -4
7: a = 2
8: a = 1
9: a = 6
10: a = 4
11: c = 9
12: c = 3
13: a = 3
4: a = 5
and examine the necessary differences:these are the causes
60
14: assert (a < 4);
5: b = 4
6: c = -4
7: a = 2
8: a = 1
9: a = 6
10: a = 4
11: c = 9
12: c = 10
13: a = 10
4: a = 5
14: assert (a < 4);
5: b = -3
6: c = -4
7: a = 2
8: a = 1
9: a = 6
10: a = 4
11: c = 9
12: c = 3
13: a = 3
4: a = 5
and the localization –lines 5, 12, and 13 are
likely bug locations.
61
Explanation with Distance Metrics
How it’s done:
Model checker
P+spec
First, the program (P) and
specification (spec) are sent
to the model checker.
62
Explanation with Distance Metrics
How it’s done:
Model checker
P+spec C
The model checker finds
a counterexample, C.
63
Explanation with Distance Metrics
How it’s done:
Model checker
BMC/constraint generator
P+spec C
The explanation tool uses P,
spec, and C to generate (via
Bounded Model Checking) a
formula with solutions that
are executions of P that are
not counterexamples
64
Explanation with Distance Metrics
How it’s done:
Model checker
BMC/constraint generator
P+spec C
S
Constraints are added to this
formula for an optimization
problem: find a solution that
is as similar to C as possible,
by the distance metric d. The
formula + optimization
problem is S
65
Explanation with Distance Metrics
How it’s done:
Model checker
BMC/constraint generator
P+spec C
Optimization tool
S -C
An optimization tool (PBS,
the Pseudo-Boolean Solver)
finds a solution to S:
an execution of P that is not
a counterexample, and is
as similar as possible to C:
call this execution -C
66
Explanation with Distance Metrics
How it’s done:
Model checker
BMC/constraint generator
P+spec C
Optimization tool
S -C
C
-Cs
Report the differences (s)
between C and –C to the
user: explanation and fault
localization
67
Explanation with Distance Metrics
The metric d is based on Static Single Assignment (SSA) (plus loop unrolling)• A variation on SSA, to be precise
CBMC model checker (bounded model checker for C programs) translates an ANSI C program into a set of equations
An execution of the program is just a solution to this set of equations
68
“SSA” Transformation
int main () {
int x, y;
int z = y;
if (x > 0)
y--;
else
y++;
z++;
assert (y == z);
}
int main () {
int x0, y0;
int z0 = y0;
y1 = y0 - 1;
y2 = y0 + 1;
guard1 = x0 > 0;
y3 = guard1?y1:y2;
z1 = z0 + 1;
assert (y3 == z1);
}
69
Transformation to Equationsint main () {
int x0, y0;
int z0 = y0;
y1 = y0 - 1;
y2 = y0 + 1;
guard1 = x0 > 0;
y3 = guard1?y1:y2;
z1 = z0 + 1;
assert (y3 == z1);
}
(z0 == y0
y1 == y0 – 1
y2 == y0 + 1
guard1 == x0 > 0
y3 == guard1?y1:y2
z1 == z0 + 1
y3 == z1)
70
Transformation to Equationsint main () {
int x0, y0;
int z0 = y0;
y1 = y0 - 1;
y2 = y0 + 1;
guard1 = x0 > 0;
y3 = guard1?y1:y2;
z1 = z0 + 1;
assert (y3 == z1);
}
(z0 == y0
y1 == y0 – 1
y2 == y0 + 1
guard1 == x0 > 0
y3 == guard1?y1:y2
z1 == z0 + 1
y3 == z1)
Uninitialized variables in CBMC are unconstrained inputs.
71
Transformation to Equationsint main () {
int x0, y0;
int z0 = y0;
y1 = y0 - 1;
y2 = y0 + 1;
guard1 = x0 > 0;
y3 = guard1?y1:y2;
z1 = z0 + 1;
assert (y3 == z1);
}
(z0 == y0
y1 == y0 – 1
y2 == y0 + 1
guard1 == x0 > 0
y3 == guard1?y1:y2
z1 == z0 + 1
y3 == z1)
CBMC (1) negates the assertion
72
Transformation to Equationsint main () {
int x0, y0;
int z0 = y0;
y1 = y0 - 1;
y2 = y0 + 1;
guard1 = x0 > 0;
y3 = guard1?y1:y2;
z1 = z0 + 1;
assert (y3 == z1);
}
(z0 == y0
y1 == y0 – 1
y2 == y0 + 1
guard1 == x0 > 0
y3 == guard1?y1:y2
z1 == z0 + 1
y3 != z1)
(assertion is now negated)
73
Transformation to Equationsint main () {
int x0, y0;
int z0 = y0;
y1 = y0 - 1;
y2 = y0 + 1;
guard1 = x0 > 0;
y3 = guard1?y1:y2;
z1 = z0 + 1;
assert (y3 == z1);
}
(z0 == y0
y1 == y0 – 1
y2 == y0 + 1
guard1 == x0 > 0
y3 == guard1?y1:y2
z1 == z0 + 1
y3 != z1)
then (2) translates to SAT and usesa fast solver to find a counterexample
74
Execution Representation
(z0 == y0
y1 == y0 – 1
y2 == y0 + 1
guard1 == x0 > 0
y3 == guard1?y1:y2
z1 == z0 + 1
y3 != z1)
Remove the assertion to get an equation forany execution of the program
(take care of loops by unrolling)
75
Execution Representation
(z0 == y0
y1 == y0 – 1
y2 == y0 + 1
guard1 == x0 > 0
y3 == guard1?y1:y2
z1 == z0 + 1
y3 != z1)
Execution represented by assignments toall variables in the equations
x0 == 1
y0 == 5
z0 == 5
y1 == 4
y2 == 6
guard1 == true
y3 == 4
z1 == 6
Counterexample
76
Execution Representation
x0 == 1
y0 == 5
z0 == 5
y1 == 4
y2 == 6
guard1 == true
y3 == 4
z1 == 6
Counterexample
Execution represented by assignments toall variables in the equations
x0 == 0
y0 == 5
z0 == 5
y1 == 4
y2 == 6
guard1 == false
y3 == 6
z1 == 6
Successful execution
77
The Distance Metric d
x0 == 1
y0 == 5
z0 == 5
y1 == 4
y2 == 6
guard1 == true
y3 == 4
z1 == 6
Counterexample
d = number of changes (s) between two executions
x0 == 0
y0 == 5
z0 == 5
y1 == 4
y2 == 6
guard1 == false
y3 == 6
z1 == 6
Successful execution
78
The Distance Metric d
x0 == 1
y0 == 5
z0 == 5
y1 == 4
y2 == 6
guard1 == true
y3 == 4
z1 == 6
Counterexample
d = number of changes (s) between two executions
x0 == 0
y0 == 5
z0 == 5
y1 == 4
y2 == 6
guard1 == false
y3 == 6
z1 == 6
Successful execution
79
The Distance Metric d
x0 == 1
y0 == 5
z0 == 5
y1 == 4
y2 == 6
guard1 == true
y3 == 4
z1 == 6
Counterexample
d = number of changes (s) between two executions
x0 == 0
y0 == 5
z0 == 5
y1 == 4
y2 == 6
guard1 == false
y3 == 6
z1 == 6
Successful execution
80
The Distance Metric d
x0 == 1
y0 == 5
z0 == 5
y1 == 4
y2 == 6
guard1 == true
y3 == 4
z1 == 6
Counterexample
d = number of changes (s) between two executions
x0 == 0
y0 == 5
z0 == 5
y1 == 4
y2 == 6
guard1 == false
y3 == 6
z1 == 6
Successful execution
d = 3
81
The Distance Metric d
x0 == 1
y0 == 5
z0 == 5
y1 == 4
y2 == 6
guard1 == true
y3 == 4
z1 == 6
Counterexample
3 is the minimum possible distance between thecounterexample and a successful execution
x0 == 0
y0 == 5
z0 == 5
y1 == 4
y2 == 6
guard1 == false
y3 == 6
z1 == 6
Successful execution
d = 3
82
The Distance Metric d
x0 == 1
y0 == 5
z0 == 5
y1 == 4
y2 == 6
guard1 == true
y3 == 4
z1 == 6
Counterexample
To compute the metric, add a new SATvariable for each potential
x0 == (x0 != 1)
y0 == (y0 != 5)
z0 == (z0 != 5)
y1 == (y1 != 4)
y2 == (y2 != 6)
guard1 == !guard1
y3 == (y3 != 4)
z1 == (z1 != 6)
New SAT variables
83
The Distance Metric d
x0 == 1
y0 == 5
z0 == 5
y1 == 4
y2 == 6
guard1 == true
y3 == 4
z1 == 6
Counterexample
And minimize the sum of the variables(treated as 0/1 values): a pseudo-Boolean problem
x0 == (x0 != 1)
y0 == (y0 != 5)
z0 == (z0 != 5)
y1 == (y1 != 4)
y2 == (y2 != 6)
guard1 == !guard1
y3 == (y3 != 4)
z1 == (z1 != 6)
New SAT variables
84
The Distance Metric d
An SSA-form oddity:• Distance metric can compare values
from code that doesn’t run in either execution being compared
• This can be the determining factor in which of two traces is most similar to a counterexample
• Counterintuitive but not necessarily incorrect: simply extends comparison to all hypothetical control flow paths
85
Explanation with Distance Metrics
Algorithm (lower level):1. Find a counterexample using Bounded Model
Checking (SAT)
2. Create a new problem: SAT for a successful execution + constraints for minimizing distance to counterexample (least changes)
3. Solve this optimization problem using a pseudo-Boolean solver (PBS) (= 0-1 ILP)
4. Report the differences (s) to the user as an explanation (and a localization) of the error
86
Explanation with Distance Metrics
Model checker
BMC/constraint generator
P+spec C
Optimization tool
S -C
C
-Cs
CBMC
explain
PBS
87
Explanation with Distance Metrics
Details hidden behind a Graphical User Interface (GUI) that hides SAT and distance metrics from users
GUI automatically highlights likely bug locations, presents changed values
Next slides: GUI in action + a teaser for experimental results
88
89
90
Explaining Abstract Counterexamples
91
Explaining Abstract CounterexamplesFirst implementation presents
differences as changes in concrete values, e.g.:• “In the counterexample, x is 14.
In the successful execution, x is 18.”
Which can miss the point:• What really matters is whether x is less
than y• But y isn’t mentioned at all!
92
Explaining Abstract Counterexamples If the counterexample and successful
execution were abstract traces, we’d get variable relationships and generalization for “free”
Abstraction should also make the model checking more scalable• This is why abstraction is traditionally
used in model checking, in fact
93
Model Checking + Abstraction
In abstract model checking, the model checker explores an abstract state space
In predicate abstraction, states consist of predicates that are true in a state, rather than concrete values:• Concrete:x = 12, y = 15, z = 0
• Abstract:x < y, z != 1
94
Model Checking + Abstraction
In abstract model checking, the model checker explores an abstract state space.
In predicate abstraction, states consist of predicates that are true in a state, rather than concrete values:• Concrete:x = 12, y = 15, z = 0
• Abstract:x < y, z != 1
Potentially represents many concrete states
95
Model Checking + Abstraction
Conservative predicate abstraction preserves all erroneous behaviors in the original system
Abstract “executions” now potentially represent a set of concrete executions
Must check execution to see if it matches some real behavior of program: abstraction adds behavior
96
Implementation #2
MAGIC Predicate Abstraction Based Model Checker for C programs:• Input: C program• Checks for various properties:
• assert statements• Simulation of a specification machine
• Provides counterexample if property does not hold
• Counterexamples are abstract executions – that describe real behavior of the actual program
• Now provides error explanation and fault localization
97
Model Checking + AbstractionPredicates & counterexample produced by the usual
Counterexample Guided Abstraction Refinement Framework.
Explanation will work as in the first case presented, except:• The explanation will be in terms of control flow differences and• Changes in predicate values.
98
MAGIC Overview
YesAbstractionAbstractionModel
CounterexampleReal?
CounterexampleReal?
No
Abstract Counterexample
AbstractionRefinementAbstractionRefinement
New Predicates
No
SpuriousCounterexample
Yes
VerificationVerification
Spec
Spec Holds
P
Real
99
MAGIC Overview
YesAbstractionAbstractionModel
CounterexampleReal?
CounterexampleReal?
No
AbstractionRefinementAbstractionRefinement
No
SpuriousCounterexample
Yes
VerificationVerification
Spec
Spec Holds
P
Real
New Predicates Abstract Counterexample
100
Model Checking + Abstraction
Explain an abstract counterexample that represents (at least one) real execution of the program
Explain with another abstract execution that:• Is not a counterexample• Is as similar as possible to the
abstract counterexample• Also represents real behavior
101
14: assert (a < 4);
5: b = 4
6: c = -4
7: a = 2
8: a = 1
9: a = 6
10: a = 4
11: c = 9
12: c = 10
13: a = 10
4: a = 5
14: assert (a < 4);
5: b = -3
6: c = -4
7: a = 2
8: a = 1
9: a = 6
10: a = 4
11: c = 9
12: c = 3
13: a = 3
4: a = 5
Abstract rather than concrete traces:represent more than one execution
Automatic generalization
102
14: assert (a < 4);
5: b > 2
6: c < 7
7: a >= 4
8: a <= 4
9: a >= 4
10: a <= 4
11: c >= 7
12: c >= 7
13: a >= 4
4: a >= 4
14: assert (a < 4);
5: b <= 2
6: c < 7
7: a > 4
8: a <= 4
9: a > 4
10: a <= 4
11: c >= 9
12: c < 7
13: a < 3
4: a >= 4
Abstract rather than concrete traces:represent more than one execution
Automatic generalization
103
14: assert (a < 4);
5: b > 2
6: c < 7
7: a >= 4
8: a <= 4
9: a >= 4
10: a <= 4
11: c >= 7
12: c >= 7
13: a >= 4
4: a >= 4
14: assert (a < 4);
5: b <= 2
6: c < 7
7: a > 4
8: a <= 4
9: a > 4
10: a <= 4
11: c >= 9
12: c < 7
13: a < 3
4: a >= 4
Automatic generalization
c >= 7:
c = 7, c = 8,
c = 9, c = 10…
c < 7:
c = 6, c = 5,
c = 4, c = 3…
104
14: assert (a < 4);
5: b > 2
6: c < 7
7: a >= 4
8: a <= 4
9: a >= 4
10: a <= 4
11: c >= 7
12: c >= a
13: a >= 4
4: a >= 4
14: assert (a < 4);
5: b <= 2
6: c < 7
7: a > 4
8: a <= 4
9: a > 4
10: a <= 4
11: c >= 9
12: c < a
13: a < 3
4: a >= 4
Relationships between variables
c >= a:
c = 7 a = 7,
c = 9 a = 6…
c < a:
c = 7 a = 10,
c = 3 a = 4…
105
An Example
1 int main () {2 int input1, input2, input3;3 int least = input1;4 int most = input1;5 if (most < input2)6 most = input2;7 if (most < input3)8 most = input3;9 if (least > input2)10 most = input2;11 if (least > input3)12 least = input3;13 assert (least <= most);14 }
106
An Example
1 int main () {2 int input1, input2, input3;3 int least = input1;4 int most = input1;5 if (most < input2)6 most = input2;7 if (most < input3)8 most = input3;9 if (least > input2)10 most = input2;11 if (least > input3)12 least = input3;13 assert (least <= most);14 }
107
An Example
1 int main () {2 int input1, input2, input3;3 int least = input1;4 int most = input1;5 if (most < input2)6 most = input2;7 if (most < input3)8 most = input3;9 if (least > input2)10 most = input2;11 if (least > input3)12 least = input3;13 assert (least <= most);14 }
108
An Example
Value changed (line 2): input3#0 from 2147483615 to 0Value changed (line 12): least#2 from 2147483615 to 0Value changed (line 13): least#3 from 2147483615 to 0
109
An Example
Not very obvious what
this means…
Value changed (line 2): input3#0 from 2147483615 to 0Value changed (line 12): least#2 from 2147483615 to 0Value changed (line 13): least#3 from 2147483615 to 0
110
An Example
Control location deleted (step #5): 10: most = input2Predicate changed (step #5): was: most < least now: least <= mostPredicate changed (step #5): was: most < input3 now: input3 <= most------------------------Predicate changed (step #6): was: most < least now: least <= mostAction changed (step #6): was: assertion_failure
111
An Example
Control location deleted (step #5): 10: most = input2Predicate changed (step #5): was: most < least now: least <= mostPredicate changed (step #5): was: most < input3 now: input3 <= most------------------------Predicate changed (step #6): was: most < least now: least <= mostAction changed (step #6): was: assertion_failure
Here, on the other hand:
112
An Example
Control location deleted (step #5): 10: most = input2Predicate changed (step #5): was: most < least now: least <= mostPredicate changed (step #5): was: most < input3 now: input3 <= most------------------------Predicate changed (step #6): was: most < least now: least <= mostAction changed (step #6): was: assertion_failure
Here, on the other hand:
Line with error indicated
Avoid error by notexecuting line 10
113
An Example
Control location deleted (step #5): 10: most = input2Predicate changed (step #5): was: most < least now: least <= mostPredicate changed (step #5): was: most < input3 now: input3 <= most------------------------Predicate changed (step #6): was: most < least now: least <= mostAction changed (step #6): was: assertion_failure
Predicates show howchange in control flowaffects relationship of
the variables
114
Explaining Abstract Counterexamples Implemented in the MAGIC predicate
abstraction-based model checker
MAGIC represents executions as paths of states, not in SSA form
New distance metrics resembles traditional metrics from string or sequence comparison:• Insert, delete, replace operations• State = PC + predicate values
115
Explaining Abstract CounterexamplesSame underlying method as for
concrete explanation
Revise the distance metric to account for the new representation of program executions
Model checker
BMC/constraint generator
P+spec C
Optimization tool
S -C
C-C
s
MAGIC
MAGIC/explain
still PBS
116
CBMC vs. MAGIC Representations
input1#0 == 0
input2#0 == -1
input3#0 == 0
least#0 == 0
most#0 == 0
guard0 == true
guard1 == false
least#1 == 0
…
CBMC: SSA Assignments
s0
s1
s2
s3
MAGIC: States & actions
0
1
2
117
CBMC vs. MAGIC Representations
input1#0 == 0
input2#0 == -1
input3#0 == 0
least#0 == 0
most#0 == 0
guard0 == true
guard1 == false
least#1 == 0
…
CBMC: SSA Assignments
s0
s1
s2
s3
MAGIC: States & actions
0
1
2
Control location:
Line 5
Predicates:
input1 > input2
least == input1
...
118
A New Distance Metric
s0
s1
s2
s3
0
1
2
s’0
s’1
s’2
s’3
0
1
2
s’4
3
Must determine whichstates to compare: maybe different number of
states in two executions
Make use of literatureon string/sequence
comparison & metrics
119
Alignment
s0
s1
s2
s3
0
1
2
s’0
s’1
s’2
s’3
0
1
2
s’4
3
1. Only compare stateswith matching
control locations
1
5
7
9
1
3
7
8
11
120
Alignment
s0
s1
s2
s3
0
1
2
s’0
s’1
s’2
s’3
0
1
2
s’4
3
1
5
7
9
1
3
7
8
11
121
Alignment
s0
s1
s2
s3
0
1
2
s’0
s’1
s’2
s’3
0
1
2
s’4
3
1
5
7
9
1
3
7
8
11
122
Alignment
s0
s1
s2
s3
0
1
2
s’0
s’1
s’2
s’3
0
1
2
s’4
3
2. Must be unique
123
Alignment
s0
s1
s2
s3
0
1
2
s’0
s’1
s’2
s’3
0
1
2
s’4
3
124
Alignment
s0
s1
s2
s3
0
1
2
s’0
s’1
s’2
s’3
0
1
2
s’4
3
3. Don’t cross overother alignments
125
Alignment
s0
s1
s2
s3
0
1
2
s’0
s’1
s’2
s’3
0
1
2
s’4
3
126
A New Distance Metric
s0
s1
s2
s3
0
1
2
s’0
s’1
s’2
s’3
0
1
2
s’4
3
In sum: much like thetraditional metrics used
to compare strings,except the alphabet
is over control locations,predicates, and actions
127
A New Distance Metric
s0
s1
s2
s3
0
1
2
s’0
s’1
s’2
s’3
0
1
2
s’4
3
Encoded using BMCand psuedo-Boolean
optimization as inthe first case, with
variables for alignmentand control, predicateand action differences
128
Explaining Abstract Counterexamples
One execution (Potentially) many executions
Changes in values Changes in predicates
Always real execution May be spurious
- May need to iterate/refine
Execution as SSA values Execution as path & states
- Counterintuitive metric - Intuitive metric
- No alignment problem - Must consider alignments:
Which states to compare?
BMC to produce PBS problem BMC to produce PBS problem
(CBMC) (MAGIC)
129
Results
130
Results: Overview
Produces good explanations for numerous interesting case studies: C/OS-II RTOS Microkernel (3K lines)• OpenSSL code (3K lines)• Fragments of Linux kernel• TCAS Resolution Advisory component• Some smaller, “toy” linear temporal logic
property examples
C/OS-II, SSL, some TCAS bugs precisely isolated: report = fault
131
Results: Quantitative Evaluation
Very good scores by Renieris & Reiss method for evaluating fault localization:• Measures how much source code user
can avoid reading thanks to the localization. 1 is a perfect score
For SSL and C/OS-II case studies, scores of 0.999
Other examples (almost) all in range 0.720-0.993
132
Results: Comparison
Scores were generally much better than Nearest Neighbor – when it could be applied at all• Much more consistent• Testing-based methods of Renieris
and Reiss occasionally worked better• Also gave useless (score 0) explanations
much of the time
Scores a great improvement over the counterexample traces alone
133
Results: Comparison
Scores and times for variouslocalization methods
Best score for each program highlighted
* alternative scoring method for large programs
Program Explain JPF n-c n-s CBMCscore time score time score score score
TCAS 1 0.91 4 0.87 1521 0.00 0.58 0.41TCAS 11 0.93 7 0.93 5673 0.13 0.13 0.51TCAS 31 0.93 7 - - 0.00 0.00 0.46TCAS 40 0.88 6 0.87 30482 0.83 0.77 0.35TCAS 41 0.88 5 0.30 34 0.58 0.92 0.38
uCOS-ii 0.99 62 - - - - 0.97uCOS-ii* 0.81 62 - - - - 0.00
134
Results: MAGIC
No program required iteration to finda non-spurious explanation: good
abstraction already discovered
Program score time CE lengthmutex-n-01.c (lock) 0.79 0.04 6mutex-n-01.c (unlock) 0.99 0.04 6pci-n-01.c 0.78 0.07 9pci-rec-n-01.c 0.72 0.09 8SSL-1 0.99 8.07 29SSL-2 0.99 3.45 52uCOS-ii 0.00 0.76 19
135
Results: Time
Time to explain comparable to model checking time• No more than 10 seconds for
abstract explanation (except when it didn’t find one at all…)
• No more than 3 minutes for concrete explanations
136
Results: Room for Improvement
Concrete explanation worked better than abstract in some cases• When SSA based metric produced
smaller optimization constraints
For TCAS examples, user assistance was needed in some cases• Assertion of form (A implies B)• First explanation “explains” by showing
how A can fail to hold• Easy to get a good explanation—force
model checker to assume A
137
Conclusions: Good News
Counterexample explanation and fault localization can provide good assistance in locating errors
The model checking approach, when it can be applied (usually not to large programs or with complex data structures) may be most effective
But Tarantula is the real winner, unless model checking starts scaling better
138
Future Work?
The ultimate goal: testing tool or model checker fixes our programs for us – automatic program repair!
That’s not going to happen, I think
But we can try (and people are doing just that, right now)
139
Model Checking and Scaling
Next week we’ll look at a kind of “model checking” that doesn’t involve building SAT equations or producing an abstraction• We’ll run the program and backtrack
execution• Really just an oddball form of testing• Can’t do “stupid SAT-solver tricks”
like using PBS to produce great fault localizations, but has some other benefits