TMPA-2017: 5W+1H Static Analysis Report Quality Measure

33
5W+1H static analysis report quality measure Maxim Menshchikov, Timur Lepikhin March 3, 2017 Saint Petersburg State University, OKTET Labs

Transcript of TMPA-2017: 5W+1H Static Analysis Report Quality Measure

Page 1: TMPA-2017: 5W+1H Static Analysis Report Quality Measure

5W+1H static analysis report qualitymeasure

Maxim Menshchikov, Timur Lepikhin

March 3, 2017

Saint Petersburg State University, OKTET Labs

Page 2: TMPA-2017: 5W+1H Static Analysis Report Quality Measure

Authors

Maxim Menshchikov

Student, Saint Petersburg State University.

Software Engineer at OKTET Labs.

Timur Lepikhin

Candidate of Sciences, Associate Professor,

Saint Petersburg State University.

1

Page 3: TMPA-2017: 5W+1H Static Analysis Report Quality Measure

Static analysis quality evaluation

How the quality is usually evaluated?

1. Precision.

PPV =TP

TP + FP

2. Recall.

TPR =TP

TP + FN

3. F1 (f-measure).

F1 =2TP

2TP + FP + FN

2

Page 4: TMPA-2017: 5W+1H Static Analysis Report Quality Measure

Static analysis quality evaluation

How the quality is usually evaluated?

4. False-Positive Rate.

FPR =FP

FP + TN

5. Accuracy.

ACC =TP + TN

P +N

6. ...

What’s missing in these measures?

3

Page 5: TMPA-2017: 5W+1H Static Analysis Report Quality Measure

Missing pieces

• Informational quality of messages

How good and informative the message is?

• Generalization of reports

Reports can be either positive or negative when talking about

errors.

“Error in line x”.

“No error in line x”.

• Error class identification1

Reports can relate to the same problem or point of interest in the

code. Reports should be combined according to that.• Utility support

Not all tested utilities may support some kind of report.1Not always missing :)

4

Page 6: TMPA-2017: 5W+1H Static Analysis Report Quality Measure

The input

Consider the following code sample:

#include <stdio.h>

int main()

{

int input;

if (scanf("%d", &input) == 1)

{

if (input == 2)

{

int *a;

int *n = a;

a = n;

*n = 5;

}

else

{

printf("OK\n");

}

}

return 0;

}

5

Page 7: TMPA-2017: 5W+1H Static Analysis Report Quality Measure

The output

Clang 3.9

main.cpp:10:13: warning: Assigned value is garbage or undefined:

int *n = a;

main.cpp:5:5: note: Taking true branch: if (scanf("%d", &input) == 1)

main.cpp:7:13: note: Assuming ’input’ is equal to 2: if (input == 2)

main.cpp:7:9: note: Taking true branch: if (input == 2)

main.cpp:9:13: note: ’a’ declared without an initial value: int *a;

main.cpp:10:13: note: Assigned value is garbage or undefined:

int *n = a;

main.cpp:11:13: warning: Value stored to ’a’ is never read: a = n;

main.cpp:11:13: note: Value stored to ’a’ is never read: a = n;

6

Page 8: TMPA-2017: 5W+1H Static Analysis Report Quality Measure

The output

cppcheck 1.76

[main.cpp:12]: (style) Variable ’a’ is assigned a value that is never

used.

[main.cpp:10]: (error) Uninitialized variable: a

7

Page 9: TMPA-2017: 5W+1H Static Analysis Report Quality Measure

The difference

1. Clang shows which conditions should be met to encounter the

bug.

2. Clang shows source code line text, while cppcheck only shows

file and line number.

Both reports would be “correct” in sense of all previous

measures. They would be considered equal with respect to

their contribution to result.

8

Page 10: TMPA-2017: 5W+1H Static Analysis Report Quality Measure

5W+1H

“5Ws” are actively used in journalism and natural language

processing.

Sometimes they are referred as “5W+1H”, where “H” denotes

“How?”.

• What?

• When?

• Where?

• Who?

• Why?

• How?

9

Page 11: TMPA-2017: 5W+1H Static Analysis Report Quality Measure

5W+1H

We suggest to rephrase the 6th question as “How to fix?”

• What? Consequences.

The error. What will happen if the error occurs.

• When?

Conditions when it happens.

• Where?

Source code line number, module name.

• Who?

Who wrote this line?

• Why?

More or less formal reason why the error was treated as such.

• How to fix?

The ways to fix the problem.

10

Page 12: TMPA-2017: 5W+1H Static Analysis Report Quality Measure

How it applies to previous code sample

Question Clang Cppcheck

What? Assigned value is garbage Uninitialized variable: a

Who? — —

Where? lines 5-10 line 10

When?scanf(...) == 1,

input == 2—

Why?’a’ declared without

initial value—

How? — —

11

Page 13: TMPA-2017: 5W+1H Static Analysis Report Quality Measure

5W+1H

• It is hard to prove its completeness. (Do you have any

counter-example?)

12

Page 14: TMPA-2017: 5W+1H Static Analysis Report Quality Measure

5W+1H

• It is hard to prove its completeness. (Do you have any

counter-example?)

• Some way to evaluate reports is still needed.

• You can always choose the most suitable question to associate

report information with.

13

Page 15: TMPA-2017: 5W+1H Static Analysis Report Quality Measure

Generalization of reports

Factual error Report

Presence Correctness Result kind Usefulness

No Indeterminate2 Indeterminate Yes

No Correct Positive No3

No Correct Negative Yes

No Incorrect Positive No

No Incorrect Negative No

Yes Indeterminate Indeterminate No

Yes Correct Positive Yes

Yes Correct Negative Yes

Yes Incorrect Positive No

Yes Incorrect Negative No

2Or rather missing3Something strange 14

Page 16: TMPA-2017: 5W+1H Static Analysis Report Quality Measure

Report classes

Report class is an infinite set of reports equal from end user’s

point of view. Let’s group reports by answers to following

questions:

• Why?

• What?

• Where?

15

Page 17: TMPA-2017: 5W+1H Static Analysis Report Quality Measure

Maths: propagate report classes

Consider the surjective function combining reports from set R to

the set of unique classes R′.

f(r) : R→ R′ r ∈ R

We’ll use R as an alias to R′ later on.

16

Page 18: TMPA-2017: 5W+1H Static Analysis Report Quality Measure

Maths: introduce weights

Consider the set of questions:

{What, When, Where,Who,Why,HowToFix}

Let W be a set of answer weights for questions 1-6, respectively.

W = {w1, w2, ..., w6}

Then following mapping can be applied4.

W = {0.2, 0.15, 0.1, 0.05, 0.2, 0.3}

4Make your own mapping satisfying the needs of your test

17

Page 19: TMPA-2017: 5W+1H Static Analysis Report Quality Measure

Maths: introduce weights, pt.2

Let I be informational quality of the message and

A = {a1, a2, ..., a6} be a set of answers quality, where

ai ∈ [0, 1], i = 1..6.

I =

6∑i=1

wi · ai (1)

Let Imax be a measure of maximal informational quality between

m utilities.

Imax =

6∑i=1

wi ·maxj

aij j ∈ 1..m (2)

18

Page 20: TMPA-2017: 5W+1H Static Analysis Report Quality Measure

Maths: introduce weights, pt.3

Having that, by taking Imax into account, we can easily find a sum

of all reports.

SR′ =

n∑i=1

Imaxi (3)

19

Page 21: TMPA-2017: 5W+1H Static Analysis Report Quality Measure

Maths: introduce weights, pt.4

Let m ∈ N be the number of tested static analyzers. Utility

support for i -report can be abstractly represented as:

uij ∈ Ui j = 1..m i = 1..n

uij ∈ {0, 1} (4)

where uij is a boolean value indicating the j− utility support of i−report’s underlying error type.

With that, we can find a sum of all reports for j− utility taking

utility support into account.

Sj =

n∑i=1

Iij ·m∏j=1

uij (5)

20

Page 22: TMPA-2017: 5W+1H Static Analysis Report Quality Measure

Maths: “IQ” measure

We can calculate informational quality measure for j− utility.

Snormj =Sj

SR′(6)

We would call this measure IQ (Informational Quality).

TPI only includes true positives. FPI includes false positives

with the informational value taken into account.

21

Page 23: TMPA-2017: 5W+1H Static Analysis Report Quality Measure

What? Should I measure it manually?

No.

• You can make you own parsers, as we did.

• Many reports looks similarly. You can evaluate them once and

apply the score to all.

• (Could have been easier if there was some kind of

standardized output...)

22

Page 24: TMPA-2017: 5W+1H Static Analysis Report Quality Measure

Real world testing

We tested the measure on Toyota ITC benchmarks5.

• Clang 3.9, cppcheck 1.76, Frama-C Silicon, PVS-Studio

(Linux) and ReSharper were tested.

• Original benchmark was forked, errors patched, limited Win32

support added.

• We created a lot of 5-minute-work parsers capable of reading

output we got. They cannot be applied to all outputs.

• pthread tests excluded from comparison as not all utilities

support it.

• We checked generic report informativeness.

• All measures were calculated and analyzed.

• The hypothesis: the measure is different from Precision,

Recall and F1 scores.5https://github.com/mmenshchikov/itc-benchmarks

23

Page 25: TMPA-2017: 5W+1H Static Analysis Report Quality Measure

Test methodology

• Prepared Toyota ITC benchmarks6.

• Coded parsers for all tested utilities7.

• Prepared scripts to do the comparison8 and verify results

except parts that cannot be automated.

• Scripts only check lines having special comments from Toyota.

• Reports were semi-automatically checked for correctness.

• Report quality was evaluted manually, yet applying the same

score to similar reports (takes really little time).

• The hypothesis was evaluated using t-test.

6https://github.com/mmenshchikov/itc-benchmarks7https://github.com/mmenshchikov/sa_parsers8https://github.com/mmenshchikov/sa_comparison_003

24

Page 26: TMPA-2017: 5W+1H Static Analysis Report Quality Measure

Results: Informativeness

Question Clang cppcheck Frama-C PVS RS9

What? 100% 100% 100% 100% 100%

When? 97.41% 0% 100% 0% 0%

Where? 100% 100% 100% 100% 100%

Who? 0% 0% 0% 0% 0%

Why? 35.78% 0% 99.77% 48.46% 0%

How to fix? 0% 0% 0% 17.15% 38.27%

9ReSharper C++

25

Page 27: TMPA-2017: 5W+1H Static Analysis Report Quality Measure

Results : IQ

Utility IQ TPI TP FPI FP PPV10 TPR11 F1

Clang 0.52 57.75 111 1.55 3 0.974 0.183 0.308

Cppcheck 0.3 30 100 0.6 2 0.98 0.165 0.282

Frama-C 0.649 196.1 302 57.2 88 0.774 0.498 0.606

PVS 0.459 53.67 117 4.32 12 0.907 0.193 0.318

RS12 – – – – – – – –

10Precision11Recall12ReSharper was excluded as it found “other” defects, although we considered

it generic-purpose from the beginning

26

Page 28: TMPA-2017: 5W+1H Static Analysis Report Quality Measure

Results : dependency

In this test we found a dependency between Precision (PPV )

and IQ.

• Utilities provide similar reports (measures for reports are

similar): test more utilities.

• Emitted messages are only error-related, no messages on error

absence: include tools that inform about bug absence as

well13.

It is not a generally representative.

We evaluated informational values ourselves, and that decreases

the reliability of results.

13Many developers ignored our requests for academic versions

27

Page 29: TMPA-2017: 5W+1H Static Analysis Report Quality Measure

What’s then

You can use this information to improve your utilities:

• Add answers to some of questions (“Who?”, “When?”).

• Explain decisions more formally (“Why?”).

• Suggest fixes, if possible (“How to fix?”).

How to improve the measure:

• Prepare better explained weights.

How to improve test:

• Better rules, less automation.

• Richer selection of tools.

28

Page 30: TMPA-2017: 5W+1H Static Analysis Report Quality Measure

Questions?

29

Page 31: TMPA-2017: 5W+1H Static Analysis Report Quality Measure

Verbosity

• Good verbosity

More information on analyzer’s decision.

Still you can filter out unneeded information.

• Bad verbosity

Many messages about the same error.

A lot of “rubbish” messages spreading user’s attention.

30

Page 32: TMPA-2017: 5W+1H Static Analysis Report Quality Measure

Who?

It questions who wrote a bad line or did the most significant

change in it.

• svn blame?

Too basic information. i.e. if constant in function invocation is

wrong, you will not know for sure who is to blame.

• Ethical aspects of blaming are out of question

You can use static analysis results to automatically create tasks in

a bugtracker and assign to right person.

31

Page 33: TMPA-2017: 5W+1H Static Analysis Report Quality Measure

5Ws

Term is coming from journalism, natural language processing,

problem-solving, etc.

Something like that mentioned by various philosophers and

rhetoricians.

Taught in high-school journalism classes by 1917.

32