Using Likely Invariants For Automated Software Fault Localization Swarup Kumar Sahoo John Criswell...

26
Using Likely Invariants For Automated Software Fault Localization Swarup Kumar Sahoo John Criswell Chase Geigle Vikram Adve 1 Department of Computer Science University of Illinois at Urbana-Champaign

Transcript of Using Likely Invariants For Automated Software Fault Localization Swarup Kumar Sahoo John Criswell...

Using Likely Invariants For Automated Software Fault Localization

Swarup Kumar Sahoo

John Criswell

Chase Geigle

Vikram Adve

1

Department of Computer ScienceUniversity of Illinois at Urbana-Champaign

Motivation - 1

2

Software bugs cost ~$59.5 billion

annually (about 0.6% of the GDP)

NIST Report

Windows 2000, 35M LOC, 63,000 known bugs at release time, 2 per 1000 lines [Mary Jo Foley]

In 2006, a Mozilla developer admitted that everyday, almost 300 bugs appear in their Bugzilla [Anvik et.al.]

Motivation - 2

3

• Cost of fault localization increases exponentially with life cycle

• Bug fix cost very high during operational/maintenance phase

• Important to quickly to fix the bugs before the appl ships

(Source: Barry Boehm, “Equity Keynote Address” 2007 &Stefan Priebsch in Advanced OOP and Design Pattern, Codeworks 2009)

Motivation - 3

• Debugging - process of eliminating a software failure

• Automatic fault localization – Automatically identify the root cause responsible for a failure

• Automatic fault localization can reduce dev cost and time

4

Reproduce failure

Locate and understand root

cause

Fix Root Cause

Goal

• Need an automated system to detect root cause of bugs

Efficient, scalable and report few false positives

5

Bug Localization

Tool

.C

Program

Faulty Input

Contributions

• Novel diagnosis mechanism for automated bug localization– Likely invariants using auto generated inputs "close to" failing input

– First to combine invariants-approach with dynamic slicing in s/w

– Two novel heuristics for reducing false positives

• Used 8 bugs in Squid, Apache, MySQL for evaluation– Tool provides only 5 to 17 program expressions as root cause

6

Diagnosis with Invariants

• We use likely range invs to find potential root causes

• Range of values computed by program insts in correct runs– Violated invariants give us a set of candidate root cause(s)

• Invariants on load values, store values, return values.

8

Value Type Example Instructions Example Invariants

Return return int %weekday 0 <= %weekday <= 6Load %value = load int* %p 0 <= %valueStore store int %val, int* %q 100 <= %val <= 100

Key Insights

• Tighter Likely invariants:– Compact (not precise) way to summarize & compare memory states

– Efficiently isolates initial candidates of root causes to a small set

• Novel way to generate invariants – Automatically generated good inputs "close to" failing input

– Few close good inputs to train invariants

Much tighter and more relevant invariants

Missed root causes less likely (though possibly many candidates)

• Sequence of novel filtering techniques to reduce false+ves

9

Generate Inputs

Generate Invariants

Test with Bad Input

False +ve Filters

Good Inputs

BadInputs

InstrumentedProgram

Failed Invariants

Failure-inducing (Bad) input

Optional Input SpecificationProgram

Backward Slicing

Dependence Filtering

Multi-faulty Inp Filter

Diagnosis Tool Architecture

False+ve Filters

1 long day_nr ( uint year, uint month, uint day ) {

2 if ( year == 0 && … )

3 return ( 0 ) ;

4 Delsum = 365 * year + … ;

5 if (month <= 2)

6 year--;

7 else delsum -= (month*4 + … ;

8 temp = ( year / 100 + … ;

9 return ( delsum + year/4 * temp ) ;

10 }

11 int week_day ( long daynr , bool first_weekday ) {

12 return ( daynr + … ) % 7 ; }

13 bool make_date ( … ) {

14 . . .

15 Weekday = week_day ( day_nr (year, month, day ) , 0 ) ;

16 str->append ( …names [ weekday ] …. ;

17 . . . }

Source Code Example

11

Failing MySQL input : SELECT DATE (”0000-01-01”,’%W %d %M %Y’) as a

1 long day_nr ( uint year, uint month, uint day ) {

2 if ( year == 0 && … )

3 return ( 0 ) ;

4 Delsum = 365 * year + … ;

5 if (month <= 2)

6 year--;

7 else delsum -= (month*4 + … ;

8 temp = ( year / 100 + … ;

9 return ( delsum + year/4 * temp ) ;

10 }

11 int week_day ( long daynr , bool first_weekday ) {

12 return ( daynr + … ) % 7 ; }

13 bool make_date ( … ) {

14 . . .

15 Weekday = week_day ( day_nr (year, month, day ) , 0 ) ;

16 str->append ( …names [ weekday ] …. ;

17 . . . }

Source Code Example – Invariant Failures

12

Invariant Failures – candidate Root causesInvariant Failures – candidate Root causes

Insights for Training Invariants

• Use training inputs very “similar” to the failing input

Capture the key relevant differences between 2 similar runs

• Very few “similar” good inputs to train invariants

Much tighter and more relevant invariants

Less likely to miss root causes

Though possibly many false-positive candidates

13

Training Input Construction – 2 Approaches

• Deletion-based specification-independent approach

– A variation of ddmin algorithm [Zeller et.al.]

– Apply character-level rewriting/deletion

• Replacement-based specification-dependent approach

– User gives input specification

Tokens in input grammar and alternative tokens for each token

– Create variations for each input token depending upon type

– Create inputs by using variations of 1 token at a time

– Possible to implement this by modifying inbuilt parsers in application• Can be automated for given input specifications

14

Replacement-Based Spec-Dependent Approach – Example

15

SELECT NAME_CONST('flag',1) * MAX(a) FROM t1;

SELECT NAME_CONST('flag',2) * MAX(a) FROM t1;

SELECT NAME_CONST('flag',3) * MAX(a) FROM t1;

SELECT NAME_CONST('flag',5) * MAX(a) FROM t1;

SELECT NAME_CONST('flag',9) * MAX(a) FROM t1;

SELECT NAME_CONST('flag',1) * SUM(a) FROM t1;

SELECT NAME_CONST('flag',1) * MIN(a) FROM t1;

SELECT NAME_CONST('flag',1) * AVG(a) FROM t1;

SELECT NAME_CONST('flag',1) * STD(a) FROM t1;

Selecting Candidate Root Causes

• Invariant generation:– Select a set of “close” good inputs based on edit distance

– Run Instrumented program on good inputs to generate invariants

• Candidate root cause selection:– Execute the program with the inserted invariants for the bad input

– Failed invariants provide a set of candidate root-causes

16

Filtering Techniques

Remove candidates that do not affect symptom

Leverages DFS traversal of DDG during Slicing

Discard dep failed inv, if no intervening passing inv

Processing time is linear in the #edges in DDG

Intersection of root causes for similar failing inputs

17

Dynamic Backward Slicing

Dependence Filtering

Multiple faulty Input Filter

Dependence Filtering - 1

18

Inv Pass

Inv Fail

Inv Fail

Crash

Symptom

Inv Pass

Inv Pass

Probably not a root cause

A possible root cause

Dependence Filtering - 2

19

Inv Pass

Inv Pass

Inv Fail

Crash

Symptom

Inv Pass

Inv Pass

A possible root cause

Inv Fail

A possible root cause

1 long day_nr ( uint year, uint month, uint day ) {

2 if ( year == 0 && … )

3 return ( 0 ) ;

4 Delsum = 365 * year + … ;

5 if (month <= 2)

6 year--;

7 else delsum -= (month*4 + … ;

8 temp = ( year / 100 + … ;

9 return ( delsum + year/4 * temp ) ;

10 }

11 int week_day ( long daynr , bool first_weekday ) {

12 return ( daynr + … ) % 7 ; }

13 bool make_date ( … ) {

14 . . .

15 Weekday = week_day ( day_nr (year, month, day ) , 0 ) ;

16 str->append ( …names [ weekday ] …. ;

17 . . . }

Souce Code Example – Dependence Filtering

20

Probably not a root cause

A possible root cause

• Slice from root cause to symptom span several functions.

• This distance is high for incorrect output bugs.

Methodology - Characteristics of 8 Bugs

21

Bug-Name SymptomDistance (Dyn#LLVM inst)

Distance(Static #LOC)

Distance(Static #Functions)

Squid-len Buffer Overflow 12 6 2SQL-int-overflow Buffer Overflow 18 7 3

SQL-convert Incorrect Output 86 27 10SQL-aggregate Buffer Overflow 4 4 2

SQL-precision Incorrect Output 124 41 17SQL-mul-overflow Incorrect Output 114 36 17SQL-dataloss Incorrect Output 429 16 5Apache- overflow Buffer Overflow 32780 6 2

• 8 s/w bugs in 3 large server appl: Squid, Apache & MySQL

• Overall 12 bugs - 4 missing code bugs not considered

• LLVM-2.6 (LLVM-gcc) to compile programs & run our passes

27 10

41 1736 1716 5

Experimental Results – Bug Loc Effectiveness

22

Bug-Name #Invs #Failed-Invs #Final Candidate Root Causes

Squid-len 3358 357 9

SQL-int-overflow 5917 95 16

SQL-convert 5942 93 6

SQL-aggregate 6847 156 8

SQL-precision 4566 130 17

SQL-mul-overflow 4652 83 5

SQL-dataloss 5836 153 11

Apache- overflow 2295 120 6

• Tool provides only 5 –17 exprs as candidate root causes

#Final Candidate Root Causes

916

68

175

116

#Failed-Invs

3579593

156130

83153120

Experimental Results – Effectiveness of Filters

23

Bug-Name#Failed-Invs Slice

Squid-len 357 30

SQL-int-overflow 95 36

SQL-convert 93 27

SQL-aggregate 156 44

SQL-precision 130 34

SQL-mul-overflow 83 13

SQL-dataloss 153 35

Apache- overflow 120 12% Reduction 80%

Experimental Results – Effectiveness of Filters

24

Bug-NameFailed-Invs Slice

Dependence-filter

Squid-len 357 30 9

SQL-int-overflow 95 36 16

SQL-convert 93 27 9

SQL-aggregate 156 44 14

SQL-precision 130 34 18

SQL-mul-overflow 83 13 7

SQL-dataloss 153 35 17

Apache- overflow 120 12 6% Reduction 80% 58%

Experimental Results – Effectiveness of Filters

25

Bug-NameFailed-Invs Slice

Dependence-filter

Multiple-faulty-inputs

Squid-len 357 30 9 9

SQL-int-overflow 95 36 16 12

SQL-convert 93 27 9 6

SQL-aggregate 156 44 14 8

SQL-precision 130 34 18 17

SQL-mul-overflow 83 13 7 5

SQL-dataloss 153 35 17 11

Apache- overflow 120 12 6 6% Reduction 80% 58% 23%

• Slicing removes 80%, Dep-filtering 58%, multy-inp 23% false+ves

Comparison With Tarantula and Ochiai

27

• Statistical techniques rank each statement based on formula

• Performed better for SQL-int-overflow & SQL-precision bugs

• Good for bugs where control flow diverges from good runs

• Our approach better than Tarantula/Ochiai in 6 out of 8 bugs

Summary

Tool can identify only 5 − 17 candidate root causes

• Generates invariants using auto-generated similar inputs

• Very few “similar” good inputs to get tighter invariants

• Novel filters to reduce false positives

Questions?

28