An Empirical Study of Reported Bugs in Server Software with Implications for Automated Bug Diagnosis...

38
An Empirical Study of Reported Bugs in Server Software with Implications for Automated Bug Diagnosis Swarup Kumar Sahoo, John Criswell, Vikram Adve Department of Computer Science University of Illinois at Urbana-Champaign 1

Transcript of An Empirical Study of Reported Bugs in Server Software with Implications for Automated Bug Diagnosis...

An Empirical Study of Reported Bugs in Server Software with Implications

for Automated Bug Diagnosis

Swarup Kumar Sahoo, John Criswell,

Vikram Adve

Department of Computer Science

University of Illinois at Urbana-Champaign

1

Motivation

• In-the-field software failures are becoming increasingly

common

– Software failures results in losses over billions of dollars every year

[Charette et.al., IEEE Spectrum, 2005]

– Increasing the reliability of systems is critical

• Off-site analysis of production run failures is difficult

– Difficulty in reproducing failures at development site

– Same bug may generate different faults at multiple production sites

– Customers have privacy concerns

2

Motivation – Production Site Diagnosis

• Problem: Need to reproduce failures fast and checkpoint

based replay limits their usefulness

Question: Will a simple restart/replay mechanism work?

• Problem: Minimal test case generation is too slow

Question: Can the knowledge of fault types and #inputs help?

To know answers to these questions we need to understand

Characteristics of software bugs

3

Application Selection

• Server applications are widely used and mission critical

• Server applications challenging for diagnosis

– Run for long periods of time (-)

– Handle large amounts of data (-)

– Concurrent (-)

– Inputs are well-structured (+)

We studied 266 randomly selected bug reports and

30 extra concurrency bug reports

from 6 servers *

(Apache, Squid, Tomcat, sshd, SVN, MySQL)* A detailed spreadsheet of bugs can be found at

http://sva.cs.illinois.edu/ICSE2010/bug_statistics.xls 4

Goals and key results of the study

• How many inputs are needed to trigger the symptoms?

– 77% of the bugs need just one input (12/266 bugs need >3)

• Time duration from first fault-triggering input to symptom?

– 57% of multi-input failures, all inputs likely to occur within short time

– Time between first fault-triggering input and symptom usually small

• Which symptoms appear as a manifestation of bugs?

– Majority (63%) bugs result in incorrect outputs

• Two applications have fewer incorrect outputs

• What fractions of failures are deterministic?

– 82% bugs showed deterministic behavior

• Very few concurrency bugs, nearly all are non-deterministic,

need many more inputs, fewer incorrect outputs5

Outline

• Motivation and Findings

• Methodology and Limitations

• Definitions and Terminology

• Classification of Software Bugs

• Analysis of Multiple Input Bugs

• Concurrency Bugs

• Implications

• Conclusions and Future Work

6

Bug Selection

selected a recent major version of the software in production use for at least a

year

Selected a set of bugs from bug database with a set of filters (Status field as

RESOLVED, Resolution field as FIXED)

Randomly selected a set of bugs from the list of bugs using a seeded rand() function

472 server bugs

7

Bug Selection

• Manual Filtering

– Removed bugs in development code versions

– Removed trivial bugs like build errors, documentation errors etc.

– After filtering, 266 bugs remained out of 472 bugs

• We analyzed each bug (reports, test cases, patches)

• Classified them into different categories based on

– Bug symptom

– Reproducibility

– #inputs

8

Applications and Software Bugs

Application Description #LOC #total bugs #bugs

after sampling Selected

MySQL 4.x Database server 1,028K 90 55

Tomcat 5 Servelet container and web server 274K 70 53

Sshd 3.6-3.x, 4.x Secure shell server 27K 61 54

Apache 2.0.x Web server 283K 65 52

Squid 3.0.x Caching web proxy 93K 170 40

SVN 1.0.0 - 1.6.0 Version control server 587K 16 12

Total --- 2,018K 472 266

9

Limitations

• Servers only

– Studied a subset of server applications

• Only two Programming languages

– 5 were in C/C++, 1 in Java

• Reported bugs only

– Unreported bugs are likely to be less frequent

– Difficult to reproduce bugs are possibly less likely to get reported

• Fixed bugs only

– Bugs unfixed for a long time may have different properties

• Human error

10

Outline

• Motivation and Findings

• Methodology and Limitations

• Definitions and Terminology

• Classification of Software Bugs

• Analysis of Multiple Input Bugs

• Concurrency Bugs

• Implications

• Conclusions and Future Work

11

Definitions and Terminology

• An input is

– Logical input from client to server at the application level

• Login input, HTTP request, SQL query, command from SSH client

• An input is not

– Messages coming from sources other than client

• File system, back-end databases, DNS queries

– Inputs creating persistent environment

• SVN checkout command, create/insert/delete commands in database

LoginSelect Database db1Set sql_mode = FULL_GROUP_BYInsert into foo values (1,2)Select count(*) from foo group by a

POST /login.jsp HTTP/1.1 Host: www.mysite.com User-Agent: Mozilla/4.0 Content-Length: 27 Content-Type: application/x-www-form-urlencoded

userid=joe&password=guessme….. 12

Definitions and Terminology

• Symptoms

– Incorrect program behavior which is externally visible

• Incorrect Output

– External program output is different from the correct output without

any catastrophic symptom

13

Definitions and Terminology

• Deterministic Bug

– Triggers the same symptom each time application is run with the

same set of inputs in the same order on a fixed platform

• Timing Dependent Bug

– Timing in addition to order determines symptom is triggered or not

– A special case of non-deterministic bug

• Ex: An input arriving before a download input completes crashes server

• Non-deterministic Bug

– Symptom may not be triggered each time same requests are input

into the application in same order

14

Outline

• Motivation and Findings

• Methodology and Limitations

• Definitions and Terminology

• Classification of Software Bugs

• Analysis of Multiple Input Bugs

• Concurrency Bugs

• Implications

• Conclusions and Future Work

15

Bug Symptoms

*Memory errors include Seg Fault, Memory Leak, NULL Pointer Exception etc

Most of the bugs (63%)

result in incorrect outputs

16

Bug Symptoms

Squid, Tomcat have lower

incorrect outputs

Many more assertion violations (23%-28%)

Squid, Tomcat have lower

incorrect outputs

Many more assertion violations (23%-28%)17

• Implications

– New techniques needed to detect incorrect outputs at run time

– Adding assertions or automatically generated program invariants

may help in detecting incorrect outputs

Bug Symptoms - Implications

18

Bug Reproducibility

82% show deterministic

behavior (Similar to Chandra et.al.,

DSN’02)Few show timing dependence and non-deterministic behavior

19

Bug Reproducibility - Implications

• Implications

– Tools should be able to reproduce most bugs by replaying inputs

– Need new techniques to reproduce small fraction of bugs classified

as timing-dependent or non-deterministic

• Time Stamping inputs or controlling thread scheduling 20

Number of Bug Triggering Inputs

21

Number of Bug Triggering Inputs Excluding Session Setup Inputs

• Nearly 77% of the bugs need single input to trigger

• 11% needed more than one input

– Apache/SVN need maximum 2 inputs, Squid/Tomcat 3 inputs

– Only 12 bugs (excluding the unclear cases) need more than 3

inputs

– Remaining 11% were unclear from the reports

22

Number of Bug Triggering Inputs - Implications

• Implications

– Most of the bugs can be reproduced with just a single input

– Nearly, all of the bugs can be reproduced with a small num of inputs

• Few input from the session which triggers the bug is enough

– Failure symptom occurs shortly after last faulty input is received

(See paper)

• Except hang or time-out bugs

23

Detailed Analysis

Appl # ≤1-input # >1-input Unclear

Total 9 (41%) 10 (45%) 3 (14%)

Classification of 22 non-deterministic bugs

Appl Deterministic Timing- dependent

Non- deterministic

Total 12 (40%) 8 (27%) 10 (33%)

Classification of 30 multi-input bugs

Appl # ≤1-input # >1-input Unclear

Total 9 (41%) 10 (45%) 3 (14%)

Appl Deterministic Timing- dependent

Non- deterministic

Total 12 (40%) 8 (27%) 10 (33%)

24

Outline

• Motivation and Findings

• Methodology and Limitations

• Definitions and Terminology

• Classification of Software Bugs

• Analysis of Multiple Input Bugs

• Concurrency Bugs

• Implications

• Conclusions and Future Work

25

Analysis of Multiple Input Bugs

• Goal: Time from first fault-triggering input to last input

• Classified into three categories

– Clustered: input requests must occur within some time bound

• Ex: All inputs should occur within socket timeout period

– Likely clustered: fault-triggering inputs are likely to occur within a

short duration for most cases

• Ex: Two successive login requests with wrong passwords

– Arbitrary: there is nothing to indicate that inputs must be or are

usually clustered within a short duration

• Ex: Request a static file, Request the same file again

26

Analysis of Multiple Input Bugs

Appl. Total Clustered Likely Clustered Arbitrary

Squid 5 3 0 2

Apache 3 0 1 2

sshd 4 0 3 1

SVN 3 1 2 0

MySQL 8 2 2 4

Tomcat 7 2 1 4

Total 30 8 9 13

• Out of 30 multi-input bugs• 8 were Clustered• 9 were likely clustered•13 were Arbitrary

27

Analysis of Multiple Input Bugs

• Implications

– Majority multi-input bugs will trigger symptom shortly after the first

faulty input

• Replay tools need to buffer session inputs & a small suffix of the inputs

– Locality of the faulty inputs within an input stream can simplify

creation of a reduced test case

Appl. Total Clustered Likely Clustered Arbitrary

Squid 5 3 0 2

Apache 3 0 1 2

sshd 4 0 3 1

SVN 3 1 2 0

MySQL 8 2 2 4

Tomcat 7 2 1 4

Total 30 8 9 13

28

Outline

• Motivation and Findings

• Methodology and Limitations

• Definitions and Terminology

• Classification of Software Bugs

• Analysis of Multiple Input Bugs

• Concurrency Bugs

• Implications

• Conclusions and Future Work

29

Study of Concurrency Bugs

• Found very few (3) concurrency bugs in our bug set

– Perhaps because servers process each input relatively

independently

– Even for multi-threaded servers (Apache, MySQL, Tomcat)

• Separately selected 30 extra concurrency bugs

– From 3 server applications (Apache, MySQL, Tomcat)

– Searched on keywords like ’race(s),’ ’atomic,’ ’concurrency,’

’deadlock,’ ’lock(s),’ and ’mutex(s)’

– 23 were data race/atomicity violation bugs, 5 were deadlock bugs, 2

were not clear

30

Concurrency Bug Symptom Classification

• A much higher fraction of bugs are hangs or crashes

• Much fewer incorrect o/p (20% overall, but 45% in MySQL).

• Five (17%) of the concurrency bugs produced different,

symptoms in different executions

Appl. Seg Fault Crash Assertion Violation Hang

Incorrect Output

Multiple Symptoms

Total 3 (10%) 1 (3%) 6 (20%) 9 (30%) 6 (20%) 5 (17%)

31

Concurrency Bug Reproducibility

Appl. Deterministic Timing-dependent Non-deterministic

Total 2 (7%) 2 (7%) 16 (87%)

Most of the bugs (87% overall, and 100% in Apache, Tomcat) show non-deterministic behavior.

32

Concurrency Bug Input Characteristics

Appl. # 0-2 input # 3-8 input # >8-input Unclear Max #ip

Total 0 (0%) 3 (10%) 17 (57%) 10 (33%) 15000 (max)

• All bugs need multiple inputs (>1) to trigger a symptom

(excluding session setup inputs)

• Some of the cases need a large number of inputs

• Many bugs needed executions with multiple threads and

multiple client connections for some time

• Most bugs can usually be triggered using 2/3 threads, client

connections 33

Implications for Concurrency Bugs

• Very few reported bugs are concurrency bugs

• Implications for tools targeting concurrency bugs

– Need new techniques to reliably reproduce symptoms

– Need to buffer larger number of inputs

– Need to use inputs from multiple different client connection

• Validation of results for overall reported bugs

– Study of concurrency bugs successfully identified non-deterministic

behavior and need for multiple inputs

– Similar methodology found a very low occurrence of these behavior

for overall reported bugs

34

Outline

• Motivation and Findings

• Methodology and Limitations

• Definitions and Terminology

• Classification of Software Bugs

• Analysis of Multiple input Bugs

• Concurrency Bugs

• Implications

• Conclusions and Future Work

35

Implications for Automated Tools

• Diagnosis tools like DDmin (implements delta debugging)

[Zeller et.al., TOSE 02]

– Test small suffixes of inputs before trying a more general algorithm

– One can possibly try subsets of small sizes

• From our results, trying subsets of 2 or 3 inputs should work for most

• Diagnosis tools like Triage [Tucek et.al., SOSP 08]

– Can reduce the input stream to a much smaller set

– Symptoms can possibly be triggered by restarting the server and

replaying a small num of inputs after session establishment inputs

• Alleviates the need for checkpointing

39

Outline

• Motivation and Findings

• Methodology and Limitations

• Definitions and Terminology

• Classification of Software Bugs

• Analysis of Multiple Input Bugs

• Concurrency Bugs

• Implications

• Conclusions and Future Work

40

Conclusion and Future Work

• We report the results of an empirical study of server bugs

– Most of the bugs were deterministic

– Most of the bugs (77%) needed a single input

– Set of inputs for multi-input bugs are usually small and clustered

– Many bugs produce incorrect outputs

– Very few bugs are concurrency bugs

– Most of the concurrency bugs need multiple inputs

• To create light-weight detectors to detect incorrect outputs

• To build production-site automated tools

– To automatically diagnose root cause at production site

• Reproduce failures

• Reduce input stream to a minimal faulty set

41