An Empirical Study of Reported Bugs in Server Software with Implications
for Automated Bug Diagnosis
Swarup Kumar Sahoo, John Criswell,
Vikram Adve
Department of Computer Science
University of Illinois at Urbana-Champaign
1
Motivation
• In-the-field software failures are becoming increasingly
common
– Software failures results in losses over billions of dollars every year
[Charette et.al., IEEE Spectrum, 2005]
– Increasing the reliability of systems is critical
• Off-site analysis of production run failures is difficult
– Difficulty in reproducing failures at development site
– Same bug may generate different faults at multiple production sites
– Customers have privacy concerns
2
Motivation – Production Site Diagnosis
• Problem: Need to reproduce failures fast and checkpoint
based replay limits their usefulness
Question: Will a simple restart/replay mechanism work?
• Problem: Minimal test case generation is too slow
Question: Can the knowledge of fault types and #inputs help?
To know answers to these questions we need to understand
Characteristics of software bugs
3
Application Selection
• Server applications are widely used and mission critical
• Server applications challenging for diagnosis
– Run for long periods of time (-)
– Handle large amounts of data (-)
– Concurrent (-)
– Inputs are well-structured (+)
We studied 266 randomly selected bug reports and
30 extra concurrency bug reports
from 6 servers *
(Apache, Squid, Tomcat, sshd, SVN, MySQL)* A detailed spreadsheet of bugs can be found at
http://sva.cs.illinois.edu/ICSE2010/bug_statistics.xls 4
Goals and key results of the study
• How many inputs are needed to trigger the symptoms?
– 77% of the bugs need just one input (12/266 bugs need >3)
• Time duration from first fault-triggering input to symptom?
– 57% of multi-input failures, all inputs likely to occur within short time
– Time between first fault-triggering input and symptom usually small
• Which symptoms appear as a manifestation of bugs?
– Majority (63%) bugs result in incorrect outputs
• Two applications have fewer incorrect outputs
• What fractions of failures are deterministic?
– 82% bugs showed deterministic behavior
• Very few concurrency bugs, nearly all are non-deterministic,
need many more inputs, fewer incorrect outputs5
Outline
• Motivation and Findings
• Methodology and Limitations
• Definitions and Terminology
• Classification of Software Bugs
• Analysis of Multiple Input Bugs
• Concurrency Bugs
• Implications
• Conclusions and Future Work
6
Bug Selection
selected a recent major version of the software in production use for at least a
year
Selected a set of bugs from bug database with a set of filters (Status field as
RESOLVED, Resolution field as FIXED)
Randomly selected a set of bugs from the list of bugs using a seeded rand() function
472 server bugs
7
Bug Selection
• Manual Filtering
– Removed bugs in development code versions
– Removed trivial bugs like build errors, documentation errors etc.
– After filtering, 266 bugs remained out of 472 bugs
• We analyzed each bug (reports, test cases, patches)
• Classified them into different categories based on
– Bug symptom
– Reproducibility
– #inputs
8
Applications and Software Bugs
Application Description #LOC #total bugs #bugs
after sampling Selected
MySQL 4.x Database server 1,028K 90 55
Tomcat 5 Servelet container and web server 274K 70 53
Sshd 3.6-3.x, 4.x Secure shell server 27K 61 54
Apache 2.0.x Web server 283K 65 52
Squid 3.0.x Caching web proxy 93K 170 40
SVN 1.0.0 - 1.6.0 Version control server 587K 16 12
Total --- 2,018K 472 266
9
Limitations
• Servers only
– Studied a subset of server applications
• Only two Programming languages
– 5 were in C/C++, 1 in Java
• Reported bugs only
– Unreported bugs are likely to be less frequent
– Difficult to reproduce bugs are possibly less likely to get reported
• Fixed bugs only
– Bugs unfixed for a long time may have different properties
• Human error
10
Outline
• Motivation and Findings
• Methodology and Limitations
• Definitions and Terminology
• Classification of Software Bugs
• Analysis of Multiple Input Bugs
• Concurrency Bugs
• Implications
• Conclusions and Future Work
11
Definitions and Terminology
• An input is
– Logical input from client to server at the application level
• Login input, HTTP request, SQL query, command from SSH client
• An input is not
– Messages coming from sources other than client
• File system, back-end databases, DNS queries
– Inputs creating persistent environment
• SVN checkout command, create/insert/delete commands in database
LoginSelect Database db1Set sql_mode = FULL_GROUP_BYInsert into foo values (1,2)Select count(*) from foo group by a
POST /login.jsp HTTP/1.1 Host: www.mysite.com User-Agent: Mozilla/4.0 Content-Length: 27 Content-Type: application/x-www-form-urlencoded
userid=joe&password=guessme….. 12
Definitions and Terminology
• Symptoms
– Incorrect program behavior which is externally visible
• Incorrect Output
– External program output is different from the correct output without
any catastrophic symptom
13
Definitions and Terminology
• Deterministic Bug
– Triggers the same symptom each time application is run with the
same set of inputs in the same order on a fixed platform
• Timing Dependent Bug
– Timing in addition to order determines symptom is triggered or not
– A special case of non-deterministic bug
• Ex: An input arriving before a download input completes crashes server
• Non-deterministic Bug
– Symptom may not be triggered each time same requests are input
into the application in same order
14
Outline
• Motivation and Findings
• Methodology and Limitations
• Definitions and Terminology
• Classification of Software Bugs
• Analysis of Multiple Input Bugs
• Concurrency Bugs
• Implications
• Conclusions and Future Work
15
Bug Symptoms
*Memory errors include Seg Fault, Memory Leak, NULL Pointer Exception etc
Most of the bugs (63%)
result in incorrect outputs
16
Bug Symptoms
Squid, Tomcat have lower
incorrect outputs
Many more assertion violations (23%-28%)
Squid, Tomcat have lower
incorrect outputs
Many more assertion violations (23%-28%)17
• Implications
– New techniques needed to detect incorrect outputs at run time
– Adding assertions or automatically generated program invariants
may help in detecting incorrect outputs
Bug Symptoms - Implications
18
Bug Reproducibility
82% show deterministic
behavior (Similar to Chandra et.al.,
DSN’02)Few show timing dependence and non-deterministic behavior
19
Bug Reproducibility - Implications
• Implications
– Tools should be able to reproduce most bugs by replaying inputs
– Need new techniques to reproduce small fraction of bugs classified
as timing-dependent or non-deterministic
• Time Stamping inputs or controlling thread scheduling 20
Number of Bug Triggering Inputs Excluding Session Setup Inputs
• Nearly 77% of the bugs need single input to trigger
• 11% needed more than one input
– Apache/SVN need maximum 2 inputs, Squid/Tomcat 3 inputs
– Only 12 bugs (excluding the unclear cases) need more than 3
inputs
– Remaining 11% were unclear from the reports
22
Number of Bug Triggering Inputs - Implications
• Implications
– Most of the bugs can be reproduced with just a single input
– Nearly, all of the bugs can be reproduced with a small num of inputs
• Few input from the session which triggers the bug is enough
– Failure symptom occurs shortly after last faulty input is received
(See paper)
• Except hang or time-out bugs
23
Detailed Analysis
Appl # ≤1-input # >1-input Unclear
Total 9 (41%) 10 (45%) 3 (14%)
Classification of 22 non-deterministic bugs
Appl Deterministic Timing- dependent
Non- deterministic
Total 12 (40%) 8 (27%) 10 (33%)
Classification of 30 multi-input bugs
Appl # ≤1-input # >1-input Unclear
Total 9 (41%) 10 (45%) 3 (14%)
Appl Deterministic Timing- dependent
Non- deterministic
Total 12 (40%) 8 (27%) 10 (33%)
24
Outline
• Motivation and Findings
• Methodology and Limitations
• Definitions and Terminology
• Classification of Software Bugs
• Analysis of Multiple Input Bugs
• Concurrency Bugs
• Implications
• Conclusions and Future Work
25
Analysis of Multiple Input Bugs
• Goal: Time from first fault-triggering input to last input
• Classified into three categories
– Clustered: input requests must occur within some time bound
• Ex: All inputs should occur within socket timeout period
– Likely clustered: fault-triggering inputs are likely to occur within a
short duration for most cases
• Ex: Two successive login requests with wrong passwords
– Arbitrary: there is nothing to indicate that inputs must be or are
usually clustered within a short duration
• Ex: Request a static file, Request the same file again
26
Analysis of Multiple Input Bugs
Appl. Total Clustered Likely Clustered Arbitrary
Squid 5 3 0 2
Apache 3 0 1 2
sshd 4 0 3 1
SVN 3 1 2 0
MySQL 8 2 2 4
Tomcat 7 2 1 4
Total 30 8 9 13
• Out of 30 multi-input bugs• 8 were Clustered• 9 were likely clustered•13 were Arbitrary
27
Analysis of Multiple Input Bugs
• Implications
– Majority multi-input bugs will trigger symptom shortly after the first
faulty input
• Replay tools need to buffer session inputs & a small suffix of the inputs
– Locality of the faulty inputs within an input stream can simplify
creation of a reduced test case
Appl. Total Clustered Likely Clustered Arbitrary
Squid 5 3 0 2
Apache 3 0 1 2
sshd 4 0 3 1
SVN 3 1 2 0
MySQL 8 2 2 4
Tomcat 7 2 1 4
Total 30 8 9 13
28
Outline
• Motivation and Findings
• Methodology and Limitations
• Definitions and Terminology
• Classification of Software Bugs
• Analysis of Multiple Input Bugs
• Concurrency Bugs
• Implications
• Conclusions and Future Work
29
Study of Concurrency Bugs
• Found very few (3) concurrency bugs in our bug set
– Perhaps because servers process each input relatively
independently
– Even for multi-threaded servers (Apache, MySQL, Tomcat)
• Separately selected 30 extra concurrency bugs
– From 3 server applications (Apache, MySQL, Tomcat)
– Searched on keywords like ’race(s),’ ’atomic,’ ’concurrency,’
’deadlock,’ ’lock(s),’ and ’mutex(s)’
– 23 were data race/atomicity violation bugs, 5 were deadlock bugs, 2
were not clear
30
Concurrency Bug Symptom Classification
• A much higher fraction of bugs are hangs or crashes
• Much fewer incorrect o/p (20% overall, but 45% in MySQL).
• Five (17%) of the concurrency bugs produced different,
symptoms in different executions
Appl. Seg Fault Crash Assertion Violation Hang
Incorrect Output
Multiple Symptoms
Total 3 (10%) 1 (3%) 6 (20%) 9 (30%) 6 (20%) 5 (17%)
31
Concurrency Bug Reproducibility
Appl. Deterministic Timing-dependent Non-deterministic
Total 2 (7%) 2 (7%) 16 (87%)
Most of the bugs (87% overall, and 100% in Apache, Tomcat) show non-deterministic behavior.
32
Concurrency Bug Input Characteristics
Appl. # 0-2 input # 3-8 input # >8-input Unclear Max #ip
Total 0 (0%) 3 (10%) 17 (57%) 10 (33%) 15000 (max)
• All bugs need multiple inputs (>1) to trigger a symptom
(excluding session setup inputs)
• Some of the cases need a large number of inputs
• Many bugs needed executions with multiple threads and
multiple client connections for some time
• Most bugs can usually be triggered using 2/3 threads, client
connections 33
Implications for Concurrency Bugs
• Very few reported bugs are concurrency bugs
• Implications for tools targeting concurrency bugs
– Need new techniques to reliably reproduce symptoms
– Need to buffer larger number of inputs
– Need to use inputs from multiple different client connection
• Validation of results for overall reported bugs
– Study of concurrency bugs successfully identified non-deterministic
behavior and need for multiple inputs
– Similar methodology found a very low occurrence of these behavior
for overall reported bugs
34
Outline
• Motivation and Findings
• Methodology and Limitations
• Definitions and Terminology
• Classification of Software Bugs
• Analysis of Multiple input Bugs
• Concurrency Bugs
• Implications
• Conclusions and Future Work
35
Implications for Automated Tools
• Diagnosis tools like DDmin (implements delta debugging)
[Zeller et.al., TOSE 02]
– Test small suffixes of inputs before trying a more general algorithm
– One can possibly try subsets of small sizes
• From our results, trying subsets of 2 or 3 inputs should work for most
• Diagnosis tools like Triage [Tucek et.al., SOSP 08]
– Can reduce the input stream to a much smaller set
– Symptoms can possibly be triggered by restarting the server and
replaying a small num of inputs after session establishment inputs
• Alleviates the need for checkpointing
39
Outline
• Motivation and Findings
• Methodology and Limitations
• Definitions and Terminology
• Classification of Software Bugs
• Analysis of Multiple Input Bugs
• Concurrency Bugs
• Implications
• Conclusions and Future Work
40
Conclusion and Future Work
• We report the results of an empirical study of server bugs
– Most of the bugs were deterministic
– Most of the bugs (77%) needed a single input
– Set of inputs for multi-input bugs are usually small and clustered
– Many bugs produce incorrect outputs
– Very few bugs are concurrency bugs
– Most of the concurrency bugs need multiple inputs
• To create light-weight detectors to detect incorrect outputs
• To build production-site automated tools
– To automatically diagnose root cause at production site
• Reproduce failures
• Reduce input stream to a minimal faulty set
41