WiDS Checker: Combating Bugs in Distributed Systems · We applied WiDS Checker to a suite of...

14
WiDS Checker: Combating Bugs in Distributed Systems Xuezheng Liu Microsoft Research Asia Wei Lin Microsoft Research Asia Aimin Pan Microsoft Research Asia Zheng Zhang Microsoft Research Asia Abstract Despite many efforts, the predominant practice of debug- ging a distributed system is still printf-based log min- ing, which is both tedious and error-prone. In this paper, we present WiDS Checker, a unified framework that can check distributed systems through both simulation and reproduced runs from real deployment. All instances of a distributed system can be executed within one simulation process, multiplexed properly to observe the “happens- before” relationship, thus accurately reveal full system state. A versatile script language allows a developer to refine system properties into straightforward assertions, which the checker inspects for violations. Combining these two components, we are able to check distributed properties that are otherwise impossible to check. We applied WiDS Checker over a suite of complex and real systems and found non-trivial bugs, including one in a previously proven Paxos specification. Our experience demonstrates the usefulness of the checker and allows us to gain insights beneficial to future research in this area. 1 Introduction From large clusters in machine rooms to large-scale P2P networks, distributed systems are at the heart of today’s Internet services. At the same time, it is well recog- nized that these systems are difficult to design, imple- ment, and test. Their protocols involve complex inter- actions among a collection of networked machines, and must handle failures ranging from network problems to crashing nodes. Intricate sequences of events can trigger complex errors as a result of mishandled corner cases. The most challenging bugs are not the ones that will crash the system immediately, but the ones that corrupt certain design properties and drive the system to unex- pected behaviors after long runs. Yet the predominant practice in debugging distributed systems has remained unchanged over the years: manu- ally inspecting logs dumped at different machines. Typ- ically, developers embed printf statements at various im- plementation points, perform tests, somehow stitch the logs together, and then look for inconsistencies. How- ever, log mining is both labor-intensive and fragile. Log events are enormous in number, making the inspection tedious and error-prone. Latent bugs often affect appli- cation properties that are themselves distributed across multiple nodes, and verifying them from local events can be very difficult. More important, logs reflect only in- complete information of an execution, sometimes insuf- ficient to reveal bugs. Pip [23], for instance, logs the application behavior in terms of communication struc- tures, timing and resource usage, and compares them against developer expectations. However, our experience shows that applications with correct message sequences can perform wrong things and mutate inner states be- cause of buggy logic. Therefore, it is impossible to catch the existence of these subtle bugs only from communica- tion logs, unless much more state is also logged. It is a common experience that omitting a key logging point can miss a bug thus defeating the entire debugging exercise, yet adding it could substantially change subse- quent runs. The non-determinism of distributed applica- tions plus the limitations of log-based debugging makes such “Heisenbugs” a nightmare for developers. Build- ing a time machine so that bugs can be deterministically replayed gets rid of the artifacts of using logs [8]. How- ever, one still lacks a comprehensive framework to ex- press correctness properties, catch violation points, and identify root causes. We believe that a desired debugging tool for dis- tributed applications needs to: 1) efficiently verify ap- plication properties, especially distributed ones; 2) pro- vide fairly complete information about an execution, so that developers can observe arbitrary application states, rather than pre-defined logs; 3) reproduce the buggy runs deterministically and faithfully, and hence enable the cyclic debugging process. NSDI ’07: 4th USENIX Symposium on Networked Systems Design & Implementation USENIX Association 257

Transcript of WiDS Checker: Combating Bugs in Distributed Systems · We applied WiDS Checker to a suite of...

Page 1: WiDS Checker: Combating Bugs in Distributed Systems · We applied WiDS Checker to a suite of distributed systems, including both individual protocols and a com-plete, deployed system.

WiDS Checker: Combating Bugs in Distributed Systems

Xuezheng LiuMicrosoft Research Asia

Wei LinMicrosoft Research Asia

Aimin PanMicrosoft Research Asia

Zheng ZhangMicrosoft Research Asia

Abstract

Despite many efforts, the predominant practice of debug-ging a distributed system is still printf-based log min-ing, which is both tedious and error-prone. In this paper,we present WiDS Checker, a unified framework that cancheck distributed systems through both simulation andreproduced runs from real deployment. All instances of adistributed system can be executed within one simulationprocess, multiplexed properly to observe the “happens-before” relationship, thus accurately reveal full systemstate. A versatile script language allows a developer torefine system properties into straightforward assertions,which the checker inspects for violations. Combiningthese two components, we are able to check distributedproperties that are otherwise impossible to check. Weapplied WiDS Checker over a suite of complex and realsystems and found non-trivial bugs, including one in apreviously proven Paxos specification. Our experiencedemonstrates the usefulness of the checker and allows usto gain insights beneficial to future research in this area.

1 Introduction

From large clusters in machine rooms to large-scale P2Pnetworks, distributed systems are at the heart of today’sInternet services. At the same time, it is well recog-nized that these systems are difficult to design, imple-ment, and test. Their protocols involve complex inter-actions among a collection of networked machines, andmust handle failures ranging from network problems tocrashing nodes. Intricate sequences of events can triggercomplex errors as a result of mishandled corner cases.The most challenging bugs are not the ones that willcrash the system immediately, but the ones that corruptcertain design properties and drive the system to unex-pected behaviors after long runs.

Yet the predominant practice in debugging distributedsystems has remained unchanged over the years: manu-

ally inspecting logs dumped at different machines. Typ-ically, developers embed printf statements at various im-plementation points, perform tests, somehow stitch thelogs together, and then look for inconsistencies. How-ever, log mining is both labor-intensive and fragile. Logevents are enormous in number, making the inspectiontedious and error-prone. Latent bugs often affect appli-cation properties that are themselves distributed acrossmultiple nodes, and verifying them from local events canbe very difficult. More important, logs reflect only in-complete information of an execution, sometimes insuf-ficient to reveal bugs. Pip [23], for instance, logs theapplication behavior in terms of communication struc-tures, timing and resource usage, and compares themagainst developer expectations. However, our experienceshows that applications with correct message sequencescan perform wrong things and mutate inner states be-cause of buggy logic. Therefore, it is impossible to catchthe existence of these subtle bugs only from communica-tion logs, unless much more state is also logged.

It is a common experience that omitting a key loggingpoint can miss a bug thus defeating the entire debuggingexercise, yet adding it could substantially change subse-quent runs. The non-determinism of distributed applica-tions plus the limitations of log-based debugging makessuch “Heisenbugs” a nightmare for developers. Build-ing a time machine so that bugs can be deterministicallyreplayed gets rid of the artifacts of using logs [8]. How-ever, one still lacks a comprehensive framework to ex-press correctness properties, catch violation points, andidentify root causes.

We believe that a desired debugging tool for dis-tributed applications needs to: 1) efficiently verify ap-plication properties, especially distributed ones; 2) pro-vide fairly complete information about an execution, sothat developers can observe arbitrary application states,rather than pre-defined logs; 3) reproduce the buggy runsdeterministically and faithfully, and hence enable thecyclic debugging process.

NSDI ’07: 4th USENIX Symposium on Networked Systems Design & ImplementationUSENIX Association 257

Page 2: WiDS Checker: Combating Bugs in Distributed Systems · We applied WiDS Checker to a suite of distributed systems, including both individual protocols and a com-plete, deployed system.

In this paper, we address the above debugging require-ments with a unified framework called WiDS Checker.This platform logs the actual execution of a distributedsystem implemented using the WiDS toolkit [15]. Wecan then apply predicate checking in a centralized simu-lator over a run that is either driven by testing scripts or isdeterministically replayed by the logs. The checker out-puts violation reports along with message traces, allow-ing us to perform “time-travel” inside the Visual StudioIDE to identify the root causes.

1.1 Our Results

Evaluating the effectiveness of our tool is a challenge.The research community, though acutely aware of thedifficulties of debugging distributed systems, has not suc-ceeded in producing a comprehensive set of benchmarksso that different debugging approaches can be quantita-tively compared. While we believe the set of applicationswe have experimented with are representative, there isalso no clear methodology of how to quantify the useful-ness of the tool. To alleviate these problems, we resort todetailed discussions of case studies, hoping that other re-searchers working on similar systems can compare theirexperiences. Where possible, we also isolate the benefitscoming from predicate checking from using log and re-play only. Our tool is targeted at the scenario in whichthe system is debugged by those who developed it, andthus assumes that the bugs are hunted by those who areintimately familiar with the system. How to propagatethe benefits to others who are not as versed in the systemitself is an interesting research question.

We applied WiDS Checker to a suite of distributedsystems, including both individual protocols and a com-plete, deployed system. Within a few weeks, we havefound non-trivial bugs in all of them. We discoveredboth deadlock and livelock in the distributed lock ser-vice of Boxwood [19]. We checked an implementationof the Chord protocol [1] on top of Macedon [24] andfound five bugs, three of them quite subtle. We checkedour BitVault storage system [32], a deployed researchprototype being incrementally improved for more thantwo years. We identified mishandling of race conditionsthat can cause loss of replicas, and incorrect assumptionsof transient failures. The most interesting experiencewas checking our Paxos [13] implementation, revealinga bug in a well-studied specification itself [21]. Most ofthese bugs were not discovered before; All our findingsare confirmed by the authors or developers of the corre-sponding systems.

We also learned some lessons. Some bugs have deeppaths and appear only at fairly large scale. They can-not be identified when the system is downscaled, thuscalling for more efficient handling of the state explosion

problem when a model checker is applied to check actualimplementation. Most of the bug cases we found havecorrect communication structure and messages. There-fore, previous work that relies on verifying event order-ing is unable to detect these bugs, and is arguably moreeffective for performance bugs.

1.2 Paper RoadmapThe rest of the paper is organized as follows. Section 2gives an overview of the checker architecture. Section 3provides implementation details. Section 4 presents ourresults, including our debugging experience and lessonslearned. Section 5 contains related work and we con-clude with future work in Section 6.

2 Methodology and Architecture Overview

2.1 Replay-Based Predicate CheckingThe WiDS1 checker is built on top of the WiDS toolkit,which defines a set of APIs that developers use to writegeneric distributed applications (details see Section 3.1).Without modification, a WiDS-based implementationcan be simulated in a single simulation process, simu-lated on a cluster-based parallel simulation engine, ordeployed and run in real environment. This is madepossible by linking the application binary to three dif-ferent runtime libraries (simulation, parallel simulationand deployment) that implement the same API interface.WiDS was originally developed for large-scale P2P ap-plications; its parallel simulation engine [14] has simu-lated up to 2 million instances of a product-strength P2Pprotocol, and revealed bugs that only occur at scale [29].With a set of basic fault injection utilities, WiDS allows asystem to be well tested inside its simulation-based test-ing framework before its release to deployment. WiDSis available with full source code in [3].

However, it is impossible to root out all bugs inside thesimulator. The deployed environment can embody differ-ent system assumptions, and the full state is unfolded un-predictably. Tracking bugs becomes extremely challeng-ing, especially for the ones causing violation of systemproperties that are themselves distributed. When debug-ging non-distributed software and stand-alone compo-nents, developers generally check memory states againstdesign-specified correctness properties at runtime usinginvariant predicates (e.g., assert() in C++). This dynamicpredicate checking technique is proven of great help todebugging; many advanced program-checking tools areeffective for finding domain-specific bugs (e.g., race con-ditions [25, 31] and memory leaks [22, 10]) based onthe same principle. However, this benefit does not ex-tend to distributed systems for two reasons. First, dis-

NSDI ’07: 4th USENIX Symposium on Networked Systems Design & Implementation USENIX Association258

Page 3: WiDS Checker: Combating Bugs in Distributed Systems · We applied WiDS Checker to a suite of distributed systems, including both individual protocols and a com-plete, deployed system.

tributed properties reside on multiple machines and can-not be directly evaluated at one place without significantruntime perturbations. Second, even if we can catch a vi-olation, the cyclic debugging process is broken becausenon-determinism across runs makes it next to impossibleto repeat the same code path that leads to the bug.

To address this problem and to provide similar check-ing capabilities to distributed systems, we propose areplay-based predicate checking approach that allowsthe execution of the entire system to be replayed after-wards within a single machine, and at the same timechecks node states during the replayed execution againstuser-defined predicates. Under modest scale, this solvesboth problems outlined above.

2.2 Checking Methodology

User-defined predicates are checked at event granularity.An event can be an expiration of a timer, receiving a mes-sage from another node, or scheduling and synchroniza-tion events (e.g., resuming/yielding a thread and acquir-ing/releasing a lock) specific for thread programming.WiDS interprets an execution of a single node or the en-tire distributed system as a sequence of events, whichare dispatched to corresponding handling routines. Dur-ing the replay, previous executed events from all nodesare re-dispatched, ordered according to the “happens-before” relationship [12]. This way the entire system isreplayed in the simulator while preserving causality.

Each time an event is dispatched, the checker evalu-ates predicates and reports violations for the current stepin replay. The choice of using event boundaries for pred-icate checking is due to a number of factors. First, theevent model is the basis of many protocol specifications,especially the ones based on I/O-automata [18, 17]. Asystem built with the event model can be regarded as aset of state machines in which each event causes a statetransition executed as an atomic step. Distributed prop-erties thus change at event boundaries. Second, manywidely adopted implementation models can be distilledinto such a model. We have used the WiDS toolkit [15]to build large and deployed systems as well as many pro-tocols; network middle-layers such as Macedon [24] andnew overlay models such as P2 [16] can be regarded asevent-based platforms as well. We believe event granu-larity is not only efficient, but also sufficient.

2.3 Architecture

Figure 1 shows the architecture of WiDS Checker.Reproducing real runs. When the system is run-

ning across multiple machines, the runtime logs all non-deterministic events, including messages received from

���������

�������

������������ ��

���� ��������

������ �����������

�����������

���� �������������

����������� ���������������

����������

��������������� ���

����� !"�

��� ������ ���������� ��

Figure 1: Components of WiDS Checker. The upper-leftbox was employed as debugging support in WiDS before theChecker was developed.

network, data read from files, thread scheduling deci-sions and many environmental system calls. The exe-cutable binary is then rerun inside the simulator, and allnon-deterministic events are fed from the logs. We useLamport’s logical clock [12] to decide the replay order ofevents from different nodes, so that the “happens-before”relationship is preserved. Therefore, inside the simulator,we reconstruct the exact state of all instances as they arerun in the real environment.

Checking user-defined predicates. We designed aversatile scripting language to specify system states be-ing observed and the predicates for invariants and cor-rectness. After each step of event-handling, the observedstates are retrieved from the replayed instance in the sim-ulator, and refreshed in a database. Then, the checkerevaluates predicates based on the current states from allreplayed instances and reports violations. Because pred-icates generally reflect design properties, they are easy toreason and write.

Screening out false alarms with auxiliary informa-tion. Unlike safety properties, liveness properties areonly guaranteed to be true eventually. This poses a se-rious problem when checking liveness properties, sincemany violations can be false alarms. To screen out falsealarms, we enable user-defined auxiliary information tobe calculated and output along with each violation point.When a violation occurs, the auxiliary information is typ-ically used to produce stability measures based on user-provided heuristics.

Tracing root causes using a visualization tool. Inaddition to the violation report, we generate a messageflow graph based on event traces. All these facilities areintegrated into the Visual Studio Integrated DevelopmentEnvironment (IDE). Thus, a developer can “time-travel”to violation points, and then trace backwards while in-specting the full state to identify root causes.

NSDI ’07: 4th USENIX Symposium on Networked Systems Design & ImplementationUSENIX Association 259

Page 4: WiDS Checker: Combating Bugs in Distributed Systems · We applied WiDS Checker to a suite of distributed systems, including both individual protocols and a com-plete, deployed system.

The first two components make it possible to checkdistributed properties, whereas the last two are criticalfeatures for a productive debugging experience. Notethat developers can incrementally refine predicates andre-evaluate them on the same reproduced execution. Inother words, by means of replay, cyclic debugging is re-enabled.

3 Design and Implementation

WiDS Checker depends critically on the replay function-ality, which is done at the API level. In this section, wewill first briefly describe these APIs. We then explainthe replay facility and the checker implementation. Thissection concludes with a discussion of the known limita-tions. WiDS has close to 20K lines of code. The replayand checker components add 4K and 7K lines, respec-tively.

3.1 Programming with WiDSTable 1 lists the class of WiDS APIs with some exam-ples. The WiDS APIs are mostly member functions ofthe WiDSObject class, which typically implements onenode instance of a distributed system. The WiDS run-time maintains an event queue to buffer pending eventsand dispatches them to corresponding handling routines(e.g., OnMsgHandler()). Beside this event-driven model,WiDS also supports multi-threaded programming withits thread and synchronization APIs. The context switch-ing of WiDS threads is encapsulated as events in theevent queue. We use non-preemptive scheduling inwhich the scheduling points are WiDS API and block-ing system calls, similar to many user-level thread im-plementations. The fault-injection utilities include drop-ping or changing the latency of messages, and killing andrestarting WiDS objects.

Macedon over WiDS. It’s not easy to write a dis-tributed application with message-passing and event-driven model. Defining better language support is an ac-tive research area [16, 24]. WiDS can be used as the low-level mechanism to implement such languages, and thusmake its full functionality available to their applications.As an experiment, we ported Macedon [24], an infras-tructure to specify and generate implementation of over-lay protocols. Macedon has been used to reimplementmany complex overlay protocols with a simple domainlanguage and conduct performance comparisons. Port-ing to the WiDS API is simplified because both Macedonand WiDS provide a set of APIs working in an event-driven manner, and programming entities such as mes-sages and timers exist in both platforms. An overlay pro-tocol expressed with Macedon is composed of a .mac file,which our parser takes as input to generate a set of WiDS

implementation files. Many overlay specific functionali-ties provided by the Macedon library are replicated in aWiDS-based library. Finally, all these files are linked togenerate an executable, which, with a proper driver, canbe both simulated and deployed. Supporting Macedon isaccomplished with about 4K lines of code.

3.2 Enabling replay

The primary goal of the deterministic replay is to re-produce the exact application memory states inside thesimulator. To achieve this, we need to log all non-deterministic inputs to the application and feed them tothe replay simulator.

Logging. The WiDS runtime logs the followingtwo classes of non-determinism: The first is internal toWiDS. We record all the WiDS events, the WiDS threadschedule decisions and the incoming message content.The second class are OS system calls, including read-ing from files, returned memory addresses for allocationand deallocation in heap, and miscellaneous others suchas system time and random number generation. In Win-dows NT, each API call is redirected by the linker to theimport address table (IAT), from which another jump istaken to reach the real API function. We changed theaddress in the IAT, so the second jump will lead to theappropriate logging wrapper, which will log the returnresults after the real API is executed. Furthermore, toenable consistent group replay [8], we embed a Lam-port Clock [12] in each out-going message’s header toperserve the “happens-before” relation during the replay.Table 1 describes logging and replay mechanisms forAPI calls.

In addition, we use a lightweight compressor to effec-tively reduce the log size. As we report in Section 4.5,the computation overhead for logging is small in the testswe performed.

Checkpoint. We use checkpoints to avoid over-committing storage overhead from logging and to sup-port partial replay during replay. A checkpoint includesthe snapshot of memory of the WiDS process and therunning context for user-level threads and sockets as wellas buffered events in the event queue.

Replay. Replaying can start from either the beginningor a checkpoint. Note that checking predicates requiresall instances to be replayed with causality among thempreserved. Therefore, during the replay, events from dif-ferent instances are collected from logs, sequentializedinto a total execution order based on the Lamport Clock,and re-executed one-by-one in the simulator.

To improve replay performance and scalability, we useonly one simulation process to replay all instances, anduse file-mapping to deal with the memory switch be-tween different instances. The state of an instance is

NSDI ’07: 4th USENIX Symposium on Networked Systems Design & Implementation USENIX Association260

Page 5: WiDS Checker: Combating Bugs in Distributed Systems · We applied WiDS Checker to a suite of distributed systems, including both individual protocols and a com-plete, deployed system.

WiDS API setCategory API example Logging and replay mechanism

Event-driven program SetTimer, KillTimer, OnTimerExpire Log the event type and the sequence; redo the same events in replay

Message communica-tion

PostMsg, PostReliableMsg, OnMsgHandler Embed Lamport Clock to maintain causal order, log incoming messagecontents. Replay with correct partial order, feed message content.

Multi-threadedprogram

CreateThread, JoinThread, KillThread,YieldThread, Lock, Unlock

Log the schedule decision and the thread context. Ensure the same sched-ule decision and the same context during replay

Socket APIs for net-work virtualization

WiDSSocket, WiDSListen, WiDSAccept,WiDSConnect, WiDSSend, WiDSRecv

Log the operation along with all received data. Feed the received datafrom log during replay. Sending operations become no-ops in replay

Fault injection andmessage delay

ActivateNode, DeActivateNode, SetNet-workModel, OnCalculateDelay

Log the operation of activation/deactivation, and redo the operation in re-play

Operation system APIs (for Windows)File system CreateFile, OpenFile, ReadFile, WriteFile,

CloseHandle, SetFilePointerLog the operation along with all input data. Feed the input data from logduring replay. Write operations become no-ops in replay

Memory management VirtualAlloc/Free, HeapAlloc/Free Ensure identical memory layout in replay.

Miscellaneous GetSystemTimeAsFileTime, GetLastError Log the return value, and feed the same value in replay

Table 1: WiDS API set and OS APIs with logging and replay mechanisms

stored in a memory mapped file, and is mapped intothe process memory space on-demand. For example, toswitch the replayed instance from A to B, we only up-date the entries in the page table of the simulation pro-cess to the base address of the mapped memory of B.Starting a process for each replayed instance and switch-ing the processes would require local process communi-cation (LPC) that is typically tens of times slower thanfunction calls. Therefore, our approach has significantadvantages. When the aggregated working set of all re-played instances fit into physical memory, the only over-head in switching instance is the update to the page ta-ble. It also avoids redundant memory usage caused byprocess context and executable binary for each instance.

The largest per-instance working set in our experi-ments is about 20MB, meaning that more than 40 in-stances can be replayed in 1GB physical memory with-out going to the disk. When the aggregated working setexceeds the physical memory, depending on the com-putation density of the replayed instance, we have ob-served 10 to 100 times slowdown due to disk swapping.The checker itself maintains a copy of the states beingchecked, and that varies across applications. Ultimately,the scalability is bound by the disk size and acceptablereplay speed.

3.3 Checker

Deterministic replay that properly preserves causalityhas enabled the reconstruction of memory states of a dis-tributed system. The next step is to write predicate state-ments to catch the violation points of correctness proper-ties. We define a simple scripting language for specify-ing predicates, so that developers can easily specify thestructure of the investigated states, retrieve them from the

memory of the instances, and evaluate properties fromthese states.

As mentioned before, checking predicates is invokedat event boundaries. Each time an event is re-executedin a replayed instance, the checker examines the statechanges in the instances, and re-evaluates the affectedpredicates. The states being checked are copies kept ina separate database. The checker refreshes these statesin the database from the replayed instance and evalu-ates predicates based on the state copies of all instances.Therefore, checking predicates is decoupled from mem-ory layout of the instances, and we do not require all in-stances to reside in memory simultaneously for evaluat-ing global properties. This makes replay and the checkermore scalable. Furthermore, maintaining state copiesseparately allows us to keep past versions of states ifneeded, which is useful for evaluating certain properties(see Section 4.1).

In this section, we explain the checker implemen-tation, including the necessary reflection facilities thatmake memory states in C++ objects observable by thechecker, state maintenance and predicate evaluation, andthe auxiliary information associated with violations thatdeal with false alarms.

3.3.1 Observing memory states

For programming languages such as Java and C# thatsupport runtime reflection, the type system and user-defined data structures are observable during the run-time. Unfortunately, this is not the case for C++, whichis what WiDS uses. To check application states, we needto record the memory address of each allocated C++ ob-ject with type information during its lifetime. We use thePhoenix compiler infrastructure [2] to analyze the exe-cutable and inject our code to track class types and object

NSDI ’07: 4th USENIX Symposium on Networked Systems Design & ImplementationUSENIX Association 261

Page 6: WiDS Checker: Combating Bugs in Distributed Systems · We applied WiDS Checker to a suite of distributed systems, including both individual protocols and a com-plete, deployed system.

addresses. Phoenix provides compiler-independent inter-mediate representation of binary code, from which weare able to list basic blocks, function calls, and the sym-bol table that contains all type definitions. We then injectour logging function to function calls of constructors anddeconstructors of the classes. The logging function willdump the timestamp, type of operation (i.e., construc-tion or deconstruction) along with the address of objectand type information. This information is used by thechecker to inspect memory states. The following assem-bly codes show an example of a constructor after codeinjection. The lines beginning with “*” are injected code.They call our injected logging function onConstruct withthe index number of this class found in symbol table. Weperform similar code injection for object deconstruction.As a result, at each step of the replay, the checker is ca-pable of enumerating pointers of all objects of a certainclass, and further reading their memory fields accordingto the symbol table. The runtime overhead is negligiblesince the actions are only triggered at object allocationand deallocation time.

$L1: (refs=0) START MyClass::MyClassMyClass::MyClass: (refs=1)this = ENTERFUNC

* [ESP], {ESP} = push 0x17 //index number for MyClass

* call _imp__onConstruct@4, $out[ESP] //call log func[ESP], {ESP} = push EBPEBP = mov ESP [ESP],{ESP} = push ECX... // other code in original constructorESP = mov EBP EBP,{ESP} = pop [ESP]{ESP} = ret {ESP}, MyClass::MyClassMyClass::MyClass: (refs=1) Offset: 32(0x0020)EXITFUNC$L2: (refs=0) END

This code injection is carried out in a fully automatedway. In addition, we provide some APIs that allow de-velopers to explicitly calculate and expose states of aninstance in the source code.

3.3.2 Defining states and evaluating predicates

A script for predicate evaluation is composed of threeparts: declaration of tables, declaration of internal vari-ables, and the predicates. Figure 2 shows the script weused for checking the Chord protocol [28] implementedon Macedon (see Section 4.4 for details).

The first section (starting with “declare table”) in-structs the checker to observe objects of some classesand refresh the states of certain member fields into thedatabase. Each row of the table corresponds to one objectin the system, and table columns correspond to memberfields of the object. Each table has two built-in columnsinstance id and memory addr, corresponding to the re-played instance and the object’s memory address, respec-tively. The declaration gives the user shorthand nota-

� � ����������������������������������� ����� ���� � �������������� ������� � ������������������������ ��� � ������������������� ���� � ����������������������������������� � �������������������������������������� ��������������������� ��

� ��� ����������� ��� ��� ��������������� ������������������� ����!�

��"����������������#�$%&�� ��'()�*+,-�./0,*1�������"�������������������&�

� ��������������������&������� ����������������������� ������������� ��� �����������������������������2���������������%�3�#���&�

� ����������4�#%��������������&�� ������#&������� ��� � ��������������������������������������������������������� ����5� ������ �������6 ������ �������6

�����������������������������7

Figure 2: An example of check scripts for chord. The auxiliaryinformation Stabilized is reset to 0 when joins or failures occur;otherwise it gradually grows to 1.

tions to name the table and the states. A table storesglobal states from all instances, e.g., the table Node heremaintains the investigated states of all the Chord nodesin the system. Sometimes it is useful to keep a shorthistory of a state for checking. We provide an optionalkeep version(N ) after a column declaration to declarethat the recent N versions of the state should be kept inthe table.

The second section allows users to define vari-ables internal to the checker with the keyword de-clare derived. These variables can also have histories,again using keep version(N ). Between begin pythonand end python are python snippets to calculate thevalue of a named variable. The python snippet has readaccess to values of all prior declarations (i.e., data tablesand internal variables), using the declared names. Datatables are regarded as enumerable python containers, in-dexed by a (instance id, memory addr) pair.

The last section uses the keyword “predicate” to spec-ify correctness properties based on all declared states andvarables, which are evaluated after refreshing tables andafter the evaluation of internal variables. Each predicateis a Boolean expression. We support the set of commonlogical operators, e.g., and, or, imply. We also supporttwo quantifiers, forall and exist, which specify the extentof validity of a predicate when dealing with tables. Thesebuilt-in operators make it easy to specify many useful in-variants. In Figure 2, the predicate states that the ringshould be well formed: if node x believes node y to beits predecessor, then y must regard x as its successor. It is

NSDI ’07: 4th USENIX Symposium on Networked Systems Design & Implementation USENIX Association262

Page 7: WiDS Checker: Combating Bugs in Distributed Systems · We applied WiDS Checker to a suite of distributed systems, including both individual protocols and a com-plete, deployed system.

a critical property for the stabilization of Chord topology.After each step of the replay, the checker does the fol-

lowing. First, it enumerates all objects of classes definedin data tables in the memory of a replayed instance. Ituses the type information and memory address providedby the log to refresh the table, inserting or deleting rows,and updating the columns accordingly. After updating ta-bles, the checker also knows which declared states havechanged, and it only re-evaluats all the affected derivedvalues and predicates. When some predicates are eval-uated as “false,” the checker outputs the violation into aviolation report, which contains the violated predicates,Lamport Clock value for each violation, and the auxiliaryinformation defined in the script (discussed shortly).

Sometimes it is useful to replay and check a segmentof execution, rather than to do it from the beginning.The problem here is how to reconstruct the states main-tained by checker scripts when the checkpoint is loaded.To solve this problem, we support checkpoints in replayruns, which store both replay context and the tables andvariables used by predicate scripts. These replay check-points can be used seamlessly for later checking. To startchecking with an intermediate checkpoint from testingruns, the developers have to provide additional scripts tosetup the states required by the script from the memoryof instances in the checkpoint. Otherwise, there might beincorrect predicate evaluations caused by checkpoints.

3.3.3 Auxiliary information for violations

For safety properties that must hold all the time, everyviolation reveals a bug case. In contrast, liveness prop-erties only guarantee to be true eventually, so a violationof liveness properties is not necessarily a bug case. Forexample, many overlay network systems employ self-stabilizing protocols to deal with churns, therefore mostof their topology-related properties are liveness ones. Asa result, checking liveness properties could generate alarge number of false alarms that overwhelm the real vi-olations. Adding a time bound to liveness properties isnot always a desirable solution, because usually it’s hardto derive an appropriate time bound.

To solve the problem, we enable users to attach aux-iliary information to the predicates. The auxiliary infor-mation is a user-defined varable calculated along with thepredicate, and it is only output when the predicate is vi-olated. Developers could used the information to helpscreen out false alarms or prioritize violations. For live-ness properties, an appropriate usage for auxiliary infor-mation is to output a measurement of a stabilization con-dition. For example, in Figure 2 we associate the even-tual ring consistency property with an auxiliray variableStabilized ranging from 0 to 1, as a measure of stabiliza-tion that shows the “confidence” of the violation.

Figure 3: A screenshot of message flow graph. The verticallines represent the histories of different instance, arcs denotemessages across instances, and the nodes correspond to eventhandlings. Arcs with two ends on the same vertical lines areeither timer events or messages sending to the instance itself.

We also maintain some built-in system parameters inthe checker, e.g., the current time in the node, the currentmessage type, and statistics of recent messages of eachtype. These parameters can be directly accessed in thescripts, and are useful in stabilization measurement. Ourevaluation section contains more examples of using theauxiliary information.

3.4 Visualization tools

To pinpoint the root cause of a bug, a user often needsto trace back in time from a violation point. In additionto our replay facility, we provide the message flow graphgenerated based on message trace (Figure 3) to make thisprocess easier. It is a common practice in our experienceto perform time travel following the message flow, replayto a selected event point and inspect memory state of thereplayed instance. We find that the visualization helpsus understand system behaviors as well as the root causeafter catching violations.

4 Evaluation

In this section, we report our experience of finding bugsusing WiDS Checker over a comprehensive set of realsystems. Table 2 gives a summary of the results. Thechecking scripts to reveal the bugs are short and easy towrite from system properties. For each of these systems,we give sufficient descriptions of their protocols and ex-plain the bugs we found and how they are discovered. Wealso summarize the benefits from WiDS Checker at the

NSDI ’07: 4th USENIX Symposium on Networked Systems Design & ImplementationUSENIX Association 263

Page 8: WiDS Checker: Combating Bugs in Distributed Systems · We applied WiDS Checker to a suite of distributed systems, including both individual protocols and a com-plete, deployed system.

end of each case study. We will discuss relevant perfor-mance numbers and conclude with lessons that we havelearned.

Application # of lines # of bugs Lines of script

Paxos 588 2 29

Lock server 2,439 2 33

BitVault 17,582 3 181

Macedon-chord 2,468 5 86

Table 2: benchmark and result summary

4.1 PaxosPaxos [13] provides a fault-tolerant distributed consen-sus service. It assumes an asynchronous communicationmodel where messages can be lost and delayed arbitrar-ily. Processes can operate at arbitrary speed, may fail-stop and restart. Our implementation strictly follows theI/O-automata specification in [21], in which there are twotypes of processes: leaders and agents.2 Leaders proposevalues to the agents round-by-round, one value for eachround. Agents can accept or reject each received pro-posal independently, and a decision is made if the major-ity of the agents agree on the same round/value pair. Theimportant safety property in Paxos is agreement: once adecision is made, the decided value will stay unchanged.

The main idea of the algorithm is to have leaders workcooperatively: endorse the latest accepted round/valuepair in the system as they currently understand it, or pro-pose their own if there is none. The following is an in-formal description. The protocol works in two phases,learning (steps 1 and 2) and submitting (steps 3 and 4):

1. A leader starts a new round by sending a Collectrequest to all agents with its round number n. Theround number must be unique and increasing.

2. An agent responds to a Collect request with its lat-est accepted round/value pair (if any), if and only ifthe round number in the request is greater than anyCollect requests to which it has already responded.

3. Once the leader has gathered responses from a ma-jority of agents, it sends a Begin request to allagents to submit the value for round n. The sub-mitted value is the previously accepted value inthe highest-numbered round among the responsesin Step 2, or the leader’s own value if there is none.

4. An agent that receives a Begin message of roundnumber n accepts the value (and accordingly up-dates its latest accepted value and round number),unless it has already responded to a Collect requestwith a higher round number than n.

����������������������������� ������� ������������������������������� ���������������������� ������ ������������������������������������������������������������������������������������������������ � ��� �������� �������������� �� ������������ ���� � ������������ �������������� �� ������������������������� ����������� �� � ��� ������ ������ �������� �� ����!"��#�� ������ ��� �!$�%��� ��

� ��� �&�!"��#�� �'�(��)��� ����� ��� ����������� �*������� �

� ��� �&�!"��#�� �'���)��� �����&�!"��#�� �'����!"��+������

������� �,���� ������ ��� �!������� � ����������-�������� �� ������ ��.��/� ��������� �����&��� �'�� �������012"3����������� � ������ �-���� ������-� �����������������������������������������!��������4)��==��������������!��������5���

� ����������������!��������4)����012"3��

� ������������ �� ������-�������*�%�� ��������������������������������������

��������� �����������!�������!��������4)��6���!�������!��������5��

Figure 4: Checker script for Paxos

In our implementation, leaders choose monotonicallyincreasing round numbers from disjoint sets of integers.An agent broadcasts to all leaders when it accepts a valueso that leaders can learn the decision. Each leader sleepsfor a random period before starting a round, until it learnsa decision is made. The test was carried out using sim-ulation with seven processes acting as both leaders andagents. To simulate message loss, we randomly droppedmessages with a probability of 0.3.

We wrote two predicates that are directly translatedfrom safety properties in the specification (see Figure 4).The Python snippet of decision value calculates the valueaccepted by the majority of agents. The first predicate,GlobalConsistency, specifies the agreement property: alldecision values should be identical. It is a distributedproperty derived from all agents. The second predicate,AcceptedRoundIncreasing, states a local property in thespecification that the newly accepted round number in-creases monotonically in an agent.

The checker caught an implementation bug in Step 3with the GlobalConsistency predicate, finding that afteran agent accepted a new Begin message, the decisionvalue changed to a different one. The root cause is that,in Step 3, after the leader gathers responses from a major-ity of agents, it sends a Begin request with the submittedvalue from the latest received agent response, instead ofthe highest-numbered round among all responses. Thepredicate breaks immediately after the Begin message ishandled that changes the decision. Tracing back one sin-gle step in the message flow and replaying the code forStep 3 allows us to immediately identify the root cause.

NSDI ’07: 4th USENIX Symposium on Networked Systems Design & Implementation USENIX Association264

Page 9: WiDS Checker: Combating Bugs in Distributed Systems · We applied WiDS Checker to a suite of distributed systems, including both individual protocols and a com-plete, deployed system.

The second bug is far more subtle. We found that withvery small probability, the accepted round number in anagent may decrease, which violates the second predicate.We ran our test several hundred times, each of them hav-ing thousands of messages, but only caught one viola-tion. Using replay to trace back from the violation point,we identified that the bug was not in the implementation,but in the specification itself. In Step 3 the Begin re-quests are sent to all agents; under message loss it is pos-sible for an agent to receive a Begin request without thepairing Collect request of the round. This means that theagent can have its accepted round number greater than itsround number responding to the Collect request (the twoare kept in separate counters). Thus, based on Step 2 theagent may in the future respond to (and in Step 4 accept avalue from) some smaller-numbered rounds, decreasingthe accepted round number. With a small probability, thebug can actually break the ultimate agreement property.(We have constructed such a sequence of events startingwith our violating trace). However, this is a corner caseand the protocol is robust enough that it never happenedin our test. After researching the specification carefully,we also understood where the original proof went wrong.This bug and our fix for the specification is confirmed byone of the authors.

Study on effectiveness. The checker gives precisebug points from thousands of events in both bug cases.After that, identifying the root cause becomes simplytracing back one or a few messages in replay. Replay(or only logging, if we dump states of Paxos nodes to thelog) is necessary for understanding root causes, however,as GlobalConsistency is a distributed property that can-not be directly verified from logs. Without the checker adeveloper has to go through a huge number of events inreplay trace and check the correctness property manually,which is very inefficient. The second predicate, thougha local one, proves the usefulness of rigorous predicatechecking for distributed systems. Without this predicate,we would miss the specification bug altogether.

4.2 Lock Server in Boxwood

The Boxwood [19] storage system implements a fault-tolerant distributed lock service that allows clients to ac-quire multiple-reader single-writer locks. A lock can bein one of the states: unlocked, shared, exclusive, andmultiple threads from one or more clients acquire or re-lease locks by communicating to a server. The serveruses a pendingqueue to record locks for which transac-tions are still in flight.

The system was written in C#. In order to check it, weported it to C++ and the WiDS APIs in about two weeks;most of the effort was spent on the language difference(e.g., the pointers). In the resulting code, almost every

� ����������������������������������� ��������������� ��� ������������������� ��� ����������� ��������������������� �������������������� ��

������� ��� ������ ������������ ��� ������ ����������� ����������� ������������� � ��!�� ���������� ���������! " ���# �����$#������������� ������������������������������%����������� ��� ������� ���������������� � ���&���� ����������� ' �$���$�(���������� ��������������������������������������������������������

���������������� �����������)

Figure 5: Checker script for lock server

line can be mapped to the original code. This enablesus to map bugs found by WiDS Checker to the originalC# code. At first we checked the safety property that ifmultiple clients simultaneously own the same lock, theymust be all in shared mode. However, during the experi-ment we did not find any violations in both simulated andreproduced runs. Next, we focused on deadlock and live-lock checking on a larger scale and found both of them.

We ran the test inside a simulator and used 20 clients,one server and four locks. Each client has five threadsthat keep randomly requesting one of the locks and thenreleasing it after a random wait. Rather than writing so-phisticated predicates to look for cycles of dependencies,we used a simple rule that if the protocol is deadlock-free, the pendingqueue should eventually be empty. Thepredicate is just that: return true if the pendingqueue isempty. Clearly there could be many false alarms. Weattached an auxiliary output that computes how long thepredicate remained broken since the last time the queuesize changed (Figure 5).

Livelock. The livelock we discovered involves onelock and two clients. Based on replay and the mes-sage flow graph, we isolated the following bug scenario.Client A holds lock l in shared mode, and client B re-quests l in exclusive mode. The server inserts l to thependinglock queue and sends a revoke message to A.While this message is on its way, another thread in A

requests to upgrade l to exclusive. According to theprotocol, A needs to release l first before acquiring itagain with the desired mode. A does so by first settinga releasepending flag and then sending the server a re-lease message.

According to the protocol, the server always deniesa lock release request if a lock is in the pendinglock

queue. The revoke message arrives at A and spawns arevoking thread, which in turn was blocked because thelock is being released (i.e., the releasepending flag is

NSDI ’07: 4th USENIX Symposium on Networked Systems Design & ImplementationUSENIX Association 265

Page 10: WiDS Checker: Combating Bugs in Distributed Systems · We applied WiDS Checker to a suite of distributed systems, including both individual protocols and a com-plete, deployed system.

� ��� ��� ��� ��� ����

���

���

���

�� �������

����

����

����

����

���

���

������������� ���������������� !�

!���"������#�������"�$���" %����� "����!���"������#����"�$���" %����� "�����������������������������

Figure 6: Violations reported in the livelock case. The stripe atthe bottom contains all the violations; the dark ones are thosewith enough time above the threshold.

set). When A finds its release is unsuccessful it resetsthe releasepending flag and retries. However, the retrycode is in a tight loop, and thus the release request is sentagain and the releasepending flag is set. As a resultthe revoke thread wakes up only to find that it is to beblocked again. This cycle continues onwards and repeatsas a livelock. It is livelock in the sense that there is asmall possibility that the blocked revoking thread can cutin after the release response is handled but before the nextrelease retry occurs.

Figure 6 visualizes the violation reports in a run thatdiscovered the livelock. There were altogether fiverounds of competition and the bug appears at the finalone. We use enough time to screen out false alarm inviolations. After screening, many false alarms are elimi-nated. Auxiliary output helped us to prioritize inspectionof violations; otherwise the violations will be too largein number to inspect.

Deadlock. The deadlock case is more sophisticated.The client implementation has an optimization of usinga socket pool for outstanding requests, and a thread willbe blocked if it cannot find a socket to use. Because thesocket pool is shared among threads, it creates extra de-pendencies and induces a deadlock.

We configured the socket pool to use only one socket.The resulting deadlock case involves four clients and twolocks. The initial state is that A has shared lock l1, andC has exclusive lock l2. Next, B and D request exclu-sive mode on l2 and l1, respectively. After a convolutedsequence of events, including threads on A and B at-tempting to upgrade and release their locks, the systementers a deadlock state. Based on replay and messageflow graph, we find the deadlock cycle, which consistsof a lock acquiring thread whose request is blocked onthe server because the lock is in the pendinglock queue,a revoking thread blocked by the lock’s releasepending

flag, and a release thread blocked by the unavailability ofsocket.

Study on effectiveness. Unlike the Paxos case in

��������

� � �

�������� �������� ��������

Figure 7: BitVault ID space and index structure. B and C areowners of object x and y, respectively. Object x has three repli-cas, whereas object y has one dangling pointer.

which the checker directly locates the bug point, herewe do not have effective predicates to reveal dead-lock/livelock; the predicate based on pendingsize onlyprovides hints for the starting point of the replay. So,the checker is more like an advanced “conditional break-point” on the replay trace, and mining the root causeheavily depends on the replay facility and the messageflow visualization tool.

4.3 BitVault Storage System

BitVault [32] is a scalable and content-addressable stor-age system built by ourselves. The system is composedof more than 10 interdependent protocols, some of whichincrementally developed over a stretch of two years. Bit-Vault achieves high reliability with object replication andfast parallel repair for lost replicas. Self-managing rou-tines in BitVault ensure that eventually every existing ob-ject in the system has its specified replication degree.

To understand the bugs, it is necessary to describe Bit-Vault’s internal indexing structure and repair protocols.BitVault uses a DHT to index the replicas of each ob-ject. Each object has a unique 160bit hash ID, and theentire ID space is uniformly partitioned by the hashes ofthe participating nodes, each node owning a particular IDrange. An object’s replicas can reside on any node, whileits index (which records locations of the object’s repli-cas) is placed on the owner node that owns the ID of theobject. The design of the index enables flexible controlof replica placement. Figure 7 shows the index schema.Mapping between ID space and nodes is achieved bya decentralized weakly-consistent membership service,in which every node uses heart-beat messages to detectnode failures within its ID neighborhood, and maintainsa local membership view of all nodes in the system. Ac-cording to the membership view, the owner of an indexcan detect replica loss and then trigger the creation of ananother replica. A replica also republishes itself to thenew owner node of the index when it detects the changeof the owner node.

BitVault has the following correctness properties de-rived from its indexing scheme: 1) correct index own-ership, i.e., for each object, eventually the owner node

NSDI ’07: 4th USENIX Symposium on Networked Systems Design & Implementation USENIX Association266

Page 11: WiDS Checker: Combating Bugs in Distributed Systems · We applied WiDS Checker to a suite of distributed systems, including both individual protocols and a com-plete, deployed system.

holds the index; 2) complete reference, i.e., when the sys-tem stabilizes, there should be neither dangling replicareferences nor orphan replicas; 3) correct replica de-gree, i.e., in a stabilized system every object has exactlydegree (say 3) replicas. Because a node’s membershipview is guaranteed to converge eventually, all these mustbe considered liveness properties. We use these prop-erties to check BitVault, and associate them with a sta-bilization measure based on membership change events.All experiments are conducted over an 8-node configu-ration in a production run, and we found the three bugswith the checker. Due to space limitations, we only ex-plain two of them.

Replica loss due to protocol races. BitVault passesintensive testing before we added a routine that balancesload between nodes by moving replicas. After addingthe load balancing routine, we observed an occasionalreplica loss from our monitor GUI. Before we developedWiDS Checker, we did not actually know the root causebecause the bug case was buried in irrelevant log events.

The checker catches a violation of the replica degreepredicate saying that an object’s replica number shouldbe no less than 3. From the violation point we traceback a few messages and quickly found how the replicanumber for this object goes from 3 to 2. BitVault hasa “remove-extra-replicas” routine that periodically re-moves over-repaired replicas which are caused by tran-sient failures and retries during replica repairing. Thebug was caused by a race between the load balancingroutine and the remove-extra-replica routine, where theload balancing routine copies the replica from node A toB and deletes the replica in source node A, and at thesame time the remove-extra-replicas routine detects thatthere are 4 replicas in total and chooses to delete the onein B.

Mishandling of transient failures. Sophisticatedpredicates enable more rigorous and thorough checks tocatch bugs. As an example, the predicate of referencecorrectness (“no dangling references nor orphan repli-cas”) checks the matching between index entries andreplicas, and it helps us to identify a subtle bug causedby transient failures, which may degrade reliability. Wekilled a process in a node and restarted it after about 5seconds. The predicate remained violated after quite along time, even when the auxiliary measure for stabi-lization was high. Then we refined the predicate to out-put the owner of orphan replicas, which turned out tobe the node suffering the transient failure. The bug iscaused by mishandling of transient failures. The churnof the failed node cannot be observed by other nodes be-cause the churn duration is shorter than the failure detec-tion timeout (15 seconds). As a result, other nodes willnot repair replicas or re-publish the index to this failednode, whose memory-resident state (e.g., the index) has

already been wiped out by the failure.Study of effectiveness. Both bugs are non-

deterministic, complicated and sensitive to the environ-ment, while catching the bugs and understanding the rootcauses require detailed memory states. Deterministic re-play is necessary because we cannot dump all the mem-ory states in logs in production runs. Like the Paxos case,the predicates directly specify the complex guarantee forcorrectness. Although they are liveness properties, pred-icate checking is very useful to pinpoint the starting pointfor inspection in replay. Suppose that we only have re-play but not checking facility. For the first bug wherereplica loss has been observed, screening traces and find-ing out the root cause is tedious, while possible. In con-trast, the second bug is difficult to even detect. This isbecause, eventually the replicas and owner nodes willnotice the data loss through a self-managing routine inBitVault and repair the loss. Thus, the delay in repair isundetected and will degrade reliability.

4.4 Macedon-Chord

From the latest version of the Macedon release (1.2.1-20050531) [1], we generated the WiDS-based implemen-tation for three protocols, RandTree, Chord, and Pastry.Due to space limitations, we only report our findings ofMacedon-Chord. The Macedon description files alreadyhave logics of initialization, joining and routing, and wewrote our drivers to add our testing scenarios. The testis carried out in the simulator: 10 nodes join the ringaround the same time and then remove one node afterstabilization. We use the predicate in Figure 2 to checkthat the ring is well-formed, and add another to checkfingers to be pointing to correct destinations.

Interestingly, this simple test altogether revealed fivebugs. Two bugs are not caught by the predicates - theyare programming errors that crash the simulation (divide-by-zero and dereferences of uninitialized pointers). Theremaining three bugs are protocol implementation bugscaught by the predicates.

The first one caused a broken ring after the nodeleaves, caught by the RingConsistency predicate3. Us-ing replay to trace the origin of the incorrect successorvalue from the violation point, we found that the field ofthe successor’s successor is wrongly assigned with thehash of the corresponding node’s IP address, instead ofthe IP address itself.

The remaining two bugs caught by finger predicatescause problems in finger structures. One does not prop-erly update the finger table, making the fingers convergeto their correct targets much later than it needs to take.The last one is a mishandled boundary case. Let f bethe ID of the ith finger and f.curr denote its currentvalue. f.start records the ID that is 2i away from this

NSDI ’07: 4th USENIX Symposium on Networked Systems Design & ImplementationUSENIX Association 267

Page 12: WiDS Checker: Combating Bugs in Distributed Systems · We applied WiDS Checker to a suite of distributed systems, including both individual protocols and a com-plete, deployed system.

node’s ID. If f.start equals to f.curr, then the currentfinger is the perfect one. When a new candidate nodewith ID y is evaluated, a checking function isinrange

is called, and f.curr is replaced with y if y falls be-tween [f.start, f.curr). isinrange regards the specialcase [f.start, f.start) as the entire ring and returns truefor arbitray y. As a result, the perfect finger is replacedby any arbitrary y. In later rounds the perfect fingerwill make it back, and the process continues to oscillate.Macedon-Chord is later re-written with Mace. The au-thor confirmed that the first four bugs were removed as aresult of this exercise, yet the last one still remains in thenew version.

Study of effectiveness. Structured overlay protocolsare perfect examples of the complexity of distributedlogic. Based on system properties on topology struc-tures, checking overlay protocols could be very effectiveand productive. Sometimes the violation report is suffi-cient to infer the root cause and find the buggy code, e.g.,the oscillation bug reveals the buggy logic for choosingfinger node ID, without the need for the replay.

4.5 Performance and overheadThe logging overhead heavily depends on the applica-tions. We performed a test run on BitVault with 4 nodes,each of which is a commodity workstation with 3GHzPentium IV CPU and 512MByte memory, and the resultsare shown in Figure 8. The experiments consists of in-serting 100 100KB objects at the 3rd minute, crashing anode at the 6th minute, rejoining it at the 9th, and finallyretrieving all objects at the 12th. The replication degreeis set to 3. The peak of the performance hit occurs onthe 3rd minute, with roughly a 2% runtime overhead. Tocollect logs more efficiently, we generate the objects thatmostly contain a single value so as to achieve high com-pression rates. As a result, the log size reaches close to600KB after compression. For uncompressed logs, thesize is around 60MB.

Table 3 summarizes the performance of the checker.The second column shows how long it takes to performthe original run, with logging turned on. Depending onthe message rate and complexity of predicate calcula-tion, the checker slows down the replay between 4 and20 times. The 15-minute BitVault run consumes about37 minutes. We believe that given the benefits of usingthe checker, these overheads are acceptable for debug-ging tasks.

4.6 DiscussionsOur experience validates a number of design points. Inalmost all cases, on-the-fly scripting has allowed us toadaptively refine predicates in response to predicate er-

���

���

���

���

���

���

���

� � � � �� �� ��

���������

�����

���� �����

���

���

���

���

���

���

� � � � �� �� ��

������

����

�������������

�����

������������

Figure 8: Logging overhead: (a) running time; (b) disk space.

Application Original run w/o checker w/ checker

Paxos(simu.) 0.62 0.34 6.56

BitVault(deployed) 900.00 236.25 2219.77

Lock server(simu.) 5.34 3.14 11.99

Chord(simu.) 1.64 0.719 3.00

Table 3: Running time (in seconds) for evaluating predicates.Orignal run, w/o checker, and w/ checker columns show therunning time for testing with log turned on, replay, and replaywith predicate checking, respectively.

rors and to chase newly discovered bugs. This iterativeprocess is especially useful when we check reproducedruns from deployed experiments, since the trace can bereused to find all the bugs it contains. The experiencesalso gave us a number of interesting lessons.

The advantage of predicate-based checking is its sim-plicity: it depends on only a snapshot of states (andsometimes augmented with a short history) to define in-variant properties and catch violations. This methodol-ogy does not require the developer to build a state ma-chine inside the checker that mirrors the steps of the im-plementation. At the same time, however, this meansthat we need to pay extra effort to trace the root causebackwards from violation point, which might be far be-hind. Time-travel with message flow definitely helps, yetit can be tedious if the violation involves long transac-tions and convoluted event sequences, as is the case ofthe Boxwood lock server. In these scenarios, the pred-icate checking is more like a programmable conditionalbreakpoint that people use to “query” the replay trace.Effectively pruning the events to only replay the relevantsubset is one of the directions for our future work.

We also obtained considerable insight in terms of de-bugging distributed systems in general. The Paxos buginvolves at least 5 processes, the deadlock case in thelock server involves 4 clients and 1 server, 2 locks, andthe client has 6 flags in total. In both cases the error isdeep and the system scale is larger than what a modelchecker is typically applicable to. It is therefore unclearwhether a model checker can explore such corner statessuccessfully. Also, 9 out of the 12 bugs have correct mes-sage communication pattern. For example, the bug foundin the Paxos specification will not cause any violation of

NSDI ’07: 4th USENIX Symposium on Networked Systems Design & Implementation USENIX Association268

Page 13: WiDS Checker: Combating Bugs in Distributed Systems · We applied WiDS Checker to a suite of distributed systems, including both individual protocols and a com-plete, deployed system.

the contract between proposers and acceptors on send-ing and receiving messages. Thus, without dumping outdetailed states, it is questionable whether log analysis isable to uncover those bugs.

5 Related Work

Our work is one of many proposals to deal with the chal-lenging problem of debugging distributed systems. Wecontrast it with the three broad approaches below.

Deterministic replay. In response to the lack of con-trol of testing in real deployment, building a time ma-chine to enable deterministic replay is required [7, 8, 27].Most of this work is for a single node. Our implemen-tation is capable of reproducing the execution of a dis-tributed system within one simulation process. Friday [9]enhances GDB with group replay and distributed watch-points. We share their methodology that incremental re-finement of predicates is an integral part of cyclic de-bugging process, however, replay in WiDS checker ismuch more efficient because we use one process to re-play all instances. In addition to debugger extensions,WiDS checker provides a unified framework with ad-vanced features tailored for distributed systems. It allowssimulation with controlled failures, which is valuable forearly development stages. It can trace user-defined objectstates with historical versions and evaluate predicatesonly at event-handler boundaries, and thus provides bet-ter functionalities and performance for predicate check-ing. These unique contributions of WiDS Checker areproven to be important for debugging. Our current draw-back is that the tool is limited to applications written us-ing the WiDS API or Macedon.

Model-check actual implementation. Our workcomplements these recent proposals [20, 30] that checkactual implementations. Model checking explores thestate space systematically, but the issue of state explosiontypically restricts the scale. MaceMC [11] and WiDSchecker share many design concepts, but differ funda-mentally on how we test a system and hit bugs. MaceMCuses model checker with heuristics to deal with livenessproperties, but has to tradeoff the scale of the system.As we discussed in Section 4.6, some bugs rely on afairly large scale and low-level implementation details,and cannot manifest in a downscaled or abstracted sys-tem (e.g., the deadlock case in Boxwood derived fromrunning out of the socket pool). For such cases, tools likeWiDS checker which simulates low-level OS APIs anduses deployed testing with replay-based checking will benecessary.

Log-based debugging. Communication structures en-code rich information. Logs can be used to identify per-formance bottlenecks, as advocated by Magpie [5], Pin-point [6] and many others [4, 23]. The logs that WiDS

Checker captures contain enough information to enableperformance debugging, but the focus of this study is oncorrectness debugging. Pip [23] also proposes that log-based analysis can root out structural bugs. In general,we are more confident that log-based analysis can revealperformance bugs than structural ones (a close read onPip’s results seems to confirm this). As demonstratedby the bug cases in this study, a correct communicationpattern is only the necessary but not the sufficient con-dition to enforce correct properties. Logging detailedstates is prohibitively expensive; it is easier to log non-deterministic events and to reconstruct the states. Fromour experience, full-state inspection with time-travel iscritical to identify the root cause of a correctness bug. Anice by-product is that it also enables us to exhaustivelytest all bugs in a given reproduced run, an endeavor thatis usually quite labor-intensive. We also differ on howcorrectness is expressed. Pip can construct path expec-tations from history. Since we are taking logs, this al-ternative is also available. However, we believe that ourassertions are more powerful. They represent the mini-mum understanding of a system’s correct properties, andare much easier to write and refine than a model.

Singh et al. [26] propose to build an online monitoringand debugging infrastracture based on a declarative de-velopment language [16]. They require that the system isprogrammed with a highly abstracted deductive model soas to enable checking, while WiDS Checker mimics low-level OS API semantics (thread, sockets and file I/O) andenhance them with replay, in order to check predicatesand find bugs in existing systems. In addition, their on-line checking methodology is restricted by the communi-cation and computation overhead in distributed systems,and thus the checking facility is less powerful than theoffline checking used in WiDS Checker. As online andoffline checking complement each other, our future workis to look at interesting combinations of the two methods.

6 Conclusion and On-going Work

In a unified framework, WiDS Checker is capable ofchecking an implementation using both simulated as wellas deterministically reproduced runs reconstructed fromtraces logged in deployment. Its versatile script lan-guage allows a developer to write and incrementally re-fine assertions on-the-fly to check properties that are oth-erwise impossible to check. Single console debuggingand message-flow based time-travel allows us to quicklyidentify many non-trivial bugs in a suite of complex andreal systems.

Our on-going work addresses the limitations in sup-porting legacy binaries with function interception tech-niques. We intercept OS system calls and APIs to changethem into events, and use an event queue to schedule the

NSDI ’07: 4th USENIX Symposium on Networked Systems Design & ImplementationUSENIX Association 269

Page 14: WiDS Checker: Combating Bugs in Distributed Systems · We applied WiDS Checker to a suite of distributed systems, including both individual protocols and a com-plete, deployed system.

execution, including thread switches, OS notifications,and socket operations. We also intercept a layer beneaththe socket interface to add Lamport Clock annotation be-fore network data chunks in a transparent way. By thismeans application can be logged and faithfully replayedin a transparent way, and we can futher bring the simula-tion, virtualization, and checking functions to legacy bi-naries in the fashion we performed with WiDS Checker.

7 Acknowledgments

We would like to thank our shepherd Petros Maniatisand the anonymous reviewers for their comments andsuggestions. We are also indebted to Linchun Sun,Fangcheng Sun, Shuo Tang, Rui Guo for their help withthe WiDS Checker experiments, as well as Roberto DePrisco, Lidong Zhou, Charles Killian, and Amin Vahdatfor verifying the bugs we found.

References[1] Macedon: http://macedon.ucsd.edu/release/.

[2] Phoenix compiler framework. http://research.microsoft.com/phoenix/phoenixrdk.aspx.

[3] WiDS release. http://research.microsoft.

com/research/downloads/details/

1c205d20-6589-40cb-892b-8656fc3da090/details.

aspx.

[4] AGUILERA, M. K., MOGUL, J. C., WIENER, J. L.,REYNOLDS, P., AND MUTHITACHAROEN, A. Performance de-bugging for distributed systems of black boxes. In SOSP (2003).

[5] BARHAM, P., DONNELLY, A., ISAACS, R., AND MORTIER, R.Using magpie for request extraction and workload modelling. InOSDI (2004).

[6] CHEN, M., KICIMAN, E., FRATKIN, E., FOX, A., ANDBREWER, E. Pinpoint: Problem determination in large, dynamic,internet services. In Int. Conf. on Dependable Systems and Net-works (2002).

[7] DUNLAP, G. W., KING, S. T., CINAR, S., BASRAI, M. A.,AND CHEN, P. M. Revirt: enabling intrusion analysis throughvirtual-machine logging and replay. SIGOPS Oper. Syst. Rev. 36,SI (2002).

[8] GEELS, D., ALTEKAR, G., SHENKER, S., AND STOICA, I. Re-play debugging for distributed applications. In USENIX (2006).

[9] GEELS, D., ALTEKARZ, G., MANIATIS, P., ROSCOEY, T., ANDSTOICAZ, I. Friday: Global comprehension for distributed re-play. In NSDI (2007).

[10] JUMP, M., AND MCKINLEY, K. S. Cork: dynamic memory leakdetection for garbage-collected languages. In POPL (2007).

[11] KILLIAN, C., ANDERSON, J. W., JHALA, R., AND VAHDAT,A. Life, death, and the critical transition: Finding liveness bugsin systems code. In NSDI (2007).

[12] LAMPORT, L. Time, clocks, and the ordering of events in a dis-tributed system. Commun. ACM 21, 7 (1978).

[13] LAMPORT, L. The part-time parliament. ACM Trans. Comput.Syst. 16, 2 (1998).

[14] LIN, S. D., PAN, A. M., GUO, R., AND ZHANG, Z. Simulat-ing large-scale p2p systems with the wids toolkit. In MASCOTS(2005).

[15] LIN, S. D., PAN, A. M., ZHANG, Z., GUO, R., AND GUO, Z. Y.Wids: an intergrated toolkit for distributed system development.In HotOS (2003).

[16] LOO, B. T., CONDIE, T., HELLERSTEIN, J. M., MANIATIS, P.,ROSCOE, T., AND STOICA, I. Implementing declarative over-lays. SIGOPS Oper. Syst. Rev. 39, 5 (2005).

[17] LYNCH, N. Distributed Algorithms. 1996, ch. 8.[18] LYNCH, N., AND TUTTLE, M. An introduction to input/output

automata. In Technical Memo MIT/LCS/TM-373 (1989).[19] MACCORMICK, J., MURPHY, N., NAJORK, M., THEKKATH,

C. A., AND ZHOU, L. Boxwood: Abstractions as the foundationfor storage infrastructure. In OSDI (2004).

[20] MUSUVATHI, M., AND ENGLER, D. Model checking large net-work protocol implementations. In NSDI (2004).

[21] PRISCO, R. D., LAMPSON, B. W., AND LYNCH, N. A. Fun-damental study revisiting the paxos algorithm. Theoretical Com-puter. Science. 243, 1-2 (2000).

[22] QIN, F., LU, S., AND ZHOU, Y. Safemem: Exploiting ecc-memory for detecting memory leaks and memory corruption dur-ing production runs. In HPCA (2005).

[23] REYNOLDS, P., KILLIAN, C., WIENER, J. L., MOGUL, J. C.,SHAH, M. A., AND VAHDAT, A. Pip: Detecting the unexpectedin distributed systems. In NSDI (2006).

[24] RODRIGUEZ, A., KILLIAN, C., BHAT, S., KOSTIC, D., ANDVAHDAT, A. Macedon: Methodology for automatically creating,evaluating, and designing overlay networks. In NSDI (2004).

[25] SAVAGE, S., BURROWS, M., NELSON, G., SOBALVARRO, P.,AND ANDERSON, T. Eraser: A dynamic data race detector formultithreaded programs. ACM Trans. Comput. Syst. 15, 4 (1997).

[26] SINGH, A., ROSCOE, T., MANIATIS, P., AND DRUSCHEL, P.Using queries for distributed monitoring and forensics. In Eu-roSys (2006).

[27] SRINIVASAN, S. M., KANDULA, S., ANDREWS, C. R., ANDZHOU, Y. Flashback: A lightweight extension for rollback anddeterministic replay for software debugging. In USENIX (2004).

[28] STOICA, I., MORRIS, R., LIBEN-NOWELL, D., KARGER,D. R., KAASHOEK, M. F., DABEK, F., AND BALAKRISHNAN,H. Chord: a scalable peer-to-peer lookup protocol for internetapplications. IEEE/ACM Trans. Netw. 11, 1 (2003).

[29] YANG, H., PIUMATTI, M., AND SINGHAL, S. K. Internet scaletesting of pnrp using wids network simulator. In P2P Conference(2006).

[30] YANG, J., TWOHEY, P., ENGLER, D., AND MUSUVATHI, M.Using model checking to find serious file system errors. ACMTrans. Comput. Syst. 24, 4 (2006).

[31] YU, Y., RODEHEFFER, T., AND CHEN, W. Racetrack: efficientdetection of data race conditions via adaptive tracking. In SOSP(2005).

[32] ZHANG, Z., LIAN, Q., LIN, S. D., CHEN, W., CHEN, Y., ANDJIN, C. Bitvault: A highly reliable distributed data retention plat-form. In MS Research Tech Report (MSR-TR-2005-179) (2005).

Notes1WiDS, a recursive acronym that stands for “WiDS implements

Distributed System”, is the name of the toolkit.2For conciseness, we omit the learners of paxos in our description.3This is only a simplified ring consistency predicate for illustration

purpose. The more complicated one which also detects disjoint andloopy rings would possibly catch more problems.

NSDI ’07: 4th USENIX Symposium on Networked Systems Design & Implementation USENIX Association270