Parallel Symbolic Execution for Automated Real-World ...

Parallel Symbolic Execution for

Automated Real-World Software Testing

Stefan Bucur Vlad Ureche Cristian Zamfir George Candea

School of Computer and Communication SciencesÉcole Polytechnique Fédérale de Lausanne (EPFL), Switzerland

{stefan.bucur,vlad.ureche,cristian.zamfir,george.candea}@epfl.ch

Abstract

This paper introduces Cloud9, a platform for automated test-ing of real-world software. Our main contribution is thescalable parallelization of symbolic execution on clustersof commodity hardware, to help cope with path explosion.Cloud9 provides a systematic interface for writing “sym-bolic tests” that concisely specify entire families of inputsand behaviors to be tested, thus improving testing produc-tivity. Cloud9 can handle not only single-threaded programsbut also multi-threaded and distributed systems. It includes anew symbolic environment model that is the first to supportall major aspects of the POSIX interface, such as processes,threads, synchronization, networking, IPC, and file I/O. Weshow that Cloud9 can automatically test real systems, likememcached, Apache httpd, lighttpd, the Python interpreter,rsync, and curl. We show how Cloud9 can use existing testsuites to generate new test cases that capture untested cornercases (e.g., network stream fragmentation). Cloud9 can alsodiagnose incomplete bug fixes by analyzing the differencebetween buggy paths before and after a patch.

Categories and Subject Descriptors D.2.5 [Software En-

gineering]: Testing and Debugging—Symbolic Execution

General Terms Reliability, Verification

1. Introduction

Software testing is resource-hungry, time-consuming, labor-intensive, and prone to human omission and error. Despitemassive investments in quality assurance, serious code de-fects are routinely discovered after software has been re-leased [RedHat], and fixing them at so late a stage carriessubstantial cost [McConnell 2004]. It is therefore imperative

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. To copy otherwise, to republish, to post on servers or to redistributeto lists, requires prior specific permission and/or a fee.

EuroSys’11 April 10–13, 2011, Salzburg, Austria.Copyright © 2011 ACM 978-1-4503-0634-8/11/04. . . $10.00Reprinted from EuroSys’11, Proceedings of the 6th ACM SIGOPS/EuroSys Confer-ence on Computer Systems, April 10–13, 2011, Salzburg, Austria., pp. 1–15.

to overcome the human-related limitations of software test-ing by developing automated software testing techniques.

Existing automated techniques, like model checkingand symbolic execution, are highly effective [Cadar 2008,Holzmann 2008], but their adoption in industrial general-purpose software testing has been limited. We blame thisgap between research and practice on three challenges facedby automated testing: scalability, applicability, and usability.

First, path explosion—the fact that the number of pathsthrough a program is roughly exponential in program size—severely limits the extent to which large software can bethoroughly tested. One must be content either with low cov-erage for large programs, or apply automated tools only tosmall programs. For example, we do not know of any sym-bolic execution engine that can thoroughly test systems withmore than a few thousand lines of code (KLOC).

Second, real-world systems interact heavily with the en-vironment (e.g., through system calls, library calls) and maycommunicate with other parties (e.g., over sockets, IPC,shared memory). For an automated testing tool to be usedin practice, it must be capable of handling these interactions.Third, an important hurdle to adoption is that, in order toproductively use most current tools, a user must become asversed in the underlying technology as the tool’s developers.

Our goal in building Cloud9 is to address these chal-lenges: we envision Cloud9 as a testing platform that bridgesthe gap between symbolic execution and the requirements ofautomated testing in the real world. As will be seen later,doing so requires solving a number of research problems.

Cloud9 helps cope with path explosion by parallelizingsymbolic execution in a way that scales well on large clustersof cheap commodity hardware. Cloud9 scales linearly withthe number of nodes in the system, thus enabling users to“throw hardware at the problem.” Doing so without Cloud9is hard, because single computers with enough CPU andmemory to symbolically execute large systems either do notexist today or are prohibitively expensive.

We built into Cloud9 features we consider necessary fora practical testing platform: Besides single-threaded single-node systems, Cloud9 handles also multi-threaded and dis-

1

tributed software. It offers fine grain control over the be-havior being tested, including the injection of faults andthe scheduling of threads. Cloud9 embeds a new symbolicmodel of a program’s environment that supports all majoraspects of the POSIX interface, including processes, threads,synchronization, networking, IPC, and file I/O. Cloud9 pro-vides an easy-to-use API for writing “symbolic tests”—developers can specify concisely families of inputs and envi-ronment behaviors for which to test the target software, with-out having to understand how symbolic execution works,which program inputs need to be marked symbolic, or howlong the symbolic inputs should be. By encompassing en-tire families of behaviors, symbolic tests cover substantiallymore cases than “concrete” regular tests.

This paper makes three contributions: (1) the first cluster-based parallel symbolic execution engine that scales linearlywith the number of nodes; (2) a testing platform for writingsymbolic tests; and (3) a quasi-complete symbolic POSIX

model that makes it possible to use symbolic execution onreal-world systems. We built a Cloud9 prototype that runson Amazon EC2, private clusters, and multicore machines.

In this paper we present Cloud9 and report on our experi-ence using it for testing several parallel and distributed sys-tems, describe some of the bugs found, and explain how wedebugged flawed bug fixes. We describe the problem (§2),Cloud9’s design (§3-§5), our prototype (§6), we evaluate theprototype (§7), survey related work (§8), and conclude (§9).

2. Problem Overview

Testing a program consists of exercising many differentpaths through it and checking whether they “do the rightthing.” In other words, testing is a way to produce partialevidence of correctness, and thus increase confidence in thetested software. Yet, due to the typically poor coverage onecan get today, testing often turns into a mere hunt for bugs.

In practice, most software test harnesses consist of man-ually written tests that are run periodically; regression testsuites provide an automated way of checking whether newbugs have entered the code [McConnell 2004]. Such suitestend to be tedious to write and maintain but, once in place,they can be reused and extended. In practice, the state of theart consists mostly of fuzzing, i.e., trying various inputs inthe hope of finding bugs and improving test coverage.

In research, the state of the art consists of model check-ers and automated test generators based on symbolic exe-cution [Cadar 2008, Godefroid 2005]. Instead of running aprogram with regular concrete inputs (e.g., x=5), symbolicexecution consists of running a programwith “symbolic” in-puts that can take on all values allowed by the type (e.g.,x=λ , where λ ∈ N). Whenever a conditional branch is en-countered that involves a predicate π that depends (directlyor indirectly) on x, state and execution are forked into twoalternatives: one following the then-branch (π) and anotherfollowing the else-branch (¬π). The two executions can now

be pursued independently. When a bug is found, test gener-ators can compute concrete values for program inputs thattake the program to the bug location. This approach is effi-cient because it analyzes code for entire classes of inputs ata time, thus avoiding the redundancy inherent in fuzzing.

The first challenge for such tools is path explosion, asmentioned earlier. One way to cope is to memoize the sym-bolic execution of sub-paths into test summaries that can besubsequently reused when the same sub-path is encounteredagain, as done in compositional test generation [Godefroid2007]. Alternatively, it is possible to use various heuris-tics to prioritize the most interesting paths first, as done inKLEE [Cadar 2008]. Another approach is to execute sym-bolically only paths that are of interest to the test, as done inselective symbolic execution [Chipounov 2009].

We pursue a complementary approach—parallel sym-

bolic execution—in which we symbolically execute a pro-gram in parallel on a cluster, thus harnessing the machinesinto a “distributed computer” whose aggregate CPU andmemory surpass that of an individual machine. An alterna-tive to a cluster-based approach would be to run a classicsingle-node symbolic execution engine on a Blue Gene-likesupercomputer with vast shared memory and CPUs commu-nicating overMPI. Supercomputers, however, are expensive,so we favor instead clusters of cheap commodity hardware.

One way to parallelize symbolic execution is by staticallydividing up the task among nodes and having them run inde-pendently. However, when running on large programs, thisapproach leads to high workload imbalance among nodes,making the entire cluster proceed at the pace of the slow-est node [Staats 2010]. If this node gets stuck, for instance,while symbolically executing a loop, the testing process maynever terminate. Parallelizing symbolic execution on shared-nothing clusters in a way that scales well is difficult.

The second challenge is mediating between a programand its environment, i.e., symbolically executing a programthat calls into libraries and the OS, or communicates withother systems, neither of which execute symbolically. Onepossible approach is to simply allow the call to go throughinto the “concrete” environment (e.g., to write a file) [Cadar2006, Godefroid 2005]; unfortunately, this causes the envi-ronment to be altered for all forked executions being ex-plored in parallel, thus introducing inconsistency. Anotherapproach is to replace the real environment with a symbolicmodel, i.e., a piece of code linked with the target programthat provides the illusion of interacting with a symbolicallyexecuting environment. For instance, KLEE uses a symbolicmodel of the file system [Cadar 2008]. Of course, real-worldprograms typically interact in richer ways than just file I/O:they fork processes, synchronize threads, etc.

We originally viewed the building of a complete environ-ment model as an engineering task, but our “mere engineer-ing” attempt failed: for any functionality that, in a normalexecution, requires hardware support (such as enforcing iso-

2

lation between address spaces), the core symbolic executionengine had to be modified. The research challenge thereforeis to find the minimal set of engine primitives required tosupport a rich model of a program’s environment.

The third challenge is using an automated test generatorin the context of a development organization’s quality as-surance processes. To take full advantage of the automatedexploration of paths, a testing tool must provideways to con-trol all aspects of the environment. For example, there needsto be a clean API for injecting failures at the boundary be-tween programs and their environment, there must be a wayto control thread schedules, and so on. There should be away to programmatically orchestrate all environment-relatedevents, but doing so should not require deep expertise in thetechnology behind the testing tools themselves.

The work presented here aims to address these three chal-lenges. Cluster-based parallel symbolic execution (§3) pro-vides the illusion of running a classic symbolic executionengine on top of a large, powerful computer. Without chang-ing the exponential nature of the problem, parallel symbolicexecution harnesses cluster resources to make it feasible torun automated testing on larger systems than what was pos-sible until now. Our work complements and benefits all toolsand approaches based on symbolic execution. We describe away to accuratelymodel the environment (§4) with sufficientcompleteness to test complex, real software, like the Apacheweb server and the Python interpreter. We present the APIsand primitives that we found necessary in developing a truetesting platform (§5). We show how using these APIs en-ables, for instance, finding errors in bug patches by repro-ducing environment conditions which otherwise would havebeen hard or impossible to set up with regular test cases.

3. Scalable Parallel Symbolic Execution

In this section we present the design of the Cloud9 en-gine, focusing on the algorithmic aspects: after a concep-tual overview (§3.1), we describe how Cloud9 operates atthe worker level (§3.2) and then at the cluster level (§3.3).

3.1 Conceptual Overview

Classic Symbolic Execution Cloud9 employs symbolicexecution, an automated testing technique that has recentlyshown a lot of promise [Cadar 2008, Godefroid 2005].

A symbolic execution engine (SEE) executes a programwith unconstrained symbolic inputs.When a branch involvessymbolic values, execution forks into two parallel executions(see §2), each with a corresponding clone of the programstate. Symbolic values in the clones are constrained to makethe branch condition evaluate to false (e.g., λ ≥MAX) re-spectively true (e.g., λ <MAX). Execution recursively splitsinto sub-executions at each subsequent branch, turning anotherwise linear execution into an execution tree (Fig. 1).

In this way, all execution paths in the program are ex-plored. To ensure that only feasible paths are explored,

void write( int x )

{

if (x < MAX) {

if (x > 0)

...

else

...

} else {

if (x > 3)

...

else

...

}

}

x < MAX

x > 3 x > 0

TrueFalse

TrueFalse TrueFalse

False

False

False

True

True

Figure 1: Symbolic execution produces an execution tree.

the SEE uses a constraint solver to check the satisfiabilityof each branch’s predicate, and it only follows satisfiablebranches. If a bug is encountered (e.g., a crash or a hang)along one of the paths, the solution to the constraints accu-mulated along that path yields the inputs that take the testedprogram to the bug—these inputs constitute a test case.

Parallel Symbolic Execution Since the size of the exe-cution tree is exponential in the number of branches, andthe complexity of constraints increases as the tree deepens,state-of-the-art SEEs can quickly bottleneck on CPU andmemory even for programs with just a couple KLOC. Wetherefore build a parallel SEE that runs on a commodity clus-ter and enables “throwing hardware at the problem.”

The key design goal is to enable individual cluster nodesto explore the execution tree independently of each other.One way of doing this is to statically split the execution treeand farm off subtrees to worker nodes. Alas, the contentsand shape of the execution tree are not known until the treeis actually explored, and finding a balanced partition (i.e.,one that will keep all workers busy) of an unexpanded exe-cution tree is undecidable. Besides subtree size, the amountof memory and CPU required to explore a subtree is alsoundecidable, yet must be taken into account when partition-ing the tree. Since the methods used so far in parallel modelcheckers [Barnat 2007, Holzmann 2008] rely on static parti-tioning of a finite state space, they cannot be directly appliedto the present problem. Instead, Cloud9 partitions the execu-tion tree dynamically, as the tree is being explored.

Dynamic Distributed Exploration Cloud9 consists of wor-ker nodes and a load balancer (LB). Workers run indepen-dent SEEs, based on KLEE [Cadar 2008]. They exploreportions of the execution tree and send statistics on theirprogress to the LB, which in turn instructs, whenever nec-essary, pairs of workers to balance each other’s work load.Encoding and transfer of work is handled directly betweenworkers, thus taking the load balancer off the critical path.

The goal is to dynamically partition the execution treesuch that the parts are disjoint (to avoid redundant work) andtogether they cover the global execution tree (for explorationto be complete). We aim to minimize the number of worktransfers and associated communication overhead. A fortu-itous side effect of dynamic partitioning is the transparent

3

handling of fluctuations in resource quality, availability, andcost, which are inherent to large clusters in cloud settings.

Cloud9 operates roughly as follows: The first compo-nent to come up is the load balancer. When the first workernodeW1 joins the Cloud9 cluster, it connects to the LB andreceives a “seed” job to explore the entire execution tree.When the second workerW2 joins and contacts the LB, it isinstructed to balanceW1’s load, which causesW1 to break offsome of its unexplored subtrees and send them toW2 in theform of jobs. As new workers join, the LB has them balancethe load of existing workers. The workers regularly send tothe LB status updates on their load in terms of explorationjobs, along with current progress in terms of code coverage,encoded as a bit vector. Based on workers’ load, the LB canissue job transfer requests to pairs of workers in the form〈 source worker, destination worker, # of jobs 〉. The sourcenode decides which particular jobs to transfer.

3.2 Worker-level Operation

A worker’s visibility is limited to the subtree it is exploringlocally. As Wi explores and reveals the content of its localsubtree, it has no knowledge of what Wj’s (i 6= j) subtreelooks like. No element in the system—not even the loadbalancer—maintains a global execution tree. Disjointnessand completeness of the exploration (see Fig. 2) are ensuredby the load balancing algorithm.

W ’s view W ’s view W ’s view

Aggregate

situation

1 2 3

fence nodes

dead nodes

candidate nodes

Figure 2: Dynamic partitioning of exploration in Cloud9.

As will be explained later, each worker has the root of theglobal execution tree. The tree portion explored thus far ona worker consists of three kinds of nodes: (1) internal nodesthat have already been explored and are thus no longer ofinterest—we call them dead nodes; (2) fence nodes that de-marcate the portion being explored, separating the domainsof different workers; and (3) candidate nodes, which arenodes ready to be explored. A worker exclusively explorescandidate nodes; it never expands fence or dead nodes.

Candidate nodes are leaves of the local tree, and theyform the exploration frontier. The work transfer algorithmensures that frontiers are disjoint between workers, thus en-suring that no worker duplicates the exploration done by an-other worker. At the same time, the union of all frontiers inthe system corresponds to the frontier of the global execution

tree. The goal of a workerWi at every step is to choose thenext candidate node to explore and, when a bug is encoun-tered, to compute the inputs, thread schedule, and systemcall returns that would take the program to that bug.

The implementation of this conceptual model lends it-self to many optimizations, some of which we cover in§6. Broadly speaking, judicious use of copy-on-write anda novel state-encoding technique ensure that actual programstate is only maintained for candidate and fence nodes.

Worker-to-Worker Job Transfer When the global explo-ration frontier becomes poorly balanced across workers, theload balancer chooses a loaded workerWs and a less loadedworkerWd and instructs them to balance load by sending n

jobs fromWs to Wd . In the extreme,Wd is a new worker orone that is done exploring its subtree and has zero jobs left.

Ws chooses n of its candidate nodes and packages themup for transfer toWd . Since a candidate node sent to anotherworker is now on the boundary between the work done byWs and the work done byWd , it becomes a fence node at thesender. This conversion prevents redundant work.

A job can be sent in at least two ways: (1) serialize thecontent of the chosen node and send it to Wd , or (2) sendto Wd the path from the tree root to the node, and rely onWd to “replay” that path and obtain the contents of the node.Choosing one vs. the other is a trade-off between time toencode/decode and network bandwidth: option (1) requireslittle work to decode, but consumes bandwidth (the stateof a real program is typically at least several megabytes),while encoding a job as a path requires replay on Wd . Weassume that large commodity clusters have abundant CPUbut meager bisection bandwidth, so in Cloud9 we choseto encode jobs as the path from the root to the candidatenode. As an optimization, we exploit common path prefixes:jobs are not encoded separately, but rather the correspondingpaths are aggregated into a job tree and sent as such.

MaterializedDead

CandidateMaterialized Materialized

Fence

VirtualCandidate

Exploration frontier

When the job tree arrives at Wd ,it is imported into Wd’s own sub-tree, and the leaves of the job treebecome part ofWd’s frontier (at thetime of arrival, these nodes may lie“ahead” ofWd’s frontier).Wd keepsthe nodes in the incoming jobs asvirtual nodes, as opposed to mate-

rialized nodes that reside in the lo-cal subtree, and replays paths onlylazily. A materialized node is onethat contains the corresponding pro-gram state, whereas a virtual node is an “empty shell” with-out corresponding program state. In the common case, thefrontier of a worker’s local subtree contains a mix of materi-alized and virtual nodes, as shown in the diagram above.

As mentioned earlier, a worker must choose at each stepwhich candidate node to explore next—this choice is guidedby a strategy. Since the set of candidate nodes now contains

4

both materialized and virtual nodes, it is possible for thestrategy to choose a virtual node as the next one to explore.When this happens, the corresponding path in the job tree isreplayed (i.e., the symbolic execution engine executes thatpath); at the end of this replay, all nodes along the path aredead, except the leaf node, which has converted from virtualto materialized and is now ready to be explored. Note that,while exploring the chosen job path, each branch produceschild program states; any such state that is not part of thepath is marked as a fence node, because it represents a nodethat is being explored elsewhere, soWd should not pursue it.

Summary A node N in Wi’s subtree has two attributes,Nstatus ∈{materialized, virtual} and Nlife ∈{candidate, fence,dead}. A worker’s frontier Fi is the set of all candidate nodeson workerWi. The worker can only explore nodes in Fi, i.e.,dead nodes are off-limits and so are fence nodes, except if afence node needs to be explored during the replay of a jobpath. The union ∪Fi equals the frontier of the global execu-tion tree, ensuring that the aggregation of worker-level ex-plorations is complete. The intersection ∩Fi = /0, thus avoid-ing redundancy by ensuring that workers explore disjointsubtrees. Fig. 3 summarizes the life cycle of a node.

explore

encounteredduringreplay

MaterializedCandidate

VirtualCandidate

MaterializedFence

MaterializedDead

job Jx returns

send job Jx to destination createdduringreplay

after replay

of job Jx

created

during

exploration

created

by job Jx

Figure 3: Transition diagram for nodes in a worker’s subtree.

As suggested in Fig. 3, once a tree node is dead, it hasreached a terminal state; therefore, a dead node’s state canbe safely discarded from memory. This enables workers tomaintain program states only for candidate and fence nodes.

3.3 Cluster-level Operation

Load Balancing When jobs arrive at Wd , they are placedconceptually in a queue; the length of this queue is sent to theload balancer periodically. The LB ensures that the workerqueue lengths stay within the same order of magnitude.The balancing algorithm takes as input the lengths li ofeach worker Wi’s queue Qi. It computes the average l andstandard deviation σ of the li values and then classifieseachWi as underloaded (li < max{l− δ ·σ ,0}), overloaded(li > l+ δ ·σ ), or OK otherwise; δ is a constant factor. TheWi are then sorted according to their queue length li andplaced in a list. LB then matches underloaded workers fromthe beginning of the list with overloaded workers from theend of the list. For each pair 〈Wi,Wj〉, with li < l j , the loadbalancer sends a job transfer request to the workers to move(l j− li)/2 candidate nodes fromWj toWi.

Coordinating Worker-level Explorations Classic sym-bolic execution relies on heuristics to choose which stateon the frontier to explore first, so as to efficiently reach thechosen test goal (code coverage, finding a particular type ofbug, etc.). In a distributed setting, local heuristics must becoordinated across workers to achieve the global goal, whilekeeping communication overhead at a minimum. What wehave described so far ensures that eventually all paths inthe execution tree are explored, but it provides no aid in fo-cusing on the paths desired by the global strategy. In thissense, what we described above is a mechanism, while theexploration strategies represent the policies.

Global strategies are implemented in Cloud9 using its in-terface for building overlays on the execution tree structure.We used this interface to implement distributed versions ofall strategies that come with KLEE [Cadar 2008]; the inter-face is also available to Cloud9 users. Due to space limita-tions, we do not describe the strategy interface further, butprovide below an example of how a global strategy is built.

A coverage-optimized strategy drives exploration so asto maximize coverage [Cadar 2008]. In Cloud9, coverageis represented as a bit vector, with one bit for every line ofcode; a set bit indicates that a line is covered. Every timea worker explores a program state, it sets the correspond-ing bits locally. The current version of the bit vector is pig-gybacked on the status updates sent to the load balancer.The LB maintains the current global coverage vector and,when it receives an updated coverage bit vector, ORs it intothe current global coverage. The result is then sent back tothe worker, which in turn ORs this global bit vector intoits own, in order to enable its local exploration strategy tomake choices consistent with the global goal. The coveragebit vector is an example of a Cloud9 overlay data structure.

4. The POSIX Environment Model

Symbolically executing real-world software is challengingnot only because of path explosion but also because real-world systems interact with their environment in varied andcomplex ways. This section describes our experience build-ing Cloud9’s symbolic model of a POSIX environment,which supports most essential interfaces: threads, processmanagement, sockets, pipes, polling, etc. We believe the de-scribed techniques are general enough to model other OSesand environments as well.

4.1 Environment Model Design

The goal of a symbolic model is to simulate the behavior of areal execution environment, while maintaining the necessarysymbolic state behind the environment interface. The sym-bolic execution engine (SEE) can then seamlessly transitionback and forth between the program and the environment.

While writing and maintaining a model can be labori-ous and prone to error [Chipounov 2011], there exist casesin which models provide distinct advantages. First, sym-

5

bolic execution with a model can be substantially faster thanwithout. For instance, in the Linux kernel, transferring apacket between two hosts exercises the entire TCP/IP net-working stack and the associated driver code, amounting toover 30 KLOC. In contrast, Cloud9’s POSIX model achievesthe same functionality in about 1.5KLOC. Requirements thatcomplicate a real environment/OS implementation, such asperformance and extensibility, can be ignored in a symbolicmodel. Second, when an interface is as stable as POSIX, in-vesting the time to model it becomes worthwhile.

We designed a minimal yet general “symbolic systemcall” interface to the Cloud9 SEE, which provides the es-sential building blocks for thread context switching, ad-dress space isolation, memory sharing, and sleep operations.These are hard to provide solely through an external model.We give more details about symbolic system calls in §4.2.

In some cases, it is practical to have the host OS handleparts of the environment via external calls. These are imple-mented by concretizing the symbolic parameters of a systemcall before invoking it from symbolically executing code.Unlike [Cadar 2008; 2006, Godefroid 2005], Cloud9 allowsexternal calls only for stateless or read-only system calls,such as reading a system configuration file from the /etcdirectory. This restriction ensures that external concrete callsdo not clobber other symbolically executing paths.

Symbolic ExecutionExternal

Environment

Modeled ComponentsUnaltered

C Library Code

Program Under Test

Symbolic system calls Host OS concrete syscalls

Engineoperations

Internal

Modeled API Extensions Original API implementation

Symbolic Domain

Sy

mb

oli

c C

Lib

rary 1 2 34

67

58

Engine

Figure 4: Architecture of the Cloud9 POSIX model.

Cloud9 builds upon the KLEE symbolic execution engine,and so it inherits from KLEE the mechanism for replacingparts of the C Library with model code; it also inherits theexternal calls mechanism. Cloud9 adds the symbolic systemcall interface and replaces parts of the C Library with thePOSIX model. The resulting architecture is shown in Fig. 4.

Before symbolic execution starts, the Cloud9 systemlinks the program under test with a special symbolic C Li-brary. We built this library by replacing parts of the exist-ing uClibc library in KLEE with the POSIX model code.Developers do not need to modify the code of to-be-testedprograms in any way to make it run on Cloud9.

In the C Library, we replaced operations related tothreads, processes, file descriptors, and network operationswith their corresponding modelÀ, and augmented the APIwith Cloud9-specific extensionsÁ. A large portion of theC Library is reused, since it works out of the boxÂ (e.g.memory and string operations). Finally, parts of the original

Primitive Name Description

cloud9_make_shared Share object across a CoW domain

cloud9_thread_createCreate and destroy threads

cloud9_thread_terminatecloud9_process_fork Fork and terminate the current

processcloud9_process_terminatecloud9_get_context Get the current context (pid and tid)

cloud9_thread_preempt Preempt a thread

cloud9_thread_sleep Thread sleep on waiting queue

cloud9_thread_notify Wake threads from waiting queuecloud9_get_wlist Create a new waiting queue

Table 1: Cloud9 primitives used to build the POSIX model.

C Library itself use the modeled codeÃ (e.g., Standard I/Ostdio relies on the modeled POSIX file descriptors).

The modeled POSIX components interface with the SEEthrough symbolic system callsÄ, listed in Table 1. Occa-sionally, the unmodified part of the C Library invokes ex-ternal system callsÅ, and the model code itself needs sup-port from the host OSÆ—in order to make sure the externalcalls do not interfere with the symbolic engine’s own opera-tionsÇ, such access is limited to read-only and/or statelessoperations. This avoids problems like, for instance, allowingan external close() system call to close a network connec-tion or log file that is actually used by the SEE itself.

4.2 Symbolic Engine Modifications

In order to support the POSIX interface, we augmentedKLEE with two major features: multiple address spaces perstate and support for scheduling threads and processes. Thisfunctionality is accessed by model code through the sym-bolic system call interface (Table 1). Additional models ofnon-POSIX environments can be built using this interface.

Address Spaces KLEE uses copy-on-write (CoW) to en-able memory sharing between symbolic states. We extendthis functionality in two ways. First, we enable multiple ad-dress spaces within a single execution state, correspondingto multiple processes encompassed in that state. Addressspaces can thus be duplicated both across states (as in classicKLEE) and within a state, when cloud9_process_fork isinvoked, e.g., as used by the POSIX model’s fork().

Second, we organize the address spaces in an execu-tion state as CoW domains that permit memory sharing be-tween processes. A memory object can be marked as sharedby calling cloud9_make_shared; it is then automaticallymapped in the address spaces of the other processes withinthe CoW domain. Whenever a shared object is modified inone address space, the new version is automatically propa-gated to the other members of the CoW domain. The sharedmemory objects can then be used by the model as globalmemory for inter-process communication.

Multithreading and Scheduling Threads are created in thecurrently executing process by calling cloud9_thread_

6

create. Cloud9’s POSIX threads (pthreads) model makesuse of this primitive in its own pthread_create() routine.

Cloud9 implements a cooperative scheduler: An enabledthread runs uninterrupted (atomically), until either (a) thethread goes to sleep; (b) the thread is explicitly preemptedby a cloud9_thread_preempt call; or (c) the thread isterminated via symbolic system calls for process/thread ter-mination. Preemption occurs at explicit points in the modelcode, but it is straightforward to extend Cloud9 to automat-ically insert preemptions calls at instruction level (as wouldbe necessary, for instance, when testing for race conditions).

When cloud9_thread_sleep is called, the SEE placesthe current thread on a specified waiting queue, and an en-abled thread is selected for execution. Another thread maycall cloud9_thread_notify on the waiting queue andwake up one or all of the queued threads.

Cloud9 can be configured to schedule the next threaddeterministically, or to fork the execution state for eachpossible next thread. The latter case is useful when lookingfor concurrency bugs, but it can be a significant source ofpath explosion, so it should be disabled when not needed.

If no thread can be scheduled when the current threadgoes to sleep, then a hang is detected, the execution state isterminated, and a corresponding test case is generated.

Note that parallelizing symbolic execution is orthogonalto providing the multithreading support described above. Inthe former case, the execution engine is instantiated on mul-tiple machines and each instance expands a portion of thesymbolic execution tree. In the latter case, multiple symbolicthreads are multiplexed along the same execution path in thetree; execution is serial along each path.

4.3 POSIX Model Implementation

In this section, we describe the key design decisions involvedin building the Cloud9 POSIX model, and we illustrate theuse of the symbolic system call interface. This is of partic-ular interest to readers who wish to build additional modelson top of the Cloud9 symbolic system call interface.

The POSIX model uses shared memory structures to keeptrack of all system objects (processes, threads, sockets, etc.).The two most important data structures are stream buffersand block buffers, analogous to character and block devicetypes in UNIX. Stream buffers model half-duplex communi-cation channels: they are generic producer-consumer queuesof bytes, with support for event notification to multiple lis-teners. Event notifications are used, for instance, by thepolling component in the POSIX model. Block buffers arerandom-access, fixed-size buffers, whose operations do notblock; they are used to implement symbolic files.

The symbolic execution engine maintains only basic in-formation on running processes and threads: identifiers,running status, and parent–child information. However, thePOSIX standard mandates additional information, such asopen file descriptors and permission flags. This informationis stored by the model in auxiliary data structures associ-

ated with the currently running threads and processes. Theimplementations of fork() and pthread_create() arein charge of initializing these auxiliary data structures andmaking the appropriate symbolic system calls.

Modeling synchronization routines is simplified by thecooperative scheduling policy: no locks are necessary, andall synchronization can be done using the sleep/notify sym-bolic system calls, together with reference counters. Fig. 5illustrates the simplicity this engenders in the implementa-tion of pthread mutex lock and unlock.

typedef struct {

wlist_id_t wlist;

char taken;

unsigned int owner;

unsigned int queued;

} mutex_data_t;

int pthread_mutex_lock(pthread_mutex_t *mutex) {

mutex_data_t *mdata = ((mutex_data_t**)mutex);

if (mdata->queued > 0 || mdata->taken) {

mdata->queued++;

cloud9_thread_sleep(mdata->wlist);

mdata->queued--;

}

mdata->taken = 1;

mdata->owner = pthread_self();

return 0;

}

int pthread_mutex_unlock(pthread_mutex_t *mutex) {

mutex_data_t *mdata = ((mutex_data_t**)mutex);

if (!mdata->taken ||

mdata->owner != pthread_self()) {

errno = EPERM;

return -1;

}

mdata->taken = 0;

if (mdata->queued > 0)

cloud9_thread_notify(mdata->wlist);

return 0;

}

Figure 5: Example implementation of pthread mutex opera-tions in Cloud9’s POSIX environment model.

Cloud9 inherits most of the semantics of the file modelfrom KLEE. In particular, one can either open a symbolicfile (its contents comes from a symbolic block buffer), ora concrete file, in which case a concrete file descriptor isassociated with the symbolic one, and all operations on thefile are forwarded as external calls on the concrete descriptor.

In addition to file objects, the Cloud9 POSIX model addssupport for networking and pipes. Currently, the TCP andUDP protocols are supported over IP and UNIX networktypes. Since no actual hardware is involved in the packettransmission, we can collapse the entire networking stackinto a simple scheme based on two stream buffers (Fig. 6).The network is modeled as a single-IP network with multipleavailable ports—this configuration is sufficient to connectmultiple processes to each other, in order to simulate and test

7

TX/RX Stream Bu�er

RX/TX Stream Bu�er

Socket 1

TX Bu�er

RX Bu�er

Shared Memory

Process 1

File descriptor table

Socket 2

TX Bu�er

RX Bu�er

Process 2

File descriptor table. . . . . .

. . . . . .

. . .

. . .

. . .

. . .

. . .

. . .

Figure 6: A TCP network connection is modeled in Cloud9using TX and RX buffers implemented as stream buffers.

distributed systems. The model also supports pipes throughthe use of a single stream buffer, similar to sockets.

The Cloud9 POSIX model supports polling through theselect() interface. All the software we tested can be con-figured to use select(), so it was not necessary to imple-ment other polling mechanisms. The select() model re-lies on the event notification support offered by the streambuffers that are used in the implementation of blocking I/Oobjects (currently sockets and pipes).

The constraint solver used in Cloud9 operates on bit vec-tors; as a result, symbolic formulas refer to contiguous areasof memory. In order to reduce the constraint solving over-head, we aim to reduce the amount of intermixing of con-crete and symbolic data in the same memory region. Thus,Cloud9’s POSIX model segregates concrete from symbolicdata by using static arrays for concrete data and linked lists(or other specialized structures) for symbolic data. We allo-cate into separate buffers potentially-symbolic data passedby the tested program through the POSIX interface.

In order to enable testing the systems presented in theevaluation section (§7), we had to add support for variousother components: IPC routines, mmap() calls, time-relatedfunctions, etc. Even though laborious, this was mostly anengineering exercise, so we do not discuss it further.

5. Symbolic Test Suites

Software products and systems typically have large “hand-made” test suites; writing and maintaining these suites re-quires substantial human effort. Cloud9 aims to reduce thisburden while improving the quality of testing, by offering aneasy way to write “symbolic test suites.” First, a symbolictest case encompassesmany similar concrete test cases into asingle symbolic one—each symbolic test a developer writesis equivalent to many concrete ones. Second, a symbolic testcase explores conditions that are hard to produce reliably ina concrete test case, such as the occurrence of faults, concur-rency side effects, or network packet reordering, droppingand delay. Furthermore, symbolic test suites can easily coverunknown corner cases, as well as new, untested functional-ity. In this section, we present the API for symbolic tests andillustrate it with a use case.

Function Name Description

cloud9_make_symbolic Mark memory regions as symboliccloud9_fi_enable Enable/disable the injection of

faultscloud9_fi_disablecloud9_set_max_heap Set heap size for symbolic malloccloud9_set_scheduler Set scheduler policy (e.g., round-robin)

Table 2: Cloud9 API for setting global behavior parameters.

Extended Ioctl Code Description

SIO_SYMBOLIC Turns this file or socket into a sourceof symbolic input

SIO_PKT_FRAGMENT Enables packet fragmentation on thissocket (must be a stream socket)

SIO_FAULT_INJ Enables fault injection for operationson this descriptor

Table 3: Cloud9 extended ioctl codes to control environ-mental events on a per-file-descriptor basis.

5.1 Testing Platform API

The Cloud9 symbolic testing API (Tables 2 and 3) allowstests to programmatically control events in the environmentof the program under test. A test suite needs to simplyinclude a cloud9.h header file and make the requisite calls.

Symbolic Data and Streams The generality of a test casecan be expanded by introducing bytes of symbolic data.This is done by calling cloud9_make_symbolic, a wrap-per around klee_make_symbolic, with an argument thatpoints to a memory region. klee_make_symbolic is aprimitive provided by KLEE to mark data symbolic. In ad-dition to wrapping this call, we added several new primi-tives to the testing API (Table 2). In Cloud9, symbolic datacan be written/read to/from files, can be sent/received overthe network, and can be passed via pipes. Furthermore, theSIO_SYMBOLIC ioctl code (Table 3) turns on/off the re-ception of symbolic bytes from individual files or sockets.

Network Conditions Delay, reordering, or dropping ofpackets causes a network data stream to be fragmented.Fragmentation can be turned on or off at the socket level us-ing one of the Cloud9 ioctl extensions. §7 presents a casewhere symbolic fragmentation enabled Cloud9 to prove thata bug fix for the lighttpd web server was incomplete.

Fault Injection Calls in a POSIX system can return anerror code when they fail. Most programs can toleratesuch failed calls, but even high-quality production softwaremisses some [Marinescu 2009]. Such error return codes aresimulated by Cloud9 whenever fault injection is turned on.

Symbolic Scheduler Cloud9 provides multiple schedulingpolicies that can be controlled for purposes of testing on aper-code-region basis. Currently, Cloud9 supports a round-robin scheduler and two schedulers specialized for bug find-ing: a variant of the iterative context bounding scheduling

8

algorithm [Musuvathi 2008] and an exhaustive explorationof all possible scheduling decisions.

5.2 Use Case

Consider a scenario in which we want to test the support fora new X-NewExtension HTTP header, just added to a webserver. We show how to write tests for this new feature.

A symbolic test suite typically starts off as an augmen-tation of an existing test suite; in our scenario, we reusethe existing boilerplate setup code and write a symbolic testcase that marks the extension header symbolic. Wheneverthe code that processes the header data is executed, Cloud9forks at all the branches that depend on the header content.Similarly, the request payload can be marked symbolic totest the payload-processing part of the system:

char hData[10];

cloud9_make_symbolic(hData);

strcat(req, "X-NewExtension: ");

strcat(req, hData);

The web server may receive HTTP requests fragmentedin a number of chunks, returned by individual invocationsof the read() system call—the web server should run cor-rectly regardless of the fragmentation pattern. To test differ-ent fragmentation patterns with Cloud9, one simply enablessymbolic packet fragmentation on the client socket:

ioctl(ssock, SIO_PKT_FRAGMENT, RD);

To test how the web server handles failures in the envi-ronment, we can ask Cloud9 to selectively inject faults whenthe server reads or sends data on a socket by placing in thesymbolic test suite calls of the form:

ioctl(ssock, SIO_FAULT_INJ, RD | WR);

Cloud9 can also enable/disable fault injection globally forall file descriptors within a certain region of the code usingcalls to cloud9_fi_enable and cloud9_fi_disable.For simulating low-memory conditions, Cloud9 provides acloud9_set_max_heap primitive, which can be used totest the web server with different maximum heap sizes.

6. Cloud9 Prototype

We developed a Cloud9 prototype that runs on privateclusters as well as cloud infrastructures like Amazon EC2[Amazon] and Eucalyptus [Eucalyptus]. The prototype has10 KLOC; the POSIX model accounts for half of the totalcode size. Cloud9 workers embed KLEE [Cadar 2008], astate-of-the-art single-node symbolic execution engine; theCloud9 fabric converts a cluster of individual engines intoone big parallel symbolic execution engine. This sectionpresents selected implementation decisions underlying theprototype.More details are available at http://cloud9.epfl.ch.

Broken Replays As discussed in §3.2, when a job is trans-ferred from one worker to another, the replay done duringmaterialization must successfully reconstruct the transferred

state. Along the reconstruction path, the destination must ex-ecute the same instructions, obtain the same symbolic mem-ory content, and get the same results during constraint solv-ing as on the source worker. Failing to do so causes the re-played path to be broken: it either diverges, or terminatesprematurely. In both cases, this means the state cannot be re-constructed, and this could affect exploration completeness.

The main challenge is that the underlyingKLEE symbolicexecution engine relies on a global memory allocator toservice the tested program’s malloc() calls. The allocatorreturns actual host memory addresses, and this is necessaryfor executing external system calls that access program state.Unfortunately, this also means that buffers are allocated ataddresses whose values for a given state depend on thehistory of previous allocations in other states. Such cross-state interference leads to frequent broken replays.

We therefore replaced the KLEE allocator with a per-state deterministic memory allocator, which uses a per-stateaddress counter that increases with everymemory allocation.To preserve the correctness of external calls (that require realaddresses), this allocator gives addresses in a range that isalso mapped in the SEE address space using mmap(). Thus,before external calls are invoked, the memory content of thestate is copied into the mmap-ed region.

Constraint Caches KLEE implements a cache mechanismfor constraint-solving results; this cache can significantlyimprove solver performance. In Cloud9, states are trans-ferred between workers without the source worker’s cache.While one might expect this to hurt performance signifi-cantly, in practice we found that the necessary portion of thecache is mostly reconstructed as a side effect of path replay,as the path constraints are re-sent to the local solver.

Custom Data Structures We developed two custom datastructures for handling symbolic execution trees: Node pinsare a kind of smart pointer customized for trees. Standardsmart pointers (e.g., the ones provided by Boost libraries)can introduce significant performance disruptionswhen usedfor linked data structures: chained destructors can introducenoticeable deallocation latency and may even overflow thestack and crash the system. The node pin allows trees to betreated akin to a “rubber band” data structure: as nodes getallocated, the rubber band is stretched, and some nodes act aspins to anchor the rubber band. When such a pin is removed,the nodes with no incoming references are freed up to thepoint where the rubber band reaches the pin next closest tothe root. Tree nodes between two pins are freed all at once,avoiding the use of the stack for recursive destructor calls.

Another custom data structure is the tree layer. At first,Cloud9 used a single tree to represent the entire symbolicexecution. As Cloud9 evolved, tree nodes acquired an in-creasing number of objects: program states, imported jobs,breakpoints, etc. This made tree searches inefficient, com-plicated synchronization, and generally impeded our devel-opment effort. We therefore adopted a layer-based structure

9

http://cloud9.epfl.ch

similar to that used in CAD tools, where the actual tree isa superposition of simpler layers. When exploring the tree,one chooses the layer of interest; switching between lay-ers can be done dynamically at virtually zero cost. Cloud9currently uses separate layers for symbolic states, importedjobs, and several other sets of internal information.

7. Evaluation

There are several questions one must ask of a parallel sym-bolic execution platform, and we aim to answer them in thissection: Can it be used on real-world software that interactsrichly with its environment (§7.1)? Does it scale on com-modity shared-nothing clusters (§7.2)? Is it an effective test-ing platform, and does it help developers gain confidencein their code (§7.3)? How do its different components con-tribute to overall efficiency (§7.4)?

For all our experiments, we used a heterogeneous clusterenvironment, with worker CPU frequencies between 2.3–2.6GHz and with 4–6 GB of RAM available per core.

On each worker, the underlying KLEE engine used thebest searchers from [Cadar 2008], namely an interleavingof random-path and coverage-optimized strategies. At eachstep, the engine alternately selects one of these heuristicsto pick the next state to explore. Random-path traverses theexecution tree starting from the root and randomly picks thenext descendant node, until a candidate state is reached. Thecoverage-optimized strategy weighs the states according toan estimated distance to an uncovered line of code, and thenrandomly selects the next state according to these weights.

To quantify coverage, we report both line coverage andpath coverage numbers. Line coveragemeasures the fractionof program statements executed during a test run, while pathcoverage reports how many execution paths were exploredduring a test. Path coverage is the more relevant metricwhen comparing thoroughness of testing tools. Nevertheless,we also evaluate properties related to line coverage, sincethis is still a de facto standard in software testing practice.Intuitively, and confirmed experimentally, path coverage inCloud9 scales linearly with the number of workers.

7.1 Handling Real-World Software

Table 4 shows a selection of the systems we tested withCloud9, covering several types of software. We confirmedthat each system can be tested properly under our POSIXmodel. In the rest of this section, we focus our in-depthevaluation on several networked servers and tools, as theyare frequently used in settings where reliability matters.

Due to its comprehensive POSIX model, Cloud9 can testmany kinds of servers. One example is lighttpd, a web serverused by numerous high-profile web sites, such as YouTube,Wikimedia, Meebo, and SourceForge. For lighttpd, Cloud9proved that a certain bug fix was incorrect, and the bug couldstill manifest even after applying the patch (§7.3.4). Cloud9also found a bug in curl, an Internet transfer application

System Size (KLOC) Type of Software

Apache httpd 2.2.16 226.4Web serversLighttpd 1.4.28 39.5

Ghttpd 1.4.4 0.6Memcached 1.4.5 8.3 Distributed object cachePython 2.6.5 388.3 Language interpreterCurl 7.21.1 65.9

Network utilitiesRsync 3.0.7 35.6Pbzip 2.1.1 3.6 Compression utilityLibevent 1.4.14 10.2 Event notification libraryCoreutils 6.10 72.1 Suite of system utilitiesBandicoot 1.0 6.4 Lightweight DBMS

Table 4: Representative selection of testing targets that runon Cloud9. Size was measured using the sloccount utility.

that is part of most Linux distributions and other operatingsystems (§7.3.2). Cloud9 also found a hang bug in the UDPhandling code of memcached, a distributed memory objectcache system used by many Internet services, such as Flickr,Craigslist, Twitter, and Livejournal (§7.3.3).

In addition to the testing targets mentioned above,we alsotested a benchmark consisting of a multi-threaded and multi-process producer-consumer simulation. The benchmark ex-ercises the entire functionality of the POSIX model: threads,synchronization, processes, and networking.

We conclude that Cloud9 is practical and capable of test-ing a wide range of real-world software systems.

7.2 Scaling on Commodity Shared-Nothing Clusters

We evaluate Cloud9 using two metrics:

1. The time to reach a certain goal (e.g., an exhaustive pathexploration, or a fixed coverage level)—we consider thisan external metric, which measures the performance ofthe testing platform in terms of its end results.

2. The useful work performed during exploration, measuredas the number of useful (non-replay) instructions exe-cuted symbolically. This is an internal metric that mea-sures the efficiency of Cloud9’s internal operation.

A cluster-based symbolic execution engine scales withthe number of workers if these two metrics improve propor-tionally with the number of workers in the cluster.

Time Scalability We show that Cloud9 scales linearly byachieving the same testing goal proportionally faster as thenumber of workers increases. We consider two scenarios.

First, we measure how fast Cloud9 can exhaustively ex-plore a fixed number of paths in the symbolic executiontree. For this, we use a symbolic test case that generatesall the possible paths involved in receiving and processingtwo symbolic messages in the memcached server (§7.3 givesmore details about the setup). Fig. 7 shows the time requiredto finish the test case with a variable number of workers: ev-ery doubling in the number of workers roughly halves the

10

time to completion. With 48 workers, the time to completeis about 10 minutes; for 1 worker, exploration time exceedsour 10-hour limit on the experiment.

0

1

2

3

4

5

6

2 4 6 12 24 48

Tim

e to

co

mp

lete

exh

au

stive

te

st [h

ou

rs]

Number of workers

Figure 7: Cloud9 scalability in terms of the time it takes toexhaustively complete a symbolic test case for memcached.

Second, we measure the time it takes Cloud9 to reach afixed coverage level for the printf UNIX utility. printfperforms a lot of parsing of its input (format specifiers),which produces complex constraints when executed sym-bolically. Fig. 8 shows that the time to achieve a coveragetarget decreases proportionally with the number of addedworkers. The low 50% coverage level can be easily achievedeven with a sequential SEE (1-worker Cloud9). However,higher coverage levels require more workers, if they areto be achieved in a reasonable amount of time; e.g., onlya 48-worker Cloud9 is able to achieve 90% coverage. Theanomaly at 4 workers for 50% coverage is due to high vari-ance; when the number of workers is low, the average (5±4.7minutes over 10 experiments) can be erratic due to the ran-dom choices in the random-path search strategy.

50%60%70%80%90%

0

10

20

30

40

50

60

1 4 8 24 48

Tim

e to

ach

ieve

ta

rge

tco

ve

rag

e [m

inu

tes]

Number of workers

Figure 8: Cloud9 scalability in terms of the time it takes toobtain a target coverage level when testing printf.

Work Scalability We now consider the same scalabilityexperiments from the perspective of useful work done byCloud9: we measure both the total number of instructions(from the target program) executed during the explorationprocess, as well as normalize this value per worker. Thismeasurement indicates whether the overheads associatedwith parallel symbolic execution impact the efficiency ofexploration, or are negligible. Fig. 9 shows the results formemcached, confirming that Cloud9 scales linearly in termsof useful work done (top graph). The average useful workdone by a worker (bottom graph) is relatively independent

0.0e+00

2.0e+09

4.0e+09

6.0e+09

8.0e+09

1.0e+10

1.2e+10

1.4e+10

1.6e+10

1.8e+10

1 4 6 12 24 48

Use

ful w

ork

do

ne

[ #

of in

str

uctio

ns ]

Number of workers

4 minutes6 minutes8 minutes

10 minutes

0.0e+00

1.0e+08

2.0e+08

3.0e+08

4.0e+08

5.0e+08

6.0e+08

1 4 6 12 24 48

No

rma

lize

d u

se

ful w

ork

[ #

in

str

uctio

ns / w

ork

er

]

Number of workers

4 minutes6 minutes

8 minutes10 minutes

Figure 9: Cloud9 scalability in terms of useful work done forfour different running times when testing memcached.

of the total number of workers in the cluster, so adding moreworkers improves proportionally Cloud9’s results.

In Fig. 10 we show the results for printf and test,UNIX utilities that are an order of magnitude smaller thanmemcached. We find that the useful work done scales ina similar way to memcached, even though the three pro-grams are quite different from each other (e.g., printfdoes mostly parsing and formatting, while memcached doesmostly data structure manipulations and network I/O).

0.0e+00

1.0e+07

2.0e+07

3.0e+07

4.0e+07

5.0e+07

6.0e+07

1 4 12 24 48

Use

ful w

ork

do

ne

[ #

of in

str

uctio

ns ]

Number of workers

30 minutes40 minutes50 minutes60 minutes

printf

0.0e+00

2.0e+06

4.0e+06

6.0e+06

8.0e+06

1.0e+07

1.2e+07

1.4e+07

1 4 12 24 48

Use

ful w

ork

do

ne

[ #

of in

str

uctio

ns ]

Number of workers

30 minutes40 minutes50 minutes60 minutes

test

Figure 10: Cloud9’s useful work on printf (top) and test(bottom) increases roughly linearly in the size of the cluster.

In conclusion, Cloud9 scales linearly with the number ofworkers, both in terms of the time to complete a symbolictesting task and in terms of reaching a target coverage level.

11

7.3 Effectiveness as a Testing Platform

In this section we present several case studies that illus-trate how Cloud9 can explore and find new bugs, con-firm/disprove that existing bugs have been correctly fixed,and regression-test a program after it has been modified. Inthe common case, Cloud9 users start with a concrete testcase (e.g., from an existing test suite) and generalize it bymaking data symbolic and by controlling the environment.

7.3.1 Case Study #1: UNIX Utilities

KLEE is an excellent tool for testing command-line pro-grams, in particular UNIX utilities. It does not tackle morecomplex systems, like the ones in Table 4, mainly due topath explosion (since KLEE is a single-node engine) and in-sufficient environment support. We cannot compare Cloud9to KLEE on parallel and distributed systems, but we can com-pare on the Coreutils suite of UNIX utilities [Coreutils].

We run KLEE on each of the 96 utilities for 10 minutes,and then run a 12-worker Cloud9 on each utility for 10 min-utes. Fig. 11 reports the average coverage increase obtainedwith Cloud9 over 7 trials, using KLEE’s 7-trial average re-sults as a baseline; the experiment totals 2× 7× 96× 10=13,440 minutes > 9 days. The increase in coverage is mea-sured as additional lines of code covered, expressed as a per-centage of program size (i.e., we do not report it as a percent-age of the baseline, which would be a higher number).

Cloud9 covers up to an additional 40% of the target pro-grams, with an average of 13% additional code coveredacross all Coreutils. In general, improving coverage be-comes exponentially harder as the base coverage increases,and this effect is visible in the results: a 12× increase inhardware resources does not bring about a 12× increasein coverage. Our results show that Cloud9 allows “throw-ing hardware” at the automated testing problem, picking upwhere KLEE left off. In three cases, Cloud9 achieved 100%coverage in 10 minutes on real-world code. This experimentdoes not aim to show that Cloud9 is a “better” symbolicexecution engine than KLEE—after all, Cloud9 is based onKLEE—but rather that Cloud9-style parallelization can makeexisting symbolic execution engines more powerful.

The way we compute coverage is different from [Cadar2008]—whereas KLEE was conceived as an automated test

generator, Cloud9 is meant to directly test software. Thus,we measure the number of lines of code tested by Cloud9,whereas [Cadar 2008] reports numbers obtained by run-ning the concrete test cases generated by KLEE. Our methodyields more-conservative numbers because a test generatedby KLEE at the end of an incomplete path (e.g., that ter-minated due to an environment failure) may execute furtherthan the termination point when run concretely.

7.3.2 Case Study #2: Curl

Curl is a popular data transfer tool for multiple networkprotocols, includingHTTP and FTP.When testing it, Cloud9

0

20

40

60

80

100

0 10 20 30 40 50 60 70 80 90

Co

de

co

ve

rag

e[ %

of p

rog

ram

LO

C ]

Index of tested Coreutil

Baseline coverageAdditional coverage

0

10

20

30

40

50

0 10 20 30 40 50 60 70 80 90

Ad

ditio

na

l co

de

co

ve

red

[ %

of p

rog

ram

LO

C ]

Index of tested Coreutil (sorted by additional coverage)

Figure 11: Cloud9 coverage improvements on the 96 Core-utils (1-worker Cloud9 vs. 12-worker Cloud9).

found a new bug which causes Curl to crash when given aURL regular expression of the form “http://site.{one,two,three}.com{”. Cloud9 exposed a general problem inCurl’s handling of the case when braces used for regularexpression globbing are not matched properly. The bug wasconfirmed and fixed within 24 hours by the developers.

This problem had not been noticed before because theglobbing functionality in Curl was shadowed by the samefunctionality in command-line interpreters (e.g., Bash). Thiscase study illustrates a situation that occurs often in practice:when a piece of software is used in a way that has not beentried before, it is likely to fail due to latent bugs.

7.3.3 Case Study #3: Memcached

Memcached is a distributed memory object cache system,mainly used to speed up web application access to persistentdata, typically residing in a database.

Memcached comes with an extensive test suite comprisedof C and Perl code. Running it completely takes about1 minute; it runs 6,472 different test cases and explores83.66% of the code.While this is considered thorough by to-day’s standards, two easy Cloud9 test cases further increasedcode coverage. Table 5 contains a summary of our results,presented in more details in the following paragraphs.

Symbolic Packets The memcached server accepts com-mands over the network. Based on memcached’s C test suite,we wrote a test case that sends memcached a generic, sym-bolic binary command (i.e., command content is fully sym-bolic), followed by a second symbolic command. This testcaptures all operations that entail a pair of commands.

A 24-worker Cloud9 explored in less than 1 hour all74,503 paths associated with this sequence of two symbolicpackets, covering an additional 1.13% of the code relative to

12

Testing Method Paths Isolated Cumulated

Covered Coverage∗ Coverage∗∗

Entire test suite 6,472 83.67% —Binary protocoltest suite

27 46.79% 84.33% (+0.67%)

Symbolic packets 74,503 35.99% 84.79% (+1.13%)Test suite +fault injection

312,465 47.82% 84.94% (+1.28%)

Table 5: Path and code coverage increase obtained by eachsymbolic testing technique on memcached. We show totalcoverage obtained with each testing method (*), as well astotal coverage obtained by augmenting the original test suitewith the indicated method (**); in parentheses, we show theincrease over the entire test suite’s coverage.

the original test suite. What we found most encouraging inthis result is that such exhaustive tests constitute first stepstoward using symbolic tests to prove properties of real-worldprograms, not just to look for bugs. Symbolic tests mayprovide an alternative to complex proof mechanisms that ismore intuitive for developers and thus more practical.

Symbolic Fault Injection We also tested memcached withfault injection enabled, whereby we injected all feasible fail-ures in memcached’s calls to the C Standard Library. After10 minutes of testing, a 24-worker Cloud9 explored 312,465paths, adding 1.28% over the base test suite. The fact thatline coverage increased by so little, despite having coveredalmost 50×more paths, illustrates the weakness of line cov-erage as a metric for test quality—high line coverage shouldoffer no high confidence in the tested code’s quality.

For the fault injection experiment, we used a special strat-egy that sorts the execution states according to the number offaults recorded along their paths, and favors the states withfewer fault injection points. This led to a uniform injectionof faults: we first injected one fault in every possible faultinjection point along the original C test suite path, then in-jected pairs of faults, and so on. We believe this is a practicalapproach to using fault injection as part of regular testing.

Hang Detection We tested memcached with symbolicUDP packets, and Cloud9 discovered a hang condition in thepacket parsing code: when a sequence of packet fragmentsof a certain size arrive at the server, memcached enters an in-finite loop, which prevents it from serving any further UDPconnections. This bug can seriously hurt the availability ofinfrastructures using memcached.

We discovered the bug by limiting the maximum numberof instructions executed per path to 5×106. The paths with-out the bug terminated after executing ∼ 3×105 instructions;the other paths that hit the maximum pointed us to the bug.

7.3.4 Case Study #4: Lighttpd

The lighttpd web server is specifically engineered for highrequest throughput, and it is quite sensitive to the rate at

Fragmentation pattern ver. 1.4.12 ver. 1.4.13

(data sizes in bytes) (pre-patch) (post-patch)

1×28 OK OK1×26+1×2 crash + hang OK2+5+1+5+2×1+3×2+5+2×1

crash + hang crash + hang

Table 6: The behavior of different versions of lighttpd tothree ways of fragmenting the HTTP request "GET /in-dex.html HTTP/1.0CRLFCRLF" (string length 28).

which new data is read from a socket. Alas, the POSIX spec-ification offers no guarantee on the number of bytes that canbe read from a file descriptor at a time. lighttpd 1.4.12 has abug in the command-processing code that causes the serverto crash (and connected clients to hang indefinitely) depend-ing on how the incoming stream of requests is fragmented.

We wrote a symbolic test case to exercise different streamfragmentation patterns and see how different lighttpd ver-sions behave. We constructed a simple HTTP request, whichwas then sent over the network to lighttpd. We activated net-work packet fragmentation via the symbolic ioctl() APIexplained in §5. We confirmed that certain fragmentationpatterns cause lighttpd to crash (prior to the bug fix). How-ever, we also tested the server right after the fix and discov-ered that the bug fix was incomplete, as some fragmentationpatterns still cause a crash and hang the client (Table 6).

This case study shows that Cloud9 can find bugs causedby specific interactions with the environment which are hardto test with a concrete test suite. It also shows how Cloud9can be used to write effective regression test suites—had astream-fragmentation symbolic test been run after the fix,the lighttpd developers would have promptly discovered theincompleteness of their fix.

7.3.5 Case Study #5: Bandicoot DBMS

Bandicoot is a lightweight DBMS that can be accessed overan HTTP interface. We exhaustively explored all paths han-dling the GET commands and found a bug in which Bandi-coot reads from outside its allocated memory. The particulartest we ran fortuitously did not result in a crash, as Bandicootended up reading from the libc memory allocator’s metadatapreceding the allocated block of memory. However, besidesthe read data being wrong, this bug could cause a crash de-pending on where the memory block was allocated.

To discover and diagnose this bug without Cloud9 is dif-ficult. First, a concrete test case has little chance of trigger-ing the bug. Second, searching for the bug with a sequentialsymbolic execution tool seems impractical: the exhaustiveexploration took 9 hours with a 4-worker Cloud9 (and lessthan 1 hour with a 24-worker cluster).

7.3.6 Discussion

Cloud9 inherits KLEE’s capabilities, being able to recognizememory errors and failed assertions. We did not add much

13

in terms of bug detection, only two mechanisms for detect-ing hangs: check if all symbolic threads are sleeping (dead-lock) and set a threshold for the maximum number of in-structions executed per path (infinite loop or livelock). Evenso, Cloud9 can find bugs beyond KLEE’s abilities becausethe POSIX model allows Cloud9 to reach more paths andexplore deeper portions of the tested program’s code—thisexposes additional potentially buggy situations. Cloud9 alsohas more total memory and CPU available, due to its dis-tributed nature, so it can afford to explore more paths thanKLEE. As we have shown above, it is feasible to offer proofsfor certain program properties: despite the exponential na-ture of exhaustively exploring paths, one can build small butuseful symbolic test cases that can be exhaustively executed.

7.4 Utility of Load Balancing

In this section we explore the utility of dynamic load balanc-ing. Consider the example of exhaustively exploring pathswith two symbolic packets in memcached, using 48 work-ers, but this time from a load balancing perspective. Fig. 12shows that load balancing events occur frequently, with3–6% of all states in the system being transferred betweenworkers in almost every 10-second time interval.

0

1

2

3

4

5

6

7

0 2 4 6 8 10

Sta

tes tra

nsfe

rre

d[ %

of to

tal ]

Time [minutes] (10-second buckets)

Figure 12: The fraction of total states (candidate nodes)transferred between workers during symbolic execution.

To illustrate the benefits of load balancing, we disable itat various moments in time and then analyze the evolutionof total useful work done. Fig. 13 shows that the elimina-tion of load balancing at any moment during the executionsignificantly affects the subsequent performance of explo-ration due to the ensuing imbalance. This demonstrates thenecessity of taking a dynamic approach to parallel symbolicexecution, instead of doing mere static partitioning of theexecution tree.

0 10 20 30 40 50 60 70 80 90

100

0 2 4 6 8 10

Use

ful w

ork

do

ne

[ %

of to

tal in

str

uctio

ns ]

Time [minutes]

Continuous load balancing LB stops after 6 min LB stops after 4 min LB stops after 2 min LB stops after 1 min

Figure 13: Instruction throughput of Cloud9 with load bal-ancing disabled at various points during the exhaustive test.

8. Related Work

To our knowledge, parallel symbolic execution was firstdescribed in [Ciortea 2009]. Cloud9 builds upon those ideas.

Recently, [Staats 2010] described an extension to JavaPathfinder (JPF) that parallelizes symbolic execution by us-ing parallel random searches on a static partition of the ex-ecution tree. JPF pre-computes a set of disjoint constraintsthat, when used as preconditions on a worker’s explorationof the execution tree, steer each worker to explore a subsetof paths disjoint from all other workers. In this approach,using constraints as preconditions imposes, at every branchin the program, a solving overhead relative to explorationwithout preconditions. The complexity of these precondi-tions increases with the number of workers, as the precon-ditions need to be more selective. Thus, per-worker solvingoverhead increases as more workers are added to the cluster.This limits scalability: the largest evaluated program had 447lines of code and did not interact with its environment. Dueto the static partitioning of the execution tree, total runningtime is determined by the worker with the largest subtree(as explained in §2). As a result, increasing the number ofworkers can even increase total test time instead of reducingit [Staats 2010]. Cloud9 mitigates these drawbacks.

Several sequential symbolic execution engines [Cadar2008, Godefroid 2005; 2008, Majumdar 2007] have hadgreat success in automated testing. These state-of-the-arttools exhaust available memory and CPU fairly quickly (asexplained in §2). Cloud9 can help such tools scale beyondtheir current limits, making symbolic execution a viable test-ing methodology for a wider spectrum of software systems.

To our knowledge, we are the first to scalably parallelizesymbolic execution to shared-nothing clusters. There hasbeen work, however, on parallel model checking [Barnat2007, Grumberg 2006, Kumar 2004, Lerda 1999, Stern1997]. The SPIN model checker has been parallelized two-way for dual-coremachines [Holzmann 2007]. Nevertheless,there are currently no model checkers that can scale to manyloosely connected computers, mainly due to the overhead ofcoordinating the search across multiple machines and trans-ferring explicit states. Cloud9 uses an encoding of states thatis compact and enables better scaling.

Swarm verification [Holzmann 2008] generates series ofparallel verification runs with user-defined bounds on timeand memory. Parallelism is used to execute different searchstrategies independently. The degree of parallelism is limitedto the number of distinct search strategies. Cloud9 is notlimited in this way: due to the use of dynamic load balancing,Cloud9 affords arbitrary degrees of parallelism.

Korat [Misailovic 2007] is a parallel testing system thatdemonstrated ∼ 500× speedup when using 1024 nodes. Ko-rat is designed for cases when all program inputs—within acertain bound—can be generated offline. In contrast, Cloud9handles arbitrarily complex inputs, even if enumeration is in-feasible, and can handle system calls and thread schedules.

14

VeriSoft [Godefroid 1997] introduced stateless search formodel checkers, in which a list of state transitions is used toreconstruct an execution state whenever needed. We extendthis idea in the Cloud9 job model, which is in essence aform of stateless search. However, jobs are replayed fromnodes on the frontier of the execution tree, instead of beingreplayed from the root. This is an important optimization:in practice, most programs have either a long linear path inthe beginning (i.e., initialization code that does not dependon the symbolic input), or long linear executions betweentwo consecutive state forks. Replaying from the root wouldrepresent a significant waste of resources for large programs.

Other techniques can be used to improve the scalabilityof symbolic execution. For example, compositional test gen-eration [Boonstoppel 2008, Godefroid 2007] automaticallycomputes summaries that condense the result of explorationinside commonly used functions. S2E [Chipounov 2011] im-proves the scalability of symbolic execution by selectivelyexecuting symbolically only those parts of a system that areof interest to the tests. Cloud9 is complementary to thesetechniques and could be used to scale them further.

9. Conclusion

This paper presented Cloud9, an automated testing platformthat employs parallelization to scale symbolic execution byharnessing the resources of commodity clusters. Cloud9 canautomatically test real systems, such as memcached, thatinteract in complex ways with their environment. It includesa new symbolic environment that supports all major aspectsof the POSIX interface, and provides a systematic interfaceto writing symbolic tests. Further information on Cloud9 canbe found at http://cloud9.epfl.ch.

Acknowledgments

We are indebted to Ayrat Khalimov, Cristian Cadar, DanielDunbar, Seungbeom Kim, Dawson Engler, and our EPFLcolleagues for their help on this project. We thank MichaelHohmuth, our shepherd, and the anonymous EuroSys re-viewers for their guidance on improving our paper. We aregrateful to Google for supporting our work through a Fo-cused Research Award.

References[Amazon ] Amazon. Amazon EC2. http://aws.amazon.com/ec2.[Barnat 2007] Jiri Barnat, Lubos Brim, and Petr Rockai. Scalable

multi-core LTL model-checking. In Intl. SPIN Workshop, 2007.[Boonstoppel 2008] Peter Boonstoppel, Cristian Cadar, and Daw-

son R. Engler. RWset: Attacking path explosion in constraint-based test generation. In Intl. Conf. on Tools and Algorithms forthe Construction and Analysis of Systems, 2008.

[Cadar 2008] Cristian Cadar, Daniel Dunbar, and Dawson R. En-gler. KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs. In Symp. on Op-erating Systems Design and Implementation, 2008.

[Cadar 2006] Cristian Cadar, Vijay Ganesh, Peter M. Pawlowski,David L. Dill, and Dawson R. Engler. EXE: Automatically gen-erating inputs of death. In Conf. on Computer and Communica-tion Security, 2006.

[Chipounov 2009] Vitaly Chipounov, Vlad Georgescu, CristianZamfir, and George Candea. Selective symbolic execution. InWorkshop on Hot Topics in Dependable Systems, 2009.

[Chipounov 2011] Vitaly Chipounov, Volodymyr Kuznetsov, andGeorge Candea. S2E: A platform for in-vivo multi-path analysisof software systems. In Intl. Conf. on Architectural Support forProgramming Languages and Operating Systems, 2011.

[Ciortea 2009] Liviu Ciortea, Cristian Zamfir, Stefan Bucur, VitalyChipounov, and George Candea. Cloud9: A software testingservice. In Workshop on Large Scale Distributed Systems andMiddleware, 2009.

[Coreutils ] Coreutils. Coreutils.http://www.gnu.org/software/coreutils/.

[Eucalyptus ] Eucalyptus. Eucalyptus software.http://open.eucalyptus.com/, 2010.

[Godefroid 2005] P. Godefroid, N. Klarlund, and K. Sen. DART:Directed automated random testing. In Conf. on ProgrammingLanguage Design and Implementation, 2005.

[Godefroid 1997] Patrice Godefroid. Model checking for program-ming languages using VeriSoft. In Symp. on Principles of Pro-gramming Languages, 1997.

[Godefroid 2007] Patrice Godefroid. Compositional dynamic testgeneration. In Symp. on Principles of Programming Languages,2007. Extended abstract.

[Godefroid 2008] Patrice Godefroid, Michael Y. Levin, and DavidMolnar. Automated whitebox fuzz testing. In Network andDistributed System Security Symp., 2008.

[Grumberg 2006] Orna Grumberg, Tamir Heyman, and AssafSchuster. A work-efficient distributed algorithm for reachabil-ity analysis. Formal Methods in System Design, 29(2), 2006.

[Holzmann 2007] G. J. Holzmann and D. Bosnacki. Multi-coremodel checking with SPIN. In Intl. Parallel and DistributedProcessing Symp., 2007.

[Holzmann 2008] Gerard J. Holzmann, Rajeev Joshi, and AlexGroce. Tackling large verification problems with the Swarmtool. In Intl. SPIN Workshop, 2008.

[Kumar 2004] Rahul Kumar and Eric G. Mercer. Load balancingparallel explicit state model checking. In Intl. Workshop onParallel and Distributed Methods in Verification, 2004.

[Lerda 1999] Flavio Lerda and Riccardo Sisto. Distributed-memory model checking with SPIN. In Intl. SPIN Workshop,1999.

[Majumdar 2007] Rupak Majumdar and Koushik Sen. Hybridconcolic testing. In Intl. Conf. on Software Engineering, 2007.

[Marinescu 2009] Paul D. Marinescu and George Candea. LFI: Apractical and general library-level fault injector. In Intl. Conf. onDependable Systems and Networks, 2009.

[McConnell 2004] Steve McConnell. Code Complete. MicrosoftPress, 2004.

[Misailovic 2007] SasaMisailovic, Aleksandar Milicevic, NemanjaPetrovic, Sarfraz Khurshid, and Darko Marinov. Parallel testgeneration and execution with Korat. In Symp. on the Founda-tions of Software Eng., 2007.

[Musuvathi 2008] Madanlal Musuvathi, Shaz Qadeer, ThomasBall, Gérard Basler, Piramanayagam Arumuga Nainar, and Iu-lian Neamtiu. Finding and reproducing Heisenbugs in concur-rent programs. In Symp. on Operating Systems Design and Im-plementation, 2008.

[RedHat ] RedHat. RedHat security.http://www.redhat.com/security/updates/classification, 2005.

[Staats 2010] Matt Staats and Corina Pasareanu. Parallel symbolicexecution for structural test generation. In Intl. Symp. on Soft-ware Testing and Analysis, 2010.

[Stern 1997] Ulrich Stern and David L. Dill. Parallelizing theMurφverifier. In Intl. Conf. on Computer Aided Verification, 1997.

15

http://cloud9.epfl.ch

http://aws.amazon.com/ec2

http://open.eucalyptus.com/

http://www.redhat.com/security/updates/classification

Parallel Symbolic Execution for Automated Real-World ...

Documents

Transcript of Parallel Symbolic Execution for Automated Real-World ...