Scalable State-Machine Replication - USI Informatics...State-machine replication is a fundamental...

12
Scalable State-Machine Replication Carlos Eduardo Bezerra *‡ , Fernando Pedone * , Robbert van Renesse * University of Lugano, Switzerland Cornell University, USA Universidade Federal do Rio Grande do Sul, Brazil Abstract—State machine replication (SMR) is a well-known technique able to provide fault-tolerance. SMR consists of sequencing client requests and executing them against replicas in the same order; thanks to deterministic execution, every replica will reach the same state after the execution of each request. However, SMR is not scalable since any replica added to the system will execute all requests, and so throughput does not increase with the number of replicas. Scalable SMR (S-SMR) addresses this issue in two ways: (i) by partitioning the application state, while allowing every command to access any combination of partitions, and (ii) by using a caching algorithm to reduce the communication across partitions. We describe Eyrie, a library in Java that implements S-SMR, and Volery, an application that implements Zookeeper’s API. We assess the performance of Volery and compare the results against Zookeeper. Our experiments show that Volery scales throughput with the number of partitions. I. I NTRODUCTION Many current online services have stringent availability and performance requirements. High availability entails tol- erating component failures and is typically accomplished with replication. For many online services, “high perfor- mance” essentially means the ability to serve an arbitrarily large load, something that can be achieved if the service can scale throughput with the number of replicas. State machine replication (SMR) [1], [2] is a well-known technique to provide fault tolerance without sacrificing strong consistency (i.e., linearizability). SMR regulates how client commands are propagated to and executed by the replicas: every non-faulty replica must receive and execute every command in the same order. Moreover, command execution must be deterministic. SMR provides configurable availability, by setting the number of replicas, but limited scalability: every replica added to the system must execute all requests; hence, throughput does not increase as replicas join the system. Distributed systems usually rely on state partitioning to scale (e.g., [3], [4]). If requests can be served simultaneously by multiple partitions, then augmenting the number of partitions results in an overall increase in system throughput. However, exploiting state partitioning in SMR is challeng- ing: First, ensuring linearizability efficiently when state is partitioned is tricky. To see why, note that the system must be able to order multiple streams of commands simultaneously (e.g., one stream per partition) since totally ordering all commands cannot scale. But with independent streams of ordered commands, how to handle commands that address multiple partitions? Second, SMR hides from the service designer much of the complexity involved in replication; all the service designer must provide is a sequential implemen- tation of each service command. If state is partitioned, then some commands may need data from multiple partitions. Should the service designer introduce additional logic in the implementation to handle such cases? Should the service be limited to commands that access a single partition? This paper presents Scalable State Machine Replication (S-SMR), an approach that achieves scalable throughput and strong consistency (i.e., linearizability) without constraining service commands or adding additional complexity to their implementation. S-SMR partitions the service state and relies on an atomic multicast primitive to consistently order commands within and across partitions. We show in the paper that simply ordering commands consistently across partitions is not enough to ensure strong consistency in partitioned state machine replication. S-SMR implements execution atomicity, a property that prevents command in- terleaves that violate strong consistency. To assess the per- formance of S-SMR, we developed Eyrie, a Java library that allows developers to implement partitioned-state services transparently, abstracting partitioning details, and Volery, a service that implements Zookeeper’s API [5]. All commu- nication between partitions is handled internally by Eyrie, including remote object transfers. In the experiments we conducted with Volery, throughput scaled with the number of partitions, in some cases linearly. In some deployments, Volery reached over 250 thousand commands per second, largely outperforming Zookeeper, which served 45 thousand commands per second under the same workload. The paper makes the following contributions: (1) It intro- duces S-SMR and discusses several performance optimiza- tions, including caching techniques. (2) It details Eyrie, a library to simplify the design of services based on S-SMR. (3) It describes Volery to demonstrate how Eyrie can be used to implement a service that provides Zookeeper’s API. (4) It presents a detailed experimental evaluation of Volery and compares its performance to Zookeeper. The remainder of this paper is organized as follows. Section II describes our system model. Section III presents state-machine replication and the motivation for this work. Section IV introduces S-SMR; we explain the technique in detail and argue about its correctness. Section V details the implementation of Eyrie and Volery. Section VI reports on the performance of the Volery service. Section VII surveys related work and Section VIII concludes the paper.

Transcript of Scalable State-Machine Replication - USI Informatics...State-machine replication is a fundamental...

  • Scalable State-Machine Replication

    Carlos Eduardo Bezerra∗‡, Fernando Pedone∗, Robbert van Renesse†∗University of Lugano, Switzerland

    †Cornell University, USA‡Universidade Federal do Rio Grande do Sul, Brazil

    Abstract—State machine replication (SMR) is a well-knowntechnique able to provide fault-tolerance. SMR consists ofsequencing client requests and executing them against replicasin the same order; thanks to deterministic execution, everyreplica will reach the same state after the execution of eachrequest. However, SMR is not scalable since any replica addedto the system will execute all requests, and so throughputdoes not increase with the number of replicas. Scalable SMR(S-SMR) addresses this issue in two ways: (i) by partitioningthe application state, while allowing every command to accessany combination of partitions, and (ii) by using a cachingalgorithm to reduce the communication across partitions. Wedescribe Eyrie, a library in Java that implements S-SMR, andVolery, an application that implements Zookeeper’s API. Weassess the performance of Volery and compare the resultsagainst Zookeeper. Our experiments show that Volery scalesthroughput with the number of partitions.

    I. INTRODUCTION

    Many current online services have stringent availabilityand performance requirements. High availability entails tol-erating component failures and is typically accomplishedwith replication. For many online services, “high perfor-mance” essentially means the ability to serve an arbitrarilylarge load, something that can be achieved if the service canscale throughput with the number of replicas. State machinereplication (SMR) [1], [2] is a well-known technique toprovide fault tolerance without sacrificing strong consistency(i.e., linearizability). SMR regulates how client commandsare propagated to and executed by the replicas: everynon-faulty replica must receive and execute every commandin the same order. Moreover, command execution must bedeterministic. SMR provides configurable availability, bysetting the number of replicas, but limited scalability: everyreplica added to the system must execute all requests; hence,throughput does not increase as replicas join the system.

    Distributed systems usually rely on state partitioning toscale (e.g., [3], [4]). If requests can be served simultaneouslyby multiple partitions, then augmenting the number ofpartitions results in an overall increase in system throughput.However, exploiting state partitioning in SMR is challeng-ing: First, ensuring linearizability efficiently when state ispartitioned is tricky. To see why, note that the system must beable to order multiple streams of commands simultaneously(e.g., one stream per partition) since totally ordering allcommands cannot scale. But with independent streams ofordered commands, how to handle commands that addressmultiple partitions? Second, SMR hides from the service

    designer much of the complexity involved in replication; allthe service designer must provide is a sequential implemen-tation of each service command. If state is partitioned, thensome commands may need data from multiple partitions.Should the service designer introduce additional logic in theimplementation to handle such cases? Should the service belimited to commands that access a single partition?

    This paper presents Scalable State Machine Replication(S-SMR), an approach that achieves scalable throughput andstrong consistency (i.e., linearizability) without constrainingservice commands or adding additional complexity to theirimplementation. S-SMR partitions the service state andrelies on an atomic multicast primitive to consistently ordercommands within and across partitions. We show in thepaper that simply ordering commands consistently acrosspartitions is not enough to ensure strong consistency inpartitioned state machine replication. S-SMR implementsexecution atomicity, a property that prevents command in-terleaves that violate strong consistency. To assess the per-formance of S-SMR, we developed Eyrie, a Java library thatallows developers to implement partitioned-state servicestransparently, abstracting partitioning details, and Volery, aservice that implements Zookeeper’s API [5]. All commu-nication between partitions is handled internally by Eyrie,including remote object transfers. In the experiments weconducted with Volery, throughput scaled with the numberof partitions, in some cases linearly. In some deployments,Volery reached over 250 thousand commands per second,largely outperforming Zookeeper, which served 45 thousandcommands per second under the same workload.

    The paper makes the following contributions: (1) It intro-duces S-SMR and discusses several performance optimiza-tions, including caching techniques. (2) It details Eyrie, alibrary to simplify the design of services based on S-SMR.(3) It describes Volery to demonstrate how Eyrie can beused to implement a service that provides Zookeeper’s API.(4) It presents a detailed experimental evaluation of Voleryand compares its performance to Zookeeper.

    The remainder of this paper is organized as follows.Section II describes our system model. Section III presentsstate-machine replication and the motivation for this work.Section IV introduces S-SMR; we explain the technique indetail and argue about its correctness. Section V details theimplementation of Eyrie and Volery. Section VI reports onthe performance of the Volery service. Section VII surveysrelated work and Section VIII concludes the paper.

  • II. MODEL AND DEFINITIONS

    We consider a distributed system consisting of an un-bounded set of client processes C = {c1, c2, ...} and abounded set of server processes S = {s1, ..., sn}. Set Sis divided into P disjoint groups of servers, S1, ...,SP .Processes are either correct, if they never fail, or faulty,otherwise. In either case, processes do not experience arbi-trary behavior (i.e., no Byzantine failures). Processes com-municate by message passing, using either one-to-one orone-to-many communication, as defined next. The system isasynchronous: there is no bound on message delay or onrelative process speed.

    One-to-one communication uses primitives send(p,m)and receive(m), where m is a message and p is the processm is addressed to. If sender and receiver are correct, thenevery message sent is eventually received. One-to-manycommunication relies on atomic multicast, defined by theprimitives multicast(γ,m) and deliver(m), where γ is aset of server groups. Let relation ≺ be defined such thatm ≺ m′ iff there is a server that delivers m before m′.Atomic multicast ensures that (i) if a server delivers m, thenall correct servers in γ deliver m (agreement); (ii) if a correctprocess multicasts m to groups in γ, then all correct serversin every group in γ deliver m (validity); and (iii) relation≺ is acyclic (order).1 The order property implies that if sand r deliver messages m and m′, then they deliver them inthe same order. Atomic broadcast is a special case of atomicmulticast in which there is a single group with all servers.

    III. BACKGROUND AND MOTIVATION

    State-machine replication is a fundamental approach toimplementing a fault-tolerant service by replicating serversand coordinating the execution of client commands againstserver replicas [1], [2]. The service is defined by a statemachine, which consists of a set of state variables V ={v1, ..., vm} and a set of commands that may read andmodify state variables, and produce a response for thecommand. Each command is implemented by a deterministicprogram. State-machine replication can be implemented withatomic broadcast: commands are atomically broadcast to allservers, and all correct servers deliver and execute the samesequence of commands.

    We are interested in implementations of state-machinereplication that ensure linearizability. Linearizability is de-fined with respect to a sequential specification. The sequen-tial specification of a service consists of a set of commandsand a set of legal sequences of commands, which definethe behavior of the service when it is accessed sequentially.In a legal sequence of commands, every response to theinvocation of a command immediately follows its invocation,with no other invocation or response in between them. For

    1Solving atomic multicast requires additional assumptions [6], [7]. In thefollowing, we simply assume the existence of an atomic multicast oracle.

    example, a sequence of operations for a read-write variable vis legal if every read command returns the value of the mostrecent write command that precedes the read, if there is one,or the initial value otherwise. An execution E is linearizableif there is some permutation of the commands executed inE that respects (i) the service’s sequential specification and(ii) the real-time precedence of commands. Command C1precedes command C2 if the response of C1 occurs beforethe invocation of C2.

    In classical state-machine replication, throughput does notscale with the number of replicas: each command mustbe ordered among replicas and executed and replied byevery (non-faulty) replica. Some simple optimizations tothe traditional scheme can provide improved performancebut not scalability. For example, although update commandsmust be ordered and executed by every replica, only onereplica can respond to the client, saving resources at theother replicas. Commands that only read the state must beordered with respect to other commands, but can be executedby a single replica, the replica that will respond to the client.

    This is a fundamental limitation: while some optimiza-tions may increase throughput by adding servers, the im-provements are limited since fundamentally, the techniquedoes not scale. In the next section, we describe an extensionto SMR that under certain workloads allows performance togrow proportionally to the number of replicas.

    IV. SCALABLE STATE-MACHINE REPLICATION

    In this section, we introduce S-SMR, discuss performanceoptimizations, and argue about S-SMR’s correctness.

    A. General idea

    S-SMR divides the application state V (i.e., state vari-ables) into P partitions P1, ...,PP , where for each Pi,Pi ⊆ V . Moreover, we require each variable v in V tobe assigned to at least one partition and define part(v) asthe partitions that hold v. Each partition Pi is replicatedby servers in group Si. For brevity, we say that server sbelongs to Pi with the meaning that s ∈ Si, and say thatclient c multicasts command C to partition Pi meaning thatc multicasts C to group Si.

    To execute command C, the client multicasts C to allpartitions that hold a variable read or updated by C.Consequently, the client must be able to determine thepartitions accessed by C, denoted by part(C). Note thatthis assumption does not imply that the client must knowall variables accessed by C, but it must know the partitionsthese variables belong to. If the client cannot accurately esti-mate which partitions are accessed by C, it must determinea superset of these partitions, in the worst case assumingall partitions. For performance, however, clients must striveto provide a close approximation to the command’s actuallyaccessed partitions. We assume the existence of an oracle

    2

  • that tells the client which partitions should receive eachcommand.

    Upon delivering command C, if server s does not containall variables read by C, s must communicate with serversin other partitions to execute C. Essentially, s must retrieveevery variable v read in C from a server that stores v(i.e., a server in a partition in part(v)). Moreover, s mustretrieve a value of v that is consistent with the order inwhich C is executed, as we explain next. Operations that donot involve reading a variable from a remote partition areexecuted locally.

    In more detail, let op be an operation in the execution ofcommand C. We distinguish between three operation types:read(v), an operation that reads the value of a state variablev, write(v, val), an operation that updates v with value val,and an operation that performs a deterministic computation.

    Server s in partition Pi executes op as follows.i) op is a read(v) operation.

    If Pi ∈ part(v), then s retrieves the value of v andsends it to every partition Pj that delivers C and doesnot hold v. If Pi 6∈ part(v), then s waits for v to bereceived from a server in a partition in part(v).

    ii) op is a write(v, val) operation.If Pi ∈ part(v), s updates the value of v with val; ifPi 6∈ part(v), s executes op, creating a local copy ofv, which will be up-to-date at least until the end of C’sexecution.

    iii) op is a computation operation.In this case, s executes op.

    As we now show, the procedure above does not ensure lin-earizability. Consider the execution depicted in Figure 1 (a),where state variables x and y have initial value of 10.Command Cx reads the value of x, Cy reads the value ofy, and Cxy sets x and y to value 20. Consequently, Cx ismulticast to partition Px, Cy is multicast to Py , and Cxyis multicast to both Px and Py . Servers in Py deliver Cyand then Cxy , while servers in Px deliver Cxy and then Cx,which is consistent with atomic order. In this execution, theonly possible legal permutation for the commands is Cy ,Cxy , and Cx, which violates the real-time precedence of thecommands, since Cx precedes Cy in real-time.

    Intuitively, the problem with the execution in Figure 1 (a)is that commands Cx and Cy execute “in between” theexecution of Cxy at partitions Px and Py . In S-SMR, weavoid such cases by ensuring that the execution of everycommand is atomic. Command C is execution atomic if, foreach server s that executes C, there exists at least one serverr in every other partition in part(C) such that the executionof C at s finishes after r delivers C. More precisely,let delivery(C, s) and end(C, s) be, respectively, the timewhen s delivers command C and the time when s completesC’s execution. Execution atomicity ensures that, for everyserver s in partition P that executes C, there is a server r inevery P ′ ∈ part(C) such that delivery(C, r) < end(C, s).

    Intuitively, this condition guarantees that the execution of Cat P and P ′ overlap in time.

    Replicas can ensure execution atomicity by coordinatingthe execution of commands. After delivering command C,servers in each partition send a signal(C) message toservers in the other partitions in part(C). Before finishingthe execution of C, each server must receive a signal(C)message from at least one server in every other partition thatexecutes C. Moreover, if server s in partition P receives thevalue of a variable from server r in another partition P ′, aspart of the execution of C, then s does not need to receive asignal(C) message from servers in P ′. This way, to toleratef failures, each partition requires f+1 servers; if all serversin a partition fail, service progress is not guaranteed.

    Figure 1 (b) shows an execution of S-SMR. In theexample, servers in Px wait for a signal from Py , thereforedelaying Cxy’s execution in Px and moving the execution ofCx ahead in time. Note that the outcome of each commandexecution is the same as in case (a), but the executions ofCx, Cy and Cxy , as seen by clients, now overlap in timewith one another. Hence, there is no real-time precedenceamong them.

    B. Detailed algorithm

    In Algorithm 1, we show the basic operation of S-SMR.To submit a command C, the client queries an oracle to getset dests (line 5), which is a superset of part(C) used bythe client as destination set for C (line 6).

    Upon delivering C, server s in partition P multicastssignal(C) to others, which is the set containing all otherpartitions involved in C (lines 10 and 11). It might happenthat s receives signals concerning C from other partitionseven before s started executing C. For this reason, s mustbuffer signals and check if there are signals buffered alreadywhen starting the execution of C. For the sake of simplicity,Algorithm 1 simply initializes such buffers as ∅ for allpossible commands. In practice, buffers for C are createdwhen the first message concerning C is delivered.

    After multicasting signals, server s proceeds to the execu-tion of C, which is a sequence of operations that might reador write variables in V . The main concern is with operationsthat read variables, as they may determine the outcome of thecommand execution. All other operations can be executedlocally at s. If the operation reads variable v and v belongsto P , s’s partition, then s multicasts the value of v to theother partitions that delivered C (line 15). The commandidentifier C.id is sent along with v to make sure that theother partitions will use the appropriate value of v duringC’s execution. If v belongs to some other partition P ′, swaits until an up-to-date value of v has been delivered (line17). Every other operation is executed with no interactionwith other partitions (line 19).

    After executing all operations of C, s waits until a signalfrom every other partition has been received (line 20) and,

    3

  • (a) atomic multicast does not ensure linearizability

    write(y,20)write(x,20)

    reply(ok)

    client a

    client b

    reply(20)

    time

    Cxy

    Cx

    read(y) reply(10)

    client c Cy

    partition Pyy=10

    partition Pxx=10 x=20

    y=20

    Cx

    Cy

    Cxy

    Cxy

    read(x)

    write(y,20)write(x,20)

    reply(ok)

    reply(20)

    (b) S-SMR achieves linearizability with signaling among partitions

    Cxy

    Cx

    read(y) reply(10)

    Cy

    y=10

    x=10 x=20

    y=20

    Cx

    Cy

    Cxy

    Cxy

    read(x)

    signal(Cxy)

    Figure 1. Atomic multicast and S-SMR. (To simplify the figure, we show a single replica per partition.)

    only then, sends the reply back to the client (line 21). Thisensures that C will be execution atomic.

    C. Performance optimizations

    Algorithm 1 can be optimized in many ways. In thissection, we briefly mention some of these optimizations andthen detail caching.• Server s does not need to wait for the execution of

    command C to reach a read(v) operation to only thenmulticast v to the other partitions in part(C). If sknows that v will be read by C, s can send v’s valueto the other partitions as soon as s starts executing C.

    • The exchange of objects between partitions servesthe purpose of signaling. Therefore, if server s sendsvariable v’s value to server r in another partition, r doesnot need to receive a signal message from s’s partition.

    • It is not necessary to exchange each variable morethan once per command since any change during theexecution of the command will be deterministic andthus any changes to the variable can be applied to thecached value.

    • Even though all replicas in all partitions in part(C)execute C, a reply from a replica in a single partitionsuffices for the client to finish the command.

    Server s in partition P can cache variables that belong toother partitions. There are different ways for s to maintaincached variables; here we define two techniques: conserva-tive caching and speculative caching. In both cases, the basicoperation is the following: When s executes a commandthat reads variable x from some other partition Px, afterretrieving the value of x from a server in Px, s stores x’svalue in its cache and uses the cached value in future readoperations. If a command writes x, s updates (or creates) x’slocal value. Server s will have a valid cache of x until (i) sdiscards the entry due to memory constraints, or (ii) somecommand not multicast to P changes the value of x. Since

    servers in Px deliver all commands that access x, theseservers know when any possible cached value of x is stale.How servers use cached entries distinguishes conservativefrom speculative caching.

    Servers in Px can determine which of its variables havea stale value cached in other partitions. This can be done bychecking if there was any command that updated a variable xin Px, where such command was not multicast to some otherpartition P that had a cache of x. Say servers in Px delivercommand C, which reads x, and say the last command thatupdated the value of x was Cw. Since x ∈ Px, servers in Pxdelivered Cw. One way for servers in Px to determine whichpartitions need to update their cache of x is by checkingwhich destinations of C did not receive Cw. This can befurther optimized: even if servers in P did not deliver Cw,but delivered some other command Cr that reads x and Crwas ordered by multicast after Cw, then P already receivedan up-to-date value of x (sent by servers in Px during theexecution of Cr). If servers in P discarded the cache ofx (e.g., due to limited memory), they will have to send arequest for its value.

    Conservative caching: Once s has a cached value of x,before it executes a read(x) operation, it waits for a cache-validation message from a server in Px. The cache validationmessage contains a set of pairs (var, val), where var is astate variable that belongs to Px and whose cache in P needsto be validated. If servers in Px determined that the cacheis stale, val contains the new value of var; otherwise, ⊥,telling s that its cached value is up to date. If s discarded itscached copy, it sends a request for x to Px. If it is possibleto determine which variables are accessed by C before C’sexecution, all such messages can be sent upon delivery ofthe command, reducing waiting time; messages concerningvariables that could not be determined a-priori are sent later,during the execution of C, as variables are determined.

    4

  • Algorithm 1 Scalable State-Machine Replication (S-SMR)1: Initialization:2: ∀C ∈ K : rcvd signals(C)← ∅3: ∀C ∈ K : rcvd variables(C)← ∅4: Command C is submitted by a client as follows:5: dests← oracle(C)6: multicast(dests, C)7: wait for response from one server

    8: Command C is executed by a server in partition P as follows:9: upon deliver(C)

    10: others← dests \ {P}11: multicast(others, signal(C))12: for each operation op in C do13: if op is read(v) then14: if v ∈ P then15: multicast(others, {v, C.id})16: else17: wait until v ∈ rcvd variables(C)18: update v with the value in rcvd variables(C)19: execute op20: wait until rcvd signals(C) = others21: send reply to client

    22: upon deliver(signal(C)) from partition P ′23: rcvd signals(C)← rcvd signals(C) ∪ {P ′}

    24: upon deliver({v, C.id})25: rcvd variables(C)← rcvd variables(C) ∪ {v}

    Algorithm variables:K: the set of all possible commandsC.id: unique identifier of command Coracle(C): function that returns a superset of part(C)dests: set of partitions to which C is multicastothers: set of partitions waiting for signals and variables fromP; also, P waits for signals from all such partitionssignal(C): a synchronization message that allows S-SMR toensure C to be execution atomicrcvd signals(C): a set containing all partitions that alreadysignaled P regarding the execution of Crcvd variables(C): a set containing all variables that mustbe received from other partitions in order to execute C

    Speculative caching: It is possible to reduce executiontime by speculatively assuming that cached values are up-to-date. Speculative caching requires servers to be able torollback the execution of commands, in case the speculativeassumption fails to hold. Many applications allow rollingback a command, such as databases, as long as no reply hasbeen sent to the client for the command yet. The differencebetween speculative caching and conservative caching isthat in the former servers that keep cached values do notwait for a cache-validation message before reading a cachedentry; instead, a read(x) operation returns the cached valueimmediately. If after reading some variable x from the cache,during the execution of command C, server s receives amessage from a server in Px that invalidates the cached

    value, s rolls back the execution to some point before theread(x) operation and resumes the command execution,now with the up-to-date value of x. Server s can only replyto the client that issued C after every variable read from thecache has been validated.

    D. Correctness

    In this proof, we denote the order given by atomicmulticast with “≺”. Given any two messages m1 and m2,“m1 ≺ m2” means that both messages are delivered by thesame group and m1 is delivered before m2, or there is somemessage m′ such that m1 ≺ m′ and m′ ≺ m2, which canbe written as m1 ≺ m′ ≺ m2.

    We argue that, if every command in execution E of S-SMRis execution atomic, then E is linearizable. Suppose, bymeans of contradiction, that there exist two commands xand y, where x finishes before y starts, but y ≺ x in theexecution. There are two possibilities for this: (i) x and yaccess some variable in common v, or (ii) x and y accessno variable in common.

    In case (i), at least one partition Pv (which contains v)delivers both x and y. As x finishes before y starts, then Pvdelivers x, then y. From the properties of atomic multicast,and since each partition is mapped to a multicast group,no partition delivers y, then x. Moreover, atomic multicastensures acyclic order, so there are no commands z1, ..., znsuch that their atomic order is y ≺ z1 ≺ · · · ≺ zn ≺ x. So,we reached a contradiction in this case.

    In case (ii), if there were no other command in E , thenthe execution of x and y could be done in any order, whichwould contradict the supposition that y ≺ x. Therefore, thereare commands z0, ..., zn such that their atomic order is y ≺z0 ≺ · · · ≺ zn ≺ x. As y ≺ z0, then some partition Pydelivers y, then z0. Also, since z0 ≺ z1, then some partitionP1 delivers z0, then z1, and so on: partition Pi delivers zi−1,then zi, where i ∈ {1, ..., n}. Finally, partition Px deliverszn, then x.

    We now claim that for every Rx ∈ Px, Rx finishesexecuting x only after some R0 ∈ P0 delivered z0. We proveour claim by induction.

    Induction basis: As zn ≺ x, every Rx ∈ Px executescommand x only after the execution of zn at Rx finished. Aszn is execution atomic, for every Rx ∈ Px, there is a serverRn ∈ Pn that delivered zn before Rx finished executing zn,which was before Rx even started executing command x.Therefore, every Rx ∈ Px finishes executing x only aftersome Rn ∈ Pn delivered zn.

    Induction step: Assume that every Rx ∈ Px finishesexecuting x only after some Ri ∈ Pi delivered zi. As zi isexecution atomic, there is some Ri−1 ∈ Pi−1 that deliveredzi before Ri finished executing zi. As zi−1 ≺ zi and Ri−1delivers both, then Ri−1 delivers zi−1 before executing zi.Thus, Rx ∈ Px finishes executing x only after some Ri−1

    5

  • has delivered zi−1, for every Rx ∈ Px. This proves ourclaim.

    Finally, for every R0 ∈ P0, there is a server Ry ∈ Pythat delivers z0 before R0 finishes executing z0. Since everyRy delivers and executes y, then z0, and from the claimabove, for every Rx ∈ Px, there is an Ry ∈ Py thatdelivers y before Rx finishes executing x. This contradictsthe assumption that x precedes y in real-time, i.e., that thereply to x is received before y is invoked by a client.

    V. IMPLEMENTATION

    In this section, we describe Eyrie, a library that im-plements S-SMR, and Volery, a service that providesZookeeper’s API. Both Eyrie and Volery were implementedin Java.

    A. Eyrie

    One of the main goals of Eyrie is to make the implementa-tion of services based on Scalable SMR as easy as possible.To use Eyrie, the user (i.e., service designer) must ex-tend two classes, PRObject and StateMachine. ClassPartitioningOracle has a default implementation, butthe user is encouraged to override its methods.

    1) The PRObject class: Eyrie supports partial replica-tion (i.e., some objects may be replicated in some partitions,not all). Therefore, when executing a command, a replicamight not have local access to some of the objects involvedin the execution of the command. The user informs to Eyriewhich object classes are partially replicated by extending thePRObject class. Each object of such class may be storedlocally or remotely, but the application code is agnostic tothat. All calls to methods of such objects are intercepted byEyrie, transparently to the user.

    Eyrie uses AspectJ2 to intercept method calls forall subclasses of PRObject. Internally, the aspect re-lated to such method invocations communicates with theStateMachine instance in order to (i) determine if theobject is stored locally or remotely and (ii) ensure that theobject is up-to-date when each command is executed.

    Each replica has a local copy of all PRObject objects.When a remote object is received, replicas in the local par-tition PL must update their local copy of the object with anup-to-date value. For this purpose, the user must provide im-plementations for the methods getDiff(Partition p)and updateFromDiff(Object diff). The former isused by the remote partition PR, which owns the object, tocalculate a delta between the old value currently held by PLand the newest value, held by PR. Such implementationsmay be as simple as returning the full object, which isthen serialized and, upon deserialization in PL, completelyoverwrites the old copy of the object. However, it also allowsthe user to implement caching mechanisms. Since getDiff

    2http://eclipse.org/aspectj

    takes a partition as parameter, the user may keep track ofwhat was the last value received by PL, and then return a(possibly small) diff, instead of the whole object, which isthen applied to the object with the user-provided methodupdateFromDiff.

    To avoid unnecessary communication, the user may op-tionally mark some methods of their PRObject subclassesas local, by annotating them with @LocalMethodCall.Calls to such methods are not intercepted by the library,sparing communication when the user sees fit. Althoughthe command that contains a call to such a method stillhas to be delivered and executed by all partitions thathold objects accessed by the command, that particular localmethod does not require an up-to-date object. For example,say a command C accesses objects O1 and O2, respectively,in partitions P1 and P2. C completely overwrites objectsO1 and O2, by calling O1.clear() and O2.clear().Although C has to be delivered by both partitions to ensurelinearizability, a write method that completely overwrites anobject, regardless of its previous state, does not need an up-to-date version of the object before executing. Because ofthis, method clear() can be safely annotated as local,avoiding unnecessary communication between P1 and P2.

    2) The StateMachine class: This class must be ex-tended by the user’s application server class. To executecommands, the user must provide an implementation for themethod executeCommand(Command c). The code forsuch a method is agnostic to the existence of partitions. Inother words, it can be exactly the same as the code used toexecute commands with classical state-machine replication(i.e., full replication). Eyrie is responsible for handling allcommunication between partitions transparently. To start theserver, method runStateMachine() is called.StateMachine ensures linearizability by making sure

    that each command is execution atomic (as defined inSection IV-A). As soon as each command C is delivered,StateMachine sends signal(C) to all remote partitionsthat deliver C, in order to reduce waiting time. A commandcan only conclude its execution after it has received a signalfrom at least one server in every other partition that deliveredthe command—remote object updates received from otherpartitions count as signals for linearizability purposes.

    In order to reduce communication, it is sufficient thata single replica in each partition sends object updates andsignal messages to other partitions. If the designated replicafails, the other replicas in the same partition will suspect thefailure and one of the operational replicas will retransmit theinformation.

    3) The PartitioningOracle class: Clients multi-cast each command directly to the partitions affected bythe command, i.e., those that contain objects accessed bythe command. Although Eyrie encapsulates most detailsregarding partitioning, the user must provide an oracle thattells, for each command, which partitions are affected by

    6

  • the command. The set of partitions returned by the oracleneeds to contain all partitions involved, but does not needto be minimal. In fact, the default implementation of theoracle simply returns all partitions for every command,which although correct, is not efficient. For best perfor-mance, the partition set returned by the oracle should beas small as possible, which requires the user to extendPartitioningOracle and override its methods.

    Method getDestinations(Command c) is used bythe oracle to tell what partitions should receive each com-mand. It returns a list of Partition objects. The usercan override this method, which will parse command cand return a list containing all partitions involved in theexecution of c. The getDestinations method can en-capsulate any kind of implementation, including one thatinvolves communicating with servers, so its execution doesnot necessarily need to be local to clients. If the set ofpartitions involved in the execution of a command cannotbe determined a priori, the oracle can communicate withservers to determine such set and then return it to the client,which then multicasts the command to the right partitions.

    Another important method in PartitioningOracleis getLocalObjects(Command c), which is used byservers before executing c. The method returns a list ofobjects in the partition of the server that will be accessedby c. This list does not need to be complete, but anykind of early knowledge about what objects need to beupdated in other partitions helps decrease execution time,as the objects can be sent as soon as the server startsexecuting the command. The default implementation of thismethod returns an empty list, which means that objects areexchanged among partitions as their methods are invokedduring execution. Depending on the application, the usermay provide an implementation for this method.

    4) Other classes: In the following, we briefly describe afew accessory classes provided by Eyrie.

    The Partition class has two relevant methods,getId() and getPartitionList(), which return, re-spectively, the partition’s unique identifier and the list of allpartitions in the system. The oracle can use this informationto map commands to partitions.

    When sending commands, the client must multicast aCommand object, which is serialized and sent to the par-titions determined by the client’s oracle. To the user, acommand object is simply a container of objects, whichare typically parameters for the command. The Commandclass offers methods addItems(Objects... objs),getNext(), hasNext() and so on. How the server willprocess such parameters is application-dependent and deter-mined by the user’s implementation of StateMachine.

    Eyrie uses atomic multicast to disseminate commandsfrom clients and handle communication between parti-

    tions. Internally, it uses an implementation3 of Multi-RingPaxos [8]. To map rings to partitions, each server in partitionPi is a learner in rings Ri and Rall (merge is deterministic);if message m is addressed only to Pi, m is sent to Ri, oth-erwise, to Rall (and discarded by non-addressee partitions).

    B. Volery

    We implemented the Volery service on top of Eyrie, pro-viding an API similar to that of Zookeeper [5]. ZooKeeperimplements a hierarchical key-value store, where each valueis stored in a znode, and each znode can have other znodesas children. The abstraction implemented by ZooKeeperresembles a file system, where each path is a unique string(i.e., a key) that identifies a znode in the hierarchy. Weimplemented the following Volery client API:• create(String path, byte[] data):

    creates a znode with the given path, holding dataas content, if there was no znode with that pathpreviously and there is a znode with the parent path.

    • delete(String path): deletes the znode that hasthe given path, if there is one and it has no children.

    • exists(String path): returns True if there existsa znode with the given path, or False, otherwise.

    • getChildren(String path): returns the list ofznodes that have path as their parent.

    • getData(String path): returns the data held bythe znode identified by path.

    • setData(String path, byte[] data): setsthe contents of the znode identified by path to data.

    Zookeeper ensures a mix of linearizability (for writecommands) and session consistency (for read commands).Every reply to a read command (e.g., getData) issuedby a client is consistent with all write commands (e.g.,create or setData) issued previously by the same client.With this consistency model, Zookeeper is able to scale forworkloads composed mainly of read-only requests. Voleryensures linearizability for every execution, regardless ofwhat kind of commands are issued. In order to be scalable,Volery makes use of partitioning, done with Eyrie.

    Distributing Volery’s znodes among partitions was donebased on each znode’s path: a function f(path) returnedthe id of the partition responsible for holding the zn-ode at path. Function f is used by Volery’s oracle tohelp clients determine which partitions must receive eachcommand. Each command getData, setData, existsand getChildren is multicast to a single partition, thusbeing called a local command. Commands create anddelete are multicast to all partitions and are called globalcommands; they are multicast to all partitions to guaranteethat every (correct) replica has a full copy of the znodeshierarchy, even though only the partition that owns eachgiven znode surely has its contents up-to-date.

    3https://github.com/sambenz/URingPaxos

    7

  • VI. PERFORMANCE EVALUATIONIn this section, we assess the performance of Volery with

    on-disk and in-memory deployments, and local and globalcommands.

    A. Environment setup and configuration parameters

    We ran all our experiments on a cluster that had twotypes of nodes: (a) HP SE1102 nodes, equipped with twoIntel Xeon L5420 processors running at 2.5 GHz and with 8GB of main memory, and (b) Dell SC1435 nodes, equippedwith two AMD Opteron 2212 processors running at 2.0GHz and with 4 GB of main memory. The HP nodes wereconnected to an HP ProCurve Switch 2910al-48G gigabitnetwork switch, and the Dell nodes were connected to an HPProCurve 2900-48G gigabit network switch. Those switcheswere interconnected by a 20 Gbps link. All nodes ranCentOS Linux 6.3 with kernel 2.6.32 and had the OracleJava SE Runtime Environment 7. Before each experiment,we synchronize the clocks of the nodes using NTP. This isdone to obtain accurate values in the measurements of thelatency breakdown involving events in different servers.

    In all our experiments with Volery and Zookeeper, clientssubmit commands asynchronously, that is, each client cankeep submitting commands even if replies to previouscommands have not been received yet, up to a certainnumber of outstanding commands. Trying to issue newcommands when this limit is reached makes the clientblock until some reply is received. Replies are processedby callback handlers registered by clients when submittingcommands asynchronously. We allowed every client to haveup to 25 outstanding commands at any time. By submittingcommands asynchronously, the load on the service can beincreased without instantiating new client processes. Localcommands consisted of calls to setData, while globalcommands were invocations to create and delete.“Message size” and “command size”, in the next sections,refer to the size of the byte array passed to such commands.

    We compared Volery with the original Zookeeper andwith ZKsmr, which is an implementation of the ZookeeperAPI using traditional state-machine replication. For theZookeeper experiments, we used an ensemble of 3 servers.For the other approaches, we used Multi-Ring Paxos foratomic multicast, having 3 acceptors per ring: ZKsmr had3 replicas that used one Paxos ring to handle all communi-cation, while Volery had 3 replicas per partition, with onePaxos ring per partition, plus one extra ring for commandsthat accessed multiple partitions. Since Zookeeper runs theservice and the broadcast protocol (i.e., Zab [9]) in the samemachines, each ZKsmr/Volery replica was colocated witha Paxos acceptor in the same node of the cluster. We hadworkloads with three different message sizes: 100, 1000 and10000 bytes. Volery was run with 1, 2, 4 and 8 partitions. Weconducted all experiments using disk for storage, then usingmemory (by means of a ramdisk). For on-disk experiments,

    we configured Multi-Ring Paxos with ∆ [8] of 40 ms,batching timeout of 50 ms and batch size threshold of250 kilobytes; for in-memory experiments, these parameterswere 5 ms, 50 ms and 30 kilobytes, respectively.

    B. Experiments using on-disk storage

    In Figure 2, we show results for local commands only.Each Paxos acceptor wrote its vote synchronously to diskbefore accepting each proposal. Zookeeper also persisteddata to disk. In Figure 2 (top left), we can see the maxi-mum throughput for each replication scheme and messagesize, normalized by the throughput of Volery with a singlepartition. In all cases, the throughput of Volery scaled withthe number of partitions and, for message sizes of 1000and 10000 bytes, it scaled linearly (ideal case). For smallmessages (100 bytes), Zookeeper has similar performanceto Volery with a single partition. As messages increasein size, Zookeeper’s throughput improves with respect toVolery: with 1000-byte messages, Zookeeper’s throughputis similar to Volery’s throughput with two partitions. Forlarge messages (10000 bytes), Zookeeper is outperformed byVolery with four partitions. Comparing S-SMR with tradi-tional SMR, we can see that for small messages (100 bytes),ZKsmr performed better than Volery with one partition.This is due to the additional complexity added by Eyriein order to ensure linearizability when data is partitioned.Such difference in throughput is less significant with biggercommands (1000 and 10000 bytes).

    We can also see in Figure 2 (bottom left), the latencyvalues for the different implementations tested. Latencyvalues correspond to 75% of the maximum throughput.Zookeeper has the lowest latency for 100- and 1000-bytecommand sizes. For 10000-byte commands, Volery hadsimilar or lower latency than Zookeeper. Such lower latencyof Volery with 10000-byte commands is due to a shortertime spent with batching: as message sizes increase, the sizethreshold of the batch (250 kilobytes for on-disk storage) isreached faster, resulting in lower latency.

    Figure 2 (right) shows the latency breakdown of com-mands executed by Volery. Batching is the time elapsed fromthe moment the client sends command C to the instant whenC is proposed by the ring coordinator as part of a batch.Multicasting is the time between the propose is executeduntil the batch that contains C is delivered by a serverreplica. Waiting represents the time between the deliveryand the moment when C finally starts executing. Executingmeasures the delay between the start of the execution ofcommand C until the client receives C’s response. Wecan see that more than half of the latency time is dueto multicasting, which includes saving Multi-Ring Paxosinstances synchronously to disk. There is also a significantamount of time spent with batching, done to reduce thenumber of disk operations and allow higher throughput:each Paxos proposal is saved to disk synchronously, so

    8

  • On-disk storage

    1 2 3 4 5 6 7 8

    Thr

    ough

    put (

    norm

    aliz

    ed)

    58 kcps 16 kcps 2 kcps56 kcps33 kcps

    6 kcps

    ZookeeperZKsmr

    Volery 1pVolery 2pVolery 4pVolery 8p

    0 10 20 30 40 50 60 70 80

    100 1000 10000

    Late

    ncy

    (ms)

    Message size (bytes)

    0

    10

    20

    30

    40

    50

    60

    70

    80

    1p 2p 4p 8p 1p 2p 4p 8p 1p 2p 4p 8p

    Late

    ncy

    (ms)

    Latency breakdown for S-SMR (on-disk storage)

    Message size (bytes)

    BatchingMulticasting

    WaitingExecuting

    100001000100

    Figure 2. Results for Zookeeper, ZKsmr and Volery (with 1, 2, 4 and 8 partitions) using disk. Throughput was normalized by that of Volery with a singlepartition (absolute values in kilocommands per second are shown). Latencies reported correspond to 75% of the maximum throughput.

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 10 20 30 40 50 60 70 80Latency (ms)

    Latency CDF for 100 bytes

    ZookeeperZKsmr

    Volery 1pVolery 2pVolery 4pVolery 8p

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 10 20 30 40 50 60 70 80Latency (ms)

    Latency CDF for 1000 bytes

    ZookeeperZKsmr

    Volery 1pVolery 2pVolery 4pVolery 8p

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 10 20 30 40 50 60 70 80Latency (ms)

    Latency CDF for 10000 bytes

    ZookeeperZKsmr

    Volery 1pVolery 2pVolery 4pVolery 8p

    Figure 3. Cumulative distribution function (CDF) of latency for different command sizes (on-disk storage).

    increasing the number of commands per proposal (i.e., perbatch) reduces the number of times the disk is accessed.This allows performance to improve, but increases latency.

    In Figure 3, we show the cumulative distribution functions(CDFs) of latency for all experiments where disk was usedfor storage. The results show that the latency distributionsfor ZKsmr and Volery with a single partition are similar,while latency had more variation for 2, 4 and 8 partitions. Animportant difference between deployments with a single andwith multiple partitions is related to how Multi-Ring Paxosis used. In ZKsmr and in Volery with a single partition,there is only one Paxos ring, which orders all commandsfrom all clients and delivers them to all replicas. Whenthere are multiple partitions, each replica delivers messagesfrom two rings: one ring that orders messages related tothe replica’s partition only, and another ring that ordersmessages addressed to more than one partition—each replicadeterministically merges deliveries from multiple rings. Asthe time necessary to perform such deterministic merge isinfluenced by the level of synchrony of the rings, latency isexpected to fluctuate more when merging is involved.

    C. Experiments using in-memory storage

    In Figure 4, we show the results for local commands whenstoring data in memory only. Volery’s throughput scales

    with the number of partitions (Figure 4 (top left)), speciallyfor large messages, in which case the scalability is linearwith the number of partitions (i.e., ideal case). We can alsosee that latency values for Volery and ZKsmr are less thanhalf of what they are for on-disk storage (Figure 4 (bottomleft)), while Zookeeper’s latency decreased by an order ofmagnitude. These results suggest that further improvementsshould be achievable in the in-memory Volery configurationwith additional optimizations and finer tuning of the atomicmulticast parameters.

    Figure 4 (right) shows the latency breakdown. Eventhough no data is saved to disk, multicasting is still responsi-ble for most of the latency, followed by batching. Differentlyfrom the experiments described in Section VI-B, batchinghere had a size threshold of 30 kilobytes, which helps toexplain why batching time is roughly the same for differentmessage sizes. In these experiments, although there are nodisk writes, batching is still used because it reduces thenumber of Paxos proposals and the number of messagessent through the network, which allows higher throughput.Figure 5 shows the latency CDFs for the in-memory experi-ments, where we can see that Volery with multiple partitions(i.e., deployments where Multi-Ring Paxos uses multiplerings) tends to have more variation in latency.

    9

  • In-memory storage

    1 2 3 4 5 6 7 8

    Thr

    ough

    put (

    norm

    aliz

    ed)

    98 kcps 35 kcps 4 kcps59 kcps

    46 kcps 8 kcps

    ZookeeperZKsmr

    Volery 1pVolery 2pVolery 4pVolery 8p

    0

    5

    10

    15

    20

    25

    30

    100 1000 10000

    Late

    ncy

    (ms)

    Message size (bytes)

    0

    5

    10

    15

    20

    25

    30

    1p 2p 4p 8p 1p 2p 4p 8p 1p 2p 4p 8p

    Late

    ncy

    (ms)

    Latency breakdown for S-SMR (in-memory storage)

    Message size (bytes)

    BatchingMulticasting

    WaitingExecuting

    100001000100

    Figure 4. Results for Zookeeper, ZKsmr and Volery (with 1, 2, 4 and 8 partitions) using memory. Throughput was normalized by that of Volery with asingle partition (absolute values in kilocommands per second are shown). Latencies reported correspond to 75% of the maximum throughput.

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 5 10 15 20 25 30Latency (ms)

    Latency CDF for 100 bytes

    ZookeeperZKsmr

    Volery 1pVolery 2pVolery 4pVolery 8p

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 5 10 15 20 25 30Latency (ms)

    Latency CDF for 1000 bytes

    ZookeeperZKsmr

    Volery 1pVolery 2pVolery 4pVolery 8p

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 5 10 15 20 25 30Latency (ms)

    Latency CDF for 10000 bytes

    ZookeeperZKsmr

    Volery 1pVolery 2pVolery 4pVolery 8p

    Figure 5. Cumulative distribution function (CDF) of latency for different command sizes (in-memory storage).

    D. Experiments with global commands

    In this section, we analyze how Volery performs whenthe workload includes commands that are multicast to allpartitions (global commands). This is the least favorable(non-faulty) scenario for S-SMR, as having commands mul-ticast to all partitions effectively reduces scalability: if allcommands go to all partitions, adding more partitions willnot increase throughput.

    We ran experiments with different rates of global com-mands (i.e., create and delete operations): 0%, 1%, 5% and10% of all commands. We chose such rates for two reasons:(i) it is obvious that high rates of global commands willprevent the system from scaling, plus (ii) it is common forlarge scale services to have a high rate of read requests(which are local commands in Volery). An example of sucha service is Facebook’s TAO [3], which handles requests toa social graph; it allows, for instance, pages to be generatedbased on the user’s connections in the social network. InFacebook’s TAO, 99.8% of all requests are read-only [3].

    We can see in Figure 6 (top left) that Volery scalesthroughput with the number of partitions for all configu-rations but the exceptional case of 10% of global com-mands when augmenting the number of partitions from 4to 8. Moreover, Volery with two partitions outperforms the

    Zookeeper in all experiments. The major drawback of Voleryunder global commands is that to ensure linearizability,partitions must exchange signals: as create and deletecommands are multicast to all partitions, no server can senda reply to a client before receiving a signal from all otherpartitions when executing such a command. This explainsthe significant increase in latency shown in Figure 6 (bottomleft), as global commands are added to the workload: as thenumber of partitions increases, so does the average latency.As we can see in Figure 6 (right), this extra latency comesfrom the servers waiting for signals from other partitions.

    Figure 7 shows the latency CDFs for the workloads withglobal commands. For experiments with more than onepartition, the rate of messages with high latency is muchhigher than the rate of global commands. This happens dueto a “convoy effect”: local commands may be delivered afterglobal commands, having to wait for the latter to finish.

    VII. RELATED WORK

    State-machine replication is a well-known approach toreplication and has been extensively studied (e.g., [1], [2],[10], [11], [12]). State-machine replication requires repli-cas to execute commands deterministically, which impliessequential execution. Even though increasing the perfor-mance of state-machine replication is non-trivial, different

    10

  • Global commands

    50

    100

    150

    200

    250

    300

    Thr

    ough

    put (

    kcps

    ) ZookeeperZKsmr

    Volery 1pVolery 2pVolery 4pVolery 8p

    0

    20

    40

    60

    80

    100

    120

    140

    0.0 0.01 0.05 0.1

    Late

    ncy

    (ms)

    Rate of create/delete commands

    0

    20

    40

    60

    80

    100

    120

    140

    1p 2p 4p 8p 1p 2p 4p 8p 1p 2p 4p 8p 1p 2p 4p 8p

    Late

    ncy

    (ms)

    Latency breakdown for S-SMR (global commands)

    Rate of create/delete commands

    BatchingMulticasting

    WaitingExecuting

    0.10.050.010.0

    Figure 6. Throughput and latency versus rate of create/delete commands (in-memory storage, 1000-bytes commands). Throughput is shown in units of athousand commands per second (kcps). Latencies shown corresponds to 75% of the maximum throughput.

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 20 40 60 80 100 120 140Latency (ms)

    0.01 create/delete rate

    ZookeeperZKsmr

    Volery 1pVolery 2pVolery 4pVolery 8p

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 20 40 60 80 100 120 140Latency (ms)

    0.05 create/delete rate

    ZookeeperZKsmr

    Volery 1pVolery 2pVolery 4pVolery 8p

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 20 40 60 80 100 120 140Latency (ms)

    0.1 create/delete rate

    ZookeeperZKsmr

    Volery 1pVolery 2pVolery 4pVolery 8p

    Figure 7. Cumulative distribution function (CDF) of latency for different rates of create/delete commands (in-memory storage, 1000-bytes commands).

    techniques have been proposed for achieving scalable sys-tems, such as optimizing the propagation and ordering ofcommands (i.e., the underlying atomic broadcast algorithm).In [13], the authors propose to have clients sending theirrequests to multiple clusters, where each of such clustersexecutes the ordering protocol only for the requests itreceived, and then forwards this partial order to every serverreplica. The server replicas, then, must deterministicallymerge all different partial orders received from the orderingclusters. In [14], Paxos [15] is used to order commands, butit is implemented in a way such that the task of orderingmessages is evenly distributed among replicas, as opposedto having a leader process that performs more work than theothers and may eventually become a bottleneck.

    State-machine replication seems at first to prevent multi-threaded execution since it may lead to non-determinism.However, some works have proposed multi-threaded imple-mentations of state-machine replication, circumventing thenon-determinism caused by concurrency in some way. In[10], for instance, the authors propose organizing the replicain multiple modules that perform different tasks concur-rently, such as receiving messages, batching and dispatchingcommands to be executed. The execution of commands isstill done sequentially, by a single thread, but the replicaperforms all other tasks in parallel. We also implemented

    such kind of parallelism in Eyrie.Some works have proposed to parallelize the execution of

    commands in SMR. In [11], application semantics is usedto determine which commands can be executed concurrentlywithout reducing determinism (e.g., read-only commandscan be executed in any order relative to one another). Upondelivery, commands are directed to a parallelizer thread thatuses application-supplied rules to schedule multi-threadedexecution. Another way of dealing with non-determinism isproposed in [12], where commands are speculatively exe-cuted concurrently. After a batch of commands is executed,replicas verify whether they reached a consistent state; ifnot, commands are rolled back and re-executed sequentially.Both [11] and [12] assume a Byzantine failure model and inboth cases, a single thread is responsible for receiving andscheduling commands to be executed. In the Byzantine fail-ure model, command execution typically includes signaturehandling, which can result in expensive commands. Underbenign failures, command execution is less expensive and thethread responsible for command reception and schedulingmay become a performance bottleneck.

    In [16], the authors propose to partition the servicestate and use atomic broadcast to totally order commandssubmitted by the clients. To ensure that linearizability holdsfor read-only commands that span multiple partitions, there

    11

  • is a single sequencer that ensures acyclic order; therefore,the approach cannot scale with the number of partitions,as the sequencer eventually becomes a bottleneck. Also, theapproach can only handle single-partition update commands.

    Many database replication schemes also aim at improvingthe system throughput, although commonly they do notensure strong consistency as we define it here (i.e., aslinearizability). Many works (e.g., [4], [17], [18], [19])are based on the deferred-update replication scheme, inwhich replicas commit read-only transactions immediately,not necessarily synchronizing with each other. This providesa significant improvement in performance, but allows non-linearizable executions to take place. The consistency criteriausually ensured by database systems are serializability [20]and snapshot isolation [21]. Those criteria can be consideredweaker than linearizability, in the sense that they do not takeinto account real-time precedence of different commandsamong different clients. For some applications, this kind ofconsistency is good enough, allowing the system to scalebetter, but services that require linearizability cannot beimplemented with such techniques.

    VIII. CONCLUSION

    This work introduces S-SMR, a scalable variant of thewell-known state-machine replication technique. S-SMR dif-fers from previous related works in that it allows throughputto scale with the number of partitions without weakeningconsistency. To evaluate S-SMR, we developed the Eyrielibrary and implemented Volery, a Zookeeper clone, withEyrie. Our experiments demonstrate that in deploymentswith 8 partitions (the largest configuration we can de-ploy in our infrastructure) and under certain workloads,throughput experienced an 8-time improvement, resulting inideal scalability. Moreover, Volery’s throughput proved to besignificantly higher than Zookeeper’s.

    ACKNOWLEDGEMENTS

    This work was supported in part by the Swiss NationalScience Foundation under grant number 146404. Robbertvan Renesse is supported in part by grants from AFOSR,NSF, DARPA, ARPA-e, MDCN/iAd, Microsoft Corporation,Facebook Inc., and Amazon.com.

    REFERENCES

    [1] L. Lamport, “Time, clocks, and the ordering of events in adistributed system,” Communications of the ACM, vol. 21,no. 7, pp. 558–565, 1978.

    [2] F. B. Schneider, “Implementing fault-tolerant services usingthe state machine approach: A tutorial,” ACM ComputingSurveys, vol. 22, no. 4, pp. 299–319, 1990.

    [3] N. Bronson, Z. Amsden, G. Cabrera, P. Chakka, P. Dimov,H. Ding, J. Ferris, A. Giardullo, S. Kulkarni, H. Li, et al.,“Tao: Facebooks distributed data store for the social graph,”in USENIX ATC, 2013.

    [4] D. Sciascia, F. Pedone, and F. Junqueira, “Scalable deferredupdate replication,” in DSN, 2012.

    [5] P. Hunt, M. Konar, F. P. Junqueira, and B. Reed, “Zookeeper:wait-free coordination for internet-scale systems,” in USENIXATC, 2010.

    [6] T. D. Chandra and S. Toueg, “Unreliable failure detectors forreliable distributed systems,” Journal of the ACM, vol. 43,no. 2, pp. 225–267, 1996.

    [7] M. J. Fischer, N. A. Lynch, and M. S. Paterson, “Impossibilityof distributed consensus with one faulty processor,” Journalof the ACM, vol. 32, no. 2, pp. 374–382, 1985.

    [8] P. Marandi, M. Primi, and F. Pedone, “Multi-Ring Paxos,” inDSN, 2012.

    [9] F. Junqueira, B. Reed, and M. Serafini, “Zab: High-performance broadcast for primary-backup systems,” in DSN,2011.

    [10] N. Santos and A. Schiper, “Achieving high-throughput statemachine replication in multi-core systems,” tech. rep., EPFL,2011.

    [11] R. Kotla and M. Dahlin, “High throughput byzantine faulttolerance,” in DSN, 2004.

    [12] M. Kapritsos, Y. Wang, V. Quema, A. Clement, L. Alvisi,and M. Dahlin, “Eve: Execute-verify replication for multi-core servers,” in OSDI, 2012.

    [13] M. Kapritsos and F. P. Junqueira, “Scalable agreement: To-ward ordering as a service,” in HotDep, 2010.

    [14] M. Biely, Z. Milosevic, N. Santos, and A. Schiper, “S-paxos:Offloading the leader for high throughput state machinereplication,” in SRDS, 2012.

    [15] L. Lamport, “The part-time parliament,” ACM Transactionson Computer Systems, vol. 16, pp. 133–169, May 1998.

    [16] P. Marandi, M. Primi, and F. Pedone, “High performancestate-machine replication,” in DSN, 2011.

    [17] A. de Sousa, R. C. Oliveira, F. Moura, and F. Pedone, “Partialreplication in the database state machine,” in NCA, 2001.

    [18] P. Chundi, D. Rosenkrantz, and S. Ravi, “Deferred updatesand data placement in distributed databases,” in IDCE, 1996.

    [19] T. Kobus, M. Kokocinski, and P. T. Wojciechowski, “Hybridreplication: State-machine-based and deferred-update replica-tion schemes combined,” in ICDCS, 2013.

    [20] P. Bernstein, V. Hadzilacos, and N. Goodman, ConcurrencyControl and Recovery in Database Systems. Addison-Wesley,1987.

    [21] Y. Lin, B. Kemme, R. Jiménez-Peris, M. Patiño-Martı́nez,and J. E. Armendáriz-Iñigo, “Snapshot isolation and integrityconstraints in replicated databases,” ACM Trans. DatabaseSyst., vol. 34, no. 2, 2009.

    12