Delta state replicated data types - HASLab · 0 =⊥ ∈ = ∈ ↪→ ={} ∈ ↪→ ={},(delta ...

12
J. Parallel Distrib. Comput. 111 (2018) 162–173 Contents lists available at ScienceDirect J. Parallel Distrib. Comput. journal homepage: www.elsevier.com/locate/jpdc Delta state replicated data types Paulo Sérgio Almeida 1 , Ali Shoker * , 1 , Carlos Baquero 1 Departamento de Informática, Universidade do Minho, Campus de Gualtar, 4710-057 Braga, Portugal highlights Definition of delta-state CRDTs and their relation to state CRDTs. Proofs of conditions to attain equivalence to state based CRDTs. Anti-entropy algorithm for basic and causally consistent convergence. Portfolio of delta state CRDTs including optimized sets, and recursive map. article info Article history: Received 19 April 2016 Received in revised form 13 June 2017 Accepted 13 August 2017 Available online 18 August 2017 Keywords: Distributed systems Eventual consistency State-based Delta Conflict-free replicated data types CRDT abstract Conflict-free Replicated Data Types (CRDTs) are distributed data types that make eventual consistency of a distributed object possible and non ad-hoc. Specifically, state-based CRDTs ensure convergence through disseminating the entire state, that may be large, and merging it to other replicas. We introduce Delta State Conflict-Free Replicated Data Types (δ-CRDT) that can achieve the best of both operation-based and state- based CRDTs: small messages with an incremental nature, as in operation-based CRDTs, disseminated over unreliable communication channels, as in traditional state-based CRDTs. This is achieved by defining δ-mutators to return a delta-state, typically with a much smaller size than the full state, that to be joined with both local and remote states. We introduce the δ-CRDT framework, and we explain it through establishing a correspondence to current state-based CRDTs. In addition, we present an anti- entropy algorithm for eventual convergence, and another one that ensures causal consistency. Finally, we introduce several δ-CRDT specifications of both well-known replicated datatypes and novel datatypes, including a generic map composition. © 2017 Elsevier Inc. All rights reserved. 1. Introduction Eventual consistency (EC) is a relaxed consistency model that is often adopted by large-scale distributed systems [15,18,34] where availability must be maintained, despite outages and par- titioning, whereas delayed consistency is acceptable. A typical approach in EC systems is to allow replicas of a distributed object to temporarily diverge, provided that they can eventually be rec- onciled into a common state. To avoid application-specific recon- ciliation methods, costly and error-prone, Conflict-Free Replicated Data Types (CRDTs) [32,33] were introduced, allowing the design of self-contained distributed data types that are always available and The work presented was partially supported by EU FP7 SyncFree project (609551), EU H2020 LightKone project (732505), and SMILES line in project TEC4Growth (NORTE-01-0145-FEDER-000020). * Corresponding author. E-mail addresses: [email protected] (P.S. Almeida), [email protected] (A. Shoker), [email protected] (C. Baquero). 1 HASLab/INESC TEC and Universidade do Minho, Portugal. eventually converge when all operations are reflected at all repli- cas. Though CRDTs are deployed in practice and support millions of users worldwide [9,21,30], more work is still required to improve their design and performance. CRDTs support two complementary designs: operation-based (or op-based) and state-based. In op-based designs [26,33], the execution of an operation is done in two phases: prepare and effect. The former is performed only on the local replica and looks at the operation and current state to produce a message that aims to represent the operation, which is then shipped to all replicas. Once received, the representation of the operation is applied remotely using effect. On the other hand, in a state-based design [5,33] an operation is only executed on the local replica state. A replica periodically propagates its local changes to other replicas through shipping its entire state. A received state is incorporated with the local state via a merge function that deterministically reconciles both states. To maintain convergence, merge is defined as a join: a least upper bound over a join-semilattice [5,33]. Op-based CRDTs have some advantages as they can allow for simpler implementations, concise replica state, and smaller mes- sages; however, they are subject to some limitations: First, they http://dx.doi.org/10.1016/j.jpdc.2017.08.003 0743-7315/© 2017 Elsevier Inc. All rights reserved.

Transcript of Delta state replicated data types - HASLab · 0 =⊥ ∈ = ∈ ↪→ ={} ∈ ↪→ ={},(delta ...

Page 1: Delta state replicated data types - HASLab · 0 =⊥ ∈ = ∈ ↪→ ={} ∈ ↪→ ={},(delta ...

J. Parallel Distrib. Comput. 111 (2018) 162–173

Contents lists available at ScienceDirect

J. Parallel Distrib. Comput.

journal homepage: www.elsevier.com/locate/jpdc

Delta state replicated data types✩

Paulo Sérgio Almeida 1, Ali Shoker *,1, Carlos Baquero 1

Departamento de Informática, Universidade do Minho, Campus de Gualtar, 4710-057 Braga, Portugal

h i g h l i g h t s

• Definition of delta-state CRDTs and their relation to state CRDTs.• Proofs of conditions to attain equivalence to state based CRDTs.• Anti-entropy algorithm for basic and causally consistent convergence.• Portfolio of delta state CRDTs including optimized sets, and recursive map.

a r t i c l e i n f o

Article history:Received 19 April 2016Received in revised form 13 June 2017Accepted 13 August 2017Available online 18 August 2017

Keywords:Distributed systemsEventual consistencyState-basedDeltaConflict-free replicated data typesCRDT

a b s t r a c t

Conflict-free Replicated Data Types (CRDTs) are distributed data types that make eventual consistency ofa distributed object possible and non ad-hoc. Specifically, state-based CRDTs ensure convergence throughdisseminating the entire state, thatmay be large, andmerging it to other replicas.We introduceDelta StateConflict-Free Replicated Data Types (δ-CRDT) that can achieve the best of both operation-based and state-based CRDTs: small messages with an incremental nature, as in operation-based CRDTs, disseminatedover unreliable communication channels, as in traditional state-based CRDTs. This is achieved by definingδ-mutators to return a delta-state, typically with a much smaller size than the full state, that to bejoined with both local and remote states. We introduce the δ-CRDT framework, and we explain itthrough establishing a correspondence to current state-based CRDTs. In addition, we present an anti-entropy algorithm for eventual convergence, and another one that ensures causal consistency. Finally, weintroduce several δ-CRDT specifications of both well-known replicated datatypes and novel datatypes,including a generic map composition.

© 2017 Elsevier Inc. All rights reserved.

1. Introduction

Eventual consistency (EC) is a relaxed consistency model thatis often adopted by large-scale distributed systems [15,18,34]where availability must be maintained, despite outages and par-titioning, whereas delayed consistency is acceptable. A typicalapproach in EC systems is to allow replicas of a distributed objectto temporarily diverge, provided that they can eventually be rec-onciled into a common state. To avoid application-specific recon-ciliation methods, costly and error-prone, Conflict-Free ReplicatedData Types (CRDTs) [32,33]were introduced, allowing the design ofself-contained distributed data types that are always available and

✩ The work presented was partially supported by EU FP7 SyncFree project(609551), EU H2020 LightKone project (732505), and SMILES line in projectTEC4Growth (NORTE-01-0145-FEDER-000020).

* Corresponding author.E-mail addresses: [email protected] (P.S. Almeida), [email protected]

(A. Shoker), [email protected] (C. Baquero).1 HASLab/INESC TEC and Universidade do Minho, Portugal.

eventually converge when all operations are reflected at all repli-cas. Though CRDTs are deployed in practice and supportmillions ofusers worldwide [9,21,30], more work is still required to improvetheir design and performance.

CRDTs support two complementary designs: operation-based(or op-based) and state-based. In op-based designs [26,33], theexecution of an operation is done in two phases: prepare and effect.The former is performed only on the local replica and looks at theoperation and current state to produce a message that aims torepresent the operation, which is then shipped to all replicas. Oncereceived, the representation of the operation is applied remotelyusing effect. On the other hand, in a state-based design [5,33] anoperation is only executed on the local replica state. A replicaperiodically propagates its local changes to other replicas throughshipping its entire state. A received state is incorporated with thelocal state via a merge function that deterministically reconcilesboth states. To maintain convergence, merge is defined as a join:a least upper bound over a join-semilattice [5,33].

Op-based CRDTs have some advantages as they can allow forsimpler implementations, concise replica state, and smaller mes-sages; however, they are subject to some limitations: First, they

http://dx.doi.org/10.1016/j.jpdc.2017.08.0030743-7315/© 2017 Elsevier Inc. All rights reserved.

Page 2: Delta state replicated data types - HASLab · 0 =⊥ ∈ = ∈ ↪→ ={} ∈ ↪→ ={},(delta ...

P.S. Almeida et al. / J. Parallel Distrib. Comput. 111 (2018) 162–173 163

assume a message dissemination layer that guarantees reliableexactly-once causal broadcast; these guarantees are hard to main-tain since large logsmust be retained to prevent duplication even ifTCP is used [20]. Second,membershipmanagement is a hard task inop-based systems especially once the number of nodes gets largeror due to churn problems, since all nodes must be coordinated bythe middleware. Third, the op-based approach requires operationsto be executed individually (even when batched) on all nodes.

The alternative is to use state-based systems, which are freefrom these limitations. However, a major drawback in currentstate-based CRDTs is the communication overhead of shipping theentire state, which can get very large in size. For instance, thestate size of a counter CRDT (a vector of integer counters, oneper replica) increases with the number of replicas; whereas in agrow-only Set, the state size depends on the set size, that grows asmore operations are invoked. This communication overhead limitsthe use of state-based CRDTs to data-types with small state size(e.g., counters are reasonable while large sets are not). Recently,there has been a demand for CRDTs with large state sizes (e.g., inRIAK DT Maps [10] that can compose multiple CRDTs and that weformalize in Section 7.4.9).

In this paper, we rethink the way state-based CRDTs should bedesigned, having in mind the problematic shipping of the entirestate. Our aim is to ship a representation of the effect of recentupdate operations on the state, rather than the whole state, whilepreserving the idempotent nature of join. This ensures conver-gence over unreliable communication (on the contrary to op-based CRDTs that demand exactly-once delivery and are proneto message duplication). To achieve this, we develop in detailthe concept of Delta State-based CRDTs (δ-CRDT) that we initiallyintroduced in [2]. In this new (delta) framework, the state is stilla join-semilattice that now results from the join of multiple fine-grained states, i.e., deltas, generated by what we call δ-mutators.δ-mutators are new versions of the datatype mutators that returnthe effect of these mutators on the state. In this way, deltas canbe temporarily retained in a buffer to be shipped individually (orjoined in groups) instead of shipping the entire object. The changesto the local state are then incorporated at other replicas by joiningthe shipped deltas with their own states.

The use of ‘‘deltas’’ (i.e., incremental states) may look intuitivein state dissemination; however, this is not the case for state-basedCRDTs. The reason is that once a node receives an entire state,merging it locally is simple since there is no need to care aboutcausality, as both states are self-contained (including meta-data).The challenge in δ-CRDT is that individual deltas are now ‘‘statefragments’’ and usually must be causally merged to maintain thedesired semantics. This raises the following questions: is mergingdeltas semantically equivalent tomerging entire states in CRDTs? Ifnot, what are the sufficient conditions tomake this true in general?And underwhat constraints causal consistency ismaintained? Thispaper answers these questions and presents corresponding proofsand examples.

We address the challenge of designing a new δ-CRDT thatconserves the correctness properties and semantics of an existingCRDT by establishing a relation between the novel δ-mutatorswiththe original CRDT mutators. We prove that eventual consistency isguaranteed in δ-CRDT as long as all deltas produced by δ-mutatorsare delivered and joined at other replicas, and we present a cor-responding simple anti-entropy algorithm. We then show how toensure causal consistency using deltas through introducing theconcept of delta-interval and the causal delta-merging condition.Based on these, we then present an anti-entropy algorithm forδ-CRDT, where sending and then joining delta-intervals into an-other replica state produces the same effect as if the entire statehad been shipped and joined.

We illustrate our approach through a simple counter CRDT anda corresponding δ-CRDT specification. Later, we present a portfo-lio of several δ-CRDTs that adapt known CRDT designs and alsointroduce a generic kernel for the definition of CRDTs that keep acausal history of known events and a CRDT map that can composethem. All these δ-CRDT datatypes, and a few more, are availableonline in a reference C++ library [3]. Our experience shows that aδ-CRDT version can be devised for all CRDTs, but this requires somedesign effort that varies with the complexity of different CRDTs.This refactoring effort can be avoided for new datatypes by writingall mutations as delta-mutations, and only deriving the standardmutators if needed; these can be trivially obtained from the delta-mutators.

This paper is an extended version of [2], adding the followingmaterial: Proofs of conditions to attain equivalence to state basedCRDTs; Anti-entropy algorithm for basic convergence; Portfolioof delta state CRDTs including simple compositions and anony-mous replicated types (grow only sets, two phase sets, lexico-graphic pairs (Soundcloud [9]) last-writer-wins sets), named types(positive–negative counters, (Cassandra [16]) lexicographic coun-ters); Kernel for causal CRDTs, with a universal join function; Opti-mized causal CRDTs (remove-wins sets, (Riak) flags [6]); Recursivemap data type for causal CRDTs.

2. Systemmodel

Consider a distributed system with nodes containing localmemory, with no shared memory between them. Any node cansend messages to any other node. The network is asynchronous;there is no global clock, no bound on the time a message takes toarrive, and no bounds on relative processing speeds. The networkis unreliable: messages can be lost, duplicated or reordered (butare not corrupted). Some messages will, however, eventually getthrough: if a node sends infinitely many messages to anothernode, infinitely many of these will be delivered. In particular, thismeans that there can be arbitrarily long partitions, but these willeventually heal. Nodes have access to durable storage; they cancrash but will eventually recover with the content of the durablestorage just before the crash occurred. Durable state is writtenatomically at each state transition. Each node has access to itsglobally unique identifier in a set I.

2.1. Notation

We use mostly standard notation for sets and maps, includingset comprehension of the forms {f (x)|x ∈ S} or {x ∈ S|Pred(x)}. Amap is a set of (k, v) pairs,where each k is associatedwith a single v.Given amapm,m(k) returns the value associated with key k, whilem{k ↦→ v} denotes m updated by mapping k to v. The domain andrange of a map m is denoted by dom m and ran m, respectively,i.e., dom m = {k|(k, v) ∈ m} and ran m = {v|(k, v) ∈ m}. Weuse fstp and sndp to denote the first and second component of apair p, respectively. We use B, N, and Z, for the booleans, naturalnumbers, and integers, respectively; also I for some unspecifiedset of node identifiers. Most sets we use are partially ordered andhave a least element ⊥ (the bottom element). We use A ↪→ B for apartial function from A to B; given such a mapm, then dom m ⊆ Aand ranm ⊆ B, and for convenience we usem(k) when k ̸∈ dommand B has a bottom, to denote ⊥B; e.g., for some m : I ↪→ N, thenm(k) denotes 0 for any unmapped key k.

3. A background of state-based CRDTs

Conflict-Free Replicated Data Types [32,33] (CRDTs) are dis-tributed datatypes that allow different replicas of a distributedCRDT instance to diverge and ensures that, eventually, all repli-cas converge to the same state. State-based CRDTs achieve thisthrough propagating updates of the local state by disseminatingthe entire state across replicas. The received states are then

Page 3: Delta state replicated data types - HASLab · 0 =⊥ ∈ = ∈ ↪→ ={} ∈ ↪→ ={},(delta ...

164 P.S. Almeida et al. / J. Parallel Distrib. Comput. 111 (2018) 162–173

Fig. 1. State-based counter CRDT; replica i.

merged to remote states, leading to convergence (i.e., consistentstates on all replicas).

A state-based CRDT consists of a triple (S,M,Q ), where S is ajoin-semilattice [17], Q is a set of query functions (which returnsome resultwithoutmodifying the state), andM is a set ofmutatorsthat performupdates; amutatorm ∈ M takes a stateX ∈ S as inputand returns a new state X ′

= m(X). A join-semilattice is a setwith apartial order ⊑ and a binary join operation ⊔ that returns the leastupper bound (LUB) of two elements in S; a join is designed to becommutative, associative, and idempotent. Mutators are definedin such a way to be inflations, i.e., for any mutator m and state X ,the following holds:

X ⊑ m(X).

In thisway, for each replica there is amonotonic sequence of states,defined under the lattice partial order, where each subsequentstate subsumes the previous state when joined elsewhere.

Both query and mutator operations are always available sincethey are performed using the local state without requiring inter-replica communication; however, as mutators are concurrentlyapplied at distinct replicas, replica states will likely diverge. Even-tual convergence is then obtained using an anti-entropy protocolthat periodically ships the entire local state to other replicas. Eachreplica merges the received state with its local state using thejoin operation in S. Given the mathematical properties of join, ifmutations stop being issued and anti-entropy proceeds, all replicaseventually converge to the same state, i.e. the least upper-boundof all states involved. State-based CRDTs are interesting as theydemand little guarantees from the dissemination layer, workingunder message loss, duplication, reordering, and temporary net-work partitioning, without impacting availability and eventualconvergence.

Fig. 1 represents a state-based increment-only counter. TheGCounter CRDT state is a map from replica identifiers to positiveintegers. Initially, the bottom state ⊥ is an empty map (unmappedkeys implicitly mapping to zero). A single mutator, i.e., inci, is de-fined that increments the value corresponding to the local replicai (returning the updated map). The query operation value returnsthe counter value by adding the integers in the map entries. Thejoin of two states is the point-wise maximum of the maps. Mu-tators, like inci, are in general parameterized by the replica id i, sothat their exact behavior can depend on it, while queries, like value,are typically replica agnostic and only depend on the CRDT state,regardless of in which replica they are invoked.

The main weakness of state-based CRDTs is the cost of dissem-ination of updates, as the full state is sent. In this simple exampleof counters, even though increments only update the value corre-sponding to the local replica i, thewholemapwill always be sent inmessages, even when the other map entries remained unchanged(e.g., if no messages have been received and merged).

It would be interesting to only ship the recent modificationincurred on the state, and possibly any received modificationsthat effectively changed it. This is, however, not possible with thecurrent model of state-based CRDTs as mutators always return a

full state. Approacheswhich simply ship operations (e.g., an ‘‘incre-ment n’’ message), like in operation-based CRDTs, require reliablecommunication (e.g., because increment is not idempotent). Incontrast, the modification that we introduce in the next sectionallows producing and encoding recentmutations in an incrementalway, while keeping the advantages of the state-based approach,namely the idempotent, associative, and commutative propertiesof join.

4. Delta-state CRDTs

We introduce Delta-State Conflict-Free Replicated Data Types, orδ-CRDT for short, as a new kind of state-based CRDTs, in whichdelta-mutators are defined to return a delta-state: a value in thesame join-semilatticewhich represents the updates induced by themutator on the current state.

Definition 4.1 (Delta-mutator). A delta-mutator mδ is a function,corresponding to an update operation, which takes a state X ina join-semilattice S as parameter and returns a delta-mutationmδ(X), also in S.

Definition 4.2 (Delta-group). A delta-group is inductively definedas either a delta-mutation or a join of several delta-groups.

Definition 4.3 (δ-CRDT). A δ-CRDT consists of a triple (S,Mδ,Q ),where S is a join-semilattice,Mδ is a set of delta-mutators, and Q aset of query functions, where the state transition at each replicais given by either joining the current state X ∈ S with a delta-mutation:

X ′= X ⊔ mδ(X),

or joining the current state with some received delta-group D:

X ′= X ⊔ D.

In a δ-CRDT, the effect of applying a mutation, represented bya delta-mutation δ = mδ(X), is decoupled from the resulting stateX ′

= X ⊔ δ, which allows shipping this δ rather than the entireresulting state X ′. All state transitions in a δ-CRDT, even uponapplying mutations locally, are the result of some join with thecurrent state. Unlike standard CRDT mutators, delta-mutators donot need to be inflations in order to inflate a state; this is howeverensured by joining their output, i.e., deltas, into the current state:X ⊑ X ⊔ mδ(X).

In principle, a delta could be shipped immediately to remotereplicas once applied locally. For efficiency reasons,multiple deltasreturned by applying several delta-mutators can be joined locallyinto a delta-group and retained in a buffer. The delta-group canthen be shipped to remote replicas to be joined with their localstates. Received delta-groups can optionally be joined into theirbuffered delta-group, allowing transitive propagation of deltas. Afull state can be seen as a special (extreme) case of a delta-group.

If the causal order of operations is not important and the in-tended aim is merely eventual convergence of states, then delta-groups can be shipped using an unreliable dissemination layerthat may drop, reorder, or duplicate messages. Delta-groups canalways be re-transmitted and re-joined, possibly out of order, orcan simply be subsumed by a less frequent sending of the full state,e.g., for performance reasons or when doing state transfers to newmembers.

4.1. Delta-state decomposition of standard CRDTS

A δ-CRDT(S,Mδ,Q ) is a delta-state decomposition of a state-based CRDT (S,M,Q ), if for every mutator m ∈ M , we have acorresponding mutatormδ

∈ Mδ such that, for every state X ∈ S:

m(X) = X ⊔ mδ(X).

Page 4: Delta state replicated data types - HASLab · 0 =⊥ ∈ = ∈ ↪→ ={} ∈ ↪→ ={},(delta ...

P.S. Almeida et al. / J. Parallel Distrib. Comput. 111 (2018) 162–173 165

Fig. 2. A δ-CRDT counter; replica i.

This equation states that applying a delta-mutator and joininginto the current state should produce the same state transition asapplying the corresponding mutator of the standard CRDT.

Given an existing state-based CRDT (which is always a trivialdecomposition of itself, i.e., m(X) = X ⊔ m(X), as mutators areinflations), itwill be useful to find a non-trivial decomposition suchthat delta-states returned by delta-mutators inMδ are smaller thanthe resulting state:

size(mδ(X)) ≪ size(m(X)).

In general, there are several possible delta-state decompo-sitions, with multiple possible delta-mutators that correspondto each standard mutator. In order to minimize the generateddelta-states (which will typically minimize their size) each delta-mutator chosen mδ should be minimal in following sense: for anyother alternative choice of delta-mutator mδ′

, for any X , mδ′

(X) ̸⊏

mδ(X). Intuitively, minimal delta-mutators do not leak into thedeltas they produce any redundant information that is alreadypresent in X . Moreover (although in theory not always necessarilythe case) for typical datatypes thatwehave comeacross in practice,for each mutator m there exists a corresponding minimum delta-mutator mδ⊥ , i.e., with mδ⊥ ⊑ mδ′

(under the standard pointwisefunction comparison), for any alternative delta-mutator. Aswewillsee in the concrete examples, typically minimum delta-mutatorsare found naturally, with no need for some special ‘‘search’’.

4.2. Example: δ-CRDT counter

Fig. 2 depicts a δ-CRDT specification of a counter datatype thatis a delta-state decomposition of the state-based counter in Fig. 1.The state, join and value query operation remain as before. Onlythemutator incδ is newly defined, which increments themap entrycorresponding to the local replica and only returns that entry,instead of the full map as inc in the state-based CRDT counter does.Thismaintains the original semantics of the counterwhile allowingthe smaller deltas returned by the delta-mutator to be sent, insteadof the full map. As before, the received payload (whether one ormore deltas) might not include entries for all keys in I, whichare assumed to have zero values. The decomposition is easy tounderstand in this example since the equation inci(X) = X⊔incδ

i (X)holds as m{i ↦→ m(i) + 1} = m ⊔ {i ↦→ m(i) + 1}. In otherwords, the single value for key i in the delta, corresponding to thelocal replica identifier, will overwrite the corresponding one in msince the former maps to a higher value (i.e., using max). Here itcan be noticed that: (1) a delta is just a state, that can be joinedpossibly several times without requiring exactly-once delivery,and without being a representation of the ‘‘increment’’ operation(as in operation-based CRDTs), which is itself non-idempotent; (2)joining deltas into a delta-group and disseminating delta-groups ata lower rate than the operation rate reduces data communicationoverhead, since multiple increments from a given source can becollapsed into a single state counter.

1 inputs:2 ni ∈ P(I), set of neighbors3 ti ∈ B, transitive mode4 choosei ∈ S × S → S,

state/delta5 durable state:6 Xi ∈ S, CRDT state, X0

i = ⊥

7 volatile state:8 Di ∈ S, delta-group,

D0i = ⊥

9 on operationi(mδ)

10 d = mδ(Xi)11 X ′

i = Xi ⊔ d12 D′

i = Di ⊔ d

13 on receivej,i(d)14 X ′

i = Xi ⊔ d15 if ti then16 D′

i = Di ⊔ d17 else18 D′

i = Di

19 periodically20 m = choosei(Xi,Di)21 for j ∈ ni do22 sendi,j(m)23 D′

i = ⊥

Algorithm 1: Basic anti-entropy algorithm for δ-CRDT.

5. State convergence

In the δ-CRDT execution model, and regardless of the anti-entropy algorithm used, a replica state always evolves by joiningthe current state with some delta: either the result of a delta-mutation, or some arbitrary delta-group (which itself can be ex-pressed as a join of delta-mutations). Without loss of generality,we assume S has a bottom ⊥ which is also the initial state. (Oth-erwise, a bottom can always be added, together with a specialinit delta-mutator, which returns the initial state.) Therefore, allstates can be expressed as joins of delta-mutations, which makesstate convergence in δ-CRDT easy to achieve: it is enough that alldelta-mutations generated in the system reach every replica, asexpressed by the following proposition.

Proposition 5.1 (δ-CRDT Convergence). Consider a set of replicasof a δ-CRDTobject, replica i evolving along a sequence of statesX0i = ⊥, X1

i , . . . , each replica performing delta-mutations of theform mδ

i,k(Xki ) at some subset of its sequence of states, and evolving

by joining the current state either with self-generated deltas or withdelta-groups received from others. If each delta-mutation mδ

i,k(Xki )

produced at each replica is joined (directly or as part of a delta-group)at least once with every other replica, all replica states become equal.

Proof. Trivial, given the associativity, commutativity, and idempo-tence of the join operation in any join-semilattice. □

This opens up the possibility of having anti-entropy algorithmsthat are only devoted to enforce convergence, without necessarilyproviding causal consistency (enforced in standard CRDTs); thus,making a trade-off between performance and consistency guaran-tees. For instance, in a counter (e.g., for the number of likes on asocial network), it may not be critical to have causal consistency,but merely not to lose increments and achieve convergence. Thesame reasoning can be applied to other situations, specially whencommutative data-types are involved, e.g., grow-only-sets.

5.1. Basic anti-entropy algorithm

A basic anti-entropy algorithm that ensures eventual conver-gence in δ-CRDT is presented in Algorithm 1. For the node corre-sponding to replica i, the durable state, which persists after a crash,is simply the δ-CRDT state Xi. The volatile state D stores a delta-group that is used to accumulate deltas before eventually sendingit to other replicas. The initial value for both Xi and Di is ⊥.

When an operation is performed, the corresponding delta-mutator mδ is applied to the current state of Xi, generating a

Page 5: Delta state replicated data types - HASLab · 0 =⊥ ∈ = ∈ ↪→ ={} ∈ ↪→ ={},(delta ...

166 P.S. Almeida et al. / J. Parallel Distrib. Comput. 111 (2018) 162–173

delta d. This delta is joined both with Xi to produce a new state,and with D. In the same spirit of standard state based CRDTs, anode sends its messages in a periodic fashion, where the messagepayload is either the delta-groupDi or the full state Xi; this decisionis made by the function choosei which returns one of them. Tokeep the algorithm simple, a node simply broadcasts its messageswithout distinguishing between neighbors. After each send, thedelta-group is reset to ⊥.

Once a message is received, the payload d is joined into thecurrent δ-CRDT state. The basic algorithm operates in one of twomodes: (1) a transitive mode (when ti is true) in which d is alsojoined into D, allowing transitive propagation of delta-mutations,where deltas received at node i from some node j can later besent to some other node k; (2) a direct mode where a delta-groupis exclusively the join of local delta-mutations (j must send itsdeltas directly to k). The decisions ofwhether to send a delta-groupversus the full state (typically less periodically), andwhether to usethe transitive or direct mode are out of the scope of this paper.In general, decisions can be made considering many criteria likedelta-group size, state size, message loss distribution assumptions,and network topology.

6. Causal consistency

Traditional state-based CRDTs converge using joins of the fullstate, which implicitly ensures per-object causal consistency [12]:each state of some replica of an object reflects the causal past ofoperations on the object (either applied locally, or applied at otherreplicas and transitively joined).

Therefore, it is desirable to have δ-CRDT s offer the same causal-consistency guarantees that standard state-based CRDTs offer. Thisraises the question about how can delta propagation and mergingof δ-CRDT be constrained (and expressed in an anti-entropy algo-rithm) in such a manner to give the same results as if a standardstate-based CRDT was used. Towards this objective, it is usefulto define a particular kind of delta-group, which we call a delta-interval:

Definition 6.1 (Delta-interval). Given a replica i progressingalong the states X0

i , X1i , . . . , by joining delta dki (either local

delta-mutation or received delta-group) into Xki to obtain Xk+1

i , adelta-interval ∆

a,bi is a delta-group resulting from joining deltas

dai , . . . , db−1i :

∆a,bi =

⨆{dki |a ≤ k < b}.

The use of delta-intervals in anti-entropy algorithms will bea key ingredient towards achieving causal consistency. We nowdefine a restricted kind of anti-entropy algorithm for δ-CRDTs.

Definition 6.2 (Delta-interval-based Anti-entropy Algorithm). Agiven anti-entropy algorithm for δ-CRDTs is delta-interval-based,if all deltas sent to other replicas are delta-intervals.

Moreover, to achieve causal consistency the next conditionmust be satisfied:

Definition 6.3 (Causal delta-merging Condition). A delta-intervalbased anti-entropy algorithm is said to satisfy the causal delta-merging condition if the algorithm only joins ∆

a,bj from replica j

into state Xi of replica i that satisfy

Xi ⊒ Xaj .

This means that a delta-interval is only joined into states thatat least reflect (i.e., subsume) the state into which the first deltain the interval was previously joined. The causal delta-mergingcondition is important, since any delta-interval based anti-entropy

algorithm for a δ-CRDT that satisfies it can be used to obtainthe same outcome of a standard CRDT; this is formally stated inProposition 6.4.

Proposition 6.4 (CRDT and δ-CRDT Correspondence). Let (S,M,Q )be a standard state-based CRDT and (S,Mδ,Q ) a corresponding delta-state decomposition. Any δ-CRDTstate reachable by an execution Eδ

over (S,Mδ,Q ), by a delta-interval based anti-entropy algorithm Aδ

satisfying the causal delta-merging condition, is equal to a state re-sulting from an execution E over (S,M,Q ), having the correspondingdata-type operations, by an anti-entropy algorithm A for state-basedCRDTs.

Proof. By simulation, establishing a correspondence betweenan execution Eδ , and execution E of a standard CRDT of which(S,Mδ,Q ) is a decomposition, as follows: (1) the state (Xi,Di, . . .)of each node in Eδ containing CRDT state Xi, information aboutdelta-intervals Di and possibly other information, corresponds toonly Xi component (in the same join-semilattice); (2) for eachaction which is a delta-mutation mδ in Eδ , E executes the corre-sponding mutation m, satisfying m(X) = X ⊔ mδ(X); (3) wheneverEδ contains a send action of a delta-interval ∆

a,bi , execution E

contains a send action containing the full state Xbi ; (4) whenever

Eδ performs a join into some Xi of a delta-interval∆a,bj , execution E

delivers and joins the corresponding message containing the fullCRDT state Xb

j . By induction on the length of the trace, assumethat for each replica i, each node state Xi in E is equal to thecorresponding component in the node state in Eδ , up to the lastaction in the global trace. A send action does not change replicastate, preserving the correspondence. Replica states only changeeither byperformingdata-typeupdate operations or uponmessagedelivery by merging deltas/states respectively. If the next action isan update operation, the correspondence is preserved due to thedelta-state decomposition property m(X) = X ⊔ mδ(X). If the nextaction is a message delivery at replica i, with a merging of delta-interval/state from other replica j, because algorithm Aδ satisfiesthe causal merging-condition, it only joins into state Xk

i a delta-interval ∆a,b

j if Xki ⊒ Xa

j . In this case, the outcome will be

Xk+1i = Xk

i ⊔ ∆a,bj

= Xki ⊔

⨆{dlj|a ≤ l < b}

= Xki ⊔ Xa

j ⊔

⨆{dlj|a ≤ l < b}

= Xki ⊔ Xa

j ⊔ daj ⊔ da+1j ⊔ . . . ⊔ db−1

j

= Xki ⊔ Xa+1

j ⊔ da+1j ⊔ . . . ⊔ db−1

j

= . . .

= Xki ⊔ Xb−1

j ⊔ db−1j

= Xki ⊔ Xb

j

The resulting state Xk+1i in Eδ will be, therefore, the same as the

corresponding one in E where the full CRDT state from j has beenjoined, preserving the correspondence between Eδ and E. □

Corollary 6.5 (δ-CRDT causal consistency). Any δ-CRDT in whichstates are propagated and joined using a delta-interval-based anti-entropy algorithm satisfying the causal delta-merging condition en-sures causal consistency.

Proof. From Proposition 6.4 and causal consistency of state-basedCRDTs. □

This means that using any delta-interval-based anti-entropyalgorithm respecting the causal delta-merging condition ensuresthe same properties concerning causal consistency as a standard

Page 6: Delta state replicated data types - HASLab · 0 =⊥ ∈ = ∈ ↪→ ={} ∈ ↪→ ={},(delta ...

P.S. Almeida et al. / J. Parallel Distrib. Comput. 111 (2018) 162–173 167

1 inputs:2 ni ∈ P(I), set of neighbors3 durable state:4 Xi ∈ S, CRDT state, X0

i = ⊥

5 ci ∈ N, sequence number,c0i = 0

6 volatile state:7 Di ∈ N ↪→ S, deltas,

D0i = {}

8 Ai ∈ I ↪→ N, ack map,A0i = {}

9 on receivej,i(delta, d, n)10 if d ̸⊑ Xi then11 X ′

i = Xi ⊔ d12 D′

i = Di{ci ↦→ d}13 c ′

i = ci + 114 sendi,j(ack, n)15 on receivej,i(ack, n)16 A′

i = Ai{j ↦→ max(Ai(j), n)}

17 on operationi(mδ)

18 d = mδ(Xi)19 X ′

i = Xi ⊔ d20 D′

i = Di{ci ↦→ d}21 c ′

i = ci + 122 periodically // ship interval

or state23 j = random(ni)24 if Di = {} ∨ min dom

Di > Ai(j) then25 d = Xi26 else27 d =

⨆{Di(l)|Ai(j) ≤

l < ci}28 if Ai(j) < ci then29 sendi,j(delta, d, ci)30 periodically // garbage

collect deltas31 l = min{n|(_, n) ∈ Ai}

32 D′

i = {(n, d) ∈ Di|n ≥ l}

Algorithm 2: Anti-entropy algorithm ensuring causal consis-tency of δ-CRDT.

state-based CRDT. Next we present such an anti-entropy algo-rithm. Regardless of the choice of such an algorithm, as we statedin Section 5, if all delta-mutations are joined at every-replica, wehave convergence.

6.1. Anti-Entropy algorithm for causal consistency

Algorithm 2 is a delta-interval based anti-entropy algorithmwhich enforces the causal delta-merging condition. It can be usedwhenever the causal consistency guarantees of standard state-based CRDTs are needed. For simplicity, it excludes some opti-mizations that are important in practice, but easy to derive. Thealgorithm distinguishes neighbor nodes, and only sends to eachone appropriate delta-intervals that obey the delta-merging con-dition and can be joined at the receiving node.

Each node i keeps a contiguous sequence of deltas dli, . . . , dui

in a map D from integers to deltas, with l = min dom D andu = maxdomD. The sequence numbers of deltas are obtained fromthe counter ci that is incremented when a delta (whether a delta-mutation or delta-interval received) is joined with the currentstate. Each node i keeps an acknowledgments map A that stores,for each neighbor j, the largest index b for all delta-intervals ∆

a,bi

acknowledged by j (after j receives ∆a,bi from i and joins it into Xj).

Node i sends a delta-interval d = ∆a,bi with a (delta, d, b)

message; the receiving node j, after joining ∆a,bi into its replica

state, replies with an acknowledgment message (ack, b); if an ackfrom jwas successfully received by node i, it updates the entry of jin the acknowledgment map, using themax function. This handlespossible old duplicates and messages arriving out of order.

Like the δ-CRDT state, the counter ci is also kept in a durablestorage. This is essential to cope with potential crash and recoveryincidents. Otherwise, there would be the danger of receiving somedelayed ack, for a delta-interval sent before crashing, causing thenode to skip sending some deltas generated after recovery, thusviolating the delta-merging condition.

The algorithm for node i periodically picks a random neighborj. In principle, i sends the join of all deltas starting from the latestdelta acked by j. Exceptionally, i sends the entire state in two cases:(1) if the sequence of deltas Di is empty, or (2) if j is expectingfrom i a delta that was already removed from Di (e.g., after a crash

and recovery, when both deltas and the ack map, being volatilestate, are lost). A delta message is only sent if the counter ci hasadvanced past the next delta expected by node j, i.e., if Ai(j) < ci,to avoid sending the full state in local inactivity periods, when nolocal operations are being issued, all neighbor nodes have acked alldeltas, and garbage collection has been applied, making theDi mapempty. To garbage collect old deltas, the algorithm periodicallyremoves the deltas that have been acked by all neighbors.

Proposition 6.6. Algorithm 2 produces the same reachable states asa standard algorithm over a CRDT for which the δ-CRDT is a decompo-sition, ensuring causal consistency.

Proof. FromProposition 6.4 and Corollary 6.5, it is enough to provethat the algorithm satisfies the causal delta-merging condition.The algorithm explicitly keeps deltas dki tagged with increasing se-quence numbers (even after a crash), accordingwith the definition;node j only sends to i a delta-interval∆a,b

j if i has acked a; this ack issent only if i has already joined some delta-interval (possibly a fullstate) ∆

k,aj . Either k = 0 or, by the same reasoning, this ∆

k,aj could

only have been joined at i if some other interval ∆l,kj had already

been joined into i. This reasoning can be recursed until a delta-interval starting from zero is reached. Therefore, Xi ⊒

⨆{dkj |0 ≤

k < a} = ∆0,aj = Xa

j . □

7. Portfolio of δ-CRDTs

Having established the equivalence to classic state based CRDTswe now derive a series of specifications based on delta-mutators.Although we cover a significant number of CRDTs, the goal is notto provide an exhaustive survey, but instead to illustrate moreextensively the design of specifications with deltas. In our expe-rience the intellectual effort of designing a delta-based CRDT is notmuch higher than designing it with standard mutators. Since stan-dardmutators can be obtained from delta-mutators, by composingthese with join, having delta-mutators as basic building blocks canonly add flexibility to the designs.

First, we will cover simple CRDTs and CRDT compositions thatdo not require distinguished node identifiers for the mutation.Next, we cover CRDTs that require a unique identifier for eachreplica that is allowed to mutate the state, and make use of thisidentifier in one or more of the mutations. Then, we address theimportant class of what we denote by Causal CRDTs, presenting ageneric design inwhich the state lattice is formed by a dot store anda causal context.We define three such dot stores and correspondinglattices, which are then used to define several causal CRDTs. Weconclude the portfolio with aMap design, a causal CRDTwhich cancorrectly embed any causal CRDT, including the map itself.

All of the selected CRDTs have delta implementations availablein C++ [3], that complement the specifications. Most of the CausalCRDTs, including theMap, are also available in Erlang and deployedin production as part of Riak DT [6].

7.1. Simple lattice compositions

To obtain composite CRDTs, a basic ingredient is being ableto obtain states, which are join-semilattices, as composition ofjoin-semilattices. Two common useful cases are the product andlexicographic product. Other examples of lattice composition arepresented in [17,23].

7.1.1. PairIn Fig. 3 we show the standard pair composition. The bottom is

the pair of respective bottoms and the join is the coordinate-wisejoin of the components. This can be generalized to products ofmorethan two components.

Page 7: Delta state replicated data types - HASLab · 0 =⊥ ∈ = ∈ ↪→ ={} ∈ ↪→ ={},(delta ...

168 P.S. Almeida et al. / J. Parallel Distrib. Comput. 111 (2018) 162–173

Fig. 3. Pair of join-semilattices.

Fig. 4. Lexicographic pair of join-semilattices.

Fig. 5. δ-CRDT grow-only set, replica i.

7.1.2. Lexicographic pairA variation of the pair composition is to establish a lexicographic

pair. In this construction, in Fig. 4, the first element takes priorityin establishing the outcome of the join, and a join of the secondcomponent is only performed on a tie. One important special caseis when the first component is a total order; it can be used, e.g., todefine an outcome based on the comparison of a time-stamp, aswill be shown later.

7.2. Anonymous δ-CRDTs

The simplest CRDTs are anonymous. This occurs when themutators do not make use of a globally unique replica identifier,having a uniform specification for all replicas. (Although for unifor-mity of notation we will keep parameterizing mutators by replicaidentifier.)

7.2.1. GsetA simple example is illustrated by a grow-only set, in Fig. 5. The

single delta mutator insertδi (e, s) does not even need to considerthe current state of the replica, available in s, and simply produces adeltawith a singleton set containing the element e to be added. Thisdelta {e} when joined to s produces the desired result: an inflatedset s ∪ {e} that includes element e. The join of grow-only sets istrivially obtained by unioning the sets.

7.2.2. 2PSetIn case one needs to remove elements, there are multiple ways

of addressing it. The simplestway is to include another (grow-only)set that gathers the removed elements. This is done in Fig. 6, whichshows a two-phase set, with state being a pair of sets. The namecomes from the fact that elements may go through two phases:the added phase and the removed phase. The shortcoming of thissimple design is that once removed, elements cannot be re-added.

Fig. 6. δ-CRDT two-phase set, replica i.

Fig. 7. δ-CRDT Add-Wins LWW Set, replica i.

If we look at the query function elements it is clear that thedata-type is presenting to the user the set difference betweenthe added elements and the removed elements (those stored inthe tombstone set t). Removing an element simply adds it to theremoved set. (A variant of 2PSet with guarded removes [36] onlydoes so if the element is already present in the added set.) The joinis simply a pairwise join.

7.2.3. Add-Wins last-writer-wins setThis construction, depicted in Fig. 7, manages a set of elements

of type E by tagging themwith timestamps from some total order—here we use natural numbers. Each time an element is added,it is tagged with a client supplied timestamp and the booleanTrue. Removed elements are similarly tagged, butwith the booleanFalse. Elements marked with True are considered to be in the set.When joining two such sets, those elements in common will haveto compete to define if they are in the set. By using lexicographicpairs, we obtain the behavior that elements with higher (morerecent) time-stamps will win, defining the presence according tothe boolean tag; if there is a tie in the time-stamp, adds will win,since we order False < True.

Notice that it is up to the client to ensure that supplied times-tamps always grow monotonically. Failure to do so is a commonsource of errors in timestamp based systems [24]. A dual construc-tion to the Add-Wins LWW Set is a Remove-Wins LWW Set, whereremove operations take priority on the event of a time-stamp tie.This construction has been widely deployed in production as partthe SoundCloud system [9].

7.3. Named δ-CRDTs

Another design strategy for conflict-free data-types is to en-sure that each replica only changes a specific part of the state. InSection 4, we defined a GCounter that, using a map from globallyunique replica identifiers to natural numbers, keeps track of howmany increment operations each replica did. This was the firstexample of a named CRDT, the construction covered in this section.The distinction from anonymous CRDTs is thatmutationsmake useof the replica identifier i.

Page 8: Delta state replicated data types - HASLab · 0 =⊥ ∈ = ∈ ↪→ ={} ∈ ↪→ ={},(delta ...

P.S. Almeida et al. / J. Parallel Distrib. Comput. 111 (2018) 162–173 169

Fig. 8. δ-CRDT positive–negative counter, replica i.

Fig. 9. δ-CRDT Lexicographic Counter, replica i.

7.3.1. PncounterBy composing, in a pair, two grow-only counters we obtain

a positive–negative counter that can track both increments anddecrements. Shown in Fig. 8, the increment and decrement op-erations will update the first and second components of the pair,respectively. As expected, the value is obtained by subtracting thedecrements from the increments.

7.3.2. Lexicographic counterWhile the PNCounter was one of the first CRDTs to be added

to a production database, in Riak 1.4 [7], the competing Cassandradatabase had its own counter implementations based on the LWWstrategy. Interestingly it proved to be difficult to avoid semanticanomalies in the behavior of those early counters, and since Cas-sandra 2.1, a new counterwas introduced [16].We capture itsmainproperties in the Fig. 9 specification of a LexCounter.

This counter is updated by either incrementing or decrementingthe second component of the lexicographic pair corresponding tothe replica issuing the mutation. Decrements also increment thefirst component, to ensure that the pair will be inflated, making it(and therefore, the just updated second component) win upon alexicographic join.

7.4. Causal δ-CRDTs

Wenow introduce a specific class of CRDTs, that wewill refer toas causal CRDTs. Initial designs [33] introduced data types such asobserved-remove sets and multi-value registers. While these madepossible sets which allow adding and removing elements multipletimes, and to model the design of the eventually consistent shop-ping cart, in Amazon Dynamo [18], they had sub-optimal scalabil-ity properties [12]. Later designs, such as in observed-remove setswithout tombstones [8], allow an efficient management of meta-data state and can be applied to a broad class of data-types.

We introduce the concept of dot store to be used together witha causal context to form the state (a join-semilattice) of a causalCRDT, presenting three such dot stores and lattices. These are thenused to obtain several related data-types, including flags, registers,sets, and maps.

Fig. 10. Causal context.

7.4.1. Causal contextA common property to causal CRDTs is that events can be

assigned unique identifiers. A simple mechanism is to create theseidentifiers by appending to a globally unique replica identifier areplica-unique integer. For instance, in replica i ∈ I we can createthe sequence (i, 1), (i, 2), . . . . Each of these pairs can be used to taga specific event, or client action, and if we collect these pairs in agrow-only set, we can remember which events are known to eachreplica. The pair is called a dot and the grow-only set of pairs canbe called a causal history, or alternatively a causal context, as we dohere.

As seen in Fig. 10, a causal context is a set of dots. We definetwo functions over causal contexts: maxi(c) gives the maximumsequence number for pairs in c from replica i, or 0 if there is nosuch dot; nexti(c) produces the next available sequence numberfor replica i given set of events in c.

7.4.2. Causal context compressionIn practice, a causal context can be efficiently compressedwith-

out any loss of information.When using an anti-entropy algorithmthat provides causal consistency, e.g., Algorithm 2, then for eachreplica state Xi that includes a causal context ci, and for any replicaidentifier j ∈ I, we have a contiguous sequence:

1 ≤ n ≤ maxj(ci) ⇒ (j, n) ∈ ci.

Thus, under causal consistency the causal context can always beencoded as a compact version vector [31] I ↪→ N that keeps themaximum sequence number for each replica.

Even under non-causal anti-entropy, such as in Algorithm 1,compression is still possible by keeping a version vector that en-codes the initial contiguous sequence of dots from each replica,together with a set for the non-contiguous dots. As anti-entropyproceeds, each dot is eventually encoded in the vector, and thusthe set remains typically small. Compression is less likely for thecausal context of delta-groups in transit or buffered to be sent, butthose contexts are only transient and smaller than those in theactual replica states. Moreover, the same techniques that encodecontiguous sequences of dots can also be used for transient contextcompression [28].

7.4.3. Dot storesTogether with a causal context, the state of a causal CRDT will

use some kind of dot store, which acts as a container for data-typespecific information. A dot store can be queried about the set ofevent identifiers (dots) corresponding to the relevant operations inthe container, by function dots, which takes a dot store and returnsa set of dots. In Fig. 11 we define three kinds of dot stores: a DotSetis simply a set of dots; the generic DotFun⟨V ⟩ is a map from dots tosome lattice V ; the generic DotMap⟨K , V ⟩ is a map from some setK into some dot store V .

7.4.4. Causal δ-CRDTsIn Fig. 12 we define the join-semilattice which serves as state

for Causal δ-CRDTs, where an element is a pair of dot store andcausal context. We define the join operation for each of the threekinds of dot stores. These lattices are a generalization of techniquesintroduced in [1,8]. To understand the meaning of a state (and the

Page 9: Delta state replicated data types - HASLab · 0 =⊥ ∈ = ∈ ↪→ ={} ∈ ↪→ ={},(delta ...

170 P.S. Almeida et al. / J. Parallel Distrib. Comput. 111 (2018) 162–173

Fig. 11. Dot stores.

way join must behave), a dot present in a causal context but not inthe corresponding dot store, means that the dot was present in thedot store, some time in the past, but has been removedmeanwhile.Therefore, the causal context can track operations with removesemantics, while avoiding the need for individual tombstones.

When joining two replicas, a dot present in only one dot store,but included in the causal context of the other, will be discarded.This is clear for the simpler case of a DotSet, where the joinpreserves all dots in common, together with those not present inthe other causal context. The DotFun⟨V ⟩ case is analogous, but thecontainer is now a map from dots to some value, allowing thevalue for a given dot to evolve with time, independently at eachreplica. It assumes the value set is a join-semilattice, and appliesthe corresponding join of values for each dot in common.

In the more complex case of DotMap⟨K , V ⟩, a map from someK to some dot store V , the join, for each key present in eitherreplica, performs a join in the lattice Causal⟨V ⟩, by pairing the per-key value with the replica-wide causal context, and storing the

Fig. 13. δ-CRDT enable-wins flag, replica i.

resulting value (first component of the result) for that key, but onlywhen it is not ⊥V . This allows the disassociation of a compositeembedded value from a key, with no need for a per-key tombstone,by remembering in the causal context all dots from the compositevalue. Matching our notation, in a DotMap⟨K , V ⟩, any unmappedkey corresponds effectively to the bottom ⊥V .

7.4.5. Enable-Wins flagThe flags are simple, yet useful, data-types that were first intro-

duced in Riak 2.0 [6]. Fig. 13 presents an enable-wins flag. Enablingthe flag simply replaces all dots in the store by a new dot; thisis achieved by obtaining the dot through nexti(c), and making thedelta mutator return a store containing just the new dot, togetherwith a causal context containing both the new dot and all currentdots in the store; this will make all current dots to be removedfrom the store upon a join (as previously defined), while the newdot is added. Concurrent enabling can lead to the store containingseveral dots. Reads will consider the flag enabled if the store is notan empty set. Disabling is similar to enabling, in that all currentdots are removed from the store, but no new dot is added. It ispossible to construct a dual data-type with disable-wins semanticsand its code is also available [3].

7.4.6. Multi-Value registerA multi-value register supports read and write operations, with

traditional sequential semantics. Under concurrent writes, a joinmakes a subsequent read return all concurrently written values,and a subsequent write will overwrite all those values. This data-type captures the semantics of the Amazon shopping cart [18],and the usual operation of Riak (when not using CRDT data-types).

Fig. 12. Lattice for causal δ-CRDTs.

Page 10: Delta state replicated data types - HASLab · 0 =⊥ ∈ = ∈ ↪→ ={} ∈ ↪→ ={},(delta ...

P.S. Almeida et al. / J. Parallel Distrib. Comput. 111 (2018) 162–173 171

Fig. 14. δ-CRDT multi-value register, replica i.

Fig. 15. δ-CRDT add-wins set, replica i.

Initial implementations of these registers tagged each value witha full version vector [33]; here we introduce an optimized im-plementation that tags each value with a single dot, by using aDotFun⟨V ⟩ as dot store. In Fig. 14 we can see that the write deltamutator returns a causal context with all dots in the store, so thatthey are removed upon join, togetherwith a singlemapping from anewdot to the valuewritten; as usual, the newdot is also put in thecontext. A clear operation simply removes current dots, leaving theregister in the initial empty state. Reading simply returns all valuesmapped in the store.

7.4.7. Add-Wins setIn an add-wins set removals do not affect elements that have

been concurrently added. In this sense, under concurrent updates,an add will win over a remove of the same element. The imple-mentation, in Fig. 15, uses a map from elements to sets of dots asdot store. This data-type can be seen as a map from elements toenable-wins flags, but with a single common causal context, andkeeping only elements mapped to an enabled flag.

When an element is added, all dots in the corresponding entrywill be replaced by a singleton set containing a new dot. If a DotSetfor some element were to become empty, such as when removingthe element, the join for DotMap⟨E,DotSet⟩ will remove the entryfrom the resulting map. Concurrently created dots are preservedwhen joining. The clear delta mutator will put all dots from thedot store in the causal context, to be removed when joining. Asonly non-empty entries are kept in the map, the set of elementscorresponds to the map domain.

7.4.8. Remove-Wins setUnder concurrent adds and removes of the same element, a

remove-wins set will make removes win. To obtain this behavior,the implementation in Fig. 16 uses amap fromelements to a nestedmap frombooleans to sets of dots. For both adding and removing ofa given entry, the corresponding nestedmap is cleared (by the deltamutator inserting all corresponding dots into the causal context),and a new mapping from either True or False, respectively, to asingleton new dot is added.

When joining replicas, the nested map will collect the unionof the respective sets in the corresponding entry (for dots notseen by the other causal context). As before, only non-bottomentries are kept, for both outer map (non-empty maps) and nestedmap (non-empty DotSets). Therefore, an element is consideredto be in the set if it belongs to the outer map domain, and thecorresponding nested map does not contain a False entry; thus,concurrent removes will win over adds.

7.4.9. A map embedding causal δ-CRDTs.Maps are important composition tools for the construction of

complex CRDTs. Although grow-only maps are simple to conceiveand have been used in early state based designs [5], the creationof a map that allows removal of entries and supports recursivecomposition is not trivial. Riak 2.0 introduced a map design thatprovides a clear observed-remove semantics: a remove can be seenas an ‘‘undo’’ of all operations leading to the embedded value,putting it in the bottom state, but remembering those operations,to undo them in other replicaswhich observe it by a join. Key to thedesign is to enable removal of keys to affect (and remember) thedots in the associated nested CRDT, to allow joining with replicasthat have concurrently evolved from the before-removal point, orto ensure that re-creating entries previously removed does notintroduce anomalies.

In order to obtain the desired semantics it is not possible tosimplymap keys to causal CRDTs having their own causal contexts.Doing so would introduce anomalies when recreating keys, sinceold versions of the mappings in other replicas could be consideredmore recent than newer mappings, since the causal contexts ofthe re-created entries would start again at their bottom state. Thesolution is to have a common causal context to the whole map,to be used for all nested components, and never reset that singlecontext.

For an arbitrary set of keys K and a causal δ-CRDT Causal⟨V ⟩

that we want to embed (including, recursively, the mapwe are defining), the desired map can be achieved throughCausal⟨DotMap⟨K , V ⟩⟩, where a single causal context is sharedby all keys and corresponding nested CRDTs, as presented inFig. 17. This map can embed any causal CRDT as values. Forinstance we can define a map of type ORMap⟨S,AWSet⟨E⟩⟩,mapping strings S to add-wins sets of elements E; or define amore complex recursive structure that uses a map within a mapORMap⟨N,ORMap⟨S,MVReg⟨E⟩⟩⟩.

The map does not support a specific operation to add newentries: it starts as an empty map, which corresponds to any keyimplicitly mapped to bottom; then, any operation from the em-bedded type can be applied, through a higher-order apply, whichtakes a delta mutator oδ

i to be applied, the key k, and the map(m, c). This mutator fetches the value at key k fromm, pairs it withthe shared causal context c , obtaining a value from the embeddedtype, and invokes the operation over the pair; from the resultingpair, it extracts the value to create a new mapping for that key,which it pairs with the resulting causal context. Removing a keywill recursively remove the dots in the corresponding embeddedvalue, while the clear operationwill remove all dots from the store.This simplicity was achieved by encapsulating most complexity inthe join (and also the dots function) of the embedded type.

8. Related work

8.1. Eventually convergent data types

The design of replicated systems that are always available andeventually converge can be traced back to historical designs in[22,35], among others. More recently, replicated data types thatalways eventually converge, both by reliably broadcasting oper-ations (called operation-based) or gossiping and merging states(called state-based), have been formalized as CRDTs [5,26,32,33].These are also closely related to BloomL [14] and Cloud Types [11].State join-semilattices were used for deterministic parallel pro-gramming in LVars [25], where variables progress in the latticeorder by joining other values, and are only accessible by specialthreshold reads.

Page 11: Delta state replicated data types - HASLab · 0 =⊥ ∈ = ∈ ↪→ ={} ∈ ↪→ ={},(delta ...

172 P.S. Almeida et al. / J. Parallel Distrib. Comput. 111 (2018) 162–173

Fig. 16. δ-CRDT remove-wins set, replica i.

Fig. 17. δ-CRDT map embedding causal δ-CRDTs, with observed removes, replica i.

8.2. Message size

A key feature of δ-CRDT is message size reduction and coa-lescing, using small-sized deltas. The general old idea of usingdifferences between things, called ‘‘deltas’’ in many contexts, canlead to many designs, depending on how exactly a delta is defined.The state-based deltas introduced for Computational CRDTs [29]require an extra delta-specific merge (in addition to the standardjoin) which does not ensure idempotence. In [19], an improvedsynchronization method for non-optimized OR-set CRDT [32] ispresented, where delta information is propagated; in that paperdeltas are a collection of items (related to update events betweensynchronizations), manipulated andmerged through a protocol, asopposed to normal states in the semilattice. No generic frameworkis defined (that could encompass other data types) and the protocolrequires several communication steps to compute the informationto exchange. Operation-based CRDTs [4,32,33] also support smallmessage sizes, and in particular, pure flavors [4] that restrict mes-sages to the operation name, and possible arguments. Though pureoperation-based CRDTs allow for compact states and are very fastat the source (since operations are broadcast without consultingthe local state), the model requires more systems guarantees thanδ-CRDT do, e.g., exactly-once reliable delivery and membershipinformation, and impose more complex integration of new repli-cas. The work in [13] shows a different trade-off among statedeltas and pure operations, by tagging operations and creatinga globally stable log of operations while allowing local transientlogs to preserve availability. While having other advantages, thecreation of this global log requires more coordination than ourgossip approach for causally consistent delta dissemination, andcan stall dissemination.

8.3. Encoding causal histories.

State-based CRDTs are always designed to be causally con-sistent [5,33]. Optimized implementations of sets, maps, andmulti-value registers can build on this assumption to keep themeta-data small [12]. In δ-CRDT, however, deltas and delta-groupsare normally not causally consistent, and thus the design of join,the meta-data state, as well as the anti-entropy algorithm usedmust ensure this. Without causal consistency, the causal contextin δ-CRDT cannot always be summarized with version vectors, andconsequently, techniques that allow for gaps are often used. A wellknown mechanism that allows for encoding of gaps is found inConcise Version Vectors [27]. Interval Version Vectors [28], lateron, introduced an encoding that optimizes sequences and allowsgaps, while preserving efficiency when gaps are absent.

9. Conclusion

We introduced the new concept of δ-CRDTs and devised delta-mutators over state-based datatypes which can detach the changesthat an operation induces on the state. This brings a significantperformance gain as it allows only shipping small states, i.e., deltas,instead of the entire state. The significant property in δ-CRDT isthat it preserves the crucial properties (idempotence, associativityand commutativity) of standard state-based CRDT. In addition, wehave shown how δ-CRDT can achieve causal consistency; and wepresented an anti-entropy algorithm that allows replacing classicalstate-based CRDTs by more efficient ones, while preserving theirproperties. As an application of our approach we designed severalnovel δ-CRDT specifications, including a general framework forcausal CRDTs and composition in maps.

Our approach is more relaxed than classical state-based CRDTs,and thus, can replace them without losing their power sinceδ-CRDT allows shipping delta-states as well as the entire state.Another interesting observation is that δ-CRDT can mimic thebehavior of operation-based CRDTs, by shipping individual deltason the fly but with weaker guarantees from the disseminationlayer.

References

[1] P.S. Almeida, C. Baquero, R. Gonçalves, N.M. Preguiça, V. Fonte, Scalable andaccurate causality tracking for eventually consistent stores, in: DistributedApplications and Interoperable Systems - 14th IFIP WG 6.1 InternationalConference, DAIS 2014, Berlin, Germany, June 3–5, 2014, Proceedings, 2014,pp. 67–81. http://dx.doi.org/10.1007/978-3-662-43352-2_6.

[2] P.S. Almeida, A. Shoker, C. Baquero, Efficient state-based CRDTs by delta-mutation, in: Networked Systems - Third International Conference, NETYS2015, Agadir, Morocco, May 13–15, 2015, 2015, pp. 62–76.

[3] C. Baquero, Delta-enabled-CRDTs (Retrieved 22.12.15). URLhttp://github.com/CBaquero/delta-enabled-crdts.

[4] C. Baquero, P.S. Almeida, A. Shoker, Making operation-based CRDTs operation-based, in: Distributed Applications and Interoperable Systems: 14th IFIP WG6.1 International Conference, DAIS 2014, Berlin, Germany, June 3–5, 2014Proceedings, Springer Berlin Heidelberg, 2014, pp. 126–140.

[5] C. Baquero, F. Moura, Using structural characteristics for autonomous opera-tion, Oper. Syst. Rev. 33 (4) (1999) 90–96. http://dx.doi.org/10.1145/334598.334614.

[6] Basho, Riak datatypes (Retrieved 22.12.15). URL http://github.com/basho.[7] Basho, Riak 1.4 (Retrieved 04.01.16). URL https://github.com/basho/riak/blob/

1.4/RELEASE-NOTES.md.[8] A. Bieniusa, M. Zawirski, N. Preguiça, M. Shapiro, C. Baquero, V. Balegas, S.

Duarte, An optimized conflict-free replicated set, Rapp. Rech. RR-8083, INRIA,Rocquencourt, France (Oct. 2012). URL http://hal.inria.fr/hal-00738680.

[9] Peter Bourgon, Consistency without Consensus: CRDTs in Production atSoundCloud (Retrieved 22.12.15). URL http://www.infoq.com/presentations/crdt-soundcloud.

[10] R. Brown, S. Cribbs, C. Meiklejohn, S. Elliott, Riak dt map: A composable,convergent replicated dictionary, in: Proceedings of the First Workshop onPrinciples and Practice of Eventual Consistency, PaPEC’14, ACM, New York,NY, USA, 2014, pp. 1:1–1:1. http://dx.doi.org/10.1145/2596631.2596633.

[11] S. Burckhardt, M. Fähndrich, D. Leijen, B.P. Wood, Cloud types for eventualconsistency, in: ECOOP 2012–Object-Oriented Programming, Springer, 2012,pp. 283–307.

[12] S. Burckhardt, A. Gotsman, H. Yang,M. Zawirski, Replicated data types: specifi-cation, verification, optimality, in: S. Jagannathan, P. Sewell (Eds.), POPL, ACM,2014, pp. 271–284.

Page 12: Delta state replicated data types - HASLab · 0 =⊥ ∈ = ∈ ↪→ ={} ∈ ↪→ ={},(delta ...

P.S. Almeida et al. / J. Parallel Distrib. Comput. 111 (2018) 162–173 173

[13] S. Burckhardt, D. Leijen, M. Fahndrich, Cloud types: Robust abstractionsfor replicated shared state, Tech. Rep. MSR-TR-2014-43, Microsoft Research(March 2014). URL http://research.microsoft.com/apps/pubs/default.aspx?id=211340.

[14] N. Conway, W.R. Marczak, P. Alvaro, J.M. Hellerstein, D. Maier, Logic andlattices for distributed programming, in: Proceedings of the Third ACM Sym-posium on Cloud Computing, ACM, 2012, p. 1.

[15] S. Cribbs, R. Brown, Data structures in Riak Riak Conference (RICON) (Oct2012).

[16] Datastax, What’s New in Cassandra 2.1: Better Implementation of Counters(Retrieved 04.01.16). URL http://www.datastax.com/dev/blog/whats-new-in-cassandra-2-1-a-better-implementation-of-counters.

[17] B.A. Davey, H.A. Priestley, Introduction to Lattices and Order, second ed.,Cambridge University Press, 2002.

[18] G. DeCandia, D. Hastorun,M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S.Sivasubramanian, P. Vosshall, W. Vogels, Dynamo: Amazon’s highly availablekey-value store, in: Symp. on Op. Sys. Principles (SOSP), Operating SystemsReview, Assoc. for Computing Machinery, Stevenson, Washington, USA, 2007,pp. 205–220. http://dx.doi.org/10.1145/1294261.1294281.

[19] A. Deftu, J. Griebsch, A scalable conflict-free replicated set data type, in: Proc-eedings of the 2013 IEEE 33rd International Conference on Distributed Com-puting Systems, ICDCS’13, IEEE Computer Society,Washington, DC, USA, 2013,pp. 186–195. http://dx.doi.org/10.1109/ICDCS.2013.10.

[20] P. Helland, Idempotence is not a medical condition, Queue 10 (4) (2012)30:30–30:46. http://dx.doi.org/10.1145/2181796.2187821.

[21] Todd Hoff, How League of Legends Scaled Chat to 70 Million Players - It akes alot ofMinions (Retrieved 22.12.15). URL http://highscalability.com/blog/2014/10/13/how-league-of-legends-scaled-chat-to-70-million-players-it-t.html.

[22] P.R. Johnson, R.H. Thomas, The maintenance of duplicate databases, InternetRequest for Comments RFC677. Information Sciences Institute (Jan. 1975). URLhttps://tools.ietf.org/html/rfc677.

[23] B. Kemme, A. Schiper, G. Ramalingam, M. Shapiro, Dagstuhl seminar review:Consistency in distributed systems, SIGACT News 45 (1) (2014) 67–89. http://dx.doi.org/10.1145/2596583.2596601.

[24] K. Kingsbury, The troublewith timestamps. URL https://aphyr.com/posts/299-the-trouble-with-timestamps.

[25] L. Kuper, R.R. Newton, Lvars: lattice-based data structures for deterministicparallelism, in: Proceedings of the 2nd ACM SIGPLANWorkshop on FunctionalHigh-Performance Computing, ACM, 2013, pp. 71–84.

[26] M. Letia, N. Preguiça, M. Shapiro, CRDTs: Consistency without concurrencycontrol, Rapp. Rech. RR-6956, INRIA, Rocquencourt, France (Jun. 2009). URLhttp://hal.inria.fr/inria-00397981/.

[27] D. Malkhi, D. Terry, Concise version vectors in WinFS, Distrib. Comput. 20 (3)(2007) 209–219.

[28] M. Mukund, G. Shenoy R., S.P. Suresh, Optimized OR-sets without orderingconstraints, in: Proceedings Ot the International Conference on DistributedComputing and Networking, ACM, New York, NY, USA, 2014, pp. 227–241.

[29] D. Navalho, S. Duarte, N. Preguiça, M. Shapiro, Incremental stream processingusing computational conflict-free replicated data types, in: Proceedings ofthe 3rd International Workshop on Cloud Data and Platforms, ACM, 2013,pp. 31–36.

[30] Michael Owen, Using Erlang, Riak and the ORSWOT CRDT at bet365 forscalability and performance. URL http://www.erlang-factory.com/euc2015/michael-owen.

[31] D.S. Parker, G.J. Popek, G. Rudisin, A. Stoughton, B.J. Walker, E. Walton, J.M.Chow, D. Edwards, S. Kiser, C. Kline, Detection of mutual inconsistency indistributed systems, IEEE Trans. Softw. Eng. 9 (3) (1983) 240–247. http://dx.doi.org/10.1109/TSE.1983.236733.

[32] M. Shapiro, N. Preguiça, C. Baquero, M. Zawirski, A comprehensive study ofConvergent and Commutative Replicated Data Types, Rapp. Rech. 7506, INRIA,Rocquencourt, France (Jan. 2011). URL http://hal.archives-ouvertes.fr/inria-00555588/.

[33] M. Shapiro, N. Preguiça, C. Baquero, M. Zawirski, Conflict-free replicated datatypes, in: X. Défago, F. Petit, V. Villain (Eds.), Int. Symp. on Stabilization, Safety,and Security of Distributed Systems (SSS), in: Lecture Notes in Comp. Sc.,vol. 6976, Springer-Verlag, Grenoble, France, 2011, pp. 386–400.

[34] D.B. Terry, M.M. Theimer, K. Petersen, A.J. Demers, M.J. Spreitzer, C.H. Hauser,Managing update conflicts in Bayou, a weakly connected replicated storagesystem, in: Symp. on Op. Sys. Principles (SOSP), ACM SIGOPS, ACM Press,Copper Mountain, CO, USA, 1995, pp. 172–182. http://dx.doi.org/10.1145/224056.224070.

[35] G.T.J. Wuu, A.J. Bernstein, Efficient solutions to the replicated log and dictio-nary problems, in: Symp. on Principles of Dist. Comp. (PODC), Vancouver, BC,Canada, 1984, pp. 233–242. http://dx.doi.org/10.1145/800222.806750.

[36] P. Zeller, A. Bieniusa, A. Poetzsch-Heffter, Formal specification and verifica-tion of crdts, in: E. Ábrahám, C. Palamidessi (Eds.), Formal Techniques forDistributed Objects, Components, and Systems - 34th IFIP WG 6.1 Interna-tional Conference, FORTE 2014, Berlin, Germany, June 3–5, 2014. Proceedings,in: Lecture Notes in Computer Science, vol. 8461, Springer, 2014, pp. 33–48.

Paulo Sérgio Almeida is an assistant professor at theDepartment of Informatics at University of Minho, and aresearcher at HASLab/INESC TEC. He obtained a M.Sc. de-gree fromUniversity of Porto in 1994 and a Ph.D. degree inComputer Science from Imperial College London in 1998.His research activities have been in the area of distributedsystems, namely in causality tracking mechanisms, even-tually consistent non-relational databases, fault-tolerantdistributed aggregation algorithms, bloom filters, and dis-tributed algorithms in graphs. In recent years the mainfocus of research has been on Conflict-free ReplicatedData

Types.

Ali Shoker is a researcher at HASLab/INESC TEC and aninvited assistant professor at the Department of Infor-matics at the University of Minho, Portugal. He obtainedhis Ph.D. degree in Informatics and Telecommunicationfrom the University of Toulouse in Nov. 2012 and spentsome time at EPFL LPD lab, Switzerland, and INSA de Lyon,France. His research is focused on Distributed Systems; inparticular, large scale distributed data management andfault tolerance. More recently, his research is focused onConflict-free Replicated Data Types, Edge/Fog Computing,and Internet of Things.

Carlos Baquero is currently an assistant professor at theComputer Science Department in Universidade do Minho(Portugal). He obtained his M.Sc. and Ph.D. degrees fromUniversidade doMinho in 1994 and 2000. His research in-terests are focused on distributed systems, in particular incausality tracking, peer-to-peer systems, distributed dataaggregation, highly dynamic distributed systems, both ininternet P2P settings and in mobile and sensor networks.Recent research is focused onConflict-free ReplicatedDataTypes, Edge/Fog Computing, and Internet of Things.