Handling Highly Contended OLTP Workloads Using Fast ...

16
Handling Highly Contended OLTP Workloads Using Fast Dynamic Partitioning Guna Prasaad University of Washington [email protected] Alvin Cheung University of California, Berkeley [email protected] Dan Suciu University of Washington [email protected] ABSTRACT Research on transaction processing has made signifcant progress towards improving performance of main memory multicore OLTP systems under low contention. However, these systems struggle on workloads with lots of conficts. Partitioned databases (and variants) perform well on high contention workloads that are statically partitionable, but time-varying workloads often make them impractical. To- wards addressing this, we propose StrifeÐa novel transac- tion processing scheme that clusters transactions together dynamically and executes most of them without any con- currency control. Strife executes transactions in batches, where each batch is partitioned into disjoint clusters with- out any cross-cluster conficts and a small set of residuals. The clusters are then executed in parallel with no concur- rency control, followed by residuals separately executed with concurrency control. Strife uses a fast dynamic clustering al- gorithm that exploits a combination of random sampling and concurrent union-fnd data structure to partition the batch online, before executing it. Strife outperforms lock-based and optimistic protocols by up to 2× on high contention workloads. While Strife incurs about 50% overhead relative to partitioned systems in the statically partitionable case, it performs 2× better when such static partitioning is not possible and adapts to dynamically varying workloads. ACM Reference Format: Guna Prasaad, Alvin Cheung, Dan Suciu. 2020. Handling Highly Contended OLTP Workloads Using Fast Dynamic Partitioning. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD’20), June 14ś19, 2020, Portland, OR, USA. ACM, New York, NY, USA, 16 pages. https://doi.org/10.1145/ 3318464.3389764 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proft or commercial advantage and that copies bear this notice and the full citation on the frst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specifc permission and/or a fee. Request permissions from [email protected]. SIGMOD’20, June 14ś19, 2020, Portland, OR, USA © 2020 Association for Computing Machinery. ACM ISBN 978-1-4503-6735-6/20/06. . . $15.00 https://doi.org/10.1145/3318464.3389764 1 INTRODUCTION As modern OLTP servers today ship with an increasing number of cores, users expect transaction processing perfor- mance to scale accordingly. In practice, however, achieving scalability is challenging, with the bottleneck often being the concurrency control needed to ensure a serializable ex- ecution of the transactions. Consequently, a new class of high performance concurrency control protocols [8, 18, 19, 37, 39, 44] have been developed and are shown to scale to even thousands of cores [39, 43]. Yet, the performance of these protocols degrades rapidly on workloads with high contention [1, 43]. One promising approach to execute workloads with high contention is to partition the data, and have each core execute serially the transactions that access data in one partition [1, 15, 24, 29]. The intuition is that, if many transactions access a common data item, then they should be executed serially by the core hosting that data item without any need for synchroniza- tion. In these systems, the database is partitioned statically using either user input, application semantics, or workload- driven techniques [6, 25, 26, 28]. Once the data is partitioned, the transactions are routed to the appropriate core where they are executed serially. For workloads where transactions tend to access data from a single partition [34], static data partitioning systems are known to scale up very well [15]. However, cross-partition transactions (i.e., those that ac- cess data from two or more partitions) do require synchro- nization among cores and can signifcantly degrade the per- formance of the system. Unfortunately, in some applications, a large fraction of the transactions are cross-partition as no good static partitioning exists, or the hot data items change dynamically over time. As an extreme example, in the 4Chan social network, a post lives for only about 20-30 minutes, and the popularity can vary within seconds [2]. The various static partitioning schemes difer in how they synchronize cross-partition transactions: H-Store [13] acquires partition- level locks, Dora [1, 24] migrates transactions between cores, while Orthrus [29] delegates concurrency control to a ded- icated set of threads. If the database is statically partition- able, then these approaches can mitigate a small fraction of cross-partitioned transactions, however, their performance degrades rapidly when this fraction increases, or when the workload is not statically partitionable as discussed in [1]. Research 6: Transaction Processing and Query Optimization SIGMOD ’20, June 14–19, 2020, Portland, OR, USA 527

Transcript of Handling Highly Contended OLTP Workloads Using Fast ...

Page 1: Handling Highly Contended OLTP Workloads Using Fast ...

Handling Highly Contended OLTP Workloads UsingFast Dynamic Partitioning

Guna PrasaadUniversity of [email protected]

Alvin CheungUniversity of California, Berkeley

[email protected]

Dan SuciuUniversity of [email protected]

ABSTRACT

Research on transaction processing has made significantprogress towards improving performance of main memorymulticore OLTP systems under low contention. However,these systems struggle on workloads with lots of conflicts.Partitioned databases (and variants) perform well on highcontention workloads that are statically partitionable, buttime-varying workloads often make them impractical. To-wards addressing this, we propose StrifeÐa novel transac-tion processing scheme that clusters transactions togetherdynamically and executes most of them without any con-currency control. Strife executes transactions in batches,where each batch is partitioned into disjoint clusters with-out any cross-cluster conflicts and a small set of residuals.The clusters are then executed in parallel with no concur-rency control, followed by residuals separately executedwithconcurrency control. Strife uses a fast dynamic clustering al-gorithm that exploits a combination of random sampling andconcurrent union-find data structure to partition the batchonline, before executing it. Strife outperforms lock-basedand optimistic protocols by up to 2× on high contentionworkloads. While Strife incurs about 50% overhead relativeto partitioned systems in the statically partitionable case,it performs 2× better when such static partitioning is notpossible and adapts to dynamically varying workloads.

ACM Reference Format:

Guna Prasaad, Alvin Cheung, Dan Suciu. 2020. Handling Highly

Contended OLTP Workloads Using Fast Dynamic Partitioning. In

Proceedings of the 2020 ACM SIGMOD International Conference on

Management of Data (SIGMOD’20), June 14ś19, 2020, Portland, OR,

USA. ACM, New York, NY, USA, 16 pages. https://doi.org/10.1145/

3318464.3389764

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are not

made or distributed for profit or commercial advantage and that copies bear

this notice and the full citation on the first page. Copyrights for components

of this work owned by others than ACMmust be honored. Abstracting with

credit is permitted. To copy otherwise, or republish, to post on servers or to

redistribute to lists, requires prior specific permission and/or a fee. Request

permissions from [email protected].

SIGMOD’20, June 14ś19, 2020, Portland, OR, USA

© 2020 Association for Computing Machinery.

ACM ISBN 978-1-4503-6735-6/20/06. . . $15.00

https://doi.org/10.1145/3318464.3389764

1 INTRODUCTION

As modern OLTP servers today ship with an increasingnumber of cores, users expect transaction processing perfor-mance to scale accordingly. In practice, however, achievingscalability is challenging, with the bottleneck often beingthe concurrency control needed to ensure a serializable ex-ecution of the transactions. Consequently, a new class ofhigh performance concurrency control protocols [8, 18, 19,37, 39, 44] have been developed and are shown to scale toeven thousands of cores [39, 43].Yet, the performance of these protocols degrades rapidly

on workloads with high contention [1, 43]. One promisingapproach to execute workloads with high contention is topartition the data, and have each core execute serially thetransactions that access data in one partition [1, 15, 24, 29].The intuition is that, if many transactions access a commondata item, then they should be executed serially by the corehosting that data item without any need for synchroniza-tion. In these systems, the database is partitioned statically

using either user input, application semantics, or workload-driven techniques [6, 25, 26, 28]. Once the data is partitioned,the transactions are routed to the appropriate core wherethey are executed serially. For workloads where transactionstend to access data from a single partition [34], static datapartitioning systems are known to scale up very well [15].However, cross-partition transactions (i.e., those that ac-

cess data from two or more partitions) do require synchro-nization among cores and can significantly degrade the per-formance of the system. Unfortunately, in some applications,a large fraction of the transactions are cross-partition as nogood static partitioning exists, or the hot data items changedynamically over time. As an extreme example, in the 4Chansocial network, a post lives for only about 20-30 minutes,and the popularity can vary within seconds [2]. The variousstatic partitioning schemes differ in how they synchronizecross-partition transactions: H-Store [13] acquires partition-level locks, Dora [1, 24] migrates transactions between cores,while Orthrus [29] delegates concurrency control to a ded-icated set of threads. If the database is statically partition-able, then these approaches can mitigate a small fraction ofcross-partitioned transactions, however, their performancedegrades rapidly when this fraction increases, or when theworkload is not statically partitionable as discussed in [1].

Research 6: Transaction Processing and Query Optimization SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

527

Page 2: Handling Highly Contended OLTP Workloads Using Fast ...

In this paper, we propose Strife, a new batch-based OLTPsystem that partitions transactions dynamically. Strife groupstransactions into batches, partitions them based on the datathey access, then executes transactions in each partition seri-ally on an exclusive core. Our key insight is to recognize thatmost OLTP workloads, even those that are not amenable tostatic partitioning, can be łalmostž perfectly partitioned at abatch granularity. More precisely, Strife divides the batchinto two parts. The first consists of transactions that can begrouped into disjoint clusters, such that no conflicts existbetween transactions in different clusters, and the secondconsists of residual transactions, which include the remain-ing ones. Strife executes the batch in two phases: in the firstphase it executes each cluster on a dedicated core, withoutany synchronization, and in the second phase it executes theresidual transactions using some fine-grained serializableconcurrency protocol.As a result, Strife does not require a static partition

of the database apriori; it discovers the partitions dynam-ically at runtime. If the workload is statically partitionable,Strife will rediscover these partitions, setting aside thecross-partition transactions as residuals. When the work-load is not statically partitionable, Strife will discover agood way to partition the workload at runtime, based on hotdata items. It easily adapts to time-varying workloads, sinceit computes the optimal partition for every batch.

The major challenge in Strife is the clustering algorithm,and this forms the main technical contribution of this paper.Optimal graph clustering is NP-hard in general, which rulesout an exact solution. Several polynomial time approxima-tion algorithms exist [9, 16, 21, 23], but their performance isstill too low for our needs. To appreciate the challenge, con-sider a statically partitioned system that happens to executea perfectly partitionable balanced workload. Its runtime islinear in the total number of data accesses and scales with thenumber of cores. To be competitive, the clustering algorithmmust run in linear time, with a small constant, and must beperfectly parallelizable across cores. Our novel clustering al-gorithm satisfies both criteria ś it performs clustering usingonly three passes over data items and exploits intra-batchparallelism (Sec. 3). To achieve this, we also extend the union-find data structure to support concurrent execution (Sec. 4).We implemented a prototype of Strife and conducted

an extensive empirical evaluation of the entire algorithm(Sec. 5). We start by evaluating Strife’s clustering algorithmand show that, in addition to having high performance, italso finds high quality clusters on three different benchmarks.Next, we compare Strife with general purpose concurrencycontrol techniques, and find that its throughput is at least2× that of the nearest competitor. A more interesting com-parison is with protocols designed specifically for high con-tention workloads. We find that, in the ideal case when the

workload is statically partitionable and there are very fewcross-partition transactions, Strife achieves about half ofthe throughput of a statically partitioned system (due toclustering overhead). However, when the workload is notstatically partitioned, or there are many cross-partition trans-actions, then it can improve throughput by 2× as comparedto other systems. We conclude by reporting a number ofmicro-benchmarks to evaluate the quality of the clusteringalgorithm, and the impact of its parameters.

While many automated static partitioning techniques existin the literature (surveyed in Sec. 6), to the best of our knowl-edge the only system that considers dynamic partitioningis LADS [42]. Like Strife, LADS uses dynamic graph-basedpartitioning for a batch of transactions. Unlike Strife, thepartitioning algorithm in LADS depends on range-based par-titioning. In addition, it does not eliminate the need for con-currency control, instead it breaks a transaction into recordactions, and migrates different pieces to different cores, toreduce the number of locks required. This creates new chal-lenges, such as cascading aborts, and is unclear how muchoverhead it introduces. In contrast, Strife is not limitedto range partitions, is free from cascading aborts, and com-pletely eliminates concurrency control when possible.In summary, we make the following contributions:

• We describe Strife, an OLTP system that relies on dy-namic clustering of transactions, aiming to eliminate con-currency control for most transactions in a batch. (Sec. 2).

• We describe a novel clustering algorithm that produceshigh quality clusters using only three scans over the setof transactions in the batch. (Sec. 3).

• We then present a concurrent implementation of the union-find data structure that enables parallelizing the clusteringof transactions. (Sec. 4).

• We conduct an extensive experimental evaluation, com-paring our systemwith both general-purpose concurrencycontrol systems, and with systems designed for high con-tention, based on a static partition of the database. (Sec. 5).

2 OVERVIEW

Strife executes transactions in batches, where it collectstransactions for a fixed duration and executes them together.For each batch, Strife analyzes the conflicts among trans-actions based on their read-write sets1 and partitions themwith the goal of minimizing concurrency control. Strife exe-cutes batches in three phases: analysis, conflict-free, andresidual. These phases are executed synchronously on allcores as shown in Fig. 1. During analysis phase, the batch ispartitioned into a number of conflict-free clusters and some

1Read-write sets can be obtained either through static analysis ofthe transaction code or via a reconnaissance query and conditionalexecution as in deterministic databases [36].

Research 6: Transaction Processing and Query Optimization SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

528

Page 3: Handling Highly Contended OLTP Workloads Using Fast ...

Time

C1

C2

C3

C4

One Batch

Analysis Conflict-Free Residual

Fig. 1. Strife execution scheme

T1 w1, a, b

T2 w1, c, d

T3 w1, a, c, d

T4 w2, e, f

T5 w2, f, h

T6 w2, g, h

T7 w2, c, f, g

Batch Access Graph (!)

T7

a

b

w1

c

d

e

w2

f

g

h

Dat

a It

ems

(#)

T1

T2

T3

T5

T4

T6

Tra

nsa

ctio

ns

(ℬ)

Fig. 2. Access Graph of a batch of transactions

residuals. Transactions in a conflict-free cluster are executedserially with no concurrency control, with several clusters ex-ecuted in parallel on different cores in conflict-free phase.After all conflict-free transactions are executed, residuals arethen executed in a separate residual phase across all coresusing a conventional concurrency control protocol. Once theresidual transactions are executed, Strife repeats the sameprocess for the next batch.The goal of analysis phase is to partition the batch into

clusters such that transactions from two different clustersare conflict-free. To do so, Strife first represents a batch oftransactions using its data access graph. A data access graphis an undirected bipartite graph A = (B ∪ D, E), whereB is the set of transactions, D is the set of data items (e.g.,tuples or tables) accessed by transactions in B, and the edgesE contain all pairs (T ,d ) where transaction T accesses dataitem d . Transactions T , T ′ are in conflict if they access acommon data item with at least one of them updating it.Fig. 2 depicts the access graph of an example batch (col-

ors explained later) derived from the TPC-C [5] benchmark.Each transaction updates a warehouse (w1 orw2) and a setof stocks (a −h). TransactionT1, for example, accesses ware-housew1 and stocks a and b. Transactions T1, T2 are in con-flict because they both accessw1; whereas T1, T5 are not.

A batch can be partitioned into many conflict-free clusterswhen transactions access disjoint sets of data items. Trans-actions T1 − T3 and T4 − T6 ( Fig. 2) access different dataitems ({w1,a − d } and {w2, e − h} respectively). Hence, thebatch T1 − T6 can be partitioned into 2 clusters {T1,T2,T3}and {T4,T5,T6} with no cross-cluster conflicts.

However, real workloads often contain outliers that accessdata items from multiple clusters (such as red transaction T7in Fig. 2). A conflict-free cluster for the batchT1−T7, thus willcontain all transactions. To avoid this, we instead removeT7 as a residual and create conflict-free clusters only withrespect to the remaining batch. This results in 2 conflict-freeclusters {T1,T2,T3}, {T4,T5,T6} and a residual T7.Formally, a clustering in Strife is an (n + 1)-partition of

a batch of transactions B into {C1,C2, . . . ,Cn ,R} such that,for any i , j and transactions T ∈ Ci ,T

′ ∈ Cj , T and T ′ donot conflict. Notice that there is no such requirement on R.

Ideally, we want to maximizen, the number of conflict-freeclusters, to maximize parallelism, while minimizing the sizeof R. To elucidate the trade-offs, we describe two naive clus-terings. The first is fallback clustering, where all transactionsare in R and set n = 0; this corresponds to running the entirebatch using conventional concurrency control. The secondis sequential clustering, where we place all transactions inC1,and set n = 1 and R = ∅; this corresponds to executing alltransactions sequentially, on a single core. Obviously, neitherof these are good clusterings. Hence, we want n to be at leastas large as number of cores, and constrain R to be no largerthan a small fraction, α , of the batch.In practice, a good clustering exists for most workloads,

except in the extreme cases. In one extreme, when all trans-actions access a single highly contended data item, then nogood clustering exists besides fallback and sequential, andStrife simply resorts to fallback clustering. Once data con-tention decreases, i.e., the number of contentious data itemsincreases, a good clustering would try to place each hotitem in a different conflict-free cluster. When contentionfurther decreases such that all transactions access differentdata items, then any n-clustering of roughly equal sizes isadequate. Thus, we expect good clustering to exist in all butmost extreme workloads. The challenge, however, is to finda good clustering efficiently.After partitioning, conflict-free clusters are placed in a

shared worklist. Multiple cores then obtain a cluster fromthe worklist, and execute all transactions in it serially oneafter another in the conflict-free phase. Once done with acluster, a core obtains another from the worklist until done.The degree of parallelism in this phase is determined by thenumber and size of conflict-free clusters. More clusters resultin parallel execution, thus reducing total time to executethem. Once the worklist has exhausted, a core waits for othercores to finish processing their clusters before moving to theresidual phase. This is because the residuals may conflictwith the conflict-free transactions that are being executedconcurrently, and hence requires data synchronization. Askew in cluster sizes may reduce concurrency since threadsthat complete early cannot advance to next phase.

Research 6: Transaction Processing and Query Optimization SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

529

Page 4: Handling Highly Contended OLTP Workloads Using Fast ...

Algorithm 1: Strife Clustering Algorithm

Input: List<Txn, ReadWriteSet> batch

Output: Queue<Queue<Txn>> worklist

Queue<Txn> residuals

1 set<Cluster> special;

2 int count[k][k] = {0};

// Spot Step (initially each data node is a cluster)

3 i = 0;

4 repeat

5 Pick a random transaction T from batch;

6 set<Cluster> C = { Find(r ) | r ∈ T.nbrs };

7 set<Cluster> S = { c | c ∈ C and c is special };

8 if |S | == 0 then

9 c = C.first;

10 foreach other ∈ C do

11 Union(c, other);

12 c.id = i;

13 c.count++;

14 c.is_special = true;

15 special.insert(c);

16 i++;

17 until k times;

// Fuse Step

18 foreach T ∈ batch do

19 set<Cluster> C = { Find(r ) | r ∈ T.nbrs };

20 set<Cluster> S = { c | c ∈ C and c is special };

21 if |S | ≤ 1 then

22 c = ( |S | == 0) ? C.first : S.first;

23 foreach other ∈ C do

24 Union(c, other);

25 c.count++;

26 else

27 foreach (c1, c2) ∈ S × S do

28 count[c1.id][c2.id]++;

// Merge Step

29 foreach (c1, c2) ∈ special × special do

30 n1 = count[c1.id][c2.id];

31 n2 = c1.count + c2.count + n1;

32 if n1 ≥ α × n2 then

33 Union(c1, c2);

// Allocate Step

34 foreach T ∈ batch do

35 C = { Find(r ) | r ∈ T.nbrs };

36 if |C | = 1 then

37 c = C.first;

38 if c.queue = ∅ then

39 c.queue = new Queue<Txn>();

40 worklist.Push(c.queue);

41 c.queue.Push(T);

42 else

43 residuals.Push(T);

44 return (worklist, residuals);

As such, optimal allocation of conflict-free clusters to corescan be modeled as a 1-D bin-packing problem. However,solving this requires estimating the cost of executing a trans-action, and the quality of solution critically depends on howgood the estimate is. We instead rely on work-sharing par-allelism powered by a low-overhead concurrent queue tobalance load across cores dynamically.

After the conflict-free clusters are processed, transactionsin the residual cluster are executed in residual phase. As dis-cussed earlier, concurrency control is needed to execute themconcurrently across cores. Strife uses two-phase lockingwith no-wait deadlock avoidance policy as it is shown to behighly scalable [43]. Under this policy, a transaction abortsimmediately when it fails to acquire a lock. Strife retriestransactions aborted this way until successful commit orlogical abort. Finally, after the residuals are executed, Strifeprocesses the next batch. We next describe the clusteringalgorithm in Strife.

3 STRIFE CLUSTERING ALGORITHM

A batch of transactions in Strife, B, is partitioned into aconflict-free clusters C1,C2, . . . ,Cn , and a set of residuals,R, of size at most α |B|. The inputs to the algorithm are thebatch of transactions (denotedB) along with their read-writesets, a number k , typically a small factor times core count,and the residual parameter α , a fraction between 0 and 1.

Our clustering problem can be viewed as partitioning thedata access graph A = (B ∪ D, E) (refer Fig. 2) with dualobjectives: maximize number of conflict-free clusters, andreduce the size of residuals. Graph partitioning in generalis NP-hard, and several PTIME approximations have beenproposed in the literature [9, 10, 21, 23, 27]. These solutions,however, do not meet the latency budget of our use case. Forinstance, our preliminary study revealed that partitioning abatch from the TPC-C benchmark using METIS [21], a state-of-the-art graph partitioning library, is an order of magnitudeslower than executing directly with concurrency protocol.Hence, our key challenge is to devise a clustering algorithmby touching each node in the graph as few times as possible.

In other words, we seek an approximation algorithm thatruns in linear time, with a very small constant and can beparallelized to incur the least runtime overhead. We nextdescribe our algorithm (Sec. 3.1), illustrate it using an exam-ple (Sec. 3.2), and finally analyze its runtime (Sec. 3.3); thepseudo-code is listed in Alg. 1.

3.1 Description

Partitioning a batch of transactions in Strife is akin to par-titioning the data nodes in A into data clusters. Once that isdone, we can then allocate each transaction to a conflict-freecluster or residuals depending on how its connected datanodes are clustered. Hence, our algorithm will cluster the

Research 6: Transaction Processing and Query Optimization SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

530

Page 5: Handling Highly Contended OLTP Workloads Using Fast ...

data nodes first, then allocate transactions to an appropriatecluster or the residuals. The algorithm runs in 5 steps: (1)prepare (2) spot (3) fuse (4) merge and (5) allocate.

Initially each data node in the graph is a singleton cluster.The algorithm repeatedly merges clusters connected to atransaction, so that data nodes accessed by that transactionbelong to the same data cluster. It uses different heuristicsto prevent merging all data nodes into a single large datacluster. We describe each step in detail below.

3.1.1 Prepare Step. First, Strife creates the data accessgraph from read-write set of transactions in the batch. Strifedoes not explicitly create a graph data structure in memory,instead uses the read-write sets along with some metadatato avoid overheads. We consider only data nodes that areupdated by at least one transaction in the batch, since thereis no need to coordinate concurrent access to read-only dataitems. For instance, the Items table in the TPC-C benchmarkis only read and never updated; as a result Items is not partof our access graph. We do this by scanning and labelingrecords that are updated as active and ignoring all inactiverecords in subsequent steps (not shown in Alg. 1 for brevity).

3.1.2 Spot Step. The spot step (Lines 3-17) identifies hotdata items in the batch so that we can avoid placing several ofthem in the same data cluster. Initially, all data nodes in theaccess graph are singleton data clusters. We say a transactionis connected to a data cluster c if it accesses any data node inc . In this step, we create a small number of clusters that areall labeled łspecialž using a sample-reject scheme. We beginby picking a transaction T from B uniformly at random. Wemerge data clusters connected to T and label the resultingcluster łspecial.žWe continue by repeatedly picking another transaction

T ′ at random; if T ′ is connected to a special cluster, then wereject it, otherwise we merge all its data nodes into a newłspecialž cluster. We continue this for k trials, creating k orfewer special clusters. Recall that k is an input parameter tothe algorithm, and is typically chosen to be a small factortimes the number of cores. For instance, in our experimentswe use k = 100 for 30 cores.

With a high probability, our scheme will find the łhotždata items (i.e., items accessed by many transactions) andcreate a special cluster for each of them. To understand theintuition, suppose d is a hot data item accessed by half ofthe transactions. Then, with high probability, one of the ran-domly chosen transactions will access d , and spotwill createa special cluster for d ; if another sampled transaction also ac-cessesd , it will be rejected. Suppose now that there aren łhotždata items, d1,d2, . . . ,dn , and each is accessed by approxi-mately |B|/n transactions. In this case, with high probability,spot will create a special cluster for each di , because the

probability that none of the k sampled transactions accessesa data item di is (1−1/n)

k ≈ 1/ek/n , which is very low whenk ≫ n; for example, if k ≳ 3n, then 1/ek/n ≤ 1/e3 = 0.04.This argument extends to the case of pairs of data itemsd1,d2 that have direct affinity, meaning that a large numberof transactions access both; then, with high probability, somespecial cluster will include both.

3.1.3 Fuse Step. The spot step processed only a smallnumber of transactions (up to k); the fuse step (Lines 18-28) processes all remaining transactions in the batch. Foreach transaction, we examine the data clusters connected toit. If at most one of them is labeled łspecial,ž we merge allof them together. To prevent creating a single cluster withall data items, fuse never merges two special clusters: if atransaction accesses two or more special clusters, then fuse

skips it, and instead records the count of such transactionsfor each special cluster pair (Line 28). The number of specialclusters remains the same throughout fuse step, but eachspecial cluster may have grown from it’s initial size in thespot step, and new non-special clusters may be created aswell. The merge of two clusters is łspecialž if and only if atleast one of them was special.

3.1.4 Merge Step. So far the special clusters created inspot have never been merged. In this step, we revisit thisinvariant and merge some of them to reduce the number ofresiduals. If two hot data items d1,d2 are accessed by manytransactions, then the special clusters they belong to aregood candidates for merging; we say that d1,d2 have directaffinity. Special clusters may also have to be merged due toindirect affinity as illustrated by the following example.

Suppose there are two hot items, d1,d2, each in a separatespecial cluster, and a third item d3, such that a large numberof transactions access both d1,d3 and another large numberaccess d2,d3; then we say that d1,d2 have high indirect affin-

ity. Depending on the order of processing, fuse will included3 either in d1’s special cluster or in d2’s special cluster. Inthe first case, all transactions that access d2,d3 are skippedas residuals, in the second case the d1,d3 transactions areskipped. merge step examines the benefit of merging thetwo special clusters of d1 and d2, thus allowing all thesetransactions to be executed in the conflict-free phase.In such scenarios, we relax the criterion from fuse step

and merge special clusters. We use a new heuristic to preventmerging all data items into a single huge cluster in themergestep (Lines 29-33). If two special clusters ci and c j are merged,transactions connected to ci and c j can now be classified intoa conflict-free cluster. We merge them only when residualsare more than the bound specified by parameter α , i.e., |R | ≥α |B|. Recall that α is one of the input parameters of thealgorithm: it chooses between executing transactions onmore cores with concurrency control (if α is small) versus

Research 6: Transaction Processing and Query Optimization SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

531

Page 6: Handling Highly Contended OLTP Workloads Using Fast ...

on fewer cores without concurrency control. Empirically, wefound α = 0.2 to give the best results in our experiments.

Let count[ci ][c j ] denote the number of transactions in Bthat access data items in ci and c j . Note that the transactionsthat are accounted for in count[ci ][c j ] can access data itemsfrom clusters other than ci and c j as well. If the two clustersci and c j are not merged, then all of count[ci ][c j ] will beresiduals. So, we merge cluster pairs ci , c j when:

count[ci ][c j ] ≥ α × ( |ci | + |c j | + count[ci ][c j ])

Since, |R | ≤∑

i,j count[ci ][c j ], this merge scheme results inresiduals fewer than α |B| when most transactions access atmost two special clusters.

3.1.5 Allocate Step. In the allocate step (Lines 34-43),we examine data clusters connected to each transaction andallot it to a conflict-free cluster or the residuals. If the numberof clusters connected to a transactionT , denoted |C |, is morethan one, thenT is added to the residuals since it has conflict-ing data accesses with more than one data cluster. If |C | = 1,all data items accessed by T belong to one data cluster andhence does not conflict with transactions not connected tothe data cluster. So, T can be assigned to a conflict-free clus-ter. We first check if the data cluster already has a designatedqueue (Line 38). If so, we addT to that queue. Else, we createa new conflict-free queue and addT to it. This queue is addedto the shared worklist. This way all transactions connectedto a data cluster are added to the same conflict-free queue,thus ensuring a serializable execution.

3.2 Example

We illustrate in Fig. 3 how Strife clusters an example batchof new-order transactions from the TPC-C benchmark. Weconsider only stock and warehouse accesses for brevity. Asshown in Fig. 3a, each transaction (mid column) updates awarehouse (right column) and few stock tuples (left column).Each stock item is listed in the catalog of one warehouse, aslabeled on the left of Fig. 3a.

At the start, each warehouse and stock item is a singletondata cluster. spot randomly samples one transaction, say thefirst from the top, and merges its data items into one specialcluster: they are shown in orange. If the next transactionsampled accesses any of the orange items, it is ignored, oth-erwise its data items form a new special cluster. After severaltrials, Strife creates four special clusters as shown in Fig. 3b:orange, green, red, blue. All black data items continue to besingleton clusters.

Next, fuse processes all transactions, top to bottom. fusemerges each transaction’s data items into one single cluster,unless if doing so will merge two special clusters, in whichcase the transaction is skipped. For instance, the transactionaccessing both a green and a red data item is skipped. In ourexample, after the fuse step, all data items will be included

in one of the four special clusters as shown in Fig. 3c; ingeneral, this is not the case, instead some new clusters maybe formed that are not special, or some data items may be leftin their singleton cluster (also non-special). Notice that mostof the catalog stock items ofw2 andw3 are also accessed bytransactions of the other warehouse and, since we processtransactions from top to bottom, most of themwill be allottedto the green cluster (ofw2).

At this stage, most ofw1,w2 andw4 transactions are con-nected to a single data cluster, while severalw3 transactionsaccess data from both green and red clusters (notice thatw3

itself is red). We say that the w2 and w3 warehouse tupleshave an indirect affinity via their stock items. These twospecial clusters are candidates for merging, because manytransactions access both red and green items. More precisely,merge checks that the number of transactions accessing bothclusters is more than α fraction of those accessing either ofthem, then merges them (refer Fig. 3d). This reduces thenumber of transactions in the residual cluster. Note that theclusters of w1 and w2 (orange and green) are not merged,because too few transactions access both (just two).Finally, allocate iterates over all transactions, and for

each, checks if it can be included in a conflict-free clusteror residuals. As shown in Fig. 3d, only three transactionsare residual, two accessing orange+red, the other accessingred+blue; other transactions are included in one of the threeconflict-free clusters: orange, red, or blue.

3.3 Runtime Analysis

We use the union-find [33] data structure to maintain dataclusters in Strife. The amortized cost of a union-find opera-tion isO (α (N )) ≈ O (1) where N is the number of data itemsand α the inverse Ackermann function, thus we assume thatevery union/find operation takes time O (1). prepare, fuseand allocate steps iterate over the entire batch of trans-actions B, but do this in parallel using n cores, thus, theirtotal runtime is O ( |B|/n), where the constant under O (· · · )

is small (about 3). spot and merge are executed sequentially,and their runtime is as follows: spot takes timeO (k ), since itsamples using k trials, while merge takes timeO (k2), since itexamines up to k2 pairs of special clusters. The total runtimeof the clustering algorithm isO ( |B|/n+k2). For comparison,if the database can be perfectly partitioned statically, andthere are no cross-partition transactions, then a system usingstatic partitioning can execute entire batch in timeO ( |B|/n).

4 PARALLEL UNION-FIND

Strife heavily exploits the union-find data structure to par-tition the batch efficiently. The analysis is done in parallel

on multiple cores to reduce end-to-end transaction latency.Specifically, the spot and merge steps are executed in a

Research 6: Transaction Processing and Query Optimization SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

532

Page 7: Handling Highly Contended OLTP Workloads Using Fast ...
Page 8: Handling Highly Contended OLTP Workloads Using Fast ...

5 EVALUATION

In this section, we evaluate Strife, by comparing it to othersystems and on micro benchmarks. Specifically:

ğ5.1 First, we analyze how well Strife’s clustering algo-rithm works on OLTP workloads.

ğ5.2 Then, we compare Strife with several concurrencycontrol (CC) protocols using the shared-everythingarchitecture.

ğ5.3 We next evaluate Strife against state-of-the-art CCsfor handling high contention workloads: MOCC [39],Dora [24], Orthrus [29] and PartCC [13].

ğ5.4 We then evaluate Strife’s adaptability when the hotdata items dynamically vary over time.

ğ5.5 Finally, we study the impact of the residual parameterα , batch size, and contention using microbenchmarks.

5.1 Setup, Implementation and Workloads

Experimental Setup. We run all our experiments on amulti-socket Intel Xeon CPU E7-4890 v2 @ 2.80GHz machine with2TBMemory and two sockets - each having 15 physical cores.We implemented Strife and our baselines in C++. We use adefault batch size of 10K transactions.

System Implementation. Wehave implemented a prototypeof Strife that clusters transactions using the algorithm alongwith optimizations described in Sec. 3. Strife, at its core,is a multi-threaded transaction manager that interacts withstorage layer of the database using the standard get-put in-terface. A table is implemented as a collection of in-memorypages that contain sequentially organized records. Some ta-bles are indexed using a hashmap that maps primary keysto records. We implement the index using libcuckoo [22]library, a thread-safe fast concurrent hashmap. Records con-tain meta-data for clustering and concurrency control. Wechose this design to avoid overheads due to multiple hashlook-ups and focus primarily on concurrency control.The analysis phase produces several conflict-free (CF)

queues and a residual queue. All CF queues are stored in amulti-producer multi-consumer (MPMC) concurrent queue,called the worklist. The worklist is shared among all threads.In conflict-free phase, threads obtain a queue from theworklist and execute the transactions in it serially one afteranother without any concurrency control. Once a thread isdone with its queue, it obtains another from the worklist.When worklist is empty, threads spin wait for other threadsto complete.Once all threads are done with conflict-free queues, they

enter residual phase. The residuals are stored in a sharedMPMC queue. Threads dequeue and execute them using 2-phase locking concurrency protocol with no wait deadlock

avoidance policy (i.e., immediately aborts the transaction ifit fails to grab a lock). This phase is similar to traditionalconcurrency control based OLTP systems.

Strifemoves to analysis phase of next batch once resid-ual phase is complete. Threads could start analyzing the nextbatch while residual phase of previous batch is in-progress.However, we did not implement this optimization to simplifythe analysis and interpretability of results.

Workloads. We use the following benchmarks:TPC-C [5]: We use a subset of full TPC-C consisting ofa 50/50 mixture of new-order and payment transactions,with all payments accessing customer through primary keylookups. These two transactions make up the vast major-ity of the benchmark and has been used previously [29] inevaluating high contention workloads to stress concurrencycontrol. Refer Sec. 5.6.4 for payment transactions that accesscustomer records through last name. In TPC-C benchmark,15% payment transactions are remote and 1% of items areordered from remote warehouses.YCSB [4]: For YCSB, each transaction reads/updates 20 recordsfrom a single table with 20M keys and a payload size of 128bytes. We use a workload that is statically partitionable andhence is the best case scenario for such protocols. Transac-tions are such that intra-partition key access is determinedby a Zipfian distribution controlled by the θ parameter. Un-less specified otherwise, we use θ = 0.99 to generate a highlycontended workload.HOT: HOT is a microbenchmark on a single table databasewith 50M records, each having an 8-byte key and twenty8-byte integer fields. Each transaction updates 10 recordsfrom the table. To add contention, we classify some records(default: 100) as hot and the remaining as cold. Every trans-action updates one hot record and 9 cold records. We controlpartitionability of the workload by limiting remote parti-tion accesses to at most 3. By default, hot records are evenlydistributed across all partitions.

5.2 Clustering Algorithm

Fig. 5 summarizes the clustering solutions produced by Strifeon a batch of 10K transactions, α = 0.2 and different bench-mark parameters. Spot Clusters refers to the number of clus-ters identified in the spot step, CF Clusters denotes the num-ber of conflict-free clusters finally produced, and Residuals

denotes the size of residual cluster.We present TPC-C results for different number of ware-

houses (denotedw), specifically 4, 15 and 30 in Fig. 5. spotstep correctly identifies 4, 15 and 30 special clusters, onecorresponding to each warehouse. TPC-C has around 25%

cross-warehouse transactions and hence some of them areclassified as residuals. We note that residuals are much less

Research 6: Transaction Processing and Query Optimization SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

534

Page 9: Handling Highly Contended OLTP Workloads Using Fast ...

TPC-C YCSB (30 Partitions) HOT

Description/Workload Parameter 4 15 30 0.1 0.5 0.8 0.99 1.2 10 50 100 200

(a) Spot Clusters 4 15 30 100 98 78 48 30 10 43 63 84

(b) CF Clusters 4 15 30 100 98 78 30 30 10 43 63 84

(d) Residuals 636 191 94 0 0 298 0 0 350 371 330 250

Fig. 5. Clustering: TPC-C (Varying #Warehouses), YCSB (Varying θ ), HOT (Varying #Hot Records)

than 25% because a cross-warehouse transaction is not nec-essarily a cross-cluster transaction in the batch. For example,a remote order for item i from warehousew2 by aw1 trans-action is cross-cluster only when somew2 transaction in thebatch orders item i fromw2, which is not always the case.We vary the Zipfian parameter θ in YCSB with 30 par-

titions. spot identifies approximately 100 special clusterswhen θ = 0.1 or 0.5 since there are very few conflicting ac-cesses, resulting in 100 CF clusters of almost equal size withno residuals. When θ = 0.8, spot identifies 78 special clustersand produces same number of CF clusters. However, 30 ofthem contain majority of transactions, each correspondingto a partition. The remaining 48 CF clusters are small: theycontain transactions that access partition-local data fromone of these 30 partitions. Since some of these transactionsresult in cross-edges with the main 30 CF clusters, we havea small fraction (3%) of residuals; merge does not combinethem since it does not meet the merging threshold derivedfrom α . When θ = 0.99, spot identifies 48 special clusters.But after fuse and merge, the final number of CF clusters is30 since Strife merges some of them. When the contentionis very high (θ = 1.2), spot easily identifies the 30 partitionsand produces a conflict-free cluster for each.We now vary number of hot records in the HOT bench-

mark. In each case, spot identifies most but not all of thehot records in the batch since we sample the batch only 100

times. We found that more sampling attempts identify all hotrecords. The same number of CF clusters are produced sincenone of them are merged. As we will see in Sec. 5.3, differ-ent sizes for CF clusters does not impact system throughputheavily since Strife uses a shared worklist and dynamicallybalances load across cores. In summary, the experimentsshow that Strife identifies hot records and clusters conflict-ing transactions together while distributing hot records toas many clusters as possible.

5.3 CC protocols

We now compare Strife against 4 classes of CC protocols:LockSorted sorts the read-write set and acquires locks inan ordered fashion before executing the transaction.DlDetect is a 2-Phase-Locking (2PL) variant which main-tains a waits-for graph among transactions and uses it todetect and eliminate deadlocks by abortion. To avoid scala-bility bottlenecks, each thread maintains a local partition of

the graph and updates it only when waiting [43].NoWait is 2PL variant that aborts a transaction immediatelywhen a lock cannot be acquired during transaction execution.We do not use a centralized lock table in any of our 2PL im-plementations since it is a known scalability bottleneck andinstead use metadata co-located with records in the database.PreNoWait acquires all locks at the start of the transactionand releases only after commit; it aborts the transaction im-mediately if any of the requested locks cannot be acquired.Silo [37] is a representative OCC protocol and is consideredstate-of-the-art [1, 43, 44].We implemented these protocols in the Strife code-baseand corroborated it with [43] to make the comparisons fair.Fig. 6 depicts the throughput as we vary the number of

cores on TPC-C (4 and 30 warehouses), YCSB, and HOTbenchmarks.We present the execution profile for HOT bench-mark on 30 cores in Fig. 7; these protocols exhibit similarbehavior on others. Abort is time spent in aborted execution,Lock is time spent in acquiring, releasing and waiting forlocks, Execute is transaction code execution, Index is timespent querying the index, Validation is time spent in validat-ing for Silo, and Analysis is time spent in analysis (Strife).Strife produces as many CF clusters as warehouses for

TPC-C (Sec. 5.2). With 2 cores, Strife is significantly betterthan other protocols on TPC-C (4 warehouses) due to itsvery high contention, and at the same time caching benefitsfor clustering due to its small size. With more cores, through-put doubles from 2 to 4 cores since conflict-free phaseis 2× faster. Beyond 4 cores, improvement in throughput isonly due to increased parallelism in analysis and residual

phases and conflict-free phase remains the same. With 30

warehouses, Strife increases parallelism in all three phasesto scale almost linearly. Strife throughput on YCSB (30 parti-tions) also follows a similar trend. Since YCSB is partitionablewithout any cross-partition transactions, no residuals arecreated. Throughput increase from 15 to 25 cores is gradualin both TPC-C (30 warehouses) and YCSB since the corecount is insufficient to execute all 30 CF clusters in paral-lel. However, with 30 cores Strife exploits all parallelismyielding almost a 2× improvement over 15 cores.A batch in HOT benchmark is clustered into approxi-

mately 60 clusters, each containing at least one hot record.conflict-free phase exploits maximum parallelism, thusscaling linearly as we increase number of cores since Strife

Research 6: Transaction Processing and Query Optimization SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

535

Page 10: Handling Highly Contended OLTP Workloads Using Fast ...
Page 11: Handling Highly Contended OLTP Workloads Using Fast ...
Page 12: Handling Highly Contended OLTP Workloads Using Fast ...

and releasing locks, so its performance is relatively stablecompared to Dora even with partition-induced skews. Whenthe distribution is very skewed, even CC threads in Orthrusbecome a bottleneck and performance drops. MOCC is ableto detect changing hot items dynamically and is not affectedby partition-induced skews since it is oblivious to partitions.Strife spots hot records in a given batch by random sam-pling and distributes them dynamically across as many CFclusters as possible. This eliminates any partition-inducedskews present in other architectures, and enables Strife tooutperform all other protocols with stable throughput as iteliminates partition-induced skews.

5.6 Microbenchmarks

5.6.1 Clustering Performance. Fig. 9(a) depicts Strife anal-ysis throughput (transactions analyzed per second) as wevary number of cores. Strife algorithm scales almost lin-early until 15 cores (within single NUMA socket). Beyond 15cores, there is a slight drop in scalability due to cross-socketcommunication but throughput still increases. analysis costdepends on total number of data accesses in the batch. The de-picted performance (YCSB) is for 20 read-write accesses andanalysis throughput is much better than most main-memoryOLTP engines today. This shows that in Strife executionpipeline, analysis phase is unlikely to be the bottleneck.Fig. 9(b) breaks down the time taken by each step dur-

ing analysis for TPC-C, YCSB-Low (θ = 0.1), YCSB-High(θ = 0.99) and HOT. prepare step scans through read-writesets, and this is the most expensive since records are lookedup for the first time and brought into cache. In subsequentsteps, cached data is reused, and hence they are much faster.analysis time also depends on number of distinct data itemsaccessed and cache size. In YCSB-Low, data items are ac-cessed almost uniformly, while YCSB-High access fewer dis-tinct data items. So, analysis of a YCSB-Low batch is moreexpensive than that of YCSB-High. We do not show spot

and merge as they are negligible compared to these 3 steps.Our implementation caches the current set of clusters con-nected to a transaction in thread-local cache, which furtherimproves fuse and allocate performance.

5.6.2 Residual Parameter. The residual parameter α playsa key role in Strife. It determines (1) the proportion oftransactions executed with or without CC and (2) the numberof CF clusters. A largerα allows Strife to classify more cross-cluster transactions as residuals improving the number ofCF clusters; but more transactions are executed with CC,reducing the benefits of CC-free execution. A smaller α , onthe other hand, may reduce concurrency in the conflict-free phase but fewer transactions are executed with CC.

Fig. 10(a) depicts the number of conflict-free clusters pro-duced for different number of warehouses in TPC-C as we

vary α . When α is zero, Strife clusters all transactions intoa single cluster. However, as we increase α number of CFclusters increase and settles at the number of warehouses.Fig. 10(b) depicts the CF clusters produced for various valuesof θ (controls contention in the workload) in YCSB with 30

partitions. A smaller θ implies lower contention and henceStrife results in 100 special clusters, which is the maximumvalue set in our experiments. Note that there is minimal im-pact of increasing α for lower θ cases. For larger θ , each CFcluster corresponds to a YCSB partition and hence no impact.

5.6.3 Latency and Batch Size. Strife employs a batchedexecution strategy. Similar to streaming systems [3, 46],batch size poses an important trade-off between throughputand latency. With a larger batch size, the latency is higher,but has more potential for amortization and hence through-put. When the batch is small, other overheads in the systembecome more prominent but with a smaller latency.We present throughput, end-to-end latency and analysis

latency for TPC-C, YCSB and HOT as we vary batch size inFig. 11 (in log scale). For small batch sizes (1K), synchroniza-tion overheads at phase boundaries affect the throughput. Forlarger batches, throughput is marginally higher due to betteramortization and cache locality. Beyond a point, throughputdoes not increase since the data structures no longer fit incache. End-to-end latency increases linearly with batch size.Although, we can cluster a 1K batch in sub millisecond laten-cies, the throughput is sub-optimal. In our experiments wechose a batch size of 10K since it provides a good trade-offbetween latency (< 5 ms) and throughput.

5.6.4 Cost of Reconnaissance Queries. When read writesets are not known a priori, Strife uses the OLLP proto-col [36]: issues a read-only, low isolation level query calledthe reconnaissance query to acquire the entire read/writeset as part of the prepare step. Additionally index lookupsare cached in transaction context for later execution. A par-ticularly common pattern of such transactions are those thatperform a secondary index lookup [36]. In full TPC-C, somepayment transactions obtain customer record through lastname which involves a secondary index lookup. Strife ex-ecutes these transactions using the OLLP protocol. Fig. 12depicts the throughput of Strife and other CC protocols on30 warehouses as we increase the percentage of paymenttransactions that require a reconnaissance query in a 50/50mixture of payment and new-order transactions. While thereis a marginal decrease in performance for other CC proto-cols due to secondary index lookups, the decrease is morepronounced in Strife since it performs it twice. At the sametime, additional cost incurred is a single lock-free secondaryindex lookup, so the impact of reconnaissance query is min-imal (at most 9%) and Strife still outperforms other CCprotocols on high contention workloads.

Research 6: Transaction Processing and Query Optimization SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

538

Page 13: Handling Highly Contended OLTP Workloads Using Fast ...
Page 14: Handling Highly Contended OLTP Workloads Using Fast ...

tracking [45] or reusing partial executions of aborted trans-actions [7] have addressed this partially. MOCC [39], state-of-the-art for high contention, combines pessimistic lockingwith optimistic validation-based execution. There are noaborts due to conflicts in Strife.

Partitioned and Deterministic Databases. H-Store [13,15], Hyper [17] yield state-of-the-art performance on par-titionable high contention workloads but are plagued bycross-partition transactions [43]. Appuswamy et al. [1] advo-cate for fine-grained partitioning to avoid skew induced bystatic partitions, which Strife eliminates and dynamicallyadapts to changing hot set. Calvin [31, 36] acquires locks ina deterministic order on a single thread while Strife is freefrom such concurrency bottlenecks and mostly avoids locks.

Alternate Architectures. Orthrus [29] delegates CC to ded-icated threads and uses message passing to synchronize ac-cess. Dora [1, 24] adopts thread-to-data assignment on apartitioned database; transactions flow from one thread toanother acquiring/releasing locks. Both suffer on update-intensive workload due to longer lock-hold times and wors-ened by cross-partition transactions and skews [1]. Strifedoes not use locks, adopts fine-grained partitioning, andavoids skews. Tang et al. [35] propose adaptive concurrencycontrol (ACC) that clusters data and chooses optimal CCfor each cluster using a machine learning model. Both thesecombine existing CC protocols to yield better performanceon a mixed workload. This is unlike Strife which optimizesfor and performs better than each of these CC protocols un-der high contention.

Automatic Database Partitioning. Schism [6] partitionsthe co-access graph among data items and uses the partitionto infer partition predicates statically; Horticulture [25] au-tomatically partitions a database using local neighborhoodsearch on the design space with an analytical cost model.While it seeks to reduce partition-induced skews and cross-partition transactions similar to Strife, its solution spaceis limited to hash/range partitioning. Sword [28] is an ex-tension of Schism that uses hyper-graph compression topartition a database also with incremental repartitioning. Allthese approaches are designed for offline computation and

are not efficient enough for dynamic clustering in Strife.

TransactionDecomposition. Callas [41] partitions a trans-action into groups that are executed independently underdifferent CC protocols. IC3 [40] tracks run time dependen-cies allowing transactions to be chopped into finer piecesthan possible statically. LADS [42] breaks a transaction intoa set of record actions using their semantics. Overall, decom-posing a transaction into smaller pieces can help alleviatereduce contention by reducing effective lock hold time on hotdata items. LADS adopts a similar graph-based partitioningstrategy on batches of transactions as Strife, but there areseveral key differences: LADS does not entirely eliminate CCduring execution since it synchronizes on logical dependen-cies between record actions of a transaction. A user initiatedabort will lead to cascading aborts in LADS due to specula-tive execution of record actions. Further, graph partitioningalgorithm in LADS initially assumes a range-based partitionand iteratively removes cross-edges by relocating records.This does not perform well on workloads not suitable forrange-based partitioning, such as the HOT benchmark.

7 CONCLUSIONS

We presented Strife, a transaction processing protocol forhigh-contention workloads. Strife identifies hot records inthe database during runtime, clusters transactions that ac-cess them together, and executes them without concurrencycontrol. Strife achieves this with an algorithm that exploitsrandom sampling and efficient union-find data structures tocluster thousands of transactions in milliseconds. Our ex-periments have shown that Strife can achieve substantialperformance improvement over conventional CC protocols;while it incurs about 50% overhead over partitioned systemson a statically partitionable workload, Strife can improveperformance by up to 2× when such static partitioning isnot viable and when workload changes over time.

ACKNOWLEDGEMENTS

This work is supported in part by the NSF grants IIS-1907997,AITF-1535565, IIS-1546083, IIS-1651489; the Intel-NSF CAPACenter; DOE award DE-SC0016260; and gifts from Adobeand Google. We also thank Xiangyao Yu and the anonymousreviewers for their valuable comments and feedback.

Research 6: Transaction Processing and Query Optimization SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

540

Page 15: Handling Highly Contended OLTP Workloads Using Fast ...

REFERENCES[1] Raja Appuswamy, Angelos C. Anadiotis, Danica Porobic, Mustafa K.

Iman, and Anastasia Ailamaki. 2017. Analyzing the Impact of System

Architecture on the Scalability of OLTP Engines for High-contention

Workloads. Proc. VLDB Endow. 11, 2 (Oct. 2017), 121ś134. https:

//doi.org/10.14778/3149193.3149194

[2] Michael S Bernstein, Andrés Monroy-Hernández, Drew Harry, Paul

André, Katrina Panovich, and Gregory G Vargas. 2011. 4chan and/b:

An Analysis of Anonymity and Ephemerality in a Large Online Com-

munity.. In ICWSM.

[3] Badrish Chandramouli, Jonathan Goldstein, Mike Barnett, Robert De-

Line, Danyel Fisher, John C. Platt, James F. Terwilliger, and JohnWerns-

ing. 2014. Trill: A High-performance Incremental Query Processor

for Diverse Analytics. Proc. VLDB Endow. 8, 4 (Dec. 2014), 401ś412.

https://doi.org/10.14778/2735496.2735503

[4] Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan,

and Russell Sears. 2010. Benchmarking Cloud Serving Systems with

YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing

(SoCC ’10). ACM, New York, NY, USA, 143ś154. https://doi.org/10.

1145/1807128.1807152

[5] T. P. P. Council. 2018. TPCC-C Standards Specification. http://www.tpc.

org/tpc_documents_current_versions/pdf/tpc-c_v5.11.0.pdf. [Online;

accessed 19-Dec-2017].

[6] Carlo Curino, Evan Jones, Yang Zhang, and SamMadden. 2010. Schism:

AWorkload-driven Approach to Database Replication and Partitioning.

Proc. VLDB Endow. 3, 1-2 (Sept. 2010), 48ś57. https://doi.org/10.14778/

1920841.1920853

[7] Mohammad Dashti, Sachin Basil John, Amir Shaikhha, and Christoph

Koch. 2017. Transaction Repair for Multi-Version Concurrency Control.

In Proceedings of the 2017 ACM International Conference onManagement

of Data (SIGMOD ’17). ACM, New York, NY, USA, 235ś250. https:

//doi.org/10.1145/3035918.3035919

[8] Cristian Diaconu, Craig Freedman, Erik Ismert, Per-Ake Larson, Pravin

Mittal, Ryan Stonecipher, Nitin Verma, and Mike Zwilling. 2013.

Hekaton: SQL Server’s Memory-optimized OLTP Engine. In Proceed-

ings of the 2013 ACM SIGMOD International Conference on Manage-

ment of Data (SIGMOD ’13). ACM, New York, NY, USA, 1243ś1254.

https://doi.org/10.1145/2463676.2463710

[9] Michael T. Heath and Padma Raghavan. 1995. A Cartesian Parallel

Nested Dissection Algorithm. SIAM J. Matrix Anal. Appl. 16, 1 (Jan.

1995), 235ś253. https://doi.org/10.1137/S0895479892238270

[10] Bruce Hendrickson and Robert Leland. 1995. A Multilevel Algorithm

for Partitioning Graphs. In Proceedings of the 1995 ACM/IEEE Confer-

ence on Supercomputing (Supercomputing ’95). ACM, New York, NY,

USA, Article 28. https://doi.org/10.1145/224170.224228

[11] Takashi Horikawa. 2013. Latch-free Data Structures for DBMS: Design,

Implementation, and Evaluation. In Proceedings of the 2013 ACM SIG-

MOD International Conference on Management of Data (SIGMOD ’13).

ACM, New York, NY, USA, 409ś420. https://doi.org/10.1145/2463676.

2463720

[12] Ryan Johnson, Ippokratis Pandis, and Anastasia Ailamaki. 2009. Im-

proving OLTP Scalability Using Speculative Lock Inheritance. Proc.

VLDB Endow. 2, 1 (Aug. 2009), 479ś489. https://doi.org/10.14778/

1687627.1687682

[13] Evan P.C. Jones, Daniel J. Abadi, and Samuel Madden. 2010. Low Over-

head Concurrency Control for Partitioned Main Memory Databases. In

Proceedings of the 2010 ACM SIGMOD International Conference on Man-

agement of Data (SIGMOD ’10). ACM, New York, NY, USA, 603ś614.

https://doi.org/10.1145/1807167.1807233

[14] Hyungsoo Jung, Hyuck Han, Alan D. Fekete, Gernot Heiser, and

Heon Y. Yeom. 2013. A Scalable Lock Manager for Multicores. In

Proceedings of the 2013 ACM SIGMOD International Conference on Man-

agement of Data (SIGMOD ’13). ACM, New York, NY, USA, 73ś84.

https://doi.org/10.1145/2463676.2465271

[15] Robert Kallman, Hideaki Kimura, Jonathan Natkins, Andrew Pavlo,

Alexander Rasin, Stanley Zdonik, Evan P. C. Jones, Samuel Madden,

Michael Stonebraker, Yang Zhang, John Hugg, and Daniel J. Abadi.

2008. H-store: A High-performance, Distributed Main Memory Trans-

action Processing System. Proc. VLDB Endow. 1, 2 (Aug. 2008), 1496ś

1499. https://doi.org/10.14778/1454159.1454211

[16] George Karypis and Vipin Kumar. 1998. A Fast and High Quality Mul-

tilevel Scheme for Partitioning Irregular Graphs. SIAM J. Sci. Comput.

20, 1 (Dec. 1998), 359ś392. https://doi.org/10.1137/S1064827595287997

[17] Alfons Kemper and Thomas Neumann. 2011. HyPer: A Hybrid

OLTP&OLAP Main Memory Database System Based on Virtual Mem-

ory Snapshots. In Proceedings of the 2011 IEEE 27th International Con-

ference on Data Engineering (ICDE ’11). IEEE Computer Society, Wash-

ington, DC, USA, 195ś206. https://doi.org/10.1109/ICDE.2011.5767867

[18] Kangnyeon Kim, Tianzheng Wang, Ryan Johnson, and Ippokratis Pan-

dis. 2016. ERMIA: Fast Memory-Optimized Database System for Het-

erogeneous Workloads. In Proceedings of the 2016 International Confer-

ence on Management of Data (SIGMOD ’16). ACM, New York, NY, USA,

1675ś1687. https://doi.org/10.1145/2882903.2882905

[19] Hideaki Kimura. 2015. FOEDUS: OLTP Engine for a Thousand Cores

and NVRAM. In Proceedings of the 2015 ACM SIGMOD International

Conference on Management of Data (SIGMOD ’15). ACM, New York,

NY, USA, 691ś706. https://doi.org/10.1145/2723372.2746480

[20] H. T. Kung and John T. Robinson. 1981. On Optimistic Methods for

Concurrency Control. ACM Trans. Database Syst. 6, 2 (June 1981),

213ś226. https://doi.org/10.1145/319566.319567

[21] Dominique LaSalle, Md Mostofa Ali Patwary, Nadathur Satish,

Narayanan Sundaram, Pradeep Dubey, and George Karypis. 2015. Im-

proving Graph Partitioning for Modern Graphs and Architectures. In

Proceedings of the 5th Workshop on Irregular Applications: Architectures

and Algorithms (IA3 ’15). ACM, New York, NY, USA, Article 14, 4 pages.

https://doi.org/10.1145/2833179.2833188

[22] Xiaozhou Li, David G. Andersen, Michael Kaminsky, and Michael J.

Freedman. 2014. Algorithmic Improvements for Fast Concurrent

Cuckoo Hashing. In Proceedings of the Ninth European Conference

on Computer Systems (EuroSys ’14). ACM, New York, NY, USA, Article

27, 14 pages. https://doi.org/10.1145/2592798.2592820

[23] Gary L. Miller, Shang-Hua Teng, William Thurston, and Stephen A.

Vavasis. 1993. Automatic Mesh Partitioning. In Graph Theory and

Sparse Matrix Computation, Alan George, John R. Gilbert, and Joseph

W. H. Liu (Eds.). Springer New York, New York, NY, 57ś84.

[24] Ippokratis Pandis, Ryan Johnson, Nikos Hardavellas, and Anastasia

Ailamaki. 2010. Data-oriented Transaction Execution. Proc. VLDB

Endow. 3, 1-2 (Sept. 2010), 928ś939. https://doi.org/10.14778/1920841.

1920959

[25] Andrew Pavlo, Carlo Curino, and Stanley Zdonik. 2012. Skew-Aware

Automatic Database Partitioning in Shared-Nothing, Parallel OLTP

Systems. In SIGMOD ’12: Proceedings of the 2012 international conference

on Management of Data. 61ś72. http://hstore.cs.brown.edu/papers/

hstore-partitioning.pdf

[26] Andrew Pavlo, Evan P. C. Jones, and Stanley Zdonik. 2011. On

Predictive Modeling for Optimizing Transaction Execution in Par-

allel OLTP Systems. Proc. VLDB Endow. 5, 2 (Oct. 2011), 85ś96.

https://doi.org/10.14778/2078324.2078325

[27] Alex Pothen, Horst D. Simon, and Kan-Pu Liou. 1990. Partitioning

Sparse Matrices with Eigenvectors of Graphs. SIAM J. Matrix Anal.

Appl. 11, 3 (May 1990), 430ś452. https://doi.org/10.1137/0611030

[28] Abdul Quamar, K. Ashwin Kumar, and Amol Deshpande. 2013.

SWORD: Scalable Workload-aware Data Placement for Transactional

Research 6: Transaction Processing and Query Optimization SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

541

Page 16: Handling Highly Contended OLTP Workloads Using Fast ...

Workloads. In Proceedings of the 16th International Conference on Ex-

tending Database Technology (EDBT ’13). ACM, New York, NY, USA,

430ś441. https://doi.org/10.1145/2452376.2452427

[29] Kun Ren, Jose M. Faleiro, and Daniel J. Abadi. 2016. Design Principles

for Scaling Multi-core OLTP Under High Contention. In Proceedings

of the 2016 International Conference on Management of Data (SIGMOD

’16). ACM, New York, NY, USA, 1583ś1598. https://doi.org/10.1145/

2882903.2882958

[30] Kun Ren, Alexander Thomson, and Daniel J. Abadi. 2013. Lightweight

locking for main memory database systems. In Proceedings of the 39th

international conference on Very Large Data Bases (PVLDB’13). VLDB

Endowment, 145ś156. http://dl.acm.org/citation.cfm?id=2448936.

2448947

[31] Kun Ren, Alexander Thomson, and Daniel J. Abadi. 2014. An Eval-

uation of the Advantages and Disadvantages of Deterministic Data-

base Systems. Proc. VLDB Endow. 7, 10 (June 2014), 821ś832. https:

//doi.org/10.14778/2732951.2732955

[32] Mohammad Sadoghi, Mustafa Canim, Bishwaranjan Bhattacharjee,

Fabian Nagel, and Kenneth A. Ross. 2014. Reducing Database Locking

Contention Through Multi-version Concurrency. Proc. VLDB Endow. 7,

13 (Aug. 2014), 1331ś1342. https://doi.org/10.14778/2733004.2733006

[33] Robert Sedgewick and Kevin Wayne. 2011. Algorithms (4th ed.).

Addison-Wesley Professional.

[34] Michael Stonebraker and Ugur Cetintemel. 2005. "One Size Fits All":

An Idea Whose Time Has Come and Gone. In Proceedings of the 21st

International Conference on Data Engineering (ICDE ’05). IEEE Com-

puter Society, Washington, DC, USA, 2ś11. https://doi.org/10.1109/

ICDE.2005.1

[35] Dixin Tang, Hao Jiang, and Aaron J. Elmore. 2017. Adaptive Concur-

rency Control: Despite the Looking Glass, One Concurrency Control

Does Not Fit All. In CIDR 2017, 8th Biennial Conference on Innovative

Data Systems Research, Chaminade, CA, USA, January 8-11, 2017, Online

Proceedings. http://cidrdb.org/cidr2017/papers/p63-tang-cidr17.pdf

[36] Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren,

Philip Shao, and Daniel J. Abadi. 2012. Calvin: Fast Distributed Trans-

actions for Partitioned Database Systems. In Proceedings of the 2012

ACM SIGMOD International Conference on Management of Data (SIG-

MOD ’12). ACM, New York, NY, USA, 1ś12. https://doi.org/10.1145/

2213836.2213838

[37] Stephen Tu, Wenting Zheng, Eddie Kohler, Barbara Liskov, and Samuel

Madden. 2013. Speedy Transactions inMulticore In-memoryDatabases.

In Proceedings of the Twenty-Fourth ACM Symposium on Operating

Systems Principles (SOSP ’13). ACM, New York, NY, USA, 18ś32. https:

//doi.org/10.1145/2517349.2522713

[38] Tianzheng Wang, Milind Chabbi, and Hideaki Kimura. 2016. Be My

Guest: MCS LockNowWelcomes Guests. In Proceedings of the 21st ACM

SIGPLAN Symposium on Principles and Practice of Parallel Programming

(PPoPP ’16). ACM, New York, NY, USA, Article 21, 12 pages. https:

//doi.org/10.1145/2851141.2851160

[39] Tianzheng Wang and Hideaki Kimura. 2016. Mostly-optimistic Con-

currency Control for Highly Contended Dynamic Workloads on a

Thousand Cores. Proc. VLDB Endow. 10, 2 (Oct. 2016), 49ś60. https:

//doi.org/10.14778/3015274.3015276

[40] Zhaoguo Wang, Shuai Mu, Yang Cui, Han Yi, Haibo Chen, and Jinyang

Li. 2016. Scaling Multicore Databases via Constrained Parallel Exe-

cution. In Proceedings of the 2016 International Conference on Manage-

ment of Data (SIGMOD ’16). ACM, New York, NY, USA, 1643ś1658.

https://doi.org/10.1145/2882903.2882934

[41] Chao Xie, Chunzhi Su, Cody Littley, Lorenzo Alvisi, Manos Kapritsos,

and Yang Wang. 2015. High-performance ACID via Modular Con-

currency Control. In Proceedings of the 25th Symposium on Operating

Systems Principles (SOSP ’15). ACM, New York, NY, USA, 279ś294.https://doi.org/10.1145/2815400.2815430

[42] Chang Yao, Divyakant Agrawal, Gang Chen, Qian Lin, Beng Chin Ooi,

Weng-Fai Wong, and Meihui Zhang. 2016. Exploiting Single-Threaded

Model in Multi-Core In-Memory Systems. IEEE Trans. on Knowl. and

Data Eng. 28, 10 (Oct. 2016), 2635ś2650. https://doi.org/10.1109/TKDE.

2016.2578319

[43] Xiangyao Yu, George Bezerra, Andrew Pavlo, Srinivas Devadas, and

Michael Stonebraker. 2014. Staring into the Abyss: An Evaluation of

Concurrency Control with One Thousand Cores. Proc. VLDB Endow. 8,

3 (Nov. 2014), 209ś220. https://doi.org/10.14778/2735508.2735511

[44] Xiangyao Yu, Andrew Pavlo, Daniel Sanchez, and Srinivas Devadas.

2016. TicToc: Time Traveling Optimistic Concurrency Control. In

Proceedings of the 2016 International Conference on Management of

Data (SIGMOD ’16). ACM, New York, NY, USA, 1629ś1642. https:

//doi.org/10.1145/2882903.2882935

[45] Yuan Yuan, Kaibo Wang, Rubao Lee, Xiaoning Ding, Jing Xing, Spy-

ros Blanas, and Xiaodong Zhang. 2016. BCC: Reducing False Aborts

in Optimistic Concurrency Control with Low Cost for In-memory

Databases. Proc. VLDB Endow. 9, 6 (Jan. 2016), 504ś515. https:

//doi.org/10.14778/2904121.2904126

[46] Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott

Shenker, and Ion Stoica. 2013. Discretized Streams: Fault-tolerant

Streaming Computation at Scale. In Proceedings of the Twenty-Fourth

ACM Symposium on Operating Systems Principles (SOSP ’13). ACM, New

York, NY, USA, 423ś438. https://doi.org/10.1145/2517349.2522737

Research 6: Transaction Processing and Query Optimization SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

542