XQueue: Extreme Fine-grained Concurrent Lock-less...

XQueue: Extreme Fine-grained Concurrent Lock-less QueuePoornima Nookala1, Peter Dinda2, Kyle Hale3, Ioan Raicu4

1,3,4Illinois Institute of Technology {[email protected], [email protected], [email protected]}, 2Northwestern University {[email protected]}

OverviewA single general purpose shared memory machine todaymay have hundreds of hardware threads available. Op-erations that may run in tens of cycles on a single coreusing a single thread can take upwards of millions ofcycles when multiple threads are competing for sharedresources. The concurrent multiple producer, multipleconsumer (MPMC) queue is a critical building block innumerous systems such as the task scheduler of a run-time system (or operating system) that supports mod-ern parallel programming models. We present the de-sign, implementation, and evaluation of XQueue, a novellock-less concurrent queuing system with relaxed order-ing semantics that is geared to realizing scalability tohundreds of concurrent threads. XQueue implementsthe MPMC interface using multiple queues in order toreduce contention, applies simple deterministic load bal-ancing techniques, and eliminates all use of locks andatomic operations with a lock-less design. Experimentalresults show that XQueue can deliver concurrent oper-ations with 110 to 400 cycle latency at scales up to 384hardware threads.

Concurrent queues areterrible!

• Analyzed latency and throughput of mutex, spinlock,semaphore and atomic fetch-and-add for an incrementoperation on Mystic testbed.

• Mystic covers latest many-core architectures from Intel,AMD, IBM and ARM with processors such as Haswell,Broadwell, Skylake, Phi, Opteron, Ryzen, Threadripper,Epyc, Power9, and ThunderX

• Latency of a single atomic increment on a Skylakesystem with 192-cores and 384 hardware threadswhen running on all threads concurrently is 33592cycles whereas on Intel Xeon Phi KnightsLanding with 64-cores and 256 hardware threads,latency reaches 3868 cycles.

• Latency of enqueue/dequeue operation on SPSCqueue takes 30 to 70 cycles depending on thearchitecture and clock frequency.

• Average throughput of enqueue/dequeue operationson SPSC queue reaches 270 million operationsper second on Intel Skylake 192-core machine.

• Latency of MPMC queue can reach up to millions ofcycles under high contention and throughput candrop up to as low as 300,000 operations persecond.

• These results provide enough motivation to investigatemethods to exploit full concurrency on many-corearchitectures while not compromising on the lowestlatency that can be achieved.

Figure 1: SPSC Queue Latency

Figure 2: MPMC Queue Latency

Figure 3: SPSC Queue Throughput

Figure 4: MPMC Queue Throughput

Problem Statement

Having scalable and fast queue datastructures under high concurrency isa critical missing link towards therealization of efficient parallel run-times on many-core architectures. Itis essential to analyze and optimize basicdata structures that form the barebones ofparallel runtime systems so they do not be-come the bottleneck for performance.

XQueue - Design and Implementation

• Core idea of XQueue is to have two queues percore, one being the master and other being the auxiliaryqueue.

• There is one master queue and one auxiliary queue oneach core with one producer thread per queue and acommon consumer thread for both queues.

• XQueue is implemented without using any types oflocks or atomic operations or barriers. Currentimplementation uses B-queue [1] which is a lock-freeconcurrent SPSC queue.

• XQueue employs load balancing techniques whereitems can be added to auxiliary queues depending onthe topology defined (round-robin in Figure 5). Figure 5: XQueue Architecture

Microbenchmarks

• For evaluation purposes, we implemented two versions ofXQueue, one with a lock-less SPSC queue and anotherone with mutex-based MPMC queue.

• The throughput achieved on skylake-192 withXQueue with lock-less queue is 5 billionoperations per second. For XQueue usinglock-based queue, the average throughput achieved is135 million operations per second.

• Latency of queue operations on XQueue usinglock-less queue is 110 to 400 CPU cycles onaverage on all the different architectures with 384threads of execution.

Figure 6: XQueue Throughput usinglock-less queue

Figure 7: XQueue Throughput usinglock-based queue

Figure 8: Latency of Xqueue using lock-less queue

latency<400cycleswith384threads!

Conclusion• XQueue is an extremely scalable lock-lesscon-current MPMC out of order queue that can scaleup to hundreds of threads of execution.

• Future Work: XTask, a light-weight parallel runtimesystem which can achieve low latency and highthroughput at extreme scale.

[1] Junchang Wang, Kai Zhang, Xinan Tang, and Bei Hua. B-queue: Efficient and practical queuing for fast core-to-core communication.International Journal of Parallel Programming, 41(1):137–159, Feb 2013.

XQueue: Extreme Fine-grained Concurrent Lock-less...

Documents

Transcript of XQueue: Extreme Fine-grained Concurrent Lock-less...