XQueue: Extreme Fine-grained Concurrent Lock-less...

1
XQueue: Extreme Fine-grained Concurrent Lock-less Queue Poornima Nookala 1 , Peter Dinda 2 , Kyle Hale 3 , Ioan Raicu 4 1,3,4 Illinois Institute of Technology {[email protected], [email protected], [email protected]}, 2 Northwestern University {[email protected]} Overview A single general purpose shared memory machine today may have hundreds of hardware threads available. Op- erations that may run in tens of cycles on a single core using a single thread can take upwards of millions of cycles when multiple threads are competing for shared resources. The concurrent multiple producer, multiple consumer (MPMC) queue is a critical building block in numerous systems such as the task scheduler of a run- time system (or operating system) that supports mod- ern parallel programming models. We present the de- sign, implementation, and evaluation of XQueue, a novel lock-less concurrent queuing system with relaxed order- ing semantics that is geared to realizing scalability to hundreds of concurrent threads. XQueue implements the MPMC interface using multiple queues in order to reduce contention, applies simple deterministic load bal- ancing techniques, and eliminates all use of locks and atomic operations with a lock-less design. Experimental results show that XQueue can deliver concurrent oper- ations with 110 to 400 cycle latency at scales up to 384 hardware threads. Concurrent queues are terrible! Analyzed latency and throughput of mutex, spinlock, semaphore and atomic fetch-and-add for an increment operation on Mystic testbed. Mystic covers latest many-core architectures from Intel, AMD, IBM and ARM with processors such as Haswell, Broadwell, Skylake, Phi, Opteron, Ryzen, Threadripper, Epyc, Power9, and ThunderX Latency of a single atomic increment on a Skylake system with 192-cores and 384 hardware threads when running on all threads concurrently is 33592 cycles whereas on Intel Xeon Phi Knights Landing with 64-cores and 256 hardware threads, latency reaches 3868 cycles. Latency of enqueue/dequeue operation on SPSC queue takes 30 to 70 cycles depending on the architecture and clock frequency. Average throughput of enqueue/dequeue operations on SPSC queue reaches 270 million operations per second on Intel Skylake 192-core machine. Latency of MPMC queue can reach up to millions of cycles under high contention and throughput can drop up to as low as 300,000 operations per second. These results provide enough motivation to investigate methods to exploit full concurrency on many-core architectures while not compromising on the lowest latency that can be achieved. Figure 1: SPSC Queue Latency Figure 2: MPMC Queue Latency Figure 3: SPSC Queue Throughput Figure 4: MPMC Queue Throughput Problem Statement Having scalable and fast queue data structures under high concurrency is a critical missing link towards the realization of efficient parallel run- times on many-core architectures. It is essential to analyze and optimize basic data structures that form the barebones of parallel runtime systems so they do not be- come the bottleneck for performance. XQueue - Design and Implementation Core idea of XQueue is to have two queues per core, one being the master and other being the auxiliary queue. There is one master queue and one auxiliary queue on each core with one producer thread per queue and a common consumer thread for both queues. XQueue is implemented without using any types of locks or atomic operations or barriers. Current implementation uses B-queue [1] which is a lock-free concurrent SPSC queue. XQueue employs load balancing techniques where items can be added to auxiliary queues depending on the topology defined (round-robin in Figure 5). Figure 5: XQueue Architecture Microbenchmarks For evaluation purposes, we implemented two versions of XQueue, one with a lock-less SPSC queue and another one with mutex-based MPMC queue. The throughput achieved on skylake-192 with XQueue with lock-less queue is 5 billion operations per second. For XQueue using lock-based queue, the average throughput achieved is 135 million operations per second. Latency of queue operations on XQueue using lock-less queue is 110 to 400 CPU cycles on average on all the different architectures with 384 threads of execution. Figure 6: XQueue Throughput using lock-less queue Figure 7: XQueue Throughput using lock-based queue Figure 8: Latency of Xqueue using lock-less queue latency <400 cycles with 384 threads! Conclusion XQueue is an extremely scalable lock-less con-current MPMC out of order queue that can scale up to hundreds of threads of execution. Future Work: XTask, a light-weight parallel runtime system which can achieve low latency and high throughput at extreme scale. [1] Junchang Wang, Kai Zhang, Xinan Tang, and Bei Hua. B-queue: Efficient and practical queuing for fast core-to-core communication. International Journal of Parallel Programming, 41(1):137–159, Feb 2013.

Transcript of XQueue: Extreme Fine-grained Concurrent Lock-less...

Page 1: XQueue: Extreme Fine-grained Concurrent Lock-less Queuedatasys.cs.iit.edu/reports/2019_GCASR-XQueue.pdf · Title: XQueue: Extreme Fine-grained Concurrent Lock-less Queue Author: Poornima

XQueue: Extreme Fine-grained Concurrent Lock-less QueuePoornima Nookala1, Peter Dinda2, Kyle Hale3, Ioan Raicu4

1,3,4Illinois Institute of Technology {[email protected], [email protected], [email protected]}, 2Northwestern University {[email protected]}

OverviewA single general purpose shared memory machine todaymay have hundreds of hardware threads available. Op-erations that may run in tens of cycles on a single coreusing a single thread can take upwards of millions ofcycles when multiple threads are competing for sharedresources. The concurrent multiple producer, multipleconsumer (MPMC) queue is a critical building block innumerous systems such as the task scheduler of a run-time system (or operating system) that supports mod-ern parallel programming models. We present the de-sign, implementation, and evaluation of XQueue, a novellock-less concurrent queuing system with relaxed order-ing semantics that is geared to realizing scalability tohundreds of concurrent threads. XQueue implementsthe MPMC interface using multiple queues in order toreduce contention, applies simple deterministic load bal-ancing techniques, and eliminates all use of locks andatomic operations with a lock-less design. Experimentalresults show that XQueue can deliver concurrent oper-ations with 110 to 400 cycle latency at scales up to 384hardware threads.

Concurrent queues areterrible!

• Analyzed latency and throughput of mutex, spinlock,semaphore and atomic fetch-and-add for an incrementoperation on Mystic testbed.

• Mystic covers latest many-core architectures from Intel,AMD, IBM and ARM with processors such as Haswell,Broadwell, Skylake, Phi, Opteron, Ryzen, Threadripper,Epyc, Power9, and ThunderX

• Latency of a single atomic increment on a Skylakesystem with 192-cores and 384 hardware threadswhen running on all threads concurrently is 33592cycles whereas on Intel Xeon Phi KnightsLanding with 64-cores and 256 hardware threads,latency reaches 3868 cycles.

• Latency of enqueue/dequeue operation on SPSCqueue takes 30 to 70 cycles depending on thearchitecture and clock frequency.

• Average throughput of enqueue/dequeue operationson SPSC queue reaches 270 million operationsper second on Intel Skylake 192-core machine.

• Latency of MPMC queue can reach up to millions ofcycles under high contention and throughput candrop up to as low as 300,000 operations persecond.

• These results provide enough motivation to investigatemethods to exploit full concurrency on many-corearchitectures while not compromising on the lowestlatency that can be achieved.

Figure 1: SPSC Queue Latency

Figure 2: MPMC Queue Latency

Figure 3: SPSC Queue Throughput

Figure 4: MPMC Queue Throughput

Problem Statement

Having scalable and fast queue datastructures under high concurrency isa critical missing link towards therealization of efficient parallel run-times on many-core architectures. Itis essential to analyze and optimize basicdata structures that form the barebones ofparallel runtime systems so they do not be-come the bottleneck for performance.

XQueue - Design and Implementation

• Core idea of XQueue is to have two queues percore, one being the master and other being the auxiliaryqueue.

• There is one master queue and one auxiliary queue oneach core with one producer thread per queue and acommon consumer thread for both queues.

• XQueue is implemented without using any types oflocks or atomic operations or barriers. Currentimplementation uses B-queue [1] which is a lock-freeconcurrent SPSC queue.

• XQueue employs load balancing techniques whereitems can be added to auxiliary queues depending onthe topology defined (round-robin in Figure 5). Figure 5: XQueue Architecture

Microbenchmarks

• For evaluation purposes, we implemented two versions ofXQueue, one with a lock-less SPSC queue and anotherone with mutex-based MPMC queue.

• The throughput achieved on skylake-192 withXQueue with lock-less queue is 5 billionoperations per second. For XQueue usinglock-based queue, the average throughput achieved is135 million operations per second.

• Latency of queue operations on XQueue usinglock-less queue is 110 to 400 CPU cycles onaverage on all the different architectures with 384threads of execution.

Figure 6: XQueue Throughput usinglock-less queue

Figure 7: XQueue Throughput usinglock-based queue

Figure 8: Latency of Xqueue using lock-less queue

latency<400cycleswith384threads!

Conclusion• XQueue is an extremely scalable lock-lesscon-current MPMC out of order queue that can scaleup to hundreds of threads of execution.

• Future Work: XTask, a light-weight parallel runtimesystem which can achieve low latency and highthroughput at extreme scale.

[1] Junchang Wang, Kai Zhang, Xinan Tang, and Bei Hua. B-queue: Efficient and practical queuing for fast core-to-core communication.International Journal of Parallel Programming, 41(1):137–159, Feb 2013.