XQueue: Extreme Fine-grained Concurrent Lock-less...
Transcript of XQueue: Extreme Fine-grained Concurrent Lock-less...
XQueue: Extreme Fine-grained Concurrent Lock-less QueuePoornima Nookala1, Peter Dinda2, Kyle Hale3, Ioan Raicu4
1,3,4Illinois Institute of Technology {[email protected], [email protected], [email protected]}, 2Northwestern University {[email protected]}
OverviewA single general purpose shared memory machine todaymay have hundreds of hardware threads available. Op-erations that may run in tens of cycles on a single coreusing a single thread can take upwards of millions ofcycles when multiple threads are competing for sharedresources. The concurrent multiple producer, multipleconsumer (MPMC) queue is a critical building block innumerous systems such as the task scheduler of a run-time system (or operating system) that supports mod-ern parallel programming models. We present the de-sign, implementation, and evaluation of XQueue, a novellock-less concurrent queuing system with relaxed order-ing semantics that is geared to realizing scalability tohundreds of concurrent threads. XQueue implementsthe MPMC interface using multiple queues in order toreduce contention, applies simple deterministic load bal-ancing techniques, and eliminates all use of locks andatomic operations with a lock-less design. Experimentalresults show that XQueue can deliver concurrent oper-ations with 110 to 400 cycle latency at scales up to 384hardware threads.
Concurrent queues areterrible!
• Analyzed latency and throughput of mutex, spinlock,semaphore and atomic fetch-and-add for an incrementoperation on Mystic testbed.
• Mystic covers latest many-core architectures from Intel,AMD, IBM and ARM with processors such as Haswell,Broadwell, Skylake, Phi, Opteron, Ryzen, Threadripper,Epyc, Power9, and ThunderX
• Latency of a single atomic increment on a Skylakesystem with 192-cores and 384 hardware threadswhen running on all threads concurrently is 33592cycles whereas on Intel Xeon Phi KnightsLanding with 64-cores and 256 hardware threads,latency reaches 3868 cycles.
• Latency of enqueue/dequeue operation on SPSCqueue takes 30 to 70 cycles depending on thearchitecture and clock frequency.
• Average throughput of enqueue/dequeue operationson SPSC queue reaches 270 million operationsper second on Intel Skylake 192-core machine.
• Latency of MPMC queue can reach up to millions ofcycles under high contention and throughput candrop up to as low as 300,000 operations persecond.
• These results provide enough motivation to investigatemethods to exploit full concurrency on many-corearchitectures while not compromising on the lowestlatency that can be achieved.
Figure 1: SPSC Queue Latency
Figure 2: MPMC Queue Latency
Figure 3: SPSC Queue Throughput
Figure 4: MPMC Queue Throughput
Problem Statement
Having scalable and fast queue datastructures under high concurrency isa critical missing link towards therealization of efficient parallel run-times on many-core architectures. Itis essential to analyze and optimize basicdata structures that form the barebones ofparallel runtime systems so they do not be-come the bottleneck for performance.
XQueue - Design and Implementation
• Core idea of XQueue is to have two queues percore, one being the master and other being the auxiliaryqueue.
• There is one master queue and one auxiliary queue oneach core with one producer thread per queue and acommon consumer thread for both queues.
• XQueue is implemented without using any types oflocks or atomic operations or barriers. Currentimplementation uses B-queue [1] which is a lock-freeconcurrent SPSC queue.
• XQueue employs load balancing techniques whereitems can be added to auxiliary queues depending onthe topology defined (round-robin in Figure 5). Figure 5: XQueue Architecture
Microbenchmarks
• For evaluation purposes, we implemented two versions ofXQueue, one with a lock-less SPSC queue and anotherone with mutex-based MPMC queue.
• The throughput achieved on skylake-192 withXQueue with lock-less queue is 5 billionoperations per second. For XQueue usinglock-based queue, the average throughput achieved is135 million operations per second.
• Latency of queue operations on XQueue usinglock-less queue is 110 to 400 CPU cycles onaverage on all the different architectures with 384threads of execution.
Figure 6: XQueue Throughput usinglock-less queue
Figure 7: XQueue Throughput usinglock-based queue
Figure 8: Latency of Xqueue using lock-less queue
latency<400cycleswith384threads!
Conclusion• XQueue is an extremely scalable lock-lesscon-current MPMC out of order queue that can scaleup to hundreds of threads of execution.
• Future Work: XTask, a light-weight parallel runtimesystem which can achieve low latency and highthroughput at extreme scale.
[1] Junchang Wang, Kai Zhang, Xinan Tang, and Bei Hua. B-queue: Efficient and practical queuing for fast core-to-core communication.International Journal of Parallel Programming, 41(1):137–159, Feb 2013.