Parallel Preﬁx Sum with SIMD - Columbia University

Parallel Prefix Sum with SIMD

Wangda ZhangColumbia University

[email protected]

Yanbin WangColumbia University

[email protected]

Kenneth A. Ross∗

Columbia University

[email protected]

ABSTRACTThe prefix sum operation is a useful primitive with a broadrange of applications. For database systems, it is a buildingblock of many important operators including join, sort andfilter queries. In this paper, we study different methods ofcomputing prefix sums with SIMD instructions and multi-ple threads. For SIMD, we implement and compare hori-zontal and vertical computations, as well as a theoreticallywork-efficient balanced tree version using gather/scatter in-structions. With multithreading, the memory bandwidthcan become the bottleneck of prefix sum computations. Wepropose a new method that partitions data into cache-sizedsmaller partitions to achieve better data locality and reducebandwidth demands from RAM. We also investigate four dif-ferent ways of organizing the computation sub-procedures,which have different performance and usability character-istics. In the experiments we find that the most efficientprefix sum computation using our partitioning technique isup to 3x faster than two standard library implementationsthat already use SIMD and multithreading.

1. INTRODUCTIONPrefix sums are widely used in parallel and distributed

database systems as building blocks for important databaseoperators. For example, a common use case is to deter-mine the new offsets of data items during a partitioningstep, where prefix sums are computed from a previously con-structed histogram or bitmap, and then used as the new in-dex values [6]. Applications of this usage include radix sorton CPUs [30, 25] and GPUs [29, 20], radix hash joins [16,3], as well as parallel filtering [32, 4]. In OLAP data cubes,prefix sums are precomputed so that they can be used to an-swer range sum queries at runtime [15, 10, 21]. Data miningalgorithms such as K-means can be accelerated using par-allel prefix sum computations in a preprocessing step [18].In data compression using differential encoding, prefix sums

∗This research was supported in part by a gift from OracleCorporation.

This article is published under a Creative Commons Attribution License(http://creativecommons.org/licenses/by/3.0/), which permits distributionand reproduction in any medium as well allowing derivative works, pro-vided that you attribute the original work to the author(s) and ADMS 2020.11th International Workshop on Accelerating Analytics and Data Manage-ment Systems (ADMS’20), August 31, 2020, Tokyo, Japan.

are also used to reconstruct the original data, and prefix-sumcomputations can account for the majority of the runningtime [22, 23].

The prefix sum operation takes a binary associative oper-ator ⊕ and an input array of n elements [a0, a1, . . . , an−1],and outputs the array containing the sums of prefixes of theinput: [a0, (a0 ⊕ a1), . . . , (a0 ⊕ a1 ⊕ an−1)]. This definitionis also known as an inclusive scan operation [5]. The out-put of an exclusive scan (also called pre-scan) would removethe last element from the above output array, and insert anidentity value at the beginning of the output. In this pa-per, we use addition as the binary operator ⊕ and computeinclusive scans by default.

The basic algorithm to compute prefix sums simply re-quires one sequential pass of additions while looping overall elements in the input array and writing out the run-ning totals to the output array. Although this operationseems to be inherently sequential, there are actually parallelalgorithms to compute prefix sums. To compute the pre-fix sums of n elements, Hillis and Steele presented a dataparallel algorithm that takes O(logn) time, assuming thereare n processors available [14]. This algorithm performsO(n logn) additions, doing more work than the sequentialversion, which only needs O(n) additions. Work-efficient al-gorithms [19, 5] build a conceptual balanced binary tree tocompute the prefix sums in two sweeps over the tree, usingO(n) operations. In Section 3, we implement and compareSIMD versions of these data-parallel algorithms, as well asa vertical SIMD algorithm that is also work-efficient.

In a shared-memory environment, we can speed up pre-fix sum computations using multiple threads on a multicoreplatform. Prefix sums can be computed locally within eachthread, but because of the sequential dependencies, threadtm has to know the previous sums computed by threadst0 . . . tm−1 in order to compute the global prefix sum results.Thus, a two-pass algorithm is necessary for multithreadedexecution [34]. There are multiple ways to organize the com-putation subprocedures, depending on (a) whether prefixsums are computed in the first or the second pass, and (b)how the work is partitioned. For example, to balance thework among threads, it is necessary to tune a dilation factorto the local configuration for optimal performance [34]. Tounderstand the best multithreading strategy, we analyze andcompare different multithreaded execution strategies (Sec-tion 2.1).

More importantly, with many concurrent threads access-ing memory, the prefix sum can become a memory bandwidth-bound computation. To fully exploit the potential of the

1

hardware, we must take care to minimize memory accessesand improve cache behavior. To this end, we propose topartition data into cache-sized partitions so that during atwo-pass execution, the second pass can maximize its usageof the cache instead of accessing memory again (Section 2.2).We also develop a low-overhead thread synchronization tech-nique, since partitioning into smaller data fragments canpotentially increase the synchronization overhead.

In summary, the main contributions of this work are:• We study multithreaded prefix sum computations, and

propose a novel algorithm that splits data into cache-sized partitions to achieve better locality and reducememory bandwidth usage.• We discuss data-parallel algorithms for implementing

the prefix sum computation using SIMD instructions,in 3 different versions (horizontal, vertical, and tree).• We experimentally compare our implementations with

external libraries. We also provde a set of recommen-dations for choosing the right algorithm.

1.1 Related WorkBeyond databases, there are also many other uses of pre-

fix sums in parallel computations and applications, includingbut not limited to various sorting algorithms (e.g., quicksort,mergesort, radix-sort), list ranking, stream compaction, poly-nomial evaluation, sparse matrix-vector multiplication, tridi-agonal matrix solvers, lexical analysis, fluid simulation, andbuilding data structures (graphs, trees, etc.) in parallel [6,5, 31].

Algorithms for parallel prefix sums have been studied earlyin the design of binary adders [19], and analyzed extensivelyin theoretical studies [8, 7]. More recently, message pass-ing algorithms were proposed for distributed memory plat-forms [28]. A sequence of approximate prefix sums can becomputed faster in O(log log n) time [11]. For data struc-tures, Fenwick trees can be used for efficient prefix sumcomputations with element updates [9]. Succinct indexabledictionaries have also been proposed to represent prefix sumscompactly [26].

As an important primitive, the prefix sum operation hasbeen implemented in multiple libraries and platforms. TheC++ standard library provides the prefix sum (scan) in itsalgorithm library, and parallel implementations are providedby GNU Parallel library [33] and Intel Parallel STL [2]. Par-allel prefix sums can be implemented efficiently on GPUswith CUDA [12], and for the Message Passing Interface(MPI-Scan) [28]. Although compilers can not typically auto-vectorize a loop of prefix sum computation because of datadependency issues [24], it is now possible to use OpenMPSIMD directives to dictate compilers to vectorize such loops.Since additions over floating point numbers are not associa-tive, the result of parallel prefix sums of floating point valuescan have a small difference from the sequential computation.

2. THREAD-LEVEL PARALLELISMWe now describe multithreaded algorithms for computing

prefix sums in parallel. Note that a prefix-sum implementa-tion can be either in-place, where the prefix sums replace theinput data elements, or out-of-place, where the prefix sumsare written to a new output array. The following descrip-tion assumes an in-place algorithm. Out-of-place versionscan be implemented similarly and will be discussed in theexperimental evaluation.

(a) Prefix Sum + Increment

(b) Accumulate + Prefix Sum

(c) Prefix Sum + Increment (+1 partition)

(d) Accumulate + Prefix Sum (+1 partition)

Figure 1: Multithreaded two-pass algorithms

2.1 Two-Pass AlgorithmsTo compute prefix sums in parallel, we study four differ-

ent two-pass algorithms. There are two ways of processingthe two passes depending on whether prefix sums are comp-tuted in the first or second pass. Figure 1(a) presents oneimplementation, using 4 threads (t0 . . . t3) as an example. Toenable parallel processing, the data is divided into 4 equal-sized partitions. In the first pass, each thread computes alocal prefix sum over its partition of data. After the prefixsum computation, the last element is the total sum of thepartition, which will be used in the second pass to computeglobal prefix sums. These sums can be stored in a tempo-rary buffer array sums = {s0, s1, s2, s3}, so it is easier tocompute the prefix sums of their own. In the second pass,except for the first thread t0, each thread tm increments ev-ery element of the local prefix sum by

∑i<m si, in order

to obtain global prefix sums. The prefix sums of the firstpartition are already the final results, since no increment isneeded.

A potential inefficiency of this algorithm is that in thesecond pass, the first thread t0 is idle, so effectively theparallelism is only (m − 1) using m threads, which has animpact on performance especially when m is small. To fixthis issue, the data can be divided into (m + 1) partitions,as shown in Figure 1(c) when m = 4. In the first pass, thethreads only work on the first 4 partitions, and the prefixsums of the last partition is computed in the second pass.We schedule thread t0 to work on the last partition, insteadof shifting every thread to the next partition, since this isimportant for our partitioning technique in Section 2.2.

Figure 1(b) demonstrates a different way of computing

2

the prefix sums. In the first pass, only the total sum of eachpartition is accumulated in parallel. Then in the secondpass, the global prefix sums can be directly computed inparallel, using

∑i<m si as an offset to the input. The benefit

of computing only the total sum in the first pass is thatthere is no memory write as in prefix sum computations.Therefore, this method can potentially require less memorybandwidth when the data size is large. In addition, we donot need the total sum of the last partition, so the last threadt3 can be idle in the first pass.

Similarly, Figure 1(d) fixes the idle-thread inefficiency byusing one more partition. In the first pass, the prefix sums oft0 and the totals sums of t1 . . . t3 are computed in parallel.In the second pass, all threads compute prefix sums withan input offset computed from the sums array. Thread t0again works on the last partition. Both partitioning schemesin Figures 1(c) and 1(d) ensure there are no idle threads andall threads work in parallel in either pass.

2.1.1 Load BalancingFor algorithms shown in Figures 1(a) and 1(b), equal par-

titioning of the data should suffice since each thread does thesame work (except for the idle thread). For Figure 1(c), inthe second pass, thread t0 computes prefix sums while otherthreads simply do an increment. Although in scalar code,both subprocedures require read, add and write, the incre-ment is easily vectorizable by compilers while the prefix sumcannot be automatically vectorized. (We will also implementexplicit SIMD versions of these algorithms.) Similarly, inthe first pass of Figure 1(d), thread t0 computes a prefixsum while other threads accumulate total sums. These op-erations may proceed at different rates, and the differenceis potentially magnified because prefix sums cannot be au-tovectorized and need to write back results, while accumu-lation can be autovectorized and does not write to memory.(The second pass has similar problems when cache-friendlypartitioning is used as we shall explain in Section 2.2.)

To compensate for thread t0 possibly taking longer, a di-lation factor d can be used to reduce the size of the first (orlast) partition, in order to balance the work done by everythread [34]. The range of a dilation factor is d ∈ [0, 1], in-dicating the ratio of sizes between partition of thread t0 toother threads. When d = 0, the corresponding partition isnonexistent, so Figure 1(a) is a special case of Figure 1(c)and Figure 1(b) is a special case of Figure 1(d). When d = 1,the partitions have equal sizes. In our experiments we findthat the dilation factors have to be carefully tuned to achievebest performances, but in practice, standard library imple-mentations typically just use equal-sized partitions by de-fault, which is suboptimal in most cases. In fact, a poorlychosen dilation factor can cause an inefficiency worse thanone idle thread, because many threads may become idle asthey wait for thread t0 to complete its work.

2.2 Cache-Friendly PartitioningFor big data, one problem with the two-pass algorithms

is that after the first pass, most data will have been evictedfrom the cache, and they have to be accessed again frommemory in the second pass. Since the performance differ-ence between cache and memory accesses is quite large, thisproblem can lead to huge performance impact overall. Wetherefore propose to partition the entire data in a cache

Figure 2: Cache-friendly partitioning

friendly way, so that for each thread, the data it processesreside in cache after the first pass.

Figure 2 demonstrates this idea, showing that the dataare partitioned into cache-sized partitions, and processedby multiple threads in parallel. Both passes over the cacheresident data are made (in parallel) before proceeding tothe next partition. In this way, the second pass reads datafrom cache rather than from RAM. The entire data is pro-cessed sequentially in multiple iterations, but each iterationis computed in parallel by multiple threads. To computeglobal prefix sums, the first thread t0 needs to use the totalsum from the previous partition (in the previous iteration)as an offset in the first pass.

If the partitioning uses the methods in Figures 1(a) or1(b), then the first thread can use the last element of theprevious partition as the offset, but this requires the firstthread to wait for the second pass of the last partition tocomplete. A different way is to use the stored sums arrayand compute the sum of all of its elements, including thelocal sum of last partition, as the offset. This way requiresthe local sum of the last partition (in the previous iteration)to have been computed, even if it is not necessary to doso using Accumulate in Figure 1(b). However, there are twobenefits of doing so: first, the data will reside in cache in thesecond pass, and second, we can save one synchronizationafter the second pass as we shall explain shortly. If thepartitioning is done as in Figures 1(c) or 1(d) using onemore partition, then in the first pass, thread t0 can directlyaccess the last element of its previous partition since thesame thread t0 has processed it.

The effect of partitioning is that during the second pass,accesses to the same data can be served from the cache in-stead of memory. As we shall demonstrate in the experi-ments, the size of a partition is better set to roughly halfof the size of L2 cache (or L1 cache if SIMD gather/scatterinstructions are used in computing local prefix sums). Asmentioned earlier, partitioning also changes the dilation fac-tors used in the partitioning scheme in Figures 1(c) and 1(d),because the prefix sum computation in the second pass ofthe previous iteration will read from cache instead of mem-ory, except for the last partition. For Figure 1(d), we need asecond dilation factor for the last partition as well to accountfor the difference.

2.2.1 Thread Scheduling and SynchronizationTo ensure the private cache of a physical core can be ac-

cessed during the second pass, we control the thread affinityso that physical threads t1, t2, . . . continue to process thesame data in the second pass. Thread t0 works on a differ-ent partition, since the first partition does not need a secondpass while the last partition does not need the first pass.

We need a barrier synchronization after the first pass, andall threads should wait for it before continuing to the secondpass. Without the cache-friendly partitioning, the originaltwo-pass algorithm only needs to synchronize twice: one be-tween the first pass and the second pass, and one to join allthe threads after the second pass. With partitioning, more

3

Figure 3: Horizontal SIMD

__m512i _mm512_slli_si512(__m512i x, int k) {const __m512i ZERO = _mm512_setzero_si512 ();return _mm512_alignr_epi32(x, ZERO , 16 - k);

}__m512i PrefixSum(__m512i x) {

x = _mm512_add_epi32(_mm512_slli_si512(x, 1)));x = _mm512_add_epi32(_mm512_slli_si512(x, 2)));x = _mm512_add_epi32(_mm512_slli_si512(x, 4)));x = _mm512_add_epi32(_mm512_slli_si512(x, 8)));return x; // local prefix sums

}

Listing 1: In-register prefix sums with Horizontal SIMD

synchronization points are needed for this multi-iterationprocess. However, we only need one synchronization in ev-ery iteration, instead of two. The synchronization after thesecond pass is unnecessary, and the second pass of iterationk can be overlapped with the first pass of iteration (k + 1).In order to prevent the local sums in the sums array of it-eration k from being overwritten by sums computed in thefirst pass of iteration (k + 1), we need two sums arrays tostore the local sums. In iteration k, the sums are stored insumsk%2. Because of the synchronization between the twopasses, this double buffering of sums is sufficient.

Since we still need one synchronization in every iteration,the overhead of our synchronization mechanism should beas low as possible. As reported by previous studies [13, 27],the latency of different software barrier implementations candiffer by orders of magnitude, thus profoundly affecting theoverall performance. In our implementation we use a hand-tuned reusable counting barrier, which is implemented asa spinlock with atomic operations. On our experimentalplatforms with 48 cores, we find its performance is muchbetter than a Pthread barrier and slightly better than theOpenMP barrier. To scale to even more cores, it is poten-tially beneficial to use even faster barrier implemenations(e.g., tournament barrier [13]). Because of the sequentialdependency of prefix sum computations, it is also possibleto restrict synchronization to only adjacent threads, ratherthan using a centralized barrier [35].

3. DATA-LEVEL PARALLELISMSIMD instructions can be used to speed up the single-

thread prefix sum computation. In this paper, we use AVX-512 extensions that have more parallelism than previous ex-tensions. We assume 16 32-bit numbers can be processed inparallel using the 512-bit vectors.

3.1 HorizontalFigure 3 shows how we can compute prefix sums in a hor-

izontal way using SIMD instructions. The basic primitiveis to compute the prefix sum of a vector of 16 elements inregister, so that we can loop over the data to compute every

Figure 4: Vertical SIMD

16 elements. The last element of the 16 prefix sums resultsis then broadcast into another vector, so it can be added tothe next 16 elements to compute the global results.

As mentioned in Section 1, [14] presented a data paral-lel algorithm to compute prefix sums, which can be usedto compute the prefix sums in a register. For w elementsin a SIMD register, this algorithm uses log(w) steps. Sofor 16 four-byte elements, we can compute the prefix sumsin-register using 4 shifts and 4 addition instructions. Thepseudocode in Listing 1 shows this procedure.

Note that unlike previous SIMD extensions (e.g. AVX2),AVX-512 does not provide the mm512 slli si512 instruc-tion to shift the 512-bit vector. Here, we implement theshift using the valign instruction. It is also possible to im-plement this shifting using permutate instructions [2, 17],but more registers have to be used to control the permu-tation destinations for different shifts, and it appears to beslightly slower (although valign and vperm use the samenumber of cycles).

3.2 VerticalWe can also compute the prefix sums vertically by divid-

ing the data into 16 chunks of length k = n/16 and thenusing a vector to accumulate the running sums, as shown inFigure 4. While scanning over the data, we gather the dataat indices (0 + j, k+ j, 2k+ j, . . . , 15k+ j) (where j ∈ [0, k)),add to the vector of running sums, and scatter the resultback to the original indices position (assuming the indicesfit into four bytes). After one pass of scanning over the data,SIMD lane i contains the total sum of elements from i ∗ kto (i + 1) ∗ k − 1. In a similar way to the multithreadedexecutions (Section 2), we need to update the results in thesecond pass.

In addition, we can also switch the two passes by firstcomputing just the total sum of each chunk (without thescatter step) in the first pass, and then compute the globalprefix sums vertically in the second pass.

No matter which pass prefix sums are computed in, weneed two passes for vertical SIMD computation because ofdata-parallel processing even if it is executed in a singlethread. For multithreaded execution, each of the two passescan run in parallel among the threads. Similar to the cache-friendly partitioning for multithreaded execution, we canalso partition the data into cache sizes so that the secondpass of vertical SIMD implementation can reuse the datafrom cache (even for single thread).

Different from the horizontal SIMD version, this verticalalgorithm is work-efficient, performing O(n) additions (in-cluding updating the indices for gather/scatter) without theO(logw) overhead. Using data-parallel execution, we essen-tially execute a two-pass sequential algorithm for each of

4

Figure 5: Tree SIMD

the 16 chunks of data. In practice, the logw extra additionsin the horizontal version are computed in register, whichis much faster than waiting for memory accesses. Even so,if the gather/scatter instructions are fast, then the verti-cal algorithm can potentially be faster than the horizontalversion.

3.3 TreeAs introduced in Section 1, a work-efficient algorithm us-

ing two sweeps over a balanced tree is presented in [6]. Asa two-pass algorithm, the first pass performs an up-sweep(reduction) to build the tree, and the second pass performsa down-sweep to construct the prefix sum results. At eachlevel of the tree, the computation can be done in a dataparallel way for the two children of every internal tree node.Figure 5 shows the conceptual tree built from the data; fordetails of the algorithm, refer to the original paper.

To implement this algorithm, we can reuse the PrefixSum()procedure in Section 3.1 and mask off the unnecessary lanesin the first pass; the second pass can be done similarly withreversed order of instructions. However, it does not makesense to artificially make the SIMD lanes idle. Instead, wecan use a loop of gather/scatter instructions to process theelements in a strided access pattern at every level of the tree.Because the tree is of height logn, the implementation needsmultiple passes of gathers and scatters, so although the al-gorithm is work-efficient in term of addition, it is quite inef-ficient in memory access. Cache-friendly partitioning helpsalleviate this issue for the second pass, but overall this al-gorithm is not suitable for SIMD implementations.

4. EXPERIMENTS

4.1 SetupWe conducted experiments using the Amazon EC2 service

on a dedicated m5.metal instance1, runing in non-virtualizedenvironments without hypervisors. The instance has twoIntel Xeon Platinum 8175M CPUs, based on the Skylakemicroarchitecture that supports AVX-512. Table 1 describesthe platform specifications. The instance provides 300 GBmemory, and the measured memory bandwidth is 180 GB/son two NUMA nodes in total.

We implemented the methods in Sections 3 and 2 usingIntel AVX-512 intrinsics and POSIX Threads with atomicoperations for synchronization. Our code2 was written inC++, compiled using Intel compiler 19.1 with -O3 optimiza-tion (scalar code, such as subprocedures Increment and Ac-cumulate, can be autovectorized), and ran on 64-bit Linux

1https://aws.amazon.com/intel2http://www.cs.columbia.edu/~zwd/prefix-sum

Table 1: Hardware SpecificationMicroarchitecture SkylakeModel 8175MSockets 2NUMA nodes 2Cores per socket 24Threads per core 2Clock frequency 2.5 GHzL1D cache 32 KBL2 cache 1 MBL3 cache (shared) 33 MB

Figure 6: Single-thread throughput

operating systems. For our experiemnts, we use 32-bit float-ing point numbers as the data type, and generate randomdata so that every thread has 128 MB of data (i.e. 32 mil-lion floats) to work with. Algorithms are in-place prefix sumcomputations, except in Section 4.2.3 where we consider out-of-place algorithms.

Our handwritten implementations used in the experimentsare summarized in Table 2. We compare with the followingtwo external baselines:• GNU Library [1]. The libstdc++ parallel mode pro-

vides a multithreaded prefix sum implementation us-ing OpenMP. The implementation executes a two-passscalar computation (Accumulate + Prefix Sum) as dis-cussed in Section 2.• Intel Library [2]. Intel provides a Parallel STL li-

brary, which is an implementation of the C++ stan-dard library algorithms with multithreading and vec-torized execution (under the par unseq execution pol-icy). The multithreaded execution is enabled by IntelThreading Building Blocks (TBB), and we can forcethe Intel compiler to use 512-bit SIMD registers byusing the -qopt-zmm-usage=high flag (which is set tolow on Skylake by default).

4.2 Results

4.2.1 Single-Thread PerformanceFigure 6 presents the single-thread throughput results.

The (horizontal) SIMD implementation (Section 3.1) us-ing AVX-512 performs the best, processing nearly three bil-lion floats per second. The single-thread Intel library im-plementation with vectorized execution is slightly slower.Our scalar implementation has the same performance as thescalar GNU library implementation, which is 3.5x slowerthan SIMD.

5

https://aws.amazon.com/intel

http://www.cs.columbia.edu/~zwd/prefix-sum

Table 2: Algorithm DescriptionsScalar Single-thread one-pass scalar implementationSIMD Single-thread one-pass horizontal SIMD (Section 3.1). Also used for multithread implementationsSIMD-V1/V2 Single-thread two-pass vertical SIMD (Section 3.2), computing prefix sums in Pass 1 / Pass 2SIMD-T Single-thread two-pass tree SIMD (Section 3.3)Scalar1, SIMD1 Multithread two-pass algorithm, computing prefix sums in Pass 1 (Figure 1(c))Scalar2, SIMD2 Multithread two-pass algorithm, computing prefix sums in Pass 2 (Figure 1(d))Scalar*-P, SIMD*-P Multithread two-pass algorithm with cache-friendly partitioning (Section 2.2)

Using cache-friendly partitioning (with the best partitionsizes), the vertical SIMD implementations (Section 3.2) arestill about 2x slower than horizontal SIMD. Computing pre-fix sums in the second pass is slightly better. The Tree im-plementation (Section 3.3) is slowest even with partitioningbecause of its non-sequential memory accesses.

4.2.2 Multithread PerformanceWe then increase the number of threads used to the maxi-

mum number of (hyper-)threads supported on the platform.In addition to external baselines, we compare our imple-mentations with and without partititioning. The partitionsizes and dilation factors used are the best ones as we shalldiscuss in Section 4.2.4. Since the vertical SIMD and treeimplementations are slow, we omit them in multithreadedexperiments.

Figure 7(b) presents the multithreaded throughput resultsusing SIMD. The partitioned SIMD implementations areconsistently the best methods across all numbers of threads.SIMD1-P has a throughput of 20.3 billion per second at48 threads. At this point, it already saturates the memorybandwidth, so its performance does not improve with morethan 48 threads used. The highest throughput of SIMD2-Pis 20.5 billion per second with 80 threads. Without parti-tioning, the SIMD1 stablizes at a throughput of 12.3 billionper second. Because the second pass has to access the datafrom memory (instead of cache) again without partitioning,it is 1.7x slower. SIMD2 has a throughput of 16.7 billionper second, 1.3x faster than SIMD1 because of less memoryaccess in the first pass (no writes).

Figure 7(a) presents the scalar results. Interestingly, al-though the partitioned scalar implementation Scalar1-P isslow with less than 32 threads, it eventually becomes fasterthan the non-partitioning SIMD method and also reachesthe limit of memory bandwidth at 48 threads. This resultshows that the prefix sum converts from a CPU-bound com-putation to a memory-bound computation as more and morethreads are used. In a compute-bound situation (e.g., lessthan 32 threads), SIMD can improve performance, while ina memory-bandwidth-bound situation, it is more importantto optimize for cache locality and reduce memory access. ForScalar2-P, partitioning does not appear beneficial. We thinkone possible explanation would be that the compiler (orhardware prefetcher) generates nontemporal prefetches (orloads) in the first accumulation pass, meaning that the datais not resident in the cache for the second pass. The non-partitioning scalar implementations are slower than SIMD,but with more threads used, they also become bandwidth-bound and reach the same performance as non-partitioningSIMD.

We also find that library implementations are slower thanour handwritten implementations. The Intel library imple-mentation is faster than the GNU library with less than 16

threads, since it uses SIMD implementations, but it appearsto not scale well. Even comparing with our non-partitioningimplementations, both GNU and Intel implementations havea lower throughput at the maximum system capacity.

4.2.3 Out-of-Place PerformanceWe now investigate out-of-place computations, where the

output goes to a separate array from the input. Figure 8shows that Scalar2 and SIMD2 perform well; partitionedversions of those algorithms giving about the same through-put, except for scalar code with few threads where the par-titioned algorithm performs better. The performance ofthe GNU library improves for out-of-place computations,but it still performs worse than the Scalar2/Scalar2-P al-gorithms. While there is little change in the performanceof the Scalar1/Scalar1-P and SIMD1/SIMD1-P algorithmswhen shifting from in-place to out-of-place computations,it appears that the performance of Scalar2/Scalar2-P andSIMD2/SIMD2-P improves.

One source of potential improvement is that the out-of-place method is reading from/writing to different memoryregions, which may be on different memory banks that canoperate in parallel. The in-place method will always bereading/writing to a single memory bank at a time. Thereare therefore more opportunities for the system to balancethe throughput from different memory banks, and achievehigher bandwidth utilization. Support for this hypothesiscomes from Figure 9 where we run the algorithms on a sin-gle node with a single memory bank. The SIMD versions ofSIMD1-P, SIMD2 and SIMD2-P now all perform similarly.

Nevertheless, the scalar performance in Figure 9 showsthat there is still a small performance edge for Scalar2/Scalar2-P in the out-of-place algorithms on a single node. We are notcertain of the explanation for this effect, but suspect thatwrite-combining buffers may be helping the performance,since the second phase in Scalar2/Scalar2-P does sequentialblind writes to the output array.

4.2.4 Effect of Partition SizesIn Figure 10(a), we tune the partition sizes for our par-

titioned scalar and SIMD implementations with 48 threads.The x-axis is in log scale. By trying out possible partitionsizes covering the cache sizes at different levels, we find thatwith 48 (and fewer) threads, the best partition size is 128Kfloats per thread, which takes 512 KB, i.e. half the size ofL2 caches. With 96 threads, the best choice of partitionsize is 64K floats per thread, as a result of two hyperthreadssharing the L2 cache. Large partition sizes make cachingineffective, while small partition sizes can increase the syn-chronization overhead.

Partitioning also helps with the vertical and tree imple-mentations, which need two passes even with a single thread.In Figure 10(b), we find that by partitioning the input data

6

(a) Scalar (b) SIMD

Figure 7: Multithread throughput

(a) Scalar (b) SIMD

Figure 8: Multithread throughput (out-of-place)

(a) Scalar (in-place) (b) SIMD (in-place)

(c) Scalar (out-of-place) (d) SIMD (out-of-place)

Figure 9: Throughput on a single node

7

(a) Multithreaded implementations with partitioning (b) Single-thread implementations with partitioning

Figure 10: Effect of partition sizes

into partitions of 8K floats (L1 cache size), we can increasethe throughput of SIMD-V1 and SIMD-V2 from 0.3 to about1.6 billion per second, and the throughput of SIMD-T from0.2 to 0.7 billion per second. This result shows that par-titions fitting into L1 cache is best for the performance ofgather and scatter instructions. Even with this enhance-ment, the throughput is still much lower than other algo-rithms.

4.2.5 Effect of Dilation FactorsFigure 11 demonstrates the effect of dilation factors. We

compare equal partitioning (no dilation) with best dilationfactors found in our tuning process. In most situtationsshown, it is clear that tuning the dilation factor is requiredto achieve good performance.

The improvement of using one more partition (in everyiteration) is largest when the cache-friendly partitioning isnot used. Improving thread utilization can indeed improvethe performance, especially with a smaller number of threadswhen the performance is not memory-bound. With cache-friendly partitioning, the difference is fairly small, especiallywhen SIMD is used.

As we have demonstrated that the default dilation d = 1as used in standard library implementations is suboptimal,users should tune this parameter for better performance.However, it is not very convenient to tune this parameterbecause although it is the ratio of two different subproce-dures, the real performance depends on multiple factors inreality. For example, Figure 12 shows the throughput resultswith varying dilation factors for the Scalar1 implementation.Using different number of threads, the best dilation factorchanges from 0.2 to 0.8, reflecting a changing balance ofCPU and memory latencies as memory bandwidth becomessaturated.

From the above results, we observe that if we want touse one more partition (Figures 1(c) and 1(d)), then thedilation factors must be tuned and sometimes it is hard tofind a fixed best factor across all configurations. On theother hand, since we have almost the same high performanceusing the cache-friendly partitioning strategy, without usingthe extra partition in every iteration, it is more robust touse the partitioning schemes in Figures 1(a) and 1(b).

4.2.6 Effect of High-Bandwidth MemoryTo study the effect of memory bandwidth, we also con-

ducted experiments on a Xeon Phi machine based on theKnights Landing (KNL) architecture. Although it is not a

mainstream machine for database query processing, it hasa multi-channel RAM (MCDRAM) providing much higherbandwidth than regular RAM (the load bandwidth is 295GB per second and the store bandwidth is 220 GB per sec-ond). Our experimental machine has 64 physical cores and16 GB MCDRAM. Figure 13 shows the results using 64threads. It is clear that the cache-friendly partitioning (intoL2-cache sized units) helps with every algorithm in regularmemory (LBM). In the high bandwidth memory (HBM), weobserve a higher throughput for all the methods. However,the partitioning does not improve performance at all, and itis better not to partition with a finer granularity (i.e., theoptimal partition size per thread is simply 1/64 of the datasize).

To explain these results, observe that there is an overheadto partitioning, including extra synchronization. When datais in RAM, the payoff (reduced memory traffic) is importantbecause the system is memory bound, and so the overheadis worth paying. When the data is in high-bandwidth mem-ory, the system never becomes memory-bound. With band-width as an abundant resource, it does not pay to reducebandwidth needs by doing extra work.

5. CHOOSING THE RIGHT ALGORITHMWe now summarize our results, and make recommenda-

tions for the efficient implementation of prefix sum algo-rithms/libraries.

Observation 1. It is hard to get configuration parame-ters like the dilation factor right. Dilation factors depend ona ratio of subalgorithm speeds that depends on many fac-tors and is hard to calibrate in advance. The performancegains from not wasting one thread are small, particularly ifmany threads are available, and the risks of suboptimal per-formance when the wrong dilation factor is used are high.We should also remark that the partitioning approach alsoneeds a configuration parameter to determine the partitionsizes. However, the best partition size appears to be easilydetermined from the L2 cache size, which can be obtainedfrom the system at runtime.

Observation 2. It is worthwhile to pursue bandwidthoptimizations like partitioning if bandwidth a bottleneck.Experiments on a machine with high-bandwidth memoryshow that partitioning helps when the data is in slow RAM,but hurts performance when data is in fast RAM.

Observation 3. Over all experiments, the partitioningvariant SIMD2-P has the most robust performance in all

8

(a) Prefix sum + Increment, no partitioning (b) Prefix sum + Increment

(c) Accumulate + Prefix Sum, no partitioning (d) Accumulate + Prefix Sum

Figure 11: Effect of dilation factors

Figure 12: Varying dilation factors with different number ofthreads

Figure 13: Throughput on Knights Landing

conditions, while SIMD1-P performed slightly better for in-place computations. Scalar code often reached the sameplateau as SIMD code with many threads, but the SIMDcode was significantly faster at lower thread counts wherebandwidth was not saturated.

Observation 4. There are some subtle interactions be-tween in-place/out-of-place choices and algorithm structure,particularly for scalar code. There are also compiler-driveneffects, where simpler loop structures in scalar code can beautomatically vectorized.

Observation 5. Tree-based algorithms are not competi-tive because of poor memory locality. An alternative verti-cal SIMD method seems reasonable in theory, but does notperform well because on current machines, gather/scatterinstructions are relatively slow.

6. CONCLUSIONSIn this paper we have implemented and compared differ-

ent ways of computing prefix sums using SIMD and multi-threading. We find that efficient SIMD implementations arebetter able to exploit the sequential access pattern, even ifit is not work-efficient in theory. Using AVX-512, we ob-serve more than 3x speedup over scalar code in a single-threaded execution. In a parallel environment, the memorybandwidth quickly becomes the bottleneck and SIMD addseven more pressure on memory access. An efficient multi-threaded implementation needs to be cache-friendly, withminimal overhead of thread synchronization. In the exper-iments we find that the most efficient prefix sum computa-tion using our partitioning technique with better caching isup to 3x faster than standard library implementations thatalready use SIMD and multithreading.

9

7. REFERENCES[1] Libstdc++ parallel mode. https://gcc.gnu.org/

onlinedocs/libstdc++/manual/parallel_mode.html.

[2] Parallel stl.https://github.com/oneapi-src/oneDPL.

[3] C. Barthels, I. Muller, T. Schneider, G. Alonso, andT. Hoefler. Distributed join algorithms on thousandsof cores. Proceedings of the VLDB Endowment,10(5):517–528, 2017.

[4] M. Billeter, O. Olsson, and U. Assarsson. Efficientstream compaction on wide simd many-corearchitectures. In Proceedings of the conference on highperformance graphics 2009, pages 159–166, 2009.

[5] G. E. Blelloch. Vector models for data-parallelcomputing, volume 2. MIT press Cambridge, 1990.

[6] G. E. Blelloch. Prefix sums and their applications.Synthesis of Parallel Algorithms, pages 35–60, 1993.

[7] S. Chaudhuri and J. Radhakrishnan. The complexityof parallel prefix problems on small domains. InProceedings., 33rd Annual Symposium on Foundationsof Computer Science, pages 638–647. IEEE, 1992.

[8] R. Cole and U. Vishkin. Faster optimal parallel prefixsums and list ranking. Information and computation,81(3):334–352, 1989.

[9] P. M. Fenwick. A new data structure for cumulativefrequency tables. Software: Practice and experience,24(3):327–336, 1994.

[10] S. Geffner, D. Agrawal, A. El Abbadi, and T. Smith.Relative prefix sums: An efficient approach forquerying dynamic olap data cubes. In Proceedings15th International Conference on Data Engineering(Cat. No. 99CB36337), pages 328–335. IEEE, 1999.

[11] T. Goldberg and U. Zwick. Optimal deterministicapproximate parallel prefix sums and theirapplications. In Proceedings Third Israel Symposiumon the Theory of Computing and Systems, pages220–228. IEEE, 1995.

[12] M. Harris, S. Sengupta, and J. D. Owens. Parallelprefix sum (scan) with cuda. GPU gems,3(39):851–876, 2007.

[13] C. Hetland, G. Tziantzioulis, B. Suchy, M. Leonard,J. Han, J. Albers, N. Hardavellas, and P. Dinda. Pathsto fast barrier synchronization on the node. InProceedings of the 28th International Symposium onHigh-Performance Parallel and DistributedComputing, pages 109–120, 2019.

[14] W. D. Hillis and G. L. Steele Jr. Data parallelalgorithms. Communications of the ACM,29(12):1170–1183, 1986.

[15] C.-T. Ho, R. Agrawal, N. Megiddo, and R. Srikant.Range queries in olap data cubes. ACM SIGMODRecord, 26(2):73–88, 1997.

[16] C. Kim, T. Kaldewey, V. W. Lee, E. Sedlar, A. D.Nguyen, N. Satish, J. Chhugani, A. Di Blas, andP. Dubey. Sort vs. hash revisited: Fast joinimplementation on modern multi-core cpus.Proceedings of the VLDB Endowment, 2(2):1378–1389,2009.

[17] M. Klemm, A. Duran, X. Tian, H. Saito,D. Caballero, and X. Martorell. Extending openmp*with vector constructs for modern multicore simdarchitectures. In International Workshop on OpenMP,

pages 59–72. Springer, 2012.

[18] K. J. Kohlhoff, V. S. Pande, and R. B. Altman.K-means for parallel architectures using all-prefix-sumsorting and updating steps. IEEE Transactions onParallel and Distributed Systems, 24(8):1602–1612,2012.

[19] R. E. Ladner and M. J. Fischer. Parallel prefixcomputation. Journal of the ACM (JACM),27(4):831–838, 1980.

[20] N. Leischner, V. Osipov, and P. Sanders. Gpu samplesort. In 2010 IEEE International Symposium onParallel & Distributed Processing (IPDPS), pages1–10. IEEE, 2010.

[21] D. Lemire. Wavelet-based relative prefix sum methodsfor range sum queries in data cubes. In Proceedings ofthe 2002 conference of the Centre for Advanced Studieson Collaborative research, page 6. IBM Press, 2002.

[22] D. Lemire and L. Boytsov. Decoding billions ofintegers per second through vectorization. Software:Practice and Experience, 45(1):1–29, 2015.

[23] D. Lemire, L. Boytsov, and N. Kurz. Simdcompression and the intersection of sorted integers.Software: Practice and Experience, 46(6):723–749,2016.

[24] S. Maleki, Y. Gao, M. J. Garzar, T. Wong, D. A.Padua, et al. An evaluation of vectorizing compilers.In 2011 International Conference on ParallelArchitectures and Compilation Techniques, pages372–382. IEEE, 2011.

[25] O. Polychroniou, A. Raghavan, and K. A. Ross.Rethinking simd vectorization for in-memorydatabases. In Proceedings of the 2015 ACM SIGMODInternational Conference on Management of Data,pages 1493–1508, 2015.

[26] R. Raman, V. Raman, and S. R. Satti. Succinctindexable dictionaries with applications to encodingk-ary trees, prefix sums and multisets. ACMTransactions on Algorithms (TALG), 3(4):43–es, 2007.

[27] A. Rodchenko, A. Nisbet, A. Pop, and M. Lujan.Effective barrier synchronization on intel xeon phicoprocessor. In European Conference on ParallelProcessing, pages 588–600. Springer, 2015.

[28] P. Sanders and J. L. Traff. Parallel prefix (scan)algorithms for mpi. In European Parallel VirtualMachine/Message Passing Interface Users’ GroupMeeting, pages 49–57. Springer, 2006.

[29] N. Satish, M. Harris, and M. Garland. Designingefficient sorting algorithms for manycore gpus. In 2009IEEE International Symposium on Parallel &Distributed Processing, pages 1–10. IEEE, 2009.

[30] N. Satish, C. Kim, J. Chhugani, A. D. Nguyen, V. W.Lee, D. Kim, and P. Dubey. Fast sort on cpus andgpus: a case for bandwidth oblivious simd sort. InProceedings of the 2010 ACM SIGMOD InternationalConference on Management of data, pages 351–362,2010.

[31] S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens.Scan primitives for gpu computing. 2007.

[32] S. Sengupta, A. Lefohn, and J. D. Owens. Awork-efficient step-efficient prefix sum algorithm. 2006.

[33] J. Singler and B. Konsik. The gnu libstdc++ parallelmode: software engineering considerations. In

10

https://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode.html

https://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode.html

https://github.com/oneapi-src/oneDPL

Proceedings of the 1st international workshop onMulticore software engineering, pages 15–22, 2008.

[34] J. Singler, P. Sanders, and F. Putze. Mcstl: Themulti-core standard template library. In EuropeanConference on Parallel Processing, pages 682–694.Springer, 2007.

[35] S. Yan, G. Long, and Y. Zhang. Streamscan: fast scanalgorithms for gpus without global barriersynchronization. In Proceedings of the 18th ACMSIGPLAN symposium on Principles and practice ofparallel programming, pages 229–238, 2013.

11

Parallel Preﬁx Sum with SIMD - Columbia University

Documents

Transcript of Parallel Preﬁx Sum with SIMD - Columbia University