Adaptive Sparse Matrix-Matrix Multiplication on the GPUrzayer/pappe/AC_SpGEMM.pdf · 2019. 5....

14
Adaptive Sparse Matrix-Matrix Multiplication on the GPU Martin Winter Graz University of Technology, Austria [email protected] Daniel Mlakar Graz University of Technology, Austria [email protected] Rhaleb Zayer Max Planck Institute for Informatics, Germany [email protected] Hans-Peter Seidel Max Planck Institute for Informatics, Germany [email protected] Markus Steinberger Graz University of Technology, Austria [email protected] Abstract In the ongoing efforts targeting the vectorization of lin- ear algebra primitives, sparse matrix-matrix multiplication (SpGEMM) has received considerably less attention than sparse Matrix-Vector multiplication (SpMV). While both are equally important, this disparity can be attributed mainly to the additional formidable challenges raised by SpGEMM. In this paper, we present a dynamic approach for address- ing SpGEMM on the GPU. Our approach works directly on the standard compressed sparse rows (CSR) data format. In comparison to previous SpGEMM implementations, our ap- proach guarantees a homogeneous, load-balanced access pattern to the first input matrix and improves memory ac- cess to the second input matrix. It adaptively re-purposes GPU threads during execution and maximizes the time effi- cient on-chip scratchpad memory can be used. Adhering to a completely deterministic scheduling pattern guarantees bit- stable results during repetitive execution, a property missing from other approaches. Evaluation on an extensive sparse matrix benchmark suggests our approach being the fastest SpGEMM implementation for highly sparse matrices (80% of the set). When bit-stable results are sought, our approach is the fastest across the entire test set. CCS Concepts Theory of computation Massively parallel algorithms; Computing methodologies Lin- ear algebra algorithms; Software and its engineering Scheduling; Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. PPoPP’19, February 16–20, 2019, Washington, DC, USA © 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-6225-2/19/02. . . $15.00 hps://doi.org/10.1145/3293883.3295701 Keywords SpGEMM, GPU, Sparse Matrix, Adaptive, ESC, bit-stable 1 Introduction Generalized sparse matrix-matrix multiplication (SpGEMM) is one of the key kernels in scientific computing and data analytics, e.g., in algebraic multigrid solvers [5], Schur com- plement methods [25], betweenness centrality [6] and cycle detection [26]. Algorithmically, SpGEMM consists of build- ing the output matrix C from the product of two sparse input matrices A and B, given by C ij = k A ik · B kj ; (1) where k spans the colliding non-zeros of the i -th row of A and j -th column of B. In the sequential setting, efficient treatment of SpGEMM dates back to the pioneering work of Gustavson [18]. As the computing landscape shifts towards ubiquitous parallelism, there is a pressing need for equiva- lently efficient approaches on modern many-core processors. Among existing many-core architectures, the graphics pro- cessing unit (GPU) is of particular interest. It has emerged as a viable and cost-effective co-processor in both single work- stations and large supercomputing clusters. As the GPU is primarily designed for massively parallel, uniform execution and memory access, achieving good speedups on unstruc- tured problems such as SpGEMM remains a challenge. To provide context to the ensuing discussion, we assume matrices are given in the compressed sparse row (CSR) for- mat, which is probably the most common format in use. Entries are sorted according to rows and their values and col- umn ids are explicitly stored. An additional row pointer array indicates the beginning of each row in the sorted arrays. Challenges The challenges of SpGEMM on the GPU stem from multiple factors. First, the number of entries in each row of A and B may vary strongly. This disparity complicates load balancing, as threads may easily receive vastly different work loads, which is especially difficult to manage on single instruction, multiple data (SIMD) devices like GPUs. 68

Transcript of Adaptive Sparse Matrix-Matrix Multiplication on the GPUrzayer/pappe/AC_SpGEMM.pdf · 2019. 5....

Page 1: Adaptive Sparse Matrix-Matrix Multiplication on the GPUrzayer/pappe/AC_SpGEMM.pdf · 2019. 5. 24. · Adaptive GPU SpGEMM PPoPP’19, February 16–20, 2019, Washington, DC, USA against

Adaptive Sparse Matrix-Matrix Multiplication on theGPU

Martin WinterGraz University of Technology,

[email protected]

Daniel MlakarGraz University of Technology,

[email protected]

Rhaleb ZayerMax Planck Institute for Informatics,

[email protected]

Hans-Peter SeidelMax Planck Institute for Informatics,

[email protected]

Markus SteinbergerGraz University of Technology,

[email protected]

AbstractIn the ongoing efforts targeting the vectorization of lin-ear algebra primitives, sparse matrix-matrix multiplication(SpGEMM) has received considerably less attention thansparse Matrix-Vector multiplication (SpMV). While both areequally important, this disparity can be attributed mainly tothe additional formidable challenges raised by SpGEMM.

In this paper, we present a dynamic approach for address-ing SpGEMM on the GPU. Our approach works directly onthe standard compressed sparse rows (CSR) data format. Incomparison to previous SpGEMM implementations, our ap-proach guarantees a homogeneous, load-balanced accesspattern to the first input matrix and improves memory ac-cess to the second input matrix. It adaptively re-purposesGPU threads during execution and maximizes the time effi-cient on-chip scratchpad memory can be used. Adhering to acompletely deterministic scheduling pattern guarantees bit-stable results during repetitive execution, a property missingfrom other approaches. Evaluation on an extensive sparsematrix benchmark suggests our approach being the fastestSpGEMM implementation for highly sparse matrices (80%of the set). When bit-stable results are sought, our approachis the fastest across the entire test set.

CCS Concepts • Theory of computation→Massivelyparallel algorithms; •Computingmethodologies→ Lin-ear algebra algorithms; • Software and its engineering→Scheduling;

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copiesare not made or distributed for profit or commercial advantage and thatcopies bear this notice and the full citation on the first page. Copyrightsfor components of this work owned by others than the author(s) mustbe honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee. Request permissions from [email protected]’19, February 16–20, 2019, Washington, DC, USA© 2019 Copyright held by the owner/author(s). Publication rights licensedto ACM.ACM ISBN 978-1-4503-6225-2/19/02. . . $15.00https://doi.org/10.1145/3293883.3295701

Keywords SpGEMM, GPU, Sparse Matrix, Adaptive, ESC,bit-stable

1 IntroductionGeneralized sparse matrix-matrix multiplication (SpGEMM)is one of the key kernels in scientific computing and dataanalytics, e.g., in algebraic multigrid solvers [5], Schur com-plement methods [25], betweenness centrality [6] and cycledetection [26]. Algorithmically, SpGEMM consists of build-ing the output matrix C from the product of two sparse inputmatrices A and B, given by

Ci j =∑k

Aik · Bk j ; (1)

where k spans the colliding non-zeros of the i-th row ofA and j-th column of B. In the sequential setting, efficienttreatment of SpGEMM dates back to the pioneering work ofGustavson [18]. As the computing landscape shifts towardsubiquitous parallelism, there is a pressing need for equiva-lently efficient approaches on modern many-core processors.

Among existingmany-core architectures, the graphics pro-cessing unit (GPU) is of particular interest. It has emerged asa viable and cost-effective co-processor in both single work-stations and large supercomputing clusters. As the GPU isprimarily designed for massively parallel, uniform executionand memory access, achieving good speedups on unstruc-tured problems such as SpGEMM remains a challenge.

To provide context to the ensuing discussion, we assumematrices are given in the compressed sparse row (CSR) for-mat, which is probably the most common format in use.Entries are sorted according to rows and their values and col-umn ids are explicitly stored. An additional row pointer arrayindicates the beginning of each row in the sorted arrays.

Challenges The challenges of SpGEMM on the GPU stemfrom multiple factors. First, the number of entries in eachrow ofA and Bmay vary strongly. This disparity complicatesload balancing, as threads may easily receive vastly differentwork loads, which is especially difficult to manage on singleinstruction, multiple data (SIMD) devices like GPUs.

68

Page 2: Adaptive Sparse Matrix-Matrix Multiplication on the GPUrzayer/pappe/AC_SpGEMM.pdf · 2019. 5. 24. · Adaptive GPU SpGEMM PPoPP’19, February 16–20, 2019, Washington, DC, USA against

PPoPP’19, February 16–20, 2019, Washington, DC, USA Winter, Mlakar, Zayer, Seidel, and Steinberger

Second, there is no way to predict the number of interme-diate products (Aik · Bk j ) without inspecting the matrices.This makes it difficult to perform intermediate computationswithin efficient on-chip memory, as it may easily overflow.

Third, memory access patterns are paramount on the GPU.Due to the nature of SpGEMM, the sparsity pattern andcontent of both matrices A and B determine the memoryaccess pattern throughout computations.

Strategies Without loss of generality, the landscape of GPUSpGEMM is dominated by the following strategies• ESC: explicitly expand all temporary products, sort themand combine them to yield the final matrix [5, 7–9].• Hashing: merge temporary products either in scratchpadmemory or globally using hashing [3, 22, 23].• Merging and Hybrid: choose a fitting method for eachrow [17, 19, 20].Most of these approaches are designed with only one or

two of the challenges discussed earlier in mind. The ESCstrategy achieves excellent load balancing at the cost ofhigh intermediate memory. In fact, in its original form allintermediate products go through slow global GPU mem-ory. Similarly, hashing is notoriously slow in global memory.Operating (partially) in scratchpad memory can increaseperformance of both approaches. Relying on smart globalscheduling [7, 22], overflow of scratchpad memory can beavoided. However, this entails a complete matrix inspection(which can consume up to 30% runtime; cf. [22] fig. 6). Onemajor drawback of hashing is that the accumulation orderdepends on the hardware scheduler and thus might yielddifferent floating point errors during each run.Merge-based approaches and hybrids focus on prepro-

cessing and row-based scheduling. This may lead to a largenumber of temporary matrices and significant preprocessingeffort. Furthermore, memory access patterns may deteriorateby switching strategies. Again, for this kind of schedulingthe matrices need to be inspected fully.

Contribution We propose a newGPU SpGEMMalgorithmthat tackles the challenges outlined above in a comprehen-sive way and at the same time cuts down on overhead com-putations throughout all stages. To this end, we make thefollowing contributions:• an efficient global load balancing scheme based solelyon the non-zeros of A. It avoids costly inspection of theinvolved matrices and achieves entirely uniform loadbalancing in practice.• a dynamic work distribution which achieves perfect lo-cal load balancing within a block of threads throughoutmultiple iterations of ESC, mixing partial results withnew input.

• an efficient local adaption of the ESC algorithmwhich op-erates within scratchpad memory and allows combiningtemporary results to a complete subset of C.• chunk-based storage of partial results of C.• an efficient, adaptive merge algorithm for partial results.• an adaptive approach to handle long rows of B efficiently.

2 Related workIn the sequential treatment of the problem, the output matrixis filled one row at a time bymeans of a large bookkeeping ar-ray aka sparse accumulator (SPA) [10, 15, 18]. As the sparsitypattern of C is not known prior to execution, a preliminarypass for memory allocation counts the number of non-zerosof C and a secondary pass computes the entries.Over the years many strategies have been proposed to

adapt SpGEMM to modern parallel architectures. They varymainly in the way the operations in Eq. 1 are partitionedamong A, B, and C. The classification by Ballard et al. [4] of-fers an elegant dimensionality-based interpretation of strate-gies by mapping operations to the cubeA×B×C. Within thisclassification, problems are amenable to solving the underly-ing hypergraph partitioning. However, the more elaboratethe partition, the more preprocessing effort is needed.The multi-threaded approach by Patwary et al. [24] re-

duces cachemisses by blocking accesses to the SPA. It searchesfor adequate block partitioning of the columns of B and fillsindividual blocks of C independently. In the same spirit, Ak-budak and Aykanat [1] rely on row-wise partitioning of Ato exploit locality in accessing the rows of B.

The approach by Bell et al. [5], implemented in CUSP [8],breaks the processing into expansion, sorting, and compres-sion (ESC). This translates into generating a list of interme-diate products which are sorted by row and column indexand falls within the 3D category. The output is generatedby contracting entries with the same indices. Optimizationstake into account the sparsity patterns ofA and B to improvethe row-wise processing of C [9]. Another variant performsESC locally before merging the results globally [7] at the costof increased load balancing effort. While we also performESC locally, the major advantage of our approach is thatwe use dynamic scheduling to perform multiple ESC itera-tions before going to global memory, considerably reducingmemory bandwidth, global sorting and compaction costs.Adopting a partial 1D row-wise strategy, rows from B

can directly be merged, even on the GPU [16]. The RMergeapproach [17] optimizes this merge strategy by ensuring op-erations are completed in efficient memory. This constraintis enforced by splitting B into multiple matrices with limitedrow length and iteratively computing the product of thesematrices from right to left. In the bhSparse approach [20]rows are grouped by the number of intermediate productsand then a merge-based strategy is adaptively selected basedon the number of intermediate products. They also compare

69

Page 3: Adaptive Sparse Matrix-Matrix Multiplication on the GPUrzayer/pappe/AC_SpGEMM.pdf · 2019. 5. 24. · Adaptive GPU SpGEMM PPoPP’19, February 16–20, 2019, Washington, DC, USA against

Adaptive GPU SpGEMM PPoPP’19, February 16–20, 2019, Washington, DC, USA

against a CPU implementation based on Intel MKL and showaverage speed up of 2.5/2.2 for single/double precision.In a similar spirit, the approach by Kunchum et al. [19]

selects different strategies depending on the row structure.SpGEMM can be also addressed using hash tables. An earlyapproach [12] keeps a primary hash table in scratchpadmem-ory and a secondary in global memory. An implementation ofthis approach is used within cuSparse [23]. Deveci et al. [13]also uses a dual hash table approach, whereas global hashtables are only used temporarily and are reclaimed. The Bal-ancedHash approach [3] restricts itself to local hash tablesand avoids overflows using “better size estimates” [2].nsparse [22] follows these approaches and addresses the

memory problem by grouping rows based on intermediateproducts and thus can construct hash tables with differentsizes. Deveci et al. [14] build on their previous work [13]. Inparticular, they combine partitioning for hierarchical paral-lelism with the use of a two-level hash data structure.One downside of hash-based approaches is their non-

deterministic compaction order leading to different floatingpoint errors during each run.

3 Adaptive SpGEMMOur adaptive chunk-based GPU SpGEMM approach (AC-SpGEMM) focuses on four major goals

1. performing computations in local on-chip memory2. coherent memory access3. independence of row lengths4. ensuring deterministic resultsTo achieve these goals, AC-SpGEMM follows a four stage

approach, as outlined in Figure 2. In the first stage, AC-SpGEMM prepares data for global load balancing. In thesecond stage, we perform chunk-based ESC, producing de-terministically bit-stable results. With the help of our localwork distribution, this stage performs multiple iterations ofESC, fully utilizing local memory resources while keepingtemporary results in scratchpad memory. Merging of rowsshared across chunks happens in the third stage. Finally, inthe fourth stage, we allocate the output matrix C and fill itwith data from the chunks.

0

500

1000

1500

2000

2500

0

50

100

150

200

NNZ

per R

ow

MinNNZ - MaxNNZ per RowAverageNNZ per Row

Figure 1. Average non-zeros per row for the SuiteSparsematrix collection, min and max overlayed and clamped.

B

A

C

ChunkPool

Global Load Balancing

Adaptive Chunk-based ESC

Fetch from A

Write Long Rows

Work Distribution

Local ESC

Write Chunk

Merge Stage

Search Merge

Multi Merge

Path Merge

Write to CSR

Restarts

Restarts

Figure 2. Our approach has four consecutive stages: globalload balancing follows a strict non-zero splitting of A; ourAC-ESC step performs the major part of the SpGEMM, com-puting entire chunks of C; merge combines the rows sharedacross chunks before the data is copied to C.

Before discussing the details of each stage, we motivatesome design choices. Analysing the row length of matricesfrom the SuiteSparse matrix collection [11], it can be ob-served that the majority of matrices in common problemdomains have average row lengths of less than 200 elements,cf. Figure 1. Considering register sizes of current GPUs andreasonably small thread block sizes, up to 4000 temporaryelements can be held by each block. If there is reasonableoverlap between rows, the resulting output can be stored inscratchpad memory for another iteration of ESC. Given 200entries per row, ideally another 3800 temporary elementscan be loaded and compacted. In the best case, these localload and compaction steps continue until yielding completedrows of C, without ever going through slow global memory.

3.1 Global Load BalancingPrevious SpGEMM approaches adopt one of two strategiesfor load balancing. (1) They bin the rows of A according totheir lengths [20] and choose an appropriate algorithm foreach category. This strategy may pull apart sequential rowsin A, severing memory access patterns to A and C.(2) They analyze the number of temporary products that

will be produced and distribute them uniformly [3, 7, 22].This approach needs to analyse the entire temporary dataand write identifiers to global memory. The relative costof load balancing increases with the sparsity of the input

70

Page 4: Adaptive Sparse Matrix-Matrix Multiplication on the GPUrzayer/pappe/AC_SpGEMM.pdf · 2019. 5. 24. · Adaptive GPU SpGEMM PPoPP’19, February 16–20, 2019, Washington, DC, USA against

PPoPP’19, February 16–20, 2019, Washington, DC, USA Winter, Mlakar, Zayer, Seidel, and Steinberger

Algorithm 1: Global Load Balancing1 a ← row_ptr [tid]2 b ← row_ptr [tid + 1]3 blocka ← divup(a,NNZ_PER_BLOCK)4 blockb ← (b − 1)/NNZ_PER_BLOCK5 while blocka ≤ blockb do6 blockRowStarts[blocka] ← tid

7 blocka ← blocka + 1

matrices. According to our analysis and the cost break downgiven by Nagasaka et al. [22], load balancing can consumeup to 30% of the overall runtime for very sparse matrices.To tackle the drawbacks of both strategies, we propose

a simple global load balancing scheme and defer the finegrained control to the next stage. Our global load balancingsplits the non-zeros of A uniformly, assigning the same num-ber to each thread block. In this way, memory requirementsfrom A are static. However, the actual workload for eachblock varies based on the intermediate products generated.

While a static splitting of A’s non-zeros does not requireprocessing, the second stage still needs to associate eachentry from A with a row. To provide this information, globalload balancing in parallel checksA’s row pointers and writesthe required information into an auxiliary array, as outlinedin Algorithm 1. The cost of this step is negligible comparedto enumerating temporary products.

3.2 Adaptive Chunk-based ESCEach thread block executing the second stage is assignedan equally sized portion of A and performs SpGEMM withB. Depending on the sparsity pattern, it can produce anynumber of output chunks, each representing a partial resultof C. We propose an adaption of ESC due to its desirableproperties to perform SpGEMM. First, after the expansionof the temporary products, every thread performs identi-cal work independent of which row the data comes from.Second, sort can efficiently be implemented within a block,using Radix sort [21]. Third, a stable sort algorithm alwaysyields identical floating point results, free of the problematicscheduling-based effects encountered when using hashing.The downside of ESC is that sorting intermediate data is

more costly than sorting the output data. However, dealingwith only a chunk of data at once and keeping all data lo-cal, the cost is significantly lower than sorting all temporaryelements of the entire productA·B. While other local ESC ap-proaches have been proposed before [7, 9, 20], they operateon individual rows or a fixed number of temporary prod-ucts and then always go to global memory. Our approachcompletely ignores row boundaries and performs multipleiterations of ESC locally, dynamically splitting off completedrows. We only move to global memory after completion orwhen data cannot be compacted sufficiently.

3.2.1 Fetch AThe first step in AC-ESC fetches non-zeros and column ids ofA required by the thread block, using a coalesced read pattern.We store this data in scratchpad memory to be availablethroughout the entire stage. While coalesced loading of Ais less important for denser matrices where the loads fromB dominate, it is important for very sparse matrices whereloads from B are similar in count. In addition, the row idsfor all non-zeros in A are needed. As NNZ_PER_BLOCK isconstant, we can reduce the bit length of these row ids bylocally remapping them: Using a dictionary we use the indexof the first non-zero in that row as local row id.

3.2.2 Local Work DistributionThe most important component in AC-ESC is the local workdistribution. As the number of intermediate products pro-cessed by a block of threads varies, we have to dynamicallydecide which elements to load to exactly fill up the availableresources. While inspecting B for global load balancing iscostly, inspecting B now comes with little additional costas it needs to be accessed anyway. Thus, after loading eachcolumn index of A, we retrieve the number of elements tobe fetched from B for every entry in A.To determine which elements to load, we propose an ex-

panding work distribution that dynamically supplies localESC with data. The work distribution offers three meth-ods: placework, size, receivework outlined in Algorithm 2.placework receives the number of temporary products thatwill be generated for each entry in A. A prefix sum overthat data yields the expansion of temporary products up toa specific entry in A. size queries the overall sum, i.e., howmany elements are still to be processed. receivework canbe called with a desired number of elements to be drawnfrom the work distribution (Consume). It can deliver multipleelements (N ) to each thread. We typically use 8.

To perform the work assignment, we determine in parallelthe offset of the first temporary product of each row from A(line 17-19). Marking this product (line 20) and performinga max prefix scan over the data (line 24) assigns the corre-sponding row id (Ar es ) to all Consume temporary products.To ensure data is loaded in a coalesced manner, we interleavethe temporary products between threads, which is commonlyknown as moving from a block layout to stripped layout(line25). Comparing the output id (c) and the cumulative sum upto the first element of the respective row (WDState[Ar es [i]]),the local offset in the row can be computed.

Instead of using the local offset directly, we reverse the or-der using the offset from the next rowWDState[Ar es [i]+ 1]and count down (line 29). In this way, if the work distribu-tion splits a row in B, we take entries from the end of therow and thus simply act like the row is shorter in the nextiteration of ESC. Finally, we reduce all cumulative counts bythe number of elements drawn from the work distribution

71

Page 5: Adaptive Sparse Matrix-Matrix Multiplication on the GPUrzayer/pappe/AC_SpGEMM.pdf · 2019. 5. 24. · Adaptive GPU SpGEMM PPoPP’19, February 16–20, 2019, Washington, DC, USA against

Adaptive GPU SpGEMM PPoPP’19, February 16–20, 2019, Washington, DC, USA

Algorithm 2: Local Work Distribution1 ScratchPadWDState[NNZ_PER_BLOCK]2 Function placework(elements[NNZ_PER_THREAD])3 blockPrefixSumIncl(elements , elements)

4 for i ← 0 to NNZ_PER_THREAD do5 WDState[tid · THREADS + i + 1] ← elements[i]

6 WDState[0] ← 07 syncThreads()

8 Function size()9 returnWDState[NNZ_PER_BLOCK]

10 Function receivework(N , Consume)11 ScratchPad Offsets[N · THREADS]12 Ar es [N ]

13 Br es [N ]

14 clear(Offsets)15 syncThreads()

16 for i ← 0 to NNZ_PER_THREAD do17 a ←WDState[i · THREADS + tid]18 an ←WDState[i · THREADS + tid + 1]19 if a < Consume and a , an then20 Offsets[a] ← i · THREADS + tid

21 syncThreads()

22 for i ← 0 to N do23 Ar es [i] ← Offsets[N · tid + i]

24 blockMaxScanIncl(Ar es ,Ar es )

25 blockedToStripped(Ar es ,Ar es )

26 for i ← 0 to N do27 c = tid + i · THREADS28 if c < Consume then29 Br es [i] ←WDState[Ar es [i] + 1] − c − 130 else31 Ar es [i] ← ∅

32 Br es [i] ← ∅

33 syncThreads()

34 for i ← 0 to NNZ_PER_THREAD do35 j ← tid + i · THREADS + 136 WDState[j] ← max(0,WDState[j] −Consume)

37 syncThreads()

38 return Ar es ,Br es

(line 36) and return identifiers for all temporary products(Ar es ,Br es ). The only state the work distribution needs tokeep isWDState . As AC-ESC might run out of memory dur-ing chunk allocation, it needs to be able to continue after anallocation round trip to the host. Thus, the work distributionalso needs to support restarts. In case of a restart, we simplystore the number of already consumed elements to globalmemory. When the block continues, we initialize the workdistribution as usual, but immediately reduce the workloadaccordingly, i.e., we execute line 36 with the stored count.

53 4 4 53

12 22 42 52 72

5

13 33 43

3

13 33

3/5/ 42/

12 22 42 52 72 13 13 33 33 430 1 2 3 4 0 1 2

13 33 43

10WD

7WD

53 73

43,4/

13 33

44/

53 73 13

51/

53 73 7313 13 13 33 33 43 530 41 2 3

5WD

23 43

52-5/

53 73 43 53 7314

31/

13 33

53 73 7343 53 1413 23 33 430 52 3 41 0

A B

Chunk

Chunk

5 3 4 4 5 3

0 0 2 4 5 3

0 0 0 0 1 3

(a)

(b)

(c)

(d)

(e)

(f)

(g)

Figure 3. Example of our work distribution-driven local ESC:(a) After global load balancing over A’s non-zeros (colors),we fill the work distribution with the row lengths each entryfromA references in B. (b) Taking 10 elements from the workdistribution, we expand, sort and compact. (c) A first chunkfor row id 2 is split off to global; the remaining 3 elements arekept locally for another iteration. (d) 7 elements are drawnfrom the work distribution. (e) All elements are kept foranother iteration. (f) 5 elements are needed. (g) A completechunk for row 3 is produced.

72

Page 6: Adaptive Sparse Matrix-Matrix Multiplication on the GPUrzayer/pappe/AC_SpGEMM.pdf · 2019. 5. 24. · Adaptive GPU SpGEMM PPoPP’19, February 16–20, 2019, Washington, DC, USA against

PPoPP’19, February 16–20, 2019, Washington, DC, USA Winter, Mlakar, Zayer, Seidel, and Steinberger

3.2.3 Local ESCDriven by the work distribution, we perform multiple itera-tions of local ESC, as indicated in Figure 3. Using the outputfrom the work distribution, every thread loads its assignedelement from B and multiplies it with the previously loadedvalue from A to complete the expansion. This yields coa-lesced memory access for elements of the same row in B. Ofcourse, as different rows from Bmight be loaded—dependingon the column ids of A—the exact memory access patternstill depends on the input data.

After the expansion, the intermediate products are movedinto radix sort, using their row and column id for sorting.The runtime of radix sort is proportional to the bit lengthbeing sorted, and thus reducing sort bit length is importantfor ESC [9]. While previous work followed a static approachto bit reduction, our approach is more aggressive and com-pletely dynamic: As mentioned earlier, we bound the maxi-mum range of row ids using a dictionary. Unfortunately, thesame approach is not applicable to column ids as it wouldrequire knowledge of all unique column ids. However, wecan bound the range by tracking the minimum and maxi-mum id for all entries we fetch from B and thus reduce thenumber of bits. We do the same for the row ids on top ofthe dictionary, further reducing their bit range. Thus, thesorting effort adapts to the input data present.The reduction of the number of sorted bits not only re-

duces the sorting effort, but also the register requirements.Keeping in mind that the row ids in the worst case requirelog2(NNZ_PER_BLOCK) bits and the column id of B is lim-ited by B’s dimension, we choose a 32 bit or 64 bit integer. Forexample, for a block size of 256 threads and 2 NNZ_PER_-THREAD, we need 9 bits; thus 32 bit integers are sufficientfor matrices up to 8.4 million columns (23 bits).

In the compaction step, we are not only interested in com-bining the data, but also in the number of entries in everyrow and how to write the chunk to memory. We perform allthree tasks within a single prefix scan using special opera-tors and state. At first, we determine whether neighbouringelements have the same sorting key, i.e., should be combined,and whether their row id bits match, i.e. they are from thesame row. Every thread can trivially perform this operationfor all elements it holds in registers.

For cross-thread communication, we use scratchpad mem-ory. We then encode both facts as individual bits in a 32 bitinteger (row ends as the 17th bit (orange) in Algorithm 3;the end of a combine sequence using the first bit (purple)).We split the remaining 30 bits in half to count the elementsin each row (green) and overall compacted elements (blue).For each element that ends a combine sequence, we initializeeach of the 15 bit counters to 1.

The scan uses the state information to decide whether toreset the accumulation and/or counting bits as outlined inAlgorithm 3.

Algorithm 3: Compaction Scan Operator// initial state for the scan operation

// end row 0b0000 0000 0000 0011 0000 0000 0000 0011// end comp 0b0000 0000 0000 0010 0000 0000 0000 0011

// none 0b0000 0000 0000 0000 0000 0000 0000 00001 Function CombineScanOperator(a, b)2 if equalRow (akey , bkey ) then3 state ← astate & 0xFFFE4 else5 state ← astate & 0xFFFEFFFE

6 if akey = bkey then7 nvalue ← avalue + bvalue8 else9 nvalue ← bvalue

10 nkey ← bkey11 nstate ← state + bstate12 return n

After completion of the scan, the state bits can still bequeried to identify the compacted elements as well as rowends. Additionally, the other bits hold information abouteach element’s position in the chunk as well as the localoffset in the row. Using the row counts, we update the globalcounter for each row, which will serve later on for computingthe row pointer array and memory requirements of C.

Previous approaches to local ESC assign the exact numberof temporary elements that can be handled to a block [7] andgo to global memory after ESC. The resulting data howevermay still need to be combined with temporary results frommultiple other blocks. If we directly wrote our results tochunks in global memory, we would face the same issue.Thanks to the flexibility of our work distribution, we keepthe results around for another iteration of ESC, combiningthe temporary results with new data from B.

By keeping the temporary results for the next iteration ofESC and reducing the number of elements drawn from thework distribution accordingly, we reduce the global mem-ory traffic significantly. While avoiding additional chunks isreasonable, keeping elements from multiple output rows forthe next iteration is not, as all but the last row are alreadycompleted. Conversely, it does not make sense to have afew elements of a new row at the end of a chunk, as thesewill require merging in a later stage. Thus, we only keep thelast row for the next round and write other rows to a globalchunk.

3.2.4 Chunk ManagementWhen we decide to write a chunk of C to global memory, athread of the block uses the compaction result to computethe chunk size and allocates a chunk from the pool. To thisend, it increments an atomic counter by the chunk size. Towrite both the column ids and values to the chunk, we take a

73

Page 7: Adaptive Sparse Matrix-Matrix Multiplication on the GPUrzayer/pappe/AC_SpGEMM.pdf · 2019. 5. 24. · Adaptive GPU SpGEMM PPoPP’19, February 16–20, 2019, Washington, DC, USA against

Adaptive GPU SpGEMM PPoPP’19, February 16–20, 2019, Washington, DC, USA

chunk list heads

shared rowstracker

5

2

-

3

4

-

5

firs

t ro

w#

entr

ies

# la

st r

ow

# fi

rst

row

sort

key

values indices

Figure 4. The chunk lists are constructed as per row linkedlists while the shared rows tracker holds per row chunkcounts if merging is required for a certain row.

compacting round trip through scratchpadmemory to ensurecoalesced writes to global memory.When a chunk is generated, we update the restart infor-

mation for this thread block. If a complete chunk is written,information about how many elements have been drainedfrom the work distribution is sufficient. If the last row iskept for the next iteration, we write the row id as restartinformation. The next time the block is launched it can thencompletely ignore these rows when loading data from A.In addition to the data needed for C, each chunk holds

its starting row, element count and number of elements inthe first and last row. To perform chunk copy in parallel inthe final stage, we keep an array of pointers to all chunks.Using a full pointer allows chunks to be placed in memoryarbitrarily. Thus, expanding the chunk pool is as easy asadding another memory region to be used as pool.

Finally, we require information about which chunks havedata for the same row to start merging. To identify all chunksthat contribute to one output row, we construct a linked listof chunk pointers for every row, with a list head being avail-able for every row, as shown in Figure 4. The list insertionuses an atomic exchange operation on the list heads, mak-ing the list order dependent on the hardware scheduler. Toensure bit-stable results, we further store a chunk identifierfrom the block id and a per-block running chunk number,which yields a global ordering of chunks. To increase theefficiency of chunk merging, we use one bit of the list headsto indicate whether there is only a single chunk in the list.

Upon adding a second chunk to the list, we insert the row idinto an array holding all rows with two or more chunks. Thisarray serves as the basis for identifying how many mergeblocks need to be started in the next stage.

3.3 Chunk MergingAfter AC-ESC, the merge stage combines rows shared be-tween chunks to generate the final result. To guarantee adeterministic merge order, we perform an initial sort of thechunks based on their global chunk order. This sort is negligi-ble compared to the merge itself. As any number of elementsmay need to be merged, one could launch a single block foreach shared row. However, typically, a shared row is coveredby two chunks, i.e., because global load balancing splits theentries of one row across two blocks. In this case, the numberof entries in both chunks may be low, potentially wastingresources when using a complete block.Similarly to AC-ESC, we want to work on multiple rows

to fully use the available resources. To this end, we run aprefix scan over all shared rows, using the row count thatwe summed up atomically for all rows during AC-ESC. Forshared rows, this count represents the number of remainingintermediate products.Throughout the scan we combine row range identifiers,

if the sum of their respective elements does not overflowthe number of elements we can handle in one block. Thisapproach may not fill up blocks completely, but yields sig-nificantly better resource usage than a one-block-per-rowstrategy. When launching theseMulti Merge blocks, we onceagain build on our work distribution and execute the remain-ing steps of our AC-ESC, creating new chunks for all mergedrows. At the same time, we set the row count for each sharedrow to the correct value after merging.The scope of Multi Merge is limited to rows that were

split over two chunks. We therefore propose two additionalalgorithms to deal with rows yielding more than two chunks:PathMerge and SearchMerge. The former is applicable up to apredefined number of chunks, while the latter can handle anarbitrary number. We decide which algorithm to use basedon the length of each row’s chunk list.Search Merge uses binary search sampling in all chunk

column ids to find overlapping ranges that can be handledat once. At first, we compute the minimum and maximumcolumn id over all involved chunks. Then, we uniformlysample this range, according to ((max−min)/THREADS),assigning one thread to each sample. Using binary search,every thread finds the next higher column id in all chunksand computes the sum over all elements that are below acrossall chunks. The thread with the largest sum that still fitsinto the available resources, delivers the data to be merged.Using AC-ESC on that data yields the first part of the mergedrow. Reducing the count of all samples by the number ofconsumed elements yields the next cut and so on. In case thesampling is too coarse we sub-sample the range.

74

Page 8: Adaptive Sparse Matrix-Matrix Multiplication on the GPUrzayer/pappe/AC_SpGEMM.pdf · 2019. 5. 24. · Adaptive GPU SpGEMM PPoPP’19, February 16–20, 2019, Washington, DC, USA against

PPoPP’19, February 16–20, 2019, Washington, DC, USA Winter, Mlakar, Zayer, Seidel, and Steinberger

15.0k

190.0

k46

0.0k

770.0

k1.5

M2.2

M5.2

M13

.0M46

.0M42

0.0M

temporary elements

10-1

100

101

GFLO

PS (f

loat

)

AC-SpGEMMcuSparsebhSparse

RMergensparseKokkos

15.0k

190.0

k44

0.0k

770.0

k1.4

M2.2

M5.2

M12

.0M41

.0M42

0.0M

temporary elements

10-1

100

101

GFLO

PS (d

oubl

e)

AC-SpGEMMcuSparsebhSparse

RMergensparseKokkos

Figure 5. Trend line of the SpGEMM performance for all tested methods over highly sparse matrices (average row length≤ 42) from the SuiteSparse collection. Line thickness indicates variance.

PathMerge avoids global memory binary search by placingsamples uniformly over the entries of every chunk. For eachsample we fetch the column id and sort them across theentire block, while carrying the sample number along withthe sort. Next, we perform a custom scan over the sorted datato find the correspondences between samples from differentchunks, i.e., identify possible paths through all chunks.

As every sample originally only holds the sample numberfrom its own chunk, we set the other counts to zero, usingonly as many bits as necessary to represent each sample num-ber. The max scan over these individual bit ranges deliversthe combined merge path, i.e., the matching cut through eachchunk. For each path, we compute the number of temporaryelements from the combined sample locations and chunksizes. Choose the one that fits into memory, we run AC-ESC.The stored paths are again used for the next iteration.

All three approaches produce new chunks, and thus mustsupport restarts. In Multi Merge, a restart simply starts fromscratch, as it has only one iteration. On the other hand, bothSearch Merge and Path Merge store the last used sample andtherefore a restart therein equals sampling a reduced range.

3.4 Long RowsProcessing long rows that exceed the available local resourcesduringAC-ESCwould load and sort themwithout any changeand write them to the chunk pool. To avoid these unneces-sary computations, we identify long rows during Fetch Aand immediately create a chunk that only points to the datain B and attach the factor from A.

By removing the row from the work distribution, we avoidprocessing it during AC-ESC, but have to make slight mod-ifications to this stage. If the long row has to be mergedwith values otherwise placed in the middle of a chunk, webreak up chunks at the position of the long row to allow formerging afterwards in the merge stage.

3.5 Output Matrix and Chunk CopyOnce all chunks have been finalized, generating the finalresult is straightforward: A device-wide prefix sum over the

row counts yields the row pointer array and C’s memoryrequirement for allocation of the values and column id arrays.Then, in parallel, we iterate over all chunks and copy theirdata to the newly allocated C. Each chunk uses a completeblock of threads to copy data in a coalesced fashion.

4 EvaluationTo provide a realistic assessment of the performance of ourapproach, we benchmarked the entire SuiteSparse matrixcollection [11], which contains more than 1800 unique matri-ces of non-trivial size (≥ 104 NNZ) from various applicationdomains with different matrix characteristics. Very smallmatrices (≤ 104 NNZ) are excluded as they do not providesufficient parallelism for execution on the GPU and thus CPUimplementations are typically faster. From about 104 NNZupwards, our approach outperforms state-of-the-art CPUimplementations [14] on a consumer grade CPU of similarcost (Intel Xeon E5-2630 16 GB of memory). We compareour approach to cuSparse [23], bhSparse [20], RMerge [17],nsparse [22], and Kokkos [14]. All approaches work directlywith CSR and were compiled with CUDA Toolkit 10.0. Wecompute A ·A for square matrices and A ·AT for non-square,where we precompute AT . As test platform we use an Inteli7 7700 CPU at 3.60GHz and an NVIDIA Titan Xp (computecapability 6.1).

Our algorithm is written in CUDA and uses a block size of256/512 non-zeros for global load balancing, sorts 8 elementsper thread and keeps up to 4 elements per thread from oneiteration to the next. We use conservative memory estimatesfor all helper data structures. For the initial chunk pool,we rely on a simplistic memory estimate S of C, using theaverage row length as a measure of row overlaps, i.e., pretendmatrices have the same number of uniformly distributedelements in each row. More precisely, for A of size nA ×mA, the average row length is given by a = |A|/nA, where|A| indicates the number of non-zeros, and the estimatedprobability for a collision is pa = a/mA. For the product AB,the memory estimate is given as S ≈ nA · b · (1−(1−pb )a)/pb .We multiply this factor by 1.2 to account for the chunk meta

75

Page 9: Adaptive Sparse Matrix-Matrix Multiplication on the GPUrzayer/pappe/AC_SpGEMM.pdf · 2019. 5. 24. · Adaptive GPU SpGEMM PPoPP’19, February 16–20, 2019, Washington, DC, USA against

Adaptive GPU SpGEMM PPoPP’19, February 16–20, 2019, Washington, DC, USA

≈1500 highly sparse matrices (a ≤ 42) ≈300 denser matrices (42 < a)speed up of AC-SpGEMM better than best speed up of AC-SpGEMM better than bestmin max h. mean AC-SpGEMM min max h. mean AC-SpGEMM

float

cuSparse† 0.78 614.17 4.01 0% 0% 0.55 674.51 1.66 6% 2%bhSparse 1.00 96.60 5.56 0% 0% 0.71 429.04 1.77 9% 0%RMerge 0.60 23.50 3.73 0% 0% 0.34 10.21 2.00 3% 0%nsparse† 0.26 76.80 2.29 3% 3% 0.19 33.02 0.76 66% 60%Kokkos† 0.48 767.71 6.27 1% 1% 0.34 148.27 1.17 45% 7%

AC-SpGEMM - - - - 96% - - - - 31%

doub

le

cuSparse† 0.51 470.44 3.78 0% 0% 0.62 608.74 1.50 10% 3%bhSparse 1.00 88.60 5.12 0% 0% 0.71 387.04 1.56 17% 0%RMerge 0.45 24.90 3.70 0% 0% 0.40 9.50 1.81 5% 1%nsparse† 0.31 43.06 2.01 6% 5% 0.18 28.83 0.67 70% 61%Kokkos† 0.37 582.87 5.09 2% 1% 0.30 120.58 0.92 53% 10%

AC-SpGEMM - - - - 94% - - - - 26%Table 1. Relative speedup of AC-SpGEMM over competing approaches and percentage where the approaches achieved betterperformance than AC-SpGEMM / achieved the best performance. AC-SpGEMM dominates the performance for highly sparsematrices and still achieves the best performance for about 1/3 of denser matrices. † does not produce bit-stable results.

data and divergences from the average row length and applya lower bound of 100MB. In case the estimate is not sufficient,the restart functionality will increase the chunk pool.

4.1 Runtime OverviewTo better analyze the runtime performance, we split the eval-uation into highly sparse and densermatrices, with a split fac-tor of 42 non-zeroes per row, classifying 80% of the matricesin SuiteSparse as highly sparse (see also Figure 1). Summaryplots for SpGEMM on highly sparse matrices are shown inFigure 5, and relative speedups against the evaluated meth-ods for all matrices in Table 1. For highly sparse matrices, ourapproach clearly dominates performance, achieving averagespeedups of at least 2× over other approaches. For densermatrices, nsparse achieves the best performance, with anaverage speedup of 1.32/1.49 over AC-SpGEMM, with AC-SpGEMM being on par with Kokkos.Over the entire data set, our approach achieves an aver-

age speedup of 3.27/3.05, 4.17/3.80, 3.30/3.21, 1.74/1.53, and3.76/3.02, over cuSparse, bhSparse, RMerge, nsparse, andKokkos, respectively. The trend plots and table already indi-cate the strengths and weaknesses of our approach. WhileESC leads to deterministic, bit-stable results, the search andmerge overhead increases as the number of non-zeros perrow increase. Thus, even though our approach performs mul-tiple iterations of ESC in efficient on-chip memory, our ap-proach loses ground to hash-based approaches, like nsparse,for denser matrices. However, for very sparse matrices theoverhead of ESC is not severe and the advantages of localscheduling and single-run chunk generation shine. Com-pared to other bit-stable approaches, AC-SpGEMM overallclearly achieves the best performance.

lang

uag...

scircui...

stat96

v...

poiss

on...

144...

asia_os...

webb

ase...

atmosmo...

filter3...

bibd

_19...

TSOP

F_R...

huge

bub...

cant...

land

mar...

hood

...TS

C_OP

F...

0

10

20

GFLO

PS

AC-SpGEMMcuSparseRMerge

bhSparsensparseKokkos

Figure 6. Double precision performance for commonlybenchmarked matrices and additional cases for which ourapproach is bested by others. Note that nsparse achieves 45GFLOPS for the last matrix.

4.2 Runtime DetailsA performance overview for commonly benchmarked matri-ces from various fields and selected matrices that are difficultfor our approach are provided in Figure 6; matrix statisticsare given in Table 2. The most difficult scenarios for ourapproach include TSC_OPF_1047, cant, and hood, which, inspite of having strongly different structure, all feature a largeaverage row length, produce a high number of intermediateproducts and compact them significantly (up to a compactionfactor of 150). The cost of ESC is simply too high in compar-ison to hashing, even while keeping hundreds of iterationsin local memory. For TSC_OPF_1047, nsparse achieves thehighest speedup over AC-SpGEMM: 5×. Another matrix withmany temporary products but a lower compaction factor islandmark. Interestingly, due to its very special structure itis one of the few cases where RMerge takes the lead.

76

Page 10: Adaptive Sparse Matrix-Matrix Multiplication on the GPUrzayer/pappe/AC_SpGEMM.pdf · 2019. 5. 24. · Adaptive GPU SpGEMM PPoPP’19, February 16–20, 2019, Washington, DC, USA against

PPoPP’19, February 16–20, 2019, Washington, DC, USA Winter, Mlakar, Zayer, Seidel, and Steinberger

A Crows cols nnz len max nnz len max temp

language 0.40 0.40 1.22 3.0 11.5k 4.61 11.6 32.0k 5.5scircuit 0.17 0.17 0.96 5.6 353 5.22 30.5 1.9k 8.7stat96v2 0.03 0.96 2.85 98.1 3.2k 0.35 12.1 1.6k 8.7poiss. . . 0.01 0.01 0.35 26.1 110 2.96 218.8 584 11.8

144 0.14 0.14 2.15 14.9 26 10.42 72.0 116 33.0asia_osm 11.95 11.95 25.42 2.1 9 42.75 3.6 24 56.9webb. . . 1.00 1.00 3.11 3.1 4.7k 51.11 51.1 12.4k 69.5atmos. . . 1.49 1.49 10.32 6.9 7 36.49 24.5 25 71.6filter3D 0.11 0.11 2.71 25.4 112 20.16 189.4 550 86.0

bibd_19_9 0.01 0.09 3.3 19.4k 19.4 0.03 171.0 171 119.7TSOPF. . . 0.04 0.04 16.17 424.2 983 74.32 1.9k 3.3k 128.0hugebu. . . 21.20 21.20 63.58 3.0 3 132.69 6.3 7 190.7

cant 0.06 0.06 4.01 64.2 78 17.44 279.3 375 269.5landmark 0.07 0.00 1.15 16.0 16 101.82 1.4k 1.6k 549.2

hood 0.22 0.22 10.77 48.8 77 34.24 155.3 231 562.0TSC_O. . . 0.01 0.01 2.02 247.8 1.5k 8.83 1.1k 3.5k 1352.4Table 2. Matrix overview: values in millions except for rowstatistics (average length and maximum row length).

lang

uag...

scircui...

stat96

v...

poiss

on...

144...

asia_os...

webb

ase...

atmosmo...

filter3...

bibd

_19...

TSOP

F_R...

huge

bub...

cant...

land

mar...

hood

...TS

C_OP

F...

0.0

0.5

1.0

AC-GLBAC-ESC

AC-MCCAC-MM

AC-PMAC-SM

AC-CC

Figure 7. Relative runtime of the different steps of our ap-proach: global load balancing (GLB), AC-ESC, Merge Assign-ment (MCC), Multi Merge (MM), Path Merge (PM), SearchMerge (SM), and Chunk Copy (CC).

As the compaction factor gets lower, our approach iscompetitive, leading the performance for many commonlytested matrices, in various domains, such as fluid dynamics(poisson3Da), linear programming (stat96v2), or graphs(webbase-1M). This performance edge is independent of thesize of the matrix (compare hugebubbles-00020 and 144).Also, local dense areas (TSOPF_RS_b2383), specific structures(hugebubbles-00020), individual long rows (webbase-1M),non-square matrices (stat96v2), and very long rows (bibd_-19_9) are handled efficiently by our approach. The bestspeedup is achieved by our approach if the matrix is highlysparse, but has few long rows (language), where our effi-cient ESC and the special treatment of long rows go hand inhand, outperforming the best other approach by 20×.

The timing breakdown of our approach in Figure 7 showsthat we operate under ideal conditions for many matrices,spending most time in AC-ESC within on-chip memory.

helper chunk used % u/o R mpL

language 15.05 100.00 58.23 50.88% 1.10 0 98.73%scircuit 13.09 100.00 63.15 55.18% 1.06 0 97.68%stat96v2 29.91 122.11 6.68 5.47% 1.65 0 97.98%

poisson3Da 6.06 129.96 42.61 32.78% 1.26 0 91.02%144 27.04 460.75 129.89 28.19% 1.09 0 98.38%

asia_osm 416.96 1091.05 497.10 45.56% 1.02 0 99.77%webbase-1M 100.03 710.80 605.79 85.23% 1.04 3 98.05%atmosmodl 130.69 1033.36 435.36 42.13% 1.04 0 99.70%

filter3D 38.06 983.23 271.19 27.58% 1.18 0 98.78%bibd_19_9 1.06 100.00 19.20 16.78% 57.38 0 99.21%TSOPF_. . . 605.60 3044.69 1593.56 52.34% 1.87 0 99.23%hugebub. . . 928.54 1991.35 1545.96 77.63% 1.02 0 99.45%

cant 66.76 3072.00 276.51 9.00% 1.39 0 99.48%landmark 228.97 6144.00 3333.51 54.26% 2.86 1 97.29%

hood 162.85 3072.00 508.58 16.56% 1.30 0 99.73%TSC_O. . . 37.99 3072.00 310.41 10.10% 3.07 0 99.03%

Table 3.Overall memory consumption in MB for helper datastructures, chunk pool, actual chunk pool used, and usedchunk memory relative to the output matrix (u/o); as well asnumber of restarts (R) and lowest multiprocessor load (mpL).

lang

uag...

scircui...

stat96

v...

poiss

on...

144...

asia_os...

webb

ase...

atmosmo...

filter3...

bibd

_19...

TSOP

F_R...

huge

bub...

cant...

land

mar...

hood

...TS

C_OP

F...

0

2

4

6

GB

1e9AC-HelperAC-ChunksAC-Overalloc

RMergebhSparsensparse

Figure 8.Memory consumptions. Due to our simplistic mem-ory estimate, the effectively used memory is very small com-pared to the allocation.

For matrices with long rows (TSOPF_RS_b2383), a consid-erable amount of time is spent on Merge. Although MultiMerge handles more shared rows than Search and Path Mergein all cases, its time is negligible, as every block handlesmany rows. Assigning the merge cases and copying the fi-nal matrix make up between 1-20% of the runtime; globalload balancing is negligible, underlining the efficiency of ourapproach.

At the same time, multi processor load is virtually perfectin all cases (cf. last column of Table 3), indicating that pairingour global and local load balancing with the GPU hardwarescheduler achieves ideal workload distributions.

4.3 Memory ConsumptionOur memory consumption is detailed in Table 3 and com-pared to others in Figure 8. nsparse requires hardly anyadditional memory.

77

Page 11: Adaptive Sparse Matrix-Matrix Multiplication on the GPUrzayer/pappe/AC_SpGEMM.pdf · 2019. 5. 24. · Adaptive GPU SpGEMM PPoPP’19, February 16–20, 2019, Washington, DC, USA against

Adaptive GPU SpGEMM PPoPP’19, February 16–20, 2019, Washington, DC, USA

We allocate similar memory as RMerge and bhSparse, ofwhich we typically only use a fraction (% in Table 3). Thefact that the used chunk memory is only slightly higher thanthe memory required for C (u/o) in Table 3, shows that lo-cal iterations of ESC essentially produce completed chunksof the output matrix (note that our lower bound of 100MBleads to the high value for bibd_19_9). This highlights theadvantages of our local work distribution. Our chunk poolestimate is conservative in most cases and only few matricesrequire restarts, cf. Table 3.Although we require three restarts for webbase-1M, we

achieve a 4.4× speedup over the best other approach, indi-cating that restart is efficient. To evaluate the cost of restarts,we reduced the chunk pool size for webbase-1M, where wemeasured a runtime of 22.0, 23.6, 24.5, 26.6, 30.8, and 39.7,and 48.6ms for 0, 3, 5, 10, 21, 42, and 63 restarts, respectively.Even with 63 restarts we still beat nsparse by a factor of 2×.Additionally, a less conservative chunk estimate could sig-nificantly reduce memory with little impact on performancedue to restarts.

4.4 SummaryOverall, it can be noted that AC-SpGEMM is the fastestapproach when bit-stable results are required for virtuallyall tested matrices (RMerge is better in 1% of cases). AC-SpGEMM is the fastest approach for very sparse matrices.For matrices with many temporary products and large com-paction factors, ESC strategies tend to fall behind hash-basedapproaches like nsparse as the per-product cost is simply toohigh. Nevertheless, across the entire test suit, AC-SpGEMMtakes the performance lead in 83% of all cases.

5 ConclusionOn massively parallel processors such as GPUs, any per-formance gains require bringing fine grained parallelismforward. AC-SpGEMM achieves this goal by a comprehen-sive take on the problem. Our main contribution is a fullyadaptive local work distribution, allowing for multiple iter-ations of local ESC avoiding costly global memory roundtrips. Paired with a novel, adaptive chunk management ap-proach, special case handling of long rows, and a series ofoptimizations, AC-SpGEMM forms a highly efficient com-plete SpGEMM solution, which achieves bit-stable results.Experimental results on 2000 matrices across various fieldsreveal dominating performance for highly sparse matrices.Only matrices with very large numbers of temporary prod-ucts can be handled more efficiently using the most recenthash-based alternative—which is not bit-stable.

An obvious improvement for our approach is reducing theoverallocation of chunkmemory. Furthermore, extending theadaptive behaviour of our chunk-based approach to choosebetween alternative approaches (ESC, hashing, merging) de-pending on the load currently seen by the work distributionmay lead to a further improvement of performance in thosescenarios where other strategies shine.

Our approach is open source and can be downloaded fromACM DL.

AcknowledgmentThis research was supported by the German Research Foun-dation (DFG) grant STE 2565/1-1, and the Austrian ScienceFund (FWF) grant I 3007. The GPU for this research wasdonated by NVIDIA Corporation.

78

Page 12: Adaptive Sparse Matrix-Matrix Multiplication on the GPUrzayer/pappe/AC_SpGEMM.pdf · 2019. 5. 24. · Adaptive GPU SpGEMM PPoPP’19, February 16–20, 2019, Washington, DC, USA against

PPoPP’19, February 16–20, 2019, Washington, DC, USA Winter, Mlakar, Zayer, Seidel, and Steinberger

References[1] Kadir Akbudak and Cevdet Aykanat. 2017. Exploiting Locality in

Sparse Matrix-Matrix Multiplication on Many-Core Architectures.IEEE Transactions on Parallel and Distributed Systems 28, 8 (Aug 2017),2258–2271. https://doi.org/10.1109/TPDS.2017.2656893

[2] Rasmus R. Amossen, Andrea Campagna, and Rasmus Pagh. 2010. Bet-ter Size Estimation for Sparse Matrix Products. In Proceedings of the13th International Conference on Approximation, and 14 the Interna-tional Conference on Randomization, and Combinatorial Optimization:Algorithms and Techniques (APPROX/RANDOM’10). Springer-Verlag,Barcelona, Spain, 406–419.

[3] Pham N. Q. Anh, Rui Fan, and YonggangWen. 2016. Balanced Hashingand Efficient GPU Sparse General Matrix-Matrix Multiplication. InProceedings of the 2016 International Conference on Supercomputing(ICS ’16). ACM, New York, NY, USA, Article 36, 12 pages. https://doi.org/10.1145/2925426.2926273

[4] Grey Ballard, Alex Druinsky, Nicholas Knight, and Oded Schwartz.2016. Hypergraph Partitioning for Sparse Matrix-Matrix Multiplica-tion. ACM Trans. Parallel Comput. 3, 3, Article 18 (Dec. 2016), 34 pages.https://doi.org/10.1145/3015144

[5] Nathan Bell, Steven Dalton, and Luke N. Olson. 2012. Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods. SIAM Journal onScientific Computing 34, 4 (2012), C123–C152. https://doi.org/10.1137/110838844

[6] Aydin Buluc and John R. Gilbert. 2008. On the representation andmultiplication of hypersparse matrices. In 2008 IEEE InternationalSymposium on Parallel and Distributed Processing. IEEE, Miami, FL,USA, 1–11. https://doi.org/10.1109/IPDPS.2008.4536313

[7] Steven Dalton, Sean Baxter, Duane Merrill, Luke Olson, and MichaelGarland. 2015. Optimizing Sparse Matrix Operations on GPUs Us-ing Merge Path. In 2015 IEEE International Parallel and DistributedProcessing Symposium. IEEE, Hyderabad, India, 407–416.

[8] Steven Dalton, Nathan Bell, Luke Olson, and Michael Garland. 2014.Cusp: Generic Parallel Algorithms for Sparse Matrix and Graph Com-putations. http://cusplibrary.github.io/ Version 0.5.0.

[9] Steven Dalton, Luke Olson, and Nathan Bell. 2015. Optimizing SparseMatrix-Matrix Multiplication for the GPU. ACM Trans. Math. Softw.41, 4, Article 25 (Oct. 2015), 20 pages. https://doi.org/10.1145/2699470

[10] TimothyA. Davis. 2017. SuiteSparse: A Suite of Sparsematrix packages.http://www.cise.ufl.edu/ davis/.

[11] Timothy A. Davis and Yifan Hu. 2011. The University of Florida SparseMatrix Collection. ACM Trans. Math. Softw. 38, 1 (Dec. 2011), 1:1–1:25.

[12] Julien Demouth. 2012. Sparse matrix-matrix multiplication on the GPU.Technical Report. NVIDIA.

[13] Mehmet Deveci, Christian Trott, and Sivasankaran Rajamanickam.2017. Performance-portable sparse matrix-matrix multiplication formany-core architectures. In 2017 IEEE International Parallel and Dis-tributed Processing SymposiumWorkshops (IPDPSW). IEEE, Lake BuenaVista, FL, USA, 693–702. https://doi.org/10.1109/IPDPSW.2017.8

[14] Mehmet Deveci, Christian Trott, and Sivasankaran Rajamanickam.2018. Multithreaded sparse matrix-matrix multiplication for many-core and GPU architectures. Parallel Comput. 78 (oct 2018), 33–46.

https://doi.org/10.1016/j.parco.2018.06.009[15] John R. Gilbert, Cleve Moler, and Robert Schreiber. 1992. Sparse Ma-

trices in MATLAB: Design and Implementation. SIAM J. Matrix Anal.Appl. 13, 1 (1992), 333–356. https://doi.org/10.1137/0613024

[16] Oded Green, Robert McColl, and David A. Bader. 2012. GPU MergePath: A GPU Merging Algorithm. In Proceedings of the 26th ACMInternational Conference on Supercomputing (ICS ’12). ACM, New York,NY, USA, 331–340.

[17] Felix Gremse, Andreas Höfter, Lars O. Schwen, Fabian Kiessling, andUwe Naumann. 2015. GPU-accelerated sparse matrix-matrix multipli-cation by iterative row merging. SIAM Journal on Scientific Computing37, 1 (2015), C54–C71.

[18] Fred G. Gustavson. 1978. Two Fast Algorithms for Sparse Matrices:Multiplication and Permuted Transposition. ACM Trans. Math. Softw.4, 3 (Sept. 1978), 250–269. https://doi.org/10.1145/355791.355796

[19] Rakshith Kunchum, Ankur Chaudhry, Aravind Sukumaran-Rajam,Qingpeng Niu, Israt Nisa, and P. Sadayappan. 2017. On Improv-ing Performance of Sparse Matrix-matrix Multiplication on GPUs.In Proceedings of the International Conference on Supercomputing(ICS ’17). ACM, New York, NY, USA, Article 14, 11 pages. https://doi.org/10.1145/3079079.3079106

[20] Weifeng Liu and Brian Vinter. 2015. A Framework for General SparseMatrix-matrix Multiplication on GPUs and Heterogeneous Processors.J. Parallel Distrib. Comput. 85, C (Nov. 2015), 47–61. https://doi.org/10.1016/j.jpdc.2015.06.010

[21] Duane Merrill. 2015. CUB: CUDA Unbound, a library of warp-wide,block-wide, and device-wide GPU parallel primitives.

[22] Yusuke Nagasaka, Akira Nukada, and Satoshi Matsuoka. 2017. High-Performance and Memory-Saving Sparse General Matrix-Matrix Mul-tiplication for NVIDIA Pascal GPU. In 2017 46th International Confer-ence on Parallel Processing (ICPP). IEEE, Bristol, UK, 101–110. https://doi.org/10.1109/ICPP.2017.19

[23] NVIDIA. 2019. The API reference guide for cuSPARSE, the CUDA sparsematrix library. (v9.1 ed.). NVIDIA.

[24] Md. Mostofa A. Patwary, Nadathur R. Satish, Narayanan Sundaram,Jongsoo Park, Michael J. Anderson, Satya G. Vadlamudi, Dipankar Das,Sergey G. Pudov, Vadim O. Pirogov, and Pradeep Dubey. 2015. ParallelEfficient Sparse Matrix-Matrix Multiplication on Multicore Platforms.Springer International Publishing, Cham, 48–57.

[25] Ichitaro Yamazaki and Xiaoye S. Li. 2011. On Techniques to Improve Ro-bustness and Scalability of a Parallel Hybrid Linear Solver. In Proceed-ings of the 9th International Conference on High Performance Computingfor Computational Science (VECPAR’10). Springer-Verlag, Berlin, Hei-delberg, 421–434. http://dl.acm.org/citation.cfm?id=1964238.1964281

[26] Raphael Yuster and Uri Zwick. 2004. Detecting Short Directed CyclesUsing Rectangular Matrix Multiplication and Dynamic Programming.In Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Dis-crete Algorithms (SODA ’04). Society for Industrial and Applied Mathe-matics, Philadelphia, PA, USA, 254–260. http://dl.acm.org/citation.cfm?id=982792.982828

79

Page 13: Adaptive Sparse Matrix-Matrix Multiplication on the GPUrzayer/pappe/AC_SpGEMM.pdf · 2019. 5. 24. · Adaptive GPU SpGEMM PPoPP’19, February 16–20, 2019, Washington, DC, USA against

Adaptive GPU SpGEMM PPoPP’19, February 16–20, 2019, Washington, DC, USA

A Artifact Description Appendix: AdaptiveSparse Matrix-Matrix Multiplication onthe GPU

A.1 AbstractThe following appendix provides the necessary informationto acquire the framework and rerun the experiments used toevaluate the framework.

A.2 DescriptionA.2.1 Check-list (artifact meta information)• Data set: Matrix data set found on SuiteSparse Matrix Col-lection (Formerly the University of Florida Sparse MatrixCollection)• Hardware: Recent GPU hardware from NVIDIA, tested onNVIDIA GTX 1080TI, NVIDIA GTX TITAN X Pascal andNVIDIA GTX TITAN Xp• Output: The output is provided in the output stream as wellas in a .csv file• Experiment workflow: Run runall.bat/.sh file to gatherresults for all matrices in a folder or run individual matricesthrough the framework• Experiment customization:The number of iterations usedfor the timing measurements can be altered. A CPU imple-mentation can be enabled as well to confirm the results ofthe framework output• Publicly available?: ACM DL

A.2.2 How software can be obtained (if available)The framework is downloadable from ACM DL.

A.2.3 Hardware dependenciesRecent GPU hardware from NVIDIA, tested on NVIDIA GTX1080Ti, NVIDIA GTX TITAN X Pascal and NVIDIA GTXTITAN Xp. Remaining system specifications should have acomparatively small impact on performance, performancewas tested on an Intel Core i7-7700 paired with 32 GB RAMon Ubuntu 16.04 LTS.

A.2.4 Software dependencies• CMake 3.2 or higher• CUDA 9.1 / 9.2 / 10.0• C++14 compliant compiler, tested on:– MSVC (Visual Studio 2017)– GCC 6 / GCC 7• CUB v1.8.0

A.2.5 DatasetsThe framework itself can parse Matrix Market Format (.mtx)files (Matrix data set found on SuiteSparse Matrix Collection(Formerly the University of Florida Sparse Matrix Collec-tion)) and upon first parsing a matrix in this format alsoperforms a conversion into a binary format that will be used

for consecutive runs, which greatly reduces loading times.These are stored with the .hicoo extension.

A.3 InstallationOn Linux simply run the provided setup.sh script whichwill clone CUB, setup and build the project. On Windows,download and extract CUB into the folder include/external,then create a build folder in the top directory and use CMaketo setup the project to build.

A.4 Experiment workflowThe framework can be operated in one of two modes:• Single Matrix– In this setup the framework can be run for a singlematrix and optionally confirm the resulting outputmatrix by comparing it to a host-based solution

• Complete testrun– A runall.bat/.sh file is provided that consecutivelycalls the framework with all matrices provided in afolder, this was used to run the large testcases

The output is provided in the output stream as well as in aseparate .csv file which includes all important matrix stats(rows, columns, nnz, average nnz per row, etc.) as well astiming measurements. This script is setup such that each testrun is done as a separate process, such that failed launchesdo not impede launches after that. The output of the scriptare timing measurements and when enabling Debug withinthe framework (the template instantiation must have thisenabled) also detailed measurements as well as memory mea-surements.For all matrices in the testset the framework computes

C = A ·A (orC = A ·AT if the input matrix is not square) andmeasures the time required for the multiplication procedure,only the input matrix A is provided directly on the device tothe procedure and the result matrix C is also returned as adevice matrix. Conversion operators are provided to transfermatrices between host and device as well as convert the COOformat to CSR if required.

A.5 Evaluation and expected resultTo replicate the results gathered in this paper it suffices torun the runall.bat/.sh file on a folder containing matricesfound in the SuiteSparse Matrix Collection (Formerly theUniversity of Florida Sparse Matrix Collection) dataset. Thenext page holds two plots detailing the exact performancemeasurements for the complete test set compiled using thehardware described above.

A.6 Experiment customizationThe number of iterations used for the timing measurementscan be altered. A CPU implementation can be enabled aswell to confirm the results of the framework output. Debugcan be enabled to get more detailed timing/memory results.

80

Page 14: Adaptive Sparse Matrix-Matrix Multiplication on the GPUrzayer/pappe/AC_SpGEMM.pdf · 2019. 5. 24. · Adaptive GPU SpGEMM PPoPP’19, February 16–20, 2019, Washington, DC, USA against

PPoPP’19, February 16–20, 2019, Washington, DC, USA Winter, Mlakar, Zayer, Seidel, and Steinbergerbc

sstm

25bc

sstm

37 PdM

80PI

_nS8

0PI_n

dete

r0lp

_czp

rob

cis-n

4c6-

b14

n4c6

-b14

pldd

blp

_pds

_02

lp_s

hip0

8l ukCa

lifor

nia

prim

agaz

bcss

tm38

extr1

bex

tr1bc

sstm

35bc

sstm

39uk

erbe

1Sc

iMet

G48

G49

G50

lutz

30-2

3-b6

lp_s

hip1

2lflo

wer_

4_4

dw10

24dw

2048

grid

2po

wer

USpo

werG

ridn2

c6-b

8Ko

hone

np6

000

df21

77ch

7-6-

b5Fr

anz3

cryg

2500

mk1

2-b5 de

lfde

ter8

lp_8

0bau

3blp

_bnl

2lp

_cre

_crd

b204

8fil

ter2

Dpd

e296

1wa

tt_1

dete

r6lsh

p156

1lp

_cre

_ade

laun

ay_n

11rw

5151

olm

5000 G5

7wa

tt_2

Ham

rle2

n4c5

-b8

roto

r2ch

6-6-

b4lsh

p188

2lp

_25f

v47

orsr

eg_1

dwt_

1242

shyy

41la

rge

dete

r5p2

p-Gn

utel

la08 lpl2

lp_t

russ

lp_m

aros

poli3

stor

mg2

-8bc

spwr

10cz

1268

TF13

Fran

z2de

ter1

qpba

ndbl

ckho

lelsh

p223

3hy

dr1

hydr

1cp2

p-Gn

utel

la09

fd12

rdb3

200l

dete

r4G6

2m

k10-

b3sh

erm

an3

mhd

3200

blp

_ken

_11

ex32

cat_

ears

_3_4 ch

Chem

97Zt

fxm

2-6

lshp2

614

lp_d

fl001

N_re

acto

me

wang

1wa

ng2

com

man

che_

dual

G65

dete

r7ca

ge8

flowe

r_5_

4Le

Gres

ley_

2508

mk1

1-b4

bCh

ebys

hev2

n3c6

-b9

dwg9

61b

g7ja

c010

g7ja

c010

scp2

p-Gn

utel

la06

p2p-

Gnut

ella

05sw

ang1

swan

g2kn

eser

_8_3

_1lp

_pds

_06

sayl

r4lp

_pilo

tnov G66

lshp3

025

airfo

il1 D_9

raja

t03

G55

G56

n4c5

-b4

can_

838

baye

r09

dela

unay

_n12

mk1

1-b4 G6

73e

ltm

odel

8n2

c6-b

4n3

c6-b

4lsh

p346

6de

ter3

dete

r2ra

dfr1

cell1

cell2

fd15

can_

1054

poli4

rdb5

000

mhd

4800

bDK

01R

can_

1072

p2p-

Gnut

ella

04 D_7

add3

2n4

c5-b

7p2

p-Gn

utel

la25 D_

8sh

erm

anAC

aln

s_39

37ln

sp39

37fre

eFly

ingR

obot

_2ch

6-6-

b3n2

c6-b

7bc

sstk

21cz

2548 G6

0G6

1em

ail

sstm

odel

ex29

tum

a2dw

4096

dw81

92nc

vxqp

9m

odel

6ra

jat0

6lp

_woo

dw l30

dtoc

pcb1

000

lp_n

ug12

lp_q

ap12

gem

at12

gem

at11

ODLI

STr

efet

hen_

700

aug2

dlp

_gre

enbe

alp

_gre

enbe

blp

i_gre

enbe

aye

ast

lp_k

en_1

3 geEr

dos9

72fre

eFly

ingR

obot

_3p2

p-Gn

utel

la24

LeGr

esle

y_49

08lp

_cyc

lero

sen8

sher

man

5au

g2dc

lp_p

ds_1

0n4

c5-b

5m

odel

3dw

t_26

80bi

ps98

_606

Erdo

s982

Erdo

s992

cryg

1000

0sp

aceS

tatio

n_4

fom

e11

fd18

Kauf

hold

ch7-

6-b2

rail_

5177

bips

98_1

142

n4c5

-b6

c-18

lhr0

1Ch

ebys

hev3

nops

s_11

kno

s3 nl

bips

98_1

450

hep-

thm

imo4

6x46

_sys

tem

xing

o_af

onso

_itai

pum

imo2

8x28

_sys

tem

ww_v

ref_

6405

zero

s_no

pss_

13k

mim

o8x8

_sys

tem

aug3

dbi

ps07

_169

3ba

yer0

6ba

yer0

8ba

yer0

5ba

yer0

7ba

rth4

L-9

dwt_

992

spac

eSta

tion_

5ba

yer0

2ut

m17

00b

G54

n2c6

-b5

raja

t07

G52

G53 se

TF14

n3c6

-b5

lpl3

bcss

tm12 G51

n3c6

-b8

jan9

9jac

020

jan9

9jac

020s

cde

laun

ay_n

13TS

OPF_

RS_b

9_c6

ex33

stea

m2

gem

at1

spac

eSta

tion_

10Al

emda

rLe

derb

erg

bipl

ane-

9cis

-n4c

6-b1

3n4

c6-b

13ba

rthp2

p-Gn

utel

la30

fxm

2-16

coat

er1

n2c6

-b6

bcss

tk09

t2da

lGL

7d25

LFAT

5000

data cs4

whita

ker3

spac

eSta

tion_

11sp

aceS

tatio

n_6

stuf

e-10

geom

spac

eSta

tion_

12tu

ma1

spac

eSta

tion_

7m

ark3

jac0

20m

ark3

jac0

20sc

raja

t08

bips

07_1

998

freeF

lyin

gRob

ot_4

stor

mg2

-27

sit10

0g7

jac0

20g7

jac0

20sc

fe_4

elt2

crac

kbi

ps07

_247

6ca

t_ea

rs_4

_4IG

5-10

spac

eSta

tion_

8lp

i_gra

ncc

acz

5108

desc

ripto

r_xi

ngo6

un3

c6-b

6xi

ngo3

012

plte

xpa

n3c6

-b7

powe

rsim

bips

07_3

078

G46

G45

eurq

saG4

7G4

3G4

4sp

aceS

tatio

n_9

psse

1cis

-n4c

6-b3

n4c6

-b3

lp_d

2q06

cph

otog

ram

met

rym

k11-

b2 ccc

ch7-

6-b4

spac

eSta

tion_

13flo

wmet

er0

flowm

eter

5m

odel

9Si

2no

poly

bcss

tk10

bcss

tm10

spac

eSta

tion_

14ra

jat0

9cq

5lp

_sto

cfor

3ca

-GrQ

cfo

me1

2LF

1000

0ba

yer0

3m

odel

4fo

me2

0lp

_pds

_20

luxe

mbo

urg_

osm

mod

el7

lhr0

2m

hd12

80b

add2

0Er

dos0

2t6

0kp2

p-Gn

utel

la31

sher

man

2c-

19ca

ge9

bcss

tk26

aug3

dcqp

pesa

ch7-

7-b2

lp_n

ug15

lp_q

ap15 ct

ilp

_deg

en3

shoc

k-9

Wor

dnet

3pl

at19

19Fr

anz4

scta

p1-2

cin

tern

etc-

28fre

eFly

ingR

obot

_5fx

m3_

6CA

G_m

at36

4ut

m30

60fe

_sph

ere

zeni

ospc

b300

0ra

jat1

0p0

5fo

rd1

ch7-

6-b3

flowe

r_7_

4nc

vxqp

1rb

sa48

0rb

sb48

0lp

22 co5

dela

unay

_n14

polb

logs

Zewa

ilep

b1bc

sstk

08ra

jat0

2c-

20 big

G36

G40

ca-H

epTh G35

G39

nem

spm

m1

G38

G42

baye

r10

G37

G41

dyna

micS

oarin

gPro

blem

_2m

k11-

b3ex

21fre

eFly

ingR

obot

_6sh

erm

anAC

dwb

-cs-

stan

ford

bcss

tk23

jan9

9jac

040

jan9

9jac

040s

cc-

27lp

i_bgi

ndy

barth

5bc

sstm

13pf

2177

pds-

30pl

buck

le fv1

ex22

de20

10m

ark3

jac0

40m

ark3

jac0

40sc fv2

fv3

t2d_

q4t2

d_q4

_A_0

1t2

d_q4

_A_0

2t2

d_q4

_A_0

3t2

d_q4

_A_0

4t2

d_q4

_A_0

5t2

d_q4

_A_0

6t2

d_q4

_A_0

7t2

d_q4

_A_0

8t2

d_q4

_A_0

9t2

d_q4

_A_1

0t2

d_q4

_A_1

1t2

d_q4

_A_1

2t2

d_q4

_A_1

3t2

d_q4

_A_1

4t2

d_q4

_A_1

5t2

d_q4

_A_1

6t2

d_q4

_A_1

7t2

d_q4

_A_1

8t2

d_q4

_A_1

9t2

d_q4

_A_2

0t2

d_q4

_A_2

1t2

d_q4

_A_2

2t2

d_q4

_A_2

3t2

d_q4

_A_2

4t2

d_q4

_A_2

5t2

d_q4

_A_2

6t2

d_q4

_A_2

7t2

d_q4

_A_2

8t2

d_q4

_A_2

9t2

d_q4

_A_3

0t2

d_q4

_A_3

1t2

d_q4

_A_3

2t2

d_q4

_A_3

3t2

d_q4

_A_3

4t2

d_q4

_A_3

5t2

d_q4

_A_3

6t2

d_q4

_A_3

7t2

d_q4

_A_3

8t2

d_q4

_A_3

9t2

d_q4

_A_4

0t2

d_q4

_A_4

1t2

d_q4

_A_4

2t2

d_q4

_A_4

3t2

d_q4

_A_4

4t2

d_q4

_A_4

5t2

d_q9

t2d_

q9_A

_01

t2d_

q9_A

_02

t2d_

q9_A

_03

t2d_

q9_A

_04

t2d_

q9_A

_05

t2d_

q9_A

_06

t2d_

q9_A

_07

t2d_

q9_A

_08

t2d_

q9_A

_09

t2d_

q9_A

_10

t2d_

q9_A

_11

t2d_

q9_A

_12

t2d_

q9_A

_13

t2d_

q9_A

_14

t2d_

q9_A

_15

t2d_

q9_A

_16

t2d_

q9_A

_17

t2d_

q9_A

_18

t2d_

q9_A

_19

t2d_

q9_A

_20

t2d_

q9_A

_21

t2d_

q9_A

_22

t2d_

q9_A

_23

t2d_

q9_A

_24

t2d_

q9_A

_25

t2d_

q9_A

_26

t2d_

q9_A

_27

t2d_

q9_A

_28

t2d_

q9_A

_29

t2d_

q9_A

_30

t2d_

q9_A

_31

t2d_

q9_A

_32

t2d_

q9_A

_33

t2d_

q9_A

_34

t2d_

q9_A

_35

t2d_

q9_A

_36

t2d_

q9_A

_37

t2d_

q9_A

_38

t2d_

q9_A

_39

t2d_

q9_A

_40

t2d_

q9_A

_41

t2d_

q9_A

_42

t2d_

q9_A

_43

t2d_

q9_A

_44

t2d_

q9_A

_45

linve

rse

bfly

ex3s

ta1

juba

40k

baur

u572

7cz

1022

8lo

ck_7

00Zh

ao1

Zhao

2c-

22bc

sstk

34sp

iral

G23

G28

G25

G30

G24

G29

G22

G27

G26

G31

ch7-

7-b5

ri201

0bo

dyy4

TF15

bcss

tk11

bcss

tk12

lp_k

en_1

8Fr

anz6

nem

spm

m2

Tref

ethe

n_20

00ex

25da

no3m

ipbo

dyy5

hi20

10 cq9

freeF

lyin

gRob

ot_7

mcf

eus

road

s-48

lp_c

re_d

PGPg

iant

com

poTS

OPF_

RS_b

2383

_c1_

bul

evim

inus

road

sc-

21na

sa18

24bo

dyy6

sher

man

ACb

rose

n10

eris1

176

c-26 FA

wing

obst

clae

tors

ion1 TS

lp_c

re_b

jnlb

rng1

fom

e13

nem

sem

m2

chem

_mas

ter1

rdist

2m

insu

rfoex

18ra

il_20

209

fom

e21

pds-

40ch

7-8-

b2vt

2010

meg

1bc

sstm

34Ke

mel

mac

her

shut

tle_e

ddy

cvxq

p3ki

netic

Batc

hRea

ctor

_1wo

rldfre

eFly

ingR

obot

_8nu

g08-

3rd

Fran

z5ex

36lp

i_gos

hM

arag

al_3

flowe

r_8_

4m

sc01

050

c-23

lp_p

ilot

mod

2co

9m

k12-

b2br

atu3

dja

n99j

ac06

0ja

n99j

ac06

0sc

cavi

ty05

cavi

ty06

cavi

ty07

cavi

ty08

cavi

ty09

mar

k3ja

c060

mar

k3ja

c060

scCu

rlCur

l_0ut

m59

40IG

5-11

epb2

ex24

p010

wang

3wa

ng4

baye

r04

hvdc

1co

nd-m

atde

laun

ay_n

15c-

24m

ario

001

pds-

50sh

allo

w_wa

ter1

shal

low_

wate

r2fre

eFly

ingR

obot

_9fo

ldoc

baye

r01

vsp_

barth

5_1K

sep_

50in

_5Ko

ut14

5bit

psse

0D6

-6th

erm

alpg

p2on

eton

e2ba

xter

freeF

lyin

gRob

ot_1

0ro

bot2

4c1_

mat

5vi

scop

last

ic1di

xmaa

nlM

arag

al_4

mhd

3200

afre

eFly

ingR

obot

_11

g7ja

c040

g7ja

c040

scsh

yy16

1ex

15TS

OPF_

RS_b

162_

c1m

ark3

jac0

80m

ark3

jac0

80sc

msc

0144

0ex

10ex

14pd

s-60

jan9

9jac

080

jan9

9jac

080s

crd

ist3a

ex10

hsfre

eFly

ingR

obot

_12

cz20

468

stok

es64

stok

es64

snh

2010

dict

iona

ry28

ex23

c-29

cis-n

4c6-

b4n4

c6-b

4ps

se2

mod

el10

Fran

z7fre

eFly

ingR

obot

_13

n4c6

-b12

GL7d

12M

ISKn

owle

dgeM

apsp

msr

tlssc

fxm

1-2b

TSOP

F_RS

_b39

_c7

freeF

lyin

gRob

ot_1

4e1

8lp

_nug

20RF

devi

ceIll

_Sto

kes

freeF

lyin

gRob

ot_1

5Fr

anz8

adde

r_dc

op_0

1in

it_ad

der1

pds-

70ad

der_

dcop

_04

adde

r_dc

op_0

5lh

r04

lhr0

4cad

der_

dcop

_03

adde

r_dc

op_0

7ad

der_

dcop

_06

adde

r_dc

op_1

0ad

der_

dcop

_09

adde

r_dc

op_0

8ad

der_

dcop

_11

adde

r_dc

op_1

3ad

der_

dcop

_19

adde

r_dc

op_4

4ad

der_

dcop

_02

adde

r_dc

op_1

2ad

der_

dcop

_14

adde

r_dc

op_1

5ad

der_

dcop

_16

adde

r_dc

op_1

7ad

der_

dcop

_18

adde

r_dc

op_2

0ad

der_

dcop

_21

adde

r_dc

op_2

2ad

der_

dcop

_23

adde

r_dc

op_2

4ad

der_

dcop

_25

adde

r_dc

op_2

6ad

der_

dcop

_27

adde

r_dc

op_2

8ad

der_

dcop

_29

adde

r_dc

op_3

0ad

der_

dcop

_31

adde

r_dc

op_3

2ad

der_

dcop

_33

adde

r_dc

op_3

4ad

der_

dcop

_35

adde

r_dc

op_3

6ad

der_

dcop

_37

adde

r_dc

op_3

8ad

der_

dcop

_39

adde

r_dc

op_4

0ad

der_

dcop

_41

adde

r_dc

op_4

2ad

der_

dcop

_43

adde

r_dc

op_4

5ad

der_

dcop

_46

adde

r_dc

op_4

7ad

der_

dcop

_48

adde

r_dc

op_4

9ad

der_

dcop

_50

adde

r_dc

op_5

1ad

der_

dcop

_52

adde

r_dc

op_5

3ad

der_

dcop

_54

adde

r_dc

op_5

5ad

der_

dcop

_56

adde

r_dc

op_5

7ad

der_

dcop

_58

adde

r_dc

op_5

9ad

der_

dcop

_60

adde

r_dc

op_6

1ad

der_

dcop

_62

adde

r_dc

op_6

3ad

der_

dcop

_64

adde

r_dc

op_6

5ad

der_

dcop

_66

adde

r_dc

op_6

7ad

der_

dcop

_68

adde

r_dc

op_6

9fre

eFly

ingR

obot

_16

ch7-

9-b2

ex12

mk1

2-b4

lowT

hrus

t_2

sts4

098

G58

G59

mar

k3ja

c100

mar

k3ja

c100

scEX

3EX

4ro

ute

lpi_k

lein

3ja

n99j

ac10

0ja

n99j

ac10

0sc

ch8-

8-b2

nem

swrld r0

5af

t01

pds-

80dy

nam

icSoa

ringP

robl

em_3

oner

a_du

alAB

ACUS

_she

ll_hd

ABAC

US_s

hell_

ldAB

ACUS

_she

ll_m

dAB

ACUS

_she

ll_ud

rlfdu

alm

hd12

80a

OPF_

3754

lung

2se

ymou

rlce

gb33

06c-

35sp

aceS

huttl

eEnt

ry_2

rdist

1c-

25ca

ge10

adde

r_tra

ns_0

1ad

der_

trans

_02

lpi_c

eria

3d ex7

aft0

2pd

s10

msc

0451

5pd

s-90

ct20

10ha

ngGl

ider

_2ta

ndem

_dua

llp

i_cpl

ex1

mhd

4800

are

orie

ntat

ion_

2gr

aphi

csm

ark3

jac1

20m

ark3

jac1

20sc

me2

010

cegb

3024

raja

t12

bcirc

uit

bcss

tk14

pds-

100

crys

tm01

cont

-201

sd20

10m

k13-

b5c-

44ex

27ja

n99j

ac12

0ja

n99j

ac12

0sc

dela

unay

_n16

epb3

c-37

nasa

2146

meg

4TF

16dy

nam

icSoa

ringP

robl

em_4

nasa

4704

ak20

10c-

34bc

sstk

18ex

13g7

jac0

50sc

lp_p

ilot8

7dy

nam

icSoa

ringP

robl

em_5

fe_b

ody

cvxb

qp1

ncvx

bqp1

onet

one1

ex20

mar

k3ja

c140

mar

k3ja

c140

scex

37ra

efsk

y5 rel7

dyna

micS

oarin

gPro

blem

_6G6

3G6

4ca

vity

10ca

vity

11ca

vity

12ca

vity

13ca

vity

14ca

vity

15ga

ron1

t2da

hex

28cir

cuit_

2pw

tdy

nam

icSoa

ringP

robl

em_7

Chev

ron1

c-38

knes

er_1

0_4_

1dy

nam

icSoa

ringP

robl

em_8

ex35

mk1

2-b3

copt

er1

nv20

10lo

ck22

32fx

m4_

6wy

2010

m13

3-b3

shar

_te2

-b3

ford

2cz

4094

8st

orm

g2-1

25ph

otog

ram

met

ry2

IG5-

12ec

l32

spac

eShu

ttleE

ntry

_3 ex9

ncvx

qp5

car4

sgpf

5y6

lhr0

7lh

r07c

rgg_

n_2_

15_s

0Lin

ux_c

all_g

raph

wats

on_1

GL7d

24te

d_B

skirt

apac

he1

tand

em_v

txvs

p_da

ta_a

nd_s

eym

ourl

bcss

tk15

mri1

mpl

ate

raef

sky6

g7ja

c060

g7ja

c060

scnd

2010

finan

ce25

6vs

p_p0

291_

seym

ourl_

iiasa

ther

mal

1ch

ipco

ol0

chip

cool

1sh

ar_t

e2-b

1fx

m3_

16ca

-Con

dMat

OPF_

6000

rail_

7984

1as

-735

c-41

imag

e_in

terp

ncvx

qp3

Fran

z11

ex8

Fran

z10

Fran

z9ex

31Du

bcov

a1ut

2010

wiki

-Vot

ebc

sstk

13TS

OPF_

RS_b

162_

c3wa

vegu

ide3

Dco

nf5_

0-4x

4-10

conf

5_0-

4x4-

14co

nf5_

0-4x

4-18

conf

5_0-

4x4-

22co

nf5_

0-4x

4-26

conf

6_0-

4x4-

20co

nf6_

0-4x

4-30

ch7-

8-b3

fe_o

cean

TSOP

F_FS

_b9_

c1M

uum

t201

0co

ater

2TS

OPF_

RS_b

39_c

19n4

c6-b

5ai

rfoil_

2dbc

sstk

25Le

Gres

ley_

8793

6de

laun

ay_n

17m

d201

0th

erm

omec

h_TC

ther

mom

ech_

TKde

ltaX

oran

i678

ch7-

8-b5

lhr1

0lh

r10c

lhr1

1lh

r11c

wv20

10k1

_san

igbt

3nc

vxqp

7Ba

uman

nco

nd-m

at-2

003

kine

ticBa

tchR

eact

or_2

scag

r7-2

bte

stbi

gca

vity

16ca

vity

17ca

vity

18ca

vity

19ca

vity

20ca

vity

21ca

vity

22ca

vity

23ca

vity

24ca

vity

25ca

vity

26nj

2010

ma2

010

raja

t01

ky20

10co

nt-3

00la

ngua

gene

2010

ms2

010

n4c6

-b11

Mar

agal

_5id

2010

c-49

grid

gena

spac

eShu

ttleE

ntry

_4sc

fxm

1-2r

light

_in_t

issue

ia20

10c-

31ar

2010

reor

ient

atio

n_3

amaz

on03

02sc

2010

nm20

10ra

jat2

7wa

tson

_2TS

OPF_

RS_b

162_

c4wa

2010

whee

l_601

cit-H

epPh

OPF_

1000

0st

okes

128

ks20

10c-

36g7

jac0

80g7

jac0

80sc

c-48

Delo

r64K

lhr1

4lh

r14c

co20

10he

lm3d

01re

orie

ntat

ion_

4ol

esni

k0be

lgiu

m_o

smre

lat7

rela

t7b

la20

10AS

IC_1

00ks

or20

10Ch

evro

n2to

mog

raph

ic1TF

172D

_276

28_b

jtcai

benz

ene

ex19

cit-H

epTh

HEP-

th-n

ewki

netic

Batc

hRea

ctor

_3wi

2010

mac

_eco

n_fw

d500

TSOP

F_RS

_b39

_c30

finan

512

pfin

an51

2wa

then

100

s3rm

t3m

3m

n201

0cr

ystm

02in

2010

rgg_

n_2_

16_s

0c-

52re

orie

ntat

ion_

5Si

eber

tn20

10c-

33nm

os3

dbic1

ok20

10IG

5-13

g3rm

t3m

3lh

r17

lhr1

7cm

c2de

pial

2010

ch7-

8-b4

reor

ient

atio

n_6

scirc

uit

reor

ient

atio

n_7

s2rm

t3m

1s3

rmt3

m1

cage

11s1

rmt3

m1

bcss

tm36

reor

ient

atio

n_8

hcirc

uit

az20

10 nsir

man

_597

6to

rso2

lp_n

ug30

wath

en12

0c-

51ga

2010

mem

plus

Notre

Dam

e_ww

wai

rcra

ftst

och_

airc

raft

cond

-mat

-200

5in

let

nc20

10g7

jac1

00g7

jac1

00sc

162b

itm

i201

0ro

adNe

t-PA

dela

unay

_n18

af23

560

ther

mom

ech_

dMbr

ack2

va20

10ch

7-9-

b3sm

allw

orld

Andr

ews

n4c6

-b6

scta

p1-2

bny

2010

EAT_

SREA

T_RS

visc

opla

stic2

FEM

_3D_

ther

mal

1as

tro-p

hm

o201

0c-

60G_

n_pi

n_po

ut lpl1

coup

led

oh20

10Si

H4co

pter

2sc

sd8-

2bsc

sd8-

2cne

ther

land

s_os

mch

8-8-

b3po

isson

3Da

vsp_

c-60

_dat

a_ct

i_cs4

n4c6

-b10

road

Net-T

X Linhv

dc2

fe_t

ooth

Oreg

on-1

vibr

obox

il201

0da

rcy0

03m

ario

002

garo

n2g7

jac1

20g7

jac1

20sc

pa20

10sin

c12

SiNa t3dl

k3pl

ates

circu

it_3

c-50

grah

am1

scrs

8-2r

abta

ha1

crys

tm03

circu

it_1

c-61

Oreg

on-2

3D_2

8984

_Tet

rade

norm

alTr

efet

hen_

2000

0bTr

efet

hen_

2000

0lo

wThr

ust_

3g7

jac1

40g7

jac1

40sc

d_pr

etok

turo

n_m

rlfpr

imlh

r34

lhr3

4ce4

0r01

00n4

c6-b

7en

ron

debr

nem

eth0

2ne

met

h03

nem

eth0

4ne

met

h05

nem

eth0

6ne

met

h07

nem

eth0

8ne

met

h09

fl201

0GL

7d13

road

Net-C

Aki

netic

Batc

hRea

ctor

_4rg

g_n_

2_17

_s0

c-39

ted_

An4

c6-b

9Du

bcov

a2c-

422D

_540

19_h

ighK

fe_r

otor

helm

2d03

cras

hbas

ism

ajor

basis

g7ja

c160

g7ja

c160

scbc

sstk

17de

laun

ay_n

19c-

65c-

40n4

c6-b

8AS

IC_6

80ks

ch7-

9-b5

IG5-

14ki

netic

Batc

hRea

ctor

_5sc

205-

2r59

8ash

ar_t

e2-b

2TF

18ki

netic

Batc

hRea

ctor

_6da

wson

5Ha

mrle

3so

c-sig

n-Sl

ashd

ot08

1106

GL7d

23so

c-sig

n-Sl

ashd

ot09

0216

kine

ticBa

tchR

eact

or_7

g7ja

c180

g7ja

c180

scso

c-sig

n-Sl

ashd

ot09

0221

kine

ticBa

tchR

eact

or_8

tmt_

unsy

mt2

emvf

emki

m1

kine

ticBa

tchR

eact

or_9

ca20

10tw

oton

ech

7-9-

b4c-

63c-

55ec

olog

y2ec

olog

y1c-

32as

-22j

uly0

6ra

jat2

2g7

jac2

00g7

jac2

00sc

para

bolic

_fem

slpts

kca

-Ast

roPh

c-47

ckt1

1752

_dc_

1ck

t117

52_t

r_0

stru

ct3

cont

11_l

vsp_

scta

p1-2

b_an

d_se

ymou

rldb

lp-2

010

raja

t26

CurlC

url_1

2cub

es_s

pher

ec-

46c-

68am

azon

0312

3D_5

1448

_3D

ibm

_mat

rix_2

c-59

xeno

n1as

-cai

dach

8-8-

b4tx

2010

wave

c-30

ch8-

8-b5

italy

_osm

Chev

ron3

ASIC

_320

ksca

-Hep

Phvs

p_c-

30_d

ata_

data

amaz

on05

05sc

sd8-

2rvs

p_bu

mp2

_e18

_aa0

1_m

odel

1_cr

ew1

amaz

on06

01ap

ache

214

4lh

r71

lhr7

1cra

jat1

5pr

efer

entia

lAtta

chm

ent

Notre

Dam

e_ac

tors

cage

12tm

t_sy

mra

jat1

3sc

tap1

-2r

grea

t-brit

ain_

osm

coAu

thor

sCite

seer

tube

1tu

be2

visc

oroc

ksNA

CA00

15c-

69de

laun

ay_n

20rg

g_n_

2_18

_s0

soc-

Epin

ions

1th

erm

omec

h_dK

vsp_

finan

512_

scag

r7-2

c_rlf

ddd

c-56

huge

trace

-000

00co

Auth

orsD

BLP

c-70

qa8f

kqa

8fm

raja

t23

web-

Stan

ford

gas_

sens

orvs

p_vi

brob

ox_s

cagr

7-2c

_rlfd

ddam

azon

-200

8ca

idaR

oute

rLev

elvs

p_m

od2_

pgp2

_slp

tsk

Si5H

12c-

72st

omac

hbc

sstk

31ve

nkat

01ve

nkat

25ve

nkat

50cf

d1IG

5-15

mat

rix_9

c-54

emai

l-EuA

llem

ail-E

nron

huge

tric-

0000

0sp

arsin

em

14b

germ

any_

osm

176b

itc-

62c-

62gh

sas

ia_o

smCh

evro

n4lo

wThr

ust_

4re

l8sc

agr7

-2r

abta

ha2

Mar

agal

_6hu

getri

c-00

010

c-43

web-

Goog

leth

erm

al2

atm

osm

odd

atm

osm

odj

TF19

iChe

m_Ja

cobi

anhu

getri

c-00

020

vent

uriLe

vel3

web-

Notre

Dam

eRe

uter

s911

av41

092

webb

ase-

1Mlp

_fit2

pla

rgeb

asis

offs

hore

atm

osm

odl

atm

osm

odm

c-66

c-66

bco

nf5_

4-8x

8-05

conf

5_4-

8x8-

10co

nf5_

4-8x

8-15

conf

5_4-

8x8-

20co

nf6_

0-8x

8-20

conf

6_0-

8x8-

30co

nf6_

0-8x

8-80

H2O

c-71

wate

r_ta

nkc-

58cf

d2vs

p_be

fref_

fxm

_2_4

_air0

2de

laun

ay_n

21to

rso3

cop2

0k_A

mes

h_de

form

lowT

hrus

t_5

hang

Glid

er_3

poiss

on3D

bGL

7d14

Mar

agal

_7FE

M_3

D_th

erm

al2

filte

r3D

mem

chip

rgg_

n_2_

19_s

0cir

cuit_

4m

atrix

-new

_3lo

wThr

ust_

6cit

atio

nCite

seer

road

_cen

tral

raja

t31

lowT

hrus

t_7

soc-

sign-

epin

ions

cnr-2

000

lowT

hrus

t_8

xeno

n2c-

45Cu

rlCur

l_2lo

wThr

ust_

9au

tolo

wThr

ust_

10lo

wThr

ust_

11Du

bcov

a3lo

wThr

ust_

12lo

wThr

ust_

13hu

getra

ce-0

0010

adap

tive

IG5-

16Fr

eesc

ale1

soc-

Slas

hdot

0811

soc-

Slas

hdot

0902

c-67

c-67

bM

6De

lor2

95K

stor

mG2

_100

033

3SP

cage

13AS

365

huge

trace

-000

20m

ri2Cu

rlCur

l_3re

lat8

NLR

nsct

s3dk

t3m

2s4

dkt3

m2

dela

unay

_n22

road

_usa

case

9TS

OPF_

FS_b

9_c6 SiO

huge

bubb

les-

0000

0hu

gebu

bble

s-00

010

Delo

r338

Khu

gebu

bble

s-00

020

c-53

rgg_

n_2_

20_s

0en

gine

barri

er2-

1ba

rrier

2-2

barri

er2-

3ba

rrier

2-4

mon

o_50

0Hz

kkt_

powe

rwe

b-Be

rkSt

anba

rrier

2-10

barri

er2-

11ba

rrier

2-12

barri

er2-

9HT

C_33

6_91

29M

arag

al_8

para

-4pa

ra-1

0pa

ra-5

para

-6pa

ra-7

para

-8pa

ra-9 CO

kim

2Cu

rlCur

l_4St

ocF-

1465

dela

unay

_n23

Tran

spor

tfe

m_f

ilter

rgg_

n_2_

21_s

0

5

10

15

20

25

30

GFLO

PS (d

oubl

e)

SPGEMM double smallAC-SpGEMMcuSparsebhSparse

RMergensparseKokkos

Figure 9.Marker plot for the complete test set using double precision for small matrices with average row length a < 42

bibd

_17_

4bi

bd_1

7_4b

bibd

_49_

3lp

_fit1

dlp

_d6c

ube

crew

1bi

bd_1

3_6

aa6

aa4

air0

5ro

sen1 aa5

air0

2aa

03 aa3

air0

6aa

01ai

r04

bibd

_81_

3m

odel

5ai

r03

lp_o

sa_0

7ro

sen2

Jour

nals

kl02

bibd

_14_

7lp

_fit2

dst

at96

v1 G1 G6 G2 G7 G10 G5 G3 G8 G4 G9

msc

0072

6lp

_osa

_14

lp_w

ood1

prlf

ddd

rail5

16qc

324

Trec

11st

at96

v5bc

sstk

27bc

sstm

27lo

ck10

74ra

il507

bibd

_15_

7ra

il582

us04 ge

nge

n1lp

_osa

_30

tom

ogra

phy

mbe

ause

stat

96v4

Eter

nity

II_A

rkat

7_m

at5

beau

seex

26t0

331-

4lpi

ston

gen2

mbe

acxc

mbe

aflw

beac

xcEt

erni

tyII_

Etild

enw

14be

aflw

com

sol

bcss

tk24

lock

3491

lp_o

sa_6

0co

mpl

exge

n4Et

erni

tyII_

Est

at96

v2bi

bd_1

6_8

stat

96v3

conn

ectu

sbc

sstk

28TS

OPF_

RS_b

300_

c1na

sa29

10 EX5

EX6

s1rm

q4m

1s2

rmq4

m1

s3rm

q4m

1st

ruct

4t5

20go

odwi

nne

met

h10

FX_M

arch

2010

nem

eth1

1bc

sstk

16bi

bd_1

7_8

bcss

tk38 Ku

uTr

ec12

nem

eth1

2cr

ystk

01 rat

TSOP

F_RS

_b30

0_c2

GT01

RCA

G_m

at19

16ne

met

h13

nem

sem

m1

nem

eth1

4ne

t25

raef

sky1

raef

sky2

ted_

ABbc

sstk

29ne

met

h15

sinc1

5ce

gb28

02ex

40cb

uckl

eTS

OPF_

RS_b

300_

c3TS

OPF_

RS_b

678_

c1Pr

es_P

oiss

on Na5

nem

eth1

6cy

l6ce

gb29

19ne

met

h17

bcss

tk33 rim

nem

eth1

8cr

plat

2pk

ustk

01f8

55_m

at9

TSOP

F_RS

_b20

52_c

1bc

sstk

37ne

met

h01

cari

bcss

tk36

msc

2305

2bi

bd_1

8_9

sinc1

8ol

afu

TSOP

F_RS

_b67

8_c2

crys

tk02

pcry

stk0

2pk

ustk

02ne

met

h19

rail2

586

bcss

tk35

sme3

Daex

11gy

rone

t50

pkus

tk09

dbir1

std1

_Jac2

qc25

34ra

efsk

y4db

ir2bc

sstk

39bb

mat li pl

ine

met

h20

bcss

tk32

wind

scre

enra

efsk

y3Si

10H1

6ba

s1lp

std1

_Jac3

Zd_Ja

c2bi

bd_1

9_9

rail4

284

Zd_Ja

c6TS

OPF_

RS_b

2383

crys

tk03

pcry

stk0

3va

nbod

yje

ndre

c1na

sasr

bne

met

h21

net7

5pk

ustk

05in

vext

r1_n

ewct

20st

ifpc

t20s

tifrm

a10

Zd_Ja

c3sr

b1pk

ustk

03m

sc10

848

bcss

tk30

mix

tank

_new

ns3D

apk

ustk

06oi

lpan

sme3

Dbtrd

heim

nem

eth2

2Tr

ec13

hear

t2he

art3

psm

igr_

2ps

mig

r_1

psm

igr_

3pk

ustk

10ne

met

h23

nem

eth2

4ne

met

h25

nem

eth2

6bi

bd_2

2_8

t3dh

s3dk

q4m

2 fpop

t1ca

ntla

min

ar_d

uct3

Dsm

e3Dc

3dtu

bets

yl20

1vs

p_m

sc10

848_

300s

ep_1

00in

_1Ko

utvs

p_bc

sstk

30_5

00se

p_10

in_1

Kout

pkus

tk11

bibd

_20_

10bo

neS0

1bm

w7st

_1sh

ipse

c8pk

ustk

07TS

C_OP

F_30

0sh

ipse

c1HF

E18_

96_in

cons

phPR

02R

bund

le1 F2

ram

age0

2pk

ustk

08pd

b1HY

Sho

odpk

ustk

13ge

arbo

xsh

ip_0

03sh

ip_0

01sh

ipse

c5he

art1

bmw3

_2 smt

pwtk

fcon

dp2

TSOP

F_FS

_b16

2_c1

BenE

lech

i1tro

llha

lfbfu

llbSi

34H3

6th

read

ohne

2bm

wcra

_1Si

87H7

6m

sdoo

rx1

04m

_t1

Ga10

As10

H30

Coup

Cons

3Dnd

3kGa

AsH6

Faul

t_63

9TS

C_OP

F_10

47Ge

87H7

6Ge

99H1

00pk

ustk

14Ga

19As

19H4

2M

L_La

plac

eld

oor

cran

kseg

_1nd

6kcr

anks

eg_2

ram

age0

2la

ndm

ark

pkus

tk08

pdb1

HYS

hood

pkus

tk13

gear

box

ship

_003

ship

_001

pack

ing-

500x

100x

100-

b050

ship

sec5

hear

t1bm

w3_2 sm

taf

_0_k

101

af_1

_k10

1af

_2_k

101

af_3

_k10

1af

_4_k

101

af_5

_k10

1af

_she

ll1af

_she

ll2af

_she

ll3af

_she

ll4af

_she

ll5af

_she

ll6af

_she

ll7af

_she

ll8af

_she

ll9pw

tkfc

ondp

2TS

OPF_

FS_b

162_

c1Be

nEle

chi1

troll

halfb

fullb

Si34

H36

thre

adoh

ne2

nlpk

kt80

bmwc

ra_1

Si87

H76

gsm

_106

857

fem

_hifr

eq_c

ircui

tm

sdoo

rx1

04m

_t1

Ga10

As10

H30

Coup

Cons

3Ddi

elFi

lterV

2clx

nd3k

GaAs

H6Fa

ult_

639

TSC_

OPF_

1047

Ge87

H76

Ge99

H100

pkus

tk14

af_s

hell1

0Ga

19As

19H4

2M

L_La

plac

eld

oor

cran

kseg

_1nd

6kcr

anks

eg_2

10

20

30

40

50

GFLO

PS (d

oubl

e)

SPGEMM double largeAC-SpGEMMcuSparsebhSparse

RMergensparseKokkos

Figure 10.Marker plot for the complete test set using double precision for large matrices with average row length a ≥ 42

bcss

tm25

bcss

tm37 Pd

M80

PI_n

S80P

I_nde

ter0

lp_c

zpro

bcis

-n4c

6-b1

4n4

c6-b

14pl

ddb

lp_p

ds_0

2lp

_shi

p08l uk

Calif

orni

apr

imag

azbc

sstm

38ex

tr1b

extr1

bcss

tm35

bcss

tm39

uker

be1

SciM

etG4

8G4

9G5

0lu

tz30

-23-

b6lp

_shi

p12l

flowe

r_4_

4dw

1024

dw20

48gr

id2

powe

rUS

powe

rGrid

n2c6

-b8

Koho

nen

p600

0df

2177

ch7-

6-b5

Fran

z3cr

yg25

00m

k12-

b5 delf

dete

r8lp

_80b

au3b

lp_b

nl2

lp_c

re_c

rdb2

048

filte

r2D

pde2

961

watt_

1de

ter6

lshp1

561

lp_c

re_a

dela

unay

_n11

rw51

51ol

m50

00 G57

watt_

2Ha

mrle

2n4

c5-b

8ro

tor2

ch6-

6-b4

lshp1

882

lp_2

5fv4

7or

sreg

_1dw

t_12

42sh

yy41

larg

ede

ter5

p2p-

Gnut

ella

08 lpl2

lp_t

russ

lp_m

aros

poli3

stor

mg2

-8bc

spwr

10cz

1268

TF13

Fran

z2de

ter1

qpba

ndbl

ckho

lelsh

p223

3hy

dr1

hydr

1cp2

p-Gn

utel

la09

fd12

rdb3

200l

dete

r4G6

2m

k10-

b3sh

erm

an3

mhd

3200

blp

_ken

_11

ex32

cat_

ears

_3_4 ch

Chem

97Zt

fxm

2-6

lshp2

614

lp_d

fl001

N_re

acto

me

wang

1wa

ng2

com

man

che_

dual

G65

dete

r7ca

ge8

flowe

r_5_

4Le

Gres

ley_

2508

mk1

1-b4

bCh

ebys

hev2

n3c6

-b9

dwg9

61b

g7ja

c010

g7ja

c010

scp2

p-Gn

utel

la06

p2p-

Gnut

ella

05sw

ang1

swan

g2kn

eser

_8_3

_1lp

_pds

_06

sayl

r4lp

_pilo

tnov G66

lshp3

025

airfo

il1 D_9

raja

t03

G55

G56

n4c5

-b4

can_

838

baye

r09

dela

unay

_n12

mk1

1-b4 G6

73e

ltm

odel

8n2

c6-b

4n3

c6-b

4lsh

p346

6de

ter3

dete

r2ra

dfr1

cell1

cell2

fd15

can_

1054

poli4

rdb5

000

mhd

4800

bDK

01R

can_

1072

p2p-

Gnut

ella

04 D_7

add3

2n4

c5-b

7p2

p-Gn

utel

la25 D_

8sh

erm

anAC

aln

s_39

37ln

sp39

37fre

eFly

ingR

obot

_2ch

6-6-

b3n2

c6-b

7bc

sstk

21cz

2548 G6

0G6

1em

ail

sstm

odel

ex29

tum

a2dw

4096

dw81

92nc

vxqp

9m

odel

6ra

jat0

6lp

_woo

dw l30

dtoc

pcb1

000

lp_n

ug12

lp_q

ap12

gem

at12

gem

at11

ODLI

STr

efet

hen_

700

aug2

dlp

_gre

enbe

alp

_gre

enbe

blp

i_gre

enbe

aye

ast

lp_k

en_1

3 geEr

dos9

72fre

eFly

ingR

obot

_3p2

p-Gn

utel

la24

LeGr

esle

y_49

08lp

_cyc

lero

sen8

sher

man

5au

g2dc

lp_p

ds_1

0n4

c5-b

5m

odel

3dw

t_26

80bi

ps98

_606

Erdo

s982

Erdo

s992

cryg

1000

0sp

aceS

tatio

n_4

fom

e11

fd18

Kauf

hold

ch7-

6-b2

rail_

5177

bips

98_1

142

n4c5

-b6

c-18

lhr0

1Ch

ebys

hev3

nops

s_11

kno

s3 nl

bips

98_1

450

hep-

thm

imo4

6x46

_sys

tem

xing

o_af

onso

_itai

pum

imo2

8x28

_sys

tem

ww_v

ref_

6405

zero

s_no

pss_

13k

mim

o8x8

_sys

tem

aug3

dbi

ps07

_169

3ba

yer0

6ba

yer0

8ba

yer0

5ba

yer0

7ba

rth4

L-9

dwt_

992

spac

eSta

tion_

5ba

yer0

2ut

m17

00b

G54

n2c6

-b5

raja

t07

G52

G53 se

TF14

n3c6

-b5

lpl3

bcss

tm12 G51

n3c6

-b8

jan9

9jac

020

jan9

9jac

020s

cde

laun

ay_n

13TS

OPF_

RS_b

9_c6

ex33

stea

m2

gem

at1

spac

eSta

tion_

10Al

emda

rLe

derb

erg

bipl

ane-

9cis

-n4c

6-b1

3n4

c6-b

13ba

rthp2

p-Gn

utel

la30

fxm

2-16

coat

er1

n2c6

-b6

bcss

tk09

t2da

lGL

7d25

LFAT

5000

data cs4

whita

ker3

spac

eSta

tion_

11sp

aceS

tatio

n_6

stuf

e-10

geom

spac

eSta

tion_

12tu

ma1

spac

eSta

tion_

7m

ark3

jac0

20m

ark3

jac0

20sc

raja

t08

bips

07_1

998

freeF

lyin

gRob

ot_4

stor

mg2

-27

sit10

0g7

jac0

20g7

jac0

20sc

fe_4

elt2

crac

kbi

ps07

_247

6ca

t_ea

rs_4

_4IG

5-10

spac

eSta

tion_

8lp

i_gra

ncc

acz

5108

desc

ripto

r_xi

ngo6

un3

c6-b

6xi

ngo3

012

plte

xpa

n3c6

-b7

powe

rsim

bips

07_3

078

G46

G45

eurq

saG4

7G4

3G4

4sp

aceS

tatio

n_9

psse

1cis

-n4c

6-b3

n4c6

-b3

lp_d

2q06

cph

otog

ram

met

rym

k11-

b2 ccc

ch7-

6-b4

spac

eSta

tion_

13flo

wmet

er0

flowm

eter

5m

odel

9Si

2no

poly

bcss

tk10

bcss

tm10

spac

eSta

tion_

14ra

jat0

9cq

5lp

_sto

cfor

3ca

-GrQ

cfo

me1

2LF

1000

0ba

yer0

3m

odel

4fo

me2

0lp

_pds

_20

luxe

mbo

urg_

osm

mod

el7

lhr0

2m

hd12

80b

add2

0Er

dos0

2t6

0kp2

p-Gn

utel

la31

sher

man

2c-

19ca

ge9

bcss

tk26

aug3

dcqp

pesa

ch7-

7-b2

lp_n

ug15

lp_q

ap15 ct

ilp

_deg

en3

shoc

k-9

Wor

dnet

3pl

at19

19Fr

anz4

scta

p1-2

cin

tern

etc-

28fre

eFly

ingR

obot

_5fx

m3_

6CA

G_m

at36

4ut

m30

60fe

_sph

ere

zeni

ospc

b300

0ra

jat1

0p0

5fo

rd1

ch7-

6-b3

flowe

r_7_

4nc

vxqp

1rb

sa48

0rb

sb48

0lp

22 co5

dela

unay

_n14

polb

logs

Zewa

ilep

b1bc

sstk

08ra

jat0

2c-

20 big

G36

G40

ca-H

epTh G35

G39

nem

spm

m1

G38

G42

baye

r10

G37

G41

dyna

micS

oarin

gPro

blem

_2m

k11-

b3ex

21fre

eFly

ingR

obot

_6sh

erm

anAC

dwb

-cs-

stan

ford

bcss

tk23

jan9

9jac

040

jan9

9jac

040s

cc-

27lp

i_bgi

ndy

barth

5bc

sstm

13pf

2177

pds-

30pl

buck

le fv1

ex22

de20

10m

ark3

jac0

40m

ark3

jac0

40sc fv2

fv3

t2d_

q4t2

d_q4

_A_0

1t2

d_q4

_A_0

2t2

d_q4

_A_0

3t2

d_q4

_A_0

4t2

d_q4

_A_0

5t2

d_q4

_A_0

6t2

d_q4

_A_0

7t2

d_q4

_A_0

8t2

d_q4

_A_0

9t2

d_q4

_A_1

0t2

d_q4

_A_1

1t2

d_q4

_A_1

2t2

d_q4

_A_1

3t2

d_q4

_A_1

4t2

d_q4

_A_1

5t2

d_q4

_A_1

6t2

d_q4

_A_1

7t2

d_q4

_A_1

8t2

d_q4

_A_1

9t2

d_q4

_A_2

0t2

d_q4

_A_2

1t2

d_q4

_A_2

2t2

d_q4

_A_2

3t2

d_q4

_A_2

4t2

d_q4

_A_2

5t2

d_q4

_A_2

6t2

d_q4

_A_2

7t2

d_q4

_A_2

8t2

d_q4

_A_2

9t2

d_q4

_A_3

0t2

d_q4

_A_3

1t2

d_q4

_A_3

2t2

d_q4

_A_3

3t2

d_q4

_A_3

4t2

d_q4

_A_3

5t2

d_q4

_A_3

6t2

d_q4

_A_3

7t2

d_q4

_A_3

8t2

d_q4

_A_3

9t2

d_q4

_A_4

0t2

d_q4

_A_4

1t2

d_q4

_A_4

2t2

d_q4

_A_4

3t2

d_q4

_A_4

4t2

d_q4

_A_4

5t2

d_q9

t2d_

q9_A

_01

t2d_

q9_A

_02

t2d_

q9_A

_03

t2d_

q9_A

_04

t2d_

q9_A

_05

t2d_

q9_A

_06

t2d_

q9_A

_07

t2d_

q9_A

_08

t2d_

q9_A

_09

t2d_

q9_A

_10

t2d_

q9_A

_11

t2d_

q9_A

_12

t2d_

q9_A

_13

t2d_

q9_A

_14

t2d_

q9_A

_15

t2d_

q9_A

_16

t2d_

q9_A

_17

t2d_

q9_A

_18

t2d_

q9_A

_19

t2d_

q9_A

_20

t2d_

q9_A

_21

t2d_

q9_A

_22

t2d_

q9_A

_23

t2d_

q9_A

_24

t2d_

q9_A

_25

t2d_

q9_A

_26

t2d_

q9_A

_27

t2d_

q9_A

_28

t2d_

q9_A

_29

t2d_

q9_A

_30

t2d_

q9_A

_31

t2d_

q9_A

_32

t2d_

q9_A

_33

t2d_

q9_A

_34

t2d_

q9_A

_35

t2d_

q9_A

_36

t2d_

q9_A

_37

t2d_

q9_A

_38

t2d_

q9_A

_39

t2d_

q9_A

_40

t2d_

q9_A

_41

t2d_

q9_A

_42

t2d_

q9_A

_43

t2d_

q9_A

_44

t2d_

q9_A

_45

linve

rse

bfly

ex3s

ta1

juba

40k

baur

u572

7cz

1022

8lo

ck_7

00Zh

ao1

Zhao

2c-

22bc

sstk

34sp

iral

G23

G28

G25

G30

G24

G29

G22

G27

G26

G31

ch7-

7-b5

ri201

0bo

dyy4

TF15

bcss

tk11

bcss

tk12

lp_k

en_1

8Fr

anz6

nem

spm

m2

Tref

ethe

n_20

00ex

25da

no3m

ipbo

dyy5

hi20

10 cq9

freeF

lyin

gRob

ot_7

mcf

eus

road

s-48

lp_c

re_d

PGPg

iant

com

poTS

OPF_

RS_b

2383

_c1_

bul

evim

inus

road

sc-

21na

sa18

24bo

dyy6

sher

man

ACb

rose

n10

eris1

176

c-26 FA

wing

obst

clae

tors

ion1 TS

lp_c

re_b

jnlb

rng1

fom

e13

nem

sem

m2

chem

_mas

ter1

rdist

2m

insu

rfoex

18ra

il_20

209

fom

e21

pds-

40ch

7-8-

b2vt

2010

meg

1bc

sstm

34Ke

mel

mac

her

shut

tle_e

ddy

cvxq

p3ki

netic

Batc

hRea

ctor

_1wo

rldfre

eFly

ingR

obot

_8nu

g08-

3rd

Fran

z5ex

36lp

i_gos

hM

arag

al_3

flowe

r_8_

4m

sc01

050

c-23

lp_p

ilot

mod

2co

9m

k12-

b2br

atu3

dja

n99j

ac06

0ja

n99j

ac06

0sc

cavi

ty05

cavi

ty06

cavi

ty07

cavi

ty08

cavi

ty09

mar

k3ja

c060

mar

k3ja

c060

scCu

rlCur

l_0ut

m59

40IG

5-11

epb2

ex24

p010

wang

3wa

ng4

baye

r04

hvdc

1co

nd-m

atde

laun

ay_n

15c-

24m

ario

001

pds-

50sh

allo

w_wa

ter1

shal

low_

wate

r2fre

eFly

ingR

obot

_9fo

ldoc

baye

r01

vsp_

barth

5_1K

sep_

50in

_5Ko

ut14

5bit

psse

0D6

-6th

erm

alpg

p2on

eton

e2ba

xter

freeF

lyin

gRob

ot_1

0ro

bot2

4c1_

mat

5vi

scop

last

ic1di

xmaa

nlM

arag

al_4

mhd

3200

afre

eFly

ingR

obot

_11

g7ja

c040

g7ja

c040

scsh

yy16

1ex

15TS

OPF_

RS_b

162_

c1m

ark3

jac0

80m

ark3

jac0

80sc

msc

0144

0ex

10ex

14pd

s-60

jan9

9jac

080

jan9

9jac

080s

crd

ist3a

ex10

hsfre

eFly

ingR

obot

_12

cz20

468

stok

es64

stok

es64

snh

2010

dict

iona

ry28

ex23

c-29

cis-n

4c6-

b4n4

c6-b

4ps

se2

mod

el10

Fran

z7fre

eFly

ingR

obot

_13

n4c6

-b12

GL7d

12M

ISKn

owle

dgeM

apsp

msr

tlssc

fxm

1-2b

TSOP

F_RS

_b39

_c7

freeF

lyin

gRob

ot_1

4e1

8lp

_nug

20RF

devi

ceIll

_Sto

kes

freeF

lyin

gRob

ot_1

5Fr

anz8

adde

r_dc

op_0

1in

it_ad

der1

pds-

70ad

der_

dcop

_04

adde

r_dc

op_0

5lh

r04

lhr0

4cad

der_

dcop

_03

adde

r_dc

op_0

7ad

der_

dcop

_06

adde

r_dc

op_1

0ad

der_

dcop

_09

adde

r_dc

op_0

8ad

der_

dcop

_11

adde

r_dc

op_1

3ad

der_

dcop

_19

adde

r_dc

op_4

4ad

der_

dcop

_02

adde

r_dc

op_1

2ad

der_

dcop

_14

adde

r_dc

op_1

5ad

der_

dcop

_16

adde

r_dc

op_1

7ad

der_

dcop

_18

adde

r_dc

op_2

0ad

der_

dcop

_21

adde

r_dc

op_2

2ad

der_

dcop

_23

adde

r_dc

op_2

4ad

der_

dcop

_25

adde

r_dc

op_2

6ad

der_

dcop

_27

adde

r_dc

op_2

8ad

der_

dcop

_29

adde

r_dc

op_3

0ad

der_

dcop

_31

adde

r_dc

op_3

2ad

der_

dcop

_33

adde

r_dc

op_3

4ad

der_

dcop

_35

adde

r_dc

op_3

6ad

der_

dcop

_37

adde

r_dc

op_3

8ad

der_

dcop

_39

adde

r_dc

op_4

0ad

der_

dcop

_41

adde

r_dc

op_4

2ad

der_

dcop

_43

adde

r_dc

op_4

5ad

der_

dcop

_46

adde

r_dc

op_4

7ad

der_

dcop

_48

adde

r_dc

op_4

9ad

der_

dcop

_50

adde

r_dc

op_5

1ad

der_

dcop

_52

adde

r_dc

op_5

3ad

der_

dcop

_54

adde

r_dc

op_5

5ad

der_

dcop

_56

adde

r_dc

op_5

7ad

der_

dcop

_58

adde

r_dc

op_5

9ad

der_

dcop

_60

adde

r_dc

op_6

1ad

der_

dcop

_62

adde

r_dc

op_6

3ad

der_

dcop

_64

adde

r_dc

op_6

5ad

der_

dcop

_66

adde

r_dc

op_6

7ad

der_

dcop

_68

adde

r_dc

op_6

9fre

eFly

ingR

obot

_16

ch7-

9-b2

ex12

mk1

2-b4

lowT

hrus

t_2

G58

G59

mar

k3ja

c100

mar

k3ja

c100

scEX

3EX

4ro

ute

lpi_k

lein

3ja

n99j

ac10

0ja

n99j

ac10

0sc

ch8-

8-b2

nem

swrld r0

5af

t01

pds-

80dy

nam

icSoa

ringP

robl

em_3

oner

a_du

alAB

ACUS

_she

ll_hd

ABAC

US_s

hell_

ldAB

ACUS

_she

ll_m

dAB

ACUS

_she

ll_ud

rlfdu

alm

hd12

80a

OPF_

3754

lung

2se

ymou

rlce

gb33

06c-

35sp

aceS

huttl

eEnt

ry_2

rdist

1c-

25ca

ge10

adde

r_tra

ns_0

1ad

der_

trans

_02

lpi_c

eria

3d ex7

aft0

2pd

s10

msc

0451

5pd

s-90

ct20

10ha

ngGl

ider

_2ta

ndem

_dua

llp

i_cpl

ex1

mhd

4800

are

orie

ntat

ion_

2gr

aphi

csm

ark3

jac1

20m

ark3

jac1

20sc

me2

010

cegb

3024

raja

t12

bcirc

uit

bcss

tk14

pds-

100

crys

tm01

cont

-201

sd20

10m

k13-

b5c-

44ex

27ja

n99j

ac12

0ja

n99j

ac12

0sc

dela

unay

_n16

epb3

c-37

nasa

2146

meg

4TF

16dy

nam

icSoa

ringP

robl

em_4

nasa

4704

ak20

10c-

34bc

sstk

18ex

13g7

jac0

50sc

lp_p

ilot8

7dy

nam

icSoa

ringP

robl

em_5

fe_b

ody

cvxb

qp1

ncvx

bqp1

onet

one1

ex20

mar

k3ja

c140

mar

k3ja

c140

scex

37ra

efsk

y5 rel7

dyna

micS

oarin

gPro

blem

_6G6

3G6

4ca

vity

10ca

vity

11ca

vity

12ca

vity

13ca

vity

14ca

vity

15ga

ron1

t2da

hex

28cir

cuit_

2pw

tdy

nam

icSoa

ringP

robl

em_7

Chev

ron1

c-38

knes

er_1

0_4_

1dy

nam

icSoa

ringP

robl

em_8

ex35

mk1

2-b3

copt

er1

nv20

10lo

ck22

32fx

m4_

6wy

2010

m13

3-b3

shar

_te2

-b3

ford

2cz

4094

8st

orm

g2-1

25ph

otog

ram

met

ry2

IG5-

12ec

l32

spac

eShu

ttleE

ntry

_3 ex9

ncvx

qp5

car4

sgpf

5y6

lhr0

7lh

r07c

rgg_

n_2_

15_s

0Lin

ux_c

all_g

raph

wats

on_1

GL7d

24te

d_B

skirt

apac

he1

tand

em_v

txvs

p_da

ta_a

nd_s

eym

ourl

bcss

tk15

mri1

mpl

ate

raef

sky6

g7ja

c060

g7ja

c060

scnd

2010

finan

ce25

6vs

p_p0

291_

seym

ourl_

iiasa

ther

mal

1ch

ipco

ol0

chip

cool

1sh

ar_t

e2-b

1fx

m3_

16ca

-Con

dMat

OPF_

6000

rail_

7984

1as

-735

c-41

imag

e_in

terp

ncvx

qp3

Fran

z11

ex8

Fran

z10

Fran

z9ex

31Du

bcov

a1ut

2010

wiki

-Vot

ebc

sstk

13TS

OPF_

RS_b

162_

c3wa

vegu

ide3

Dco

nf5_

0-4x

4-10

conf

5_0-

4x4-

14co

nf5_

0-4x

4-18

conf

5_0-

4x4-

22co

nf5_

0-4x

4-26

conf

6_0-

4x4-

20co

nf6_

0-4x

4-30

ch7-

8-b3

fe_o

cean

TSOP

F_FS

_b9_

c1M

uum

t201

0co

ater

2TS

OPF_

RS_b

39_c

19n4

c6-b

5ai

rfoil_

2dbc

sstk

25Le

Gres

ley_

8793

6de

laun

ay_n

17m

d201

0th

erm

omec

h_TC

ther

mom

ech_

TKde

ltaX

oran

i678

ch7-

8-b5

lhr1

0lh

r10c

lhr1

1lh

r11c

wv20

10k1

_san

igbt

3nc

vxqp

7Ba

uman

nco

nd-m

at-2

003

kine

ticBa

tchR

eact

or_2

scag

r7-2

bte

stbi

gca

vity

16ca

vity

17ca

vity

18ca

vity

19ca

vity

20ca

vity

21ca

vity

22ca

vity

23ca

vity

24ca

vity

25ca

vity

26nj

2010

ma2

010

raja

t01

ky20

10co

nt-3

00la

ngua

gene

2010

ms2

010

n4c6

-b11

Mar

agal

_5id

2010

c-49

grid

gena

spac

eShu

ttleE

ntry

_4sc

fxm

1-2r

light

_in_t

issue

ia20

10c-

31ar

2010

reor

ient

atio

n_3

amaz

on03

02sc

2010

nm20

10ra

jat2

7wa

tson

_2TS

OPF_

RS_b

162_

c4wa

2010

whee

l_601

cit-H

epPh

OPF_

1000

0st

okes

128

ks20

10c-

36g7

jac0

80g7

jac0

80sc

c-48

Delo

r64K

lhr1

4lh

r14c

co20

10he

lm3d

01re

orie

ntat

ion_

4ol

esni

k0be

lgiu

m_o

smre

lat7

rela

t7b

la20

10AS

IC_1

00ks

or20

10Ch

evro

n2to

mog

raph

ic1TF

172D

_276

28_b

jtcai

benz

ene

ex19

cit-H

epTh

HEP-

th-n

ewki

netic

Batc

hRea

ctor

_3wi

2010

mac

_eco

n_fw

d500

TSOP

F_RS

_b39

_c30

finan

512

pfin

an51

2wa

then

100

mn2

010

crys

tm02

in20

10rg

g_n_

2_16

_s0

c-52

reor

ient

atio

n_5

Sieb

ertn

2010

c-33

nmos

3db

ic1ok

2010

IG5-

13g3

rmt3

m3

lhr1

7lh

r17c

mc2

depi

al20

10ch

7-8-

b4re

orie

ntat

ion_

6sc

ircui

tre

orie

ntat

ion_

7ca

ge11

s1rm

t3m

1bc

sstm

36re

orie

ntat

ion_

8hc

ircui

taz

2010 nsir

man

_597

6to

rso2

lp_n

ug30

wath

en12

0c-

51ga

2010

mem

plus

Notre

Dam

e_ww

wai

rcra

ftst

och_

airc

raft

cond

-mat

-200

5in

let

nc20

10g7

jac1

00g7

jac1

00sc

162b

itm

i201

0ro

adNe

t-PA

dela

unay

_n18

af23

560

ther

mom

ech_

dMbr

ack2

va20

10ch

7-9-

b3sm

allw

orld

Andr

ews

n4c6

-b6

scta

p1-2

bny

2010

EAT_

SREA

T_RS

visc

opla

stic2

FEM

_3D_

ther

mal

1as

tro-p

hm

o201

0c-

60G_

n_pi

n_po

ut lpl1

coup

led

oh20

10Si

H4co

pter

2sc

sd8-

2bsc

sd8-

2cne

ther

land

s_os

mch

8-8-

b3po

isson

3Da

vsp_

c-60

_dat

a_ct

i_cs4

n4c6

-b10

road

Net-T

X Linhv

dc2

fe_t

ooth

Oreg

on-1

vibr

obox

il201

0da

rcy0

03m

ario

002

garo

n2g7

jac1

20g7

jac1

20sc

pa20

10sin

c12

SiNa t3dl

k3pl

ates

circu

it_3

c-50

grah

am1

scrs

8-2r

abta

ha1

crys

tm03

circu

it_1

c-61

Oreg

on-2

3D_2

8984

_Tet

rade

norm

alTr

efet

hen_

2000

0bTr

efet

hen_

2000

0lo

wThr

ust_

3g7

jac1

40g7

jac1

40sc

d_pr

etok

turo

n_m

rlfpr

imlh

r34

lhr3

4ce4

0r01

00n4

c6-b

7en

ron

debr

nem

eth0

2ne

met

h03

nem

eth0

4ne

met

h05

nem

eth0

6ne

met

h07

nem

eth0

8ne

met

h09

fl201

0GL

7d13

road

Net-C

Aki

netic

Batc

hRea

ctor

_4rg

g_n_

2_17

_s0

c-39

ted_

An4

c6-b

9Du

bcov

a2c-

422D

_540

19_h

ighK

fe_r

otor

helm

2d03

cras

hbas

ism

ajor

basis

g7ja

c160

g7ja

c160

scbc

sstk

17de

laun

ay_n

19c-

65c-

40n4

c6-b

8AS

IC_6

80ks

ch7-

9-b5

IG5-

14ki

netic

Batc

hRea

ctor

_5sc

205-

2r59

8ash

ar_t

e2-b

2TF

18ki

netic

Batc

hRea

ctor

_6da

wson

5Ha

mrle

3so

c-sig

n-Sl

ashd

ot08

1106

GL7d

23so

c-sig

n-Sl

ashd

ot09

0216

kine

ticBa

tchR

eact

or_7

g7ja

c180

g7ja

c180

scso

c-sig

n-Sl

ashd

ot09

0221

kine

ticBa

tchR

eact

or_8

tmt_

unsy

mt2

emvf

emki

m1

kine

ticBa

tchR

eact

or_9

ca20

10tw

oton

ech

7-9-

b4c-

63c-

55ec

olog

y2ec

olog

y1c-

32as

-22j

uly0

6ra

jat2

2g7

jac2

00g7

jac2

00sc

para

bolic

_fem

slpts

kca

-Ast

roPh

c-47

ckt1

1752

_dc_

1ck

t117

52_t

r_0

stru

ct3

cont

11_l

vsp_

scta

p1-2

b_an

d_se

ymou

rldb

lp-2

010

raja

t26

CurlC

url_1

2cub

es_s

pher

ec-

46c-

68am

azon

0312

3D_5

1448

_3D

ibm

_mat

rix_2

c-59

xeno

n1as

-cai

dach

8-8-

b4tx

2010

wave

c-30

ch8-

8-b5

italy

_osm

Chev

ron3

ASIC

_320

ksca

-Hep

Phvs

p_c-

30_d

ata_

data

amaz

on05

05sc

sd8-

2rvs

p_bu

mp2

_e18

_aa0

1_m

odel

1_cr

ew1

amaz

on06

01ap

ache

214

4lh

r71

lhr7

1cra

jat1

5pr

efer

entia

lAtta

chm

ent

Notre

Dam

e_ac

tors

cage

12tm

t_sy

mra

jat1

3sc

tap1

-2r

grea

t-brit

ain_

osm

coAu

thor

sCite

seer

tube

1tu

be2

visc

oroc

ksNA

CA00

15c-

69de

laun

ay_n

20rg

g_n_

2_18

_s0

soc-

Epin

ions

1th

erm

omec

h_dK

vsp_

finan

512_

scag

r7-2

c_rlf

ddd

c-56

huge

trace

-000

00co

Auth

orsD

BLP

c-70

qa8f

kqa

8fm

raja

t23

web-

Stan

ford

gas_

sens

orvs

p_vi

brob

ox_s

cagr

7-2c

_rlfd

ddam

azon

-200

8ca

idaR

oute

rLev

elvs

p_m

od2_

pgp2

_slp

tsk

Si5H

12c-

72st

omac

hbc

sstk

31ve

nkat

01ve

nkat

25ve

nkat

50cf

d1IG

5-15

mat

rix_9

c-54

emai

l-EuA

llem

ail-E

nron

huge

tric-

0000

0sp

arsin

em

14b

germ

any_

osm

176b

itc-

62c-

62gh

sas

ia_o

smCh

evro

n4lo

wThr

ust_

4re

l8sc

agr7

-2r

abta

ha2

Mar

agal

_6hu

getri

c-00

010

c-43

web-

Goog

leth

erm

al2

atm

osm

odd

atm

osm

odj

TF19

iChe

m_Ja

cobi

anhu

getri

c-00

020

vent

uriLe

vel3

web-

Notre

Dam

eRe

uter

s911

av41

092

webb

ase-

1Mlp

_fit2

pla

rgeb

asis

offs

hore

atm

osm

odl

atm

osm

odm

c-66

c-66

bco

nf5_

4-8x

8-05

conf

5_4-

8x8-

10co

nf5_

4-8x

8-15

conf

5_4-

8x8-

20co

nf6_

0-8x

8-20

conf

6_0-

8x8-

30co

nf6_

0-8x

8-80

H2O

c-71

wate

r_ta

nkc-

58cf

d2vs

p_be

fref_

fxm

_2_4

_air0

2de

laun

ay_n

21to

rso3

cop2

0k_A

mes

h_de

form

lowT

hrus

t_5

hang

Glid

er_3

poiss

on3D

bGL

7d14

Mar

agal

_7FE

M_3

D_th

erm

al2

filte

r3D

mem

chip

rgg_

n_2_

19_s

0cir

cuit_

4m

atrix

-new

_3lo

wThr

ust_

6cit

atio

nCite

seer

road

_cen

tral

raja

t31

lowT

hrus

t_7

soc-

sign-

epin

ions

cnr-2

000

lowT

hrus

t_8

xeno

n2c-

45Cu

rlCur

l_2bl

owey

bqbl

owey

blbl

owey

alo

wThr

ust_

9au

tolo

wThr

ust_

10lo

wThr

ust_

11Du

bcov

a3lo

wThr

ust_

12lo

wThr

ust_

13hu

getra

ce-0

0010

adap

tive

IG5-

16Fr

eesc

ale1

soc-

Slas

hdot

0811

soc-

Slas

hdot

0902

c-67

c-67

bM

6De

lor2

95K

stor

mG2

_100

033

3SP

cage

13AS

365

huge

trace

-000

20m

ri2Cu

rlCur

l_3re

lat8

NLR

nsct

s3dk

t3m

2s4

dkt3

m2

dela

unay

_n22

road

_usa

192b

itca

se9

TSOP

F_FS

_b9_

c6 SiO

huge

bubb

les-

0000

0hu

gebu

bble

s-00

010

Delo

r338

Ka2

nnsn

slhu

gebu

bble

s-00

020

c-53

rgg_

n_2_

20_s

0en

gine

barri

er2-

1ba

rrier

2-2

barri

er2-

3ba

rrier

2-4

mon

o_50

0Hz

kkt_

powe

rwe

b-Be

rkSt

anba

rrier

2-10

barri

er2-

11ba

rrier

2-12

barri

er2-

9HT

C_33

6_91

29IG

5-17

euro

pe_o

smM

arag

al_8

para

-4pa

ra-1

0pa

ra-5

para

-6pa

ra-7

para

-8pa

ra-9

HTC_

336_

4438 CO

kim

2Cu

rlCur

l_4St

ocF-

1465

dela

unay

_n23

Tran

spor

tfe

m_f

ilter

pre2

rgg_

n_2_

21_s

0

10

20

30

40

GFLO

PS (f

loat

)

SPGEMM float smallAC-SpGEMMcuSparsebhSparse

RMergensparseKokkos

Figure 11.Marker plot for the complete test set using single precision for small matrices with average row length a < 42

bibd

_17_

4bi

bd_1

7_4b

bibd

_49_

3lp

_fit1

dlp

_d6c

ube

crew

1bi

bd_1

3_6

aa6

aa4

air0

5ro

sen1 aa5

air0

2aa

03 aa3

air0

6aa

01ai

r04

bibd

_81_

3m

odel

5ai

r03

lp_o

sa_0

7ro

sen2

Jour

nals

kl02

bibd

_14_

7lp

_fit2

dst

at96

v1 G1 G6 G2 G7 G10 G5 G3 G8 G4 G9

msc

0072

6lp

_osa

_14

lp_w

ood1

prlf

ddd

rail5

16qc

324

Trec

11st

at96

v5bc

sstk

27bc

sstm

27lo

ck10

74ra

il507

bibd

_15_

7ra

il582

us04 ge

nge

n1lp

_osa

_30

tom

ogra

phy

mbe

ause

stat

96v4

Eter

nity

II_A

rkat

7_m

at5

beau

seex

26t0

331-

4lpi

ston

gen2

mbe

acxc

mbe

aflw

beac

xcEt

erni

tyII_

Etild

enw

14be

aflw

com

sol

bcss

tk24

lock

3491

lp_o

sa_6

0co

mpl

exge

n4Et

erni

tyII_

Est

at96

v2bi

bd_1

6_8

stat

96v3

conn

ectu

sbc

sstk

28TS

OPF_

RS_b

300_

c1na

sa29

10 EX5

EX6

s1rm

q4m

1s2

rmq4

m1

s3rm

q4m

1st

ruct

4t5

20go

odwi

nne

met

h10

FX_M

arch

2010

nem

eth1

1bc

sstk

16bi

bd_1

7_8

bcss

tk38 Ku

uTr

ec12

nem

eth1

2cr

ystk

01 rat

TSOP

F_RS

_b30

0_c2

GT01

RCA

G_m

at19

16ne

met

h13

nem

sem

m1

nem

eth1

4ne

t25

raef

sky1

raef

sky2

ted_

ABbc

sstk

29ne

met

h15

sinc1

5ce

gb28

02ex

40cb

uckl

eTS

OPF_

RS_b

300_

c3TS

OPF_

RS_b

678_

c1Pr

es_P

oiss

on Na5

nem

eth1

6cy

l6ce

gb29

19ne

met

h17

bcss

tk33 rim

nem

eth1

8cr

plat

2pk

ustk

01f8

55_m

at9

TSOP

F_RS

_b20

52_c

1bc

sstk

37ne

met

h01

cari

bcss

tk36

msc

2305

2bi

bd_1

8_9

sinc1

8ol

afu

TSOP

F_RS

_b67

8_c2

crys

tk02

pcry

stk0

2pk

ustk

02ne

met

h19

rail2

586

bcss

tk35

sme3

Daex

11gy

rone

t50

pkus

tk09

dbir1

std1

_Jac2

qc25

34ra

efsk

y4db

ir2bc

sstk

39bb

mat li pl

ine

met

h20

bcss

tk32

wind

scre

enra

efsk

y3Si

10H1

6ba

s1lp

std1

_Jac3

Zd_Ja

c2bi

bd_1

9_9

rail4

284

Zd_Ja

c6TS

OPF_

RS_b

2383

crys

tk03

pcry

stk0

3va

nbod

yje

ndre

c1na

sasr

bne

met

h21

net7

5pk

ustk

05in

vext

r1_n

ewct

20st

ifpc

t20s

tifrm

a10

Zd_Ja

c3sr

b1pk

ustk

03m

sc10

848

bcss

tk30

mix

tank

_new

ns3D

apk

ustk

06oi

lpan

sme3

Dbtrd

heim

nem

eth2

2Tr

ec13

hear

t2he

art3

psm

igr_

2ps

mig

r_1

psm

igr_

3pk

ustk

10ne

t100

nem

eth2

3ne

met

h24

nem

eth2

5ne

met

h26

bibd

_22_

8t3

dhs3

dkq4

m2

appu fp

opt1

cant

lam

inar

_duc

t3D

sme3

Dc3d

tube

tsyl

201

vsp_

msc

1084

8_30

0sep

_100

in_1

Kout

vsp_

bcss

tk30

_500

sep_

10in

_1Ko

utpk

ustk

11bi

bd_2

0_10

bone

S01

bmw7

st_1

ship

sec8

pkus

tk07

TSC_

OPF_

300

ship

sec1

HFE1

8_96

_inco

nsph

PR02

Rbu

ndle

1 F2ra

mag

e02

pkus

tk08

pdb1

HYS

hood

pkus

tk13

gear

box

ship

_003

ship

sec5

hear

t1bm

w3_2 sm

tpw

tkfc

ondp

2TS

OPF_

FS_b

162_

c1Be

nEle

chi1

troll

halfb

fullb

Si34

H36

thre

adoh

ne2

bmwc

ra_1

3Dsp

ectra

lwav

e2Si

87H7

6m

sdoo

rx1

04m

_t1

Ga10

As10

H30

Coup

Cons

3Dnd

3kGa

AsH6

Faul

t_63

9TS

C_OP

F_10

47Ge

87H7

6Ge

99H1

00pk

ustk

14Em

ilia_9

23Ga

19As

19H4

2M

L_La

plac

eld

oor

cran

kseg

_1 F1nd

6kin

line_

1cr

anks

eg_2

bone

S10

Si41

Ge41

H72

RM07

Rnd

12k

cage

14ra

mag

e02

land

mar

kpk

ustk

08pd

b1HY

Sho

odpk

ustk

13ge

arbo

xsh

ip_0

03pa

ckin

g-50

0x10

0x10

0-b0

50sh

ipse

c5he

art1

bmw3

_2 smt

af_0

_k10

1af

_1_k

101

af_2

_k10

1af

_3_k

101

af_4

_k10

1af

_5_k

101

af_s

hell1

af_s

hell2

af_s

hell3

af_s

hell4

af_s

hell5

af_s

hell6

af_s

hell7

af_s

hell8

af_s

hell9

pwtk

fcon

dp2

TSOP

F_FS

_b16

2_c1

BenE

lech

i1tro

llha

lfbfu

llbSi

34H3

6th

read

ohne

2nl

pkkt

80bm

wcra

_13D

spec

tralw

ave2

eu-2

005

Si87

H76

rgg_

n_2_

22_s

0gs

m_1

0685

7fe

m_h

ifreq

_circ

uit

msd

oor

x104

m_t

1Ga

10As

10H3

0Co

upCo

ns3D

diel

Filte

rV2c

lxnd

3kGa

AsH6

Faul

t_63

9TS

C_OP

F_10

47Ge

87H7

6Ge

99H1

00in

-200

4pk

ustk

14Em

ilia_9

23af

_she

ll10

Ga19

As19

H42

ML_

Lapl

ace

ldoo

rcr

anks

eg_1 F1

nd6k

inlin

e_1

cran

kseg

_2bo

neS1

0Si

41Ge

41H7

2RM

07R

nd12

k

10

20

30

40

50

60

GFLO

PS (f

loat

)

SPGEMM float largeAC-SpGEMMcuSparsebhSparse

RMergensparseKokkos

Figure 12.Marker plot for the complete test set using single precision for large matrices with average row length a ≥ 42

81