Adaptive Sparse Matrix-Matrix Multiplication on the GPUrzayer/pappe/AC_SpGEMM.pdf · 2019. 5....
Transcript of Adaptive Sparse Matrix-Matrix Multiplication on the GPUrzayer/pappe/AC_SpGEMM.pdf · 2019. 5....
Adaptive Sparse Matrix-Matrix Multiplication on theGPU
Martin WinterGraz University of Technology,
Daniel MlakarGraz University of Technology,
Rhaleb ZayerMax Planck Institute for Informatics,
Hans-Peter SeidelMax Planck Institute for Informatics,
Markus SteinbergerGraz University of Technology,
AbstractIn the ongoing efforts targeting the vectorization of lin-ear algebra primitives, sparse matrix-matrix multiplication(SpGEMM) has received considerably less attention thansparse Matrix-Vector multiplication (SpMV). While both areequally important, this disparity can be attributed mainly tothe additional formidable challenges raised by SpGEMM.
In this paper, we present a dynamic approach for address-ing SpGEMM on the GPU. Our approach works directly onthe standard compressed sparse rows (CSR) data format. Incomparison to previous SpGEMM implementations, our ap-proach guarantees a homogeneous, load-balanced accesspattern to the first input matrix and improves memory ac-cess to the second input matrix. It adaptively re-purposesGPU threads during execution and maximizes the time effi-cient on-chip scratchpad memory can be used. Adhering to acompletely deterministic scheduling pattern guarantees bit-stable results during repetitive execution, a property missingfrom other approaches. Evaluation on an extensive sparsematrix benchmark suggests our approach being the fastestSpGEMM implementation for highly sparse matrices (80%of the set). When bit-stable results are sought, our approachis the fastest across the entire test set.
CCS Concepts • Theory of computation→Massivelyparallel algorithms; •Computingmethodologies→ Lin-ear algebra algorithms; • Software and its engineering→Scheduling;
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copiesare not made or distributed for profit or commercial advantage and thatcopies bear this notice and the full citation on the first page. Copyrightsfor components of this work owned by others than the author(s) mustbe honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee. Request permissions from [email protected]’19, February 16–20, 2019, Washington, DC, USA© 2019 Copyright held by the owner/author(s). Publication rights licensedto ACM.ACM ISBN 978-1-4503-6225-2/19/02. . . $15.00https://doi.org/10.1145/3293883.3295701
Keywords SpGEMM, GPU, Sparse Matrix, Adaptive, ESC,bit-stable
1 IntroductionGeneralized sparse matrix-matrix multiplication (SpGEMM)is one of the key kernels in scientific computing and dataanalytics, e.g., in algebraic multigrid solvers [5], Schur com-plement methods [25], betweenness centrality [6] and cycledetection [26]. Algorithmically, SpGEMM consists of build-ing the output matrix C from the product of two sparse inputmatrices A and B, given by
Ci j =∑k
Aik · Bk j ; (1)
where k spans the colliding non-zeros of the i-th row ofA and j-th column of B. In the sequential setting, efficienttreatment of SpGEMM dates back to the pioneering work ofGustavson [18]. As the computing landscape shifts towardsubiquitous parallelism, there is a pressing need for equiva-lently efficient approaches on modern many-core processors.
Among existingmany-core architectures, the graphics pro-cessing unit (GPU) is of particular interest. It has emerged asa viable and cost-effective co-processor in both single work-stations and large supercomputing clusters. As the GPU isprimarily designed for massively parallel, uniform executionand memory access, achieving good speedups on unstruc-tured problems such as SpGEMM remains a challenge.
To provide context to the ensuing discussion, we assumematrices are given in the compressed sparse row (CSR) for-mat, which is probably the most common format in use.Entries are sorted according to rows and their values and col-umn ids are explicitly stored. An additional row pointer arrayindicates the beginning of each row in the sorted arrays.
Challenges The challenges of SpGEMM on the GPU stemfrom multiple factors. First, the number of entries in eachrow ofA and Bmay vary strongly. This disparity complicatesload balancing, as threads may easily receive vastly differentwork loads, which is especially difficult to manage on singleinstruction, multiple data (SIMD) devices like GPUs.
68
PPoPP’19, February 16–20, 2019, Washington, DC, USA Winter, Mlakar, Zayer, Seidel, and Steinberger
Second, there is no way to predict the number of interme-diate products (Aik · Bk j ) without inspecting the matrices.This makes it difficult to perform intermediate computationswithin efficient on-chip memory, as it may easily overflow.
Third, memory access patterns are paramount on the GPU.Due to the nature of SpGEMM, the sparsity pattern andcontent of both matrices A and B determine the memoryaccess pattern throughout computations.
Strategies Without loss of generality, the landscape of GPUSpGEMM is dominated by the following strategies• ESC: explicitly expand all temporary products, sort themand combine them to yield the final matrix [5, 7–9].• Hashing: merge temporary products either in scratchpadmemory or globally using hashing [3, 22, 23].• Merging and Hybrid: choose a fitting method for eachrow [17, 19, 20].Most of these approaches are designed with only one or
two of the challenges discussed earlier in mind. The ESCstrategy achieves excellent load balancing at the cost ofhigh intermediate memory. In fact, in its original form allintermediate products go through slow global GPU mem-ory. Similarly, hashing is notoriously slow in global memory.Operating (partially) in scratchpad memory can increaseperformance of both approaches. Relying on smart globalscheduling [7, 22], overflow of scratchpad memory can beavoided. However, this entails a complete matrix inspection(which can consume up to 30% runtime; cf. [22] fig. 6). Onemajor drawback of hashing is that the accumulation orderdepends on the hardware scheduler and thus might yielddifferent floating point errors during each run.Merge-based approaches and hybrids focus on prepro-
cessing and row-based scheduling. This may lead to a largenumber of temporary matrices and significant preprocessingeffort. Furthermore, memory access patterns may deteriorateby switching strategies. Again, for this kind of schedulingthe matrices need to be inspected fully.
Contribution We propose a newGPU SpGEMMalgorithmthat tackles the challenges outlined above in a comprehen-sive way and at the same time cuts down on overhead com-putations throughout all stages. To this end, we make thefollowing contributions:• an efficient global load balancing scheme based solelyon the non-zeros of A. It avoids costly inspection of theinvolved matrices and achieves entirely uniform loadbalancing in practice.• a dynamic work distribution which achieves perfect lo-cal load balancing within a block of threads throughoutmultiple iterations of ESC, mixing partial results withnew input.
• an efficient local adaption of the ESC algorithmwhich op-erates within scratchpad memory and allows combiningtemporary results to a complete subset of C.• chunk-based storage of partial results of C.• an efficient, adaptive merge algorithm for partial results.• an adaptive approach to handle long rows of B efficiently.
2 Related workIn the sequential treatment of the problem, the output matrixis filled one row at a time bymeans of a large bookkeeping ar-ray aka sparse accumulator (SPA) [10, 15, 18]. As the sparsitypattern of C is not known prior to execution, a preliminarypass for memory allocation counts the number of non-zerosof C and a secondary pass computes the entries.Over the years many strategies have been proposed to
adapt SpGEMM to modern parallel architectures. They varymainly in the way the operations in Eq. 1 are partitionedamong A, B, and C. The classification by Ballard et al. [4] of-fers an elegant dimensionality-based interpretation of strate-gies by mapping operations to the cubeA×B×C. Within thisclassification, problems are amenable to solving the underly-ing hypergraph partitioning. However, the more elaboratethe partition, the more preprocessing effort is needed.The multi-threaded approach by Patwary et al. [24] re-
duces cachemisses by blocking accesses to the SPA. It searchesfor adequate block partitioning of the columns of B and fillsindividual blocks of C independently. In the same spirit, Ak-budak and Aykanat [1] rely on row-wise partitioning of Ato exploit locality in accessing the rows of B.
The approach by Bell et al. [5], implemented in CUSP [8],breaks the processing into expansion, sorting, and compres-sion (ESC). This translates into generating a list of interme-diate products which are sorted by row and column indexand falls within the 3D category. The output is generatedby contracting entries with the same indices. Optimizationstake into account the sparsity patterns ofA and B to improvethe row-wise processing of C [9]. Another variant performsESC locally before merging the results globally [7] at the costof increased load balancing effort. While we also performESC locally, the major advantage of our approach is thatwe use dynamic scheduling to perform multiple ESC itera-tions before going to global memory, considerably reducingmemory bandwidth, global sorting and compaction costs.Adopting a partial 1D row-wise strategy, rows from B
can directly be merged, even on the GPU [16]. The RMergeapproach [17] optimizes this merge strategy by ensuring op-erations are completed in efficient memory. This constraintis enforced by splitting B into multiple matrices with limitedrow length and iteratively computing the product of thesematrices from right to left. In the bhSparse approach [20]rows are grouped by the number of intermediate productsand then a merge-based strategy is adaptively selected basedon the number of intermediate products. They also compare
69
Adaptive GPU SpGEMM PPoPP’19, February 16–20, 2019, Washington, DC, USA
against a CPU implementation based on Intel MKL and showaverage speed up of 2.5/2.2 for single/double precision.In a similar spirit, the approach by Kunchum et al. [19]
selects different strategies depending on the row structure.SpGEMM can be also addressed using hash tables. An earlyapproach [12] keeps a primary hash table in scratchpadmem-ory and a secondary in global memory. An implementation ofthis approach is used within cuSparse [23]. Deveci et al. [13]also uses a dual hash table approach, whereas global hashtables are only used temporarily and are reclaimed. The Bal-ancedHash approach [3] restricts itself to local hash tablesand avoids overflows using “better size estimates” [2].nsparse [22] follows these approaches and addresses the
memory problem by grouping rows based on intermediateproducts and thus can construct hash tables with differentsizes. Deveci et al. [14] build on their previous work [13]. Inparticular, they combine partitioning for hierarchical paral-lelism with the use of a two-level hash data structure.One downside of hash-based approaches is their non-
deterministic compaction order leading to different floatingpoint errors during each run.
3 Adaptive SpGEMMOur adaptive chunk-based GPU SpGEMM approach (AC-SpGEMM) focuses on four major goals
1. performing computations in local on-chip memory2. coherent memory access3. independence of row lengths4. ensuring deterministic resultsTo achieve these goals, AC-SpGEMM follows a four stage
approach, as outlined in Figure 2. In the first stage, AC-SpGEMM prepares data for global load balancing. In thesecond stage, we perform chunk-based ESC, producing de-terministically bit-stable results. With the help of our localwork distribution, this stage performs multiple iterations ofESC, fully utilizing local memory resources while keepingtemporary results in scratchpad memory. Merging of rowsshared across chunks happens in the third stage. Finally, inthe fourth stage, we allocate the output matrix C and fill itwith data from the chunks.
0
500
1000
1500
2000
2500
0
50
100
150
200
NNZ
per R
ow
MinNNZ - MaxNNZ per RowAverageNNZ per Row
Figure 1. Average non-zeros per row for the SuiteSparsematrix collection, min and max overlayed and clamped.
B
A
C
ChunkPool
Global Load Balancing
Adaptive Chunk-based ESC
Fetch from A
Write Long Rows
Work Distribution
Local ESC
Write Chunk
Merge Stage
Search Merge
Multi Merge
Path Merge
Write to CSR
Restarts
Restarts
Figure 2. Our approach has four consecutive stages: globalload balancing follows a strict non-zero splitting of A; ourAC-ESC step performs the major part of the SpGEMM, com-puting entire chunks of C; merge combines the rows sharedacross chunks before the data is copied to C.
Before discussing the details of each stage, we motivatesome design choices. Analysing the row length of matricesfrom the SuiteSparse matrix collection [11], it can be ob-served that the majority of matrices in common problemdomains have average row lengths of less than 200 elements,cf. Figure 1. Considering register sizes of current GPUs andreasonably small thread block sizes, up to 4000 temporaryelements can be held by each block. If there is reasonableoverlap between rows, the resulting output can be stored inscratchpad memory for another iteration of ESC. Given 200entries per row, ideally another 3800 temporary elementscan be loaded and compacted. In the best case, these localload and compaction steps continue until yielding completedrows of C, without ever going through slow global memory.
3.1 Global Load BalancingPrevious SpGEMM approaches adopt one of two strategiesfor load balancing. (1) They bin the rows of A according totheir lengths [20] and choose an appropriate algorithm foreach category. This strategy may pull apart sequential rowsin A, severing memory access patterns to A and C.(2) They analyze the number of temporary products that
will be produced and distribute them uniformly [3, 7, 22].This approach needs to analyse the entire temporary dataand write identifiers to global memory. The relative costof load balancing increases with the sparsity of the input
70
PPoPP’19, February 16–20, 2019, Washington, DC, USA Winter, Mlakar, Zayer, Seidel, and Steinberger
Algorithm 1: Global Load Balancing1 a ← row_ptr [tid]2 b ← row_ptr [tid + 1]3 blocka ← divup(a,NNZ_PER_BLOCK)4 blockb ← (b − 1)/NNZ_PER_BLOCK5 while blocka ≤ blockb do6 blockRowStarts[blocka] ← tid
7 blocka ← blocka + 1
matrices. According to our analysis and the cost break downgiven by Nagasaka et al. [22], load balancing can consumeup to 30% of the overall runtime for very sparse matrices.To tackle the drawbacks of both strategies, we propose
a simple global load balancing scheme and defer the finegrained control to the next stage. Our global load balancingsplits the non-zeros of A uniformly, assigning the same num-ber to each thread block. In this way, memory requirementsfrom A are static. However, the actual workload for eachblock varies based on the intermediate products generated.
While a static splitting of A’s non-zeros does not requireprocessing, the second stage still needs to associate eachentry from A with a row. To provide this information, globalload balancing in parallel checksA’s row pointers and writesthe required information into an auxiliary array, as outlinedin Algorithm 1. The cost of this step is negligible comparedto enumerating temporary products.
3.2 Adaptive Chunk-based ESCEach thread block executing the second stage is assignedan equally sized portion of A and performs SpGEMM withB. Depending on the sparsity pattern, it can produce anynumber of output chunks, each representing a partial resultof C. We propose an adaption of ESC due to its desirableproperties to perform SpGEMM. First, after the expansionof the temporary products, every thread performs identi-cal work independent of which row the data comes from.Second, sort can efficiently be implemented within a block,using Radix sort [21]. Third, a stable sort algorithm alwaysyields identical floating point results, free of the problematicscheduling-based effects encountered when using hashing.The downside of ESC is that sorting intermediate data is
more costly than sorting the output data. However, dealingwith only a chunk of data at once and keeping all data lo-cal, the cost is significantly lower than sorting all temporaryelements of the entire productA·B. While other local ESC ap-proaches have been proposed before [7, 9, 20], they operateon individual rows or a fixed number of temporary prod-ucts and then always go to global memory. Our approachcompletely ignores row boundaries and performs multipleiterations of ESC locally, dynamically splitting off completedrows. We only move to global memory after completion orwhen data cannot be compacted sufficiently.
3.2.1 Fetch AThe first step in AC-ESC fetches non-zeros and column ids ofA required by the thread block, using a coalesced read pattern.We store this data in scratchpad memory to be availablethroughout the entire stage. While coalesced loading of Ais less important for denser matrices where the loads fromB dominate, it is important for very sparse matrices whereloads from B are similar in count. In addition, the row idsfor all non-zeros in A are needed. As NNZ_PER_BLOCK isconstant, we can reduce the bit length of these row ids bylocally remapping them: Using a dictionary we use the indexof the first non-zero in that row as local row id.
3.2.2 Local Work DistributionThe most important component in AC-ESC is the local workdistribution. As the number of intermediate products pro-cessed by a block of threads varies, we have to dynamicallydecide which elements to load to exactly fill up the availableresources. While inspecting B for global load balancing iscostly, inspecting B now comes with little additional costas it needs to be accessed anyway. Thus, after loading eachcolumn index of A, we retrieve the number of elements tobe fetched from B for every entry in A.To determine which elements to load, we propose an ex-
panding work distribution that dynamically supplies localESC with data. The work distribution offers three meth-ods: placework, size, receivework outlined in Algorithm 2.placework receives the number of temporary products thatwill be generated for each entry in A. A prefix sum overthat data yields the expansion of temporary products up toa specific entry in A. size queries the overall sum, i.e., howmany elements are still to be processed. receivework canbe called with a desired number of elements to be drawnfrom the work distribution (Consume). It can deliver multipleelements (N ) to each thread. We typically use 8.
To perform the work assignment, we determine in parallelthe offset of the first temporary product of each row from A(line 17-19). Marking this product (line 20) and performinga max prefix scan over the data (line 24) assigns the corre-sponding row id (Ar es ) to all Consume temporary products.To ensure data is loaded in a coalesced manner, we interleavethe temporary products between threads, which is commonlyknown as moving from a block layout to stripped layout(line25). Comparing the output id (c) and the cumulative sum upto the first element of the respective row (WDState[Ar es [i]]),the local offset in the row can be computed.
Instead of using the local offset directly, we reverse the or-der using the offset from the next rowWDState[Ar es [i]+ 1]and count down (line 29). In this way, if the work distribu-tion splits a row in B, we take entries from the end of therow and thus simply act like the row is shorter in the nextiteration of ESC. Finally, we reduce all cumulative counts bythe number of elements drawn from the work distribution
71
Adaptive GPU SpGEMM PPoPP’19, February 16–20, 2019, Washington, DC, USA
Algorithm 2: Local Work Distribution1 ScratchPadWDState[NNZ_PER_BLOCK]2 Function placework(elements[NNZ_PER_THREAD])3 blockPrefixSumIncl(elements , elements)
4 for i ← 0 to NNZ_PER_THREAD do5 WDState[tid · THREADS + i + 1] ← elements[i]
6 WDState[0] ← 07 syncThreads()
8 Function size()9 returnWDState[NNZ_PER_BLOCK]
10 Function receivework(N , Consume)11 ScratchPad Offsets[N · THREADS]12 Ar es [N ]
13 Br es [N ]
14 clear(Offsets)15 syncThreads()
16 for i ← 0 to NNZ_PER_THREAD do17 a ←WDState[i · THREADS + tid]18 an ←WDState[i · THREADS + tid + 1]19 if a < Consume and a , an then20 Offsets[a] ← i · THREADS + tid
21 syncThreads()
22 for i ← 0 to N do23 Ar es [i] ← Offsets[N · tid + i]
24 blockMaxScanIncl(Ar es ,Ar es )
25 blockedToStripped(Ar es ,Ar es )
26 for i ← 0 to N do27 c = tid + i · THREADS28 if c < Consume then29 Br es [i] ←WDState[Ar es [i] + 1] − c − 130 else31 Ar es [i] ← ∅
32 Br es [i] ← ∅
33 syncThreads()
34 for i ← 0 to NNZ_PER_THREAD do35 j ← tid + i · THREADS + 136 WDState[j] ← max(0,WDState[j] −Consume)
37 syncThreads()
38 return Ar es ,Br es
(line 36) and return identifiers for all temporary products(Ar es ,Br es ). The only state the work distribution needs tokeep isWDState . As AC-ESC might run out of memory dur-ing chunk allocation, it needs to be able to continue after anallocation round trip to the host. Thus, the work distributionalso needs to support restarts. In case of a restart, we simplystore the number of already consumed elements to globalmemory. When the block continues, we initialize the workdistribution as usual, but immediately reduce the workloadaccordingly, i.e., we execute line 36 with the stored count.
53 4 4 53
12 22 42 52 72
5
13 33 43
3
13 33
3/5/ 42/
12 22 42 52 72 13 13 33 33 430 1 2 3 4 0 1 2
13 33 43
10WD
7WD
53 73
43,4/
13 33
44/
53 73 13
51/
53 73 7313 13 13 33 33 43 530 41 2 3
5WD
23 43
52-5/
53 73 43 53 7314
31/
13 33
53 73 7343 53 1413 23 33 430 52 3 41 0
A B
Chunk
Chunk
5 3 4 4 5 3
0 0 2 4 5 3
0 0 0 0 1 3
(a)
(b)
(c)
(d)
(e)
(f)
(g)
Figure 3. Example of our work distribution-driven local ESC:(a) After global load balancing over A’s non-zeros (colors),we fill the work distribution with the row lengths each entryfromA references in B. (b) Taking 10 elements from the workdistribution, we expand, sort and compact. (c) A first chunkfor row id 2 is split off to global; the remaining 3 elements arekept locally for another iteration. (d) 7 elements are drawnfrom the work distribution. (e) All elements are kept foranother iteration. (f) 5 elements are needed. (g) A completechunk for row 3 is produced.
72
PPoPP’19, February 16–20, 2019, Washington, DC, USA Winter, Mlakar, Zayer, Seidel, and Steinberger
3.2.3 Local ESCDriven by the work distribution, we perform multiple itera-tions of local ESC, as indicated in Figure 3. Using the outputfrom the work distribution, every thread loads its assignedelement from B and multiplies it with the previously loadedvalue from A to complete the expansion. This yields coa-lesced memory access for elements of the same row in B. Ofcourse, as different rows from Bmight be loaded—dependingon the column ids of A—the exact memory access patternstill depends on the input data.
After the expansion, the intermediate products are movedinto radix sort, using their row and column id for sorting.The runtime of radix sort is proportional to the bit lengthbeing sorted, and thus reducing sort bit length is importantfor ESC [9]. While previous work followed a static approachto bit reduction, our approach is more aggressive and com-pletely dynamic: As mentioned earlier, we bound the maxi-mum range of row ids using a dictionary. Unfortunately, thesame approach is not applicable to column ids as it wouldrequire knowledge of all unique column ids. However, wecan bound the range by tracking the minimum and maxi-mum id for all entries we fetch from B and thus reduce thenumber of bits. We do the same for the row ids on top ofthe dictionary, further reducing their bit range. Thus, thesorting effort adapts to the input data present.The reduction of the number of sorted bits not only re-
duces the sorting effort, but also the register requirements.Keeping in mind that the row ids in the worst case requirelog2(NNZ_PER_BLOCK) bits and the column id of B is lim-ited by B’s dimension, we choose a 32 bit or 64 bit integer. Forexample, for a block size of 256 threads and 2 NNZ_PER_-THREAD, we need 9 bits; thus 32 bit integers are sufficientfor matrices up to 8.4 million columns (23 bits).
In the compaction step, we are not only interested in com-bining the data, but also in the number of entries in everyrow and how to write the chunk to memory. We perform allthree tasks within a single prefix scan using special opera-tors and state. At first, we determine whether neighbouringelements have the same sorting key, i.e., should be combined,and whether their row id bits match, i.e. they are from thesame row. Every thread can trivially perform this operationfor all elements it holds in registers.
For cross-thread communication, we use scratchpad mem-ory. We then encode both facts as individual bits in a 32 bitinteger (row ends as the 17th bit (orange) in Algorithm 3;the end of a combine sequence using the first bit (purple)).We split the remaining 30 bits in half to count the elementsin each row (green) and overall compacted elements (blue).For each element that ends a combine sequence, we initializeeach of the 15 bit counters to 1.
The scan uses the state information to decide whether toreset the accumulation and/or counting bits as outlined inAlgorithm 3.
Algorithm 3: Compaction Scan Operator// initial state for the scan operation
// end row 0b0000 0000 0000 0011 0000 0000 0000 0011// end comp 0b0000 0000 0000 0010 0000 0000 0000 0011
// none 0b0000 0000 0000 0000 0000 0000 0000 00001 Function CombineScanOperator(a, b)2 if equalRow (akey , bkey ) then3 state ← astate & 0xFFFE4 else5 state ← astate & 0xFFFEFFFE
6 if akey = bkey then7 nvalue ← avalue + bvalue8 else9 nvalue ← bvalue
10 nkey ← bkey11 nstate ← state + bstate12 return n
After completion of the scan, the state bits can still bequeried to identify the compacted elements as well as rowends. Additionally, the other bits hold information abouteach element’s position in the chunk as well as the localoffset in the row. Using the row counts, we update the globalcounter for each row, which will serve later on for computingthe row pointer array and memory requirements of C.
Previous approaches to local ESC assign the exact numberof temporary elements that can be handled to a block [7] andgo to global memory after ESC. The resulting data howevermay still need to be combined with temporary results frommultiple other blocks. If we directly wrote our results tochunks in global memory, we would face the same issue.Thanks to the flexibility of our work distribution, we keepthe results around for another iteration of ESC, combiningthe temporary results with new data from B.
By keeping the temporary results for the next iteration ofESC and reducing the number of elements drawn from thework distribution accordingly, we reduce the global mem-ory traffic significantly. While avoiding additional chunks isreasonable, keeping elements from multiple output rows forthe next iteration is not, as all but the last row are alreadycompleted. Conversely, it does not make sense to have afew elements of a new row at the end of a chunk, as thesewill require merging in a later stage. Thus, we only keep thelast row for the next round and write other rows to a globalchunk.
3.2.4 Chunk ManagementWhen we decide to write a chunk of C to global memory, athread of the block uses the compaction result to computethe chunk size and allocates a chunk from the pool. To thisend, it increments an atomic counter by the chunk size. Towrite both the column ids and values to the chunk, we take a
73
Adaptive GPU SpGEMM PPoPP’19, February 16–20, 2019, Washington, DC, USA
chunk list heads
shared rowstracker
5
2
-
3
4
-
5
firs
t ro
w#
entr
ies
# la
st r
ow
# fi
rst
row
sort
key
values indices
Figure 4. The chunk lists are constructed as per row linkedlists while the shared rows tracker holds per row chunkcounts if merging is required for a certain row.
compacting round trip through scratchpadmemory to ensurecoalesced writes to global memory.When a chunk is generated, we update the restart infor-
mation for this thread block. If a complete chunk is written,information about how many elements have been drainedfrom the work distribution is sufficient. If the last row iskept for the next iteration, we write the row id as restartinformation. The next time the block is launched it can thencompletely ignore these rows when loading data from A.In addition to the data needed for C, each chunk holds
its starting row, element count and number of elements inthe first and last row. To perform chunk copy in parallel inthe final stage, we keep an array of pointers to all chunks.Using a full pointer allows chunks to be placed in memoryarbitrarily. Thus, expanding the chunk pool is as easy asadding another memory region to be used as pool.
Finally, we require information about which chunks havedata for the same row to start merging. To identify all chunksthat contribute to one output row, we construct a linked listof chunk pointers for every row, with a list head being avail-able for every row, as shown in Figure 4. The list insertionuses an atomic exchange operation on the list heads, mak-ing the list order dependent on the hardware scheduler. Toensure bit-stable results, we further store a chunk identifierfrom the block id and a per-block running chunk number,which yields a global ordering of chunks. To increase theefficiency of chunk merging, we use one bit of the list headsto indicate whether there is only a single chunk in the list.
Upon adding a second chunk to the list, we insert the row idinto an array holding all rows with two or more chunks. Thisarray serves as the basis for identifying how many mergeblocks need to be started in the next stage.
3.3 Chunk MergingAfter AC-ESC, the merge stage combines rows shared be-tween chunks to generate the final result. To guarantee adeterministic merge order, we perform an initial sort of thechunks based on their global chunk order. This sort is negligi-ble compared to the merge itself. As any number of elementsmay need to be merged, one could launch a single block foreach shared row. However, typically, a shared row is coveredby two chunks, i.e., because global load balancing splits theentries of one row across two blocks. In this case, the numberof entries in both chunks may be low, potentially wastingresources when using a complete block.Similarly to AC-ESC, we want to work on multiple rows
to fully use the available resources. To this end, we run aprefix scan over all shared rows, using the row count thatwe summed up atomically for all rows during AC-ESC. Forshared rows, this count represents the number of remainingintermediate products.Throughout the scan we combine row range identifiers,
if the sum of their respective elements does not overflowthe number of elements we can handle in one block. Thisapproach may not fill up blocks completely, but yields sig-nificantly better resource usage than a one-block-per-rowstrategy. When launching theseMulti Merge blocks, we onceagain build on our work distribution and execute the remain-ing steps of our AC-ESC, creating new chunks for all mergedrows. At the same time, we set the row count for each sharedrow to the correct value after merging.The scope of Multi Merge is limited to rows that were
split over two chunks. We therefore propose two additionalalgorithms to deal with rows yielding more than two chunks:PathMerge and SearchMerge. The former is applicable up to apredefined number of chunks, while the latter can handle anarbitrary number. We decide which algorithm to use basedon the length of each row’s chunk list.Search Merge uses binary search sampling in all chunk
column ids to find overlapping ranges that can be handledat once. At first, we compute the minimum and maximumcolumn id over all involved chunks. Then, we uniformlysample this range, according to ((max−min)/THREADS),assigning one thread to each sample. Using binary search,every thread finds the next higher column id in all chunksand computes the sum over all elements that are below acrossall chunks. The thread with the largest sum that still fitsinto the available resources, delivers the data to be merged.Using AC-ESC on that data yields the first part of the mergedrow. Reducing the count of all samples by the number ofconsumed elements yields the next cut and so on. In case thesampling is too coarse we sub-sample the range.
74
PPoPP’19, February 16–20, 2019, Washington, DC, USA Winter, Mlakar, Zayer, Seidel, and Steinberger
15.0k
190.0
k46
0.0k
770.0
k1.5
M2.2
M5.2
M13
.0M46
.0M42
0.0M
temporary elements
10-1
100
101
GFLO
PS (f
loat
)
AC-SpGEMMcuSparsebhSparse
RMergensparseKokkos
15.0k
190.0
k44
0.0k
770.0
k1.4
M2.2
M5.2
M12
.0M41
.0M42
0.0M
temporary elements
10-1
100
101
GFLO
PS (d
oubl
e)
AC-SpGEMMcuSparsebhSparse
RMergensparseKokkos
Figure 5. Trend line of the SpGEMM performance for all tested methods over highly sparse matrices (average row length≤ 42) from the SuiteSparse collection. Line thickness indicates variance.
PathMerge avoids global memory binary search by placingsamples uniformly over the entries of every chunk. For eachsample we fetch the column id and sort them across theentire block, while carrying the sample number along withthe sort. Next, we perform a custom scan over the sorted datato find the correspondences between samples from differentchunks, i.e., identify possible paths through all chunks.
As every sample originally only holds the sample numberfrom its own chunk, we set the other counts to zero, usingonly as many bits as necessary to represent each sample num-ber. The max scan over these individual bit ranges deliversthe combined merge path, i.e., the matching cut through eachchunk. For each path, we compute the number of temporaryelements from the combined sample locations and chunksizes. Choose the one that fits into memory, we run AC-ESC.The stored paths are again used for the next iteration.
All three approaches produce new chunks, and thus mustsupport restarts. In Multi Merge, a restart simply starts fromscratch, as it has only one iteration. On the other hand, bothSearch Merge and Path Merge store the last used sample andtherefore a restart therein equals sampling a reduced range.
3.4 Long RowsProcessing long rows that exceed the available local resourcesduringAC-ESCwould load and sort themwithout any changeand write them to the chunk pool. To avoid these unneces-sary computations, we identify long rows during Fetch Aand immediately create a chunk that only points to the datain B and attach the factor from A.
By removing the row from the work distribution, we avoidprocessing it during AC-ESC, but have to make slight mod-ifications to this stage. If the long row has to be mergedwith values otherwise placed in the middle of a chunk, webreak up chunks at the position of the long row to allow formerging afterwards in the merge stage.
3.5 Output Matrix and Chunk CopyOnce all chunks have been finalized, generating the finalresult is straightforward: A device-wide prefix sum over the
row counts yields the row pointer array and C’s memoryrequirement for allocation of the values and column id arrays.Then, in parallel, we iterate over all chunks and copy theirdata to the newly allocated C. Each chunk uses a completeblock of threads to copy data in a coalesced fashion.
4 EvaluationTo provide a realistic assessment of the performance of ourapproach, we benchmarked the entire SuiteSparse matrixcollection [11], which contains more than 1800 unique matri-ces of non-trivial size (≥ 104 NNZ) from various applicationdomains with different matrix characteristics. Very smallmatrices (≤ 104 NNZ) are excluded as they do not providesufficient parallelism for execution on the GPU and thus CPUimplementations are typically faster. From about 104 NNZupwards, our approach outperforms state-of-the-art CPUimplementations [14] on a consumer grade CPU of similarcost (Intel Xeon E5-2630 16 GB of memory). We compareour approach to cuSparse [23], bhSparse [20], RMerge [17],nsparse [22], and Kokkos [14]. All approaches work directlywith CSR and were compiled with CUDA Toolkit 10.0. Wecompute A ·A for square matrices and A ·AT for non-square,where we precompute AT . As test platform we use an Inteli7 7700 CPU at 3.60GHz and an NVIDIA Titan Xp (computecapability 6.1).
Our algorithm is written in CUDA and uses a block size of256/512 non-zeros for global load balancing, sorts 8 elementsper thread and keeps up to 4 elements per thread from oneiteration to the next. We use conservative memory estimatesfor all helper data structures. For the initial chunk pool,we rely on a simplistic memory estimate S of C, using theaverage row length as a measure of row overlaps, i.e., pretendmatrices have the same number of uniformly distributedelements in each row. More precisely, for A of size nA ×mA, the average row length is given by a = |A|/nA, where|A| indicates the number of non-zeros, and the estimatedprobability for a collision is pa = a/mA. For the product AB,the memory estimate is given as S ≈ nA · b · (1−(1−pb )a)/pb .We multiply this factor by 1.2 to account for the chunk meta
75
Adaptive GPU SpGEMM PPoPP’19, February 16–20, 2019, Washington, DC, USA
≈1500 highly sparse matrices (a ≤ 42) ≈300 denser matrices (42 < a)speed up of AC-SpGEMM better than best speed up of AC-SpGEMM better than bestmin max h. mean AC-SpGEMM min max h. mean AC-SpGEMM
float
cuSparse† 0.78 614.17 4.01 0% 0% 0.55 674.51 1.66 6% 2%bhSparse 1.00 96.60 5.56 0% 0% 0.71 429.04 1.77 9% 0%RMerge 0.60 23.50 3.73 0% 0% 0.34 10.21 2.00 3% 0%nsparse† 0.26 76.80 2.29 3% 3% 0.19 33.02 0.76 66% 60%Kokkos† 0.48 767.71 6.27 1% 1% 0.34 148.27 1.17 45% 7%
AC-SpGEMM - - - - 96% - - - - 31%
doub
le
cuSparse† 0.51 470.44 3.78 0% 0% 0.62 608.74 1.50 10% 3%bhSparse 1.00 88.60 5.12 0% 0% 0.71 387.04 1.56 17% 0%RMerge 0.45 24.90 3.70 0% 0% 0.40 9.50 1.81 5% 1%nsparse† 0.31 43.06 2.01 6% 5% 0.18 28.83 0.67 70% 61%Kokkos† 0.37 582.87 5.09 2% 1% 0.30 120.58 0.92 53% 10%
AC-SpGEMM - - - - 94% - - - - 26%Table 1. Relative speedup of AC-SpGEMM over competing approaches and percentage where the approaches achieved betterperformance than AC-SpGEMM / achieved the best performance. AC-SpGEMM dominates the performance for highly sparsematrices and still achieves the best performance for about 1/3 of denser matrices. † does not produce bit-stable results.
data and divergences from the average row length and applya lower bound of 100MB. In case the estimate is not sufficient,the restart functionality will increase the chunk pool.
4.1 Runtime OverviewTo better analyze the runtime performance, we split the eval-uation into highly sparse and densermatrices, with a split fac-tor of 42 non-zeroes per row, classifying 80% of the matricesin SuiteSparse as highly sparse (see also Figure 1). Summaryplots for SpGEMM on highly sparse matrices are shown inFigure 5, and relative speedups against the evaluated meth-ods for all matrices in Table 1. For highly sparse matrices, ourapproach clearly dominates performance, achieving averagespeedups of at least 2× over other approaches. For densermatrices, nsparse achieves the best performance, with anaverage speedup of 1.32/1.49 over AC-SpGEMM, with AC-SpGEMM being on par with Kokkos.Over the entire data set, our approach achieves an aver-
age speedup of 3.27/3.05, 4.17/3.80, 3.30/3.21, 1.74/1.53, and3.76/3.02, over cuSparse, bhSparse, RMerge, nsparse, andKokkos, respectively. The trend plots and table already indi-cate the strengths and weaknesses of our approach. WhileESC leads to deterministic, bit-stable results, the search andmerge overhead increases as the number of non-zeros perrow increase. Thus, even though our approach performs mul-tiple iterations of ESC in efficient on-chip memory, our ap-proach loses ground to hash-based approaches, like nsparse,for denser matrices. However, for very sparse matrices theoverhead of ESC is not severe and the advantages of localscheduling and single-run chunk generation shine. Com-pared to other bit-stable approaches, AC-SpGEMM overallclearly achieves the best performance.
lang
uag...
scircui...
stat96
v...
poiss
on...
144...
asia_os...
webb
ase...
atmosmo...
filter3...
bibd
_19...
TSOP
F_R...
huge
bub...
cant...
land
mar...
hood
...TS
C_OP
F...
0
10
20
GFLO
PS
AC-SpGEMMcuSparseRMerge
bhSparsensparseKokkos
Figure 6. Double precision performance for commonlybenchmarked matrices and additional cases for which ourapproach is bested by others. Note that nsparse achieves 45GFLOPS for the last matrix.
4.2 Runtime DetailsA performance overview for commonly benchmarked matri-ces from various fields and selected matrices that are difficultfor our approach are provided in Figure 6; matrix statisticsare given in Table 2. The most difficult scenarios for ourapproach include TSC_OPF_1047, cant, and hood, which, inspite of having strongly different structure, all feature a largeaverage row length, produce a high number of intermediateproducts and compact them significantly (up to a compactionfactor of 150). The cost of ESC is simply too high in compar-ison to hashing, even while keeping hundreds of iterationsin local memory. For TSC_OPF_1047, nsparse achieves thehighest speedup over AC-SpGEMM: 5×. Another matrix withmany temporary products but a lower compaction factor islandmark. Interestingly, due to its very special structure itis one of the few cases where RMerge takes the lead.
76
PPoPP’19, February 16–20, 2019, Washington, DC, USA Winter, Mlakar, Zayer, Seidel, and Steinberger
A Crows cols nnz len max nnz len max temp
language 0.40 0.40 1.22 3.0 11.5k 4.61 11.6 32.0k 5.5scircuit 0.17 0.17 0.96 5.6 353 5.22 30.5 1.9k 8.7stat96v2 0.03 0.96 2.85 98.1 3.2k 0.35 12.1 1.6k 8.7poiss. . . 0.01 0.01 0.35 26.1 110 2.96 218.8 584 11.8
144 0.14 0.14 2.15 14.9 26 10.42 72.0 116 33.0asia_osm 11.95 11.95 25.42 2.1 9 42.75 3.6 24 56.9webb. . . 1.00 1.00 3.11 3.1 4.7k 51.11 51.1 12.4k 69.5atmos. . . 1.49 1.49 10.32 6.9 7 36.49 24.5 25 71.6filter3D 0.11 0.11 2.71 25.4 112 20.16 189.4 550 86.0
bibd_19_9 0.01 0.09 3.3 19.4k 19.4 0.03 171.0 171 119.7TSOPF. . . 0.04 0.04 16.17 424.2 983 74.32 1.9k 3.3k 128.0hugebu. . . 21.20 21.20 63.58 3.0 3 132.69 6.3 7 190.7
cant 0.06 0.06 4.01 64.2 78 17.44 279.3 375 269.5landmark 0.07 0.00 1.15 16.0 16 101.82 1.4k 1.6k 549.2
hood 0.22 0.22 10.77 48.8 77 34.24 155.3 231 562.0TSC_O. . . 0.01 0.01 2.02 247.8 1.5k 8.83 1.1k 3.5k 1352.4Table 2. Matrix overview: values in millions except for rowstatistics (average length and maximum row length).
lang
uag...
scircui...
stat96
v...
poiss
on...
144...
asia_os...
webb
ase...
atmosmo...
filter3...
bibd
_19...
TSOP
F_R...
huge
bub...
cant...
land
mar...
hood
...TS
C_OP
F...
0.0
0.5
1.0
AC-GLBAC-ESC
AC-MCCAC-MM
AC-PMAC-SM
AC-CC
Figure 7. Relative runtime of the different steps of our ap-proach: global load balancing (GLB), AC-ESC, Merge Assign-ment (MCC), Multi Merge (MM), Path Merge (PM), SearchMerge (SM), and Chunk Copy (CC).
As the compaction factor gets lower, our approach iscompetitive, leading the performance for many commonlytested matrices, in various domains, such as fluid dynamics(poisson3Da), linear programming (stat96v2), or graphs(webbase-1M). This performance edge is independent of thesize of the matrix (compare hugebubbles-00020 and 144).Also, local dense areas (TSOPF_RS_b2383), specific structures(hugebubbles-00020), individual long rows (webbase-1M),non-square matrices (stat96v2), and very long rows (bibd_-19_9) are handled efficiently by our approach. The bestspeedup is achieved by our approach if the matrix is highlysparse, but has few long rows (language), where our effi-cient ESC and the special treatment of long rows go hand inhand, outperforming the best other approach by 20×.
The timing breakdown of our approach in Figure 7 showsthat we operate under ideal conditions for many matrices,spending most time in AC-ESC within on-chip memory.
helper chunk used % u/o R mpL
language 15.05 100.00 58.23 50.88% 1.10 0 98.73%scircuit 13.09 100.00 63.15 55.18% 1.06 0 97.68%stat96v2 29.91 122.11 6.68 5.47% 1.65 0 97.98%
poisson3Da 6.06 129.96 42.61 32.78% 1.26 0 91.02%144 27.04 460.75 129.89 28.19% 1.09 0 98.38%
asia_osm 416.96 1091.05 497.10 45.56% 1.02 0 99.77%webbase-1M 100.03 710.80 605.79 85.23% 1.04 3 98.05%atmosmodl 130.69 1033.36 435.36 42.13% 1.04 0 99.70%
filter3D 38.06 983.23 271.19 27.58% 1.18 0 98.78%bibd_19_9 1.06 100.00 19.20 16.78% 57.38 0 99.21%TSOPF_. . . 605.60 3044.69 1593.56 52.34% 1.87 0 99.23%hugebub. . . 928.54 1991.35 1545.96 77.63% 1.02 0 99.45%
cant 66.76 3072.00 276.51 9.00% 1.39 0 99.48%landmark 228.97 6144.00 3333.51 54.26% 2.86 1 97.29%
hood 162.85 3072.00 508.58 16.56% 1.30 0 99.73%TSC_O. . . 37.99 3072.00 310.41 10.10% 3.07 0 99.03%
Table 3.Overall memory consumption in MB for helper datastructures, chunk pool, actual chunk pool used, and usedchunk memory relative to the output matrix (u/o); as well asnumber of restarts (R) and lowest multiprocessor load (mpL).
lang
uag...
scircui...
stat96
v...
poiss
on...
144...
asia_os...
webb
ase...
atmosmo...
filter3...
bibd
_19...
TSOP
F_R...
huge
bub...
cant...
land
mar...
hood
...TS
C_OP
F...
0
2
4
6
GB
1e9AC-HelperAC-ChunksAC-Overalloc
RMergebhSparsensparse
Figure 8.Memory consumptions. Due to our simplistic mem-ory estimate, the effectively used memory is very small com-pared to the allocation.
For matrices with long rows (TSOPF_RS_b2383), a consid-erable amount of time is spent on Merge. Although MultiMerge handles more shared rows than Search and Path Mergein all cases, its time is negligible, as every block handlesmany rows. Assigning the merge cases and copying the fi-nal matrix make up between 1-20% of the runtime; globalload balancing is negligible, underlining the efficiency of ourapproach.
At the same time, multi processor load is virtually perfectin all cases (cf. last column of Table 3), indicating that pairingour global and local load balancing with the GPU hardwarescheduler achieves ideal workload distributions.
4.3 Memory ConsumptionOur memory consumption is detailed in Table 3 and com-pared to others in Figure 8. nsparse requires hardly anyadditional memory.
77
Adaptive GPU SpGEMM PPoPP’19, February 16–20, 2019, Washington, DC, USA
We allocate similar memory as RMerge and bhSparse, ofwhich we typically only use a fraction (% in Table 3). Thefact that the used chunk memory is only slightly higher thanthe memory required for C (u/o) in Table 3, shows that lo-cal iterations of ESC essentially produce completed chunksof the output matrix (note that our lower bound of 100MBleads to the high value for bibd_19_9). This highlights theadvantages of our local work distribution. Our chunk poolestimate is conservative in most cases and only few matricesrequire restarts, cf. Table 3.Although we require three restarts for webbase-1M, we
achieve a 4.4× speedup over the best other approach, indi-cating that restart is efficient. To evaluate the cost of restarts,we reduced the chunk pool size for webbase-1M, where wemeasured a runtime of 22.0, 23.6, 24.5, 26.6, 30.8, and 39.7,and 48.6ms for 0, 3, 5, 10, 21, 42, and 63 restarts, respectively.Even with 63 restarts we still beat nsparse by a factor of 2×.Additionally, a less conservative chunk estimate could sig-nificantly reduce memory with little impact on performancedue to restarts.
4.4 SummaryOverall, it can be noted that AC-SpGEMM is the fastestapproach when bit-stable results are required for virtuallyall tested matrices (RMerge is better in 1% of cases). AC-SpGEMM is the fastest approach for very sparse matrices.For matrices with many temporary products and large com-paction factors, ESC strategies tend to fall behind hash-basedapproaches like nsparse as the per-product cost is simply toohigh. Nevertheless, across the entire test suit, AC-SpGEMMtakes the performance lead in 83% of all cases.
5 ConclusionOn massively parallel processors such as GPUs, any per-formance gains require bringing fine grained parallelismforward. AC-SpGEMM achieves this goal by a comprehen-sive take on the problem. Our main contribution is a fullyadaptive local work distribution, allowing for multiple iter-ations of local ESC avoiding costly global memory roundtrips. Paired with a novel, adaptive chunk management ap-proach, special case handling of long rows, and a series ofoptimizations, AC-SpGEMM forms a highly efficient com-plete SpGEMM solution, which achieves bit-stable results.Experimental results on 2000 matrices across various fieldsreveal dominating performance for highly sparse matrices.Only matrices with very large numbers of temporary prod-ucts can be handled more efficiently using the most recenthash-based alternative—which is not bit-stable.
An obvious improvement for our approach is reducing theoverallocation of chunkmemory. Furthermore, extending theadaptive behaviour of our chunk-based approach to choosebetween alternative approaches (ESC, hashing, merging) de-pending on the load currently seen by the work distributionmay lead to a further improvement of performance in thosescenarios where other strategies shine.
Our approach is open source and can be downloaded fromACM DL.
AcknowledgmentThis research was supported by the German Research Foun-dation (DFG) grant STE 2565/1-1, and the Austrian ScienceFund (FWF) grant I 3007. The GPU for this research wasdonated by NVIDIA Corporation.
78
PPoPP’19, February 16–20, 2019, Washington, DC, USA Winter, Mlakar, Zayer, Seidel, and Steinberger
References[1] Kadir Akbudak and Cevdet Aykanat. 2017. Exploiting Locality in
Sparse Matrix-Matrix Multiplication on Many-Core Architectures.IEEE Transactions on Parallel and Distributed Systems 28, 8 (Aug 2017),2258–2271. https://doi.org/10.1109/TPDS.2017.2656893
[2] Rasmus R. Amossen, Andrea Campagna, and Rasmus Pagh. 2010. Bet-ter Size Estimation for Sparse Matrix Products. In Proceedings of the13th International Conference on Approximation, and 14 the Interna-tional Conference on Randomization, and Combinatorial Optimization:Algorithms and Techniques (APPROX/RANDOM’10). Springer-Verlag,Barcelona, Spain, 406–419.
[3] Pham N. Q. Anh, Rui Fan, and YonggangWen. 2016. Balanced Hashingand Efficient GPU Sparse General Matrix-Matrix Multiplication. InProceedings of the 2016 International Conference on Supercomputing(ICS ’16). ACM, New York, NY, USA, Article 36, 12 pages. https://doi.org/10.1145/2925426.2926273
[4] Grey Ballard, Alex Druinsky, Nicholas Knight, and Oded Schwartz.2016. Hypergraph Partitioning for Sparse Matrix-Matrix Multiplica-tion. ACM Trans. Parallel Comput. 3, 3, Article 18 (Dec. 2016), 34 pages.https://doi.org/10.1145/3015144
[5] Nathan Bell, Steven Dalton, and Luke N. Olson. 2012. Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods. SIAM Journal onScientific Computing 34, 4 (2012), C123–C152. https://doi.org/10.1137/110838844
[6] Aydin Buluc and John R. Gilbert. 2008. On the representation andmultiplication of hypersparse matrices. In 2008 IEEE InternationalSymposium on Parallel and Distributed Processing. IEEE, Miami, FL,USA, 1–11. https://doi.org/10.1109/IPDPS.2008.4536313
[7] Steven Dalton, Sean Baxter, Duane Merrill, Luke Olson, and MichaelGarland. 2015. Optimizing Sparse Matrix Operations on GPUs Us-ing Merge Path. In 2015 IEEE International Parallel and DistributedProcessing Symposium. IEEE, Hyderabad, India, 407–416.
[8] Steven Dalton, Nathan Bell, Luke Olson, and Michael Garland. 2014.Cusp: Generic Parallel Algorithms for Sparse Matrix and Graph Com-putations. http://cusplibrary.github.io/ Version 0.5.0.
[9] Steven Dalton, Luke Olson, and Nathan Bell. 2015. Optimizing SparseMatrix-Matrix Multiplication for the GPU. ACM Trans. Math. Softw.41, 4, Article 25 (Oct. 2015), 20 pages. https://doi.org/10.1145/2699470
[10] TimothyA. Davis. 2017. SuiteSparse: A Suite of Sparsematrix packages.http://www.cise.ufl.edu/ davis/.
[11] Timothy A. Davis and Yifan Hu. 2011. The University of Florida SparseMatrix Collection. ACM Trans. Math. Softw. 38, 1 (Dec. 2011), 1:1–1:25.
[12] Julien Demouth. 2012. Sparse matrix-matrix multiplication on the GPU.Technical Report. NVIDIA.
[13] Mehmet Deveci, Christian Trott, and Sivasankaran Rajamanickam.2017. Performance-portable sparse matrix-matrix multiplication formany-core architectures. In 2017 IEEE International Parallel and Dis-tributed Processing SymposiumWorkshops (IPDPSW). IEEE, Lake BuenaVista, FL, USA, 693–702. https://doi.org/10.1109/IPDPSW.2017.8
[14] Mehmet Deveci, Christian Trott, and Sivasankaran Rajamanickam.2018. Multithreaded sparse matrix-matrix multiplication for many-core and GPU architectures. Parallel Comput. 78 (oct 2018), 33–46.
https://doi.org/10.1016/j.parco.2018.06.009[15] John R. Gilbert, Cleve Moler, and Robert Schreiber. 1992. Sparse Ma-
trices in MATLAB: Design and Implementation. SIAM J. Matrix Anal.Appl. 13, 1 (1992), 333–356. https://doi.org/10.1137/0613024
[16] Oded Green, Robert McColl, and David A. Bader. 2012. GPU MergePath: A GPU Merging Algorithm. In Proceedings of the 26th ACMInternational Conference on Supercomputing (ICS ’12). ACM, New York,NY, USA, 331–340.
[17] Felix Gremse, Andreas Höfter, Lars O. Schwen, Fabian Kiessling, andUwe Naumann. 2015. GPU-accelerated sparse matrix-matrix multipli-cation by iterative row merging. SIAM Journal on Scientific Computing37, 1 (2015), C54–C71.
[18] Fred G. Gustavson. 1978. Two Fast Algorithms for Sparse Matrices:Multiplication and Permuted Transposition. ACM Trans. Math. Softw.4, 3 (Sept. 1978), 250–269. https://doi.org/10.1145/355791.355796
[19] Rakshith Kunchum, Ankur Chaudhry, Aravind Sukumaran-Rajam,Qingpeng Niu, Israt Nisa, and P. Sadayappan. 2017. On Improv-ing Performance of Sparse Matrix-matrix Multiplication on GPUs.In Proceedings of the International Conference on Supercomputing(ICS ’17). ACM, New York, NY, USA, Article 14, 11 pages. https://doi.org/10.1145/3079079.3079106
[20] Weifeng Liu and Brian Vinter. 2015. A Framework for General SparseMatrix-matrix Multiplication on GPUs and Heterogeneous Processors.J. Parallel Distrib. Comput. 85, C (Nov. 2015), 47–61. https://doi.org/10.1016/j.jpdc.2015.06.010
[21] Duane Merrill. 2015. CUB: CUDA Unbound, a library of warp-wide,block-wide, and device-wide GPU parallel primitives.
[22] Yusuke Nagasaka, Akira Nukada, and Satoshi Matsuoka. 2017. High-Performance and Memory-Saving Sparse General Matrix-Matrix Mul-tiplication for NVIDIA Pascal GPU. In 2017 46th International Confer-ence on Parallel Processing (ICPP). IEEE, Bristol, UK, 101–110. https://doi.org/10.1109/ICPP.2017.19
[23] NVIDIA. 2019. The API reference guide for cuSPARSE, the CUDA sparsematrix library. (v9.1 ed.). NVIDIA.
[24] Md. Mostofa A. Patwary, Nadathur R. Satish, Narayanan Sundaram,Jongsoo Park, Michael J. Anderson, Satya G. Vadlamudi, Dipankar Das,Sergey G. Pudov, Vadim O. Pirogov, and Pradeep Dubey. 2015. ParallelEfficient Sparse Matrix-Matrix Multiplication on Multicore Platforms.Springer International Publishing, Cham, 48–57.
[25] Ichitaro Yamazaki and Xiaoye S. Li. 2011. On Techniques to Improve Ro-bustness and Scalability of a Parallel Hybrid Linear Solver. In Proceed-ings of the 9th International Conference on High Performance Computingfor Computational Science (VECPAR’10). Springer-Verlag, Berlin, Hei-delberg, 421–434. http://dl.acm.org/citation.cfm?id=1964238.1964281
[26] Raphael Yuster and Uri Zwick. 2004. Detecting Short Directed CyclesUsing Rectangular Matrix Multiplication and Dynamic Programming.In Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Dis-crete Algorithms (SODA ’04). Society for Industrial and Applied Mathe-matics, Philadelphia, PA, USA, 254–260. http://dl.acm.org/citation.cfm?id=982792.982828
79
Adaptive GPU SpGEMM PPoPP’19, February 16–20, 2019, Washington, DC, USA
A Artifact Description Appendix: AdaptiveSparse Matrix-Matrix Multiplication onthe GPU
A.1 AbstractThe following appendix provides the necessary informationto acquire the framework and rerun the experiments used toevaluate the framework.
A.2 DescriptionA.2.1 Check-list (artifact meta information)• Data set: Matrix data set found on SuiteSparse Matrix Col-lection (Formerly the University of Florida Sparse MatrixCollection)• Hardware: Recent GPU hardware from NVIDIA, tested onNVIDIA GTX 1080TI, NVIDIA GTX TITAN X Pascal andNVIDIA GTX TITAN Xp• Output: The output is provided in the output stream as wellas in a .csv file• Experiment workflow: Run runall.bat/.sh file to gatherresults for all matrices in a folder or run individual matricesthrough the framework• Experiment customization:The number of iterations usedfor the timing measurements can be altered. A CPU imple-mentation can be enabled as well to confirm the results ofthe framework output• Publicly available?: ACM DL
A.2.2 How software can be obtained (if available)The framework is downloadable from ACM DL.
A.2.3 Hardware dependenciesRecent GPU hardware from NVIDIA, tested on NVIDIA GTX1080Ti, NVIDIA GTX TITAN X Pascal and NVIDIA GTXTITAN Xp. Remaining system specifications should have acomparatively small impact on performance, performancewas tested on an Intel Core i7-7700 paired with 32 GB RAMon Ubuntu 16.04 LTS.
A.2.4 Software dependencies• CMake 3.2 or higher• CUDA 9.1 / 9.2 / 10.0• C++14 compliant compiler, tested on:– MSVC (Visual Studio 2017)– GCC 6 / GCC 7• CUB v1.8.0
A.2.5 DatasetsThe framework itself can parse Matrix Market Format (.mtx)files (Matrix data set found on SuiteSparse Matrix Collection(Formerly the University of Florida Sparse Matrix Collec-tion)) and upon first parsing a matrix in this format alsoperforms a conversion into a binary format that will be used
for consecutive runs, which greatly reduces loading times.These are stored with the .hicoo extension.
A.3 InstallationOn Linux simply run the provided setup.sh script whichwill clone CUB, setup and build the project. On Windows,download and extract CUB into the folder include/external,then create a build folder in the top directory and use CMaketo setup the project to build.
A.4 Experiment workflowThe framework can be operated in one of two modes:• Single Matrix– In this setup the framework can be run for a singlematrix and optionally confirm the resulting outputmatrix by comparing it to a host-based solution
• Complete testrun– A runall.bat/.sh file is provided that consecutivelycalls the framework with all matrices provided in afolder, this was used to run the large testcases
The output is provided in the output stream as well as in aseparate .csv file which includes all important matrix stats(rows, columns, nnz, average nnz per row, etc.) as well astiming measurements. This script is setup such that each testrun is done as a separate process, such that failed launchesdo not impede launches after that. The output of the scriptare timing measurements and when enabling Debug withinthe framework (the template instantiation must have thisenabled) also detailed measurements as well as memory mea-surements.For all matrices in the testset the framework computes
C = A ·A (orC = A ·AT if the input matrix is not square) andmeasures the time required for the multiplication procedure,only the input matrix A is provided directly on the device tothe procedure and the result matrix C is also returned as adevice matrix. Conversion operators are provided to transfermatrices between host and device as well as convert the COOformat to CSR if required.
A.5 Evaluation and expected resultTo replicate the results gathered in this paper it suffices torun the runall.bat/.sh file on a folder containing matricesfound in the SuiteSparse Matrix Collection (Formerly theUniversity of Florida Sparse Matrix Collection) dataset. Thenext page holds two plots detailing the exact performancemeasurements for the complete test set compiled using thehardware described above.
A.6 Experiment customizationThe number of iterations used for the timing measurementscan be altered. A CPU implementation can be enabled aswell to confirm the results of the framework output. Debugcan be enabled to get more detailed timing/memory results.
80
PPoPP’19, February 16–20, 2019, Washington, DC, USA Winter, Mlakar, Zayer, Seidel, and Steinbergerbc
sstm
25bc
sstm
37 PdM
80PI
_nS8
0PI_n
dete
r0lp
_czp
rob
cis-n
4c6-
b14
n4c6
-b14
pldd
blp
_pds
_02
lp_s
hip0
8l ukCa
lifor
nia
prim
agaz
bcss
tm38
extr1
bex
tr1bc
sstm
35bc
sstm
39uk
erbe
1Sc
iMet
G48
G49
G50
lutz
30-2
3-b6
lp_s
hip1
2lflo
wer_
4_4
dw10
24dw
2048
grid
2po
wer
USpo
werG
ridn2
c6-b
8Ko
hone
np6
000
df21
77ch
7-6-
b5Fr
anz3
cryg
2500
mk1
2-b5 de
lfde
ter8
lp_8
0bau
3blp
_bnl
2lp
_cre
_crd
b204
8fil
ter2
Dpd
e296
1wa
tt_1
dete
r6lsh
p156
1lp
_cre
_ade
laun
ay_n
11rw
5151
olm
5000 G5
7wa
tt_2
Ham
rle2
n4c5
-b8
roto
r2ch
6-6-
b4lsh
p188
2lp
_25f
v47
orsr
eg_1
dwt_
1242
shyy
41la
rge
dete
r5p2
p-Gn
utel
la08 lpl2
lp_t
russ
lp_m
aros
poli3
stor
mg2
-8bc
spwr
10cz
1268
TF13
Fran
z2de
ter1
qpba
ndbl
ckho
lelsh
p223
3hy
dr1
hydr
1cp2
p-Gn
utel
la09
fd12
rdb3
200l
dete
r4G6
2m
k10-
b3sh
erm
an3
mhd
3200
blp
_ken
_11
ex32
cat_
ears
_3_4 ch
Chem
97Zt
fxm
2-6
lshp2
614
lp_d
fl001
N_re
acto
me
wang
1wa
ng2
com
man
che_
dual
G65
dete
r7ca
ge8
flowe
r_5_
4Le
Gres
ley_
2508
mk1
1-b4
bCh
ebys
hev2
n3c6
-b9
dwg9
61b
g7ja
c010
g7ja
c010
scp2
p-Gn
utel
la06
p2p-
Gnut
ella
05sw
ang1
swan
g2kn
eser
_8_3
_1lp
_pds
_06
sayl
r4lp
_pilo
tnov G66
lshp3
025
airfo
il1 D_9
raja
t03
G55
G56
n4c5
-b4
can_
838
baye
r09
dela
unay
_n12
mk1
1-b4 G6
73e
ltm
odel
8n2
c6-b
4n3
c6-b
4lsh
p346
6de
ter3
dete
r2ra
dfr1
cell1
cell2
fd15
can_
1054
poli4
rdb5
000
mhd
4800
bDK
01R
can_
1072
p2p-
Gnut
ella
04 D_7
add3
2n4
c5-b
7p2
p-Gn
utel
la25 D_
8sh
erm
anAC
aln
s_39
37ln
sp39
37fre
eFly
ingR
obot
_2ch
6-6-
b3n2
c6-b
7bc
sstk
21cz
2548 G6
0G6
1em
ail
sstm
odel
ex29
tum
a2dw
4096
dw81
92nc
vxqp
9m
odel
6ra
jat0
6lp
_woo
dw l30
dtoc
pcb1
000
lp_n
ug12
lp_q
ap12
gem
at12
gem
at11
ODLI
STr
efet
hen_
700
aug2
dlp
_gre
enbe
alp
_gre
enbe
blp
i_gre
enbe
aye
ast
lp_k
en_1
3 geEr
dos9
72fre
eFly
ingR
obot
_3p2
p-Gn
utel
la24
LeGr
esle
y_49
08lp
_cyc
lero
sen8
sher
man
5au
g2dc
lp_p
ds_1
0n4
c5-b
5m
odel
3dw
t_26
80bi
ps98
_606
Erdo
s982
Erdo
s992
cryg
1000
0sp
aceS
tatio
n_4
fom
e11
fd18
Kauf
hold
ch7-
6-b2
rail_
5177
bips
98_1
142
n4c5
-b6
c-18
lhr0
1Ch
ebys
hev3
nops
s_11
kno
s3 nl
bips
98_1
450
hep-
thm
imo4
6x46
_sys
tem
o_af
onso
_itai
pum
imo2
8x28
_sys
tem
ww_v
ref_
6405
zero
s_no
pss_
13k
mim
o8x8
_sys
tem
aug3
dbi
ps07
_169
3ba
yer0
6ba
yer0
8ba
yer0
5ba
yer0
7ba
rth4
L-9
dwt_
992
spac
eSta
tion_
5ba
yer0
2ut
m17
00b
G54
n2c6
-b5
raja
t07
G52
G53 se
TF14
n3c6
-b5
lpl3
bcss
tm12 G51
n3c6
-b8
jan9
9jac
020
jan9
9jac
020s
cde
laun
ay_n
13TS
OPF_
RS_b
9_c6
ex33
stea
m2
gem
at1
spac
eSta
tion_
10Al
emda
rLe
derb
erg
bipl
ane-
9cis
-n4c
6-b1
3n4
c6-b
13ba
rthp2
p-Gn
utel
la30
fxm
2-16
coat
er1
n2c6
-b6
bcss
tk09
t2da
lGL
7d25
LFAT
5000
data cs4
whita
ker3
spac
eSta
tion_
11sp
aceS
tatio
n_6
stuf
e-10
geom
spac
eSta
tion_
12tu
ma1
spac
eSta
tion_
7m
ark3
jac0
20m
ark3
jac0
20sc
raja
t08
bips
07_1
998
freeF
lyin
gRob
ot_4
stor
mg2
-27
sit10
0g7
jac0
20g7
jac0
20sc
fe_4
elt2
crac
kbi
ps07
_247
6ca
t_ea
rs_4
_4IG
5-10
spac
eSta
tion_
8lp
i_gra
ncc
acz
5108
desc
ripto
r_xi
ngo6
un3
c6-b
6xi
ngo3
012
plte
xpa
n3c6
-b7
powe
rsim
bips
07_3
078
G46
G45
eurq
saG4
7G4
3G4
4sp
aceS
tatio
n_9
psse
1cis
-n4c
6-b3
n4c6
-b3
lp_d
2q06
cph
otog
ram
met
rym
k11-
b2 ccc
ch7-
6-b4
spac
eSta
tion_
13flo
wmet
er0
flowm
eter
5m
odel
9Si
2no
poly
bcss
tk10
bcss
tm10
spac
eSta
tion_
14ra
jat0
9cq
5lp
_sto
cfor
3ca
-GrQ
cfo
me1
2LF
1000
0ba
yer0
3m
odel
4fo
me2
0lp
_pds
_20
luxe
mbo
urg_
osm
mod
el7
lhr0
2m
hd12
80b
add2
0Er
dos0
2t6
0kp2
p-Gn
utel
la31
sher
man
2c-
19ca
ge9
bcss
tk26
aug3
dcqp
pesa
ch7-
7-b2
lp_n
ug15
lp_q
ap15 ct
ilp
_deg
en3
shoc
k-9
Wor
dnet
3pl
at19
19Fr
anz4
scta
p1-2
cin
tern
etc-
28fre
eFly
ingR
obot
_5fx
m3_
6CA
G_m
at36
4ut
m30
60fe
_sph
ere
zeni
ospc
b300
0ra
jat1
0p0
5fo
rd1
ch7-
6-b3
flowe
r_7_
4nc
vxqp
1rb
sa48
0rb
sb48
0lp
22 co5
dela
unay
_n14
polb
logs
Zewa
ilep
b1bc
sstk
08ra
jat0
2c-
20 big
G36
G40
ca-H
epTh G35
G39
nem
spm
m1
G38
G42
baye
r10
G37
G41
dyna
micS
oarin
gPro
blem
_2m
k11-
b3ex
21fre
eFly
ingR
obot
_6sh
erm
anAC
dwb
-cs-
stan
ford
bcss
tk23
jan9
9jac
040
jan9
9jac
040s
cc-
27lp
i_bgi
ndy
barth
5bc
sstm
13pf
2177
pds-
30pl
buck
le fv1
ex22
de20
10m
ark3
jac0
40m
ark3
jac0
40sc fv2
fv3
t2d_
q4t2
d_q4
_A_0
1t2
d_q4
_A_0
2t2
d_q4
_A_0
3t2
d_q4
_A_0
4t2
d_q4
_A_0
5t2
d_q4
_A_0
6t2
d_q4
_A_0
7t2
d_q4
_A_0
8t2
d_q4
_A_0
9t2
d_q4
_A_1
0t2
d_q4
_A_1
1t2
d_q4
_A_1
2t2
d_q4
_A_1
3t2
d_q4
_A_1
4t2
d_q4
_A_1
5t2
d_q4
_A_1
6t2
d_q4
_A_1
7t2
d_q4
_A_1
8t2
d_q4
_A_1
9t2
d_q4
_A_2
0t2
d_q4
_A_2
1t2
d_q4
_A_2
2t2
d_q4
_A_2
3t2
d_q4
_A_2
4t2
d_q4
_A_2
5t2
d_q4
_A_2
6t2
d_q4
_A_2
7t2
d_q4
_A_2
8t2
d_q4
_A_2
9t2
d_q4
_A_3
0t2
d_q4
_A_3
1t2
d_q4
_A_3
2t2
d_q4
_A_3
3t2
d_q4
_A_3
4t2
d_q4
_A_3
5t2
d_q4
_A_3
6t2
d_q4
_A_3
7t2
d_q4
_A_3
8t2
d_q4
_A_3
9t2
d_q4
_A_4
0t2
d_q4
_A_4
1t2
d_q4
_A_4
2t2
d_q4
_A_4
3t2
d_q4
_A_4
4t2
d_q4
_A_4
5t2
d_q9
t2d_
q9_A
_01
t2d_
q9_A
_02
t2d_
q9_A
_03
t2d_
q9_A
_04
t2d_
q9_A
_05
t2d_
q9_A
_06
t2d_
q9_A
_07
t2d_
q9_A
_08
t2d_
q9_A
_09
t2d_
q9_A
_10
t2d_
q9_A
_11
t2d_
q9_A
_12
t2d_
q9_A
_13
t2d_
q9_A
_14
t2d_
q9_A
_15
t2d_
q9_A
_16
t2d_
q9_A
_17
t2d_
q9_A
_18
t2d_
q9_A
_19
t2d_
q9_A
_20
t2d_
q9_A
_21
t2d_
q9_A
_22
t2d_
q9_A
_23
t2d_
q9_A
_24
t2d_
q9_A
_25
t2d_
q9_A
_26
t2d_
q9_A
_27
t2d_
q9_A
_28
t2d_
q9_A
_29
t2d_
q9_A
_30
t2d_
q9_A
_31
t2d_
q9_A
_32
t2d_
q9_A
_33
t2d_
q9_A
_34
t2d_
q9_A
_35
t2d_
q9_A
_36
t2d_
q9_A
_37
t2d_
q9_A
_38
t2d_
q9_A
_39
t2d_
q9_A
_40
t2d_
q9_A
_41
t2d_
q9_A
_42
t2d_
q9_A
_43
t2d_
q9_A
_44
t2d_
q9_A
_45
linve
rse
bfly
ex3s
ta1
juba
40k
baur
u572
7cz
1022
8lo
ck_7
00Zh
ao1
Zhao
2c-
22bc
sstk
34sp
iral
G23
G28
G25
G30
G24
G29
G22
G27
G26
G31
ch7-
7-b5
ri201
0bo
dyy4
TF15
bcss
tk11
bcss
tk12
lp_k
en_1
8Fr
anz6
nem
spm
m2
Tref
ethe
n_20
00ex
25da
no3m
ipbo
dyy5
hi20
10 cq9
freeF
lyin
gRob
ot_7
mcf
eus
road
s-48
lp_c
re_d
PGPg
iant
com
poTS
OPF_
RS_b
2383
_c1_
bul
evim
inus
road
sc-
21na
sa18
24bo
dyy6
sher
man
ACb
rose
n10
eris1
176
c-26 FA
wing
obst
clae
tors
ion1 TS
lp_c
re_b
jnlb
rng1
fom
e13
nem
sem
m2
chem
_mas
ter1
rdist
2m
insu
rfoex
18ra
il_20
209
fom
e21
pds-
40ch
7-8-
b2vt
2010
meg
1bc
sstm
34Ke
mel
mac
her
shut
tle_e
ddy
cvxq
p3ki
netic
Batc
hRea
ctor
_1wo
rldfre
eFly
ingR
obot
_8nu
g08-
3rd
Fran
z5ex
36lp
i_gos
hM
arag
al_3
flowe
r_8_
4m
sc01
050
c-23
lp_p
ilot
mod
2co
9m
k12-
b2br
atu3
dja
n99j
ac06
0ja
n99j
ac06
0sc
cavi
ty05
cavi
ty06
cavi
ty07
cavi
ty08
cavi
ty09
mar
k3ja
c060
mar
k3ja
c060
scCu
rlCur
l_0ut
m59
40IG
5-11
epb2
ex24
p010
wang
3wa
ng4
baye
r04
hvdc
1co
nd-m
atde
laun
ay_n
15c-
24m
ario
001
pds-
50sh
allo
w_wa
ter1
shal
low_
wate
r2fre
eFly
ingR
obot
_9fo
ldoc
baye
r01
vsp_
barth
5_1K
sep_
50in
_5Ko
ut14
5bit
psse
0D6
-6th
erm
alpg
p2on
eton
e2ba
xter
freeF
lyin
gRob
ot_1
0ro
bot2
4c1_
mat
5vi
scop
last
ic1di
xmaa
nlM
arag
al_4
mhd
3200
afre
eFly
ingR
obot
_11
g7ja
c040
g7ja
c040
scsh
yy16
1ex
15TS
OPF_
RS_b
162_
c1m
ark3
jac0
80m
ark3
jac0
80sc
msc
0144
0ex
10ex
14pd
s-60
jan9
9jac
080
jan9
9jac
080s
crd
ist3a
ex10
hsfre
eFly
ingR
obot
_12
cz20
468
stok
es64
stok
es64
snh
2010
dict
iona
ry28
ex23
c-29
cis-n
4c6-
b4n4
c6-b
4ps
se2
mod
el10
Fran
z7fre
eFly
ingR
obot
_13
n4c6
-b12
GL7d
12M
ISKn
owle
dgeM
apsp
msr
tlssc
fxm
1-2b
TSOP
F_RS
_b39
_c7
freeF
lyin
gRob
ot_1
4e1
8lp
_nug
20RF
devi
ceIll
_Sto
kes
freeF
lyin
gRob
ot_1
5Fr
anz8
adde
r_dc
op_0
1in
it_ad
der1
pds-
70ad
der_
dcop
_04
adde
r_dc
op_0
5lh
r04
lhr0
4cad
der_
dcop
_03
adde
r_dc
op_0
7ad
der_
dcop
_06
adde
r_dc
op_1
0ad
der_
dcop
_09
adde
r_dc
op_0
8ad
der_
dcop
_11
adde
r_dc
op_1
3ad
der_
dcop
_19
adde
r_dc
op_4
4ad
der_
dcop
_02
adde
r_dc
op_1
2ad
der_
dcop
_14
adde
r_dc
op_1
5ad
der_
dcop
_16
adde
r_dc
op_1
7ad
der_
dcop
_18
adde
r_dc
op_2
0ad
der_
dcop
_21
adde
r_dc
op_2
2ad
der_
dcop
_23
adde
r_dc
op_2
4ad
der_
dcop
_25
adde
r_dc
op_2
6ad
der_
dcop
_27
adde
r_dc
op_2
8ad
der_
dcop
_29
adde
r_dc
op_3
0ad
der_
dcop
_31
adde
r_dc
op_3
2ad
der_
dcop
_33
adde
r_dc
op_3
4ad
der_
dcop
_35
adde
r_dc
op_3
6ad
der_
dcop
_37
adde
r_dc
op_3
8ad
der_
dcop
_39
adde
r_dc
op_4
0ad
der_
dcop
_41
adde
r_dc
op_4
2ad
der_
dcop
_43
adde
r_dc
op_4
5ad
der_
dcop
_46
adde
r_dc
op_4
7ad
der_
dcop
_48
adde
r_dc
op_4
9ad
der_
dcop
_50
adde
r_dc
op_5
1ad
der_
dcop
_52
adde
r_dc
op_5
3ad
der_
dcop
_54
adde
r_dc
op_5
5ad
der_
dcop
_56
adde
r_dc
op_5
7ad
der_
dcop
_58
adde
r_dc
op_5
9ad
der_
dcop
_60
adde
r_dc
op_6
1ad
der_
dcop
_62
adde
r_dc
op_6
3ad
der_
dcop
_64
adde
r_dc
op_6
5ad
der_
dcop
_66
adde
r_dc
op_6
7ad
der_
dcop
_68
adde
r_dc
op_6
9fre
eFly
ingR
obot
_16
ch7-
9-b2
ex12
mk1
2-b4
lowT
hrus
t_2
sts4
098
G58
G59
mar
k3ja
c100
mar
k3ja
c100
scEX
3EX
4ro
ute
lpi_k
lein
3ja
n99j
ac10
0ja
n99j
ac10
0sc
ch8-
8-b2
nem
swrld r0
5af
t01
pds-
80dy
nam
icSoa
ringP
robl
em_3
oner
a_du
alAB
ACUS
_she
ll_hd
ABAC
US_s
hell_
ldAB
ACUS
_she
ll_m
dAB
ACUS
_she
ll_ud
rlfdu
alm
hd12
80a
OPF_
3754
lung
2se
ymou
rlce
gb33
06c-
35sp
aceS
huttl
eEnt
ry_2
rdist
1c-
25ca
ge10
adde
r_tra
ns_0
1ad
der_
trans
_02
lpi_c
eria
3d ex7
aft0
2pd
s10
msc
0451
5pd
s-90
ct20
10ha
ngGl
ider
_2ta
ndem
_dua
llp
i_cpl
ex1
mhd
4800
are
orie
ntat
ion_
2gr
aphi
csm
ark3
jac1
20m
ark3
jac1
20sc
me2
010
cegb
3024
raja
t12
bcirc
uit
bcss
tk14
pds-
100
crys
tm01
cont
-201
sd20
10m
k13-
b5c-
44ex
27ja
n99j
ac12
0ja
n99j
ac12
0sc
dela
unay
_n16
epb3
c-37
nasa
2146
meg
4TF
16dy
nam
icSoa
ringP
robl
em_4
nasa
4704
ak20
10c-
34bc
sstk
18ex
13g7
jac0
50sc
lp_p
ilot8
7dy
nam
icSoa
ringP
robl
em_5
fe_b
ody
cvxb
qp1
ncvx
bqp1
onet
one1
ex20
mar
k3ja
c140
mar
k3ja
c140
scex
37ra
efsk
y5 rel7
dyna
micS
oarin
gPro
blem
_6G6
3G6
4ca
vity
10ca
vity
11ca
vity
12ca
vity
13ca
vity
14ca
vity
15ga
ron1
t2da
hex
28cir
cuit_
2pw
tdy
nam
icSoa
ringP
robl
em_7
Chev
ron1
c-38
knes
er_1
0_4_
1dy
nam
icSoa
ringP
robl
em_8
ex35
mk1
2-b3
copt
er1
nv20
10lo
ck22
32fx
m4_
6wy
2010
m13
3-b3
shar
_te2
-b3
ford
2cz
4094
8st
orm
g2-1
25ph
otog
ram
met
ry2
IG5-
12ec
l32
spac
eShu
ttleE
ntry
_3 ex9
ncvx
qp5
car4
sgpf
5y6
lhr0
7lh
r07c
rgg_
n_2_
15_s
0Lin
ux_c
all_g
raph
wats
on_1
GL7d
24te
d_B
skirt
apac
he1
tand
em_v
txvs
p_da
ta_a
nd_s
eym
ourl
bcss
tk15
mri1
mpl
ate
raef
sky6
g7ja
c060
g7ja
c060
scnd
2010
finan
ce25
6vs
p_p0
291_
seym
ourl_
iiasa
ther
mal
1ch
ipco
ol0
chip
cool
1sh
ar_t
e2-b
1fx
m3_
16ca
-Con
dMat
OPF_
6000
rail_
7984
1as
-735
c-41
imag
e_in
terp
ncvx
qp3
Fran
z11
ex8
Fran
z10
Fran
z9ex
31Du
bcov
a1ut
2010
wiki
-Vot
ebc
sstk
13TS
OPF_
RS_b
162_
c3wa
vegu
ide3
Dco
nf5_
0-4x
4-10
conf
5_0-
4x4-
14co
nf5_
0-4x
4-18
conf
5_0-
4x4-
22co
nf5_
0-4x
4-26
conf
6_0-
4x4-
20co
nf6_
0-4x
4-30
ch7-
8-b3
fe_o
cean
TSOP
F_FS
_b9_
c1M
uum
t201
0co
ater
2TS
OPF_
RS_b
39_c
19n4
c6-b
5ai
rfoil_
2dbc
sstk
25Le
Gres
ley_
8793
6de
laun
ay_n
17m
d201
0th
erm
omec
h_TC
ther
mom
ech_
TKde
ltaX
oran
i678
ch7-
8-b5
lhr1
0lh
r10c
lhr1
1lh
r11c
wv20
10k1
_san
igbt
3nc
vxqp
7Ba
uman
nco
nd-m
at-2
003
kine
ticBa
tchR
eact
or_2
scag
r7-2
bte
stbi
gca
vity
16ca
vity
17ca
vity
18ca
vity
19ca
vity
20ca
vity
21ca
vity
22ca
vity
23ca
vity
24ca
vity
25ca
vity
26nj
2010
ma2
010
raja
t01
ky20
10co
nt-3
00la
ngua
gene
2010
ms2
010
n4c6
-b11
Mar
agal
_5id
2010
c-49
grid
gena
spac
eShu
ttleE
ntry
_4sc
fxm
1-2r
light
_in_t
issue
ia20
10c-
31ar
2010
reor
ient
atio
n_3
amaz
on03
02sc
2010
nm20
10ra
jat2
7wa
tson
_2TS
OPF_
RS_b
162_
c4wa
2010
whee
l_601
cit-H
epPh
OPF_
1000
0st
okes
128
ks20
10c-
36g7
jac0
80g7
jac0
80sc
c-48
Delo
r64K
lhr1
4lh
r14c
co20
10he
lm3d
01re
orie
ntat
ion_
4ol
esni
k0be
lgiu
m_o
smre
lat7
rela
t7b
la20
10AS
IC_1
00ks
or20
10Ch
evro
n2to
mog
raph
ic1TF
172D
_276
28_b
jtcai
benz
ene
ex19
cit-H
epTh
HEP-
th-n
ewki
netic
Batc
hRea
ctor
_3wi
2010
mac
_eco
n_fw
d500
TSOP
F_RS
_b39
_c30
finan
512
pfin
an51
2wa
then
100
s3rm
t3m
3m
n201
0cr
ystm
02in
2010
rgg_
n_2_
16_s
0c-
52re
orie
ntat
ion_
5Si
eber
tn20
10c-
33nm
os3
dbic1
ok20
10IG
5-13
g3rm
t3m
3lh
r17
lhr1
7cm
c2de
pial
2010
ch7-
8-b4
reor
ient
atio
n_6
scirc
uit
reor
ient
atio
n_7
s2rm
t3m
1s3
rmt3
m1
cage
11s1
rmt3
m1
bcss
tm36
reor
ient
atio
n_8
hcirc
uit
az20
10 nsir
man
_597
6to
rso2
lp_n
ug30
wath
en12
0c-
51ga
2010
mem
plus
Notre
Dam
e_ww
wai
rcra
ftst
och_
airc
raft
cond
-mat
-200
5in
let
nc20
10g7
jac1
00g7
jac1
00sc
162b
itm
i201
0ro
adNe
t-PA
dela
unay
_n18
af23
560
ther
mom
ech_
dMbr
ack2
va20
10ch
7-9-
b3sm
allw
orld
Andr
ews
n4c6
-b6
scta
p1-2
bny
2010
EAT_
SREA
T_RS
visc
opla
stic2
FEM
_3D_
ther
mal
1as
tro-p
hm
o201
0c-
60G_
n_pi
n_po
ut lpl1
coup
led
oh20
10Si
H4co
pter
2sc
sd8-
2bsc
sd8-
2cne
ther
land
s_os
mch
8-8-
b3po
isson
3Da
vsp_
c-60
_dat
a_ct
i_cs4
n4c6
-b10
road
Net-T
X Linhv
dc2
fe_t
ooth
Oreg
on-1
vibr
obox
il201
0da
rcy0
03m
ario
002
garo
n2g7
jac1
20g7
jac1
20sc
pa20
10sin
c12
SiNa t3dl
k3pl
ates
circu
it_3
c-50
grah
am1
scrs
8-2r
abta
ha1
crys
tm03
circu
it_1
c-61
Oreg
on-2
3D_2
8984
_Tet
rade
norm
alTr
efet
hen_
2000
0bTr
efet
hen_
2000
0lo
wThr
ust_
3g7
jac1
40g7
jac1
40sc
d_pr
etok
turo
n_m
rlfpr
imlh
r34
lhr3
4ce4
0r01
00n4
c6-b
7en
ron
debr
nem
eth0
2ne
met
h03
nem
eth0
4ne
met
h05
nem
eth0
6ne
met
h07
nem
eth0
8ne
met
h09
fl201
0GL
7d13
road
Net-C
Aki
netic
Batc
hRea
ctor
_4rg
g_n_
2_17
_s0
c-39
ted_
An4
c6-b
9Du
bcov
a2c-
422D
_540
19_h
ighK
fe_r
otor
helm
2d03
cras
hbas
ism
ajor
basis
g7ja
c160
g7ja
c160
scbc
sstk
17de
laun
ay_n
19c-
65c-
40n4
c6-b
8AS
IC_6
80ks
ch7-
9-b5
IG5-
14ki
netic
Batc
hRea
ctor
_5sc
205-
2r59
8ash
ar_t
e2-b
2TF
18ki
netic
Batc
hRea
ctor
_6da
wson
5Ha
mrle
3so
c-sig
n-Sl
ashd
ot08
1106
GL7d
23so
c-sig
n-Sl
ashd
ot09
0216
kine
ticBa
tchR
eact
or_7
g7ja
c180
g7ja
c180
scso
c-sig
n-Sl
ashd
ot09
0221
kine
ticBa
tchR
eact
or_8
tmt_
unsy
mt2
emvf
emki
m1
kine
ticBa
tchR
eact
or_9
ca20
10tw
oton
ech
7-9-
b4c-
63c-
55ec
olog
y2ec
olog
y1c-
32as
-22j
uly0
6ra
jat2
2g7
jac2
00g7
jac2
00sc
para
bolic
_fem
slpts
kca
-Ast
roPh
c-47
ckt1
1752
_dc_
1ck
t117
52_t
r_0
stru
ct3
cont
11_l
vsp_
scta
p1-2
b_an
d_se
ymou
rldb
lp-2
010
raja
t26
CurlC
url_1
2cub
es_s
pher
ec-
46c-
68am
azon
0312
3D_5
1448
_3D
ibm
_mat
rix_2
c-59
xeno
n1as
-cai
dach
8-8-
b4tx
2010
wave
c-30
ch8-
8-b5
italy
_osm
Chev
ron3
ASIC
_320
ksca
-Hep
Phvs
p_c-
30_d
ata_
data
amaz
on05
05sc
sd8-
2rvs
p_bu
mp2
_e18
_aa0
1_m
odel
1_cr
ew1
amaz
on06
01ap
ache
214
4lh
r71
lhr7
1cra
jat1
5pr
efer
entia
lAtta
chm
ent
Notre
Dam
e_ac
tors
cage
12tm
t_sy
mra
jat1
3sc
tap1
-2r
grea
t-brit
ain_
osm
coAu
thor
sCite
seer
tube
1tu
be2
visc
oroc
ksNA
CA00
15c-
69de
laun
ay_n
20rg
g_n_
2_18
_s0
soc-
Epin
ions
1th
erm
omec
h_dK
vsp_
finan
512_
scag
r7-2
c_rlf
ddd
c-56
huge
trace
-000
00co
Auth
orsD
BLP
c-70
qa8f
kqa
8fm
raja
t23
web-
Stan
ford
gas_
sens
orvs
p_vi
brob
ox_s
cagr
7-2c
_rlfd
ddam
azon
-200
8ca
idaR
oute
rLev
elvs
p_m
od2_
pgp2
_slp
tsk
Si5H
12c-
72st
omac
hbc
sstk
31ve
nkat
01ve
nkat
25ve
nkat
50cf
d1IG
5-15
mat
rix_9
c-54
emai
l-EuA
llem
ail-E
nron
huge
tric-
0000
0sp
arsin
em
14b
germ
any_
osm
176b
itc-
62c-
62gh
sas
ia_o
smCh
evro
n4lo
wThr
ust_
4re
l8sc
agr7
-2r
abta
ha2
Mar
agal
_6hu
getri
c-00
010
c-43
web-
Goog
leth
erm
al2
atm
osm
odd
atm
osm
odj
TF19
iChe
m_Ja
cobi
anhu
getri
c-00
020
vent
uriLe
vel3
web-
Notre
Dam
eRe
uter
s911
av41
092
webb
ase-
1Mlp
_fit2
pla
rgeb
asis
offs
hore
atm
osm
odl
atm
osm
odm
c-66
c-66
bco
nf5_
4-8x
8-05
conf
5_4-
8x8-
10co
nf5_
4-8x
8-15
conf
5_4-
8x8-
20co
nf6_
0-8x
8-20
conf
6_0-
8x8-
30co
nf6_
0-8x
8-80
H2O
c-71
wate
r_ta
nkc-
58cf
d2vs
p_be
fref_
fxm
_2_4
_air0
2de
laun
ay_n
21to
rso3
cop2
0k_A
mes
h_de
form
lowT
hrus
t_5
hang
Glid
er_3
poiss
on3D
bGL
7d14
Mar
agal
_7FE
M_3
D_th
erm
al2
filte
r3D
mem
chip
rgg_
n_2_
19_s
0cir
cuit_
4m
atrix
-new
_3lo
wThr
ust_
6cit
atio
nCite
seer
road
_cen
tral
raja
t31
lowT
hrus
t_7
soc-
sign-
epin
ions
cnr-2
000
lowT
hrus
t_8
xeno
n2c-
45Cu
rlCur
l_2lo
wThr
ust_
9au
tolo
wThr
ust_
10lo
wThr
ust_
11Du
bcov
a3lo
wThr
ust_
12lo
wThr
ust_
13hu
getra
ce-0
0010
adap
tive
IG5-
16Fr
eesc
ale1
soc-
Slas
hdot
0811
soc-
Slas
hdot
0902
c-67
c-67
bM
6De
lor2
95K
stor
mG2
_100
033
3SP
cage
13AS
365
huge
trace
-000
20m
ri2Cu
rlCur
l_3re
lat8
NLR
nsct
s3dk
t3m
2s4
dkt3
m2
dela
unay
_n22
road
_usa
case
9TS
OPF_
FS_b
9_c6 SiO
huge
bubb
les-
0000
0hu
gebu
bble
s-00
010
Delo
r338
Khu
gebu
bble
s-00
020
c-53
rgg_
n_2_
20_s
0en
gine
barri
er2-
1ba
rrier
2-2
barri
er2-
3ba
rrier
2-4
mon
o_50
0Hz
kkt_
powe
rwe
b-Be
rkSt
anba
rrier
2-10
barri
er2-
11ba
rrier
2-12
barri
er2-
9HT
C_33
6_91
29M
arag
al_8
para
-4pa
ra-1
0pa
ra-5
para
-6pa
ra-7
para
-8pa
ra-9 CO
kim
2Cu
rlCur
l_4St
ocF-
1465
dela
unay
_n23
Tran
spor
tfe
m_f
ilter
rgg_
n_2_
21_s
0
5
10
15
20
25
30
GFLO
PS (d
oubl
e)
SPGEMM double smallAC-SpGEMMcuSparsebhSparse
RMergensparseKokkos
Figure 9.Marker plot for the complete test set using double precision for small matrices with average row length a < 42
bibd
_17_
4bi
bd_1
7_4b
bibd
_49_
3lp
_fit1
dlp
_d6c
ube
crew
1bi
bd_1
3_6
aa6
aa4
air0
5ro
sen1 aa5
air0
2aa
03 aa3
air0
6aa
01ai
r04
bibd
_81_
3m
odel
5ai
r03
lp_o
sa_0
7ro
sen2
Jour
nals
kl02
bibd
_14_
7lp
_fit2
dst
at96
v1 G1 G6 G2 G7 G10 G5 G3 G8 G4 G9
msc
0072
6lp
_osa
_14
lp_w
ood1
prlf
ddd
rail5
16qc
324
Trec
11st
at96
v5bc
sstk
27bc
sstm
27lo
ck10
74ra
il507
bibd
_15_
7ra
il582
us04 ge
nge
n1lp
_osa
_30
tom
ogra
phy
mbe
ause
stat
96v4
Eter
nity
II_A
rkat
7_m
at5
beau
seex
26t0
331-
4lpi
ston
gen2
mbe
acxc
mbe
aflw
beac
xcEt
erni
tyII_
Etild
enw
14be
aflw
com
sol
bcss
tk24
lock
3491
lp_o
sa_6
0co
mpl
exge
n4Et
erni
tyII_
Est
at96
v2bi
bd_1
6_8
stat
96v3
conn
ectu
sbc
sstk
28TS
OPF_
RS_b
300_
c1na
sa29
10 EX5
EX6
s1rm
q4m
1s2
rmq4
m1
s3rm
q4m
1st
ruct
4t5
20go
odwi
nne
met
h10
FX_M
arch
2010
nem
eth1
1bc
sstk
16bi
bd_1
7_8
bcss
tk38 Ku
uTr
ec12
nem
eth1
2cr
ystk
01 rat
TSOP
F_RS
_b30
0_c2
GT01
RCA
G_m
at19
16ne
met
h13
nem
sem
m1
nem
eth1
4ne
t25
raef
sky1
raef
sky2
ted_
ABbc
sstk
29ne
met
h15
sinc1
5ce
gb28
02ex
40cb
uckl
eTS
OPF_
RS_b
300_
c3TS
OPF_
RS_b
678_
c1Pr
es_P
oiss
on Na5
nem
eth1
6cy
l6ce
gb29
19ne
met
h17
bcss
tk33 rim
nem
eth1
8cr
plat
2pk
ustk
01f8
55_m
at9
TSOP
F_RS
_b20
52_c
1bc
sstk
37ne
met
h01
cari
bcss
tk36
msc
2305
2bi
bd_1
8_9
sinc1
8ol
afu
TSOP
F_RS
_b67
8_c2
crys
tk02
pcry
stk0
2pk
ustk
02ne
met
h19
rail2
586
bcss
tk35
sme3
Daex
11gy
rone
t50
pkus
tk09
dbir1
std1
_Jac2
qc25
34ra
efsk
y4db
ir2bc
sstk
39bb
mat li pl
ine
met
h20
bcss
tk32
wind
scre
enra
efsk
y3Si
10H1
6ba
s1lp
std1
_Jac3
Zd_Ja
c2bi
bd_1
9_9
rail4
284
Zd_Ja
c6TS
OPF_
RS_b
2383
crys
tk03
pcry
stk0
3va
nbod
yje
ndre
c1na
sasr
bne
met
h21
net7
5pk
ustk
05in
vext
r1_n
ewct
20st
ifpc
t20s
tifrm
a10
Zd_Ja
c3sr
b1pk
ustk
03m
sc10
848
bcss
tk30
mix
tank
_new
ns3D
apk
ustk
06oi
lpan
sme3
Dbtrd
heim
nem
eth2
2Tr
ec13
hear
t2he
art3
psm
igr_
2ps
mig
r_1
psm
igr_
3pk
ustk
10ne
met
h23
nem
eth2
4ne
met
h25
nem
eth2
6bi
bd_2
2_8
t3dh
s3dk
q4m
2 fpop
t1ca
ntla
min
ar_d
uct3
Dsm
e3Dc
3dtu
bets
yl20
1vs
p_m
sc10
848_
300s
ep_1
00in
_1Ko
utvs
p_bc
sstk
30_5
00se
p_10
in_1
Kout
pkus
tk11
bibd
_20_
10bo
neS0
1bm
w7st
_1sh
ipse
c8pk
ustk
07TS
C_OP
F_30
0sh
ipse
c1HF
E18_
96_in
cons
phPR
02R
bund
le1 F2
ram
age0
2pk
ustk
08pd
b1HY
Sho
odpk
ustk
13ge
arbo
xsh
ip_0
03sh
ip_0
01sh
ipse
c5he
art1
bmw3
_2 smt
pwtk
fcon
dp2
TSOP
F_FS
_b16
2_c1
BenE
lech
i1tro
llha
lfbfu
llbSi
34H3
6th
read
ohne
2bm
wcra
_1Si
87H7
6m
sdoo
rx1
04m
_t1
Ga10
As10
H30
Coup
Cons
3Dnd
3kGa
AsH6
Faul
t_63
9TS
C_OP
F_10
47Ge
87H7
6Ge
99H1
00pk
ustk
14Ga
19As
19H4
2M
L_La
plac
eld
oor
cran
kseg
_1nd
6kcr
anks
eg_2
ram
age0
2la
ndm
ark
pkus
tk08
pdb1
HYS
hood
pkus
tk13
gear
box
ship
_003
ship
_001
pack
ing-
500x
100x
100-
b050
ship
sec5
hear
t1bm
w3_2 sm
taf
_0_k
101
af_1
_k10
1af
_2_k
101
af_3
_k10
1af
_4_k
101
af_5
_k10
1af
_she
ll1af
_she
ll2af
_she
ll3af
_she
ll4af
_she
ll5af
_she
ll6af
_she
ll7af
_she
ll8af
_she
ll9pw
tkfc
ondp
2TS
OPF_
FS_b
162_
c1Be
nEle
chi1
troll
halfb
fullb
Si34
H36
thre
adoh
ne2
nlpk
kt80
bmwc
ra_1
Si87
H76
gsm
_106
857
fem
_hifr
eq_c
ircui
tm
sdoo
rx1
04m
_t1
Ga10
As10
H30
Coup
Cons
3Ddi
elFi
lterV
2clx
nd3k
GaAs
H6Fa
ult_
639
TSC_
OPF_
1047
Ge87
H76
Ge99
H100
pkus
tk14
af_s
hell1
0Ga
19As
19H4
2M
L_La
plac
eld
oor
cran
kseg
_1nd
6kcr
anks
eg_2
10
20
30
40
50
GFLO
PS (d
oubl
e)
SPGEMM double largeAC-SpGEMMcuSparsebhSparse
RMergensparseKokkos
Figure 10.Marker plot for the complete test set using double precision for large matrices with average row length a ≥ 42
bcss
tm25
bcss
tm37 Pd
M80
PI_n
S80P
I_nde
ter0
lp_c
zpro
bcis
-n4c
6-b1
4n4
c6-b
14pl
ddb
lp_p
ds_0
2lp
_shi
p08l uk
Calif
orni
apr
imag
azbc
sstm
38ex
tr1b
extr1
bcss
tm35
bcss
tm39
uker
be1
SciM
etG4
8G4
9G5
0lu
tz30
-23-
b6lp
_shi
p12l
flowe
r_4_
4dw
1024
dw20
48gr
id2
powe
rUS
powe
rGrid
n2c6
-b8
Koho
nen
p600
0df
2177
ch7-
6-b5
Fran
z3cr
yg25
00m
k12-
b5 delf
dete
r8lp
_80b
au3b
lp_b
nl2
lp_c
re_c
rdb2
048
filte
r2D
pde2
961
watt_
1de
ter6
lshp1
561
lp_c
re_a
dela
unay
_n11
rw51
51ol
m50
00 G57
watt_
2Ha
mrle
2n4
c5-b
8ro
tor2
ch6-
6-b4
lshp1
882
lp_2
5fv4
7or
sreg
_1dw
t_12
42sh
yy41
larg
ede
ter5
p2p-
Gnut
ella
08 lpl2
lp_t
russ
lp_m
aros
poli3
stor
mg2
-8bc
spwr
10cz
1268
TF13
Fran
z2de
ter1
qpba
ndbl
ckho
lelsh
p223
3hy
dr1
hydr
1cp2
p-Gn
utel
la09
fd12
rdb3
200l
dete
r4G6
2m
k10-
b3sh
erm
an3
mhd
3200
blp
_ken
_11
ex32
cat_
ears
_3_4 ch
Chem
97Zt
fxm
2-6
lshp2
614
lp_d
fl001
N_re
acto
me
wang
1wa
ng2
com
man
che_
dual
G65
dete
r7ca
ge8
flowe
r_5_
4Le
Gres
ley_
2508
mk1
1-b4
bCh
ebys
hev2
n3c6
-b9
dwg9
61b
g7ja
c010
g7ja
c010
scp2
p-Gn
utel
la06
p2p-
Gnut
ella
05sw
ang1
swan
g2kn
eser
_8_3
_1lp
_pds
_06
sayl
r4lp
_pilo
tnov G66
lshp3
025
airfo
il1 D_9
raja
t03
G55
G56
n4c5
-b4
can_
838
baye
r09
dela
unay
_n12
mk1
1-b4 G6
73e
ltm
odel
8n2
c6-b
4n3
c6-b
4lsh
p346
6de
ter3
dete
r2ra
dfr1
cell1
cell2
fd15
can_
1054
poli4
rdb5
000
mhd
4800
bDK
01R
can_
1072
p2p-
Gnut
ella
04 D_7
add3
2n4
c5-b
7p2
p-Gn
utel
la25 D_
8sh
erm
anAC
aln
s_39
37ln
sp39
37fre
eFly
ingR
obot
_2ch
6-6-
b3n2
c6-b
7bc
sstk
21cz
2548 G6
0G6
1em
ail
sstm
odel
ex29
tum
a2dw
4096
dw81
92nc
vxqp
9m
odel
6ra
jat0
6lp
_woo
dw l30
dtoc
pcb1
000
lp_n
ug12
lp_q
ap12
gem
at12
gem
at11
ODLI
STr
efet
hen_
700
aug2
dlp
_gre
enbe
alp
_gre
enbe
blp
i_gre
enbe
aye
ast
lp_k
en_1
3 geEr
dos9
72fre
eFly
ingR
obot
_3p2
p-Gn
utel
la24
LeGr
esle
y_49
08lp
_cyc
lero
sen8
sher
man
5au
g2dc
lp_p
ds_1
0n4
c5-b
5m
odel
3dw
t_26
80bi
ps98
_606
Erdo
s982
Erdo
s992
cryg
1000
0sp
aceS
tatio
n_4
fom
e11
fd18
Kauf
hold
ch7-
6-b2
rail_
5177
bips
98_1
142
n4c5
-b6
c-18
lhr0
1Ch
ebys
hev3
nops
s_11
kno
s3 nl
bips
98_1
450
hep-
thm
imo4
6x46
_sys
tem
o_af
onso
_itai
pum
imo2
8x28
_sys
tem
ww_v
ref_
6405
zero
s_no
pss_
13k
mim
o8x8
_sys
tem
aug3
dbi
ps07
_169
3ba
yer0
6ba
yer0
8ba
yer0
5ba
yer0
7ba
rth4
L-9
dwt_
992
spac
eSta
tion_
5ba
yer0
2ut
m17
00b
G54
n2c6
-b5
raja
t07
G52
G53 se
TF14
n3c6
-b5
lpl3
bcss
tm12 G51
n3c6
-b8
jan9
9jac
020
jan9
9jac
020s
cde
laun
ay_n
13TS
OPF_
RS_b
9_c6
ex33
stea
m2
gem
at1
spac
eSta
tion_
10Al
emda
rLe
derb
erg
bipl
ane-
9cis
-n4c
6-b1
3n4
c6-b
13ba
rthp2
p-Gn
utel
la30
fxm
2-16
coat
er1
n2c6
-b6
bcss
tk09
t2da
lGL
7d25
LFAT
5000
data cs4
whita
ker3
spac
eSta
tion_
11sp
aceS
tatio
n_6
stuf
e-10
geom
spac
eSta
tion_
12tu
ma1
spac
eSta
tion_
7m
ark3
jac0
20m
ark3
jac0
20sc
raja
t08
bips
07_1
998
freeF
lyin
gRob
ot_4
stor
mg2
-27
sit10
0g7
jac0
20g7
jac0
20sc
fe_4
elt2
crac
kbi
ps07
_247
6ca
t_ea
rs_4
_4IG
5-10
spac
eSta
tion_
8lp
i_gra
ncc
acz
5108
desc
ripto
r_xi
ngo6
un3
c6-b
6xi
ngo3
012
plte
xpa
n3c6
-b7
powe
rsim
bips
07_3
078
G46
G45
eurq
saG4
7G4
3G4
4sp
aceS
tatio
n_9
psse
1cis
-n4c
6-b3
n4c6
-b3
lp_d
2q06
cph
otog
ram
met
rym
k11-
b2 ccc
ch7-
6-b4
spac
eSta
tion_
13flo
wmet
er0
flowm
eter
5m
odel
9Si
2no
poly
bcss
tk10
bcss
tm10
spac
eSta
tion_
14ra
jat0
9cq
5lp
_sto
cfor
3ca
-GrQ
cfo
me1
2LF
1000
0ba
yer0
3m
odel
4fo
me2
0lp
_pds
_20
luxe
mbo
urg_
osm
mod
el7
lhr0
2m
hd12
80b
add2
0Er
dos0
2t6
0kp2
p-Gn
utel
la31
sher
man
2c-
19ca
ge9
bcss
tk26
aug3
dcqp
pesa
ch7-
7-b2
lp_n
ug15
lp_q
ap15 ct
ilp
_deg
en3
shoc
k-9
Wor
dnet
3pl
at19
19Fr
anz4
scta
p1-2
cin
tern
etc-
28fre
eFly
ingR
obot
_5fx
m3_
6CA
G_m
at36
4ut
m30
60fe
_sph
ere
zeni
ospc
b300
0ra
jat1
0p0
5fo
rd1
ch7-
6-b3
flowe
r_7_
4nc
vxqp
1rb
sa48
0rb
sb48
0lp
22 co5
dela
unay
_n14
polb
logs
Zewa
ilep
b1bc
sstk
08ra
jat0
2c-
20 big
G36
G40
ca-H
epTh G35
G39
nem
spm
m1
G38
G42
baye
r10
G37
G41
dyna
micS
oarin
gPro
blem
_2m
k11-
b3ex
21fre
eFly
ingR
obot
_6sh
erm
anAC
dwb
-cs-
stan
ford
bcss
tk23
jan9
9jac
040
jan9
9jac
040s
cc-
27lp
i_bgi
ndy
barth
5bc
sstm
13pf
2177
pds-
30pl
buck
le fv1
ex22
de20
10m
ark3
jac0
40m
ark3
jac0
40sc fv2
fv3
t2d_
q4t2
d_q4
_A_0
1t2
d_q4
_A_0
2t2
d_q4
_A_0
3t2
d_q4
_A_0
4t2
d_q4
_A_0
5t2
d_q4
_A_0
6t2
d_q4
_A_0
7t2
d_q4
_A_0
8t2
d_q4
_A_0
9t2
d_q4
_A_1
0t2
d_q4
_A_1
1t2
d_q4
_A_1
2t2
d_q4
_A_1
3t2
d_q4
_A_1
4t2
d_q4
_A_1
5t2
d_q4
_A_1
6t2
d_q4
_A_1
7t2
d_q4
_A_1
8t2
d_q4
_A_1
9t2
d_q4
_A_2
0t2
d_q4
_A_2
1t2
d_q4
_A_2
2t2
d_q4
_A_2
3t2
d_q4
_A_2
4t2
d_q4
_A_2
5t2
d_q4
_A_2
6t2
d_q4
_A_2
7t2
d_q4
_A_2
8t2
d_q4
_A_2
9t2
d_q4
_A_3
0t2
d_q4
_A_3
1t2
d_q4
_A_3
2t2
d_q4
_A_3
3t2
d_q4
_A_3
4t2
d_q4
_A_3
5t2
d_q4
_A_3
6t2
d_q4
_A_3
7t2
d_q4
_A_3
8t2
d_q4
_A_3
9t2
d_q4
_A_4
0t2
d_q4
_A_4
1t2
d_q4
_A_4
2t2
d_q4
_A_4
3t2
d_q4
_A_4
4t2
d_q4
_A_4
5t2
d_q9
t2d_
q9_A
_01
t2d_
q9_A
_02
t2d_
q9_A
_03
t2d_
q9_A
_04
t2d_
q9_A
_05
t2d_
q9_A
_06
t2d_
q9_A
_07
t2d_
q9_A
_08
t2d_
q9_A
_09
t2d_
q9_A
_10
t2d_
q9_A
_11
t2d_
q9_A
_12
t2d_
q9_A
_13
t2d_
q9_A
_14
t2d_
q9_A
_15
t2d_
q9_A
_16
t2d_
q9_A
_17
t2d_
q9_A
_18
t2d_
q9_A
_19
t2d_
q9_A
_20
t2d_
q9_A
_21
t2d_
q9_A
_22
t2d_
q9_A
_23
t2d_
q9_A
_24
t2d_
q9_A
_25
t2d_
q9_A
_26
t2d_
q9_A
_27
t2d_
q9_A
_28
t2d_
q9_A
_29
t2d_
q9_A
_30
t2d_
q9_A
_31
t2d_
q9_A
_32
t2d_
q9_A
_33
t2d_
q9_A
_34
t2d_
q9_A
_35
t2d_
q9_A
_36
t2d_
q9_A
_37
t2d_
q9_A
_38
t2d_
q9_A
_39
t2d_
q9_A
_40
t2d_
q9_A
_41
t2d_
q9_A
_42
t2d_
q9_A
_43
t2d_
q9_A
_44
t2d_
q9_A
_45
linve
rse
bfly
ex3s
ta1
juba
40k
baur
u572
7cz
1022
8lo
ck_7
00Zh
ao1
Zhao
2c-
22bc
sstk
34sp
iral
G23
G28
G25
G30
G24
G29
G22
G27
G26
G31
ch7-
7-b5
ri201
0bo
dyy4
TF15
bcss
tk11
bcss
tk12
lp_k
en_1
8Fr
anz6
nem
spm
m2
Tref
ethe
n_20
00ex
25da
no3m
ipbo
dyy5
hi20
10 cq9
freeF
lyin
gRob
ot_7
mcf
eus
road
s-48
lp_c
re_d
PGPg
iant
com
poTS
OPF_
RS_b
2383
_c1_
bul
evim
inus
road
sc-
21na
sa18
24bo
dyy6
sher
man
ACb
rose
n10
eris1
176
c-26 FA
wing
obst
clae
tors
ion1 TS
lp_c
re_b
jnlb
rng1
fom
e13
nem
sem
m2
chem
_mas
ter1
rdist
2m
insu
rfoex
18ra
il_20
209
fom
e21
pds-
40ch
7-8-
b2vt
2010
meg
1bc
sstm
34Ke
mel
mac
her
shut
tle_e
ddy
cvxq
p3ki
netic
Batc
hRea
ctor
_1wo
rldfre
eFly
ingR
obot
_8nu
g08-
3rd
Fran
z5ex
36lp
i_gos
hM
arag
al_3
flowe
r_8_
4m
sc01
050
c-23
lp_p
ilot
mod
2co
9m
k12-
b2br
atu3
dja
n99j
ac06
0ja
n99j
ac06
0sc
cavi
ty05
cavi
ty06
cavi
ty07
cavi
ty08
cavi
ty09
mar
k3ja
c060
mar
k3ja
c060
scCu
rlCur
l_0ut
m59
40IG
5-11
epb2
ex24
p010
wang
3wa
ng4
baye
r04
hvdc
1co
nd-m
atde
laun
ay_n
15c-
24m
ario
001
pds-
50sh
allo
w_wa
ter1
shal
low_
wate
r2fre
eFly
ingR
obot
_9fo
ldoc
baye
r01
vsp_
barth
5_1K
sep_
50in
_5Ko
ut14
5bit
psse
0D6
-6th
erm
alpg
p2on
eton
e2ba
xter
freeF
lyin
gRob
ot_1
0ro
bot2
4c1_
mat
5vi
scop
last
ic1di
xmaa
nlM
arag
al_4
mhd
3200
afre
eFly
ingR
obot
_11
g7ja
c040
g7ja
c040
scsh
yy16
1ex
15TS
OPF_
RS_b
162_
c1m
ark3
jac0
80m
ark3
jac0
80sc
msc
0144
0ex
10ex
14pd
s-60
jan9
9jac
080
jan9
9jac
080s
crd
ist3a
ex10
hsfre
eFly
ingR
obot
_12
cz20
468
stok
es64
stok
es64
snh
2010
dict
iona
ry28
ex23
c-29
cis-n
4c6-
b4n4
c6-b
4ps
se2
mod
el10
Fran
z7fre
eFly
ingR
obot
_13
n4c6
-b12
GL7d
12M
ISKn
owle
dgeM
apsp
msr
tlssc
fxm
1-2b
TSOP
F_RS
_b39
_c7
freeF
lyin
gRob
ot_1
4e1
8lp
_nug
20RF
devi
ceIll
_Sto
kes
freeF
lyin
gRob
ot_1
5Fr
anz8
adde
r_dc
op_0
1in
it_ad
der1
pds-
70ad
der_
dcop
_04
adde
r_dc
op_0
5lh
r04
lhr0
4cad
der_
dcop
_03
adde
r_dc
op_0
7ad
der_
dcop
_06
adde
r_dc
op_1
0ad
der_
dcop
_09
adde
r_dc
op_0
8ad
der_
dcop
_11
adde
r_dc
op_1
3ad
der_
dcop
_19
adde
r_dc
op_4
4ad
der_
dcop
_02
adde
r_dc
op_1
2ad
der_
dcop
_14
adde
r_dc
op_1
5ad
der_
dcop
_16
adde
r_dc
op_1
7ad
der_
dcop
_18
adde
r_dc
op_2
0ad
der_
dcop
_21
adde
r_dc
op_2
2ad
der_
dcop
_23
adde
r_dc
op_2
4ad
der_
dcop
_25
adde
r_dc
op_2
6ad
der_
dcop
_27
adde
r_dc
op_2
8ad
der_
dcop
_29
adde
r_dc
op_3
0ad
der_
dcop
_31
adde
r_dc
op_3
2ad
der_
dcop
_33
adde
r_dc
op_3
4ad
der_
dcop
_35
adde
r_dc
op_3
6ad
der_
dcop
_37
adde
r_dc
op_3
8ad
der_
dcop
_39
adde
r_dc
op_4
0ad
der_
dcop
_41
adde
r_dc
op_4
2ad
der_
dcop
_43
adde
r_dc
op_4
5ad
der_
dcop
_46
adde
r_dc
op_4
7ad
der_
dcop
_48
adde
r_dc
op_4
9ad
der_
dcop
_50
adde
r_dc
op_5
1ad
der_
dcop
_52
adde
r_dc
op_5
3ad
der_
dcop
_54
adde
r_dc
op_5
5ad
der_
dcop
_56
adde
r_dc
op_5
7ad
der_
dcop
_58
adde
r_dc
op_5
9ad
der_
dcop
_60
adde
r_dc
op_6
1ad
der_
dcop
_62
adde
r_dc
op_6
3ad
der_
dcop
_64
adde
r_dc
op_6
5ad
der_
dcop
_66
adde
r_dc
op_6
7ad
der_
dcop
_68
adde
r_dc
op_6
9fre
eFly
ingR
obot
_16
ch7-
9-b2
ex12
mk1
2-b4
lowT
hrus
t_2
G58
G59
mar
k3ja
c100
mar
k3ja
c100
scEX
3EX
4ro
ute
lpi_k
lein
3ja
n99j
ac10
0ja
n99j
ac10
0sc
ch8-
8-b2
nem
swrld r0
5af
t01
pds-
80dy
nam
icSoa
ringP
robl
em_3
oner
a_du
alAB
ACUS
_she
ll_hd
ABAC
US_s
hell_
ldAB
ACUS
_she
ll_m
dAB
ACUS
_she
ll_ud
rlfdu
alm
hd12
80a
OPF_
3754
lung
2se
ymou
rlce
gb33
06c-
35sp
aceS
huttl
eEnt
ry_2
rdist
1c-
25ca
ge10
adde
r_tra
ns_0
1ad
der_
trans
_02
lpi_c
eria
3d ex7
aft0
2pd
s10
msc
0451
5pd
s-90
ct20
10ha
ngGl
ider
_2ta
ndem
_dua
llp
i_cpl
ex1
mhd
4800
are
orie
ntat
ion_
2gr
aphi
csm
ark3
jac1
20m
ark3
jac1
20sc
me2
010
cegb
3024
raja
t12
bcirc
uit
bcss
tk14
pds-
100
crys
tm01
cont
-201
sd20
10m
k13-
b5c-
44ex
27ja
n99j
ac12
0ja
n99j
ac12
0sc
dela
unay
_n16
epb3
c-37
nasa
2146
meg
4TF
16dy
nam
icSoa
ringP
robl
em_4
nasa
4704
ak20
10c-
34bc
sstk
18ex
13g7
jac0
50sc
lp_p
ilot8
7dy
nam
icSoa
ringP
robl
em_5
fe_b
ody
cvxb
qp1
ncvx
bqp1
onet
one1
ex20
mar
k3ja
c140
mar
k3ja
c140
scex
37ra
efsk
y5 rel7
dyna
micS
oarin
gPro
blem
_6G6
3G6
4ca
vity
10ca
vity
11ca
vity
12ca
vity
13ca
vity
14ca
vity
15ga
ron1
t2da
hex
28cir
cuit_
2pw
tdy
nam
icSoa
ringP
robl
em_7
Chev
ron1
c-38
knes
er_1
0_4_
1dy
nam
icSoa
ringP
robl
em_8
ex35
mk1
2-b3
copt
er1
nv20
10lo
ck22
32fx
m4_
6wy
2010
m13
3-b3
shar
_te2
-b3
ford
2cz
4094
8st
orm
g2-1
25ph
otog
ram
met
ry2
IG5-
12ec
l32
spac
eShu
ttleE
ntry
_3 ex9
ncvx
qp5
car4
sgpf
5y6
lhr0
7lh
r07c
rgg_
n_2_
15_s
0Lin
ux_c
all_g
raph
wats
on_1
GL7d
24te
d_B
skirt
apac
he1
tand
em_v
txvs
p_da
ta_a
nd_s
eym
ourl
bcss
tk15
mri1
mpl
ate
raef
sky6
g7ja
c060
g7ja
c060
scnd
2010
finan
ce25
6vs
p_p0
291_
seym
ourl_
iiasa
ther
mal
1ch
ipco
ol0
chip
cool
1sh
ar_t
e2-b
1fx
m3_
16ca
-Con
dMat
OPF_
6000
rail_
7984
1as
-735
c-41
imag
e_in
terp
ncvx
qp3
Fran
z11
ex8
Fran
z10
Fran
z9ex
31Du
bcov
a1ut
2010
wiki
-Vot
ebc
sstk
13TS
OPF_
RS_b
162_
c3wa
vegu
ide3
Dco
nf5_
0-4x
4-10
conf
5_0-
4x4-
14co
nf5_
0-4x
4-18
conf
5_0-
4x4-
22co
nf5_
0-4x
4-26
conf
6_0-
4x4-
20co
nf6_
0-4x
4-30
ch7-
8-b3
fe_o
cean
TSOP
F_FS
_b9_
c1M
uum
t201
0co
ater
2TS
OPF_
RS_b
39_c
19n4
c6-b
5ai
rfoil_
2dbc
sstk
25Le
Gres
ley_
8793
6de
laun
ay_n
17m
d201
0th
erm
omec
h_TC
ther
mom
ech_
TKde
ltaX
oran
i678
ch7-
8-b5
lhr1
0lh
r10c
lhr1
1lh
r11c
wv20
10k1
_san
igbt
3nc
vxqp
7Ba
uman
nco
nd-m
at-2
003
kine
ticBa
tchR
eact
or_2
scag
r7-2
bte
stbi
gca
vity
16ca
vity
17ca
vity
18ca
vity
19ca
vity
20ca
vity
21ca
vity
22ca
vity
23ca
vity
24ca
vity
25ca
vity
26nj
2010
ma2
010
raja
t01
ky20
10co
nt-3
00la
ngua
gene
2010
ms2
010
n4c6
-b11
Mar
agal
_5id
2010
c-49
grid
gena
spac
eShu
ttleE
ntry
_4sc
fxm
1-2r
light
_in_t
issue
ia20
10c-
31ar
2010
reor
ient
atio
n_3
amaz
on03
02sc
2010
nm20
10ra
jat2
7wa
tson
_2TS
OPF_
RS_b
162_
c4wa
2010
whee
l_601
cit-H
epPh
OPF_
1000
0st
okes
128
ks20
10c-
36g7
jac0
80g7
jac0
80sc
c-48
Delo
r64K
lhr1
4lh
r14c
co20
10he
lm3d
01re
orie
ntat
ion_
4ol
esni
k0be
lgiu
m_o
smre
lat7
rela
t7b
la20
10AS
IC_1
00ks
or20
10Ch
evro
n2to
mog
raph
ic1TF
172D
_276
28_b
jtcai
benz
ene
ex19
cit-H
epTh
HEP-
th-n
ewki
netic
Batc
hRea
ctor
_3wi
2010
mac
_eco
n_fw
d500
TSOP
F_RS
_b39
_c30
finan
512
pfin
an51
2wa
then
100
mn2
010
crys
tm02
in20
10rg
g_n_
2_16
_s0
c-52
reor
ient
atio
n_5
Sieb
ertn
2010
c-33
nmos
3db
ic1ok
2010
IG5-
13g3
rmt3
m3
lhr1
7lh
r17c
mc2
depi
al20
10ch
7-8-
b4re
orie
ntat
ion_
6sc
ircui
tre
orie
ntat
ion_
7ca
ge11
s1rm
t3m
1bc
sstm
36re
orie
ntat
ion_
8hc
ircui
taz
2010 nsir
man
_597
6to
rso2
lp_n
ug30
wath
en12
0c-
51ga
2010
mem
plus
Notre
Dam
e_ww
wai
rcra
ftst
och_
airc
raft
cond
-mat
-200
5in
let
nc20
10g7
jac1
00g7
jac1
00sc
162b
itm
i201
0ro
adNe
t-PA
dela
unay
_n18
af23
560
ther
mom
ech_
dMbr
ack2
va20
10ch
7-9-
b3sm
allw
orld
Andr
ews
n4c6
-b6
scta
p1-2
bny
2010
EAT_
SREA
T_RS
visc
opla
stic2
FEM
_3D_
ther
mal
1as
tro-p
hm
o201
0c-
60G_
n_pi
n_po
ut lpl1
coup
led
oh20
10Si
H4co
pter
2sc
sd8-
2bsc
sd8-
2cne
ther
land
s_os
mch
8-8-
b3po
isson
3Da
vsp_
c-60
_dat
a_ct
i_cs4
n4c6
-b10
road
Net-T
X Linhv
dc2
fe_t
ooth
Oreg
on-1
vibr
obox
il201
0da
rcy0
03m
ario
002
garo
n2g7
jac1
20g7
jac1
20sc
pa20
10sin
c12
SiNa t3dl
k3pl
ates
circu
it_3
c-50
grah
am1
scrs
8-2r
abta
ha1
crys
tm03
circu
it_1
c-61
Oreg
on-2
3D_2
8984
_Tet
rade
norm
alTr
efet
hen_
2000
0bTr
efet
hen_
2000
0lo
wThr
ust_
3g7
jac1
40g7
jac1
40sc
d_pr
etok
turo
n_m
rlfpr
imlh
r34
lhr3
4ce4
0r01
00n4
c6-b
7en
ron
debr
nem
eth0
2ne
met
h03
nem
eth0
4ne
met
h05
nem
eth0
6ne
met
h07
nem
eth0
8ne
met
h09
fl201
0GL
7d13
road
Net-C
Aki
netic
Batc
hRea
ctor
_4rg
g_n_
2_17
_s0
c-39
ted_
An4
c6-b
9Du
bcov
a2c-
422D
_540
19_h
ighK
fe_r
otor
helm
2d03
cras
hbas
ism
ajor
basis
g7ja
c160
g7ja
c160
scbc
sstk
17de
laun
ay_n
19c-
65c-
40n4
c6-b
8AS
IC_6
80ks
ch7-
9-b5
IG5-
14ki
netic
Batc
hRea
ctor
_5sc
205-
2r59
8ash
ar_t
e2-b
2TF
18ki
netic
Batc
hRea
ctor
_6da
wson
5Ha
mrle
3so
c-sig
n-Sl
ashd
ot08
1106
GL7d
23so
c-sig
n-Sl
ashd
ot09
0216
kine
ticBa
tchR
eact
or_7
g7ja
c180
g7ja
c180
scso
c-sig
n-Sl
ashd
ot09
0221
kine
ticBa
tchR
eact
or_8
tmt_
unsy
mt2
emvf
emki
m1
kine
ticBa
tchR
eact
or_9
ca20
10tw
oton
ech
7-9-
b4c-
63c-
55ec
olog
y2ec
olog
y1c-
32as
-22j
uly0
6ra
jat2
2g7
jac2
00g7
jac2
00sc
para
bolic
_fem
slpts
kca
-Ast
roPh
c-47
ckt1
1752
_dc_
1ck
t117
52_t
r_0
stru
ct3
cont
11_l
vsp_
scta
p1-2
b_an
d_se
ymou
rldb
lp-2
010
raja
t26
CurlC
url_1
2cub
es_s
pher
ec-
46c-
68am
azon
0312
3D_5
1448
_3D
ibm
_mat
rix_2
c-59
xeno
n1as
-cai
dach
8-8-
b4tx
2010
wave
c-30
ch8-
8-b5
italy
_osm
Chev
ron3
ASIC
_320
ksca
-Hep
Phvs
p_c-
30_d
ata_
data
amaz
on05
05sc
sd8-
2rvs
p_bu
mp2
_e18
_aa0
1_m
odel
1_cr
ew1
amaz
on06
01ap
ache
214
4lh
r71
lhr7
1cra
jat1
5pr
efer
entia
lAtta
chm
ent
Notre
Dam
e_ac
tors
cage
12tm
t_sy
mra
jat1
3sc
tap1
-2r
grea
t-brit
ain_
osm
coAu
thor
sCite
seer
tube
1tu
be2
visc
oroc
ksNA
CA00
15c-
69de
laun
ay_n
20rg
g_n_
2_18
_s0
soc-
Epin
ions
1th
erm
omec
h_dK
vsp_
finan
512_
scag
r7-2
c_rlf
ddd
c-56
huge
trace
-000
00co
Auth
orsD
BLP
c-70
qa8f
kqa
8fm
raja
t23
web-
Stan
ford
gas_
sens
orvs
p_vi
brob
ox_s
cagr
7-2c
_rlfd
ddam
azon
-200
8ca
idaR
oute
rLev
elvs
p_m
od2_
pgp2
_slp
tsk
Si5H
12c-
72st
omac
hbc
sstk
31ve
nkat
01ve
nkat
25ve
nkat
50cf
d1IG
5-15
mat
rix_9
c-54
emai
l-EuA
llem
ail-E
nron
huge
tric-
0000
0sp
arsin
em
14b
germ
any_
osm
176b
itc-
62c-
62gh
sas
ia_o
smCh
evro
n4lo
wThr
ust_
4re
l8sc
agr7
-2r
abta
ha2
Mar
agal
_6hu
getri
c-00
010
c-43
web-
Goog
leth
erm
al2
atm
osm
odd
atm
osm
odj
TF19
iChe
m_Ja
cobi
anhu
getri
c-00
020
vent
uriLe
vel3
web-
Notre
Dam
eRe
uter
s911
av41
092
webb
ase-
1Mlp
_fit2
pla
rgeb
asis
offs
hore
atm
osm
odl
atm
osm
odm
c-66
c-66
bco
nf5_
4-8x
8-05
conf
5_4-
8x8-
10co
nf5_
4-8x
8-15
conf
5_4-
8x8-
20co
nf6_
0-8x
8-20
conf
6_0-
8x8-
30co
nf6_
0-8x
8-80
H2O
c-71
wate
r_ta
nkc-
58cf
d2vs
p_be
fref_
fxm
_2_4
_air0
2de
laun
ay_n
21to
rso3
cop2
0k_A
mes
h_de
form
lowT
hrus
t_5
hang
Glid
er_3
poiss
on3D
bGL
7d14
Mar
agal
_7FE
M_3
D_th
erm
al2
filte
r3D
mem
chip
rgg_
n_2_
19_s
0cir
cuit_
4m
atrix
-new
_3lo
wThr
ust_
6cit
atio
nCite
seer
road
_cen
tral
raja
t31
lowT
hrus
t_7
soc-
sign-
epin
ions
cnr-2
000
lowT
hrus
t_8
xeno
n2c-
45Cu
rlCur
l_2bl
owey
bqbl
owey
blbl
owey
alo
wThr
ust_
9au
tolo
wThr
ust_
10lo
wThr
ust_
11Du
bcov
a3lo
wThr
ust_
12lo
wThr
ust_
13hu
getra
ce-0
0010
adap
tive
IG5-
16Fr
eesc
ale1
soc-
Slas
hdot
0811
soc-
Slas
hdot
0902
c-67
c-67
bM
6De
lor2
95K
stor
mG2
_100
033
3SP
cage
13AS
365
huge
trace
-000
20m
ri2Cu
rlCur
l_3re
lat8
NLR
nsct
s3dk
t3m
2s4
dkt3
m2
dela
unay
_n22
road
_usa
192b
itca
se9
TSOP
F_FS
_b9_
c6 SiO
huge
bubb
les-
0000
0hu
gebu
bble
s-00
010
Delo
r338
Ka2
nnsn
slhu
gebu
bble
s-00
020
c-53
rgg_
n_2_
20_s
0en
gine
barri
er2-
1ba
rrier
2-2
barri
er2-
3ba
rrier
2-4
mon
o_50
0Hz
kkt_
powe
rwe
b-Be
rkSt
anba
rrier
2-10
barri
er2-
11ba
rrier
2-12
barri
er2-
9HT
C_33
6_91
29IG
5-17
euro
pe_o
smM
arag
al_8
para
-4pa
ra-1
0pa
ra-5
para
-6pa
ra-7
para
-8pa
ra-9
HTC_
336_
4438 CO
kim
2Cu
rlCur
l_4St
ocF-
1465
dela
unay
_n23
Tran
spor
tfe
m_f
ilter
pre2
rgg_
n_2_
21_s
0
10
20
30
40
GFLO
PS (f
loat
)
SPGEMM float smallAC-SpGEMMcuSparsebhSparse
RMergensparseKokkos
Figure 11.Marker plot for the complete test set using single precision for small matrices with average row length a < 42
bibd
_17_
4bi
bd_1
7_4b
bibd
_49_
3lp
_fit1
dlp
_d6c
ube
crew
1bi
bd_1
3_6
aa6
aa4
air0
5ro
sen1 aa5
air0
2aa
03 aa3
air0
6aa
01ai
r04
bibd
_81_
3m
odel
5ai
r03
lp_o
sa_0
7ro
sen2
Jour
nals
kl02
bibd
_14_
7lp
_fit2
dst
at96
v1 G1 G6 G2 G7 G10 G5 G3 G8 G4 G9
msc
0072
6lp
_osa
_14
lp_w
ood1
prlf
ddd
rail5
16qc
324
Trec
11st
at96
v5bc
sstk
27bc
sstm
27lo
ck10
74ra
il507
bibd
_15_
7ra
il582
us04 ge
nge
n1lp
_osa
_30
tom
ogra
phy
mbe
ause
stat
96v4
Eter
nity
II_A
rkat
7_m
at5
beau
seex
26t0
331-
4lpi
ston
gen2
mbe
acxc
mbe
aflw
beac
xcEt
erni
tyII_
Etild
enw
14be
aflw
com
sol
bcss
tk24
lock
3491
lp_o
sa_6
0co
mpl
exge
n4Et
erni
tyII_
Est
at96
v2bi
bd_1
6_8
stat
96v3
conn
ectu
sbc
sstk
28TS
OPF_
RS_b
300_
c1na
sa29
10 EX5
EX6
s1rm
q4m
1s2
rmq4
m1
s3rm
q4m
1st
ruct
4t5
20go
odwi
nne
met
h10
FX_M
arch
2010
nem
eth1
1bc
sstk
16bi
bd_1
7_8
bcss
tk38 Ku
uTr
ec12
nem
eth1
2cr
ystk
01 rat
TSOP
F_RS
_b30
0_c2
GT01
RCA
G_m
at19
16ne
met
h13
nem
sem
m1
nem
eth1
4ne
t25
raef
sky1
raef
sky2
ted_
ABbc
sstk
29ne
met
h15
sinc1
5ce
gb28
02ex
40cb
uckl
eTS
OPF_
RS_b
300_
c3TS
OPF_
RS_b
678_
c1Pr
es_P
oiss
on Na5
nem
eth1
6cy
l6ce
gb29
19ne
met
h17
bcss
tk33 rim
nem
eth1
8cr
plat
2pk
ustk
01f8
55_m
at9
TSOP
F_RS
_b20
52_c
1bc
sstk
37ne
met
h01
cari
bcss
tk36
msc
2305
2bi
bd_1
8_9
sinc1
8ol
afu
TSOP
F_RS
_b67
8_c2
crys
tk02
pcry
stk0
2pk
ustk
02ne
met
h19
rail2
586
bcss
tk35
sme3
Daex
11gy
rone
t50
pkus
tk09
dbir1
std1
_Jac2
qc25
34ra
efsk
y4db
ir2bc
sstk
39bb
mat li pl
ine
met
h20
bcss
tk32
wind
scre
enra
efsk
y3Si
10H1
6ba
s1lp
std1
_Jac3
Zd_Ja
c2bi
bd_1
9_9
rail4
284
Zd_Ja
c6TS
OPF_
RS_b
2383
crys
tk03
pcry
stk0
3va
nbod
yje
ndre
c1na
sasr
bne
met
h21
net7
5pk
ustk
05in
vext
r1_n
ewct
20st
ifpc
t20s
tifrm
a10
Zd_Ja
c3sr
b1pk
ustk
03m
sc10
848
bcss
tk30
mix
tank
_new
ns3D
apk
ustk
06oi
lpan
sme3
Dbtrd
heim
nem
eth2
2Tr
ec13
hear
t2he
art3
psm
igr_
2ps
mig
r_1
psm
igr_
3pk
ustk
10ne
t100
nem
eth2
3ne
met
h24
nem
eth2
5ne
met
h26
bibd
_22_
8t3
dhs3
dkq4
m2
appu fp
opt1
cant
lam
inar
_duc
t3D
sme3
Dc3d
tube
tsyl
201
vsp_
msc
1084
8_30
0sep
_100
in_1
Kout
vsp_
bcss
tk30
_500
sep_
10in
_1Ko
utpk
ustk
11bi
bd_2
0_10
bone
S01
bmw7
st_1
ship
sec8
pkus
tk07
TSC_
OPF_
300
ship
sec1
HFE1
8_96
_inco
nsph
PR02
Rbu
ndle
1 F2ra
mag
e02
pkus
tk08
pdb1
HYS
hood
pkus
tk13
gear
box
ship
_003
ship
sec5
hear
t1bm
w3_2 sm
tpw
tkfc
ondp
2TS
OPF_
FS_b
162_
c1Be
nEle
chi1
troll
halfb
fullb
Si34
H36
thre
adoh
ne2
bmwc
ra_1
3Dsp
ectra
lwav
e2Si
87H7
6m
sdoo
rx1
04m
_t1
Ga10
As10
H30
Coup
Cons
3Dnd
3kGa
AsH6
Faul
t_63
9TS
C_OP
F_10
47Ge
87H7
6Ge
99H1
00pk
ustk
14Em
ilia_9
23Ga
19As
19H4
2M
L_La
plac
eld
oor
cran
kseg
_1 F1nd
6kin
line_
1cr
anks
eg_2
bone
S10
Si41
Ge41
H72
RM07
Rnd
12k
cage
14ra
mag
e02
land
mar
kpk
ustk
08pd
b1HY
Sho
odpk
ustk
13ge
arbo
xsh
ip_0
03pa
ckin
g-50
0x10
0x10
0-b0
50sh
ipse
c5he
art1
bmw3
_2 smt
af_0
_k10
1af
_1_k
101
af_2
_k10
1af
_3_k
101
af_4
_k10
1af
_5_k
101
af_s
hell1
af_s
hell2
af_s
hell3
af_s
hell4
af_s
hell5
af_s
hell6
af_s
hell7
af_s
hell8
af_s
hell9
pwtk
fcon
dp2
TSOP
F_FS
_b16
2_c1
BenE
lech
i1tro
llha
lfbfu
llbSi
34H3
6th
read
ohne
2nl
pkkt
80bm
wcra
_13D
spec
tralw
ave2
eu-2
005
Si87
H76
rgg_
n_2_
22_s
0gs
m_1
0685
7fe
m_h
ifreq
_circ
uit
msd
oor
x104
m_t
1Ga
10As
10H3
0Co
upCo
ns3D
diel
Filte
rV2c
lxnd
3kGa
AsH6
Faul
t_63
9TS
C_OP
F_10
47Ge
87H7
6Ge
99H1
00in
-200
4pk
ustk
14Em
ilia_9
23af
_she
ll10
Ga19
As19
H42
ML_
Lapl
ace
ldoo
rcr
anks
eg_1 F1
nd6k
inlin
e_1
cran
kseg
_2bo
neS1
0Si
41Ge
41H7
2RM
07R
nd12
k
10
20
30
40
50
60
GFLO
PS (f
loat
)
SPGEMM float largeAC-SpGEMMcuSparsebhSparse
RMergensparseKokkos
Figure 12.Marker plot for the complete test set using single precision for large matrices with average row length a ≥ 42
81