Towards Efficient SpMV on Sunway Many-core...

11
Towards Efficient SpMV on Sunway Many-core Architectures Changxi Liu School of Computer Science and Engineering, Beihang University, China [email protected] Biwei Xie State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, University of Chinese Academy of Sciences, China [email protected] Xin Liu National Research Centre of Parallel Computer Engineering and Technology, China [email protected] Wei Xue Department of Computer Science and Technology, Tsinghua University, China [email protected] Hailong Yang School of Computer Science and Engineering, Beihang University, China [email protected] Xu Liu Department of Computer Science, College of William and Mary, USA [email protected] ABSTRACT Sparse Matrix-Vector Multiplication (SpMV) is an essential compu- tation kernel for many data-analytic workloads running in both supercomputers and data centers. The intrinsic irregularity in SpMV is challenging to achieve high performance, especially when port- ing to new architectures. In this paper, we present our work on designing and implementing efficient SpMV algorithms on Sunway, a novel architecture with many unique features. To fully exploit the Sunway architecture, we have designed a dual-side multi-level parti- tion mechanism on both sparse matrices and hardware resources to improve locality and parallelism. On one hand, we partition sparse matrices into blocks, tiles, and slices for different granularities. On the other hand, we partition cores in a Sunway processor into fleets, and further dedicate part of cores in a fleet as computation and I/O cores. Moreover, we have optimized the communication between partitions to further improve the performance. Our scheme is gener- ally applicable to different SpMV formats and implementations. For evaluation, we have applied our techniques atop a popular SpMV format, CSR. Experimental results on 18 datasets show that our optimization yields up to 15.5× (12.3× on average) speedups. CCS CONCEPTS Mathematics of computing Mathematical software per- formance; Computations on matrices; Theory of computa- tion Parallel algorithms; KEYWORDS Sunway Architecture, Sparse Matrices, SpMV, Parallelism, Locality Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. ICS ’18, June 12–15, 2018, Beijing, China © 2018 Copyright held by the owner/author(s). Publication rights licensed to the Association for Computing Machinery. ACM ISBN 978-1-4503-5783-8/18/06. . . $15.00 https://doi.org/10.1145/3205289.3205313 ACM Reference Format: Changxi Liu, Biwei Xie, Xin Liu, Wei Xue, Hailong Yang, and Xu Liu. 2018. Towards Efficient SpMV on Sunway Many-core Architectures. In ICS ’18: 2018 International Conference on Supercomputing, June 12–15, 2018, Beijing, China. ACM, New York, NY, USA, Article 4, 11 pages. https://doi.org/10. 1145/3205289.3205313 1 INTRODUCTION Sparse Matrix-vector Multiply (SpMV) is an indispensable kernel for many applications from various domains. In the domain of high performance computing (HPC), applications such as computational fluid dynamics (CFD) and molecular dynamics (MD) highly rely on linear algebra algorithms, where SpMV plays an important role. Moreover, many machine learning algorithms such as support vec- tor machine (SVM) and sparse convolution neural network (CNN) extensively invoke SpMV computation. Finally, graph algorithms such as page rank and breadth-first search can be abstracted as an SpMV problem. SpMV is well-known for its irregular computation and memory access patterns. Such irregularity occurs due to random memory references, which are difficult to exploit data locality. From the com- piler and programmer’s perspective, the intrinsic irregular pattern is unpredictable at compile time because the pattern highly relies on the sparsity pattern of input matrices. From the hardware perspec- tive, such irregular pattern incurs potential write conflicts, limiting instruction- and thread-level parallelism. Thus, it is challenging to efficiently implement SpMV algorithms. It becomes even more challenging when porting SpMV algo- rithms to new architectures. In this paper, we target Sunway [12], an emerging architecture that is developed for clusters in the HPC domain. The supercomputer—Sunway TaihuLight—is powered by 10,649,600 SW26010 many-core RISC processor based on the Sun- way architecture [12]. Sunway TaihuLight achieves peak perfor- mance of 125 PFlops, ranked as the 1st place in top500 list since June 2016. Sunway TaihuLight has demonstrated it’s powerful com- puting capacity; two applications, Dynamic Model [31] and Earth- quake Simulation [11] running on the whole system of Sunway TaihuLight, have received the ACM Gordon Bell Prize. Recent ef- forts on porting various computation kernels, such as DNN [10],

Transcript of Towards Efficient SpMV on Sunway Many-core...

Page 1: Towards Efficient SpMV on Sunway Many-core Architecturesics2018.ict.ac.cn/essay/ics18-final141.pdf · (SPM). This SPM can be configured to be a programmable buffer or automatic data

Towards Efficient SpMV on Sunway Many-core ArchitecturesChangxi Liu

School of Computer Science andEngineering, Beihang University,

[email protected]

Biwei XieState Key Laboratory of ComputerArchitecture, Institute of ComputingTechnology, Chinese Academy ofSciences, University of ChineseAcademy of Sciences, China

[email protected]

Xin LiuNational Research Centre of Parallel

Computer Engineering andTechnology, [email protected]

Wei XueDepartment of Computer Science andTechnology, Tsinghua University,

[email protected]

Hailong YangSchool of Computer Science andEngineering, Beihang University,

[email protected]

Xu LiuDepartment of Computer Science,College of William and Mary, USA

[email protected]

ABSTRACTSparse Matrix-Vector Multiplication (SpMV) is an essential compu-tation kernel for many data-analytic workloads running in bothsupercomputers and data centers. The intrinsic irregularity in SpMVis challenging to achieve high performance, especially when port-ing to new architectures. In this paper, we present our work ondesigning and implementing efficient SpMV algorithms on Sunway,a novel architecture with many unique features. To fully exploit theSunway architecture, we have designed a dual-side multi-level parti-tion mechanism on both sparse matrices and hardware resources toimprove locality and parallelism. On one hand, we partition sparsematrices into blocks, tiles, and slices for different granularities. Onthe other hand, we partition cores in a Sunway processor into fleets,and further dedicate part of cores in a fleet as computation and I/Ocores. Moreover, we have optimized the communication betweenpartitions to further improve the performance. Our scheme is gener-ally applicable to different SpMV formats and implementations. Forevaluation, we have applied our techniques atop a popular SpMVformat, CSR. Experimental results on 18 datasets show that ouroptimization yields up to 15.5× (12.3× on average) speedups.

CCS CONCEPTS• Mathematics of computing → Mathematical software per-formance; Computations on matrices; • Theory of computa-tion → Parallel algorithms;

KEYWORDSSunway Architecture, Sparse Matrices, SpMV, Parallelism, Locality

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected] ’18, June 12–15, 2018, Beijing, China© 2018 Copyright held by the owner/author(s). Publication rights licensed to theAssociation for Computing Machinery.ACM ISBN 978-1-4503-5783-8/18/06. . . $15.00https://doi.org/10.1145/3205289.3205313

ACM Reference Format:Changxi Liu, Biwei Xie, Xin Liu, Wei Xue, Hailong Yang, and Xu Liu. 2018.Towards Efficient SpMV on Sunway Many-core Architectures. In ICS ’18:2018 International Conference on Supercomputing, June 12–15, 2018, Beijing,China. ACM, New York, NY, USA, Article 4, 11 pages. https://doi.org/10.1145/3205289.3205313

1 INTRODUCTIONSparse Matrix-vector Multiply (SpMV) is an indispensable kernelfor many applications from various domains. In the domain of highperformance computing (HPC), applications such as computationalfluid dynamics (CFD) and molecular dynamics (MD) highly relyon linear algebra algorithms, where SpMV plays an important role.Moreover, many machine learning algorithms such as support vec-tor machine (SVM) and sparse convolution neural network (CNN)extensively invoke SpMV computation. Finally, graph algorithmssuch as page rank and breadth-first search can be abstracted as anSpMV problem.

SpMV is well-known for its irregular computation and memoryaccess patterns. Such irregularity occurs due to random memoryreferences, which are difficult to exploit data locality. From the com-piler and programmer’s perspective, the intrinsic irregular patternis unpredictable at compile time because the pattern highly relies onthe sparsity pattern of input matrices. From the hardware perspec-tive, such irregular pattern incurs potential write conflicts, limitinginstruction- and thread-level parallelism. Thus, it is challenging toefficiently implement SpMV algorithms.

It becomes even more challenging when porting SpMV algo-rithms to new architectures. In this paper, we target Sunway [12],an emerging architecture that is developed for clusters in the HPCdomain. The supercomputer—Sunway TaihuLight—is powered by10,649,600 SW26010 many-core RISC processor based on the Sun-way architecture [12]. Sunway TaihuLight achieves peak perfor-mance of 125 PFlops, ranked as the 1st place in top500 list sinceJune 2016. Sunway TaihuLight has demonstrated it’s powerful com-puting capacity; two applications, Dynamic Model [31] and Earth-quake Simulation [11] running on the whole system of SunwayTaihuLight, have received the ACM Gordon Bell Prize. Recent ef-forts on porting various computation kernels, such as DNN [10],

Page 2: Towards Efficient SpMV on Sunway Many-core Architecturesics2018.ict.ac.cn/essay/ics18-final141.pdf · (SPM). This SPM can be configured to be a programmable buffer or automatic data

ICS ’18, June 12–15, 2018, Beijing, China Changxi Liu, Biwei Xie, Xin Liu, Wei Xue, Hailong Yang, and Xu Liu

CPE 0

Memory

CG 2MPE

8*8 CPEsmeshes

MC

Memory

CG 2MPE

8*8 CPEsmeshes

MC

CG 2MPE

8*8 CPEsmeshes

MC

Memory

CG 2 MPE

8*8 CPEsmeshes

MC

Memory

CPE 24

CPE 8

CPE 16

CPE 56

CPE 1

CPE 25

CPE 9

CPE 17

CPE57

CPE 2

CPE 26

CPE 10

CPE 18

CPE 58

CPE 3

CPE 27

CPE 11

CPE 19

CPE 59

CPE 7

CPE 31

CPE 15

CPE 23

CPE 63

………

… … … … …

LDM

Noc

CG 0 CG 1

CG 2 CG 3

Figure 1: The architecture of the Sunway processor.

BFS [17], SpTRSV [27], and Stencil [1] reveal unique techniques re-quired to optimize application performance on Sunway architecture.However, the Sunway architecture still lacks of basic computationlibraries, such as SpMV. This paper is the first effort to perform thestudy of porting SpMV to the Sunway architecture.

As known as an unconventional architecture, Sunway differssignificantly from any existing architectures, such as GPGPU, IntelXeon Phi, and general-purpose CPU, so SpMV algorithms such asCSR [22], CSR5 [18], and Block Ellpack [8] that are designed forexisting architectures cannot adapt to the cache-less design of Sun-way architecture for high performance. We detail the architecturaldifferences and highlight the challenges in the next section.

To address these challenges such as load balance with massiveparallelism and efficient memory access with cache-less design,we present a novel SpMV scheme on the Sunway architecture.Our technique is generally applicable to any SpMV format that isdesigned for existing platforms and bridges the gap between theexisting SpMV design and the Sunway architecture. To fully exploitSunway’s new architectural features on the SpMV algorithm, wehave designed a dual-side multi-level partition mechanism. Fromthe hardware perspective, we partition the cores in a single Sun-way processor (also known as a core group) into eight fleets, eachof which consists of eight cores as a basic processing unit. Coresin each fleet are further partitioned into seven computation coresand one I/O core. The computation cores perform the SpMV com-putation, while the I/O cores write the results back to the mainmemory.

From the software perspective, we first partition the input sparsematrix into blocks, which are assigned to fleets. Each block is fur-ther decomposed into several tiles, which are processed by thecomputation cores in a fleet one by one. Moreover, we partition atile into a few slices to gain benefit from vectorization and registercommunication provided by the Sunway architecture. Intuitively,our novel partition technique naturally maps the SpMV algorithmto the Sunway architecture, which is able to benefit from bothparallelism and locality.

While our technique is general to various SpMV formats, forevaluation, we apply our technique to a popular SpMV format—CSR. We denote our new SpMV implementation as BT-CSR. Weevaluate BT-CSR on Sunway using 18 matrices, covering both scale-free and HPC datasets. We compare BT-CSR with existing SpMV

implementation: CSR, CSR5, and Block-Ellpack [3]. Experimentalresults show that BT-CSR achieves the highest throughput andscalability, yielding speedups up to 15.5× (12.3× on average) overthe baseline CSR algorithm.

The remainder of this paper is organized as follows. In Section 2,we describe the background and summarize the challenges forimplementing SpMV on the Sunway architecture. Section 3 presentsthe design of our methodology: the dual-side multi-level partitionmechanism. Section 4 gives the details of SpMV implementationbased on BT-CSR. Section 5 elaborates the experiment setup andanalyzes the experimental results. Section 6 describes related workand Section 7 presents some conclusions from this paper.

2 BACKGROUND AND CHALLENGESIn this section, we introduce the Sunway architecture, give anoverview of the SpMV algorithm, and highlight the challenges inporting SpMV to Sunway.

2.1 Sunway SW26010 Many-Core ProcessorThe Sunway SW26010 many-core processor is the basic buildingblock of the Sunway TaihuLight supercomputer. Figure 1 showsthe Sunway architecture in detail.

The whole Sunway processor consists of four core groups (CG).Each CG has 765 GFlops double-precision peak performance and34.1 GB/s theoretical memory bandwidth. One CG includes a DDR3MemoryController (MC), aManagement Processing Element (MPE),and a Computing Processor Elements (CPE) cluster with 64 CPEsconnected through a 8× mesh. MPE and CPE, with the same 1.45GHz frequency, have different architectures that are designed fordifferent purposes. MPE is designed for task management becauseit supports complete interrupt functions and out-of-order execu-tion, similar to most mainstream processors. In contrast, CPE is aset of simplified 64-bit RISC cores for high computing throughput.Each CPE core consists of two pipelines—P0 and P1; P0 is usedfor floating point and vector operations, while P1 is dedicated tomemory-related operations. Moreover, Sunway introduces a newregister communication mechanism that the CPEs in the same rowor column of the mesh can communicate with each other within tencycles, which is much lower than a memory access. This registercommunication mechanism exchanges data between CPEs withoutmoving data across other costly layers in the memory hierarchy.

As for Sunway’s memory hierarchy, each MPE has a 32KB L1data cache and a 256KB L2 data/instruction cache, while each CPEhas a 16 KB L1 instruction cache and a 64KB scratchpad memory(SPM). This SPM can be configured to be a programmable bufferor automatic data cache. The programmable SPM, which is alsonamed as the local device memory (LDM), should be managedexplicitly by software. The data movement between the LDM andthe memory is performed through the direct memory access (DMA)to guarantee the efficiency. The automatic data cache is based onglobal load/store (Gload/Gstore) operations, which is transparentto the programmers and invoked automatically. The differencebetween the LDM and automatic data cache is that DMA is suitablefor moving large data blocks, while Gload/Gstore prefers small andrandom data references.

Page 3: Towards Efficient SpMV on Sunway Many-core Architecturesics2018.ict.ac.cn/essay/ics18-final141.pdf · (SPM). This SPM can be configured to be a programmable buffer or automatic data

Towards Efficient SpMV on Sunway Many-core Architectures ICS ’18, June 12–15, 2018, Beijing, China

In summary, we identify the following features that are criticalfor applications to fully exploit the computation capability of theSunway architecture.

CPE - In order to take advantage of the massive parallelism ofthe CPEs, applications should be carefully parallelized.

Register communication - Since the register communicationonly supports data exchange in the same row/column of the mesh,adapting the communication pattern of the application is importantto achieve efficient data communication across CPEs.

LDM - LDM provides much higher bandwidth as well as shorterlatency to access than the main memory. Therefore, it requires adelicate mechanism to better leverage the limited size of LDM tospeed up the data access during runtime.

2.2 Sparse Matrix-vector Multiplication (SpMV)

Algorithm 1 scalar-SpMV with the CSR format.

1: for i = 0 to numRows − 1 do2: sum ← 03: for j = row_ptr [i] to row_ptr [i + 1] − 1 do4: y[i] ← y[i] +vals[j] × x[col_idx[j]]5: end for6: end for

In this paper, we use matrix A, vector x, and vector y to describethe computation of SpMV (y = y+A×x ). We illustrate the algorithmprocedure of SpMV (y = y + A × x), and analyze the characteris-tics of the corresponding SpMV implementation with the widelyused CSR format [22]. There are three vectors in CSR: vector valsstores the values of the non-zero element; vector col_idx storesthe column indices of the non-zero elements; and vector row_ptrstores the indices of the first non-zero element of each row in vec-tor vals and vector col_idx. Algorithm 1 shows the pseudo code ofa SpMV implementation based on the CSR format. As shown inAlgorithm 1, the intrinsic characteristics of SpMV including poor lo-cality, write conflict, and load imbalance that raise the challenges inachieving high performance on modern multi-core and many-corearchitecture.

Poor locality - SpMV is inherently memory bound due to therandom memory references of vector x, which also leads to poorcache locality and low memory bandwidth utilization. The memoryaccess pattern of SpMV is highly dependent on the sparsity of thematrix, which is unpredictable at compile time.

Write conflict - The writes to vector y frommultiple threads orSIMD lanes may lead to potential conflicts at the runtime, especiallywhen multiple threads or lanes write to the same location of vectory simultaneously. Although the write conflict can be resolved byusing atomic operations, it depends on special hardware support;otherwise the overhead is unaffordable.

Load imbalance - Due to the irregularity of the sparse matrix,the number of the non-zero elements in each row of the matrix canbe imbalanced. Moreover, the distribution of non-zero elementscan also vary from row to row. Thus, it is challenging to devise anefficient SpMV implementation due to the inherent load imbalance.

Block 0

EmptySlice 1Slice 2

S1S1 S2 S3 S4 S5 S6 S7 S8

Block 1

Block 2

Block (M / θ)

Tile0

Tile1

EmptyTile

(N / δ)

000

000

000

θ

δ

ω

Sparse Matrix A

100

011

100

Empty

Empty Partition

Non-Empty Partition

Figure 2: Themulti-level partition of a sparse matrix. M andN is the number of rows and columns of the input matrixrespectively.

2.3 Challenges in Porting SpMV on SunwayGiven the unique characteristics of the Sunway architecture, exist-ing SpMV algorithms are far from the bare-metal performance. Wemainly study three state-of-the-art algorithms—CSR, Block Ellpack,and CSR5, and show all of them, with naive porting efforts, receivepoor performance in Section 5. We further identify the reasons forsuch poor performance as follows:• The large number of cores in a Sunway processor require fine-grained parallelism management on the SpMV algorithm. Withcareless design, it is easy to introduce load imbalance acrosscores.• As the Sunway architecture provides a unique shared-memorycommunication strategy via registers, only using default com-munication through main memory without using the registercommunication does not yield high performance on the Sunwayarchitecture.• The LDM in the Sunway processor requires the manual efforts fordata placement. SpMV algorithms without explicit data manage-ment with LDM can suffer from significant performance degra-dation. Moreover, the software-managed LDM incurs new datacoherence problem in the whole memory hierarchy. It introducesnew overhead that requires careful control to guarantee the pro-gram correctness besides the performance.In the next section, we describe our approach that addresses all

of these challenges raised by the Sunway architecture.

3 METHODOLOGYIn this section, we present our dual-side multi-level partitioningtechnique, specifically designed for the SpMV running on the Sun-way architecture. The high-level idea is to partition the computationof SpMV into three levels from both sides—the input matrix andthe hardware resources. The computation on each level is naturallymapped to the corresponding level of hardware partitions to benefitfrom both parallelism and locality. The input matrix is divided into

Page 4: Towards Efficient SpMV on Sunway Many-core Architecturesics2018.ict.ac.cn/essay/ics18-final141.pdf · (SPM). This SPM can be configured to be a programmable buffer or automatic data

ICS ’18, June 12–15, 2018, Beijing, China Changxi Liu, Biwei Xie, Xin Liu, Wei Xue, Hailong Yang, and Xu Liu

S1C0 C1 C2 C3 C4 C5 C6 C7

S1C8 C9 C10 C11 C12 C13 C14 C15

C56 C57 C58 C59 C60 C61 C62 C63

Fleet 0

Fleet 1

Fleet 7

Vector x (δ) Vector y (θ)

Sparse Matrix A (slice)

LDM LDMLDM Buffer

(LDM)

Register Communication

Sliceδδ Slice

δ Slice

θ

Reg RegReg

Reg

Computation Core I/O Core Register

LDM

Vectors in Memory

Piece of data

Register Communication Memory reference

Fin Data 1RowIdx Data 2 Data 3

4B 4B 8B 8B 8B

Format of Register Communication

(a) The partition of the cores.

Fin Data 0RowIdx Data 1 Data 2

4B 4B 8B 8B 8B

(b) The format for register communication.

Figure 3: The multi-level partition of the hardware re-sources.

three-level partitions: block, tile, and slice, which provides differ-ent granularities for task management. In the meantime, the manycores on the Sunway processor are separated into: fleet, computa-tion core, and I/O core. In the rest of this section, we elaborate onthis partitioning technique.

3.1 Partitioning the Sparse MatrixAs shown in Figure 2, we partition the input matrix (M × N ) withour multi-level strategy. We introduce three concepts to describethe data partitions at different levels: block, tile, and slice. First, wepartition the original sparse matrix into blocks, which consists ofθ rows from the input matrix. Thus the size of a block is θ × N .The block is further divided into tiles, each with the size of θ × δ .Finally, the tile is divided into slices, each with the size of ω ×δ . Wedefer the discussion of how to choose appropriate values for theseparameters in Section 4.4. An empty block, tile, or slice means thatall elements are zero within that partition, which does not need tobe processed. We elaborate on how we map these data partitionsfor computation to the Sunway processor in Section 3.3.

3.2 Partitioning the CoresDue to the high latency of memory accesses on the Sunway pro-cessor, it deteriorates the performance significantly to write theresult to memory every time when a non-zero element is processed.Fortunately, the Sunway processor supports a unique feature ofregister communication, based on which we can design a partition-ing method for better memory efficiency. We divide all the coresof Sunway processor into eight fleets, each of which consists ofeight CPE cores in the same row of the CPE mesh. These fleetsare assigned with different rows of the input matrix and thus inde-pendent from each other when performing the SpMV computation.We further assign the cores in the same fleet with two differentroles: computation core and I/O core. The computation cores areresponsible for the computation of the SpMV, whereas the I/O coresare used to buffer the intermediate results and write the final resultsback to memory when the computation is done.

In Figure 3(a), we show the detail on how to partition the hard-ware resources. Each computation core processes the correspondinginput slices and transfers the results to the I/O core through theregister communication. The I/O core is dedicated for writing thecomputation results to the vector y in memory. The I/O core main-tains a buffer to store θ values of vector y, which are frequentlyaccessed during the computation of the corresponding block.

We further design a data format to facilitate the vectorizationand the data transfer from the computation core to the I/O core. Thesize of the message in the register communication on the Sunwayprocessor is 32 bytes, which is also the width of a register forvectorization. We divide the register into two parts: one for theauxiliary information and the other for the result. As shown inFigure 3(b), the first 8 bytes are occupied by two variables, Fin andRowIdx. Fin denotes whether there are still tiles left in this blockfor processing. RowIdx indicates the index of the first row in thecurrent slice.

3.3 Mapping SpMV to Many-core SunwayIn our approach, the input matrix and the hardware resourcesare partitioned separately. We perform the SpMV computationby mapping the partitions to the Sunway processor at each levelas shown in Figure 4. At Level 1, the original sparse matrix ispartitioned into (M/θ ) blocks. These blocks are stored in a globalblock queue, which is shared by all the fleets. Thus, each fleet canfetch another block from this queue when its resources becomeavailable. When there is no block left in this queue, the entire SpMVfinishes processing. At Level 2, a fleet processes the blocks. First,the block is split into (N/δ ) tiles, which forms a tile queue. This tilequeue is shared among the cores within the same fleet, includingseven computation cores and one I/O core. Each computation corefetches a tile from the tile queue when it is available. At Level 3, thetile is further divided into (θ/ω) slices to facilitate the vectorizationand data transfer from computation cores to the I/O core. Sliceis the basic unit to be processed by the computation core in ourmethod. The work sharing mechanism in the block and slice queuesguarantee the workload balance across fleets and cores.

Page 5: Towards Efficient SpMV on Sunway Many-core Architecturesics2018.ict.ac.cn/essay/ics18-final141.pdf · (SPM). This SPM can be configured to be a programmable buffer or automatic data

Towards Efficient SpMV on Sunway Many-core Architectures ICS ’18, June 12–15, 2018, Beijing, China

M

N

Sparse Matrix (M x N) Sunway Processor (64 Cores)

Block (θ x N)Num= (M / θ) Fleet (8 Cores)Num=8

Fleets process independently block by block

Process a sparse matrix on the Sunway processor

Initial(Level 0)

Level 1

Num= (N / δ)7 Calculation Cores

Computation Cores process independently tile by tile

Level 2

Tile(θ x δ)

Num= (θ / ω)Calculation Core with VPU

The Computation Cores process and write the result to the I/O Core slice by slice

Level 3 Slice(ω x δ)

I/O Core

Reduction

Figure 4: Mapping the computation of a sparse matrix to thehardware resources using the three-level partitions.

4 IMPLEMENTATION DETAILSTo demonstrate the effectiveness of our dual-side multi-level parti-tioning approach, we apply it to CSR, one of the most popular SpMVformats. We refer to our customized format as BT-CSR. It is worthnoting that our technique is generally applicable to other SpMVformats. In this section, we elaborate on the implementation detailsof BT-CSR, which includes the processing logic of computationcores and I/O core in each fleet and how they collaborate with eachother. Naturally, our multi-level partitioning maps the computationto the hardware at each level. At the top level, the fleets take chargeof processing blocks. Within each fleet, the computation cores andI/O core form a logic pipeline, where the computation cores iteratethrough the lower-level partitions (tile and slice) of the input matrixto perform the calculation and the I/O core buffers the intermediateresults before writing them to the memory.

4.1 Processing Logic of Computation CoreAs aforementioned, each fleet is divided into the computation coresand the I/O core, with computation cores responsible for performingthe SpMV computation and I/O core for writing results back to thememory. One critical factor that affects the performance of thecomputation procedure is the frequent data accesses to the memory.There are two ways on Sunway for memory references: GLOADand DMA. GLOAD supports loading random data from memory,while DMA supports loading continuous data in batches. Sinceour approach prefetches vector ’x’, DMA is preferred over GLOAD.Although LDM has the potential to reduce the data access latency,it requires careful management due to its limited size.

As discussed in Section 3, we divide the sparse matrix into blocksand further into tiles. Tiles are the basic task units that we assign to

different computation cores. When a computation core processesthe tile, it further divides the tile into slices for vectorization andregister communication. Prefetching the data into the LDM beforecomputation can speedup the memory access, however the tile istoo large to be loaded, whereas the slice is too small that loses theefficiency. Therefore, we combine multiple slices into a batch toachieve the optimal data transfer between the memory and LDM.We pre-load the data from memory to the LDM with batches andmake sure that there is always enough data for processing in theLDM. For example, batch n+1 is pre-loaded while we are processingthe batch n. Note that the computation core uses batch to pre-loaddata, but still uses slice as its basic computation unit.

Algorithm 2 elaborates the implementation details. We first de-scribe all the notations used in the algorithm. bl_set, tl_set, andsl_batch_set denote the set of blocks in current matrix, the set oftiles in current block, and the set of slices in current batch. bln, tln,and sln_batch denote the size of block set, tile set, and slice setrespectively. The values of x_tile are pre-loaded according to thevalues of vector x, which will be used in the current tile. dt_batchand ci_batch stores the values and column indices of the non-zeroelements in the current batch. msg stores the message that will betransferred to the I/O core.

Each slice contains ω rows, which means there are ω resultsafter current slice is processed. Here we use vector Data, the sizeof which is ω, to store the intermediate results. Every time thecomputation core finishes processing a non-zero element, it addsthe result to the corresponding element in vector Data. When thecomputation core finishes processing a slice, vectorData that storesthe intermediate results needs to be written back to vector y (line21-25). We use the format shown in Figure 3(b) to store the resultsbefore transferring them to the I/O core (line 27). Note that Fin isthe flag indicating whether current computation core needs to fetchanother block for processing. RowIdx is the index of the startingrow in the current slice. If the Fin in msg is set to be zero, all tilesin current blocks have been processed by other computation cores.Therefore, no extra work needs to be done by current computationcore (line 31-32).

A batch contains sln slices to process. After the computationcore finishes processing a batch (line 18-28), it triggers the pre-load action for the next batch and moves the data pointer to thepre-loaded batch on the computation core (line 12-29). After thecomputation core finishes processing all the batches for a tile, itcarries on to the next tile until there is no tiles left for processingin the current block (line 6-30). Then current computation corestays idle until it receives the restart notification from the I/O core.This notification is a special message sending from I/O core to thecomputation cores in the same fleet, indicating that all non-zeroelements in current block have been processed, and the computationcore proceeds to next block until there is no blocks left in the blockqueue. When all blocks have been processed, the computation corefinishes its work (line 1-35).

4.2 Processing Logic of I/O CoreOn Sunway processor, the latency of directly accessing the memoryis quite high. The random memory accesses of SpMV exacerbatesthis performance penalty. To solve this problem, we leverage the

Page 6: Towards Efficient SpMV on Sunway Many-core Architecturesics2018.ict.ac.cn/essay/ics18-final141.pdf · (SPM). This SPM can be configured to be a programmable buffer or automatic data

ICS ’18, June 12–15, 2018, Beijing, China Changxi Liu, Biwei Xie, Xin Liu, Wei Xue, Hailong Yang, and Xu Liu

Algorithm 2 Processing logic on the computation core.1: for bid = 1→ bln do2: /* Iterate through all the blocks */3: F in ← RU N4: t ln ← bl_set (bid ).t ln5: t l_set ← bl_set (bid ).t l_set6: for t id = 1→ t ln do7: /* Iterate through all the tiles in a block */8: t l = t l_set (t id )9: x_t ile ← t l .x10: sl_batch_set ← t l .sl_batch_set11: sln_batch ← t l .sln_batch12: for sl id_batch = 1→ sln_batch do13: /* Iterate through all the batches in a tile */14: sl_batch = sl_batch_set (sl id_batch)15: dt_batch ← sl_batch .dt16: ci_batch ← sl_batch .ci17: sln ← sl_batch .sln18: for sl id = 1→ sln do19: /* Iterate through all the slices in a batch */20: RowIdx ← sl_batch(sl id ).r i21: for ω_id = 1→ ω do22: /* Store intermediate results */23: Data(ω_id ) ←24: dt_batch(sl id )(ω_id ) ∗ x_t ile(ci_batch(sl id )(ω_id ))25: end for26: msд = {F in, RowIdx, Data(1 : ω)}27: ReдSend (msд, ResReдIndex ) /* Send message to I/O core */28: end for29: end for30: end for31: F in ← EX IT32: msд = {F in, 0}33: ReдSend (msд, ResReдIndex ) /* Send finish message to I/O core */34: ReдRecv(rcf ) /* Receive synchronization message */35: end for

I/O core to reduce the number of writes to the memory. The I/Ocore is dedicated to buffer the intermediate results that are receivedfrom the computation cores, and write results back to vector y inmemory when an entire block has been processed. We assign onecore as the I/O core within each fleet and introduce vector ty tostore the intermediate results. The size of vector ty is θ , since thereare at most θ rows in each block.

Algorithm 3 shows the implementation details of the I/O proce-dure. We use vector ty to buffer the intermediate results that wouldfinally be written back to vector y. In other words, we cache a seg-ment of vector y in vector ty in the LDM. yb indicates the positionof the first element of vector ty in vector y. yn denotes the size ofthe vector ty. Each time a computation core finishes a slice, theintermediate result is sent to the I/O core and then written to the ty.CCN is the number of computation cores in a fleet. fnc is a counterto record the number of computation cores that have finished theirprocessing in the current block. Each time the I/O core receivesa message from the computation core, it firstly examines the Finfield (line 8). If Fin is non-zero, this message is a reduction request,which carries the data that needs to be reduced into ty. Otherwise,it is a finish notification message indicating that there are no tilesleft for processing in the current block. The reduction request israised by the computation core when it finishes processing a slice.

The format of the reduction request is shown in Figure 3(b). TherowIdx field in the reduction request indicates the row index ofthe slice, which triggers current reduction request. To deal witha reduction request, the I/O core stores data1, data2 and data3 to

Algorithm 3 Processing logic on the I/O core.1: for bid = 1→ bln do2: yb ← bl_set (bid ).yb3: yn ← bl_set (bid ).yn4: ty ← y(yb : yn)5: while (1) do6: ReдRecv(Recv Inf o)7: {F in, RowIdx, Data(1 : ω)} ← Recv Inf o8: if F in == EX IT then9: f nc ← f nc + 110: if f nc == CCN then11: ReдSend (SendInf o, 1 : CCN )12: f nc ← 013: Break14: end if15: end if16: ty(RowIdx : ω)+ = Recv Inf o(1 : ω)17: end while18: y(yb : yn) ← ty19: end for

vector ty using the indices rowIdx, rowIdx+1 and rowIdx+2 respec-tively (line 16). If the field Fin within the reduction request equalszero, the counter fnc increases itself by one to record the numberof computation cores that have finished their work for the currentblock (line 9). When fnc equals to CCN, which is the total numberof the computation cores in a fleet, it means that all computationcores have finished their work. Then the I/O core notifies the com-putation cores in the same fleet to process next block, and reset fncto zero (line 10-12). The fleet continues to process next block untilexhausting all the blocks (line 1-19).

4.3 Synchronization across CoresWedesign a synchronizationmechanism based on the Bulk Synchro-nous Parallel (BSP) model to enable the synchronization betweenthe computation cores and the I/O core in the same fleet. However,the computation across different fleets is performed independently.There are two situations that trigger the synchronization betweenthe computation cores and I/O core: the reduction request and thefinish notification. On Sunway processor, we use register communi-cation to send messages from the computation cores to the I/O core.Even though I/O core may receive multiple messages from differentcomputation cores simultaneously, the message processing mecha-nism on Sunway guarantees that these messages can be receivedreliably, which ensures the correctness of the computational results.

For the finish notificationmessage, each of them indicates that notiles in the block need to be processed by the current computationcore. After sending the message, the computation core stays idleuntil it receives the restart notification from the I/O core, indicatingthat the whole fleet is to process the next block. The I/O core uses avariable fnc to record the number of finish notifications it receives.Thus, when the fnc reaches CCN, it means all the computationcores have finished their work. Then the I/O core broadcasts therestart notifications to each computation core, writes vector ty backto vector y in main memory and restores its data structures. Theprocedure of restoring the data structures includes: 1) change thepointer of the task queue to the next block; 2) reset the range ofvector ty on the I/O core.

Page 7: Towards Efficient SpMV on Sunway Many-core Architecturesics2018.ict.ac.cn/essay/ics18-final141.pdf · (SPM). This SPM can be configured to be a programmable buffer or automatic data

Towards Efficient SpMV on Sunway Many-core Architectures ICS ’18, June 12–15, 2018, Beijing, China

4.4 Parameter TuningThere are three parameters in our dual-side multi-level partitioningmethod: θ , δ and ω. θ indicates the number of buffered elements invector ty on the I/O core. δ determines the number of pre-loadedelements of vector x during the calculation procedure. ω is thenumber of rows in a slice and also the number of intermediateresults that a reduction request carries. When a fleet has finishedprocessing a block, the intermediate results are buffered in vectorty in the LDM on the I/O core. It is quite straightforward to figureout the size of vector ty, which is the number of rows (θ ) in a block.As the maximum size of vector ty is limited by the size of the LDM(64KB), the value of θ is also determined. For double precision floatdata, the value of θ is 64KB/8B = 8092. The value of δ relies onthe sparse pattern of the input matrix. We discuss its impact onperformance in Section 5.4. On the Sunway processor, the reductionrequest transferred through register communication can carry threefloat data at most. We set ω to be three, so that each time a slice isprocessed, the intermediate results from this slice can be writtenback at one time.

5 EVALUATION5.1 Experiment SetupOur experiments are conducted on a CG of a Sunway SW26010 pro-cessor. The performance of SpMV on a CG is important for scientificapplications, since many of them decompose the computation prob-lem at the granularity of CG and perform SpMV intensively withinCG. We use the native compilers swcc and sw5cc on Sunway for Cand C++ programs, respectively. To evaluate our method BT-CSR,we select 18 representative sparse matrices from the University ofFlorida Sparse Matrix Collection [9], which are listed in Table 1.Note that SW26010 gives two different methods to fetch data frommemory. One is to access the memory through Gload/Gstore in-struction controlled by the compiler, and the other is to use DMAcontrolled by the programmer. In our implementations, we tryto make the memory access as efficient as possible with eitherGload/Gstore instruction or DMA. All the experiments use doubleprecision. We run each experiment five times and report the aver-age result. Based on our observation, the variance of execution timeacross runs is quite small on Sunway processor. For comparison,we also implement four other SpMV formats on SW26010. We useCSR-MPE as our baseline for performance comparison.

• CSR-MPE, which implements SpMV on the MPE of Sunway pro-cessor based on Algorithm 1.• CSR-CPE, which is also based on Algoritm 1, however it leveragesthe CPEs of a Sunway processor. The data, column-indexed androw-indexed matrix and vector y are transferred from memory toLDM using DMA. Since the access pattern of vector x is irregular,its accesses are through Gload/Gstore instruction.• CSR5-CPE, which is implemented on Sunwaywith CPEs based onCSR5 [18]. CSR5 is the most cutting-edge SpMV implementationon various platforms. All the data required for computation istransferred from memory to LDM using DMA except vector x.• Block-Ellpack-CPE, which is implemented on Sunway with CPEsbased on Block Ellpack [8]. Block Ellpack has been proved quiteefficient on many-core architectures such as GPU. All the data

Table 1: The datasets used for evaluation.

Matrix shape Matrix row×col nnz nnz/row

dense 2K × 2K 4M 2K

crankseg_2 63K × 63K 14.1M 221

F1 343K × 343K 26.8M 78

nd24k 72K × 72K 28.7M 398

pdb1HYS 36K × 36K 4.3M 119

cant 62K × 62K 4.0M 64

pwtk 218K × 218K 11.5M 52

ldoor 952K × 952K 42.5M 44

qcd5_4 49k × 49k 1.9M 39

cop20k_A 121K × 121K 2.6M 21

cage14 1.5M× 1.5M 27.1M 18

2cubes_sphere 101K × 101K 1.6M 16

atmosmodd 1.2M× 1.2M 8.8M 6

mac_econ_fwd500 206K×206K 1.3M 6

scircuit 171K×171K 959K 5

shallow_water1 82K×82K 327K 4

webbase-1M 1M×1M 3.1M 3

bcsstm38 8K×8K 10K 1

required for computation is transferred from memory to LDMusing DMA. We set the block size to 8.

5.2 Isolated SpMV Performance AnalysisFigure 5 presents the isolated SpMV performance of our approachand four other SpMV implementations on Sunway processor. Itis clear that our approach gives the best performance across alldatasets compared to other implementations. In general, our ap-proach BT-CSR achieves 12.3× speedup on average compared tothe baseline when using 64 cores. It is also interesting to notice thatexcept our approach, other implementations actually experienceperformance degradation to certain extent in quite a few cases com-pared to the baseline. The reason is that although CPE providesmuch higher bandwidth than MPE, the irregular access pattern ofvector x restricts the implementations to use Gload/Gstore instruc-tion to transfer data from/to memory instead of more efficient DMAwith LDM. For the CSR5-CPE implementation, another problem lim-its its performance on Sunway processor is that the original CSR5implementation heavily relies on SIMD instructions to boost itsperformance. However, to the best of our knowledge, Sunway pro-cessor supports quite limited number of SIMD instructions, whichmakes CSR5 less appealing to perform SpMV efficiently on Sunwayprocessor.

Page 8: Towards Efficient SpMV on Sunway Many-core Architecturesics2018.ict.ac.cn/essay/ics18-final141.pdf · (SPM). This SPM can be configured to be a programmable buffer or automatic data

ICS ’18, June 12–15, 2018, Beijing, China Changxi Liu, Biwei Xie, Xin Liu, Wei Xue, Hailong Yang, and Xu Liu

0

1

2

3

4

5

6

8 16 32 64

GFLOPS

CSR-MPECSR-CPECSR5-CPEBlock-Ellpack-CPEBT-CSR-CPE

(a) dense28 16 32 64

(b) nd24k8 16 32 64

(c) crankseg_28 16 32 64

(d) pdb1HYS8 16 32 64

(e) F18 16 32 64

(f) cant

0

1

2

3

4

5

6

8 16 32 64

GFLOPS

CSR-MPECSR-CPECSR5-CPEBlock-Ellpack-CPEBT-CSR-CPE

(g) pwtk8 16 32 64

(h) ldoor8 16 32 64

(i) qcd5_48 16 32 64

(j) cop20k_A8 16 32 64

(k) cage148 16 32 64

(l) 2cubes_sphere

0

1

2

3

4

5

6

8 16 32 64

GFLOPS

CSR-MPECSR-CPECSR5-CPEBlock-Ellpack-CPEBT-CSR-CPE

(m) atmosmodd8 16 32 64

(n) mac_econ_fwd5008 16 32 64

(o) scircuit8 16 32 64

(p) shallow_water18 16 32 64

(q) webbase-1M8 16 32 64

(r) bcsstm38

Figure 5: Comparison of isolated SpMV performance (sharing the same y-axis on the left) between BT-CSR-CPE and fourother SpMV implementations (CSR-MPE, CSR-CPE, CSR5-CPE andBlock-Ellpack-CPE) on Sunway Processorwith the numberof cores equals to 8, 16, 32 and 64 (along the x-axis). We use CSR-MPE as our baseline for performance comparison, whichperforms SpMV on the MPE sequentially. It is worth noting that without thorough optimization, SpMV parallelized on theCPE runs even slower than the serial version on the MPE.

To address the irregular accesses of vector x, previous litera-tures [2, 4, 8, 20, 30] have pointed out that the blocking techniqueis effective. However, as shown in Figure 5, Block-Ellpack does notbehave well across all datasets. Based on the observation in [8],we set the block size of Block-Ellpack to 8. The performance ofBlock-Ellpack is much better than the baseline on datasets suchas dense2 and nd24k. However, on dataset webbase-1M, the per-formance of Block-Ellpack is even worse than the baseline. This isbecause the limited size of the block prevents the reuse of vectorx efficiently. If we set the size of the block too large, a lot of use-less data will be transferred into LDM due to the sparsity of thematrix. Our approach solves this problem with traditional blockingmethods. The vector y is shared by the entire fleet, and vector xis shared by the slice. As seen in Figure 5, BT-CSR achieves betterperformance on datasets such as webbase-1M and 2cubes_spherewith 10.3× and 10× speedup respectively compared to the baseline,where Block-Ellpack performs poorly. Moreover, our experiments

show that the overhead of preprocessing can be amortized by tensof iterations for most matrices.

Figure 5 also shows the scalability of each approach. As the num-ber of cores increases, the number of fleets available to BT-CSRalso increases, which enables BT-CSR to process more blocks simul-taneously. It is clear in Figure 5 that the performance of BT-CSRincreases as the number of cores increases, which demonstrates thegood scalability of our approach across all datasets. For instance,with dense2, BT-CSR achieves 2.0×, 4.0× and 7.6× speedup (com-pared to running at 8 cores) respectively as the number of coresscales from 16 to 64. In contrast, Block-Ellpack does not scales wellon datasets such as cop20k_A, 2cube_sphere, webbase-1M com-pared to BT-CSR. Similarly, CSR5 exhibits poor scalability on alldatasets except scircuit and webbase-1M. Both CSR-MPE andCSR-CPE show similar scalability trends as CSR5 on most datasets.

Page 9: Towards Efficient SpMV on Sunway Many-core Architecturesics2018.ict.ac.cn/essay/ics18-final141.pdf · (SPM). This SPM can be configured to be a programmable buffer or automatic data

Towards Efficient SpMV on Sunway Many-core Architectures ICS ’18, June 12–15, 2018, Beijing, China

0

2

4

6

8

10

12

Ratio

CSR-CPE

CSR5-CPE

Block-Ellpack-CPE

BT-CSR

Figure 6: The locality inefficiency of all SpMV implementations across all datasets. Lower is better.

5.3 Locality AnalysisTo further understand the performance of SpMV, we analyze thelocality of memory accesses of all approaches. However, due tothe unique design of LDM and limited support of hardware per-formance counters on Sunway, it is infeasible for us to report thetraditional locality metrics such as cache misses. Instead, we mea-sure the number of memory accesses during the SpMV computationas the locality metric since all cache misses eventually require mem-ory accesses. We also define the locality inefficiency as shown inEquation 1, which represents the ratio of the actual number of mem-ory accesses to the theoretical number of memory accesses for theSpMV computation. The theoretical number of memory accesses(memory_accessesthe ) is easy to calculate, which equals to the sumof the number of elements within matrices A, vector x and y. Fora specific dataset, thememory_accessesthe is the same across allSpMV implementations. The actual number of memory accesses(memory_accessesact ) is difficult to measure directly due to thelimited support of performance counters on Sunway. Therefore,we manually instrument all SpMV implementations to record thememory_accessesact on each dataset, which includes the accessesto matricesA, vector x and y, as well as the auxiliary data structuresused in each SpMV implementation.

Locality_Ine f f iciency =memory_accessesact

memory_accessesthe(1)

Figure 6 shows the locality inefficiency of the four SpMV imple-mentations on all datasets. BT-CSR achieves the lowest locality inef-ficiency across all datasets, which demonstrates our approach is ef-fective to reduce the memory traffic of SpMV on Sunway processor.Note that the actual number of memory accesses is commonly largerthan the theoretical number due to the auxiliary data structuresused in each SpMV implementation. For instance, CSR5 requires ad-ditional data structures such as bitflag and tile_ptr to facilitateits methodology.We also notice that the Block-Ellpack-CPE exhibitsmuch worse locality inefficiency when dealing with datasets suchas cop20k_A, 2cubes_sphere, mac_econ_fwd500 and scircuit.These datasets are too sparse so that Block-Ellpack-CPE needs

32 64 128 256 512 1024

The Size of δ

8

4

2

1

The N

um

ber

of

Fleets

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Figure 7: The performance sensitivity of SpMV using differ-ent value of δ and different number of fleets. Each cell inthe heatmap is the harmonic mean of performance devia-tions to the optimal settings across all datasets calculatedby Equation 2.

to pad more zeroes for them, which leads to more useless memorytraffic. The fundamental reason for the better locality efficiencyof our approach can be attributed to the ability to reuse the datawithin LDM, which effectively reduces the number of accessesto the memory. In addition, as we further break down the actualmemory accesses into DMA and Gload/Gstore instruction throughinstrumentation, the results show that the DMA requests domi-nate the memory accesses of BT-CSR, whereas the Gload/Gstoreinstructions dominate the other three approaches. In sum, BT-CSRachieves the best locality by not only performing less accesses tothe memory, but also using efficient DMA to access the memory.

5.4 Parameter Sensitivity AnalysisThe value of δ is important for the performance of BT-CSR. Becauseδ determines the number of elements in vector x pre-loaded duringthe computation of a tile. If it is too small, there will be more tiles,

Page 10: Towards Efficient SpMV on Sunway Many-core Architecturesics2018.ict.ac.cn/essay/ics18-final141.pdf · (SPM). This SPM can be configured to be a programmable buffer or automatic data

ICS ’18, June 12–15, 2018, Beijing, China Changxi Liu, Biwei Xie, Xin Liu, Wei Xue, Hailong Yang, and Xu Liu

which increases the accesses to vector x. In contrast, if it is toolarge, there will be fewer tiles, which could lead to load imbalanceamong cores within a fleet as well as more useless values of vectorx to be loaded.

To evaluate the performance sensitivity of SpMV to the settingof δ , we test all datasets by varying the value of δ from 32 to 1024as well as the number of fleets from 1 to 8. We use γ to denotethe number of fleets. Figure 7 shows the performance heatmap ofSpMV using different values of δ and γ . Each cell within Figure 7represents the harmonic mean of the performance deviations tothe optimal settings across all datasets. The harmonic mean is cal-culated based on Equation 2. Harmonicγ ,δ is the harmonic meanof the performance deviations (Ratioγ ,δ,α ) across all datasets un-der specific setting of δ and γ , which is effective to measure theperformance impact of δ across datasets. We define DS as the setof all datasets and α to be a specific dataset in DS . NumDS de-notes the number of elements in DS . ∆ is the set of δ , which is{32, 64, 128, 256, 512, 1024}. Γ is the set of γ , which is {1, 2, 4, 8}.

Harmonicγ ,δ =NumDS∑

α ∈DS

1Ratioγ ,δ ,α

(2)

Ratioγ ,δ,α represents performance deviation on specific datasetunder specific setting of δ andγ . It is calculated based on Equation 3.Tγ ,δ,α denotes the execution time on dataset α .

Ratioγ ,δ,α =

mini ∈Γ, j ∈∆

(T i, j,α )

Tγ ,δ,α(3)

As Figure 7 shows, the performance difference is small acrossdifferent settings of δ as long as γ is settled. This means the perfor-mance of SpMV using our approach is less sensitive to the settingof δ . In our evaluation, we set δ to be 256, with which BT-CSRachieves the best performance across all datasets.

6 RELATEDWORKPlenty of work has been published on SpMV optimization fromdiverse perspectives [6, 13, 15, 19, 28]. Many new SpMV formats,techniques, and auto-tuners have been proposed to fully exploitthe underlying architectures.

CSR5 [18] can be applied to multiple platforms, with delicatevectorization and tiling method based on segmented sum. BSR [5]introduce blocking mechanism based on the classical CSR format.CVR [29] is an vectorization oriented format, aiming at better vec-torization efficiency and memory locality. Liu et al. [20] introducesorting and blocking based on ELLPACK and present a new for-mat named ESB. Kourtis et al. [15] propose Compressed SparseeXtended (CSX) to compress metadata by exploiting substructureswithin the matrix. Aydin et al. [6] introduce compressed sparseblocks (CSB), which can deal with both Ax and AT x efficiently. Yanet al. [30] present blocked compressed common coordinate (BC-COO), which is also known as yet another SpMV framework. It usesbit flags to compress the data and reduce the memory references alot. Ashari et al. [2] introduce a two-dimensional blocking mecha-nism and present a format named blocked row-column (BRC). Tanget al. [26] propose VHCC using a 2D jagged partition mechanismfor better locality and segmented sum for vectorization. Greathouseet al. [14] present CSR-adaptive, aiming at better load balance and

memory reference efficiency. Merrill et al. [21] propose a merge-based parallel method, aiming at better SpMV performance on GPU.Liu et al. [19] present a method for SpMV on CPU+GPU usingspeculative segmented sum. Buono et al. [7] optimize SpMV forscale-free matrices on POWER8 using a two-phase method.

As the formats show various performance on sparse matriceswith different sparsity patterns, there is also some work focusing onselecting the optimal format for the input matrix by analyzing itssparsity pattern. SMAT [16] extracts features from the input matrixand uses a decision tree to predict the optimal format. Sedaghati etal. [23] use the features that extracted from both the input matrixand the hardware platform for the model training. Su et al. [25]present clSpMV framework based on openCL and propose Cocktailformat that consists of multiple formats for automatic selection.Zhao et al. [32] apply deep learning in SpMV format selectionby considering a sparse matrix as an image. Sedaghati et al. [24]propose a decision model using machine learning to automaticallyselect the best format for sparse matrix on GPU platform.

To the best of our knowledge, this is the first work proposesa dual-side multi-level partitioning mechanism for efficient SpMVimplementation on Sunway architecture. It leverages the uniquefeatures of the Sunway architecture at each level with a set ofnew techniques to efficiently map the computation of SpMV to thehardware resources.

7 CONCLUSIONSThis paper presents a novel SpMV scheme targeting the Sunwayarchitecture. The novelty exists in the multiple levels of partitions,designed according to the new Sunway architecture, on both soft-ware and hardware sides, which enhances data locality and fullyexploits the hardware parallelism. Our technique is generally appli-cable to any existing SpMV format and is able to efficiently mapthe SpMV algorithms to the Sunway architecture. To demonstratethe effectiveness of our technique, we have applied it to one of themost popular SpMV format—CSR and developed its variance BT-CSR. We evaluate BT-CSR with 18 representative sparse matriceson the Sunway TaihuLight supercomputer. Although our approachis designed for Sunway, it is generally applicable to other emergingmanycore architectures, especially with cache-less design. Experi-mental results show that BT-CSR can efficiently utilize Sunway’sparallel resources with balanced workloads across cores. BT-CSRoutperforms existing SpMV approaches running on the Sunwayarchitecture, specifically yielding speedups up to 15.5× (12.3× onaverage) over the baseline CSR approach.

ACKNOWLEDGMENTSThe authors would like to thank all anonymous reviewers fortheir insightful comments and suggestions. This work is par-tially supported by the National Key R&D Program of China(Grant No. 2016YFB1000304, 2016YFA0602100, 2017YFA0604500 and2016YFA0602200), National Natural Science Foundation of China(Grant No. 61502019, 91530323, 41776010 and 61732002) and Na-tional Science Foundation (NSF) under Grant No. 1618620. HailongYang is the corresponding author.

Page 11: Towards Efficient SpMV on Sunway Many-core Architecturesics2018.ict.ac.cn/essay/ics18-final141.pdf · (SPM). This SPM can be configured to be a programmable buffer or automatic data

Towards Efficient SpMV on Sunway Many-core Architectures ICS ’18, June 12–15, 2018, Beijing, China

REFERENCES[1] Yulong Ao, Chao Yang, Xinliang Wang, Wei Xue, Haohuan Fu, Fangfang Liu,

Lin Gan, Ping Xu, and Wenjing Ma. 2017. 26 PFLOPS Stencil Computations forAtmospheric Modeling on Sunway TaihuLight. In 2017 IEEE International Paralleland Distributed Processing Symposium, IPDPS 2017, Orlando, FL, USA, May 29 -June 2, 2017. 535–544. https://doi.org/10.1109/IPDPS.2017.9

[2] Arash Ashari, Naser Sedaghati, John Eisenlohr, and P. Sadayappan. 2014. AnEfficient Two-dimensional Blocking Strategy for Sparse Matrix-vector Multi-plication on GPUs. In Proceedings of the 28th ACM International Conference onSupercomputing (ICS ’14). ACM, New York, NY, USA, 273–282.

[3] Nathan Bell and Michael Garland. 2009. Implementing Sparse Matrix-vector Mul-tiplication on Throughput-oriented Processors. In Proceedings of the ACM/IEEEConference on High Performance Computing Networking, Storage and Analysis (SC’09). ACM, New York, NY, USA, Article 18, 11 pages.

[4] Nathan Bell and Michael Garland. 2009. Implementing sparse matrix-vectormultiplication on throughput-oriented processors. In Proceedings of the conferenceon high performance computing networking, storage and analysis. ACM, 18.

[5] Luc Buatois, Guillaume Caumon, and Bruno Levy. 2009. Concurrent numbercruncher: a GPU implementation of a general sparse linear solver. InternationalJournal of Parallel, Emergent and Distributed Systems 24, 3 (2009), 205–223.

[6] Aydin Buluç, Jeremy T Fineman, Matteo Frigo, John R Gilbert, and Charles ELeiserson. 2009. Parallel sparse matrix-vector and matrix-transpose-vector multi-plication using compressed sparse blocks. In Proceedings of the twenty-first annualsymposium on Parallelism in algorithms and architectures. ACM, 233–244.

[7] Daniele Buono, Fabrizio Petrini, Fabio Checconi, Xing Liu, Xinyu Que, ChrisLong, and Tai-Ching Tuan. 2016. Optimizing Sparse Matrix-Vector Multiplicationfor Large-Scale Data Analytics. In Proceedings of the 30th International Conferenceon Supercomputing (ICS ’16). ACM, New York, NY, USA, Article 37, 12 pages.https://doi.org/10.1145/2925426.2926278

[8] Jee W. Choi, Amik Singh, and Richard W. Vuduc. 2010. Model-driven Autotuningof Sparse Matrix-vector Multiply on GPUs. SIGPLAN Not. 45, 5 (Jan. 2010),115–126. https://doi.org/10.1145/1837853.1693471

[9] Timothy A. Davis. 1997. The University of Florida sparse matrix collection. NADIGEST (1997).

[10] J. Fang, H. Fu, W. Zhao, B. Chen, W. Zheng, and G. Yang. 2017. swDNN: ALibrary for Accelerating Deep Learning Applications on Sunway TaihuLight. In2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).615–624. https://doi.org/10.1109/IPDPS.2017.20

[11] Haohuan Fu, Conghui He, Bingwei Chen, Zekun Yin, Zhenguo Zhang, WenqiangZhang, Tingjian Zhang, Wei Xue, Weiguo Liu, Wanwang Yin, Guangwen Yang,and Xiaofei Chen. 2017. 18.9Pflopss Nonlinear Earthquake Simulation on SunwayTaihuLight: Enabling Depiction of 18-Hz and 8-meter Scenarios. In Proceedingsof the International Conference for High Performance Computing, Networking,Storage and Analysis (SC ’17). ACM, New York, NY, USA, Article 2, 12 pages.https://doi.org/10.1145/3126908.3126910

[12] Haohuan Fu, Junfeng Liao, Jinzhe Yang, Lanning Wang, Zhenya Song, XiaomengHuang, Chao Yang, Wei Xue, Fangfang Liu, Fangli Qiao, Wei Zhao, Xunqiang Yin,Chaofeng Hou, Chenglong Zhang, Wei Ge, Jian Zhang, Yangang Wang, ChunboZhou, andGuangwen Yang. 2016. The Sunway TaihuLight supercomputer: systemand applications. Science China Information Sciences 59, 7 (21 Jun 2016), 072001.https://doi.org/10.1007/s11432-016-5588-7

[13] Georgios Goumas, Kornilios Kourtis, Nikos Anastopoulos, Vasileios Karakasis,and Nectarios Koziris. 2009. Performance evaluation of the sparse matrix-vectormultiplication on modern architectures. The Journal of Supercomputing 50, 1 (01Oct 2009), 36–77. https://doi.org/10.1007/s11227-008-0251-8

[14] Joseph L. Greathouse and Mayank Daga. 2014. Efficient Sparse Matrix-vectorMultiplication on GPUs Using the CSR Storage Format. In Proceedings of theACM/IEEE International Conference for High Performance Computing, Networking,Storage and Analysis (SC ’14). IEEE Press, Piscataway, NJ, USA, 769–780. https://doi.org/10.1109/SC.2014.68

[15] Kornilios Kourtis, Vasileios Karakasis, Georgios Goumas, and Nectarios Koziris.2011. CSX: An Extended Compression Format for Spmv on Shared MemorySystems. SIGPLAN Not. 46, 8 (Feb. 2011), 247–256. https://doi.org/10.1145/2038037.1941587

[16] Jiajia Li, Guangming Tan, Mingyu Chen, and Ninghui Sun. 2013. SMAT: An InputAdaptive Auto-tuner for Sparse Matrix-vector Multiplication. In Proceedingsof the 34th ACM SIGPLAN Conference on Programming Language Design andImplementation (PLDI ’13). ACM, New York, NY, USA, 117–126. https://doi.org/

10.1145/2462156.2462181[17] Heng Lin, Xiongchao Tang, Bowen Yu, Youwei Zhuo, Wenguang Chen, Jidong

Zhai, Wanwang Yin, and Weimin Zheng. 2017. Scalable Graph Traversal onSunway TaihuLight with Ten Million Cores. In 2017 IEEE International Paralleland Distributed Processing Symposium, IPDPS 2017, Orlando, FL, USA, May 29 -June 2, 2017. 635–645. https://doi.org/10.1109/IPDPS.2017.53

[18] Weifeng Liu and Brian Vinter. 2015. CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication. In Proceedings of the 29th ACMInternational Conference on Supercomputing (ICS ’15). ACM, New York, NY, USA,339–350. https://doi.org/10.1145/2751205.2751209

[19] Weifeng Liu and Brian Vinter. 2015. Speculative Segmented Sum for SparseMatrix-Vector Multiplication on Heterogeneous Processors. Parallel Comput. 49(2015), 179 – 193. https://doi.org/10.1016/j.parco.2015.04.004

[20] Xing Liu,Mikhail Smelyanskiy, EdmondChow, and PradeepDubey. 2013. EfficientSparse Matrix-vector Multiplication on x86-based Many-core Processors. InProceedings of the 27th ACM International Conference on Supercomputing (ICS ’13).ACM, New York, NY, USA, 273–282. https://doi.org/10.1145/2464996.2465013

[21] Duane Merrill and Michael Garland. 2016. Merge-based Parallel Sparse Matrix-vector Multiplication. In Proceedings of the ACM/IEEE International Conferencefor High Performance Computing, Networking, Storage and Analysis (SC ’16). IEEE,Piscataway, NJ, USA, Article 58, 12 pages. https://doi.org/10.1109/SC.2016.57

[22] Y. Saad. 2003. Iterative Methods for Sparse Linear Systems (2nd ed.). Society forIndustrial and Applied Mathematics, Philadelphia, PA, USA.

[23] Naser Sedaghati, Te Mu, Louis-Noel Pouchet, Srinivasan Parthasarathy, and P.Sadayappan. 2015. Automatic Selection of Sparse Matrix Representation on GPUs.In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS’15). ACM, New York, NY, USA, 99–108. https://doi.org/10.1145/2751205.2751244

[24] Naser Sedaghati, Te Mu, Louis-Noel Pouchet, Srinivasan Parthasarathy, and P.Sadayappan. 2015. Automatic Selection of Sparse Matrix Representation on GPUs.In Proceedings of the 29th ACM International Conference on Supercomputing (ICS’15). ACM, New York, NY, USA, 99–108. https://doi.org/10.1145/2751205.2751244

[25] Bor-Yiing Su and Kurt Keutzer. 2012. clSpMV: A Cross-Platform OpenCL SpMVFramework on GPUs. In Proceedings of the 26th ACM International Conference onSupercomputing (ICS ’12). ACM, New York, NY, USA, 353–364. https://doi.org/10.1145/2304576.2304624

[26] Wai Teng Tang, Ruizhe Zhao, Mian Lu, Yun Liang, Huynh Phung Huynh, XibaiLi, and Rick Siow Mong Goh. 2015. Optimizing and Auto-tuning Scale-freeSparse Matrix-vector Multiplication on Intel Xeon Phi. In Proceedings of the13th IEEE/ACM International Symposium on Code Generation and Optimization(CGO ’15). IEEE Computer Society, Washington, DC, USA, 136–145. https://doi.org/10.1109/CGO.2015.7054194

[27] Xinliang Wang, Weifeng Liu, Wei Xue, and Li Wu. 2018. swSpTRSV: A FastSparse Triangular Solve with Sparse Level Tile Layout on Sunway Architectures.In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice ofParallel Programming (PPoPP ’18). ACM, New York, NY, USA, 338–353. https://doi.org/10.1145/3178487.3178513

[28] Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, andJames Demmel. 2007. Optimization of Sparse Matrix-vector Multiplication onEmerging Multicore Platforms. In Proceedings of the 21st ACM/IEEE Conferenceon Supercomputing (ICS ’07). ACM, New York, NY, USA, Article 38, 12 pages.https://doi.org/10.1145/1362622.1362674

[29] Biwei Xie, Jianfeng Zhan, Xu Liu, Wanling Gao, Zhen Jia, Xiwen He, and LixinZhang. 2018. CVR: Efficient Vectorization of SpMV on x86 Processors. In Proceed-ings of the 2018 International Symposium on Code Generation and Optimization(CGO ’18). ACM, New York, NY, USA, 149–162. https://doi.org/10.1145/3168818

[30] Shengen Yan, Chao Li, Yunquan Zhang, and Huiyang Zhou. 2014. yaSpMV: YetAnother SpMV Framework on GPUs. In Proceedings of the 19th ACM SIGPLANSymposium on Principles and Practice of Parallel Programming (PPoPP ’14). ACM,New York, NY, USA, 107–118. https://doi.org/10.1145/2555243.2555255

[31] Jian Zhang, Chunbao Zhou, Yangang Wang, Lili Ju, Qiang Du, Xuebin Chi, Dong-sheng Xu, Dexun Chen, Yong Liu, and Zhao Liu. 2016. Extreme-scale phase fieldsimulations of coarsening dynamics on the sunway taihulight supercomputer.In Proceedings of the International Conference for High Performance Computing,Networking, Storage and Analysis. IEEE Press, 4.

[32] Yue Zhao, Jiajia Li, Chunhua Liao, and Xipeng Shen. 2018. Bridging the GapBetween Deep Learning and Sparse Matrix Format Selection. In Proceedings of the23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming(PPoPP ’18). ACM, New York, NY, USA, 94–108.