Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors
-
Upload
gray-christian -
Category
Documents
-
view
49 -
download
2
description
Transcript of Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors
![Page 1: Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors](https://reader030.fdocuments.in/reader030/viewer/2022033104/56812d10550346895d91ec10/html5/thumbnails/1.jpg)
NVIDIA Research
Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors
Nathan Bell and Michael Garland
![Page 2: Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors](https://reader030.fdocuments.in/reader030/viewer/2022033104/56812d10550346895d91ec10/html5/thumbnails/2.jpg)
© 2008 NVIDIA Corporation
Overview
GPUs deliver high SpMV performance10+ GFLOP/s on unstructured matrices
140+ GByte/s memory bandwidth
No one-size-fits-all approachMatch method to matrix structure
Exploit structure when possibleFast methods for regular portion
Robust methods for irregular portion
![Page 3: Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors](https://reader030.fdocuments.in/reader030/viewer/2022033104/56812d10550346895d91ec10/html5/thumbnails/3.jpg)
© 2008 NVIDIA Corporation
Characteristics of SpMV
Memory boundFLOP : Byte ratio is very low
Generally irregular & unstructuredUnlike dense matrix operations (BLAS)
![Page 4: Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors](https://reader030.fdocuments.in/reader030/viewer/2022033104/56812d10550346895d91ec10/html5/thumbnails/4.jpg)
© 2008 NVIDIA Corporation
Solving Sparse Linear Systems
Iterative methodsCG, GMRES, BiCGstab, etc.
Require 100s or 1000s of SpMV operations
![Page 5: Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors](https://reader030.fdocuments.in/reader030/viewer/2022033104/56812d10550346895d91ec10/html5/thumbnails/5.jpg)
© 2008 NVIDIA Corporation
Finite-Element Methods
Discretized on structured or unstructured meshesDetermines matrix sparsity structure
![Page 6: Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors](https://reader030.fdocuments.in/reader030/viewer/2022033104/56812d10550346895d91ec10/html5/thumbnails/6.jpg)
© 2008 NVIDIA Corporation
Objectives
Expose sufficient parallelismDevelop 1000s of independent threads
Minimize execution path divergenceSIMD utilization
Minimize memory access divergenceMemory coalescing
![Page 7: Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors](https://reader030.fdocuments.in/reader030/viewer/2022033104/56812d10550346895d91ec10/html5/thumbnails/7.jpg)
© 2008 NVIDIA Corporation
Sparse Matrix Formats
Structured Unstructured
(DIA
) Dia
gona
l
(ELL
) E
LLP
AC
K
(CO
O)
Coo
rdin
ate
(CS
R)
Com
pres
sed
Row
(HY
B) H
ybrid
![Page 8: Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors](https://reader030.fdocuments.in/reader030/viewer/2022033104/56812d10550346895d91ec10/html5/thumbnails/8.jpg)
© 2008 NVIDIA Corporation
Compressed Sparse Row (CSR)
Rows laid out in sequence
Inconvenient for fine-grained parallelism
![Page 9: Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors](https://reader030.fdocuments.in/reader030/viewer/2022033104/56812d10550346895d91ec10/html5/thumbnails/9.jpg)
© 2008 NVIDIA Corporation
CSR (scalar) kernel
One thread per rowPoor memory coalescing
Unaligned memory accessTh
read
0
Thre
ad 1
Thre
ad 2
Thre
ad 3
Thre
ad 4
… … ……
![Page 10: Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors](https://reader030.fdocuments.in/reader030/viewer/2022033104/56812d10550346895d91ec10/html5/thumbnails/10.jpg)
© 2008 NVIDIA Corporation
CSR (vector) kernel
One SIMD vector or warp per rowPartial memory coalescing
Unaligned memory accessW
arp
0
War
p 1
War
p 2
War
p 3
War
p 4
![Page 11: Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors](https://reader030.fdocuments.in/reader030/viewer/2022033104/56812d10550346895d91ec10/html5/thumbnails/11.jpg)
© 2008 NVIDIA Corporation
ELLPACK (ELL)
Storage for K nonzeros per rowPad rows with fewer than K nonzeros
Inefficient when row length varies
![Page 12: Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors](https://reader030.fdocuments.in/reader030/viewer/2022033104/56812d10550346895d91ec10/html5/thumbnails/12.jpg)
© 2008 NVIDIA Corporation
Hybrid Format
ELL handles typical entries
COO handles exceptional entriesImplemented with segmented reduction
![Page 13: Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors](https://reader030.fdocuments.in/reader030/viewer/2022033104/56812d10550346895d91ec10/html5/thumbnails/13.jpg)
© 2008 NVIDIA Corporation
Exposing Parallelism
DIA, ELL & CSR (scalar)One thread per row
CSR (vector)One warp per row
COOOne thread per nonzero
Fin
er G
ran
ula
rity
![Page 14: Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors](https://reader030.fdocuments.in/reader030/viewer/2022033104/56812d10550346895d91ec10/html5/thumbnails/14.jpg)
© 2008 NVIDIA Corporation
Exposing Parallelism
![Page 15: Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors](https://reader030.fdocuments.in/reader030/viewer/2022033104/56812d10550346895d91ec10/html5/thumbnails/15.jpg)
© 2008 NVIDIA Corporation
Execution Divergence
Variable row lengths can be problematicIdle threads in CSR (scalar)
Idle processors in CSR (vector)
Robust strategies existCOO is insensitive to row length
![Page 16: Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors](https://reader030.fdocuments.in/reader030/viewer/2022033104/56812d10550346895d91ec10/html5/thumbnails/16.jpg)
© 2008 NVIDIA Corporation
Execution Divergence
![Page 17: Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors](https://reader030.fdocuments.in/reader030/viewer/2022033104/56812d10550346895d91ec10/html5/thumbnails/17.jpg)
© 2008 NVIDIA Corporation
Memory Access Divergence
Uncoalesced memory access is very costlySometimes mitigated by cache
Misaligned access is suboptimalAlign matrix format to coalescing boundary
Access to matrix representationDIA, ELL and COO are fully coalesced
CSR (vector) is partially coalesced
CSR (scalar) is seldom coalesced
![Page 18: Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors](https://reader030.fdocuments.in/reader030/viewer/2022033104/56812d10550346895d91ec10/html5/thumbnails/18.jpg)
© 2008 NVIDIA Corporation
Memory Bandwidth (AXPY)
![Page 19: Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors](https://reader030.fdocuments.in/reader030/viewer/2022033104/56812d10550346895d91ec10/html5/thumbnails/19.jpg)
© 2008 NVIDIA Corporation
Performance Results
GeForce GTX 285Peak Memory Bandwidth: 159 GByte/s
All results in double precision
Source vector accessed through texture cache
Structured MatricesCommon stencils on regular grids
Unstructured MatricesWide variety of applications and sparsity patterns
![Page 20: Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors](https://reader030.fdocuments.in/reader030/viewer/2022033104/56812d10550346895d91ec10/html5/thumbnails/20.jpg)
© 2008 NVIDIA Corporation
Structured Matrices
![Page 21: Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors](https://reader030.fdocuments.in/reader030/viewer/2022033104/56812d10550346895d91ec10/html5/thumbnails/21.jpg)
© 2008 NVIDIA Corporation
Unstructured Matrices
![Page 22: Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors](https://reader030.fdocuments.in/reader030/viewer/2022033104/56812d10550346895d91ec10/html5/thumbnails/22.jpg)
© 2008 NVIDIA Corporation
Performance Comparison
System Cores Clock (GHz) Notes
GTX 285 240 1.5 NVIDIA GeForce GTX 285
Cell 8 (SPEs) 3.2 IBM QS20 Blade (half)
Core i7 4 3.0 Intel Core i7 (Nehalem)
Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented ProcessorsN. Bell and M. Garland, Proc. Supercomputing '09, November 2009
Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore PlatformsSamuel Williams et al., Supercomputing 2007.
Sources:
![Page 23: Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors](https://reader030.fdocuments.in/reader030/viewer/2022033104/56812d10550346895d91ec10/html5/thumbnails/23.jpg)
© 2008 NVIDIA Corporation
Performance Comparison
![Page 24: Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors](https://reader030.fdocuments.in/reader030/viewer/2022033104/56812d10550346895d91ec10/html5/thumbnails/24.jpg)
© 2008 NVIDIA Corporation
ELL kernel
__global__ void ell_spmv(const int num_rows, const int num_cols, const int num_cols_per_row, const int stride, const double * Aj, const double * Ax, const double * x, double * y)
{ const int thread_id = blockDim.x * blockIdx.x + threadIdx.x; const int grid_size = gridDim.x * blockDim.x;
for (int row = thread_id; row < num_rows; row += grid_size) { double sum = y[row];
int offset = row;
for (int n = 0; n < num_cols_per_row; n++) { const int col = Aj[offset];
if (col != -1) sum += Ax[offset] * x[col];
offset += stride; }
y[row] = sum; }}
![Page 25: Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors](https://reader030.fdocuments.in/reader030/viewer/2022033104/56812d10550346895d91ec10/html5/thumbnails/25.jpg)
© 2008 NVIDIA Corporation
#include <cusp/hyb_matrix.h>#include <cusp/io/matrix_market.h>#include <cusp/krylov/cg.h>
int main(void){
// create an empty sparse matrix structure (HYB format)cusp::hyb_matrix<int, double, cusp::device_memory> A;
// load a matrix stored in MatrixMarket formatcusp::io::read_matrix_market_file(A, "5pt_10x10.mtx");
// allocate storage for solution (x) and right hand side (b)cusp::array1d<double, cusp::device_memory> x(A.num_rows, 0);cusp::array1d<double, cusp::device_memory> b(A.num_rows, 1);
// solve linear system with the Conjugate Gradient methodcusp::krylov::cg(A, x, b);
return 0;}
http://cusp-library.googlecode.com
![Page 26: Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors](https://reader030.fdocuments.in/reader030/viewer/2022033104/56812d10550346895d91ec10/html5/thumbnails/26.jpg)
© 2008 NVIDIA Corporation
Extensions & Optimizations
Block formats (register blocking)Block CSR
Block ELL
Block vectorsSolve multiple RHS
Block Krylov methods
Other optimizationsBetter CSR (vector)
![Page 27: Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors](https://reader030.fdocuments.in/reader030/viewer/2022033104/56812d10550346895d91ec10/html5/thumbnails/27.jpg)
© 2008 NVIDIA Corporation
Further Reading
Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented ProcessorsN. Bell and M. GarlandProc. Supercomputing '09, November 2009
Efficient Sparse Matrix-Vector Multiplication on CUDAN. Bell and M. GarlandNVIDIA Tech Report NVR-2008-004, December 2008
Optimizing Sparse Matrix-Vector Multiplication on GPUsM. M. Baskaran and R. Bordawekar.IBM Research Report RC24704, IBM, April 2009
Model-driven Autotuning of Sparse Matrix-Vector Multiply on GPUsJ. W. Choi, A. Singh, and R. VuducProc. ACM SIGPLAN (PPoPP), January 2010