Optimizing the Performance of Sparse Matrix-Vector Multiplication
description
Transcript of Optimizing the Performance of Sparse Matrix-Vector Multiplication
6/13/00 U.C.Berkeley 1
Optimizing the Performance of Sparse Matrix-Vector Multiplication
Eun-Jin ImU.C.Berkeley
6/13/00 U.C.Berkeley 2
Overview Motivation Optimization techniques
Register Blocking Cache Blocking Multiple Vectors
Sparsity system Related Work Contribution Conclusion
6/13/00 U.C.Berkeley 3
Motivation : Usage Sparse Matrix-Vector Multiplication
Usage of this operation: Iterative Solvers Explicit Methods Eigenvalue and Singular Value Problems
Applications in structure modeling, fluid dynamics, document retrieval(Latent Semantic Indexing) and many other simulation areas
xAy
6/13/00 U.C.Berkeley 4
Motivation : Performance (1) Matrix-vector multiplication (BLAS2) is slower
than matrix-matrix multiplication (BLAS3) For example, on 167 MHz UltraSPARC I,
Vendor optimized matrix-vector multiplication: 57Mflops
Vendor optimized matrix-matrix multiplication: 185Mflops
The reason: lower ratio of the number of floating point operations to the number of memory operation
6/13/00 U.C.Berkeley 5
Motivation : Performance (2) Sparse matrix operation is slower than dense
matrix operation. For example, on 167 MHz UltraSPARC I,
Dense matrix-vector multiplication : naïve implementation : 38Mflops vendor optimized implementation : 57Mflops Sparse matrix-vector multiplication (Naïve
implementation) 5.7 - 25Mflops
The reason : indirect data structure, thus inefficient memory accesses
6/13/00 U.C.Berkeley 6
Motivation : Optimized libraries Old approach : Hand-Optimized Libraries
Vendor-supplied BLAS, LAPACK New approach : Automatic generation of
libraries PHiPAC (dense linear algebra) ATLAS (dense linear algebra) FFTW (fast fourier transform)
Our approach : Automatic generation of libraries for sparse matricesAdditional dimension : nonzero structure of sparse
matrices
6/13/00 U.C.Berkeley 7
Sparse Matrix Formats There are large number of sparse matrix
formats. Point-entry
Coordinate format (COO), Compressed Sparse Row (CSR),Compressed Sparse Column (CSC), Sparse Diagonal
(DIA), … Block-entry
Block Coordinate (BCO), Block Sparse Row (BSR),Block Sparse Column (BSC), Block Diagonal (BDI),Variable Block Compressed Sparse Row (VBR), …
6/13/00 U.C.Berkeley 8
Compressed Sparse Row Format We internally use CSR format, because it is
relatively efficient format
6/13/00 U.C.Berkeley 9
Optimization Techniques Register Blocking Cache Blocking Multiple vector
6/13/00 U.C.Berkeley 10
Register Blocking Blocked Compressed Sparse Row Format
Advantages of the format Better temporal locality in registers The multiplication loop can be unrolled for better
performance
0 2 4
A00 A01 A10 A11 A04 0 0 A15 A22 0 A32 A33 A25 0 A34 A35
0 4 2 4
35343332
2522
151110
040100
00
0000
000
000
AAAA
AA
AAA
AAA
6/13/00 U.C.Berkeley 11
Register Blocking : Fill Overhead We use uniform block size, adding fill overhead.
fill overhead = 12/7 = 1.71 This increases both space and the number of
floating point operations.
6/13/00 U.C.Berkeley 12
Register Blocking Dense Matrix profile on an UltraSPARC I (input
to the performance model)
6/13/00 U.C.Berkeley 13
Register Blocking : Selecting the block size The hard part of the problem is picking the
block size so that : It minimizes the fill overhead It maximizes the raw performance
Two approaches : Exhaustive search Using a model
6/13/00 U.C.Berkeley 14
Register Blocking: Performance model Two components to the performance model
Multiplication performance of dense matrix represented in sparse format
Estimated fill overhead
Predicted performance for block size r x c dense r x c blocked performance = fill overhead
6/13/00 U.C.Berkeley 15
Benchmark matrices Matrix 1: Dense matrix (1000 x 1000) Matrices 2-17 : Finite Element Method
matrices Matrices 18-39 : matrices from Structural
Engineering, Device Simulation Matrices 40-44 : Linear Programming matrices Matrix 45 : document retrieval matrix used for Latent Semantic Indexing Matrix 46 : random matrix (10000 x 10000,
0.15%)
6/13/00 U.C.Berkeley 16
Register Blocking : Performance
The optimization is effective most on FEM matrices and dense matrix (lower-numbered).
6/13/00 U.C.Berkeley 17
Register Blocking : Performance
Speedup is generally best on MIPS R10000, which is competitive with the dense BLAS performance. (DGEMV/DGEMM = 0.38)
6/13/00 U.C.Berkeley 18
Register Blocking : Validation of Performance Model
Comparison to the performance of exhaustive search (yellow bars, block sizes in lower row) on a subset of the benchmark matrices
The exhaustive search does not produce much better result.
6/13/00 U.C.Berkeley 19
Register Blocking : Overhead Pre-computation overhead :
Estimating fill overhead (red bars) Reorganizing the matrix (yellow bars)
The ratio means the number of repetitions for which the optimization is beneficial.
6/13/00 U.C.Berkeley 20
Cache Blocking Temporal locality of access to source vector
Source vector x
DestinationVector
y
In memory
6/13/00 U.C.Berkeley 21
Cache Blocking : Performance
MIPS speedup is generally better. larger cache, larger miss penalty (26/589 ns for MIPS, 36/268 ns for Ultra.)
Except document retrieval and random matrix.
6/13/00 U.C.Berkeley 22
Cache Blocking : Performance on document retrieval matrix
Document retrieval matrix : 10K x 256K, 37M nonzeros, SVD is applied for LSI(Latent Semantic Indexing)
The nonzero elements are spread across the matrix, with no dense cluster.
Peak at 16K x 16K cache block with speedup 3.1
6/13/00 U.C.Berkeley 23
Cache Blocking : When and how to use cache blocking From the experiment, the matrices for which
cache blocking is most effective are large and “random”.
We developed a measurement of “randomness” of matrix.
We perform search in coarse grain, to decide cache block size.
6/13/00 U.C.Berkeley 24
Combination of Register and Cache blocking : UltraSPARC The combination is rarely beneficial, often slower than
either of the two optimization.
6/13/00 U.C.Berkeley 25
Combination of Register and Cache blocking : MIPS
6/13/00 U.C.Berkeley 26
Multiple Vector Multiplication Better chance of optimization : BLAS2 vs.
BLAS3
Repetition of single-vector case Multiple-vector case
6/13/00 U.C.Berkeley 27
Multiple Vector Multiplication : Performances Register blocking performance Cache blocking performance
6/13/00 U.C.Berkeley 28
Multiple Vector Multiplication :Register Blocking Performance
The speedup is larger than single vector register blocking. Even the performance of the matrices that did not
speedup improved. (middle group in UltraSPARC)
6/13/00 U.C.Berkeley 29
Multiple Vector Multiplication : Cache Blocking Performance
Noticeable speedup for the matrices that did not speedup (UltraSPARC) Block sizes are much smaller than that of single vector cache blocking.
UltraSPARC MIPS
6/13/00 U.C.Berkeley 30
Sparsity System : Purpose Guide a choice of optimization Automatic selection of optimization
parameters such as block size, number of vectors
http://comix.cs.berkeley.edu/~ejim/sparsity
6/13/00 U.C.Berkeley 31
Sparsity System : Organization
SparsityMachineProfiler
MachinePerformance
Profile
SparsityOptimizer
Examplematrix
MaximumNumber of
vectors
Optimized code,drivers
6/13/00 U.C.Berkeley 32
Summary : Speedup of Sparsity on UltraSPARC On UltraSPARC, up to 3x for single vector, 4.7x for multiple
vector
Single Vector Multiple Vector
6/13/00 U.C.Berkeley 33
Summary : Speedup of Sparsity on MIPSOn MIPS, up to 3x single vector, 6x for multiple vector
Single Vector Multiple Vector
6/13/00 U.C.Berkeley 34
Summary : Overhead of Sparsity Optimization
The number of iteration =
Overhead time Time saved
The BLAS Technical Forum include a parameter in the matrix creation routine to indicate how many times the operation is performed.
6/13/00 U.C.Berkeley 35
Related Work (1) Dense Matrix Optimization
Loop transformation by compilers : M. Wolf, etc. Hand-optimized libraries : BLAS, LAPACK
Automatic Generation of Libraries PHiPAC, ATLAS and FFTW
Sparse Matrix Standardization and Libraries BLAS Technical Forum NIST Sparse BLAS, MV++, SparseLib++, TNT
Hand Optimization of Sparse Matrix-Vector Multi.
S. Toledo, Oliker et. al.
6/13/00 U.C.Berkeley 36
Related Work (2) Sparse Matrix Packages
SPARSKIT, PSPARSELIB, Aztec, BlockSolve95, Spark98
Compiling Sparse Matrix Code Sparse compiler (Bik), Bernoulli compiler (Kotlyar)
On-demand Code Generation NIST SparseBLAS, Sparse compiler
6/13/00 U.C.Berkeley 37
Contribution Thorough investigation of memory hierarchy
optimization for sparse matrix-vector multiplication
Performance study on benchmark matrices Development of performance model to choose
optimization parameter Sparsity system for automatic tuning and code
generation of sparse matrix-vector multiplication
6/13/00 U.C.Berkeley 38
Conclusion Memory hierarchy optimization for sparse
matrix-vector multiplication Register Blocking : matrices with dense local structure
benefit Cache Blocking : large matrices with random structure
benefit Multiple vector multiplication improves the
performance further because of reuse of matrix elements
The choice of optimization depends on both matrix structure and machine architecture.
The automated system helps this complicated and time-consuming process.