How to Avoid Global Synchronization by Domino Scheme · Penalty of domino is significant on csrsv...
Transcript of How to Avoid Global Synchronization by Domino Scheme · Penalty of domino is significant on csrsv...
How to Avoid Global Synchronization by Domino Scheme
Lung-Sheng Chien, [email protected]
Outline
�What should be done in sparse direct solver
- what we have now on CPU
- what is low hanging fruit on GPU
� Domino scheme
� Conclusions
We have: Sparse direct solver
� Work done by professor Timothy A. Davis
- UMFPACK: unsymmetric multifrontal sparse LU factorization package
- KLU: sparse LU factorization, for circuit simulation
- CHOLMOD: supernodal sparse Cholesky factorization and update/downdate.
Now it supports GPU
- SPQR: Sparse QR
Sencer Nuri Yeralan, Timothy A. Davis, Sparse QR Factorization on GPU
Architectures, SIAM CSE 2013
S4204 - Multifrontal Sparse QR Factorization on the GPU, GTC2014
Low hanging fruit
� Intercept each BLAS call and re-direct it to
GPU by
- re-designing the code and porting
cuBLAS, or
- using drop-in library, nvBLAS
� Pros: easy, not error-prone
� Cons: PCI bandwidth limits the
performance12GB/s of PCI gen 3 versus 280GB/s of GTX TITAN
Sencer Nuri Yeralan, Timothy A. Davis, Sparse QR Factorization on GPU Architectures, SIAM CSE 2013
Step back and rethink Cholesky
� Cholesky factorization (� = ���) does not need pivoting
� Symbolic analysis can easily find out zero fill-in and figure out sparsity pattern of
Cholesky factor �
� Reform original matrix � with sparsity pattern of Cholesky factor �
� Perform Cholesky factorization (the same as incomplete Cholesky factorization)
�
=
� ��
������ B = ��0(�)
� = � + ���(�)
0
0
ic0 = incomplete Cholesky factorization without zero fill-in
Benchmark: spd matrices
� From Florida Matrix Collection
except Laplacian operator
� Laplacian operator is standard
Finite Difference 3-pt, 5-pt and 7-
pt with Dirichlet boundary
condition
� nnzA is # of nonzeros of A
� nnzL is # of nonzero of L after
symamd
� levels is # of levels in incomplete
Cholesky factorization
matrix name m nnzA nnzL nnzA/m nnzL/m levels
1138_bus.mtx 1138 2596 3264 2.3 2.9 40
aft01.mtx 8205 66886 304194 8.2 37.1 699
bcsstk13.mtx 2003 42943 289870 21.4 144.7 673
bcsstk34.mtx 588 11003 55120 18.7 93.7 364
bodyy4.mtx 17546 69742 584769 4.0 33.3 708
ex9.mtx 3363 51417 116942 15.3 34.8 662
LF10000.mtx 19998 59990 69988 3.0 3.5 19997
minsurfo.mtx 40806 122214 1038156 3.0 25.4 1177
nos6.mtx 675 1965 6200 2.9 9.2 101
s1rmt3m1.mtx 5489 112505 543103 20.5 98.9 1133
mhd4800b.mtx 4800 16160 16160 3.4 3.4 1198
Chem97ZtZ.mtx 2541 4951 4951 1.9 1.9 3
bcsstk26.mtx 1922 16129 42243 8.4 22.0 309
plat1919.mtx 1919 17159 66803 8.9 34.8 459
nasa1824.mtx 1824 20516 69926 11.2 38.3 299
ex33.mtx 1733 11961 33167 6.9 19.1 200
Muu.mtx 7102 88618 198606 12.5 28.0 323
sts4098.mtx 4098 38227 144225 9.3 35.2 371
bodyy5.mtx 18589 73935 627327 4.0 33.7 766
gyro_m.mtx 17361 178896 413709 10.3 23.8 489
Pres_Poisson.mtx 14822 365313 2447857 24.6 165.2 2758
Kuu.mtx 7102 173651 385828 24.5 54.3 612
shallow_water1.mtx 81920 204800 2448970 2.5 29.9 1428
finan512.mtx 74752 335872 2028289 4.5 27.1 1681
cant.mtx 62451 2034917 27456636 32.6 439.7 18072
lap1D_3pt_n1000000.mtx 1000000 1999999 1999999 2.0 2.0 1000000
lap2D_5pt_n100.mtx 10000 29800 25116 3.0 2.5 36
lap3D_7pt_n20.mtx 8000 30800 63281 3.9 7.9 141
http://www.cise.ufl.edu/research/sparse/matrices/
lap3D_7pt_n20
Original A symamd(A)
L=chol(symamd(A))
domino-Cholesky versus CHOLMOD simplicial
� Domino-Cholesky is 1.33x faster than CHOLMOD simplicial in average
� Domino-Cholesky loses when
1) zero fill-in unevenly (load imbalance),
2) too many levels (more like sequential),
3) whole matrix can fit into CPU cache
GPU: K40
CPU: i7-3930K CPU @ 3.20GHz
SuiteSparse-3.6.0
/CHOLMOD/Demo/cholmod_l_demo.c
1.33x
domino-Cholesky versus CHOLMOD Supernodal
� Domino-Cholesky is 4.7x faster than CHOLMOD supernodal in average
� Domino-Cholesky wins if there are ONLY few big supernodes or it is NOT sequential
GPU: K40
CPU: i7-3930K CPU @ 3.20GHz
SuiteSparse-3.6.0
/CHOLMOD/Demo/cholmod_l_demo.c
Outline
�What should be done in sparse direct solver
� Domino scheme
- review scheduler of triangular solve
- how to trace DAG by a single kernel
� Conclusions
Triangular solve: global sync
� CUSPARSE library performs an
analysis phase, then a solve
phase
� Maxim Naumov shows how to
extract parallelism and how to
keep utilization in GTC2012,
“On the Parallel Solution of
Sparse Triangular Linear System”
� Global barrier between
consecutive two levels
� Each level running in parallel by
atomic operations
Maxim Naumov, On the Parallel Solution of Sparse Triangular Linear System, GTC2012
Drawback of global synchronization
Critical path = max{level 1} + max{level 2} + max{level 3} = row 2 + row 4 + row 9
Avoid global synchronization
Critical path = row 1 + row 4 + row 9
Domino scheme� Motivation
Jonathan Hogg proposed a new trsv (dense triangular solve) in 2012
� Applications
- csrsv (sparse triangular solve)
- csric0 (sparse incomplete Cholesky factorization)
- csrilu0 (sparse incomplete LU)
- sparse Cholesky
- sparse QR by householder reflection or Givens rotation
� Idea
Each row is a domino. A domino waits until parents are done, and triggers next
dominos when it is done
� Requirements
- to avoid deadlock, parents of a domino must either be done or be running
(a logical CTA id is required)
- whole matrix must fit into GPU
Jonathan’s trsv
CTA 0
CTA 1
CTA 2
CTA 3
CTA 0
CTA 1
CTA 2
CTA 3
S
mv
mv
mv
S
mv
mv
S
mvS
time
� DAG (implicit) is described by two constraints
- The work associated with a given off-diagonal block cannot begin until the solve for
the diagonal block in the column has completed
- The diagonal solve in a given row cannot commence until all matrix-vector multiplies
for blocks in that row have completed
� One semaphore (lock) is used to schedule the DAG
- the semaphore is an integer indicating which row is done
� One semaphore is used to query logical CTA id to avoid deadlock
Extend to sparse linear algebra
D1, D2 and D3 run in parallel and other rows are waiting
Once D1 is done, it will trigger its children, D4 and D5. Other rows are still waiting
Each domino needs a semaphore (lock)
Pros and cons of Domino scheme
� Pros
- hide global barrier inside the kernel
- overlap write phase and read phase of different dominos
- can run without levels
- reproducibility (determinism)
- quick analysis phase for csrsv/csric0/csrilu0
� Cons
- load imbalance
- long latency of L2 cache
- cannot adjust computational resources during the execution
- low utilization due to insufficient parallelism
Conclusions
� Domino scheme can trace dependence graph, it works for dense/sparse linear algebra
� Global synchronization is painful, there are several workarounds
- system level (across multiple nodes)
overlap MPI transfer and computation
- system level (single node)
overlap PCI transfer and computation
S4201 - GPU Acceleration of Sparse Matrix Factorization in CHOLMOD
- GPU level
domino-scheme
� Domino scheme still suffers
- long latency of L2 cache (hardware issue)
- not enough parallelism (characteristic of a matrix)
- load imbalance (big trouble)
� Penalty of domino is significant on csrsv
� Batched domino is much faster (9x) because
- more work to do for each level
- penalty of domino becomes small
� New domino-based functions in r6.0
- csrsv2
- csric02
- csrilu02
- bsrsv2
- bsric02
- bsrilu02
domino Non-domino
Reproducibility v x
Quick analysis phase v x
Memory efficient v x
Report structural zero or
numerical zero
v x
Buffer provided by users v x
Disable level scheduling v x
Sparse direct solver: QR � Symbolic analysis to find zero fill-in and dependence graph, then do left-looking
householder QR by domino scheme
� QR has 9x more zero fill-in than Cholesky and is 14x slower than Cholesky
GPU: K40
161 37
Future work for sparse direct solver
� Sparse QR, including householder and Givens rotation
- symbolic analysis on GPU (now on CPU)
- improve numerical factorization
� Sparse LU with partial pivoting
Thank you !
� [1] umfpack, cholmod, KLU, http://www.cise.ufl.edu/research/sparse/
� [2] Maxim Naumov, On the Parallel Solution of Sparse Triangular Linear System, GTC2012
� [3] J. D. Hogg, A Fast Dense Triangular Solve in CUDA,
SIAM Journal on Scientific Computing 2013, Vol. 35, No. 3, pp. C303-C322
� [4] Maxim Naumov, What Does It Take to Accelerate SPICE on the GPU?, GTC2013
� S4201 - GPU Acceleration of Sparse Matrix Factorization in CHOLMODWednesday, 03/26, 14:00 - 14:50, Room LL20D
� S4243 - A GPU Sparse Direct Solver for AX=BWednesday, 03/26, 09:00 - 09:25, Room LL20D
� S4524 - Sparse LU Factorization on GPUs for Accelerating SPICE SimulationWednesday, 03/26, 09:30 - 09:55, Room LL20D
Related talks GTC2014
Domino-csrsv versus MKL
� Domino-csrsv is 4.2x faster than MKL in average
GPU: K40CPU: i7-3930K CPU @ 3.20GHz
4.2x
Benchmark: unsymmetric matrices
� From Florida Matrix Collection
except dense2.mtx
� dense2.mtx is 2000x2000 dense
matrix
� nnz is # of nonzeros of A
� levels is # of levels in incomplete
LU factorization
� No zero fill-in
matrix name m nnz nnz/m levels
cage14.mtx 1505785 27130349 18.0 100
shermanACb.mtx 18510 145149 7.8 34
Chebyshev4.mtx 68121 5377761 78.9 17424
dense2.mtx 2000 4000000 2000.0 2000
webbase-1M.mtx 1000005 3105536 3.1 512
cage12.mtx 130228 2032536 15.6 66
lung2.mtx 109460 492564 4.5 479
cryg10000.mtx 10000 49699 5.0 198
ASIC_680ks.mtx 682712 2329176 3.4 49
FEM_3D_thermal2.mtx 147900 3489300 23.6 3757
thermomech_dK.mtx 204316 2846228 13.9 644
ASIC_320ks.mtx 321671 1827807 5.7 35
cage13.mtx 445315 7479343 16.8 83
atmosmodd.mtx 1270432 8814880 6.9 352
http://www.cise.ufl.edu/research/sparse/matrices/
domino-ilu0 versus MKL
� Domino-ilu0 is 5.4x faster than MKL in average (exclude dense2.mtx)
GPU: K40CPU: i7-3930K CPU @ 3.20GHz
5.4x
Batched domino-ilu0
� Same sparsity pattern but
different value
� Batch size is 32
� Speedup =
(32 * single ilu0)/(batched ilu0)
� Batched domino-ilu0 is 9x faster
than 32 domino-ilu0 in average
GPU: K40