How to Avoid Global Synchronization by Domino Scheme · Penalty of domino is significant on csrsv...

26
How to Avoid Global Synchronization by Domino Scheme Lung-Sheng Chien, NVIDIA [email protected]

Transcript of How to Avoid Global Synchronization by Domino Scheme · Penalty of domino is significant on csrsv...

Page 1: How to Avoid Global Synchronization by Domino Scheme · Penalty of domino is significant on csrsv Batched domino is much faster (9x) because-more work to do for each level-penalty

How to Avoid Global Synchronization by Domino Scheme

Lung-Sheng Chien, [email protected]

Page 2: How to Avoid Global Synchronization by Domino Scheme · Penalty of domino is significant on csrsv Batched domino is much faster (9x) because-more work to do for each level-penalty

Outline

�What should be done in sparse direct solver

- what we have now on CPU

- what is low hanging fruit on GPU

� Domino scheme

� Conclusions

Page 3: How to Avoid Global Synchronization by Domino Scheme · Penalty of domino is significant on csrsv Batched domino is much faster (9x) because-more work to do for each level-penalty

We have: Sparse direct solver

� Work done by professor Timothy A. Davis

- UMFPACK: unsymmetric multifrontal sparse LU factorization package

- KLU: sparse LU factorization, for circuit simulation

- CHOLMOD: supernodal sparse Cholesky factorization and update/downdate.

Now it supports GPU

- SPQR: Sparse QR

Sencer Nuri Yeralan, Timothy A. Davis, Sparse QR Factorization on GPU

Architectures, SIAM CSE 2013

S4204 - Multifrontal Sparse QR Factorization on the GPU, GTC2014

Page 4: How to Avoid Global Synchronization by Domino Scheme · Penalty of domino is significant on csrsv Batched domino is much faster (9x) because-more work to do for each level-penalty

Low hanging fruit

� Intercept each BLAS call and re-direct it to

GPU by

- re-designing the code and porting

cuBLAS, or

- using drop-in library, nvBLAS

� Pros: easy, not error-prone

� Cons: PCI bandwidth limits the

performance12GB/s of PCI gen 3 versus 280GB/s of GTX TITAN

Sencer Nuri Yeralan, Timothy A. Davis, Sparse QR Factorization on GPU Architectures, SIAM CSE 2013

Page 5: How to Avoid Global Synchronization by Domino Scheme · Penalty of domino is significant on csrsv Batched domino is much faster (9x) because-more work to do for each level-penalty

Step back and rethink Cholesky

� Cholesky factorization (� = ���) does not need pivoting

� Symbolic analysis can easily find out zero fill-in and figure out sparsity pattern of

Cholesky factor �

� Reform original matrix � with sparsity pattern of Cholesky factor �

� Perform Cholesky factorization (the same as incomplete Cholesky factorization)

=

� ��

������ B = ��0(�)

� = � + ���(�)

0

0

ic0 = incomplete Cholesky factorization without zero fill-in

Page 6: How to Avoid Global Synchronization by Domino Scheme · Penalty of domino is significant on csrsv Batched domino is much faster (9x) because-more work to do for each level-penalty

Benchmark: spd matrices

� From Florida Matrix Collection

except Laplacian operator

� Laplacian operator is standard

Finite Difference 3-pt, 5-pt and 7-

pt with Dirichlet boundary

condition

� nnzA is # of nonzeros of A

� nnzL is # of nonzero of L after

symamd

� levels is # of levels in incomplete

Cholesky factorization

matrix name m nnzA nnzL nnzA/m nnzL/m levels

1138_bus.mtx 1138 2596 3264 2.3 2.9 40

aft01.mtx 8205 66886 304194 8.2 37.1 699

bcsstk13.mtx 2003 42943 289870 21.4 144.7 673

bcsstk34.mtx 588 11003 55120 18.7 93.7 364

bodyy4.mtx 17546 69742 584769 4.0 33.3 708

ex9.mtx 3363 51417 116942 15.3 34.8 662

LF10000.mtx 19998 59990 69988 3.0 3.5 19997

minsurfo.mtx 40806 122214 1038156 3.0 25.4 1177

nos6.mtx 675 1965 6200 2.9 9.2 101

s1rmt3m1.mtx 5489 112505 543103 20.5 98.9 1133

mhd4800b.mtx 4800 16160 16160 3.4 3.4 1198

Chem97ZtZ.mtx 2541 4951 4951 1.9 1.9 3

bcsstk26.mtx 1922 16129 42243 8.4 22.0 309

plat1919.mtx 1919 17159 66803 8.9 34.8 459

nasa1824.mtx 1824 20516 69926 11.2 38.3 299

ex33.mtx 1733 11961 33167 6.9 19.1 200

Muu.mtx 7102 88618 198606 12.5 28.0 323

sts4098.mtx 4098 38227 144225 9.3 35.2 371

bodyy5.mtx 18589 73935 627327 4.0 33.7 766

gyro_m.mtx 17361 178896 413709 10.3 23.8 489

Pres_Poisson.mtx 14822 365313 2447857 24.6 165.2 2758

Kuu.mtx 7102 173651 385828 24.5 54.3 612

shallow_water1.mtx 81920 204800 2448970 2.5 29.9 1428

finan512.mtx 74752 335872 2028289 4.5 27.1 1681

cant.mtx 62451 2034917 27456636 32.6 439.7 18072

lap1D_3pt_n1000000.mtx 1000000 1999999 1999999 2.0 2.0 1000000

lap2D_5pt_n100.mtx 10000 29800 25116 3.0 2.5 36

lap3D_7pt_n20.mtx 8000 30800 63281 3.9 7.9 141

http://www.cise.ufl.edu/research/sparse/matrices/

Page 7: How to Avoid Global Synchronization by Domino Scheme · Penalty of domino is significant on csrsv Batched domino is much faster (9x) because-more work to do for each level-penalty

lap3D_7pt_n20

Original A symamd(A)

L=chol(symamd(A))

Page 8: How to Avoid Global Synchronization by Domino Scheme · Penalty of domino is significant on csrsv Batched domino is much faster (9x) because-more work to do for each level-penalty

domino-Cholesky versus CHOLMOD simplicial

� Domino-Cholesky is 1.33x faster than CHOLMOD simplicial in average

� Domino-Cholesky loses when

1) zero fill-in unevenly (load imbalance),

2) too many levels (more like sequential),

3) whole matrix can fit into CPU cache

GPU: K40

CPU: i7-3930K CPU @ 3.20GHz

SuiteSparse-3.6.0

/CHOLMOD/Demo/cholmod_l_demo.c

1.33x

Page 9: How to Avoid Global Synchronization by Domino Scheme · Penalty of domino is significant on csrsv Batched domino is much faster (9x) because-more work to do for each level-penalty

domino-Cholesky versus CHOLMOD Supernodal

� Domino-Cholesky is 4.7x faster than CHOLMOD supernodal in average

� Domino-Cholesky wins if there are ONLY few big supernodes or it is NOT sequential

GPU: K40

CPU: i7-3930K CPU @ 3.20GHz

SuiteSparse-3.6.0

/CHOLMOD/Demo/cholmod_l_demo.c

Page 10: How to Avoid Global Synchronization by Domino Scheme · Penalty of domino is significant on csrsv Batched domino is much faster (9x) because-more work to do for each level-penalty

Outline

�What should be done in sparse direct solver

� Domino scheme

- review scheduler of triangular solve

- how to trace DAG by a single kernel

� Conclusions

Page 11: How to Avoid Global Synchronization by Domino Scheme · Penalty of domino is significant on csrsv Batched domino is much faster (9x) because-more work to do for each level-penalty

Triangular solve: global sync

� CUSPARSE library performs an

analysis phase, then a solve

phase

� Maxim Naumov shows how to

extract parallelism and how to

keep utilization in GTC2012,

“On the Parallel Solution of

Sparse Triangular Linear System”

� Global barrier between

consecutive two levels

� Each level running in parallel by

atomic operations

Maxim Naumov, On the Parallel Solution of Sparse Triangular Linear System, GTC2012

Page 12: How to Avoid Global Synchronization by Domino Scheme · Penalty of domino is significant on csrsv Batched domino is much faster (9x) because-more work to do for each level-penalty

Drawback of global synchronization

Critical path = max{level 1} + max{level 2} + max{level 3} = row 2 + row 4 + row 9

Page 13: How to Avoid Global Synchronization by Domino Scheme · Penalty of domino is significant on csrsv Batched domino is much faster (9x) because-more work to do for each level-penalty

Avoid global synchronization

Critical path = row 1 + row 4 + row 9

Page 14: How to Avoid Global Synchronization by Domino Scheme · Penalty of domino is significant on csrsv Batched domino is much faster (9x) because-more work to do for each level-penalty

Domino scheme� Motivation

Jonathan Hogg proposed a new trsv (dense triangular solve) in 2012

� Applications

- csrsv (sparse triangular solve)

- csric0 (sparse incomplete Cholesky factorization)

- csrilu0 (sparse incomplete LU)

- sparse Cholesky

- sparse QR by householder reflection or Givens rotation

� Idea

Each row is a domino. A domino waits until parents are done, and triggers next

dominos when it is done

� Requirements

- to avoid deadlock, parents of a domino must either be done or be running

(a logical CTA id is required)

- whole matrix must fit into GPU

Page 15: How to Avoid Global Synchronization by Domino Scheme · Penalty of domino is significant on csrsv Batched domino is much faster (9x) because-more work to do for each level-penalty

Jonathan’s trsv

CTA 0

CTA 1

CTA 2

CTA 3

CTA 0

CTA 1

CTA 2

CTA 3

S

mv

mv

mv

S

mv

mv

S

mvS

time

� DAG (implicit) is described by two constraints

- The work associated with a given off-diagonal block cannot begin until the solve for

the diagonal block in the column has completed

- The diagonal solve in a given row cannot commence until all matrix-vector multiplies

for blocks in that row have completed

� One semaphore (lock) is used to schedule the DAG

- the semaphore is an integer indicating which row is done

� One semaphore is used to query logical CTA id to avoid deadlock

Page 16: How to Avoid Global Synchronization by Domino Scheme · Penalty of domino is significant on csrsv Batched domino is much faster (9x) because-more work to do for each level-penalty

Extend to sparse linear algebra

D1, D2 and D3 run in parallel and other rows are waiting

Once D1 is done, it will trigger its children, D4 and D5. Other rows are still waiting

Each domino needs a semaphore (lock)

Page 17: How to Avoid Global Synchronization by Domino Scheme · Penalty of domino is significant on csrsv Batched domino is much faster (9x) because-more work to do for each level-penalty

Pros and cons of Domino scheme

� Pros

- hide global barrier inside the kernel

- overlap write phase and read phase of different dominos

- can run without levels

- reproducibility (determinism)

- quick analysis phase for csrsv/csric0/csrilu0

� Cons

- load imbalance

- long latency of L2 cache

- cannot adjust computational resources during the execution

- low utilization due to insufficient parallelism

Page 18: How to Avoid Global Synchronization by Domino Scheme · Penalty of domino is significant on csrsv Batched domino is much faster (9x) because-more work to do for each level-penalty

Conclusions

� Domino scheme can trace dependence graph, it works for dense/sparse linear algebra

� Global synchronization is painful, there are several workarounds

- system level (across multiple nodes)

overlap MPI transfer and computation

- system level (single node)

overlap PCI transfer and computation

S4201 - GPU Acceleration of Sparse Matrix Factorization in CHOLMOD

- GPU level

domino-scheme

� Domino scheme still suffers

- long latency of L2 cache (hardware issue)

- not enough parallelism (characteristic of a matrix)

- load imbalance (big trouble)

Page 19: How to Avoid Global Synchronization by Domino Scheme · Penalty of domino is significant on csrsv Batched domino is much faster (9x) because-more work to do for each level-penalty

� Penalty of domino is significant on csrsv

� Batched domino is much faster (9x) because

- more work to do for each level

- penalty of domino becomes small

� New domino-based functions in r6.0

- csrsv2

- csric02

- csrilu02

- bsrsv2

- bsric02

- bsrilu02

domino Non-domino

Reproducibility v x

Quick analysis phase v x

Memory efficient v x

Report structural zero or

numerical zero

v x

Buffer provided by users v x

Disable level scheduling v x

Page 20: How to Avoid Global Synchronization by Domino Scheme · Penalty of domino is significant on csrsv Batched domino is much faster (9x) because-more work to do for each level-penalty

Sparse direct solver: QR � Symbolic analysis to find zero fill-in and dependence graph, then do left-looking

householder QR by domino scheme

� QR has 9x more zero fill-in than Cholesky and is 14x slower than Cholesky

GPU: K40

161 37

Page 21: How to Avoid Global Synchronization by Domino Scheme · Penalty of domino is significant on csrsv Batched domino is much faster (9x) because-more work to do for each level-penalty

Future work for sparse direct solver

� Sparse QR, including householder and Givens rotation

- symbolic analysis on GPU (now on CPU)

- improve numerical factorization

� Sparse LU with partial pivoting

Page 22: How to Avoid Global Synchronization by Domino Scheme · Penalty of domino is significant on csrsv Batched domino is much faster (9x) because-more work to do for each level-penalty

Thank you !

� [1] umfpack, cholmod, KLU, http://www.cise.ufl.edu/research/sparse/

� [2] Maxim Naumov, On the Parallel Solution of Sparse Triangular Linear System, GTC2012

� [3] J. D. Hogg, A Fast Dense Triangular Solve in CUDA,

SIAM Journal on Scientific Computing 2013, Vol. 35, No. 3, pp. C303-C322

� [4] Maxim Naumov, What Does It Take to Accelerate SPICE on the GPU?, GTC2013

� S4201 - GPU Acceleration of Sparse Matrix Factorization in CHOLMODWednesday, 03/26, 14:00 - 14:50, Room LL20D

� S4243 - A GPU Sparse Direct Solver for AX=BWednesday, 03/26, 09:00 - 09:25, Room LL20D

� S4524 - Sparse LU Factorization on GPUs for Accelerating SPICE SimulationWednesday, 03/26, 09:30 - 09:55, Room LL20D

Related talks GTC2014

Page 23: How to Avoid Global Synchronization by Domino Scheme · Penalty of domino is significant on csrsv Batched domino is much faster (9x) because-more work to do for each level-penalty

Domino-csrsv versus MKL

� Domino-csrsv is 4.2x faster than MKL in average

GPU: K40CPU: i7-3930K CPU @ 3.20GHz

4.2x

Page 24: How to Avoid Global Synchronization by Domino Scheme · Penalty of domino is significant on csrsv Batched domino is much faster (9x) because-more work to do for each level-penalty

Benchmark: unsymmetric matrices

� From Florida Matrix Collection

except dense2.mtx

� dense2.mtx is 2000x2000 dense

matrix

� nnz is # of nonzeros of A

� levels is # of levels in incomplete

LU factorization

� No zero fill-in

matrix name m nnz nnz/m levels

cage14.mtx 1505785 27130349 18.0 100

shermanACb.mtx 18510 145149 7.8 34

Chebyshev4.mtx 68121 5377761 78.9 17424

dense2.mtx 2000 4000000 2000.0 2000

webbase-1M.mtx 1000005 3105536 3.1 512

cage12.mtx 130228 2032536 15.6 66

lung2.mtx 109460 492564 4.5 479

cryg10000.mtx 10000 49699 5.0 198

ASIC_680ks.mtx 682712 2329176 3.4 49

FEM_3D_thermal2.mtx 147900 3489300 23.6 3757

thermomech_dK.mtx 204316 2846228 13.9 644

ASIC_320ks.mtx 321671 1827807 5.7 35

cage13.mtx 445315 7479343 16.8 83

atmosmodd.mtx 1270432 8814880 6.9 352

http://www.cise.ufl.edu/research/sparse/matrices/

Page 25: How to Avoid Global Synchronization by Domino Scheme · Penalty of domino is significant on csrsv Batched domino is much faster (9x) because-more work to do for each level-penalty

domino-ilu0 versus MKL

� Domino-ilu0 is 5.4x faster than MKL in average (exclude dense2.mtx)

GPU: K40CPU: i7-3930K CPU @ 3.20GHz

5.4x

Page 26: How to Avoid Global Synchronization by Domino Scheme · Penalty of domino is significant on csrsv Batched domino is much faster (9x) because-more work to do for each level-penalty

Batched domino-ilu0

� Same sparsity pattern but

different value

� Batch size is 32

� Speedup =

(32 * single ilu0)/(batched ilu0)

� Batched domino-ilu0 is 9x faster

than 32 domino-ilu0 in average

GPU: K40