Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels...
Transcript of Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels...
![Page 1: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/1.jpg)
Exploiting Multiple GPUs in Sparse QR:Regular Numerics with Irregular Data Movement
Tim Davis (Texas A&M University)with Sanjay Ranka, Mohamed Gadou (University of Florida)
Nuri Yeralan (Microsoft)
NVIDIA GTC 2015March 2015
![Page 2: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/2.jpg)
Outline
Combinatorial scientific computing: math + CS + applications
Multifrontal methods for factorizing sparse matrices
fill-in creates cliques in the graphcliques connected via the elimination treeSparse LU: square cliques assembled via additionSparse LU: rectangular, via additionSparse QR: rectangular, via concatenation
GPU kernel design for sparse QR
Bucket scheduler for single-GPU QR
Extended for multi-GPU QR
Performance results
![Page 3: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/3.jpg)
Combinatorial Scientific Computing: math + CS + applications
![Page 4: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/4.jpg)
Cliques in the graph: the multifrontal method
Cliques + elimination tree = sequence of frontal matrices
Dense factorization within a front; assemble data into parent
regular + irregular computation
![Page 5: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/5.jpg)
UMFPACK: unsymmetric multifrontal method
Frontal matrices become rectangular
Assemble data into ancestors, not just parents
![Page 6: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/6.jpg)
UMFPACK: unsymmetric multifrontal method
Key results / impact
high-performance via dense matrix kernels within each front
symbolic preordering and analysis, followed by revised local pivot search withapproximate unsymmetric degree update
widely used
sparse LU and x=A\b in MATLABMathematicaIBM circuit simulation applicationfinite-element solvers: NASTRAN, FEniCS, ...Berkeley Design Automation: circuit simulationCVXOPT...
![Page 7: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/7.jpg)
SuiteSparseQR: multifrontal sparse QR factorization
Key results / impact
rectangular fronts like UMFPACK, but simpler frontal matrix assembly(concatenation, not summation) (Duff, Puglisi)
rank approximation (Heath, extended to multifrontal case)
multicore parallelism
on multicore CPU (70 Gflop theoretical peak): up to 30 Gflops
sparse qr in MATLAB, and x=A\btoday’s talk: GPU algorithm
novel “Bucket QR” scheduler and custom GPU kernelsup to 150 GFlops on one Kepler K20c, 286 GFlops on 4 Tesla C2070’sup to 28x speedup vs CPU algorithm (10x typical for large problems)
![Page 8: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/8.jpg)
SuiteSparseQR
A column elimination tree and its supernodes
![Page 9: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/9.jpg)
SuiteSparseQR
Frontal matrix assembly
![Page 10: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/10.jpg)
SuiteSparseQR
concatenation, resulting a staircase matrix
![Page 11: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/11.jpg)
SuiteSparseQR
![Page 12: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/12.jpg)
Multifrontal factorization and assembly
Prior methods
one front at a time on the GPUassembly on CPUpanel factorization on the CPU, applied on GPU
Our multifrontal QR
many fronts on the GPU (entire subtree)assembly on GPU: data concatenation, not summationentire dense QR of each front on the GPU
![Page 13: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/13.jpg)
Consider a subtree of frontal matrices on the GPU
![Page 14: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/14.jpg)
Expanded to show GPU kernel launches
![Page 15: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/15.jpg)
Bucket QR factorization
![Page 16: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/16.jpg)
Bucket QR factorization
![Page 17: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/17.jpg)
Bucket QR factorization
![Page 18: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/18.jpg)
Bucket QR factorization
![Page 19: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/19.jpg)
Bucket QR factorization
![Page 20: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/20.jpg)
Bucket QR factorization
![Page 21: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/21.jpg)
Bucket QR factorization
![Page 22: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/22.jpg)
Bucket QR factorization
![Page 23: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/23.jpg)
Bucket QR factorization
![Page 24: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/24.jpg)
![Page 25: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/25.jpg)
GPU kernels for Bucket QR
Bucket QR requires two kernels on the GPU
QR factorization of a t-by-1 tile,
t = 1, 2, or 3creates a block Householderdetails on next slides
Apply a block Householder:
A = A− V (T ′(V ′A))A is t-by-s, where s can be largethread-block iterates 2 column blocks at atime(details omitted)
![Page 26: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/26.jpg)
GPU kernel: Block Householder QR
Block Householder QR = Awhere Q = (I − VT ′V ′)′, and R is upper triangular
[m n] = size (A)
for k = 1:n
[tau, v] = house (A (k:m,k))
A (k:m,k:n) = A (k:m,k:n) - v * (tau * (v’ * A (k:m,k:n)))
V (k:m,k) = v ; Tau (k) = tau
end
T = zeros (n)
for k = 1:n
tau = Tau (k) ; v = V (k:m,k)
z = - tau * v’ * V (k:m,1:k-1)
T (1:k-1,k) = T (1:k-1,1:k-1) * z’
T (k,k) = tau
end
![Page 27: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/27.jpg)
GPU kernel: Block Householder QR
Householder update
A(k:m,k:n) = A(k:m,k:n) - ...
v*(tau*(v’*A(k:m,k:n)))
Construct T
z = - tau * v’ * V (k:m,1:k-1)
T (1:k-1,k) = T(1:k-1,1:k-1)*z’
T (k,k) = tau
![Page 28: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/28.jpg)
GPU kernel: Block Householder QR
Towards an GPU kerneloverwrite tril(A,-1) with V, and fold in construction of T.
[m n] = size (A)
T = zeros (n)
for k = 1:n
[tau, v] = house (A (k:m,k))
A (k:m,k:n) = A (k:m,k:n) - ...
v * (tau * (v’ * A (k:m,k:n)))
V1 (k) = v (1)
A (k+1:m,k) = v (2:end)
z = - tau * v’ * A (k:m,1:k-1)
T (1:k-1,k) = T (1:k-1,1:k-1) * z’
T (k,k) = tau
end
![Page 29: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/29.jpg)
GPU kernel: Block Householder QR
The GPU kernel update A and construct T in parallel:
A (k:m,k:n) = A (k:m,k:n) - ...
v * (tau * (v’ * A (k:m,k:n)))
z = - tau * (v’ * A (k:m,1:k-1))
T (1:k-1,k) = T (1:k-1,1:k-1) * z’
T (k,k) = tau
becomes
z = -tau * v’ * A (k:m,:)
A (k:m,k:n) = A (k:m,k:n) + v * z (k:n)
T (1:k-1,k) = T (1:k-1,1:k-1) * z (1:k-1)’
T (k,k) = tau
![Page 30: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/30.jpg)
GPU kernel: Block Householder QR
z = -tau * v’ * A (k:m,:)
A (k:m,k:n) = ...
A (k:m,k:n) + ...
v * z (k:n)
T (1:k-1,k) = ...
T (1:k-1,1:k-1) ...
* z (1:k-1)’
T (k,k) = tau
![Page 31: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/31.jpg)
GPU kernel: Block Householder QR
The GPU kernel thread-level parallelism
z = -tau * v’ * A (k:m,:)
A (k:m,k:n) = A (k:m,k:n) + v * z (k:n)
T (1:k-1,k) = T (1:k-1,1:k-1) * z (1:k-1)’
T (k,k) = tau
A is 96-by-32, in register during factorization, and finally in global memory at theend (V and R)
each thread owns an 8-by-1 “bitty block” of A
v is 96-by-1, in shared memory
z is 32-by-1, in shared. Requires 12-by-32 shared space for v’*A(k:m,:)reduction
![Page 32: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/32.jpg)
GPU kernel: Block Householder QR
Putting it all together. At the kth step:
threads that own column k write A to shared
thread zero computes Householder coefficients
z = -tau * v’ * A (k:m,:)
each thread computes 8-by-1 dot product in parallelwrites scalar result in 12-by-32 reduction spacewarp zero sums reduction space to get z
A (k:m,k:n) = A (k:m,k:n) + v * z (k:n)
only done by threads that own columns k:n
threads that own column k+1 compute norm of that column of A, for nextHouseholder coef, saving result in 12-by-1 reduction space
T (1:k-1,k) = T (1:k-1,1:k-1) * z (1:k-1)’
only done by threads 1:k-1
thread zero sums up reduction space for norm of column k+1
![Page 33: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/33.jpg)
GPU kernel: Block Householder QR
z = -tau * v’ * A (k:m,:)
each thread computes8-by-1 dot product inparallel
writes scalar result in12-by-32 reduction space
warp zero sums reductionspace to get z
![Page 34: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/34.jpg)
GPU kernel: Block Householder QR
A (k:m,k:n) = A (k:m,k:n)
+ v * z (k:n)
only done by threads thatown columns k:n
threads that own columnk+1 compute norm of thatcolumn of A, for nextHouseholder coef, savingresult in 12-by-1 reductionspace
![Page 35: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/35.jpg)
GPU kernel: Block Householder QR
T (1:k-1,k) = T
(1:k-1,1:k-1) * z (1:k-1)’
only done by threads 1:k-1
![Page 36: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/36.jpg)
Single GPU performance results
Putting it all together ... Performance resultsFermi K20 K40
GPU kernels:apply block Householder 183 Gflops 260 Gflops 360 Gflopsfactorize 3 tiles 27 Gflops 20 Gflops
dense QR for large front 107 Gflops 120 Gflops(’bare metal’ flops) 154 Gflops 172 Gflops
sparse QR on GPU 80 Gflops 150 Gflopspeak speedup over CPU 11x 20xtypical speedup over CPU 5x 10x
![Page 37: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/37.jpg)
GPU Kernel pitfalls
What we’ll do differently in our kernel design
Householder block-apply using too much shared memory
uberkernel approach
each thread block determines what to do from a task list (QR, apply, assemble)pros: single large kernel launch, lots of parallelismcon: all tasks use same thread geometrycon: QR of panel needs higher occupancy to hide scalar
√(d)
to do: block apply kernel needs to stage A = A−WYA by not keeping W and Yin shared.
split the uberkernel so QR panel can have higher occupancy
![Page 38: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/38.jpg)
Single GPU performance on many matrices
![Page 39: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/39.jpg)
Multiple GPUs: decomposing the tree
![Page 40: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/40.jpg)
Multiple GPUs: Performance Results
Results on NVIDIA Tesla C2070 GPUsproblem CPU 1 GPU 2 GPU 4 GPU speedup speedup
GFlop GFlop GFlop GFlop vs CPU vs 1 GPU
1500:2D 6.1 16.0 27.1 38.4 6.3 2.42000:2D 6.9 21.0 37.8 56.7 8.2 2.73000:2D 7.8 25.8 44.8 73.7 9.4 2.9lp nug20 23.9 74.3 86.4 66.1 2.8 0.9ch7-8-b3 25.3 104.0 111.3 173.7 6.9 1.7ch7-8-b3:8 10.0 88.0 160.4 286.2 28.6 3.3
![Page 41: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/41.jpg)
Multiple GPUs: for large fronts
![Page 42: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/42.jpg)
Multiple GPUs: bucket scheduler on the large scale
![Page 43: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/43.jpg)
Multiple GPUs: bucket scheduler on the large scale
![Page 44: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/44.jpg)
Multiple GPUs: bucket scheduler on the large scale
![Page 45: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/45.jpg)
Multiple GPUs: bucket scheduler on the large scale
![Page 46: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/46.jpg)
Multiple GPUs: bucket scheduler on the large scale
![Page 47: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/47.jpg)
Multiple GPUs: bucket scheduler on the large scale
![Page 48: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/48.jpg)
Multiple GPUs: bucket scheduler on the large scale
![Page 49: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/49.jpg)
Multiple GPUs: bucket scheduler on the large scale
![Page 50: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/50.jpg)
Multiple GPUs: bucket scheduler on the large scale
![Page 51: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/51.jpg)
Multiple GPUs: bucket scheduler on the large scale
![Page 52: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/52.jpg)
Multiple GPUs: bucket scheduler on the large scale
![Page 53: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/53.jpg)
Multiple GPUs: bucket scheduler on the large scale
![Page 54: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/54.jpg)
Multiple GPUs: bucket scheduler on the large scale
![Page 55: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/55.jpg)
Multiple GPUs: bucket scheduler on the large scale
![Page 56: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/56.jpg)
Multiple GPUs: bucket scheduler on the large scale
![Page 57: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/57.jpg)
Multiple GPUs: bucket scheduler on the large scale
![Page 58: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/58.jpg)
Multiple GPUs: bucket scheduler on the large scale
![Page 59: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/59.jpg)
Multiple GPUs: bucket scheduler on the large scale
![Page 60: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/60.jpg)
Multiple GPUs: bucket scheduler on the large scale
![Page 61: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/61.jpg)
Multiple GPUs: bucket scheduler on the large scale
![Page 62: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/62.jpg)
Multiple GPUs: bucket scheduler on the large scale
![Page 63: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/63.jpg)
Multiple GPUs: bucket scheduler on the large scale
![Page 64: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/64.jpg)
Multiple GPUs: bucket scheduler on the large scale
![Page 65: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/65.jpg)
Multiple GPUs: bucket scheduler on the large scale
![Page 66: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/66.jpg)
Multiple GPUs: bucket scheduler on the large scale
![Page 67: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/67.jpg)
![Page 68: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/68.jpg)
![Page 69: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/69.jpg)
![Page 70: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/70.jpg)
Acknowledgements:
National Science Foundation
NVIDIA
Texas A&M University
![Page 71: Exploiting Multiple GPUs in Sparse QR: Regular Numerics with … · 2015. 4. 28. · GPU kernels for Bucket QR Bucket QR requires two kernels on the GPU QR factorization of a t-by-1](https://reader036.fdocuments.in/reader036/viewer/2022062604/5fbccb9e51085d578115368f/html5/thumbnails/71.jpg)
Summary: Sparse QR on GPUs
Fronts live and die on the GPU, reduces CPU-GPU traffic
Bucket scheduler: extends Communication-Avoiding QR method
Single GPU: speedup 5x to 20x on one GPU
Multi GPU prototype: speedup over 3x on 4 GPUs
Code: SuiteSparse.com and developer.nvidia.com/cholmod
SuiteSparse logo, and music to art via math: NotesArtStudio.com