QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under...
Transcript of QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under...
![Page 1: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/1.jpg)
QR Decomposition QR Algorithms Block Householder QR
QR Decomposition on GPUs
Andrew Kerr*1 Dan Campbell1 Mark Richards2
1Georgia Tech Research Institute
2School of Electrical and Computer Engineering
Georgia Institute of Technology
March 8, 2009
GPGPU ’09
This work was supported in part byDARPA and AFRL under contractsFA8750-06-1-0012 andFA8650-07-C-7724. The opinionsexpressed are those of the authors.
Kerr, Campbell, Richards QR Decomposition on GPUs
![Page 2: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/2.jpg)
QR Decomposition QR Algorithms Block Householder QR
Outline
1 QR Decomposition
2 QR AlgorithmsAlgorithmsHouseholder ReflectionsGPU Implementation
3 Block Householder QRProblemAlgorithmImplementationPerformance
Kerr, Campbell, Richards QR Decomposition on GPUs
![Page 3: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/3.jpg)
QR Decomposition QR Algorithms Block Householder QR
QR Decomposition
QR Decomposition
Matrix factorization: A = QR
QTQ = I, Q is unitary
R is upper triangular
O(N3)
Kerr, Campbell, Richards QR Decomposition on GPUs
![Page 4: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/4.jpg)
QR Decomposition QR Algorithms Block Householder QR
Applications of QR
QR decomposition used to compute
least squares
other matrix factorizations (Toeplitz, SVD)
orthogonal basis for a collection of vectors
matrix eigendecomposition
Kerr, Campbell, Richards QR Decomposition on GPUs
![Page 5: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/5.jpg)
QR Decomposition QR Algorithms Block Householder QR
QR on GPUs
Challenges of QR decomposition on GPUs
parallel computations require fine-grain synchronization andcommunication
divergent control flow
low compute intensity
GPUs lack large caches and have high memory latencies
Kerr, Campbell, Richards QR Decomposition on GPUs
![Page 6: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/6.jpg)
QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation
QR Algorithms
Modified Gram-Schmidt
computes A = Q1R1 directly by solving normal equations
parallel blocked QR via MGS unstable
QR via Orthogonal Transformations
orthogonal transformations triangularize A
Givens - compute rotation for each element below main diagonal toplace zeros
Householder - compute reflection for each column to place zeros
Kerr, Campbell, Richards QR Decomposition on GPUs
![Page 7: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/7.jpg)
QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation
Parallel QR via Householder Reflections
Approach
basic linear algebra procedures parallel and perform well on GPUs
large problem sizes minimize kernel launch overhead
Constraints
consists mostly of matrix-vector operations
Kerr, Campbell, Richards QR Decomposition on GPUs
![Page 8: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/8.jpg)
QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation
Householder Reflections
Compute a Householder reflection P from a vector x
v = x± ||x||e1
P = I − 2vT v
vvT
such that
Px = ∓||x||e1
PA may be computed without explicitly forming P
PA = (I − 2vT v
vvT )A
= A− 2vT v
v(vTA)
Kerr, Campbell, Richards QR Decomposition on GPUs
![Page 9: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/9.jpg)
QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation
Householder Reflections
A
x1 = A(:, 1)
v1 =[x1(1)− ||x1||2
x1(2 :)
]
Kerr, Campbell, Richards QR Decomposition on GPUs
![Page 10: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/10.jpg)
QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation
Householder Reflections
P1
P = I5 − 2vT v
vvT
P1 =[I0
P
]
Kerr, Campbell, Richards QR Decomposition on GPUs
![Page 11: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/11.jpg)
QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation
Householder Reflections
P1A
x2 = A(2 :, 2)
v2 =[x2(1)− ||x2||2
x2(2 :)
]
Kerr, Campbell, Richards QR Decomposition on GPUs
![Page 12: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/12.jpg)
QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation
Householder Reflections
P2
P = I4 − 2vT v
vvT
P2 =[I1
P
]
Kerr, Campbell, Richards QR Decomposition on GPUs
![Page 13: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/13.jpg)
QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation
Householder Reflections
P2P1A
x3 = A(3 :, 3)
v3 =[x3(1)− ||x3||2
x3(2 :)
]
Kerr, Campbell, Richards QR Decomposition on GPUs
![Page 14: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/14.jpg)
QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation
Householder Reflections
P2
P = I3 − 2vT v
vvT
P3 =[I2
P
]
Kerr, Campbell, Richards QR Decomposition on GPUs
![Page 15: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/15.jpg)
QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation
Householder Reflections
P3P2P1A
x4 = A(4 :, 4)
v4 =[x4(1)− ||x4||2
x4(2 :)
]
Kerr, Campbell, Richards QR Decomposition on GPUs
![Page 16: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/16.jpg)
QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation
Householder Reflections
P2
P = I2 − 2vT v
vvT
P4 =[I3
P
]
Kerr, Campbell, Richards QR Decomposition on GPUs
![Page 17: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/17.jpg)
QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation
Householder Reflections
P4P3P2P1AAccumulate Q
Q = PT1 P
T2 P
T3 P
T4
Triangularize A in place
A = (PT1 P
T2 P
T3 P
T4 )(P4P3P2P1A)
A = QRQ orthogonal, R upper triangular
Kerr, Campbell, Richards QR Decomposition on GPUs
![Page 18: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/18.jpg)
QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation
Implementation on GPUs
Initial strategy:
Householder QR
matrix dimensions constrained to multiples of 32
Use CUBLAS to compute Householder reflections and vector outerproducts
Write kernel with CUDA to do better than CUBLAS’s cublasSgemv
128-byte aligned accesses to global memoryCUDA grid block sizes avoid guard conditionals
Kerr, Campbell, Richards QR Decomposition on GPUs
![Page 19: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/19.jpg)
QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation
Matrix-vector multiply CUDA kernel
Kerr, Campbell, Richards QR Decomposition on GPUs
![Page 20: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/20.jpg)
QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation
Performance of matrix-vector product
1000 2000 3000 4000 5000 6000 7000 8000
10
20
30
40
50
60
70Matrix−vector product
Matrix order (m)
GF
LOP
/s
gtSgemv GTX280cublasSgemv GTX280Theoretical GTX280gtSgemv 9800cublasSgemv 9800Theoretical 9800GX2
TeraFLOP-capable GPU achieves 70GFLOP/s
All computations in Householderalgorithm bandwidth limited
norm, vector outer product
Custom kernel does significantlybetter for problem sizes of interest
Kerr, Campbell, Richards QR Decomposition on GPUs
![Page 21: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/21.jpg)
QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance
Problem
Problem with Householder QR
Matrix-vector operations are:
memory bound
inefficient
large
numerous
Kerr, Campbell, Richards QR Decomposition on GPUs
![Page 22: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/22.jpg)
QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance
Solution
Solution
reduce the problem size for matrix-vector operations
apply reflections in rank-r updates to identity for r > 1let high performance of matrix-matrix product offset costs ofincreased FLOP count
Kerr, Campbell, Richards QR Decomposition on GPUs
![Page 23: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/23.jpg)
QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance
A More Efficient Approach
Block Householder Representation
P = P1 · · ·Pr, where Pj is a rank-1 update to I
P may be written as P = I + YWT
W and Y are m-by-r
A← PTA = (I + YWT )A = A+ YWTA
Q← QP = Q(I +WY T ) = Q+QWY T
matrix-matrix multiply is efficient on GPUs
Kerr, Campbell, Richards QR Decomposition on GPUs
![Page 24: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/24.jpg)
QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance
Block Householder QR Algorithm
A =[A1 A2 A3
]
1.) Input matrix is partitioned intoblocks A1, A2, . . . Ap, each with rcolumns.
Kerr, Campbell, Richards QR Decomposition on GPUs
![Page 25: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/25.jpg)
QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance
Block Householder QR Algorithm
A =[A1 A2 A3
]
2.) A Householder reflection iscomputed from the first column
Kerr, Campbell, Richards QR Decomposition on GPUs
![Page 26: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/26.jpg)
QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance
Block Householder QR Algorithm
A =[P1A1 A2 A3
]
3.) and applied to the remainingcolumns in A1.
Kerr, Campbell, Richards QR Decomposition on GPUs
![Page 27: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/27.jpg)
QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance
Block Householder QR Algorithm
A =[P1A1 A2 A3
]
4.) A Householder reflection iscomputed from the second column
Kerr, Campbell, Richards QR Decomposition on GPUs
![Page 28: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/28.jpg)
QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance
Block Householder QR Algorithm
A =[P2P1A1 A2 A3
]
5.) and applied to the remainingcolumns in A1.
Kerr, Campbell, Richards QR Decomposition on GPUs
![Page 29: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/29.jpg)
QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance
Block Householder QR Algorithm
Y WT[A2 A3
] 6.) After r reflections are applied toblock A1, W is computed from Y .
Then, matrix[A2 A3 · · ·Ap
]and Q
are updated according to
A[2···p] ← A[2···p] + YWTA[2···p]
Q← Q+QWY T
Kerr, Campbell, Richards QR Decomposition on GPUs
![Page 30: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/30.jpg)
QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance
Block Householder QR Algorithm
[P2P1A1 PTA2 PTA3
] 7.) Applying the block Householderupdate I + YWT to A is equivalentto performing the first rHouseholder reflections according tothe original algorithm.
Problem sizes for matrix-vectorproduct are much smaller.
Q is updated strictly withmatrix-matrix products.
Kerr, Campbell, Richards QR Decomposition on GPUs
![Page 31: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/31.jpg)
QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance
Block Householder QR Algorithm
R =[A1 A2 A3
]
8.) Repeat with the next block untilall of A is triangularized.
Kerr, Campbell, Richards QR Decomposition on GPUs
![Page 32: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/32.jpg)
QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance
Block Householder QR Algorithm
Block Householder QR Algorithm
Q← Ipartition A into
[A1 A2 · · ·An/r
]for k = 1 to n/r do
for j = 1 to r dos = j + (k − 1) · rv = house(Ak(s : , j))V ( : , j) = vβ(j) = 2
vT vAk ← (I − βvvT )Ak
end forcompute W from V and β[Ak+1 · · ·An/r
]← PTA = (I + YWT )
[Ak+1 · · ·An/r
]Q← QP = Q(I +WY T )
end for
Kerr, Campbell, Richards QR Decomposition on GPUs
![Page 33: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/33.jpg)
QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance
Computing P = I + WY T
Compute W and Y from V and β for block k
Y = V ( : , 1)W = −β(1) · V ( : , 1)for j = 2 to r doz = −β(j) · (I +WY T ) · V ( : , j)W =
[W z
]Y =
[Y V ( : , j)
]end for
Kerr, Campbell, Richards QR Decomposition on GPUs
![Page 34: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/34.jpg)
QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance
GPU Implementation
Improved strategy for GPU Implementation
blocked Householder algorithm using CUBLAS and custommatrix-vector kernel
matrix sizes multiples of 32
more rows than columns
block size of 32 columns
efficient matrix-matrix product using CUBLAS
pad loads and stores to ensure alignment to maximize memorybandwidth
Kerr, Campbell, Richards QR Decomposition on GPUs
![Page 35: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/35.jpg)
QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance
Experimental Configuration
Test Platform
GeForce GTX280 - 240 stream processors
Intel Core2 Xeon - 2.83 GHz, 6 MB L2 cache per pair of cores
Intel Math Kernel Library 10 (MKL) - sgeqrf(), sormqr()
Linux x86-64 - CUDA 2.0
timing excludes transfer between system and GPU memory
Kerr, Campbell, Richards QR Decomposition on GPUs
![Page 36: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/36.jpg)
QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance
Experimental Configuration
Input Test Data
single-precision, real-valued data
A′ initialized to lower-triangular matrix
diagonal elements initialized to 1random values |a| ≤ 1 below diagonal
random Givens rotations applied to A′ to conceal structure
Result Verification
All results satisfy:
||A−QR|| ≤ m · 2−23 · ||A||||QTQ− I|| ≤ m · 2−23
Q is explicitly formed
Kerr, Campbell, Richards QR Decomposition on GPUs
![Page 37: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/37.jpg)
QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance
Performance
0 1000 2000 3000 4000 5000 6000 7000 8000 90000
50
100
150
Matrix rows
GF
LOP
/s
GTX 2809800 GX2
Peak performance
8192× 4096143 GFLOP/s
Kerr, Campbell, Richards QR Decomposition on GPUs
![Page 38: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/38.jpg)
QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance
Runtime Distribution
Table: Runtime in seconds for phases of blocked Householder QR on GPUs
Operation GeForce GTX 280 GeForce GTX 280
Problem size 6656× 3328 8192× 4096
Householder 0.326 0.565Ak ← P ·Ak 0.952 1.45WY Computation 1.25 1.86A← (I + WY T )T A 0.534 0.971Q← Q(I + WY T ) 1.36 2.79
Total (seconds) 4.43 7.629
GFLOP/s 129 GFLOP/s 143 GFLOP/s
Kerr, Campbell, Richards QR Decomposition on GPUs
![Page 39: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/39.jpg)
QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance
Performance of QP and P TA
Average performance of QP and PTA
1000 2000 3000 4000 5000 6000 7000 8000
50
100
150
200
250
300
Matrix rows
GF
LOP
s/s
QP − GTX280
PHA − GTX280QP − 9800GX2
PHA − 9800GX2
Peak performance
Q← QP :334 GFLOP/s
A← PTA:237 GFLOP/s
Kerr, Campbell, Richards QR Decomposition on GPUs
![Page 40: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/40.jpg)
QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance
Speedup
0 2000 4000 6000 8000 100000
1
2
3
4
5
Spe
edup
Matrix rows
MKL − 1 threadMKL − 2 threadsMKL − 4 threads
Peak speedup (1 thread)
8192× 40964.9× speedup
Kerr, Campbell, Richards QR Decomposition on GPUs
![Page 41: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/41.jpg)
QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance
Future Work
Attempt custom matrix-matrix product kernel to achieve higherperformance
Extend to complex-valued data
Support arbitrarily sized matrices
GPU VSIPL http://gpu-vsipl.gtri.gatech.edu
Kerr, Campbell, Richards QR Decomposition on GPUs
![Page 42: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/42.jpg)
QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance
Conclusions
Dense, block-oriented algorithms with large problem sizes do well onGPUs
GPUs can efficiently compute QR decomposition
143 GFLOP/s - 4.9x speedup
Enhancements to CUBLAS are still possible
Kerr, Campbell, Richards QR Decomposition on GPUs
![Page 43: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of](https://reader033.fdocuments.in/reader033/viewer/2022052019/6033846bfd62740c6450c718/html5/thumbnails/43.jpg)
QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance
Questions
Questions?
Kerr, Campbell, Richards QR Decomposition on GPUs