QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under...

43
QR Decomposition QR Algorithms Block Householder QR QR Decomposition on GPUs Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of Technology March 8, 2009 GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of the authors. Kerr, Campbell, Richards QR Decomposition on GPUs

Transcript of QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under...

Page 1: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR

QR Decomposition on GPUs

Andrew Kerr*1 Dan Campbell1 Mark Richards2

1Georgia Tech Research Institute

2School of Electrical and Computer Engineering

Georgia Institute of Technology

March 8, 2009

GPGPU ’09

This work was supported in part byDARPA and AFRL under contractsFA8750-06-1-0012 andFA8650-07-C-7724. The opinionsexpressed are those of the authors.

Kerr, Campbell, Richards QR Decomposition on GPUs

Page 2: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR

Outline

1 QR Decomposition

2 QR AlgorithmsAlgorithmsHouseholder ReflectionsGPU Implementation

3 Block Householder QRProblemAlgorithmImplementationPerformance

Kerr, Campbell, Richards QR Decomposition on GPUs

Page 3: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR

QR Decomposition

QR Decomposition

Matrix factorization: A = QR

QTQ = I, Q is unitary

R is upper triangular

O(N3)

Kerr, Campbell, Richards QR Decomposition on GPUs

Page 4: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR

Applications of QR

QR decomposition used to compute

least squares

other matrix factorizations (Toeplitz, SVD)

orthogonal basis for a collection of vectors

matrix eigendecomposition

Kerr, Campbell, Richards QR Decomposition on GPUs

Page 5: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR

QR on GPUs

Challenges of QR decomposition on GPUs

parallel computations require fine-grain synchronization andcommunication

divergent control flow

low compute intensity

GPUs lack large caches and have high memory latencies

Kerr, Campbell, Richards QR Decomposition on GPUs

Page 6: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation

QR Algorithms

Modified Gram-Schmidt

computes A = Q1R1 directly by solving normal equations

parallel blocked QR via MGS unstable

QR via Orthogonal Transformations

orthogonal transformations triangularize A

Givens - compute rotation for each element below main diagonal toplace zeros

Householder - compute reflection for each column to place zeros

Kerr, Campbell, Richards QR Decomposition on GPUs

Page 7: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation

Parallel QR via Householder Reflections

Approach

basic linear algebra procedures parallel and perform well on GPUs

large problem sizes minimize kernel launch overhead

Constraints

consists mostly of matrix-vector operations

Kerr, Campbell, Richards QR Decomposition on GPUs

Page 8: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation

Householder Reflections

Compute a Householder reflection P from a vector x

v = x± ||x||e1

P = I − 2vT v

vvT

such that

Px = ∓||x||e1

PA may be computed without explicitly forming P

PA = (I − 2vT v

vvT )A

= A− 2vT v

v(vTA)

Kerr, Campbell, Richards QR Decomposition on GPUs

Page 9: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation

Householder Reflections

A

x1 = A(:, 1)

v1 =[x1(1)− ||x1||2

x1(2 :)

]

Kerr, Campbell, Richards QR Decomposition on GPUs

Page 10: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation

Householder Reflections

P1

P = I5 − 2vT v

vvT

P1 =[I0

P

]

Kerr, Campbell, Richards QR Decomposition on GPUs

Page 11: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation

Householder Reflections

P1A

x2 = A(2 :, 2)

v2 =[x2(1)− ||x2||2

x2(2 :)

]

Kerr, Campbell, Richards QR Decomposition on GPUs

Page 12: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation

Householder Reflections

P2

P = I4 − 2vT v

vvT

P2 =[I1

P

]

Kerr, Campbell, Richards QR Decomposition on GPUs

Page 13: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation

Householder Reflections

P2P1A

x3 = A(3 :, 3)

v3 =[x3(1)− ||x3||2

x3(2 :)

]

Kerr, Campbell, Richards QR Decomposition on GPUs

Page 14: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation

Householder Reflections

P2

P = I3 − 2vT v

vvT

P3 =[I2

P

]

Kerr, Campbell, Richards QR Decomposition on GPUs

Page 15: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation

Householder Reflections

P3P2P1A

x4 = A(4 :, 4)

v4 =[x4(1)− ||x4||2

x4(2 :)

]

Kerr, Campbell, Richards QR Decomposition on GPUs

Page 16: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation

Householder Reflections

P2

P = I2 − 2vT v

vvT

P4 =[I3

P

]

Kerr, Campbell, Richards QR Decomposition on GPUs

Page 17: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation

Householder Reflections

P4P3P2P1AAccumulate Q

Q = PT1 P

T2 P

T3 P

T4

Triangularize A in place

A = (PT1 P

T2 P

T3 P

T4 )(P4P3P2P1A)

A = QRQ orthogonal, R upper triangular

Kerr, Campbell, Richards QR Decomposition on GPUs

Page 18: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation

Implementation on GPUs

Initial strategy:

Householder QR

matrix dimensions constrained to multiples of 32

Use CUBLAS to compute Householder reflections and vector outerproducts

Write kernel with CUDA to do better than CUBLAS’s cublasSgemv

128-byte aligned accesses to global memoryCUDA grid block sizes avoid guard conditionals

Kerr, Campbell, Richards QR Decomposition on GPUs

Page 19: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation

Matrix-vector multiply CUDA kernel

Kerr, Campbell, Richards QR Decomposition on GPUs

Page 20: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation

Performance of matrix-vector product

1000 2000 3000 4000 5000 6000 7000 8000

10

20

30

40

50

60

70Matrix−vector product

Matrix order (m)

GF

LOP

/s

gtSgemv GTX280cublasSgemv GTX280Theoretical GTX280gtSgemv 9800cublasSgemv 9800Theoretical 9800GX2

TeraFLOP-capable GPU achieves 70GFLOP/s

All computations in Householderalgorithm bandwidth limited

norm, vector outer product

Custom kernel does significantlybetter for problem sizes of interest

Kerr, Campbell, Richards QR Decomposition on GPUs

Page 21: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance

Problem

Problem with Householder QR

Matrix-vector operations are:

memory bound

inefficient

large

numerous

Kerr, Campbell, Richards QR Decomposition on GPUs

Page 22: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance

Solution

Solution

reduce the problem size for matrix-vector operations

apply reflections in rank-r updates to identity for r > 1let high performance of matrix-matrix product offset costs ofincreased FLOP count

Kerr, Campbell, Richards QR Decomposition on GPUs

Page 23: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance

A More Efficient Approach

Block Householder Representation

P = P1 · · ·Pr, where Pj is a rank-1 update to I

P may be written as P = I + YWT

W and Y are m-by-r

A← PTA = (I + YWT )A = A+ YWTA

Q← QP = Q(I +WY T ) = Q+QWY T

matrix-matrix multiply is efficient on GPUs

Kerr, Campbell, Richards QR Decomposition on GPUs

Page 24: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance

Block Householder QR Algorithm

A =[A1 A2 A3

]

1.) Input matrix is partitioned intoblocks A1, A2, . . . Ap, each with rcolumns.

Kerr, Campbell, Richards QR Decomposition on GPUs

Page 25: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance

Block Householder QR Algorithm

A =[A1 A2 A3

]

2.) A Householder reflection iscomputed from the first column

Kerr, Campbell, Richards QR Decomposition on GPUs

Page 26: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance

Block Householder QR Algorithm

A =[P1A1 A2 A3

]

3.) and applied to the remainingcolumns in A1.

Kerr, Campbell, Richards QR Decomposition on GPUs

Page 27: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance

Block Householder QR Algorithm

A =[P1A1 A2 A3

]

4.) A Householder reflection iscomputed from the second column

Kerr, Campbell, Richards QR Decomposition on GPUs

Page 28: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance

Block Householder QR Algorithm

A =[P2P1A1 A2 A3

]

5.) and applied to the remainingcolumns in A1.

Kerr, Campbell, Richards QR Decomposition on GPUs

Page 29: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance

Block Householder QR Algorithm

Y WT[A2 A3

] 6.) After r reflections are applied toblock A1, W is computed from Y .

Then, matrix[A2 A3 · · ·Ap

]and Q

are updated according to

A[2···p] ← A[2···p] + YWTA[2···p]

Q← Q+QWY T

Kerr, Campbell, Richards QR Decomposition on GPUs

Page 30: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance

Block Householder QR Algorithm

[P2P1A1 PTA2 PTA3

] 7.) Applying the block Householderupdate I + YWT to A is equivalentto performing the first rHouseholder reflections according tothe original algorithm.

Problem sizes for matrix-vectorproduct are much smaller.

Q is updated strictly withmatrix-matrix products.

Kerr, Campbell, Richards QR Decomposition on GPUs

Page 31: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance

Block Householder QR Algorithm

R =[A1 A2 A3

]

8.) Repeat with the next block untilall of A is triangularized.

Kerr, Campbell, Richards QR Decomposition on GPUs

Page 32: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance

Block Householder QR Algorithm

Block Householder QR Algorithm

Q← Ipartition A into

[A1 A2 · · ·An/r

]for k = 1 to n/r do

for j = 1 to r dos = j + (k − 1) · rv = house(Ak(s : , j))V ( : , j) = vβ(j) = 2

vT vAk ← (I − βvvT )Ak

end forcompute W from V and β[Ak+1 · · ·An/r

]← PTA = (I + YWT )

[Ak+1 · · ·An/r

]Q← QP = Q(I +WY T )

end for

Kerr, Campbell, Richards QR Decomposition on GPUs

Page 33: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance

Computing P = I + WY T

Compute W and Y from V and β for block k

Y = V ( : , 1)W = −β(1) · V ( : , 1)for j = 2 to r doz = −β(j) · (I +WY T ) · V ( : , j)W =

[W z

]Y =

[Y V ( : , j)

]end for

Kerr, Campbell, Richards QR Decomposition on GPUs

Page 34: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance

GPU Implementation

Improved strategy for GPU Implementation

blocked Householder algorithm using CUBLAS and custommatrix-vector kernel

matrix sizes multiples of 32

more rows than columns

block size of 32 columns

efficient matrix-matrix product using CUBLAS

pad loads and stores to ensure alignment to maximize memorybandwidth

Kerr, Campbell, Richards QR Decomposition on GPUs

Page 35: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance

Experimental Configuration

Test Platform

GeForce GTX280 - 240 stream processors

Intel Core2 Xeon - 2.83 GHz, 6 MB L2 cache per pair of cores

Intel Math Kernel Library 10 (MKL) - sgeqrf(), sormqr()

Linux x86-64 - CUDA 2.0

timing excludes transfer between system and GPU memory

Kerr, Campbell, Richards QR Decomposition on GPUs

Page 36: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance

Experimental Configuration

Input Test Data

single-precision, real-valued data

A′ initialized to lower-triangular matrix

diagonal elements initialized to 1random values |a| ≤ 1 below diagonal

random Givens rotations applied to A′ to conceal structure

Result Verification

All results satisfy:

||A−QR|| ≤ m · 2−23 · ||A||||QTQ− I|| ≤ m · 2−23

Q is explicitly formed

Kerr, Campbell, Richards QR Decomposition on GPUs

Page 37: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance

Performance

0 1000 2000 3000 4000 5000 6000 7000 8000 90000

50

100

150

Matrix rows

GF

LOP

/s

GTX 2809800 GX2

Peak performance

8192× 4096143 GFLOP/s

Kerr, Campbell, Richards QR Decomposition on GPUs

Page 38: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance

Runtime Distribution

Table: Runtime in seconds for phases of blocked Householder QR on GPUs

Operation GeForce GTX 280 GeForce GTX 280

Problem size 6656× 3328 8192× 4096

Householder 0.326 0.565Ak ← P ·Ak 0.952 1.45WY Computation 1.25 1.86A← (I + WY T )T A 0.534 0.971Q← Q(I + WY T ) 1.36 2.79

Total (seconds) 4.43 7.629

GFLOP/s 129 GFLOP/s 143 GFLOP/s

Kerr, Campbell, Richards QR Decomposition on GPUs

Page 39: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance

Performance of QP and P TA

Average performance of QP and PTA

1000 2000 3000 4000 5000 6000 7000 8000

50

100

150

200

250

300

Matrix rows

GF

LOP

s/s

QP − GTX280

PHA − GTX280QP − 9800GX2

PHA − 9800GX2

Peak performance

Q← QP :334 GFLOP/s

A← PTA:237 GFLOP/s

Kerr, Campbell, Richards QR Decomposition on GPUs

Page 40: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance

Speedup

0 2000 4000 6000 8000 100000

1

2

3

4

5

Spe

edup

Matrix rows

MKL − 1 threadMKL − 2 threadsMKL − 4 threads

Peak speedup (1 thread)

8192× 40964.9× speedup

Kerr, Campbell, Richards QR Decomposition on GPUs

Page 41: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance

Future Work

Attempt custom matrix-matrix product kernel to achieve higherperformance

Extend to complex-valued data

Support arbitrarily sized matrices

GPU VSIPL http://gpu-vsipl.gtri.gatech.edu

Kerr, Campbell, Richards QR Decomposition on GPUs

Page 42: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance

Conclusions

Dense, block-oriented algorithms with large problem sizes do well onGPUs

GPUs can efficiently compute QR decomposition

143 GFLOP/s - 4.9x speedup

Enhancements to CUBLAS are still possible

Kerr, Campbell, Richards QR Decomposition on GPUs

Page 43: QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-7724. The opinions expressed are those of

QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance

Questions

Questions?

Kerr, Campbell, Richards QR Decomposition on GPUs