QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under...

QR Decomposition QR Algorithms Block Householder QR

QR Decomposition on GPUs

Andrew Kerr*1 Dan Campbell1 Mark Richards2

1Georgia Tech Research Institute

2School of Electrical and Computer Engineering

Georgia Institute of Technology

March 8, 2009

GPGPU ’09

This work was supported in part byDARPA and AFRL under contractsFA8750-06-1-0012 andFA8650-07-C-7724. The opinionsexpressed are those of the authors.

Kerr, Campbell, Richards QR Decomposition on GPUs


Outline

1 QR Decomposition

2 QR AlgorithmsAlgorithmsHouseholder ReflectionsGPU Implementation

3 Block Householder QRProblemAlgorithmImplementationPerformance



QR Decomposition

QR Decomposition

Matrix factorization: A = QR

QTQ = I, Q is unitary

R is upper triangular

O(N3)



Applications of QR

QR decomposition used to compute

least squares

other matrix factorizations (Toeplitz, SVD)

orthogonal basis for a collection of vectors

matrix eigendecomposition



QR on GPUs

Challenges of QR decomposition on GPUs

parallel computations require fine-grain synchronization andcommunication

divergent control flow

low compute intensity

GPUs lack large caches and have high memory latencies


QR Decomposition QR Algorithms Block Householder QR Algorithms Householder Reflections GPU Implementation

QR Algorithms

Modified Gram-Schmidt

computes A = Q1R1 directly by solving normal equations

parallel blocked QR via MGS unstable

QR via Orthogonal Transformations

orthogonal transformations triangularize A

Givens - compute rotation for each element below main diagonal toplace zeros

Householder - compute reflection for each column to place zeros



Parallel QR via Householder Reflections

Approach

basic linear algebra procedures parallel and perform well on GPUs

large problem sizes minimize kernel launch overhead

Constraints

consists mostly of matrix-vector operations



Householder Reflections

Compute a Householder reflection P from a vector x

v = x± ||x||e1

P = I − 2vT v

vvT

such that

Px = ∓||x||e1

PA may be computed without explicitly forming P

PA = (I − 2vT v

vvT )A

= A− 2vT v

v(vTA)




A

x1 = A(:, 1)

v1 =[x1(1)− ||x1||2

x1(2 :)

]




P1

P = I5 − 2vT v

vvT

P1 =[I0

P

]




P1A

x2 = A(2 :, 2)

v2 =[x2(1)− ||x2||2

x2(2 :)

]




P2

P = I4 − 2vT v

vvT

P2 =[I1

P

]




P2P1A

x3 = A(3 :, 3)

v3 =[x3(1)− ||x3||2

x3(2 :)

]




P2

P = I3 − 2vT v

vvT

P3 =[I2

P

]




P3P2P1A

x4 = A(4 :, 4)

v4 =[x4(1)− ||x4||2

x4(2 :)

]




P2

P = I2 − 2vT v

vvT

P4 =[I3

P

]




P4P3P2P1AAccumulate Q

Q = PT1 P

T2 P

T3 P

T4

Triangularize A in place

A = (PT1 P

T2 P

T3 P

T4 )(P4P3P2P1A)

A = QRQ orthogonal, R upper triangular



Implementation on GPUs

Initial strategy:

Householder QR

matrix dimensions constrained to multiples of 32

Use CUBLAS to compute Householder reflections and vector outerproducts

Write kernel with CUDA to do better than CUBLAS’s cublasSgemv

128-byte aligned accesses to global memoryCUDA grid block sizes avoid guard conditionals



Matrix-vector multiply CUDA kernel



Performance of matrix-vector product

1000 2000 3000 4000 5000 6000 7000 8000

10

20

30

40

50

60

70Matrix−vector product

Matrix order (m)

GF

LOP

/s

gtSgemv GTX280cublasSgemv GTX280Theoretical GTX280gtSgemv 9800cublasSgemv 9800Theoretical 9800GX2

TeraFLOP-capable GPU achieves 70GFLOP/s

All computations in Householderalgorithm bandwidth limited

norm, vector outer product

Custom kernel does significantlybetter for problem sizes of interest


QR Decomposition QR Algorithms Block Householder QR Problem Algorithm Implementation Performance

Problem

Problem with Householder QR

Matrix-vector operations are:

memory bound

inefficient

large

numerous



Solution

Solution

reduce the problem size for matrix-vector operations

apply reflections in rank-r updates to identity for r > 1let high performance of matrix-matrix product offset costs ofincreased FLOP count



A More Efficient Approach

Block Householder Representation

P = P1 · · ·Pr, where Pj is a rank-1 update to I

P may be written as P = I + YWT

W and Y are m-by-r

A← PTA = (I + YWT )A = A+ YWTA

Q← QP = Q(I +WY T ) = Q+QWY T

matrix-matrix multiply is efficient on GPUs



Block Householder QR Algorithm

A =[A1 A2 A3

]

1.) Input matrix is partitioned intoblocks A1, A2, . . . Ap, each with rcolumns.




A =[A1 A2 A3

]

2.) A Householder reflection iscomputed from the first column




A =[P1A1 A2 A3

]

3.) and applied to the remainingcolumns in A1.




A =[P1A1 A2 A3

]

4.) A Householder reflection iscomputed from the second column




A =[P2P1A1 A2 A3

]

5.) and applied to the remainingcolumns in A1.




Y WT[A2 A3

] 6.) After r reflections are applied toblock A1, W is computed from Y .

Then, matrix[A2 A3 · · ·Ap

]and Q

are updated according to

A[2···p] ← A[2···p] + YWTA[2···p]

Q← Q+QWY T




[P2P1A1 PTA2 PTA3

] 7.) Applying the block Householderupdate I + YWT to A is equivalentto performing the first rHouseholder reflections according tothe original algorithm.

Problem sizes for matrix-vectorproduct are much smaller.

Q is updated strictly withmatrix-matrix products.




R =[A1 A2 A3

]

8.) Repeat with the next block untilall of A is triangularized.





Q← Ipartition A into

[A1 A2 · · ·An/r

]for k = 1 to n/r do

for j = 1 to r dos = j + (k − 1) · rv = house(Ak(s : , j))V ( : , j) = vβ(j) = 2

vT vAk ← (I − βvvT )Ak

end forcompute W from V and β[Ak+1 · · ·An/r

]← PTA = (I + YWT )

[Ak+1 · · ·An/r

]Q← QP = Q(I +WY T )

end for



Computing P = I + WY T

Compute W and Y from V and β for block k

Y = V ( : , 1)W = −β(1) · V ( : , 1)for j = 2 to r doz = −β(j) · (I +WY T ) · V ( : , j)W =

[W z

]Y =

[Y V ( : , j)

]end for



GPU Implementation

Improved strategy for GPU Implementation

blocked Householder algorithm using CUBLAS and custommatrix-vector kernel

matrix sizes multiples of 32

more rows than columns

block size of 32 columns

efficient matrix-matrix product using CUBLAS

pad loads and stores to ensure alignment to maximize memorybandwidth



Experimental Configuration

Test Platform

GeForce GTX280 - 240 stream processors

Intel Core2 Xeon - 2.83 GHz, 6 MB L2 cache per pair of cores

Intel Math Kernel Library 10 (MKL) - sgeqrf(), sormqr()

Linux x86-64 - CUDA 2.0

timing excludes transfer between system and GPU memory



Experimental Configuration

Input Test Data

single-precision, real-valued data

A′ initialized to lower-triangular matrix

diagonal elements initialized to 1random values |a| ≤ 1 below diagonal

random Givens rotations applied to A′ to conceal structure

Result Verification

All results satisfy:

||A−QR|| ≤ m · 2−23 · ||A||||QTQ− I|| ≤ m · 2−23

Q is explicitly formed



Performance

0 1000 2000 3000 4000 5000 6000 7000 8000 90000

50

100

150

Matrix rows

GF

LOP

/s

GTX 2809800 GX2

Peak performance

8192× 4096143 GFLOP/s



Runtime Distribution

Table: Runtime in seconds for phases of blocked Householder QR on GPUs

Operation GeForce GTX 280 GeForce GTX 280

Problem size 6656× 3328 8192× 4096

Householder 0.326 0.565Ak ← P ·Ak 0.952 1.45WY Computation 1.25 1.86A← (I + WY T )T A 0.534 0.971Q← Q(I + WY T ) 1.36 2.79

Total (seconds) 4.43 7.629

GFLOP/s 129 GFLOP/s 143 GFLOP/s



Performance of QP and P TA

Average performance of QP and PTA

1000 2000 3000 4000 5000 6000 7000 8000

50

100

150

200

250

300

Matrix rows

GF

LOP

s/s

QP − GTX280

PHA − GTX280QP − 9800GX2

PHA − 9800GX2

Peak performance

Q← QP :334 GFLOP/s

A← PTA:237 GFLOP/s



Speedup

0 2000 4000 6000 8000 100000

1

2

3

4

5

Spe

edup

Matrix rows

MKL − 1 threadMKL − 2 threadsMKL − 4 threads

Peak speedup (1 thread)

8192× 40964.9× speedup



Future Work

Attempt custom matrix-matrix product kernel to achieve higherperformance

Extend to complex-valued data

Support arbitrarily sized matrices

GPU VSIPL http://gpu-vsipl.gtri.gatech.edu



Conclusions

Dense, block-oriented algorithms with large problem sizes do well onGPUs

GPUs can efficiently compute QR decomposition

143 GFLOP/s - 4.9x speedup

Enhancements to CUBLAS are still possible



Questions

Questions?


QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under...

Documents

Transcript of QR Decomposition on GPUs · GPGPU ’09 This work was supported in part by DARPA and AFRL under...