Evaluation and Tuning of Level 3 CUBLAS for Graphics ...

28
Introduction Level 3 CUBLAS Evaluation and Tuning Conclusions and Future Work Evaluation and Tuning of Level 3 CUBLAS for Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ort´ ı Universitat Jaume I Spain 9th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing Miami, April 2008 Evaluation and Tuning of Level 3 CUBLAS

Transcript of Evaluation and Tuning of Level 3 CUBLAS for Graphics ...

Page 1: Evaluation and Tuning of Level 3 CUBLAS for Graphics ...

IntroductionLevel 3 CUBLAS Evaluation and Tuning

Conclusions and Future Work

Evaluation and Tuning of Level 3 CUBLAS forGraphics Processors

Sergio Barrachina Maribel Castillo Francisco IgualRafael Mayo Enrique S. Quintana-Ort́ı

Universitat Jaume ISpain

9th IEEE International Workshop onParallel and Distributed Scientific and Engineering Computing

Miami, April 2008

Evaluation and Tuning of Level 3 CUBLAS

Page 2: Evaluation and Tuning of Level 3 CUBLAS for Graphics ...

IntroductionLevel 3 CUBLAS Evaluation and Tuning

Conclusions and Future Work

Motivation

Current graphics processors (GPUs) are gaining wide appeal asHPC accelerators

Evaluation and Tuning of Level 3 CUBLAS

Page 3: Evaluation and Tuning of Level 3 CUBLAS for Graphics ...

IntroductionLevel 3 CUBLAS Evaluation and Tuning

Conclusions and Future Work

Motivation (II)

GPUs are being transformed into general-purpose devices ofwide appeal (GPGPU):

1 High performance2 Low cost3 Programmability

Applied in many problems: physical simulations, real-timeimage processing, linear algebra, signal processing,. . .

Evaluation and Tuning of Level 3 CUBLAS

Page 4: Evaluation and Tuning of Level 3 CUBLAS for Graphics ...

IntroductionLevel 3 CUBLAS Evaluation and Tuning

Conclusions and Future Work

Outline

1 Introduction

2 Level 3 CUBLAS Evaluation and TuningGEMM. PaddingSYRK and TRSMPartitioning for Larger MatricesHybrid Computation

3 Conclusions and Future Work

Evaluation and Tuning of Level 3 CUBLAS

Page 5: Evaluation and Tuning of Level 3 CUBLAS for Graphics ...

IntroductionLevel 3 CUBLAS Evaluation and Tuning

Conclusions and Future Work

CUDA Hardware

A CUDA-enabled device is seen as a coprocessor to the CPU,capable of executing a very high number of threads in parallel

Example: nVIDIA G80 as a set of SIMD Multiprocessors withOn-Chip Shared Memory

Up to 128 Streaming

Processors (SP), grouped inclusters

SP are SIMD processors

Small and fast Shared Memoryshared per SP cluster

Local 32-bit registers perprocessor

Evaluation and Tuning of Level 3 CUBLAS

Page 6: Evaluation and Tuning of Level 3 CUBLAS for Graphics ...

IntroductionLevel 3 CUBLAS Evaluation and Tuning

Conclusions and Future Work

CUDA Software

The CUDA API provides a simple framework for writing Cprograms for execution on the GPU

Consists of:

A minimal set of extensions to the C languageA runtime library of routines for controlling the transfersbetween video and main memory, run-time configuration,execution of device-specific functions, handling multipleGPUs,. . .

CUDA libraries

On top of CUDA, nVIDIA provides two optimized libraries:CUFFT and CUBLAS

Evaluation and Tuning of Level 3 CUBLAS

Page 7: Evaluation and Tuning of Level 3 CUBLAS for Graphics ...

IntroductionLevel 3 CUBLAS Evaluation and Tuning

Conclusions and Future Work

CUBLAS Example

i n t main ( vo i d ){. . .f l o a t∗ h ve c to r , ∗ d v e c t o r ;

h v e c t o r = ( f l o a t ∗) ma l l oc (M∗ s i z e o f ( f l o a t ) ) ;

. . . // I n i t i a l i z e v e c t o r o f M f l o a t sc u b l a sA l l o c (M, s i z e o f ( f l o a t ) ,

( vo i d∗∗) &d v e c t o r ) ;

c ub l a s S e tVe c t o r (M, s i z e o f ( f l o a t ) , h ve c to r ,d ve c to r , 1 ) ;

c ub l a s S s c a l (M, ALPHA, d ve c to r , 1 ) ;c ub l a sGe tVe c to r (M, s i z e o f ( f l o a t ) , d ve c to r ,

h ve c to r , 1 ) ;

c ub l a sF r e e ( d v e c t o r ) ;. . .

}

A typical CUDA (andCUBLAS) program has 3phases:

1 Allocation and transfer ofdata to GPU

2 Execution of the BLASkernel

3 Transfer of results back tomain memory

Evaluation and Tuning of Level 3 CUBLAS

Page 8: Evaluation and Tuning of Level 3 CUBLAS for Graphics ...

IntroductionLevel 3 CUBLAS Evaluation and Tuning

Conclusions and Future Work

GEMM. PaddingSYRK and TRSMPartitioning for Larger MatricesHybrid Computation

Level 3 CUBLAS Evaluation

BLAS are the building blocks for more complex linear algebraoperations (e.g., solution of dense linear systems)

There exist optimized versions of the BLAS tuned for mostcurrent processors: GotoBLAS, Intel MKL, AMD ACML,. . .

Evaluation and Tuning of Level 3 CUBLAS

Page 9: Evaluation and Tuning of Level 3 CUBLAS for Graphics ...

IntroductionLevel 3 CUBLAS Evaluation and Tuning

Conclusions and Future Work

GEMM. PaddingSYRK and TRSMPartitioning for Larger MatricesHybrid Computation

Level 3 CUBLAS Evaluation

BLAS are the building blocks for more complex linear algebraoperations (e.g., solution of dense linear systems)

There exist optimized versions of the BLAS tuned for mostcurrent processors: GotoBLAS, Intel MKL, AMD ACML,. . .

Goals

Evaluate the result of the Level 3 BLAS in CUBLAS

Tune the implementations in CUBLAS to attain higherperformance

Evaluation and Tuning of Level 3 CUBLAS

Page 10: Evaluation and Tuning of Level 3 CUBLAS for Graphics ...

IntroductionLevel 3 CUBLAS Evaluation and Tuning

Conclusions and Future Work

GEMM. PaddingSYRK and TRSMPartitioning for Larger MatricesHybrid Computation

Experimental Setup

Intel Core 2 Duo with a nVIDIA 8800 Ultra board:

CPU GPU

Processor Intel Core 2 Duo nvidia 8800 UltraCodename Crusoe E6320 G80Clock frequency 1.86 GHz 575 MHzMemory speed 2 × 333 MHz 2 × 900 MHzBus width 64 bits 384 bitsMax. bandwidth 5.3 GB/s 86.4 GB/sMemory 1024 MB DDR2 768 MB GDDR3Bus PCI Express x16 (4 GB/s)

CUDA and CUBLAS 1.0 and single precision arithmetic used in theexperiments

Evaluation and Tuning of Level 3 CUBLAS

Page 11: Evaluation and Tuning of Level 3 CUBLAS for Graphics ...

IntroductionLevel 3 CUBLAS Evaluation and Tuning

Conclusions and Future Work

GEMM. PaddingSYRK and TRSMPartitioning for Larger MatricesHybrid Computation

Evaluated BLAS routines

GEMM

C := β · C + α · op(A) · op(B)

SYRK

C := β · C + α · A · ATor

C := β · C + α · AT · A

TRSM

op(A) · X = α · B or

X · op(A) = α · B

where op(X ) = X or XT

SYMM and TRMM similarEvaluation and Tuning of Level 3 CUBLAS

Page 12: Evaluation and Tuning of Level 3 CUBLAS for Graphics ...

IntroductionLevel 3 CUBLAS Evaluation and Tuning

Conclusions and Future Work

GEMM. PaddingSYRK and TRSMPartitioning for Larger MatricesHybrid Computation

GEMM Evaluation

0

20

40

60

80

100

120

0 1000 2000 3000 4000 5000

GF

LOP

S

Matrix dimension (m=n=k)

SGEMM

A*B

AT*B

A*BT

AT*BT

Evaluation and Tuning of Level 3 CUBLAS

Page 13: Evaluation and Tuning of Level 3 CUBLAS for Graphics ...

IntroductionLevel 3 CUBLAS Evaluation and Tuning

Conclusions and Future Work

GEMM. PaddingSYRK and TRSMPartitioning for Larger MatricesHybrid Computation

GEMM Evaluation

Main remarks

Peak performance is ∼116 Gflops for GEMM

Evaluation and Tuning of Level 3 CUBLAS

Page 14: Evaluation and Tuning of Level 3 CUBLAS for Graphics ...

IntroductionLevel 3 CUBLAS Evaluation and Tuning

Conclusions and Future Work

GEMM. PaddingSYRK and TRSMPartitioning for Larger MatricesHybrid Computation

GEMM Evaluation

Main remarks

Peak performance is ∼116 Gflops for GEMM

Performance attained for large matrices is much better thanthat of small/medium problems

Stream-oriented architecture

Evaluation and Tuning of Level 3 CUBLAS

Page 15: Evaluation and Tuning of Level 3 CUBLAS for Graphics ...

IntroductionLevel 3 CUBLAS Evaluation and Tuning

Conclusions and Future Work

GEMM. PaddingSYRK and TRSMPartitioning for Larger MatricesHybrid Computation

GEMM Evaluation

Main remarks

Peak performance is ∼116 Gflops for GEMM

Performance attained for large matrices is much better thanthat of small/medium problems

Stream-oriented architecture

The peak observed for m = 4000 is also observed for alldimensions multiple of 32

Our proposal: padding to improve performance

Evaluation and Tuning of Level 3 CUBLAS

Page 16: Evaluation and Tuning of Level 3 CUBLAS for Graphics ...

IntroductionLevel 3 CUBLAS Evaluation and Tuning

Conclusions and Future Work

GEMM. PaddingSYRK and TRSMPartitioning for Larger MatricesHybrid Computation

GEMM Tuning: Padding

0

20

40

60

80

100

120

0 1000 2000 3000 4000

GF

LOP

S

Matrix dimension (m=n=k)

SGEMM

W/out paddingWith padding

Evaluation and Tuning of Level 3 CUBLAS

Page 17: Evaluation and Tuning of Level 3 CUBLAS for Graphics ...

IntroductionLevel 3 CUBLAS Evaluation and Tuning

Conclusions and Future Work

GEMM. PaddingSYRK and TRSMPartitioning for Larger MatricesHybrid Computation

SYRK Evaluation

0

10

20

30

40

50

0 1000 2000 3000 4000 5000

GF

LOP

S

Matrix dimension (m=k)

SYRK - Detailed performance evaluation

A^T*A - UpperA*A^T - UpperA^T*A - LowerA*A^T - Lower

Evaluation and Tuning of Level 3 CUBLAS

Page 18: Evaluation and Tuning of Level 3 CUBLAS for Graphics ...

IntroductionLevel 3 CUBLAS Evaluation and Tuning

Conclusions and Future Work

GEMM. PaddingSYRK and TRSMPartitioning for Larger MatricesHybrid Computation

SYRK Evaluation

Main remarks

Peak performance is ∼40 Gflops for SYRK ⇒ Suboptimalimplementation

Evaluation and Tuning of Level 3 CUBLAS

Page 19: Evaluation and Tuning of Level 3 CUBLAS for Graphics ...

IntroductionLevel 3 CUBLAS Evaluation and Tuning

Conclusions and Future Work

GEMM. PaddingSYRK and TRSMPartitioning for Larger MatricesHybrid Computation

SYRK Evaluation

Main remarks

Peak performance is ∼40 Gflops for SYRK ⇒ Suboptimalimplementation

As for GEMM, results attained for big matrices are better

Evaluation and Tuning of Level 3 CUBLAS

Page 20: Evaluation and Tuning of Level 3 CUBLAS for Graphics ...

IntroductionLevel 3 CUBLAS Evaluation and Tuning

Conclusions and Future Work

GEMM. PaddingSYRK and TRSMPartitioning for Larger MatricesHybrid Computation

SYRK Evaluation

Main remarks

Peak performance is ∼40 Gflops for SYRK ⇒ Suboptimalimplementation

As for GEMM, results attained for big matrices are better

Our proposal: padding and build on top of GEMM

Evaluation and Tuning of Level 3 CUBLAS

Page 21: Evaluation and Tuning of Level 3 CUBLAS for Graphics ...

IntroductionLevel 3 CUBLAS Evaluation and Tuning

Conclusions and Future Work

GEMM. PaddingSYRK and TRSMPartitioning for Larger MatricesHybrid Computation

Building SYRK on top of GEMM

Both SYRK and TRSM yield poor performance comparedwith that of the GEMM

Assuming the partitioning for SYRK:

+:= *

C

C 00

10C 11C

2120 22CC 22C

00C

10C

20C 21C

11C

0A

1A

2A

T0A 1AT

2AT

if the first block of columns of C has been computed, we canproceed by:

C11 := β · C11 + α · A1 · AT1

C21 := β · C21 + α · A2 · AT1

Evaluation and Tuning of Level 3 CUBLAS

Page 22: Evaluation and Tuning of Level 3 CUBLAS for Graphics ...

IntroductionLevel 3 CUBLAS Evaluation and Tuning

Conclusions and Future Work

GEMM. PaddingSYRK and TRSMPartitioning for Larger MatricesHybrid Computation

Building SYRK on top of GEMM

0

20

40

60

80

100

120

140

0 1000 2000 3000 4000 5000

GF

LOP

S

Matrix dimension (m=k)

SSYRK

CUBLASBlocked w/out paddingBlocked with padding

Evaluation and Tuning of Level 3 CUBLAS

Page 23: Evaluation and Tuning of Level 3 CUBLAS for Graphics ...

IntroductionLevel 3 CUBLAS Evaluation and Tuning

Conclusions and Future Work

GEMM. PaddingSYRK and TRSMPartitioning for Larger MatricesHybrid Computation

Building TRSM on top of GEMM

0

10

20

30

40

50

60

70

80

90

0 1000 2000 3000 4000 5000

GF

LOP

S

Matrix dimension (m=n)

STRSM; column-oriented version

CUBLASBlocked w/out paddingBlocked with padding

Evaluation and Tuning of Level 3 CUBLAS

Page 24: Evaluation and Tuning of Level 3 CUBLAS for Graphics ...

IntroductionLevel 3 CUBLAS Evaluation and Tuning

Conclusions and Future Work

GEMM. PaddingSYRK and TRSMPartitioning for Larger MatricesHybrid Computation

Partitioning for Larger Matrices

Transfer times GPU!CPU are an important bottleneck

Our blocked version of GEMM allows to overlap:

The partial multiplication Ap and Bp andThe transference of the next pair of blocks Ap+1 and Bp+1,being Ap and Bp blocks of columns of A and B, respectively

Nor CUDA 1.0 neither G80 allow the overlapping ofcommunication and calculation on GPU (supported by morerecent systems!)

An orthogonal benefit: it enables computation with largermatrices that do not fit on GPU memory

Evaluation and Tuning of Level 3 CUBLAS

Page 25: Evaluation and Tuning of Level 3 CUBLAS for Graphics ...

IntroductionLevel 3 CUBLAS Evaluation and Tuning

Conclusions and Future Work

GEMM. PaddingSYRK and TRSMPartitioning for Larger MatricesHybrid Computation

Hybrid Computation

While GPU is performing a calculation, CPU can also beexecuting part of the computation

We implement a hybrid GEMM implementation following thedecomposition:

CPU GPU CPU

A

M

N’ N’’

+= *

C1

GPU

C2 B1 B2

Values for N ′ and N ′′ must be selected carefully to balancethe computation load

Same approach applied to SYRK and TRSM with similarresults

Evaluation and Tuning of Level 3 CUBLAS

Page 26: Evaluation and Tuning of Level 3 CUBLAS for Graphics ...

IntroductionLevel 3 CUBLAS Evaluation and Tuning

Conclusions and Future Work

GEMM. PaddingSYRK and TRSMPartitioning for Larger MatricesHybrid Computation

Hybrid Computation

0

20

40

60

80

100

120

140

0 1000 2000 3000 4000 5000

GF

LOP

S

Matrix dimension (m=n=k)

SGEMM

Hybrid SGEMM with paddingCUBLAS with padding

CUBLAS w/out padding

Evaluation and Tuning of Level 3 CUBLAS

Page 27: Evaluation and Tuning of Level 3 CUBLAS for Graphics ...

IntroductionLevel 3 CUBLAS Evaluation and Tuning

Conclusions and Future Work

Conclusions and Future Work

Although CUBLAS is a vendor-specific library, it is not highlyoptimized. In particular, not all routines are equally optimized(GEMM is the best)

Applying simple ideas, it is possible to attain higher and morepredictable results without modifying CUBLAS

The source code of CUBLAS has been recently

released!!

Possible to apply our improvements directly to CUBLAS, andeven apply more fine-grained optimizations

Evaluation and Tuning of Level 3 CUBLAS

Page 28: Evaluation and Tuning of Level 3 CUBLAS for Graphics ...

IntroductionLevel 3 CUBLAS Evaluation and Tuning

Conclusions and Future Work

Thanks for your interest!

Enrique S. Quintana-Ort́ı

More information

[email protected]

http://www3.uji.es/~figual/publications.html

Evaluation and Tuning of Level 3 CUBLAS