NVIDIA CUDA Libraries - Oak Ridge Leadership Computing...
Transcript of NVIDIA CUDA Libraries - Oak Ridge Leadership Computing...
Applications
3rd Party Libraries & Middleware
CUDA-C APIs
NVIDIA Libraries
CUDA Libraries and Middleware
Basic building blocks Easy to use
Fast
Programming language integration Hybrid computation
Multi-node and multi-gpu Domain specific languages
CUBLAS dense linear algebra CUSPARSE sparse linear algebra CUFFT discrete Fourier transforms CURAND random number generation NPP signal and image processing Thrust scan, sort, reduce, transform math.h floating point system calls printf, malloc, assert
NVIDIA CUDA Libraries
0x 2x 4x 6x 8x
CUSPARSE
4x-10x speedups over CPU on single precision
0x 5x
10x 15x
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 log2(size)
CUFFT
0x
5x
10x
15x
reduce transform scan sort
Thrust
0x
2x
4x
6x
SGEMM SSYMM SSYRK STRMM STRSM
CUBLAS
• CUDA 4.1 on Tesla M2090, ECC on • MKL 10.2.3, TYAN FT72-B7015 Xeon x5680 Six-Core @ 3.33 GHz NOTE: ROUTINES ASSUME INPUT AND OUTPUT DATA ARE IN GPU MEMORY
Challenge #1: MATCH AN ESTABLISHED CPU ECOSYSTEM
2007 2008 2009 2010 2011
CUDA Toolkit 2.x
• Double Precision support in all libraries
CUDA Toolkit 1.x
• Single precision
• cuBLAS
• cuFFT
• math.h
CUDA Toolkit 3.x
• cuSPARSE
• cuRAND
• printf()
• malloc()
CUDA Toolkit 4.x
• Thrust
• NPP
• assert()
CUDA ecosystem today spans many domains
" Languages (FORTRAN, python) " Math, Numerics, Statistics " Dense & Sparse Linear Algebra " Algorithms (sort, etc.) " Image Processing " Computer Vision " Signal Processing " Finance
http://developer.nvidia.com/gpu-accelerated-libraries
!"
#!"
$!"
%!"
&!"
'!!"
'#!"
'$!"
'%!"
'&!"
!" '%" (#" $&" %$" &!" )%" ''#" '#&"
*+,-
./"
/012"3454546"
/07892".:2;0<0=7">99"/012<"#5#5#"?="'#&5'#&5'#&"
@A++B"$C'"
@A++B"$C!"
DE,"
Challenge #2: PREDICTABLE ACCELERATION
Consistently faster than MKL
>3x faster than 4.0 on average
• cuFFT 4.1 and cuFFT 4.0 on Tesla M2090, ECC on • MKL 10.2.3, TYAN FT72-B7015 Xeon x5680 Six-Core @ 3.33 GHz Performance may vary based on OS version and motherboard configuration
NOTE: ROUTINES ASSUME INPUT AND OUTPUT DATA ARE IN GPU MEMORY
• cuBLAS 4.1 on Tesla M2090, ECC on • MKL 10.2.3, TYAN FT72-B7015 Xeon x5680 Six-Core @ 3.33 GHz Performance may vary based on OS version and motherboard configuration
Challenge #2: PREDICTABLE ACCELERATION
0
50
100
150
200
250
300
350
400
450
0 256 512 768 1024 1280 1536 1792 2048
GFL
OPS
Matrix Size (NxN)
CUBLAS-Zgemm MKL-Zgemm
NOTE: ROUTINES ASSUME INPUT AND OUTPUT DATA ARE IN GPU MEMORY
• cuBLAS 4.1 on Tesla M2090, ECC on • MKL 10.2.3, TYAN FT72-B7015 Xeon x5680 Six-Core @ 3.33 GHz Performance may vary based on OS version and motherboard configuration
Challenge #2: PREDICTABLE ACCELERATION
!"
#!"
$!"
%!"
&!"
'!!"
'#!"
'$!"
'%!"
'&!"
#!!"
!" '%" (#" $&" %$"
*+,-
./"
DF?:05"G0H27<0=7"34546"
;IJ,>/"'!!"HF?:0;2<" ;IJ,>/"'!K!!!"HF?:0;2<" DE,"'!K!!!"HF?:0;2<"
NOTE: ROUTINES ASSUME INPUT AND OUTPUT DATA ARE IN GPU MEMORY
" CUDA Libraries accelerate basic building block algorithms " BLAS ! CUBLAS, FFTW ! CUFFT, STL ! Thrust " Enables a wide range of 3rd party libraries and middleware
" CUDA Libraries coverage and performance constantly improving " Automatic performance improvements with new CUDA releases and new
GPU architectures