GPU Computing Workshop CSU 2013 Background and motivationgbdurham/gpu_workshop/gpu_slides01.pdf ·...
-
Upload
trankhuong -
Category
Documents
-
view
213 -
download
0
Transcript of GPU Computing Workshop CSU 2013 Background and motivationgbdurham/gpu_workshop/gpu_slides01.pdf ·...
2
Outline
(1) Background and motivation
(2) Configuring and using an Amazon EC2 GPU server
(3) Getting started with CUDA
(4) Advanced topics
(5) Higher level APIs and applications
—
3
Background and motivation
What is GPU computing and why should we be interested in it?
• GPU can be thought of as a “math coprocessor”, optimized for performance on specialized
computations
– motivated by the needs of graphics applications
– but, also useful for general computations...
• CPU’s are optimized for serial computations
– hardware implications (lots of silicone devoted to cache and control logic ...)
– inherent limits to this approach (diminishing returns...)
– therefore moving toward mainstream use of multicore CPUs (phones,...)
• GPUs are optimized for total throughput on highly parallel calculations
– a typical GPU may have 100s, or even 1000s of cores (much higher density of com-
putational units)
—
5
Figure — Floating-Point Operations per Second for the CPU and GPU
Source: Nvidia C Programming Guide
—
7
Discussion
• If you want performance now or in the future, you have to go parallel
– CPUs (several cores)
– Clusters (network latency)
– GPU’s (thousands of cores)
• How much faster?
– From an academic perspective... (10 years ahead of the curve...)
• Research implications
– new applications
– new algorithms
• Alternative hardware
– FPGA (field programmable gate arrays)
– Playstation (cell)
– Intel
– ???
• Supercomputing
—
8
Discussion — continued
• Why not just develop hardware specialized for the needs of scientific computing?
• Amdahl’s Law
—
9
Hardware
Understanding the hardware is critical to making efficient use of it.
• For future reference, we refer to
– host: CPU and system memory
– device: GPU and GPU memory
• A GPU is comprised of
– global memory (accessible from host and all GPU cores)
– SMs (stream multiprocessors)
• Each SM has
– registers
– thread specific memory
– “shared” memory (shared between threads)
– processing units
• SIMT architecture
– Each SM operates on groups of threads in SIMT fashion.
– Different SM’s can work independently of each other.
—
10
Hardware — continued
• A function run on the device is referred to as a kernel and operates on blocks of threads.
• Each thread block executes on a single SM and can access a common block of shared
memory (threads also have private memory).
• Communication and synchronization
– Threads in the same block can communicate using shared memory.
– Threads in different blocks can only communicate via global memory.
– In either case, synchronization barriers must be used (e.g., to ensure that all threads
have written data before reads are attempted by different threads).
– Synchronization is always costly. But especially when it involves synchronizing across
SMs.
• Memory on the SM itself is nonpersistent (memory associated with a thread block is no
longer accessible after a kernel is finished executing).
• Memory hierarchy:
– Reads/writes to memory on the SM (including shared memory) are fast.
– Reads/writes to global memory are costly.
– Transfers between host and device are yet more costly.
—
11
Hardware — continued
• An SM can have several blocks resident at one time, subject to the following constraints:
– 1024 threads maximum
– resource constraints
∗ each thread has resource requirements( registers, memory, etc)
∗ resources are very limited
– a single block is never split across SMs
• The collection of blocks is referred to as the “grid”. The number of blocks (gridsize) is
user determined.
– There can be more blocks than it is possible to have resident on the available SMs
at one time.
—
12
Hardware — continued
• In practice, the SM operates on groups of threads (“warps”*) in parallel
– warp size is hardware determined (32 threads on recent hardware)
– A block can consist of several warps (blocksize is user determined).
– Processing switches from one warp to the other whenever the execution is waiting
on data.
• Having lots of threads resident on the SM is good (latency hiding)
– referred to as occupancy.
– implications for block size and software design (how much work is done by a single
thread and how work is organized)
• From the user perspective, this is all transparent
– the user selects the grid size and block size
– the compiler takes care of all scheduling.
* In recognition of weaving, the first massively parallel threaded technology (Nvidia User’s Guide).
—
15
Links
• https://developer.nvidia.com/category/zone/cuda-zone
• https://developer.nvidia.com/cuda-downloads
—
16
Nvidia GPUs
For details on all available GPUs see https://developer.nvidia.com/cuda-gpus.
• Important specs are
– Number of cores
– Amount of memory
– ”Compute capability” (see Users Guide for details)
—
17
Nvidia GPUs — continued
Tesla (scientific computing)
*** Tesla K20 (about \$3000)
Tesla K10 (optimized for single precision)
Tesla C2050 C2070 C2075 (previous generation)
Quadro
==========
GTX (consumer cards)
=============
**Titan (about \$1000)
*780 *770 *760 (double precision crippled; \$650 - 400 - 250)
690 680 670 GTX 690 is basically 2 580’s on one card
590 580 570 GTX 590 is basically 2 580’s on one card
The 500 series is better than 600 and 700 series for double precision.
Double precision speed as fraction of single precision:
Titan: 1/3
600 and 700 series: 1/24
500 series: 1/8
Mobile
==============
anandtech.com is a good source of information and benchmarks
—
18
Remarks
• The Amazon EC2 GPU servers have 2 Tesla C2050 cards.
• If you are building your own GPU server, the more powerful cards draw lots of power. Get
a BIG power supply!
—
19
Documentation
Installed at /usr/local/cuda/doc/pdf. Also available online.
** CUDA_C_Best_Practices_Guide.pdf
* CUDA_Compiler_Driver_NVCC.pdf
*** CUDA_C_Programming_Guide.pdf
* CUDA_CUBLAS_Users_Guide.pdf
CUDA_CUFFT_Users_Guide.pdf
CUDA_CUSPARSE_Users_Guide.pdf
CUDA_Debugger_API.pdf
CUDA_Developer_Guide_for_Optimus_Platforms.pdf
CUDA_Dynamic_Parallelism_Programming_Guide.pdf
CUDA_GDB.pdf
* CUDA_Getting_Started_Guide_For_Linux.pdf
CUDA_Getting_Started_Guide_For_Mac_OS_X.pdf
CUDA_Getting_Started_Guide_For_Microsoft_Windows.pdf
CUDA_Memcheck.pdf
CUDA_Profiler_Users_Guide.pdf
CUDA_Samples_Guide_To_New_Features.pdf
* CUDA_Samples.pdf
CUDA_Samples_Release_Notes.pdf
* CUDA_Toolkit_Reference_Manual.pdf
CUDA_Toolkit_Release_Notes.pdf
CUDA_VideoDecoder_Library.pdf
cuobjdump.pdf
CUPTI_User_Guide.pdf
* CURAND_Library.pdf
Floating_Point_on_NVIDIA_GPU_White_Paper.pdf
* Getting_Started_With_CUDA_Samples.pdf
GPUDirect_RDMA.pdf
Kepler_Compatibility_Guide.pdf
Kepler_Tuning_Guide.pdf
NPP_Library.pdf
Nsight_Eclipse_Edition_Getting_Started.pdf
Preconditioned_Iterative_Methods_White_Paper.pdf
ptx_isa_3.1.pdf
qwcode.highlight.css
* Thrust_Quick_Start_Guide.pdf
Using_Inline_PTX_Assembly_In_CUDA.pdf
—
20
Samples
Installed at /usr/local/cuda/samples. Also available online.
0_Simple
asyncAPI cdpSimplePrint cdpSimpleQuicksort clock
cppIntegration cudaOpenMP inlinePTX matrixMul matrixMulCUBLAS
matrixMulDrv matrixMulDynlinkJIT simpleAssert simpleAtomicIntrinsics
simpleCallback simpleCubemapTexture simpleIPC simpleLayeredTexture
simpleMPI simpleMultiCopy simpleMultiGPU simpleP2P
simplePitchLinearTexture simplePrintf simpleSeparateCompilation simpleStreams
simpleSurfaceWrite simpleTemplates simpleTexture simpleTextureDrv
simpleVoteIntrinsics simpleZeroCopy template template_runtime
vectorAdd vectorAddDrv
1_Utilities
bandwidthTest deviceQuery deviceQueryDrv
2_Graphics
bindlessTexture Mandelbrot marchingCubes simpleGL simpleTexture3D volumeFiltering
volumeRender
3_Imaging
bicubicTexture bilateralFilter boxFilter convolutionFFT2D
convolutionSeparable convolutionTexture dct8x8 dwtHaar1D
dxtc histogram HSOpticalFlow imageDenoising
postProcessGL recursiveGaussian SobelFilter stereoDisparity
4_Finance
binomialOptions BlackScholes MonteCarloMultiGPU quasirandomGenerator SobolQRNG
—
21
Samples — continued
5_Simulations
fluidsGL nbody oceanFFT particles smokeParticles
6_Advanced
alignedTypes cdpAdvancedQuicksort cdpLUDecomposition cdpQuadtree
concurrentKernels eigenvalues fastWalshTransform FDTD3d
FunctionPointers interval lineOfSight mergeSort
newdelete ptxjit radixSortThrust reduction
scalarProd scan segmentationTreeThrust shfl_scan
simpleHyperQ sortingNetworks threadFenceReduction threadMigration
transpose
7_CUDALibraries
batchCUBLAS boxFilterNPP common conjugateGradient
conjugateGradientPrecond freeImageInteropNPP grabcutNPP histEqualizationNPP
imageSegmentationNPP MC_EstimatePiInlineP MC_EstimatePiInlineQ MC_EstimatePiP
MC_EstimatePiQ MC_SingleAsianOptionP MersenneTwisterGP11213 randomFog
simpleCUBLAS simpleCUFFT simpleDevLibCUBLAS
—
22
Other resources
• “Cuda By Example” (an excellent book for learning about CUDA at the introductory level)
• “CUDA Application Design and Development”
• “Programming Massively Parallel Processors: A Hands On Approach”
• ”GPU Gems,” Volumes 1-3 (These are available online at Nvidia.com).
– Volume 1 is mostly outdated.
– Volume 2, see Part IV: “General-Purpose Computation on GPUS: A Primer” (chap-
ters 29-36)
– Volume 3, Part VI: ”GPU Computing”, especially chapter 37 (random number gen-
eration) and chapter 39 (prefix scan).
See workshop home page for more links.
—