CUDA TricksCUDA Tricks - Penn Engineeringcis565/LECTURES/CUDA Tricks.pdf · 2010-11-19 · CUDA...

33
CUDA Tricks CUDA Tricks Presented by D d R i Damodaran Ramani

Transcript of CUDA TricksCUDA Tricks - Penn Engineeringcis565/LECTURES/CUDA Tricks.pdf · 2010-11-19 · CUDA...

Page 1: CUDA TricksCUDA Tricks - Penn Engineeringcis565/LECTURES/CUDA Tricks.pdf · 2010-11-19 · CUDA structures GPU programs into parallel thread blocks of up to 512 SIMD512 SIMD-parallel

CUDA TricksCUDA Tricks

Presented by D d R iDamodaran Ramani

Page 2: CUDA TricksCUDA Tricks - Penn Engineeringcis565/LECTURES/CUDA Tricks.pdf · 2010-11-19 · CUDA structures GPU programs into parallel thread blocks of up to 512 SIMD512 SIMD-parallel

Synopsis

Scan Algorithm

l Applications

Specialized Libraries

CUDPP: CUDA Data Parallel Primitives Library

Thrust: a Template Library for CUDA Applications

CUDA FFT and BLAS libraries for the GPU

Page 3: CUDA TricksCUDA Tricks - Penn Engineeringcis565/LECTURES/CUDA Tricks.pdf · 2010-11-19 · CUDA structures GPU programs into parallel thread blocks of up to 512 SIMD512 SIMD-parallel

References

Scan primitives for GPU Computing. Shubhabrata Sengupta, Mark Harris, Yao Zhang, and John D. Owens

Presentation on scan primitives by Gary J. Katz based on the article Parallel Prefix Sum (Scan) with CUDA - Harris, Sengupta and Owens (GPU GEMS Ch 39)(GPU GEMS Chapter 39)

Page 4: CUDA TricksCUDA Tricks - Penn Engineeringcis565/LECTURES/CUDA Tricks.pdf · 2010-11-19 · CUDA structures GPU programs into parallel thread blocks of up to 512 SIMD512 SIMD-parallel

Introduction

GPUs massively parallel processors

Programmable parts of the graphics pipeline operates on primitives (vertices, fragments)p p ( , g )

These primitive programs spawn aese p t e p og a s spa athread for each primitive to keep the parallel processors fullp

Page 5: CUDA TricksCUDA Tricks - Penn Engineeringcis565/LECTURES/CUDA Tricks.pdf · 2010-11-19 · CUDA structures GPU programs into parallel thread blocks of up to 512 SIMD512 SIMD-parallel

Stream programming model (particle systems, image processing, grid-basedfluid simulations, and dense matrix algebra)

Fragment program operating on n fragments g p g p g g(accesses - O(n))

Problem arises when access requirements are complex (eg: prefix-sum – O(n2))

Page 6: CUDA TricksCUDA Tricks - Penn Engineeringcis565/LECTURES/CUDA Tricks.pdf · 2010-11-19 · CUDA structures GPU programs into parallel thread blocks of up to 512 SIMD512 SIMD-parallel

Prefix-Sum Example

in: 3 1 7 0 4 1 6 3 out: 0 3 4 11 11 14 16 22 out: 0 3 4 11 11 14 16 22

Page 7: CUDA TricksCUDA Tricks - Penn Engineeringcis565/LECTURES/CUDA Tricks.pdf · 2010-11-19 · CUDA structures GPU programs into parallel thread blocks of up to 512 SIMD512 SIMD-parallel

Trivial Sequential Implementation

void scan(int* in, int* out, int n){{

out[0] = 0;for (int i 1; i < n; i++)for (int i = 1; i < n; i++)

out[i] = in[i-1] + out[i-1];

}}

Page 8: CUDA TricksCUDA Tricks - Penn Engineeringcis565/LECTURES/CUDA Tricks.pdf · 2010-11-19 · CUDA structures GPU programs into parallel thread blocks of up to 512 SIMD512 SIMD-parallel

Scan: An Efficient Parallel Primitive

Interested in finding efficient solutions to parallel problems in which each output requires global knowledge of the inputs.

Why CUDA? (General Load-Store Memory Architecture, On-chip Shared Memory, Thread S h i i )Synchronization)

Page 9: CUDA TricksCUDA Tricks - Penn Engineeringcis565/LECTURES/CUDA Tricks.pdf · 2010-11-19 · CUDA structures GPU programs into parallel thread blocks of up to 512 SIMD512 SIMD-parallel

Threads & Blocks GeForce 8800 GTX ( 16 multiprocessors, 8 processors each) CUDA structures GPU programs into parallel thread blocks of up

to 512 SIMD-parallel threadsto 512 SIMD parallel threads. Programmers specify the number of thread blocks and threads

per block, and the hardware and drivers map thread blocks to parallel multiprocessors on the GPUparallel multiprocessors on the GPU.

Within a thread block, threads can communicatethrough shared memory and cooperate through sync.B l th d ithi th bl k t i Because only threads within the same block can cooperate via shared memory and thread synchronization,programmers must partition computation into multiple blocks.(complex programming large performance benefits)programming, large performance benefits)

Page 10: CUDA TricksCUDA Tricks - Penn Engineeringcis565/LECTURES/CUDA Tricks.pdf · 2010-11-19 · CUDA structures GPU programs into parallel thread blocks of up to 512 SIMD512 SIMD-parallel

The Scan Operator

Definition: The scan operation takes a binary associative

operator with identity I, and an array of n elements

[a0 a1 a 1][a0, a1, …, an-1]and returns the array[I, a0, (a0 a1), … , (a0 a1 … an-2)]

Types – inclusive, exclusive, forward, backward

Page 11: CUDA TricksCUDA Tricks - Penn Engineeringcis565/LECTURES/CUDA Tricks.pdf · 2010-11-19 · CUDA structures GPU programs into parallel thread blocks of up to 512 SIMD512 SIMD-parallel

Parallel Scanfor(d = 1; d < log2n; d++)

for all k in parallelif( k >= 2d )( )

x[out][k] = x[in][k – 2d-1] + x[in][k]else

x[out][k] = x[in][k][ ][ ] [ ][ ]

Complexity O(nlog2n)

Page 12: CUDA TricksCUDA Tricks - Penn Engineeringcis565/LECTURES/CUDA Tricks.pdf · 2010-11-19 · CUDA structures GPU programs into parallel thread blocks of up to 512 SIMD512 SIMD-parallel

A work efficient parallel scan

Goal is a parallel scan that is O(n) instead of O(nlog2n)( g2 )

Solution: Balanced Trees: Build a binary tree on the y

input data and sweep it to and from the root.

Bi t ith l h d l l lBinary tree with n leaves has d=log2n levels, each level d has 2d nodes

One add is performed per node thereforeOne add is performed per node, therefore O(n) add on a single traversal of the tree.

Page 13: CUDA TricksCUDA Tricks - Penn Engineeringcis565/LECTURES/CUDA Tricks.pdf · 2010-11-19 · CUDA structures GPU programs into parallel thread blocks of up to 512 SIMD512 SIMD-parallel

O(n) unsegmented scan

Reduce/Up-Sweepfor(d = 0; d < log2n-1; d++)

for all k=0; k < n-1; k+=2d+1 in parallel

x[k+2d+1-1] = x[k+2d-1] + x[k+2d+1-1]

D S Down-Sweepx[n-1] = 0;for(d = log2n – 1; d >=0; d--) ( g2 ; ; )

for all k = 0; k < n-1; k += 2d+1 in parallelt = x[k + 2d – 1]x[k + 2d 1] x[k + 2d+1 1]x[k + 2d - 1] = x[k + 2d+1 -1]x[k + 2d+1 - 1] = t + x[k + 2d+1 – 1]

Page 14: CUDA TricksCUDA Tricks - Penn Engineeringcis565/LECTURES/CUDA Tricks.pdf · 2010-11-19 · CUDA structures GPU programs into parallel thread blocks of up to 512 SIMD512 SIMD-parallel

Tree analogy

x0 ∑(x0..x1) ∑(x0..x3)x2 x4 ∑(x4..x5) x6 ∑(x0..x7)

x0 ∑(x0..x1) ∑(x0..x3)x2 x4 ∑(x4..x5) x6 0

x0 ∑(x0..x1) 0x2 x4 ∑(x4..x5) x6 ∑(x0..x3)

x0 0 ∑(x0..x1)x2 x4 ∑(x0..x3) x6 ∑(x0..x5)

0 ∑(x0..x2) ∑(x0..x4) ∑(x0..x6)x0

∑(x0..x1) ∑(x0..x3) ∑(x0..x5)

Page 15: CUDA TricksCUDA Tricks - Penn Engineeringcis565/LECTURES/CUDA Tricks.pdf · 2010-11-19 · CUDA structures GPU programs into parallel thread blocks of up to 512 SIMD512 SIMD-parallel

O(n) Segmented Scan

Up-Sweep

Page 16: CUDA TricksCUDA Tricks - Penn Engineeringcis565/LECTURES/CUDA Tricks.pdf · 2010-11-19 · CUDA structures GPU programs into parallel thread blocks of up to 512 SIMD512 SIMD-parallel

Down-Sweep

Page 17: CUDA TricksCUDA Tricks - Penn Engineeringcis565/LECTURES/CUDA Tricks.pdf · 2010-11-19 · CUDA structures GPU programs into parallel thread blocks of up to 512 SIMD512 SIMD-parallel

Features of segmented scan

3 times slower than unsegmented scan Useful for building broad variety of Useful for building broad variety of

applications which are not possible with unsegmented scanunsegmented scan.

Page 18: CUDA TricksCUDA Tricks - Penn Engineeringcis565/LECTURES/CUDA Tricks.pdf · 2010-11-19 · CUDA structures GPU programs into parallel thread blocks of up to 512 SIMD512 SIMD-parallel

Primitives built on scan

Enumerate enumerate([t f f t f t t]) = [0 1 1 1 2 2 3] enumerate([t f f t f t t]) [0 1 1 1 2 2 3] Exclusive scan of input vector

Distribute (copy) Distribute (copy) distribute([a b c][d e]) = [a a a][d d]

I l i f i t t Inclusive scan of input vector

Split and split-and-segmentSplit divides the input vector into two pieces, with all the

elements marked false on the left side of the output vector and all the elements marked true on the right.

Page 19: CUDA TricksCUDA Tricks - Penn Engineeringcis565/LECTURES/CUDA Tricks.pdf · 2010-11-19 · CUDA structures GPU programs into parallel thread blocks of up to 512 SIMD512 SIMD-parallel

Applications

Quicksort Sparse Matrix-Vector Multiply Sparse Matrix-Vector Multiply Tridiagonal Matrix Solvers and Fluid

SimulationSimulation Radix Sort Stream Compaction Summed-Area TablesSummed Area Tables

Page 20: CUDA TricksCUDA Tricks - Penn Engineeringcis565/LECTURES/CUDA Tricks.pdf · 2010-11-19 · CUDA structures GPU programs into parallel thread blocks of up to 512 SIMD512 SIMD-parallel

Quicksort

Page 21: CUDA TricksCUDA Tricks - Penn Engineeringcis565/LECTURES/CUDA Tricks.pdf · 2010-11-19 · CUDA structures GPU programs into parallel thread blocks of up to 512 SIMD512 SIMD-parallel

Sparse Matrix-Vector Multiplication

Page 22: CUDA TricksCUDA Tricks - Penn Engineeringcis565/LECTURES/CUDA Tricks.pdf · 2010-11-19 · CUDA structures GPU programs into parallel thread blocks of up to 512 SIMD512 SIMD-parallel

Stream CompactionDefinition:

Extracts the ‘interest’ elements from an array of elements and places them continuously in a newelements and places them continuously in a new array

Uses:Collision Detection Collision Detection

Sparse Matrix CompressionA B A D D E C F B

A B A C B

Page 23: CUDA TricksCUDA Tricks - Penn Engineeringcis565/LECTURES/CUDA Tricks.pdf · 2010-11-19 · CUDA structures GPU programs into parallel thread blocks of up to 512 SIMD512 SIMD-parallel

Stream Compaction

A B A D D E C F B Input: We want to preserve the gray elements

1 1 1 0 0 0 1 0 1elementsSet a ‘1’ in each gray input

Scan

A B A D D E C F B

0 1 2 3 3 3 3 4 4 Scan

S tt i t t

A B A C B

Scatter gray inputs to output using scan result as scatter address

A B A C B

0 1 2 3 4

Page 24: CUDA TricksCUDA Tricks - Penn Engineeringcis565/LECTURES/CUDA Tricks.pdf · 2010-11-19 · CUDA structures GPU programs into parallel thread blocks of up to 512 SIMD512 SIMD-parallel

Radix Sort Using Scan

100 111 010 110 011 101 001 000 Input Array

0 1 0 0 1 1 1 0 b = least significant bit

1 0 1 1 0 0 0 1e = Insert a 1 for all false sort keys

0 1 1 2 3 3 3 f = Scan the 1s3

b least significant bit

f Scan the 1s

0-0+4 1-1+4 2-1+4 3-2+4 4-3+4 5-3+4 6-3+4 7-3+4t = index – f + Total Falses

Total Falses = e[n-1] + f[n-1]

= 4 = 4 = 5 = 5 = 5 = 6 = 7 = 8 t = index – f + Total Falses

0 4 1 2 5 6 7 d = b ? t : f3

100 111 010 110 011 101 001 000100 111 010 110 011 101 001 000

100 010 110 000 111 011 101 001

Scatter input using d as scatter address

Page 25: CUDA TricksCUDA Tricks - Penn Engineeringcis565/LECTURES/CUDA Tricks.pdf · 2010-11-19 · CUDA structures GPU programs into parallel thread blocks of up to 512 SIMD512 SIMD-parallel

Specialized Libraries

CUDPP: CUDA Data Parallel Primitives LibraryLibrary CUDPP is a library of data-parallel

algorithm primitives such as parallel prefix-g p p psum (”scan”), parallel sort and parallel reduction.

Page 26: CUDA TricksCUDA Tricks - Penn Engineeringcis565/LECTURES/CUDA Tricks.pdf · 2010-11-19 · CUDA structures GPU programs into parallel thread blocks of up to 512 SIMD512 SIMD-parallel

CUDPP_DLL CUDPPResult cudppSparseMatrixVectorMultiply(CUDPPHandle sparseMatrixHandle,void * d_y,const void * d_x )

Perform matrix-vector multiply y = A*x for arbitrary sparse matrix A and vector x.

Page 27: CUDA TricksCUDA Tricks - Penn Engineeringcis565/LECTURES/CUDA Tricks.pdf · 2010-11-19 · CUDA structures GPU programs into parallel thread blocks of up to 512 SIMD512 SIMD-parallel

CUDPPScanConfig config; config.direction = CUDPP_SCAN_FORWARD; config.exclusivity = CUDPP_SCAN_EXCLUSIVE; config.op = CUDPP_ADD; config datatype = CUDPP FLOAT;config.datatype = CUDPP_FLOAT; config.maxNumElements = numElements; config.maxNumRows = 1; config.rowPitch = 0;

cudppInitializeScan(&config);cudppScan(d odata d idata numElements &config);cudppScan(d_odata, d_idata, numElements, &config);

Page 28: CUDA TricksCUDA Tricks - Penn Engineeringcis565/LECTURES/CUDA Tricks.pdf · 2010-11-19 · CUDA structures GPU programs into parallel thread blocks of up to 512 SIMD512 SIMD-parallel

CUFFT

No. of elements<8192 slower than fftw >8192 5x speedup over threaded fftw >8192, 5x speedup over threaded fftwand 10x over serial fftw.

Page 29: CUDA TricksCUDA Tricks - Penn Engineeringcis565/LECTURES/CUDA Tricks.pdf · 2010-11-19 · CUDA structures GPU programs into parallel thread blocks of up to 512 SIMD512 SIMD-parallel

CUBLAS Cuda Based Linear Algebra Subroutines Saxpy, conjugate gradient, linear solvers. 3D reconstruction of planetary nebulae.

http://graphics.tu-bs.de/publications/Fernandez08TechReport.pdfbs.de/publications/Fernandez08TechReport.pdf

Page 30: CUDA TricksCUDA Tricks - Penn Engineeringcis565/LECTURES/CUDA Tricks.pdf · 2010-11-19 · CUDA structures GPU programs into parallel thread blocks of up to 512 SIMD512 SIMD-parallel
Page 31: CUDA TricksCUDA Tricks - Penn Engineeringcis565/LECTURES/CUDA Tricks.pdf · 2010-11-19 · CUDA structures GPU programs into parallel thread blocks of up to 512 SIMD512 SIMD-parallel
Page 32: CUDA TricksCUDA Tricks - Penn Engineeringcis565/LECTURES/CUDA Tricks.pdf · 2010-11-19 · CUDA structures GPU programs into parallel thread blocks of up to 512 SIMD512 SIMD-parallel

GPU Variant 100 times faster than CPU versionversion

Matrix size is limited by graphics card memory and texture sizememory and texture size.

Although taking advantage of sparcematrices will help reduce memorymatrices will help reduce memory consumption, sparse matrix storage is not implemented by CUBLASnot implemented by CUBLAS.

Page 33: CUDA TricksCUDA Tricks - Penn Engineeringcis565/LECTURES/CUDA Tricks.pdf · 2010-11-19 · CUDA structures GPU programs into parallel thread blocks of up to 512 SIMD512 SIMD-parallel

Useful Links http://www.science.uwaterloo.ca/~hmerz/CUDA_ben

chFFT/ http://developer.download.nvidia.com/compute/cuda

/2_0/docs/CUBLAS_Library_2.0.pdf http://gpgpu org/developer/cudpp http://gpgpu.org/developer/cudpp http://gpgpu.org/2009/05/31/thrust