Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread...

64
1 Basics of CUDA Programming Weijun Xiao Department of Electrical and Computer Engineering University of Minnesota

Transcript of Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread...

Page 1: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

1

Basics of CUDA Programming

Weijun Xiao

Department of Electrical and Computer Engineering

University of Minnesota

Page 2: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

2

Outline

• What’s GPU computing?• CUDA programming model• Basic Memory Management• Basic Kernels and Execution• CPU and GPU Coordination• CUDA debugging and profiling• Conclusions

Page 3: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

3

What is GPU?

• Graphic Processing Unit

Logical Representationof Visual Information Output Signal

Page 4: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

4

Performance Gap between GPUs and CPUs

Page 5: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

5

GPU = Fast Parallel Machine

• GPU speed increasing at faster pace than Moore’s Law.

• This is a consequence of the data-parallel streaming aspects of the GPU.

• Gaming market simulates the development of GPU

• GPUs are cheap ! Put enough together, and you can get a super-computer.

So can we use the GPU for general-purpose computing ?

Page 6: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

6

Sure, thousands of Applications

• Large matrix/vector operations (BLAS)

• Protein Folding (Molecular Dynamics)

• FFT (signal processing)

• VMD(Visual Molecular Dynamics)

• Speech Recognition (Hidden Markov Models, Neural nets)

• Databases

• Sort/Search

• Storages

• MRI

• …

Page 7: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

7

Why are We Interested in GPU?

• High-performance Computing• High Parallelism• Low Cost• GPU can be Programmable• GPGPU

Page 8: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

8

Growth and Development of GPU

• A quiet revolution and potential build-up– Calculation: 367 GFLOPS vs. 32 GFLOPS– Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s– Before CUDA , programmed through graphics API

– GPU in every PC and workstation – massive volume and potential impact

GF

LOP

S

G80 = GeForce 8800 GTX

G71 = GeForce 7900 GTX

G70 = GeForce 7800 GTX

NV40 = GeForce 6800 Ultra

NV35 = GeForce FX 5950 Ultra

NV30 = GeForce FX 5800

Page 9: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

9

GeForce 8800

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Host

Texture Texture Texture Texture Texture Texture Texture TextureTexture

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Load/store Load/store Load/store Load/store Load/store

16 highly threaded MP, 128 Cores, 367 GFLOPS, 768 MB DRAM, 86.4 GB/S Mem BW, 4GB/S BW to CPU

Page 10: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

10

Telsa 205014 MP, 448 Cores, 1.03 TFLOPS/515 GFLOPS, 3GB GDDR5 DRAM with ECC, 144GB/S Mem BW, PCIe 2 x16 (8GB/S BW to CPU)

Page 11: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

11

GPU Languages

• Assembly

• Cg (NVIDIA)

- C for Graphics

• GLSL (OpenGL)

- OpenGL Shading Language

• HLSL (Microsoft)

- High-level Shading language

• Brook C/C++ (AMD)

• CUDA (NVIDIA)

• Open CL

Page 12: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

12

How GPGPU Works before CUDA?

• Follow graphics pipeline• Pretend to be graphics• Take an advantage of massive parallelism of

GPU• Disguise data as textures or geometry• Disguise algorithm as render passes• Fool graphics pipeline to do computation

Page 13: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

13

CUDA Programming Model

• Compute Unified Device Architecture• Simple and General-Purpose Programming

Model• Standalone driver to load computation

programs into GPU• Graphics-free API• Data sharing with OpenGL buffer objects• Easy to use and low-learning curve

Page 14: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

14

CUDA – C with no shader limitations!• Integrated host+device app C program

– Serial or modestly parallel parts in host C code– Highly parallel parts in device SPMD kernel C code

Serial Code (host)

. . .

. . .

Parallel Kernel (device)

KernelA<<< nBlk, nTid >>>(args);

Serial Code (host)

Parallel Kernel (device)

KernelB<<< nBlk, nTid >>>(args);

Page 15: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

15© David Kirk/NVIDIA and Wen­mei W. Hwu, 2007­2009ECE 498AL, University of Illinois, Urbana­Champaign

CUDA Devices and Threads

• A compute device– Is a coprocessor to the CPU or host– Has its own DRAM (device memory)– Runs many threads in parallel– Is typically a GPU but can also be another type of parallel

processing device

• Data-parallel portions of an application are expressed as device kernels which run on many threads

• Differences between GPU and CPU threads – GPU threads are extremely lightweight

• Very little creation overhead

– GPU needs 1000s of threads for full efficiency• Multi-core CPU needs only a few

Page 16: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

16

Extended C• Declspecs

– global, device, shared, local, constant

• Keywords– threadIdx, blockIdx

• Intrinsics– __syncthreads

• Runtime API– Memory, symbol,

execution management

• Function launch

__device__ float filter[N];

__global__ void convolve (float *image) {

__shared__ float region[M]; ...

region[threadIdx] = image[i];

__syncthreads() ...

image[j] = result;}

// Allocate GPU memoryvoid *myimage = cudaMalloc(bytes)

// 100 blocks, 10 threads per blockconvolve<<<100, 10>>> (myimage);

Page 17: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

17

Compiling a CUDA Program

NVCC

C/C++ CUDAApplication

PTX to TargetCompiler

G80 … GPU

Target code

PTX CodeVirtual

Physical

CPU Code

• Parallel Thread eXecution (PTX)– Virtual Machine

and ISA– Programming

model– Execution

resources and state

float4 me = gx[gtid];me.x += me.y * me.z;

ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0];mad.f32 $f1, $f5, $f3, $f1;

Page 18: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

18© David Kirk/NVIDIA and Wen­mei W. Hwu, 2007­2009ECE 498AL, University of Illinois, Urbana­Champaign

Arrays of Parallel Threads

• A CUDA kernel is executed by an array of threads– All threads run the same code (SPMD)– Each thread has an ID that it uses to compute memory addresses and

make control decisions

76543210

…float x = input[threadID];float y = func(x);output[threadID] = y;…

threadID

Page 19: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

19© David Kirk/NVIDIA and Wen­mei W. Hwu, 2007­2009ECE 498AL, University of Illinois, Urbana­Champaign

…float x = input[threadID];float y = func(x);output[threadID] = y;…

threadID

Thread Block 0

……float x = input[threadID];float y = func(x);output[threadID] = y;…

Thread Block 1

…float x = input[threadID];float y = func(x);output[threadID] = y;…

Thread Block N - 1

Thread Blocks: Scalable Cooperation• Divide monolithic thread array into multiple blocks

– Threads within a block cooperate via shared memory, atomic operations and barrier synchronization

– Threads in different blocks cannot cooperate

76543210 76543210 76543210

Page 20: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

20

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA

Figure 3.2. An Example of CUDA Thread Organization.

Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

Click to edit Master text stylesSecond levelThird levelFourth levelFifth level

Block IDs and Thread IDs

• Each thread uses IDs to decide what data to work on– Block ID: 1D or 2D– Thread ID: 1D, 2D, or 3D

• Simplifies memoryaddressing when processingmultidimensional data– Image processing– Solving PDEs on volumes– …

Page 21: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

21© David Kirk/NVIDIA and Wen­mei W. Hwu, 2007­2009ECE 498AL, University of Illinois, Urbana­Champaign

CUDA Memory Model

• Global memory– Main means of

communicating R/W Data between host and device

– Contents visible to all threads

– Long latency access

Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Page 22: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

22

Basic Memory Management

Page 23: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

23

Memory Spaces

• CPU and GPU have separate memory spaces– Data is moved across PCIe bus

– Use functions to allocate/set/copy memory on GPU• Very similar to corresponding C functions

• Pointers are just addresses– Can’t tell from the pointer value whether the address is on

CPU or GPU

– Must exercise care when dereferencing:• Dereferencing CPU pointer on GPU will likely crash

• Same for vice versa

Page 24: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

24

GPU Memory Allocation / Release

• Host (CPU) manages device (GPU) memory:– cudaMalloc (void ** pointer, size_t nbytes)– cudaMemset (void * pointer, int value, size_t count)– cudaFree (void* pointer)

int n = 1024;

int nbytes = 1024*sizeof(int);

int * d_a = 0;

cudaMalloc( (void**)&d_a, nbytes );

cudaMemset( d_a, 0, nbytes);

cudaFree(d_a);

Page 25: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

25

Data Copies• cudaMemcpy( void *dst, void *src, size_t nbytes,

enum cudaMemcpyKind direction);– returns after the copy is complete– blocks CPU thread until all bytes have been copied– doesn’t start copying until previous CUDA calls complete

• enum cudaMemcpyKind– cudaMemcpyHostToDevice– cudaMemcpyDeviceToHost– cudaMemcpyDeviceToDevice

• Non-blocking memcopies are provided

Page 26: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

26

Code Walkthrough 1

• Allocate CPU memory for n integers• Allocate GPU memory for n integers• Initialize GPU memory to 0s• Copy from GPU to CPU• Print the values

Page 27: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

27

Code Walkthrough 1#include <stdio.h>

int main(){ int dimx = 16; int num_bytes = dimx*sizeof(int);

int *d_a=0, *h_a=0; // device and host pointers

Page 28: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

28

Code Walkthrough 1#include <stdio.h>

int main(){ int dimx = 16; int num_bytes = dimx*sizeof(int);

int *d_a=0, *h_a=0; // device and host pointers

h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes );

if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; }

Page 29: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

29

Code Walkthrough 1#include <stdio.h>

int main(){ int dimx = 16; int num_bytes = dimx*sizeof(int);

int *d_a=0, *h_a=0; // device and host pointers

h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes );

if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; }

cudaMemset( d_a, 0, num_bytes ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );

Page 30: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

30

Code Walkthrough 1#include <stdio.h>

int main(){ int dimx = 16; int num_bytes = dimx*sizeof(int);

int *d_a=0, *h_a=0; // device and host pointers

h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes );

if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; }

cudaMemset( d_a, 0, num_bytes ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );

for(int i=0; i<dimx; i++) printf("%d ", h_a[i] ); printf("\n");

free( h_a ); cudaFree( d_a );

return 0;}

Page 31: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

31

Basic Kernels and Execution on GPU

Page 32: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

32© David Kirk/NVIDIA and Wen­mei W. Hwu, 2007­2009ECE 498AL, University of Illinois, Urbana­Champaign

CUDA Function Declarations

hosthost__host__ float HostFunc()

hostdevice__global__ void KernelFunc()

devicedevice__device__ float DeviceFunc()

Only callable from the:

Executed on the:

• __global__ defines a kernel function– Must return void

• __device__ and __host__ can be used together

Page 33: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

33© David Kirk/NVIDIA and Wen­mei W. Hwu, 2007­2009ECE 498AL, University of Illinois, Urbana­Champaign

CUDA Function Declarations (cont.)

• __device__ functions cannot have their address taken

• For functions executed on the device:– Can only access GPU memory– No recursion– No static variable declarations inside the

function– No variable number of arguments

Page 34: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

34

Code Walkthrough 2

• Build on Walkthrough 1• Write a kernel to initialize integers• Copy the result back to CPU• Print the values

Page 35: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

35

__global__ void kernel( int *a ){ int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = 7;}

Kernel Code (executed on GPU)

Page 36: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

36

Launching kernels on GPU

• Launch parameters:– grid dimensions (up to 2D), dim3 type– thread-block dimensions (up to 3D), dim3 type– shared memory: number of bytes per block

• for extern smem variables declared without size• Optional, 0 by default

– stream ID• Optional, 0 by default

dim3 grid(16, 16);dim3 block(16,16);kernel<<<grid, block, 0, 0>>>(...);

kernel<<<32, 512>>>(...);

Page 37: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

37

#include <stdio.h>

__global__ void kernel( int *a ){ int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = 7;}

int main(){ int dimx = 16; int num_bytes = dimx*sizeof(int);

int *d_a=0, *h_a=0; // device and host pointers

h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes );

if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; }

cudaMemset( d_a, 0, num_bytes );

dim3 grid, block; block.x = 4; grid.x = dimx / block.x;

kernel<<<grid, block>>>( d_a );

cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );

for(int i=0; i<dimx; i++) printf("%d ", h_a[i] ); printf("\n");

free( h_a ); cudaFree( d_a );

return 0;}

Page 38: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

38

__global__ void kernel( int *a ){ int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = 7;}

__global__ void kernel( int *a ){ int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = blockIdx.x;}

__global__ void kernel( int *a ){ int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = threadIdx.x;}

Kernel Variations and Output

Output: 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7

Output: 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3

Output: 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3

Page 39: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

39

Code Walkthrough 3

• Build on Walkthruogh 2• Write a kernel to increment n×m integers• Copy the result back to CPU• Print the values

Page 40: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

40

__global__ void kernel( int *a, int dimx, int dimy ){ int ix = blockIdx.x*blockDim.x + threadIdx.x; int iy = blockIdx.y*blockDim.y + threadIdx.y; int idx = iy*dimx + ix;

a[idx] = a[idx]+1;}

Kernel with 2D Indexing

Page 41: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

41

int main(){ int dimx = 16; int dimy = 16; int num_bytes = dimx*dimy*sizeof(int);

int *d_a=0, *h_a=0; // device and host pointers

h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes );

if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; }

cudaMemset( d_a, 0, num_bytes );

dim3 grid, block; block.x = 4; block.y = 4; grid.x = dimx / block.x; grid.y = dimy / block.y;

kernel<<<grid, block>>>( d_a, dimx, dimy );

cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );

for(int row=0; row<dimy; row++) { for(int col=0; col<dimx; col++) printf("%d ", h_a[row*dimx+col] ); printf("\n"); }

free( h_a ); cudaFree( d_a );

return 0;}

__global__ void kernel( int *a, int dimx, int dimy ){ int ix = blockIdx.x*blockDim.x + threadIdx.x; int iy = blockIdx.y*blockDim.y + threadIdx.y; int idx = iy*dimx + ix;

a[idx] = a[idx]+1;}

Page 42: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

42

Blocks must be independent

• Any possible interleaving of blocks should be valid– presumed to run to completion without pre-emption– can run in any order– can run concurrently OR sequentially

• Blocks may coordinate but not synchronize– shared queue pointer: OK– shared lock: BAD … can easily deadlock

• Independence requirement gives scalability

Page 43: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

43

Blocks must be independent

• Thread blocks can run in any order– Concurrently or sequentially– Facilitates scaling of the same code across

many devices

Scalability

Page 44: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

44

Coordinating CPU and GPU Execution

Page 45: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

45

Synchronizing GPU and CPU• All kernel launches are asynchronous

– control returns to CPU immediately– kernel starts executing once all previous CUDA calls

have completed• Memcopies are synchronous

– control returns to CPU once the copy is complete– copy starts once all previous CUDA calls have

completed• cudaThreadSynchronize()

– blocks until all previous CUDA calls complete• Asynchronous CUDA calls provide:

– non-blocking memcopies– ability to overlap memcopies and kernel execution

Page 46: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

46

CUDA Error Reporting to CPU• All CUDA calls return error code:

– except kernel launches– cudaError_t type

• cudaError_t cudaGetLastError(void)– returns the code for the last error (“no error” has a code)

• char* cudaGetErrorString(cudaError_t code)– returns a null-terminated character string describing the

errorprintf(“%s\n”, cudaGetErrorString( cudaGetLastError() ) );

Page 47: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

47

Device Management• CPU can query and select GPU devices

– cudaGetDeviceCount( int* count )– cudaSetDevice( int device )– cudaGetDevice( int *current_device )– cudaGetDeviceProperties( cudaDeviceProp* prop, int device )– cudaChooseDevice( int *device, cudaDeviceProp* prop )

• Multi-GPU setup:– device 0 is used by default– one CPU thread can control one GPU

• multiple CPU threads can control the same GPU – calls are serialized by the driver

Page 48: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

48

CUDA Debugging and Profiling

Page 49: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

49

What’s cuda-gdb?

• All-in-one debugging tool• Host and CUDA codes• Extension to linux gdb• 32/64-bit Linux• 4.0 release

Page 50: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

50

Debug Compilation

• -g –G

• nvcc –g –G foo.cu –o foo

• Fermi

-gencode arch=compute_20,code=sm_20

• Makefile

• CUDA-GDB error: undefined reference to '$gpu_registers‘ (2.2 beta or previous)

• Ptxvars.cunvcc "/usr/local/cuda/bin/ptxvars.cu" -g -G --host-compilation=c -c -define-always-macro _DEVICE_LAUNCH_PARAMETERS_H__ -Xptxas -fext

Page 51: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

51

Extension to GDB

• Debug both host and GPU code seamlessly• GPU memory is as an extension to host

memory• GPU thread/blocks are as extensions to host

threads• Breakpoints at any host and/or device

function symbol or source file line number• Single-step individual warps

Page 52: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

52

Debug commands

• thread <<<(BX,BY),(TX,TY,TX)>>>

thread <<<170>>>

thread <<<2,(10,10)>>> • cuda block (n,m) thread (x,y,z)• info cuda state (replacing with devices,

kernels, system, warp, sm…)

Page 53: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

53

Debugging Commands(cont.)

• break

• print

• continue

• next

• step

• quit

• set args

• GDB quick reference

http://users.ece.utexas.edu/~adnan/gdb-refcard.pdf

Page 54: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

54

Example code

• 8-bit bit reverse• 00011101 -> 10111000• 10010111 -> 11101001

Page 55: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

55

Algorithms

r= 0;for (int i=0;i<8;i++) {

r = r <<1; if (x mod 2) r += 1; x=x>>1;

}

x = (((0xf0 &x )>>4) | ((0x0f &x) << 4) );x = (((0xcc &x )>>2) | ((0x33 &x) << 2 ));x = (((0xaa &x ) >>1) | ((0x55 &x> <<1));

Page 56: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

56

Code1 #include <stdio.h>

2 #include <stdlib.h>3

4 // Simple 8-bit bit reversal Compute test

56 #define N 256

7

8 __global__ void bitreverse(unsigned int *data)9 {

10 unsigned int *idata = data;

11

12 unsigned int x = idata[threadIdx.x];

13

14 x = ((0xf0f0f0f0 & x) >> 4 | ((0x0f0f0f0f & x) << 4);

15 x = ((0xcccccccc & x) >> 2 | ((0x33333333 & x) << 2);16 x = ((0xaaaaaaaa & x) >> 1 | ((0x55555555 & x) << 1);

17

18 idata[threadIdx.x] = x;

19 }

2021 int main(void)

22 {

23 unsigned int *d = NULL; int i;

24 unsigned int idata[N], odata[N];

25

26 for (i = 0; i < N; i++)

27 idata[i] = (unsigned int)i;

28

29 cudaMalloc((void**)&d, sizeof(int)*N);

30 cudaMemcpy(d, idata, sizeof(int)*N,

31 cudaMemcpyHostToDevice);

32

33 bitreverse<<<1, N>>>(d);

34

35 cudaMemcpy(odata, d, sizeof(int)*N,

36 cudaMemcpyHostToDevice);

37

38 for (i = 0; i < N; i++)

39 printf(“%u -> %u\n”, idata[i], odata[i]);

40

41 cudaFree((void*)d);

42 return 0;

43 }

Page 57: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

57

Cuda-gdb supported platform

• Host platform

X11 cannot be running on the GPU used for debugging

One GPU: Disable X11

Two more GPUs• GPU requirements

All CUDA-enable GPUs except 8800GTS, 8800GTX, 8800 Ultra, FX4600, and FX5600

Page 58: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

58

Debugging example code

• Step 1

nvcc –g –G bitreverse.cu –o bitreverse

• Step 2

Cuda-gdb ./bitreverse

• Step 3

Set breakpoints(break main, break bitreverse, break 18)

• Step 4

Run CUDA application

(cuda-gdb) run

• Step 5

Continue and watch variables

(cuda-gdb) continue

(cuda-gdb) thread

(cuda-gdb) print x

Page 59: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

59

Profiling Tools

• CUDA memcheck• Occupancy Calculator• Visual Profiler

Page 60: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

60

CUDA Visual Profiler

Page 61: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

61

CUDA Counters

Page 62: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

62

Profiler Counters for Fermi

• branch, divergent branch

• instruction issued, instruction executed

• sm cta launched

• gld request, gst request

• local load, local store

• share load, share store

• warps launched, threads launched

• l1 global load hit, l1 global load miss

• l1 local load hit, l1 local load miss

• l1 local store hit, l1 local store miss

• l1 share bank conflicts

• uncached global load transaction

• global store transaction

• l2 read requests, l2 write requests

• l2 read misses, l2 write misses

• dram reads, dram writes

• tex cache requests, tex cache misses

Page 63: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

63

Memory throughput• Compute capability<2.0Global read throughput= (((gld_32*32) + (gld_64*64) + (gld_128*128)) * TPC) / gputime Global write throughput = (((gst_32*32) + (gst_64*64) + (gst_128*128)) * TPC) / gputime

• Compute capability>=2.0

Global read throughput = (dram reads * 32 )/gputime

Global write throughput = (dram writes * 32 )/gputime

• Gmem overall throughput = read throughput + write throughput• Tesla C2050 , theoretical bandwidth 144GB/s

Page 64: Basics of CUDA Programming - msi.umn.eduGrid 2 Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Organization. Block (1, 1) Thread (0,1,0) Thread (1,1,0) Thread (2,1,0) Thread

64

Conclusions

• GPU as an accelerator for HPC• CUDA programming model• CUDA thread and kernel• CUDA example codes• CUDA debugging and profiling