002 - Introduction to CUDA Programming_1

54
Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David Kirk UCSB course by Andrea Di Blas Universitat Jena by Waqar Saleem NVIDIA by Simon Green

Transcript of 002 - Introduction to CUDA Programming_1

Page 1: 002 - Introduction to CUDA Programming_1

Introduction to CUDA ProgrammingCUDA Programming Introduction

Andreas MoshovosWinter 2009

Some slides/material from:UIUC course by Wen-Mei Hwu and David Kirk

UCSB course by Andrea Di BlasUniversitat Jena by Waqar Saleem

NVIDIA by Simon Green

Page 2: 002 - Introduction to CUDA Programming_1

Programmer’s view

• GPU as a co-processor

CPU

Memory

GPU

GPU Memory

1GB on our systems

3GB/s – 8GB.s

6.4GB/sec – 31.92GB/sec8B per transfer

141GB/sec

Page 3: 002 - Introduction to CUDA Programming_1

Target Applications

int a[N]; // N is large

for all elements of a computea[i] = a[i] * fade

• Lots of independent computations– CUDA thread need not be independent

Page 4: 002 - Introduction to CUDA Programming_1

Programmer’s View of the GPU

• GPU: a compute device that:– Is a coprocessor to the CPU or host– Has its own DRAM (device memory)– Runs many threads in parallel

• Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads

Page 5: 002 - Introduction to CUDA Programming_1

Why are threads useful?

• Concurrency: – Do multiple things in parallel

– Put more hardware Get higher performance

Needs more functional units

Page 6: 002 - Introduction to CUDA Programming_1

Why are threads useful #2 – Tolerating stalls

• Often a thread stalls, e.g., memory accessMultiplex the same functional unitGet more performance at a fraction of the cost

Page 7: 002 - Introduction to CUDA Programming_1

GPU vs. CPU Threads

• GPU threads are extremely lightweight• Very little creation overhead• In the order of microseconds

• GPU needs 1000s of threads for full efficiency• Multi-core CPU needs only a few

Page 8: 002 - Introduction to CUDA Programming_1

Execution Timeline

time

1. Copy to GPU mem

2. Launch GPU Kernel

GPU / Device

2’. Synchronize with GPU

3. Copy from GPU mem

CPU / Host

Page 9: 002 - Introduction to CUDA Programming_1

GBT: Grids of Blocks of Threads

Why? Realities of integrated circuits: need to cluster computation and storageto achieve high speeds

Page 10: 002 - Introduction to CUDA Programming_1

Grids of Blocks of Threads: Dimension Limits

• Grid of Blocks 1D or 2D– Max x: 65535– Max y: 65535

• Block of Threads: 1D, 2D, or 3D– Max number of threads: 512– Max x: 512– Max y: 512– Max z: 64

• Limits apply to Compute Capability 1.0, 1.1, 1.2, and 1.3

– GTX280 = 1.3

Page 11: 002 - Introduction to CUDA Programming_1

Block and Thread IDs

• Threads and blocks have IDs– So each thread can decide

what data to work on– Block ID: 1D or 2D– Thread ID: 1D, 2D, or 3D

• Simplifies memoryaddressing when processingmultidimensional data

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

• IDs and dimensions are easily accessible through predefined “variables”, e.g., blockDim.x and threadIdx.x

Page 12: 002 - Introduction to CUDA Programming_1

Thread Batching• A kernel is executed as a

grid of thread blocks– All threads share data

memory space

• A thread block: threads that can cooperate with each other by:– Synchronizing their

execution• For hazard-free shared

memory accesses

– Efficiently sharing data through a low latency shared memory

• Two threads from two different blocks cannot cooperate

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

Page 13: 002 - Introduction to CUDA Programming_1

Thread Coordination Overview

• Race-free access to data

Page 14: 002 - Introduction to CUDA Programming_1

Programmer’s view: Memory Model

Page 15: 002 - Introduction to CUDA Programming_1

Programmer’s View: Memory Detail – Thread and Host

• Each thread can:– R/W per-thread registers– R/W per-thread local memory– R/W per-block shared memory– R/W per-grid global memory– Read only per-grid constant memory– Read only per-grid texture memory

• The host can R/W: • global, constant, and texture memories

Page 16: 002 - Introduction to CUDA Programming_1

Memory Model: Global, Constant, and Texture Memories

• Global memory– Main means of communicating R/W Data between

host and device– Contents visible to all threads

• Texture and Constant Memories– Constants initialized by host – Contents visible to all threads

Page 17: 002 - Introduction to CUDA Programming_1

Memory Model Summary

Memory Location Cached Access Scope

Local off-chip No R/W thread

Shared on-chip N/A R/W all threads in a block

Global off-chip No R/W all threads + host

Constant off-chip Yes RO all threads + host

Texture off-chip Yes RO all threads + host

Page 18: 002 - Introduction to CUDA Programming_1

Execution Model: Ordering

• Execution order is undefined

• Do not assume and use:• block 0 executes before block 1• Thread 10 executes before thread 20• And any other ordering even if you can observe it

Page 19: 002 - Introduction to CUDA Programming_1

Execution Model Summary• Grid of blocks of threads

– 1D/2D grid of blocks– 1D/2D/3D blocks of threads

• All blocks are identical: – same structure and # of threads

• Block execution order is undefined • Same block threads:

– can synchronize and share data fast (shared memory)• Threads from different blocks:

– Cannot cooperate– Communication through global memory

• Threads and Blocks have IDs– Simplifies data indexing– Can be 1D, 2D, or 3D (threads)

• Blocks do not migrate: execute on the same processor• Several blocks may run over the same processor

Page 20: 002 - Introduction to CUDA Programming_1

CUDA Software Architecture

cuda…()

cu…()

e.g., fft()

Page 21: 002 - Introduction to CUDA Programming_1

Reasoning about CUDA call ordering

• GPU communication via cuda…() calls and kernel invocations– cudaMalloc, cudaMemCpy,

• Asynchronous from the CPU’s perspective– CPU places a request in a “CUDA” queue– requests are handled in-order

• Streams allow for multiple queues– More on this much later one

Page 22: 002 - Introduction to CUDA Programming_1

CUDA API: Example

int a[N]; for (i =0; i < N; i++)a[i] = a[i] + x;

1. Allocate CPU Data Structure2. Initialize Data on CPU3. Allocate GPU Data Structure4. Copy Data from CPU to GPU5. Define Execution Configuration6. Run Kernel7. CPU synchronizes with GPU8. Copy Data from GPU to CPU9. De-allocate GPU and CPU memory

Page 23: 002 - Introduction to CUDA Programming_1

1. Allocate CPU Data

float *ha;

main (int argc, char *argv[])

{

int N = atoi (argv[1]);

ha = (float *) malloc (sizeof (float) * N);

...

}

• Pinned memory allocation results in faster CPU to/from GPU copies

• More on this later• cudaMallocHost (…)

Page 24: 002 - Introduction to CUDA Programming_1

2. Initialize CPU Data

float *ha;

int i;

for (i = 0; i < N; i++)

ha[i] = i;

Page 25: 002 - Introduction to CUDA Programming_1

3. Allocate GPU Data

float *da;

cudaMalloc ((void **) &da, sizeof (float) * N);

• Notice: no assignment side– NOT: da = cudaMalloc (…)

• Assignment is done internally:– That why we pass &da

• Allocated space in Global Memory

Page 26: 002 - Introduction to CUDA Programming_1

GPU Memory Allocation

• The host manages GPU memory allocation:– cudaMalloc (void **ptr, size_t nbytes)– Must explicitly cast to (void **)

• cudaMalloc ((void **) &da, sizeof (float) * N);

– cudaFree (void *ptr);• cudaFree (da);

– cudaMemset (void *ptr, int value, size_t nbytes);

– cudaMemset (da, 0, N * sizeof (int));

• Check the CUDA Reference Manual

Page 27: 002 - Introduction to CUDA Programming_1

4. Copy Initialized CPU data to GPU

float *da;

float *ha;

cudaMemCpy ((void *) da, // DESTINATION

(void *) ha, // SOURCE

sizeof (float) * N, // #bytes

cudaMemcpyHostToDevice); // DIRECTION

Page 28: 002 - Introduction to CUDA Programming_1

Host/Device Data Transfers

• The host initiates all transfers:• cudaMemcpy( void *dst, void *src,

size_t nbytes, enum cudaMemcpyKind direction)

• Asynchronous from the CPU’s perspective– CPU thread continues

• In-order processing with other CUDA requests

• enum cudaMemcpyKind– cudaMemcpyHostToDevice– cudaMemcpyDeviceToHost– cudaMemcpyDeviceToDevice

Page 29: 002 - Introduction to CUDA Programming_1

5. Define Execution Configuration

• How many blocks and threads/block

int threads_block = 64;

int blocks = N / threads_block;

if (blocks % N != 0) blocks += 1;

• Alternatively:

blocks = (N + threads_block – 1) / threads_block;

Page 30: 002 - Introduction to CUDA Programming_1

6. Launch Kernel & 7. CPU/GPU Synchronization

• Instructs the GPU to launch blocks x thread_blocks threads:

darradd <<<blocks, threads_block>> (da, 10f, N);

cudaThreadSynchronize ();

• darradd: kernel name• <<<…>>> execution configuration

– More on this soon

• (da, x, N): arguments– 256 – 8 byte limit / No variable arguments

Page 31: 002 - Introduction to CUDA Programming_1

CPU/GPU Synchronization

• CPU does not block on cuda…() calls– Kernel/requests are queued and processed in-order– Control returns to CPU immediately

• Good if there is other work to be done– e.g., preparing for the next kernel invocation

• Eventually, CPU must know when GPU is done• Then it can safely copy the GPU results

• cudaThreadSynchronize ()– Block CPU until all preceding cuda…() and kernel requests

have completed

Page 32: 002 - Introduction to CUDA Programming_1

8. Copy data from GPU to CPU & 9. DeAllocate Memory

float *da;

float *ha;

cudaMemCpy ((void *) ha, // DESTINATION

(void *) da, // SOURCE

sizeof (float) * N, // #bytes

cudaMemcpyDeviceToHost); // DIRECTION

cudaFree (da);

// display or process results here

free (ha);

Page 33: 002 - Introduction to CUDA Programming_1

The GPU Kernel

__global__ darradd (float *da, float x, int N)

{

int i = blockIdx.x * blockDim.x + threadIdx.x;

if (i < N) da[i] = da[i] + x;

}

• BlockIdx: Unique Block ID.– Numerically asceding: 0, 1, …

• ThreadIdx: Unique per Block Index– 0, 1, … – Per Block

• BlockDim: Dimensions of Block– BlockDim.x, BlockDim.y, BlockDim.z– Unused dimensions default to 0

Page 34: 002 - Introduction to CUDA Programming_1

Array Index Calculation Example

int i = blockIdx.x * blockDim.x + threadIdx.x;

a[0] a[63] a[64] a[127]a[128] a[255]a[256]

blockIdx.x 0 blockIdx.x 1 blockIdx.x 2

thre

adId

x.x

0

thre

adId

x.x

63th

read

Idx.

x 0

thre

adId

x.x

63th

read

Idx.

x 0

thre

adId

x.x

63th

read

Idx.

x 0

i = 0 i = 63 i = 64 i = 127 i = 128 i = 255 i = 256

Assuming blockDim.x = 64

Page 35: 002 - Introduction to CUDA Programming_1

Generic Unique Thread and Block Index Calculations #1

• 1D Grid / 1D Blocks:

UniqueBlockIndex = blockIdx.x;UniqueThreadIndex = blockIdx.x * blockDim.x + threadIdx.x;

• 1D Grid / 2D Blocks:

UniqueBlockIndex = blockIdx.x;UniqueThreadIndex = blockIdx.x * blockDim.x * blockDim.y + threadIdx.y * blockDim.x + threadIdx.x;

• 1D Grid / 3D Blocks:

UniqueBockIndex = blockIdx.x;UniqueThreadIndex = blockIdx.x * blockDim.x * blockDim.y * blockDim.z + threadIdx.z * blockDim.y * blockDim.x + threadIdx.y * blockDim.x + threadIdx.x;

• Source: http://forums.nvidia.com/lofiversion/index.php?t82040.html

Page 36: 002 - Introduction to CUDA Programming_1

Generic Unique Thread and Block Index Calculations #2

• 2D Grid / 1D Blocks:

UniqueBlockIndex = blockIdx.y * gridDim.x + blockIdx.x;UniqueThreadIndex = UniqueBlockIndex * blockDim.x + threadIdx.x;

• 2D Grid / 2D Blocks:

UniqueBlockIndex = blockIdx.y * gridDim.x + blockIdx.x;UniqueThreadIndex =UniqueBlockIndex * blockDim.y * blockDim.x + threadIdx.y * blockDim.x + threadIdx.x;

• 2D Grid / 3D Blocks:

UniqueBlockIndex = blockIdx.y * gridDim.x + blockIdx.x;UniqueThreadIndex = UniqueBlockIndex * blockDim.z * blockDim.y * blockDim.x + threadIdx.z * blockDim.y * blockDim.z + threadIdx.y * blockDim.x + threadIdx.x;

• UniqueThreadIndex means unique per grid.

Page 37: 002 - Introduction to CUDA Programming_1

CUDA Function Declarations

• __global__ defines a kernel function– Must return void– Can only call __device__ functions

• __device__ and __host__ can be used together

Executed on the:

Only callable from the:

__device__ float DeviceFunc() device device

__global__ void KernelFunc() device host

__host__ float HostFunc() host host

Page 38: 002 - Introduction to CUDA Programming_1

__device__ Example

• Add x to a[i] multiple times

__device__ float addmany (float a, float b, int count)

{

while (count--) a += b;

return a;

}

__global__ darradd (float *da, float x, int N)

{

int i = blockIdx.x * blockDim.x + threadIdx.x;

if (i < N) da[i] = addmany (da[i], x, 10);

}

Page 39: 002 - Introduction to CUDA Programming_1

Kernel and Device Function Restrictions• __device__ functions cannot have their address taken

– e.g., f = &addmany; *f(…);

• For functions executed on the device:– No recursion

• darradd (…){

darradd (…)}

– No static variable declarations inside the function• darradd (…){

static int canthavethis;}

– No variable number of arguments• e.g., something like printf (…)

Page 40: 002 - Introduction to CUDA Programming_1

Predefined Vector Datatypes

• Can be used both in host and in device code.– [u]char[1..4], [u]short[1..4], [u]int[1..4], [u]long[1..4], float[1..4]

• Structures accessed with .x, .y, .z, .w fields

• default constructors, “make_TYPE (…)”: – float4 f4 = make_float4 (1f, 10f, 1.2f, 0.5f);

• dim3– type built on uint3– Used to specify dimensions– Default value is (1, 1, 1)

Page 41: 002 - Introduction to CUDA Programming_1

Execution Configuration

• Must specify when calling a __global__ function:

<<< Dg, Db [, Ns [, S]] >>>• where:

– dim3 Dg: grid dimensions in blocks– dim3 Db: block dimensions in threads– size_t Ns: per block additional number of shared

memory bytes to allocate • optional, defaults to 0• more on this much later on

– cudaStream_t S: request stream(queue)• optional, default to 0. • Compute capability >= 1.1

Page 42: 002 - Introduction to CUDA Programming_1

Built-in Variables

• dim3 gridDim– Number of blocks per grid, in 2D (.z always 1)

• uint3 blockIdx– Block ID, in 2D (blockIdx.z = 1 always)

• dim3 blockDim– Number of threads per block, in 3D

• uint3 threadIdx– Thread ID in block, in 3D

Page 43: 002 - Introduction to CUDA Programming_1

Execution Configuration Examples

• 1D grid / 1D blocks

dim3 gd(1024)

dim3 bd(64)

akernel<<<gd, bd>>>(...)

gridDim.x = 1024, gridDim.y = 1,

blockDim.x = 64, blockDim.y = 1, blockDim.z = 1

• 2D grid / 3D blocks

dim3 gd(4, 128)

dim3 bd(64, 16, 4)

akernel<<<gd, bd>>>(...)

gridDim.x = 4, gridDim.y = 128,

blockDim.x = 64, blockDim.y = 16, blockDim.z = 4

Page 44: 002 - Introduction to CUDA Programming_1

Error Handling

• Most cuda…() functions return a cudaError_t– If cudaSuccess: Request completed without a problem

• cudaGetLastError():– returns the last error to the CPU– Use with cudaThreadSynchronize():

cudaError_t code;

cudaThreadSynchronize ();

code = cudaGetLastError ();

• char *cudaGetErrorString(cudaError_t code);– returns a human-readable description of the error code

Page 45: 002 - Introduction to CUDA Programming_1

Error Handling Utility Function

void

cudaDie (const char *msg)

{

cudaError_t err;

cudaThreadSynchronize ();

err = cudaGetLastError();

if (err == cudaSuccess) return;

fprintf (stderr, "CUDA error: %s: %s.\n", msg, cudaGetErrorString (err));

exit(EXIT_FAILURE);

}• adapted from: http://www.ddj.com/hpc-high-performance-computing/207603131

Page 46: 002 - Introduction to CUDA Programming_1

Error Handling Macros

• CUDA_SAFE_CALL ( some cuda call )CUDA_SAFE_CALL (cudaMemcpy (a_h, a_d, arr_size, cudaMemcpyDeviceToHost) );

• Must define #define _DEBUG– No code emitted when undefined: Performance

• Use make dbg=1 under NVIDIA_CUDA_SDK

Page 47: 002 - Introduction to CUDA Programming_1

Measuring Time -- gettimeofday• Unix-based:#include <sys/time.h>#include <time.h>

struct timeval start, end;

gettimeofday (&start, NULL);WHAT WE ARE INTERESTED INgettimeofday (&end, NULL);

timeCpu = (float)(end.tv_sec - start.tv_sec);if (end.tv_usec < start.tv_usec){

timeCpu -= 1.0;timeCpu += (double)(1000000.0 + end.tv_usec - start.tv_usec)/1000000.0;

} elsetimeCpu += (double)(end.tv_usec - start.tv_usec)/1000000.0;

Page 48: 002 - Introduction to CUDA Programming_1

Using CUDA clock ()• clock_t clock ();• Can be used in device code• returns per multiprocessor counter which is incremented every clock cycle• Sample at the beginning and end • upper bound since threads are time-sliced

• uint start = clock();... compute (less than 3 sec) ....uint end = clock();if (end > start) time = end - start;else

 time = end + (0xffffffff - start) • Look at the clock example under projects in SDK• Using takes some effort

– Every thread measures start and end– Then must find min start and max end– Accurate

Page 49: 002 - Introduction to CUDA Programming_1

Using cutTimer…() library calls

#include <cuda.h>

#include <cutil.h>

unsigned int htimer;

cutCreateTimer (&htimer);

CudaThreadSynchronize ();

cutStartTimer(htimer);

WHAT WE ARE INTERESTED IN

cudaThreadSynchronize ();

cutStopTimer(htimer);

printf (“time: %f\n", cutGetTimerValue(htimer));

Page 50: 002 - Introduction to CUDA Programming_1

Code Overview: Host side#include <cuda.h>

#include <cutil.h>

unsigned int htimer;

float *ha, *da;

main (int argc, char *argv[]) {

int N = atoi (argv[1]);

ha = (float *) malloc (sizeof (float) * N);

for (int i = 0; i < N; i++) ha[i] = i;

cutCreateTimer (&htimer);

cudaMalloc ((void **) &da, sizeof (float) * N);

cudaMemCpy ((void *) da, (void *) ha, sizeof (float) * N, cudaMemcpyHostToDevice);

cudaThreadSynchronize ();

cutStartTimer(htimer);

blocks = (N + threads_block – 1) / threads_block;

darradd <<<blocks, threads_block>> (da, 10f, N)

cudaThreadSynchronize ();

cutStopTimer(htimer);

cudaMemCpy ((void *) ha, (void *) da, sizeof (float) * N, cudaMemcpyDeviceToHost);

cudaFree (da);

free (ha);

printf (“processing time: %f\n", cutGetTimerValue(htimer));

}

Page 51: 002 - Introduction to CUDA Programming_1

Code Overview: Device Side__device__ float addmany (float a, float b, int count)

{

while (count--) a += b;

return a;

}

__global__ darradd (float *da, float x, int N)

{

int i = blockIdx.x * blockDim.x + threadIdx.x;

if (i < N) da[i] = addmany (da[i], x, 10);

}

Page 52: 002 - Introduction to CUDA Programming_1

Variable Declarations – Will revisit next time

• __device__– stored in device memory (large, high latency, no cache)– Allocated with cudaMalloc (__device__qualifier implied)– accessible by all threads– lifetime: application

• __constant__– same as __device__, but cached and read-only by GPU– written by CPU via cudaMemcpyToSymbol(...) call– lifetime: application

• __shared__– stored in on-chip shared memory (very low latency)– accessible by all threads in the same thread block– lifetime: kernel launch

• Unqualified variables:– scalars and built-in vector types are stored in registers– arrays of more than 4 elements or run-time indices stored in device memory

Page 53: 002 - Introduction to CUDA Programming_1

Measurement Methodology

• You will not get exactly the same time measurements every time– Other processes running / external events (e.g.,

network activity)– Cannot control – “Non-determinism”

• Must take sufficient samples– say 10 or more– There is theory on what the number of samples

must be

• Measure average• Will discuss this next time or will provide a

handout online

Page 54: 002 - Introduction to CUDA Programming_1

Handling Large Input Data Sets – 1D Example• Recall gridDim.[xy] <= 65535• Host calls kernel multiple times: float *dac = da; // starting offset for current kernel

while (n_blocks) { int bn = n_blocks; int elems; // array elements processed in this kernel if (bn > 65535) bn = 65535; elems = bn * block_size; darradd <<<bn, block_size>>> (dac, 10.0f, elems); n_blocks -= bn; dac += elems; }