CUDA Lecture 4 CUDA Programming Basics

84
Prepared 6/22/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron. CUDA Lecture 4 CUDA Programming Basics

description

CUDA Lecture 4 CUDA Programming Basics. Prepared 6/22/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron. Parallel Programming Basics. Things we need to consider: Control Synchronization Communication Parallel programming languages offer different ways of dealing with above. - PowerPoint PPT Presentation

Transcript of CUDA Lecture 4 CUDA Programming Basics

Page 1: CUDA Lecture 4 CUDA Programming Basics

Prepared 6/22/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

CUDA Lecture 4CUDA Programming

Basics

Page 2: CUDA Lecture 4 CUDA Programming Basics

Things we need to consider:ControlSynchronizationCommunication

Parallel programming languages offer different ways of dealing with above

CUDA Programming Basics – Slide 2

Parallel Programming Basics

Page 3: CUDA Lecture 4 CUDA Programming Basics

CUDA programming model – basic concepts and data types

CUDA application programming interface - basic

Simple examples to illustrate basic concepts and functionalities

Performance features will be covered laterCUDA Programming Basics – Slide 3

Overview

Page 4: CUDA Lecture 4 CUDA Programming Basics

Basic kernels and execution on GPU Basic memory management Coordinating CPU and GPU execution

See the programming guide for the full API

CUDA Programming Basics – Slide 4

Outline of CUDA Basics

Page 5: CUDA Lecture 4 CUDA Programming Basics

Integrated host + device application program in C Serial or modestly parallel parts in host C code Highly parallel parts in device SPMD kernel C

code Programming model

Parallel code (kernel) is launched and executed on a device by many threads

Launches are hierarchical Threads are grouped into blocks Blocks are grouped into grids

Familiar serial code is written for a thread Each thread is free to execute a unique code path Built-in thread and block ID variables

CUDA Programming Basics – Slide 5

CUDA – C with no shader limitations!

Page 6: CUDA Lecture 4 CUDA Programming Basics

CUDA Programming Basics – Slide 6

CUDA – C with no shader limitations!

Serial Code (host)

. . .

. . .

Parallel Kernel (device)KernelA<<< nBlk, nTid >>>(args);

Serial Code (host)

Parallel Kernel (device)KernelB<<< nBlk, nTid >>>(args);

Page 7: CUDA Lecture 4 CUDA Programming Basics

A compute device Is a coprocessor to the CPU or host Has its own DRAM (device memory) Runs many threads in parallel Is typically a GPU but can also be another

type of parallel processing device Data-parallel portions of an application are

expressed as device kernels which run on many threads

CUDA Programming Basics – Slide 7

CUDA Devices and Threads

Page 8: CUDA Lecture 4 CUDA Programming Basics

Differences between GPU and CPU threads GPU threads are extremely lightweight

Very little creation overhead GPU needs 1000s of threads for full

efficiency Multi-core CPU needs only a few

CUDA Programming Basics – Slide 8

CUDA Devices and Threads

Page 9: CUDA Lecture 4 CUDA Programming Basics

The future of GPUs is programmable processing

So – build the architecture around the processor

CUDA Programming Basics – Slide 9

G80 – Graphics Mode

L2

FB

SP SP

L1

TF

Thre

ad P

roce

ssor

Vtx Thread Issue

Setup / Rstr / ZCull

Geom Thread Issue Pixel Thread Issue

Input Assembler

Host

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

L2

FB

L2

FB

L2

FB

L2

FB

L2

FB

Page 10: CUDA Lecture 4 CUDA Programming Basics

Processors execute computing threads New operating mode/hardware interface for

computing

CUDA Programming Basics – Slide 10

G80 CUDA Mode – A Device Example

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Host

Texture Texture Texture Texture Texture Texture Texture TextureTexture

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Load/store Load/store Load/store Load/store Load/store

Page 11: CUDA Lecture 4 CUDA Programming Basics

CUDA Programming Basics – Slide 11

High Level ViewSM

EM SM

EM SM

EM SM

EM

Global Memory CPU Chipset

PCIe

Page 12: CUDA Lecture 4 CUDA Programming Basics

CUDA Programming Basics – Slide 12

Blocks of Threads Run on a SM

Thread

Memory

Threadblock

Per-blockSharedMemory

SME

M

Streaming Processor Streaming Multiprocessor

Registers

Memory

Page 13: CUDA Lecture 4 CUDA Programming Basics

CUDA Programming Basics – Slide 13

Whole Grid Runs on GPUMany blocks of threads

. . .

SME

M SME

M SME

M SME

M

Global Memory

Page 14: CUDA Lecture 4 CUDA Programming Basics

CUDA Programming Basics – Slide 14

Extended CType Qualifiers

global, device, shared, local, constant

KeywordsthreadIdx, blockIdx

Intrinsics__syncthreads

Runtime APIMemory, symbol, execution management

Function launch

__device__ float filter[N];

__global__ void convolve (float *image) {

__shared__ float region[M]; ...

region[threadIdx] = image[i];

__syncthreads() ...

image[j] = result;}

// Allocate GPU memoryvoid *myimage = cudaMalloc(bytes)

// 100 blocks, 10 threads per blockconvolve<<<100, 10>>> (myimage);

Page 15: CUDA Lecture 4 CUDA Programming Basics

Mark Murphy, “NVIDIA’s Experience with Open64,”www.capsl.udel.edu/conferences/open64/2008/Papers/101.doc

CUDA Programming Basics – Slide 15

Extended C

gcc / cl

G80 SASSfoo.sass

OCG

cudaccEDG C/C++ frontend

Open64 Global Optimizer

GPU Assemblyfoo.s

CPU Host Code foo.cpp

Integrated source(foo.cu)

Page 16: CUDA Lecture 4 CUDA Programming Basics

A CUDA kernel is executed by an array of threadsAll threads run the same code (SPMD)Each thread has an ID that it uses to compute

memory addresses and make control decisions

CUDA Programming Basics – Slide 16

Arrays of Parallel Threads

76543210

…float x = input[threadID];float y = func(x);output[threadID] = y;…

threadID

Page 17: CUDA Lecture 4 CUDA Programming Basics

Divide monolithic thread array into multiple blocksThreads within a block cooperate via shared

memory, atomic operations and barrier synchronization

Threads in different blocks cannot cooperate

CUDA Programming Basics – Slide 17

Thread Blocks: Scalable Cooperation

…float x = input[threadID];float y = func(x);output[threadID] = y;…

threadID

Thread Block 0

……float x = input[threadID];float y = func(x);output[threadID] = y;…

Thread Block 1

…float x = input[threadID];float y = func(x);output[threadID] = y;…

Thread Block N - 176543210 76543210 76543210

Page 18: CUDA Lecture 4 CUDA Programming Basics

Threads launched for a parallel section are partitioned into thread blocksGrid = all blocks for a given launch

Thread block is a group of threads that canSynchronize their executionsCommunicate via shared memory

CUDA Programming Basics – Slide 18

Thread Hierarchy

Page 19: CUDA Lecture 4 CUDA Programming Basics

Any possible interleaving of blocks should be validPresumed to run to completion without

preemptionCan run in any orderCan run concurrently OR sequentially

Blocks may coordinate but not synchronizeShared queue pointer: OKShared lock: BAD … can easily deadlock

Independence requirement gives scalability

CUDA Programming Basics – Slide 19

Blocks Must Be Independent

Page 20: CUDA Lecture 4 CUDA Programming Basics

A CUDA program has two piecesHost code on the CPU which interfaces to the

GPUKernel code which runs on the GPU

At the host level, there is a choice of 2 APIs (Application Programming Interfaces):Runtime: simpler, more convenientDriver: much more verbose, more flexible,

closer to OpenCLWe will only use the Runtime API in this

courseCUDA Programming Basics – Slide 20

Basics of CUDA Programming

Page 21: CUDA Lecture 4 CUDA Programming Basics

At the host code level, there are library routines for:memory allocation on graphics carddata transfer to/from device memory

constantstexture arrays (useful for lookup tables)ordinary data

error-checkingtiming

There is also a special syntax for launching multiple copies of the kernel process on the GPU.

CUDA Programming Basics – Slide 21

Basics of CUDA Programming

Page 22: CUDA Lecture 4 CUDA Programming Basics

Each thread uses IDs to decide what data to work on Block ID: 1-D or 2-

D Unique within a

block Thread ID: 1-D, 2-D

or 3-D Unique within a

block Dimensions set at

launch Can be unique for

each gridCUDA Programming Basics – Slide 22

Block IDs and Thread IDsHost

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA

Figure 3.2. An Example of CUDA Thread Organization.

Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

Page 23: CUDA Lecture 4 CUDA Programming Basics

Built-in variables threadIdx,

blockIdx blockDim, gridDim

Simplifies memory addressing when processing multidimensional data Image processing Solving PDEs on

volumes …

CUDA Programming Basics – Slide 23

Block IDs and Thread IDsHost

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA

Figure 3.2. An Example of CUDA Thread Organization.

Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

Page 24: CUDA Lecture 4 CUDA Programming Basics

In its simplest form launch of kernel looks like:kernel_routine<<<gridDim, blockDim>>>(args);

wheregridDim is the number of copies of the kernel (the

“grid” size”)blockDim is the number of threads within each

copy (the “block” size)args is a limited number of arguments, usually

mainly pointers to arrays in graphics memory, and some constants which get copied by value

The more general form allows gridDim and blockDim to be 2-D or 3-D to simplify application programs CUDA Programming Basics – Slide 24

Basics of CUDA Programming

Page 25: CUDA Lecture 4 CUDA Programming Basics

At the lower level, when one copy of the kernel is started on a SM it is executed by a number of threads, each of which knows about:some variables passed as argumentspointers to arrays in device memory (also

arguments)global constants in device memoryshared memory and private registers/local variablessome special variables:

gridDim size (or dimensions) of grid of blocksblockIdx index (or 2-D/3-D indices) of blockblockDim size (or dimensions) of each blockthreadIdx index (or 2-D/3-D indices) of thread

CUDA Programming Basics – Slide 25

Basics of CUDA Programming

Page 26: CUDA Lecture 4 CUDA Programming Basics

Suppose we have 1000 blocks, and each one has 128 threads – how does it get executed?

On current Tesla hardware, would probably get 8 blocks running at the same time on each SM, and each block has 4 warps => 32 warps running on each SM

Each clock tick, SM warp scheduler decides which warp to execute next, choosing from those not waiting fordata coming from device memory (memory latency)completion of earlier instructions (pipeline delay)

Programmer doesn’t have to worry about this level of detail, just make sure there are lots of threads / warps

CUDA Programming Basics – Slide 26

Basics of CUDA Programming

Page 27: CUDA Lecture 4 CUDA Programming Basics

In the simplest case, we have a 1-D grid of blocks, and a 1-D set of threads within each block.

If we want to use a 2-D set of threads, then blockDim.x, blockDim.y give the dimensions, and threadIdx.x, threadIdx.y give the thread indices

To launch the kernel we would use somthing likedim3 nthreads(16,4);my_new_kernel<<<nblocks,nthreads>>>(d_x);

where dim3 is a special CUDA datatype with 3 components .x, .y, .z each initialized to 1.

CUDA Programming Basics – Slide 27

Basics of CUDA Programming

Page 28: CUDA Lecture 4 CUDA Programming Basics

CUDA Programming Basics – Slide 28

For ExampleHost

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA

Figure 3.2. An Example of CUDA Thread Organization.

Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

Launch withdim3 dimGrid(2, 2);dim3 dimBlock(4, 2, 2);kernelFunc<<<dimGrid,dimBlock>>>(…);

Zoomed in on block withblockIdx.x = blockIdx.y = 1,blockDim.x = 4,blockDim.y = blockDim.z = 2

Each thread in block has coordinates (threadIdx.x, threadIdx.y, threadIdx.z)

Page 29: CUDA Lecture 4 CUDA Programming Basics

A similar approach is used for 3-D threads and/or 2-D grids. This can be very useful in 2-D / 3-D finite difference applications.

How do 2-D / 3-D threads get divided into warps?1-D thread ID defined bythreadIdx.x +threadIdx.y * blockDim.x +threadIdx.z * blockDim.x * blockDim.y

and this is then broken up into warps of size 32.

CUDA Programming Basics – Slide 29

Basics of CUDA Programming

Page 30: CUDA Lecture 4 CUDA Programming Basics

Thread (1, 0)

Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Global memory Main means of

communicating R/W data between host and device

Contents visible to all threads

Long latency access We will focus on

global memory for now Constant and

texture memory will come later CUDA Programming Basics – Slide 30

CUDA Memory Model Overview

Thread (1, 0)

Page 31: CUDA Lecture 4 CUDA Programming Basics

CUDA Programming Basics – Slide 31

Memory ModelKernel

0. . . Per-device

GlobalMemory

. . .

Kernel 1

SequentialKernels

Page 32: CUDA Lecture 4 CUDA Programming Basics

The API is an extension to the ANSI C programming languageLow learning curve

The hardware is designed to enable lightweight runtime and driverHigh performance

CUDA Programming Basics – Slide 32

CUDA API Highlights: Easy and Lightweight

Page 33: CUDA Lecture 4 CUDA Programming Basics

CPU and GPU have separate memory spacesData is moved across the PCIe busUse functions to allocate/set/copy memory on

GPUVery similar to corresponding C functions

Pointers are just addressesCan’t tell from the pointer value whether the

address is on CPU or GPUMust exercise care when dereferencing

Dereferencing CPU pointer on GPU will likely crash and vice-versa

CUDA Programming Basics – Slide 33

Memory Spaces

Page 34: CUDA Lecture 4 CUDA Programming Basics

Thread (1, 0)

Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

cudaMalloc() Allocates object in

the device global memory

Requires two parameters

Address of a pointer to the allocated object

Size of allocated object

cudaFree() Frees objects from

device global memory Pointer to freed object

CUDA Programming Basics – Slide 34

CUDA Device Memory Allocation

Thread (1, 0)

Page 35: CUDA Lecture 4 CUDA Programming Basics

Code exampleAllocate a 64-by-64 single precision float arrayAttach the allocated storage to Md

“d” is often used to indicate a device data structure

CUDA Programming Basics – Slide 35

CUDA Device Memory Allocation

TILE_WIDTH = 64;float* Md;int size = TILE_WIDTH * TILE_WIDTH * sizeof(float);

cudaMalloc((void**)&Md, size);cudaMemset(Md, 0, size);cudaFree(Md);

Page 36: CUDA Lecture 4 CUDA Programming Basics

Thread (1, 0)

Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

cudaMemcpy() Memory data

transfer Requires four

parameters Pointer to destination Pointer to source Number of bytes

copied Type of transfer

Host to host Host to device Device to host Device to device

Asynchronous transfer

CUDA Programming Basics – Slide 36

CUDA Host-Device Data Transfer

Thread (1, 0)

Page 37: CUDA Lecture 4 CUDA Programming Basics

cudaMemcpy()Returns after the copy is completeBlocks CPU thread until all bytes have been copiedDoesn’t start copying until previous CUDA calls

completeNon-blocking copies are also available

CUDA Programming Basics – Slide 37

Memory ModelDevice 0memory

Device 1memory

Host memory cudaMemcpy()

Page 38: CUDA Lecture 4 CUDA Programming Basics

Code exampleTransfer a 64-by-64 single precision float arrayM is in host memory and Md is in device memorycudaMemcpyHostToDevice , cudaMemcpyDeviceToHost and cudaMemcpyDeviceToDevice are symbolic constants

CUDA Programming Basics – Slide 38

CUDA Host-Device Data Transfer

cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost);

Page 39: CUDA Lecture 4 CUDA Programming Basics

CUDA Programming Basics – Slide 39

First Simple CUDA Example#include <stdio.h>

int main(){ int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers

h_a = (int*)malloc(num_bytes); cudaMalloc((void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; } cudaMemset( d_a, 0, num_bytes ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost ); for(int i=0; i<dimx; i++) printf("%d ", h_a[i] ); printf("\n"); free( h_a ); cudaFree( d_a ); return 0;}

Page 40: CUDA Lecture 4 CUDA Programming Basics

C/C++ with some restrictionsCan only access GPU memoryNo variable number of argumentsNo static variablesNo recursionNo dynamic polymorphism

Must be declared with a qualifier__global__ : launched by CPU, cannot be

called from GPU__device__ : called from other GPU functions,

cannot be called by the CPU__host__ : can be called by the CPU

CUDA Programming Basics – Slide 40

Code Executed on GPU

Page 41: CUDA Lecture 4 CUDA Programming Basics

__global__ defines a kernel functionMust return void

__device__ and __host__ can be used togetherSample use: overloading operators

CUDA Programming Basics – Slide 41

CUDA Function DeclarationsExecuted

on theOnly

callable from the

__device__ float DeviceFunc() Device Device__global__ void KernelFunc() Device Host__host__ float HostFunc() Host Host

Page 42: CUDA Lecture 4 CUDA Programming Basics

__device__ int reduction_lock = 0;

The __device__ prefix tells nvcc this is a global variable in the GPU, not the CPU.

The variable can be read and modified by any kernel

Its lifetime is the lifetime of the whole applicationCan also declare arrays of fixed sizeCan read/write by host code using special routines cudaMemcpyToSymbol, cudaMemcpyFromSymbol or with standard cudaMemcpy in combination with cudaGetSymbolAddress

CUDA Programming Basics – Slide 42

CUDA Function Declarations

Page 43: CUDA Lecture 4 CUDA Programming Basics

__device__ functions cannot have their address taken

For functions executed on the deviceNo recursionNo static variable declarations inside the

functionNo variable number of arguments

CUDA Programming Basics – Slide 43

CUDA Function Declarations

Page 44: CUDA Lecture 4 CUDA Programming Basics

As seen a kernel function must be called with an execution configuration:

Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit synch needed for blocking

CUDA Programming Basics – Slide 44

Calling a Kernel Function – Thread Creation

__global__ void KernelFunc(…);dim3 DimGrid(100, 50); // 5000 thread blocksdim3 DimBlock(4, 8, 8); // 256 threads/blocksize_t SharedMemBytes = 64; // 64 bytes shared memoryKernelFunc<<< DimGrid, DimBlock, SharedMemBytes >>>(…);

Page 45: CUDA Lecture 4 CUDA Programming Basics

The kernel code looks fairly normal once you get used to two things:code is written from the point of view of a single

threadquite different to OpenMP multithreadingsimilar to MPI, where you use the MPI “rank” to identify

the MPI processall local variables are private to that thread

need to think about where each variable livesany operation involving data in the device memory forces

its transfer to/from registers in the GPUno cache on old hardware so a second operation with the

same data will force a second transferoften better to copy the value into a local register variable

CUDA Programming Basics – Slide 45

Basics of CUDA Programming

Page 46: CUDA Lecture 4 CUDA Programming Basics

CUDA Programming Basics – Slide 46

Next CUDA Example: Vector Addition

// Compute vector sum C = A+B// Each thread performs one pair-wise addition__global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i];}

int main() { int N = 16; // total number of elements in the vector/array int TPB = 4; // number of threads per block // allocate and initialize host (CPU) memory float *h_A, *h_B, *h_C, *d_A, *d_B, *d_C; // allocate device (GPU) memory cudaMalloc( (void**) &d_A, N * sizeof(float)); cudaMalloc( (void**) &d_B, N * sizeof(float)); cudaMalloc( (void**) &d_C, N * sizeof(float)); // assign values to d_A and d_B; // copy host memory to device cudaMemcpy( d_A, h_A, N * sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy( d_B, h_B, N * sizeof(float), cudaMemcpyHostToDevice); // Run grid of N/4 blocks of 4 threads each vecAdd<<< N/4, 4>>>(d_A, d_B, d_C); // copy result back to host memory cudaMemcpy( h_C, d_C, N * sizeof(float), cudaMemcpyDeviceToHost); // do something with the result… // free device (GPU) memory cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);}

Page 47: CUDA Lecture 4 CUDA Programming Basics

__global__ identifier says its a kernel function

Each thread sets one element of C[] arrayWithin each block of threads, threadIdx.x

ranges from 0 to blockDim.x-1, so each thread has a unique value for i

CUDA Programming Basics – Slide 47

Next CUDA Example: Vector Addition

Page 48: CUDA Lecture 4 CUDA Programming Basics

CUDA Programming Basics – Slide 48

Kernel Variations and Output__global__ void kernel( int *a ) { int idx = threadIdx.x + blockDim.x * blockIdx.x; a[idx] = 7;}

Output:7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7

__global__ void kernel( int *a ) { int idx = threadIdx.x + blockDim.x * blockIdx.x; a[idx] = blockIdx.x;}

Output:0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3

__global__ void kernel( int *a ) { int idx = threadIdx.x + blockDim.x * blockIdx.x; a[idx] = threadIdx.x;}

Output:0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3

Page 49: CUDA Lecture 4 CUDA Programming Basics

CUDA Programming Basics – Slide 49

Next CUDA Example: Kernel with 2-D Addressing

__global__ void kernel( int *a, int dimx, int dimy ) { int ix = blockIdx.x*blockDim.x + threadIdx.x; int iy = blockIdx.y*blockDim.y + threadIdx.y; int idx = iy*dimx + ix; a[idx] = a[idx]+1;}

int main() { int dimx = 16; int dimy = 16; int num_bytes = dimx*dimy*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers

h_a = (int*)malloc(num_bytes); cudaMalloc((void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; } cudaMemset( d_a, 0, num_bytes ); dim3 grid, block; block.x = 4; block.y = 4; grid.x = dimx / block.x; grid.y = dimy / block.y; kernel<<<grid, block>>>( d_a, dimx, dimy ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost ); for(int row=0; row<dimy; row++) { for(int col=0; col<dimx; col++) printf("%d ", h_a[row*dimx+col] ); printf("\n"); } free( h_a ); cudaFree( d_a ); return 0;}

Page 50: CUDA Lecture 4 CUDA Programming Basics

A simple matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programsLeave shared memory usage until laterLocal, register usageThread ID usageMemory data transfer API between host and

deviceAssume square matrix for simplicity

CUDA Programming Basics – Slide 50

A Simple Running ExampleMatrix Multiplication

Page 51: CUDA Lecture 4 CUDA Programming Basics

P = M × N of size WIDTH-by-WIDTH

Without tilingOne thread

calculates one element of P

M and N are loaded WIDTH times from global memory

CUDA Programming Basics – Slide 51

Programming ModelSquare Matrix Multiplication Example

M

N

P

WID

TH

WID

TH

WIDTH WIDTH

Page 52: CUDA Lecture 4 CUDA Programming Basics

CUDA Programming Basics – Slide 52

Memory Layout of a Matrix in CM0,2

M1,1

M0,1M0,0

M1,0

M0,3

M1,2 M1,3

M0,2M0,1M0,0 M0,3 M1,1M1,0 M1,2 M1,3 M2,1M2,0 M2,2 M2,3

M2,1M2,0 M2,2 M2,3

M3,1M3,0 M3,2 M3,3

M3,1M3,0 M3,2 M3,3

M

Page 53: CUDA Lecture 4 CUDA Programming Basics

53

M2,0

M1,1

M1,0M0,0

M0,1

M3,0

M2,1 M3,1

Memory Layout of a Matrix in the Textbook

M2,0M1,0M0,0 M3,0 M1,1M0,1 M2,1 M3,1 M1,2M0,2 M2,2 M3,2

M1,2M0,2 M2,2 M3,2

M1,3M0,3 M2,3 M3,3

M1,3M0,3 M2,3 M3,3

M

Page 54: CUDA Lecture 4 CUDA Programming Basics

CUDA Programming Basics – Slide 54

Step 1: Matrix MultiplicationA Simple Host Version in C

M

N

P

WID

TH

WID

TH

WIDTH WIDTH

k

j

i

k

Page 55: CUDA Lecture 4 CUDA Programming Basics

CUDA Programming Basics – Slide 55

Step 1: Matrix MultiplicationA Simple Host Version in C// Matrix multiplication on the (CPU) host in double precisionvoid MatrixMulOnHost(float* M, float* N, float* P, int Width){ for (int i = 0; i < Width; ++i) for (int j = 0; j < Width; ++j) { double sum = 0; for (int k = 0; k < Width; ++k) { double a = M[i * Width + k]; double b = N[k * Width + j]; sum += a * b; } P[i * Width + j] = sum; }}

Page 56: CUDA Lecture 4 CUDA Programming Basics

CUDA Programming Basics – Slide 56

Step 2: Input Matrix Data Transfer(Host-Side Code)void MatrixMulOnDevice(float* M, float* N, float* P, int Width){ int size = Width * Width * sizeof(float); float* Md, Nd, Pd;

// allocate and load M, N to device memory cudaMalloc(&Md, size); cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

cudaMalloc(&Nd, size); cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);

// allocate P on the device cudaMalloc(&Pd, size);

Page 57: CUDA Lecture 4 CUDA Programming Basics

CUDA Programming Basics – Slide 57

Step 3: Output Matrix Data Transfer(Host-Side Code)

// kernel invocation code – to be shown later (Step 5) …

// read P from the device cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);

// free device matrices cudaFree(Md); cudaFree(Nd); cudaFree(Pd);}

Page 58: CUDA Lecture 4 CUDA Programming Basics

CUDA Programming Basics – Slide 58

Step 4: Kernel Function (Overview)

Md

Nd

Pd

WID

TH

WID

TH

WIDTH WIDTH

k

threadIdx.x

threadIdx.y

kthreadIdx.x

threadIdx.y

Page 59: CUDA Lecture 4 CUDA Programming Basics

CUDA Programming Basics – Slide 59

Step 4: Kernel Function// Matrix multiplication kernel – per thread code__global__ void MatrixMulKernel

(float* Md, float* Nd, float* Pd, int Width){ // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0;

for (int k = 0; k < Width; ++k) float Melement = Md[threadIdx.y * Width + k]; float Nelement = Nd[k * Width + threadIdx.x]; Pvalue += Melement * Nelement; }

Pd[threadIdx.y * Width + threadIdx.x] = Pvalue;}

Page 60: CUDA Lecture 4 CUDA Programming Basics

CUDA Programming Basics – Slide 60

Step 5: Kernel Invocation(Host-Side Code)

// Set up the execution configurationdim3 dimGrid(1, 1);dim3 dimBlock(Width, Width);

// Launch the device computation threadsMatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);

Page 61: CUDA Lecture 4 CUDA Programming Basics

One block of threads compute matrix PdEach thread computes one element of Pd

Each threadLoads a row of matrix MdLoads a column of matrix NdPerforms one multiply and addition for each

pair of Md and Nd elementsCompute to off-chip memory access ratio close

to 1:1 (not very high)Size of matrix limited by the number of

threads allowed in a thread blockCUDA Programming Basics – Slide 61

Only One Thread Block Used

Page 62: CUDA Lecture 4 CUDA Programming Basics

CUDA Programming Basics – Slide 62

Only One Thread Block Used Grid 1Block 1

3 2 5 4

2

4

2

6

48

Thread(2, 2)

WIDTH

Md Pd

Nd

Page 63: CUDA Lecture 4 CUDA Programming Basics

Have each 2-D thread block compute a (TILE_WIDTH)² sub-matrix (tile) of the result matrixEach has (TILE_WIDTH)² threads

Generate a 2-D grid of (WIDTH / TILE_WIDTH)² blocks

You still need to put a loop around the kernel call for cases where WIDTH / TILE_WIDTH is greater than the max grid size (64K)

CUDA Programming Basics – Slide 63

Handling Square Matrices with Arbitrary Size

Page 64: CUDA Lecture 4 CUDA Programming Basics

CUDA Programming Basics – Slide 64

Matrix Multiplication Using Multiple Blocks

Md

Nd

Pd

WID

TH

WID

TH

WIDTH WIDTH

ty

tx

by

bx

TILE_WIDTH

Break-up Pd into tiles

Each block calculates one tile Each thread

calculates one element

Block size equal tile size

Page 65: CUDA Lecture 4 CUDA Programming Basics

CUDA Programming Basics – Slide 65

A Small Example: Multiplication

Pd1,0Pd0,0

Pd0,1

Pd2,0Pd3,0

Pd1,1

Pd0,2 Pd2,2Pd3,2Pd1,2

Pd3,1Pd2,1

Pd0,3 Pd2,3Pd3,3Pd1,3

Block(0,0) Block(1,0)

Block(1,1)Block(0,1)

TILE_WIDTH = 2

Pd1,0Md2,0

Md1,1

Md1,0Md0,0

Md0,1

Md3,0

Md2,1

Pd0,0

Md3,1 Pd0,1

Pd2,0Pd3,0

Nd0,3Nd1,3

Nd1,2

Nd1,1

Nd1,0Nd0,0

Nd0,1

Nd0,2

Pd1,1

Pd0,2 Pd2,2Pd3,2Pd1,2

Pd3,1Pd2,1

Pd0,3 Pd2,3Pd3,3Pd1,3

Page 66: CUDA Lecture 4 CUDA Programming Basics

CUDA Programming Basics – Slide 66

Revised Matrix Multiplication Kernel Using Multiple Blocks// Matrix multiplication kernel – per thread code__global__ void MatrixMulKernel

(float* Md, float* Nd, float* Pd, int Width){ // Calculate the row index of the Pd element and M int Row = blockIdx.y*TILE_WIDTH + threadIdx.y; // Calculate the column idenx of Pd and N int Col = blockIdx.x*TILE_WIDTH + threadIdx.x;

float Pvalue = 0; // each thread computes one element of the block sub-matrix for (int k = 0; k < Width; ++k) Pvalue += Md[Row*Width+k] * Nd[k*Width+Col];

Pd[Row*Width+Col] = Pvalue;}

Page 67: CUDA Lecture 4 CUDA Programming Basics

All threads in a block execute the same kernel program (SPMD)

Programmer declares block: Block size 1 to 512 concurrent

threads Block shape 1-D, 2-D, or 3-D Block dimensions in threads

Threads have thread id numbers within block Thread program uses thread id

to select work and address shared data

CUDA Tools and Threads – Slide 67

CUDA Thread BlockCUDA Thread Block

Thread Id #:0 1 2 3 … m

Thread program

Courtesy: John Nickolls, NVIDIA

Page 68: CUDA Lecture 4 CUDA Programming Basics

Threads in the same block share data and synchronize while doing their share of the work

Threads in different blocks cannot cooperate Each block can execute in any

order relative to other blocs!

CUDA Tools and Threads – Slide 68

CUDA Thread BlockCUDA Thread Block

Thread Id #:0 1 2 3 … m

Thread program

Courtesy: John Nickolls, NVIDIA

Page 69: CUDA Lecture 4 CUDA Programming Basics

Hardware is free to assign blocks to any processor at any timeA kernel scales across any number of parallel

processors

Each block can execute in any order relative

to other blocksCUDA Tools and Threads – Slide 69

Transparent Scalability

Device

Block 0 Block 1

Block 2 Block 3

Block 4 Block 5

Block 6 Block 7

Kernel grid

Block 0 Block 1

Block 2 Block 3

Block 4 Block 5

Block 6 Block 7

Device

Block 0 Block 1 Block 2 Block 3

Block 4 Block 5 Block 6 Block 7

time

Page 70: CUDA Lecture 4 CUDA Programming Basics

Processors execute computing threads New operating mode/hardware interface for

computing

CUDA Tools and Threads – Slide 70

G80 CUDA Mode – A Review

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Host

Texture Texture Texture Texture Texture Texture Texture TextureTexture

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Load/store Load/store Load/store Load/store Load/store

Page 71: CUDA Lecture 4 CUDA Programming Basics

Threads are assigned to streaming multiprocessors (SMs) in block granularity Up to 8 blocks to each SM as resource allows Each SM in G80 can take up to 768 threads

Could be 256 (threads/block) × 3 blocks Or 128 (threads/block) × 6 blocks, etc.

Threads run concurrently Each SM maintains thread/block id numbers Each SM manages/schedules thread execution

CUDA Tools and Threads – Slide 71

G80 Example: Executing Thread Blocks

Page 72: CUDA Lecture 4 CUDA Programming Basics

CUDA Tools and Threads – Slide 72

G80 Example: Executing Thread Blocks

t0 t1 t2 … tm

Blocks

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

t0 t1 t2 … tm

Blocks

SM 1SM 0

Flexible resource allocation

Page 73: CUDA Lecture 4 CUDA Programming Basics

Each block is executed as 32-thread warps An implementation decision, not part of the CUDA

programming model Warps are scheduling units in an SM

If 3 blocks are assigned to an SM and each block has 256 threads, how many warps are there in an SM? Each block is divided into 256/32 = 8 warps There are 8 × 3 = 24 warps

CUDA Tools and Threads – Slide 73

G80 Example: Thread Scheduling

Page 74: CUDA Lecture 4 CUDA Programming Basics

CUDA Tools and Threads – Slide 74

G80 Example: Thread Scheduling

…t0 t1 t2 …

t31

…t0 t1 t2 …

t31

…Block 1 Warps Block 2 Warps

SP

SP

SP

SP

SFU

SP

SP

SP

SP

SFU

Instruction Fetch/Dispatch

Instruction L1Streaming Multiprocessor

Shared Memory

…t0 t1 t2 …

t31

…Block 1 Warps

Page 75: CUDA Lecture 4 CUDA Programming Basics

Each SM implements zero-overhead warp scheduling At any time, only one of the warps is executed by an SM Warps whose next instruction has its operands ready for

consumption are eligible for execution Eligible warps are selected for execution on a prioritized

scheduling policy All threads in a warp execute the same instruction when

selected

CUDA Tools and Threads – Slide 75

G80 Example: Thread Scheduling

TB1W1

TB = Thread Block, W = Warp

TB2W1

TB3W1

TB2W1

TB1W1

TB3W2

TB1W2

TB1W3

TB3W2

Time

TB1, W1 stallTB3, W2 stallTB2, W1 stall

Instruction: 1 2 3 4 5 6 1 2 1 2 3 41 2 7 8 1 2 1 2 3 4

Page 76: CUDA Lecture 4 CUDA Programming Basics

For matrix multiplication using multiple blocks, should I use 8 × 8, 16 × 16 or 32 × 32 blocks?

For 8 × 8, we have 64 threads per Block. Since each SM can take up to 768 threads, there are 12 Blocks. However, each SM can only take up to 8 Blocks, only 512 threads will go into each SM!

For 16 × 16, we have 256 threads per Block. Since each SM can take up to 768 threads, it can take up to 3 Blocks and achieve full capacity unless other resource considerations overrule.

For 32 × 32, we have 1024 threads per Block. Not even one can fit into an SM!

CUDA Tools and Threads – Slide 76

G80 Block Granularity Considerations

Page 77: CUDA Lecture 4 CUDA Programming Basics

The API is an extension to the C programming language

It consists of: Language extensions

To target portions of the code for execution on the device

A runtime library split into: A common component providing built-in vector

types and a subset of the C runtime library in both host and device codes

A host component to control and access one or more devices from the host

A device component providing device-specific functions CUDA Tools and Threads – Slide 77

Application Programming Interface

Page 78: CUDA Lecture 4 CUDA Programming Basics

dim3 gridDim; Dimensions of the grid in blocks (gridDim.z

unused) dim3 blockDim;

Dimensions of the block in threads dim3 blockIdx;

Block index within the grid dim3 threadIdx;

Thread index within the block

CUDA Tools and Threads – Slide 78

Language Extensions: Built-in Variables

Page 79: CUDA Lecture 4 CUDA Programming Basics

pow, sqrt, cbrt, hypot exp, exp2, expm1 log, log2, log10, log1p sin, cos, tan, asin, acos, atan, atan2 sinh, cosh, tanh, asinh, acosh, atanh ceil, floor, trunc, round Etc.

When executed on the host, a given function uses the C runtime implementation if available

These functions are only supported for scalar types, not vector types

CUDA Tools and Threads – Slide 79

Common Runtime Component: Mathematical Functions

Page 80: CUDA Lecture 4 CUDA Programming Basics

Some mathematical functions (e.g. sin(x)) have a less accurate, but faster device-only version (e.g. __sin(x)) __pow __log, __log2, __log10 __exp __sin, __cos, __tan

CUDA Tools and Threads – Slide 80

Common Runtime Component: Mathematical Functions

Page 81: CUDA Lecture 4 CUDA Programming Basics

Provides functions to deal with: Device management (including multi-device

systems) Memory management Error handling

Initializes the first time a runtime function is called

A host thread can invoke device code on only one device Multiple host threads required to run on multiple

devicesCUDA Tools and Threads – Slide 81

Host Runtime Component

Page 82: CUDA Lecture 4 CUDA Programming Basics

void __syncthreads(); Synchronizes all threads in a block Once all threads have reached this point,

execution resumes normally Used to avoid RAW / WAR / WAW hazards

when accessing shared or global memory Allowed in conditional constructs only if the

conditional is uniform across the entire thread block

CUDA Tools and Threads – Slide 82

Device Runtime Component:Synchronization Function

Page 83: CUDA Lecture 4 CUDA Programming Basics

memory allocationcudaMalloc((void **)&xd, nbytes);

data copyingcudaMemcpy(xh, xd, nbytes, cudaMemcpyDeviceToHost);

reminder: d (h) to distinguish an array on the device (host) is not mandatory, just helpful labeling

kernel routine is declared by __global__ prefix, and is written from point of view of a single thread

CUDA Programming Basics – Slide 83

Final Thoughts

Page 84: CUDA Lecture 4 CUDA Programming Basics

Reading: Chapters 3 and 4, “Programming Massively Parallel Processors” by Kirk and Hwu.

Based on original material fromThe University of Illinois at Urbana-Champaign

David Kirk, Wen-mei W. HwuOxford University: Mike GilesStanford University

Jared Hoberock, David TarjanRevision history: last updated 6/22/2011.

CUDA Programming Basics – Slide 84

End Credits