ME964 High Performance Computing for Engineering Applications

ME964High Performance Computing for Engineering Applications

CUDA API

Sept. 18, 2008

Before we get started…

Last Time The CUDA API Start discussing CUDA programming model

Today HW2 is due on Friday, 11:59 PM HW3 available on the class website (due Sept.25) A look at a matrix multiplication example The CUDA execution model

2

split personality

n.

Two distinct personalities in the same entity, each of which prevails at a particular time.

3

Going back to the G80 HW…

HK-UIUC

__global__ void KernelFunc(...); // declaration

dim3 DimGrid(100, 50); // 5000 thread blocks dim3 DimBlock(4, 8, 8); // 256 threads per block size_t SharedMemBytes = 64; // 64 bytes of shared memory

KernelFunc<<< DimGrid, DimBlock, SharedMemBytes >>>(...);

Calling a Kernel Function, and the Concept of Execution Configuration

A kernel function must be called with an execution configuration:

4

Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit sync needed for blocking

HK-UIUC

Last slide of previous lecture

G80 Thread Computing Pipeline

Processors execute computing threads Alternative operating mode specifically for computing

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Host

Texture Texture Texture Texture Texture Texture Texture TextureTexture

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Load/store Load/store Load/store Load/store Load/store

The future of GPUs is programmable processing So – build the architecture around the processor

L2

FB

SP SP

L1

TF

Th

rea

d P

roc

es

so

r

Vtx Thread Issue

Setup / Rstr / ZCull

Geom Thread Issue Pixel Thread Issue

Input Assembler

Host

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

L2

FB

L2

FB

L2

FB

L2

FB

L2

FB

5HK-UIUC

Simple Example:Matrix Multiplication

A straightforward matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programs

Leave shared memory usage until later

Local variable and register usage

Thread ID usage

Memory data transfer API between host and device

6HK-UIUC

Square Matrix Multiplication Example

P = M * N of size WIDTH x WIDTH Software Design Decisions:

One thread handles one element of P Each thread will access entries in M and

N WIDTH times from global memory

M

N

P

WID

TH

WID

TH

WIDTH WIDTH

7HK-UIUC

Step 1: Matrix Data Transfers(Host Side)

// Allocate the device memory where we will copy M to// Although not shown, you do the same for N and P matricesMatrix Md;#define WIDTH 16Md.width = WIDTH;Md.height = WIDTH;Md.pitch = WIDTH;int size = WIDTH * WIDTH * sizeof(float);cudaMalloc((void**)&Md.elements, size);

// Copy M from the host to the devicecudaMemcpy(Md.elements, M.elements, size, cudaMemcpyHostToDevice);

...//do your work here… (see next slides)

// Read the result from the device to the host into PcudaMemcpy(P.elements, Pd.elements, size, cudaMemcpyDeviceToHost);// Free device memorycudaFree(Md.elements);

8HK-UIUC

Step 2: Matrix MultiplicationA Simple Host Code in C

// Matrix multiplication on the (CPU) host in double precision;

void MatrixMulOnHost(const Matrix M, const Matrix N, Matrix P){ for (int i = 0; i < M.height; ++i) { for (int j = 0; j < N.width; ++j) { double sum = 0; for (int k = 0; k < M.width; ++k) { double a = M.elements[i * M.width + k]; //you’ll see a lot of this… double b = N.elements[k * N.width + j]; // and of this as well… sum += a * b; } P.elements[i * N.width + j] = sum; } }}

9HK-UIUC

Step 3: Matrix Multiplication, Host-side. Main Program Code

int main(void) { // Allocate and initialize the matrices. // The last argument in AllocateMatrix: should an initialization with // random numbers be done? Yes: 1. No: 0 (everything is set to zero) Matrix M = AllocateMatrix(WIDTH, WIDTH, 1); Matrix N = AllocateMatrix(WIDTH, WIDTH, 1); Matrix P = AllocateMatrix(WIDTH, WIDTH, 0);

// M * N on the device MatrixMulOnDevice(M, N, P);

// Free matrices FreeMatrix(M); FreeMatrix(N); FreeMatrix(P);

return 0;} 10HK-UIUC

Step 3: Matrix MultiplicationHost-side code

11

// Setup the execution configuration

dim3 dimGrid(1, 1); dim3 dimBlock(WIDTH, WIDTH);

// Launch the device computation threads! MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd);

// Read P from the device CopyFromDeviceMatrix(P, Pd);

// Free device matrices FreeDeviceMatrix(Md); FreeDeviceMatrix(Nd); FreeDeviceMatrix(Pd);}

Continue here…

// Matrix multiplication on the devicevoid MatrixMulOnDevice(const Matrix M, const Matrix N, Matrix P){ // Load M and N to the device Matrix Md = AllocateDeviceMatrix(M); CopyToDeviceMatrix(Md, M); Matrix Nd = AllocateDeviceMatrix(N); CopyToDeviceMatrix(Nd, N);

// Allocate P on the device Matrix Pd = AllocateDeviceMatrix(P); CopyToDeviceMatrix(Pd, P); // clear memory

HK-UIUC

Multiply Using One Thread Block

One Block of threads computes matrix P Each thread computes one element of P

Each thread Loads a row of matrix M Loads a column of matrix N Perform one multiply and addition for each

pair of M and N elements Compute to off-chip memory access ratio

close to 1:1 Not that good…

Size of matrix limited by the number of threads allowed in a thread block

Grid 1

Block 1

3 2 5 4

2

4

2

6

48

Thread(2, 2)

BLOCK_SIZE

MP

N

12HK-UIUC

Step 4: Matrix Multiplication- Device-side Kernel Function

// Matrix multiplication kernel – thread specification__global__ void MatrixMulKernel(Matrix M, Matrix N, Matrix P){ // 2D Thread ID int tx = threadIdx.x; int ty = threadIdx.y;

// Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0;

for (int k = 0; k < M.width; ++k) { float Melement = M.elements[ty * M.pitch + k]; float Nelement = Nd.elements[k * N.pitch + tx]; Pvalue += Melement * Nelement; }

// Write the matrix to device memory; // each thread writes one element P.elements[ty * P.pitch + tx] = Pvalue;}

13

M

N

P

WID

TH

WID

TH

WIDTH WIDTH

tx

ty

HK-UIUC

Step 5: Some Loose Ends

// Free a device matrix.void FreeDeviceMatrix(Matrix M) { cudaFree(M.elements);}

void FreeMatrix(Matrix M) { free(M.elements);}

14

// Allocate a device matrix of same size as M.Matrix AllocateDeviceMatrix(const Matrix M){ Matrix Mdevice = M; int size = M.width * M.height * sizeof(float); cudaMalloc((void**)&Mdevice.elements, size); return Mdevice;}

// Copy a host matrix to a device matrix.void CopyToDeviceMatrix(Matrix Mdevice, const Matrix Mhost){ int size = Mhost.width * Mhost.height * sizeof(float); cudaMemcpy(Mdevice.elements, Mhost.elements, size,

cudaMemcpyHostToDevice);}

// Copy a device matrix to a host matrix.void CopyFromDeviceMatrix(Matrix Mhost, const Matrix Mdevice){ int size = Mdevice.width * Mdevice.height * sizeof(float); cudaMemcpy(Mhost.elements, Mdevice.elements, size,

cudaMemcpyDeviceToHost);} HK-UIUC

Step 6: Handling Arbitrary Sized Square Matrices

Have each 2D thread block to compute a (BLOCK_WIDTH)2 sub-matrix (tile) of the result matrix Each has (BLOCK_WIDTH)2 threads

Generate a 2D Grid of (WIDTH/BLOCK_WIDTH)2 blocks

M

N

P

WID

TH

WID

TH

WIDTH WIDTH

ty

tx

by

bx

NOTE: You still need to put a loop around the kernel call for cases where WIDTH is really large

15HK-UIUC

The Common Pattern to CUDA Programming

Phase 1: Allocate memory on the device and copy to the device the data required to carry out computation on the GPU

Phase 2: Let the GPU crunch the numbers for you based on the kernel that you define

Phase 3: Bring back the results from the GPU. Free memory on the device (clean up…). You’re done.

16

A Common Programming Pattern (Cntd.)BRINGING THE SHARED MEMORY INTO THE PICTURE

Local and global memory reside in device memory (DRAM) - much slower access than shared memory

An advantageous way of performing computation on the device is to block data to take advantage of fast shared memory:

Partition data into data subsets that fit into shared memory

Handle each data subset with one thread block by: Loading the subset from global memory to shared memory, using

multiple threads to exploit memory-level parallelism Performing the computation on the subset from shared memory; each

thread can efficiently multi-pass over any data element Copying results from shared memory back to global memory

17HK-UIUC

How about the memory not in registers or shared?

The local, global, constant, and texture spaces are regions of device memory

Beyond registers and shared memory, each multiprocessor has:

Read/Write global memory There is a lot of this…

A read-only constant cache To speed up access to the constant

memory space

A read-only texture cache To speed up access to the texture

memory space

Device

Stream Multiprocessor N

Stream Multiprocessor 2

Stream Multiprocessor 1

Device memory

Shared Memory

InstructionUnit

Processor 1

Registers

…Processor 2

Registers

Processor M

Registers

ConstantCache

TextureCache

Global, constant, texture memories

18

A Common Programming Pattern (cont.)

Texture and Constant memory also reside in device memory (DRAM) - much slower access than shared memory But… cached! Highly efficient access for read-only data

Carefully divide data according to access patterns R/O no structure constant memory R/O array structured texture memory R/W shared within Block shared memory R/W registers spill to local memory R/W inputs/results global memory

19HK-UIUC

G80 Hardware Implementation:A Set of SIMD Multiprocessors

The device is a set of Stream Multiprocessors (14 on 8800 GT, but that’s totally irrelevant)

Each multiprocessor has a collection of 32-bit Scalar Processors with a Single Instruction Multiple Data architecture – shared instruction unit

At each clock cycle, a Stream Multiprocessor executes the same instruction on a group of threads called a warp

The number of threads in a warp is the warp size

Device

Multiprocessor N

Multiprocessor 2

Multiprocessor 1

InstructionUnit

Scalar Processor 1

…Scalar Processor 2

Scalar Processor M

20HK-UIUC

Hardware Implementation: Execution Model (review)

Each block of a grid is split into warps, each gets executed by one Stream Multiprocessor (SM) The device processes only one grid at a time

Each block is executed by one Stream Multiprocessor The shared memory space resides in the on-chip shared memory

A Stream Multiprocessor can execute multiple blocks concurrently Shared memory and registers are partitioned among the threads of all

concurrent blocks So, decreasing shared memory usage (per block) and register usage (per

thread) increases number of blocks that can run concurrently, which is good Currently, up to eight blocks can be processed concurrently (this is bound to

change, as the HW changes…)

21

ME964 High Performance Computing for Engineering Applications

Documents

Transcript of ME964 High Performance Computing for Engineering Applications