ME964 High Performance Computing for Engineering Applications
description
Transcript of ME964 High Performance Computing for Engineering Applications
ME964High Performance Computing for Engineering Applications
CUDA API
Sept. 18, 2008
Before we get started…
Last Time The CUDA API Start discussing CUDA programming model
Today HW2 is due on Friday, 11:59 PM HW3 available on the class website (due Sept.25) A look at a matrix multiplication example The CUDA execution model
2
split personality
n.
Two distinct personalities in the same entity, each of which prevails at a particular time.
3
Going back to the G80 HW…
HK-UIUC
__global__ void KernelFunc(...); // declaration
dim3 DimGrid(100, 50); // 5000 thread blocks dim3 DimBlock(4, 8, 8); // 256 threads per block size_t SharedMemBytes = 64; // 64 bytes of shared memory
KernelFunc<<< DimGrid, DimBlock, SharedMemBytes >>>(...);
Calling a Kernel Function, and the Concept of Execution Configuration
A kernel function must be called with an execution configuration:
4
Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit sync needed for blocking
HK-UIUC
Last slide of previous lecture
G80 Thread Computing Pipeline
Processors execute computing threads Alternative operating mode specifically for computing
Load/store
Global Memory
Thread Execution Manager
Input Assembler
Host
Texture Texture Texture Texture Texture Texture Texture TextureTexture
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Load/store Load/store Load/store Load/store Load/store
The future of GPUs is programmable processing So – build the architecture around the processor
L2
FB
SP SP
L1
TF
Th
rea
d P
roc
es
so
r
Vtx Thread Issue
Setup / Rstr / ZCull
Geom Thread Issue Pixel Thread Issue
Input Assembler
Host
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
L2
FB
L2
FB
L2
FB
L2
FB
L2
FB
5HK-UIUC
Simple Example:Matrix Multiplication
A straightforward matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programs
Leave shared memory usage until later
Local variable and register usage
Thread ID usage
Memory data transfer API between host and device
6HK-UIUC
Square Matrix Multiplication Example
P = M * N of size WIDTH x WIDTH Software Design Decisions:
One thread handles one element of P Each thread will access entries in M and
N WIDTH times from global memory
M
N
P
WID
TH
WID
TH
WIDTH WIDTH
7HK-UIUC
Step 1: Matrix Data Transfers(Host Side)
// Allocate the device memory where we will copy M to// Although not shown, you do the same for N and P matricesMatrix Md;#define WIDTH 16Md.width = WIDTH;Md.height = WIDTH;Md.pitch = WIDTH;int size = WIDTH * WIDTH * sizeof(float);cudaMalloc((void**)&Md.elements, size);
// Copy M from the host to the devicecudaMemcpy(Md.elements, M.elements, size, cudaMemcpyHostToDevice);
...//do your work here… (see next slides)
// Read the result from the device to the host into PcudaMemcpy(P.elements, Pd.elements, size, cudaMemcpyDeviceToHost);// Free device memorycudaFree(Md.elements);
8HK-UIUC
Step 2: Matrix MultiplicationA Simple Host Code in C
// Matrix multiplication on the (CPU) host in double precision;
void MatrixMulOnHost(const Matrix M, const Matrix N, Matrix P){ for (int i = 0; i < M.height; ++i) { for (int j = 0; j < N.width; ++j) { double sum = 0; for (int k = 0; k < M.width; ++k) { double a = M.elements[i * M.width + k]; //you’ll see a lot of this… double b = N.elements[k * N.width + j]; // and of this as well… sum += a * b; } P.elements[i * N.width + j] = sum; } }}
9HK-UIUC
Step 3: Matrix Multiplication, Host-side. Main Program Code
int main(void) { // Allocate and initialize the matrices. // The last argument in AllocateMatrix: should an initialization with // random numbers be done? Yes: 1. No: 0 (everything is set to zero) Matrix M = AllocateMatrix(WIDTH, WIDTH, 1); Matrix N = AllocateMatrix(WIDTH, WIDTH, 1); Matrix P = AllocateMatrix(WIDTH, WIDTH, 0);
// M * N on the device MatrixMulOnDevice(M, N, P);
// Free matrices FreeMatrix(M); FreeMatrix(N); FreeMatrix(P);
return 0;} 10HK-UIUC
Step 3: Matrix MultiplicationHost-side code
11
// Setup the execution configuration
dim3 dimGrid(1, 1); dim3 dimBlock(WIDTH, WIDTH);
// Launch the device computation threads! MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd);
// Read P from the device CopyFromDeviceMatrix(P, Pd);
// Free device matrices FreeDeviceMatrix(Md); FreeDeviceMatrix(Nd); FreeDeviceMatrix(Pd);}
Continue here…
// Matrix multiplication on the devicevoid MatrixMulOnDevice(const Matrix M, const Matrix N, Matrix P){ // Load M and N to the device Matrix Md = AllocateDeviceMatrix(M); CopyToDeviceMatrix(Md, M); Matrix Nd = AllocateDeviceMatrix(N); CopyToDeviceMatrix(Nd, N);
// Allocate P on the device Matrix Pd = AllocateDeviceMatrix(P); CopyToDeviceMatrix(Pd, P); // clear memory
HK-UIUC
Multiply Using One Thread Block
One Block of threads computes matrix P Each thread computes one element of P
Each thread Loads a row of matrix M Loads a column of matrix N Perform one multiply and addition for each
pair of M and N elements Compute to off-chip memory access ratio
close to 1:1 Not that good…
Size of matrix limited by the number of threads allowed in a thread block
Grid 1
Block 1
3 2 5 4
2
4
2
6
48
Thread(2, 2)
BLOCK_SIZE
MP
N
12HK-UIUC
Step 4: Matrix Multiplication- Device-side Kernel Function
// Matrix multiplication kernel – thread specification__global__ void MatrixMulKernel(Matrix M, Matrix N, Matrix P){ // 2D Thread ID int tx = threadIdx.x; int ty = threadIdx.y;
// Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0;
for (int k = 0; k < M.width; ++k) { float Melement = M.elements[ty * M.pitch + k]; float Nelement = Nd.elements[k * N.pitch + tx]; Pvalue += Melement * Nelement; }
// Write the matrix to device memory; // each thread writes one element P.elements[ty * P.pitch + tx] = Pvalue;}
13
M
N
P
WID
TH
WID
TH
WIDTH WIDTH
tx
ty
HK-UIUC
Step 5: Some Loose Ends
// Free a device matrix.void FreeDeviceMatrix(Matrix M) { cudaFree(M.elements);}
void FreeMatrix(Matrix M) { free(M.elements);}
14
// Allocate a device matrix of same size as M.Matrix AllocateDeviceMatrix(const Matrix M){ Matrix Mdevice = M; int size = M.width * M.height * sizeof(float); cudaMalloc((void**)&Mdevice.elements, size); return Mdevice;}
// Copy a host matrix to a device matrix.void CopyToDeviceMatrix(Matrix Mdevice, const Matrix Mhost){ int size = Mhost.width * Mhost.height * sizeof(float); cudaMemcpy(Mdevice.elements, Mhost.elements, size,
cudaMemcpyHostToDevice);}
// Copy a device matrix to a host matrix.void CopyFromDeviceMatrix(Matrix Mhost, const Matrix Mdevice){ int size = Mdevice.width * Mdevice.height * sizeof(float); cudaMemcpy(Mhost.elements, Mdevice.elements, size,
cudaMemcpyDeviceToHost);} HK-UIUC
Step 6: Handling Arbitrary Sized Square Matrices
Have each 2D thread block to compute a (BLOCK_WIDTH)2 sub-matrix (tile) of the result matrix Each has (BLOCK_WIDTH)2 threads
Generate a 2D Grid of (WIDTH/BLOCK_WIDTH)2 blocks
M
N
P
WID
TH
WID
TH
WIDTH WIDTH
ty
tx
by
bx
NOTE: You still need to put a loop around the kernel call for cases where WIDTH is really large
15HK-UIUC
The Common Pattern to CUDA Programming
Phase 1: Allocate memory on the device and copy to the device the data required to carry out computation on the GPU
Phase 2: Let the GPU crunch the numbers for you based on the kernel that you define
Phase 3: Bring back the results from the GPU. Free memory on the device (clean up…). You’re done.
16
A Common Programming Pattern (Cntd.)BRINGING THE SHARED MEMORY INTO THE PICTURE
Local and global memory reside in device memory (DRAM) - much slower access than shared memory
An advantageous way of performing computation on the device is to block data to take advantage of fast shared memory:
Partition data into data subsets that fit into shared memory
Handle each data subset with one thread block by: Loading the subset from global memory to shared memory, using
multiple threads to exploit memory-level parallelism Performing the computation on the subset from shared memory; each
thread can efficiently multi-pass over any data element Copying results from shared memory back to global memory
17HK-UIUC
How about the memory not in registers or shared?
The local, global, constant, and texture spaces are regions of device memory
Beyond registers and shared memory, each multiprocessor has:
Read/Write global memory There is a lot of this…
A read-only constant cache To speed up access to the constant
memory space
A read-only texture cache To speed up access to the texture
memory space
Device
Stream Multiprocessor N
Stream Multiprocessor 2
Stream Multiprocessor 1
Device memory
Shared Memory
InstructionUnit
Processor 1
Registers
…Processor 2
Registers
Processor M
Registers
ConstantCache
TextureCache
Global, constant, texture memories
18
A Common Programming Pattern (cont.)
Texture and Constant memory also reside in device memory (DRAM) - much slower access than shared memory But… cached! Highly efficient access for read-only data
Carefully divide data according to access patterns R/O no structure constant memory R/O array structured texture memory R/W shared within Block shared memory R/W registers spill to local memory R/W inputs/results global memory
19HK-UIUC
G80 Hardware Implementation:A Set of SIMD Multiprocessors
The device is a set of Stream Multiprocessors (14 on 8800 GT, but that’s totally irrelevant)
Each multiprocessor has a collection of 32-bit Scalar Processors with a Single Instruction Multiple Data architecture – shared instruction unit
At each clock cycle, a Stream Multiprocessor executes the same instruction on a group of threads called a warp
The number of threads in a warp is the warp size
Device
Multiprocessor N
Multiprocessor 2
Multiprocessor 1
InstructionUnit
Scalar Processor 1
…Scalar Processor 2
Scalar Processor M
20HK-UIUC
Hardware Implementation: Execution Model (review)
Each block of a grid is split into warps, each gets executed by one Stream Multiprocessor (SM) The device processes only one grid at a time
Each block is executed by one Stream Multiprocessor The shared memory space resides in the on-chip shared memory
A Stream Multiprocessor can execute multiple blocks concurrently Shared memory and registers are partitioned among the threads of all
concurrent blocks So, decreasing shared memory usage (per block) and register usage (per
thread) increases number of blocks that can run concurrently, which is good Currently, up to eight blocks can be processed concurrently (this is bound to
change, as the HW changes…)
21