ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 GPUMemories
ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult
description
Transcript of ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult
1ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013MatrixMult.ppt
Matrix Multiplication Performance Improvements
These notes will introduce:•Basic matrix multiplication on a 2-D grid/block review•Limitations•Block matrix multiplication •Memory access patterns in matrix multiplication •Coalescing memory accesses
• Transpose array• Using shared memory tiles
2
Matrix Multiplication
Matrix multiplication is an important application in HPC and appears in many applications
C = A * B
where A, B, and C are matrices (two-dimensional arrays.
A restricted case is when B has only one column -- matrix-vector multiplication, which appears in representation of linear equations and partial differential equations
3
Matrix multiplication, C = A x B
4
Assume matrices square (N x N matrices).
for (i = 0; i < N; i++)for (j = 0; j < N; j++) {
c[i][j] = 0;for (k = 0; k < N; k++)
c[i][j] = c[i][j] + a[i][k] * b[k][j];}
Requires n3 multiplications and n3 additionsSequential time complexity of O(n3). Very easy to parallelize.
Implementing Matrix MultiplicationSequential Code
5
CUDA Kernelfor multiplying two arrays
__global__ void gpu_matrixmult(int *gpu_a, int *gpu_b, int *gpu_c, int N) {
int k, sum = 0;
int col = threadIdx.x + blockDim.x * blockIdx.x;
int row = threadIdx.y + blockDim.y * blockIdx.y;
if (col < N && row < N) {
for (k = 0; k < N; k++)
sum += a[row * N + k] * b[k * N + col];
c[row * N + col] = sum;
}
}
In this example, one thread computes one C element and the number of threads must equal or greater than the number of elements
6
Sequential version with flattened arraysfor comparison
void cpu_matrixmult(int *cpu_a, int *cpu_b, int *cpu_c, int N) {
int i, j, k, sum;
for (row =0; row < N; row++) // row of a
for (col =0; col < N; col++) { // column of b
sum = 0;
for(k = 0; k < N; k++)
sum += cpu_a[row * N + k] * cpu_b[k * N + col];
cpu_c[row * N + col] = sum;
}
}
7
Matrix mapped on 2-D Grids and 2-D blocks
threadID.x
threadID.y
blockIdx.x * blockDim.x + threadIdx.x
blockIdx.y * blockDim.y + threadIdx.y
A[][column]
A[row][] Thread
Arrays mapped onto structure, one element per thread
Array
Grid
Block
Basically array divided into “tiles” and one tile mapped onto one block
8
// Matrix addition program MatrixMult.cu, Barry Wilkinson, Dec. 28, 2010.#include <stdio.h>#include <cuda.h>#include <stdlib.h>
__global__ void gpu_matrixmult(int *gpu_a, int *gpu_b, int *gpu_c, int N) {…
}
void cpu_matrixmult(int *cpu_a, int *cpu_b, int *cpu_c, int N) {…
}
int main(int argc, char *argv[]) {int i, j; // loop countersint Grid_Dim_x=1, Grid_Dim_y=1; //Grid structure valuesint Block_Dim_x=1, Block_Dim_y=1; //Block structure valuesint noThreads_x, noThreads_y; // number of threads available in device, each dimensionint noThreads_block; // number of threads in a blockint N = 10; // size of array in each dimensionint *a,*b,*c,*d;int *dev_a, *dev_b, *dev_c;int size; // number of bytes in arrayscudaEvent_t start, stop; // using cuda events to measure timefloat elapsed_time_ms; // which is applicable for asynchronous code also
/* --------------------ENTER INPUT PARAMETERS AND ALLOCATE DATA -----------------------*/… // keyboard input
dim3 Grid(Grid_Dim_x, Grid_Dim_x); //Grid structuredim3 Block(Block_Dim_x,Block_Dim_y); //Block structure, threads/block limited by specific devicesize = N * N * sizeof(int); // number of bytes in total in arrays
a = (int*) malloc(size); //dynamically allocated memory for arrays on hostb = (int*) malloc(size);c = (int*) malloc(size); // results from GPUd = (int*) malloc(size); // results from CPU… // load arrays with some numbers
Complete Program(several slides)
9
/* ------------- COMPUTATION DONE ON GPU ----------------------------*/
cudaMalloc((void**)&dev_a, size); // allocate memory on devicecudaMalloc((void**)&dev_b, size);cudaMalloc((void**)&dev_c, size);
cudaMemcpy(dev_a, a , size ,cudaMemcpyHostToDevice);cudaMemcpy(dev_b, b , size ,cudaMemcpyHostToDevice);
cudaEventRecord(start, 0); // here start time, after memcpy
gpu_matrixmult<<<Grid,Block>>>(dev_a,dev_b,dev_c,N);cudaMemcpy(c, dev_c, size , cudaMemcpyDeviceToHost);
cudaEventRecord(stop, 0); // measuse end timecudaEventSynchronize(stop);cudaEventElapsedTime(&elapsed_time_ms, start, stop );
printf("Time to calculate results on GPU: %f ms.\n", elapsed_time_ms);
Where you measure time will make a big difference
10
/* ------------- COMPUTATION DONE ON HOST CPU ----------------------------*/
cudaEventRecord(start, 0); // use same timing*
cpu_matrixmult(a,b,d,N); // do calculation on host
cudaEventRecord(stop, 0); // measure end timecudaEventSynchronize(stop);cudaEventElapsedTime(&elapsed_time_ms, start, stop );
printf("Time to calculate results on CPU: %f ms.\n", elapsed_time_ms); // exe. time
/* ------------------- check device creates correct results -----------------*/…/* --------------------- repeat program ----------------------------------------*/… // while loop to repeat calc with different parameters/* -------------- clean up ---------------------------------------*/
free(a); free(b); free(c);cudaFree(dev_a);cudaFree(dev_b);cudaFree(dev_c);cudaEventDestroy(start);cudaEventDestroy(stop);return 0;
}
I have found problems if do CPU timing before GPU timing.Anyone else?
11
Some PreliminariesEffects of First Launch
Program is written so that can repeat with different parameters without stopping program – to eliminate effect of first kernel launch
Also might take advantage of caching – seems not significant as first launch
12
Some results
Random numbers 0- 9
32 x 32 array
1 blockof 32 x 32 threads
Speedup = 1.65,First time
Answer
Check both CPU and GPU same answers
13
Some results
32 x 32 array
1 blockof 32 x 32 threads
Speedup = 2.12Second time
14
Some results
32 x 32 array
1 blockof 32 x 32 threads
Speedup = 2.16Third time
Subsequently can vary 2.12 – 2.18
15
Some results
256 x 256 array
8 blocksof 32 x 32 threads
Speedup = 151.86
16
Some results
1024 x 1024 array
32 blocksof 32 x 32 threads
Speedup = 860.9
17
Some results
2048 x 2048 array
64 blocksof 32 x 32 threads
Speedup = ??GPU appears to freeze.Why?
211 x 211 threads = 222 threads
Memory needed = 222 integers= 224 bytes= 4 Mbytes
Max number of threads on GPU appears to be 216 x 216 = 232 threadsServer has 4 Gbytes of main memory
18
Array size Speedup * Speedup**
32 x 32 2.18 1.26
256 x 256 151.86 110.63
1024 x 1024 860.9 733.44
2048 x 2048 ??
4096 x 4096
Block size 32 x 32. Number of blocks to suit array size
Different Array Sizes
* These result include time of memory copy back from device but not memory copy to device.
** These result include time of memory copy back from the device and memory copy to the device
19
GPU Limitations
Previous program has limitations:
•Number of threads => number of array elements(code will not work if number of threads < number of array elements)
•Number of threads/block and blocks/grid has GPU limitations, which will limit size of arrays that can be processed.
•Keyboard input must check for invalid grid and block value
•There are memory bandwidth issues.
20
Compute capability 1.x
Maximum number of threads per block = 512
Maximum sizes of x- and y- dimension of thread block = 512
Maximum size of each dimension of grid = 65535
Maximum number of threads per block of 512 means a square 2-D block cannot be greater than 16 x 16 (256 threads)
So maximum square array size is 16 x 65535 (24 x 216 = 220)i.e. an array of 220 x 220
Is this a problem?
21
Compute capability 2.x(coit-grid06.uncc.edu)
Maximum number of threads per block = 1024
Maximum sizes of x- and y- dimension of thread block = 1024
Maximum size of each dimension of grid = 65535
Max number of threads per block of 1024 means a square 2-D block cannot be greater than 32 x 32. Now all 1024 threads allocated
So maximum square array size is 32 x 65535 (25 x 216 = 221)i.e. an array of 221 x 221
Is this a problem?
22
Increasing size of arrays beyond thread limitation of GPU
-- Using less threads than array elements*
Actually this one is easy and can draw upon regular technique of using sub-matrix multiplication:
Sub-matrix
Sub-matrixSub-matrix
* Would you want to use less threads than elements?Not in textbooks (not their “tiling”).
23
To demonstrate that sum-matrix multiplication will produce the correct final answer, consider simple 2 x 2 sub-matrices:
24
Suppose N x N matrices, divided into s2 submatrices.Each submatrix has n/s x n/s elements.
for (p = 0; p < s; p++) for (q = 0; q < s; q++) { Cp,q = 0; // clear elements of submatrix
for (r = 0; r < m; r++) // submatrix multiplication
Cp,q = Cp,q + Ap,r * Br,q; // add to accum. submatrix
}
Cp,q means submatrix row p and column q of matrix C. Cp,q = Cp,q + Ap,r * Br,q; means multiply submatrix Ap,r and Br,q using matrix multiplication and add to submatrix Cp,q using matrix addition.
Pseudo Code for Block Multiplication
25
N = …; // no of elements in cols/rows of matricess = …; // no of sub matricesss = N/s; // no of elements in sub-matrix cols/rowsfor (i = 0; i < N; i+= ss) // go thro sub-matrices of Afor (j = 0; j < N; j+= ss) { // and sub-matrices of B
for (p = i; p < i + ss; p++) // clear elements of sub-matrix, Cp,q = 0;for (q = j; q < j + ss; q++)
C[p][q] = 0;
for (r = 0; r < s; r++) { // sub-matrices row, column
for (p = i; p < i + ss; p++) // submatrix multiplicationfor (q = j; q < j + ss; q++) {
c[p][q] = 0;for (k = r; k < r + ss; k++)
c[p][q] += A[p][k] * B[k][q];}C[i][j] += c[p][q]; // add to accum. submatrix
}}
Code for Block Multiplication
Not tested yetMistakes?
Each thread does one value of i and j.
S threads
N/s an integer
26
Effects of memory access in matrix multiplication
One thread is responsible for computing one result C ij and needs access a row of A and a column of B:
Thread
Each thread access one row of A and one column of BN2 row/column combinations, N2 threads
27
Seen another way, in first time period, each thread accesses the first element in a row of A:
Thread 0, …
Thread I, …
Thread N-1, …
Consider those threads that access different rowsGiven the row-major order of how A is stored, those threads will locations are not in consecutive locations
– Bad cannot do memory coalescing.
Question: how many threads access the same location?
28
Next, each thread accesses the first element in a column of B:
Thread 0, … Thread I, … Thread N-1, …
Consider those threads that access different columnsGiven the row-major order of how A is stored, those threads will locations are in consecutive locations.
– Good! Can do memory coalescing.
Question: how many threads access the same location?
29
How can we get better memory accesses and memory coalcesing?
1. Transpose one array Copy all rows of A to columns and all columns of A to rows before access A and modify program according.
(Not mentioned in course textbook or other NVIDIA book, although appears obvious way – see next about whether works!)
30
Sequential code for a transpose using same array:
for (i=0; i < N; i++) for (j=0; j < i; j++) {
temp = B[i][j];B[i][j] = b[j][i];B[j][i] = temp;
}
(In my code, I use separate arrays)
Could be done on host prior to copying to device.How would the code look like if on device?
31
/* ------ COMPUTATION DONE ON GPU USING A TRANSPOSED ARRAY-----*/
transposeArray(a, a_T, N); // transpose array
cudaEventRecord(start, 0); // here time measured before // host-device copy, but not transpose
// cudaEventSynchronize(start); // Needed?
cudaMemcpy(dev_a, a_T , size ,cudaMemcpyHostToDevice);// cpy transp. AcudaMemcpy(dev_b, b , size ,cudaMemcpyHostToDevice); // copy B
gpu_matrixmult_T<<<Grid,Block>>>(dev_a,dev_b,dev_c,N);
cudaMemcpy(c_T,dev_c, size ,cudaMemcpyDeviceToHost);
cudaEventRecord(stop, 0); // measure end timecudaEventSynchronize(stop);cudaEventElapsedTime(&elapsed_time_ms2, start, stop );
printf("Time to calculate results on GPU with transposed array: %f ms.\n", elapsed_time_ms2); // print out execution time
32
Some results
8 x 8 array
1 blockof 8 x 8 threads
Speedup = 1.62 over not transposing array
33
Some results
32 x 32 array
1 blockof 32 x 32 threads
Speedup = 1.17 over not transposing array
34
Some results
256 x 256 array
8 blocksof 32 x 32 threads
Speedup = 0.89!! over not transposing array
35
Some results
1024 x 1024 array
32 blocksof 32 x 32 threads
Speedup = 0.93!! over not transposing array
36
2. Using shared memory with “Tiling”
Programming Massively Parallel Processors A Hands-on Approach, Fig. 6.10, Page 109.
Copy of tiles made in shared memory
As Bs Cs
Note: this is not using block matrix multiplication. The fundamental algorithm is just divided into phases
37
Developing Code
1. Declaring shared memory:
__global__ void gpu_matrixmult (int* Md, int* Nd, int* Pd, int Width) {
__shared__ int Mds[TILE_WIDTH][TILE_WIDTH];__shared__ int Nds[TILE_WIDTH][TILE_WIDTH];
Needs to be using static memory allocationso will need to fix size of tileConvenient to choose same as kernel block size, say 32 x 32
38
2. Index into shared memory:
For convenience declare tile(block) and thread indices, and indices into final C array:
int bx = blockIdx.x; // tile (block) indicesint by = blockIdx.y;
int tx = threadIdx.x; //thread indices int ty = threadIdx.y;
To access (later):
Mds[ty][tx] = … // element associated with thread Nds[ty][tx] = …
39
3. Global address:
For convenience declare row and column:
int Row = by * TILE_WIDTH + ty; // global indicesint Col = bx * TILE_WIDTH + tx;
Note: Same as the usual global thread ID and result index in normal matrix multiplication:
int row = threadIdx.y + blockDim.y * blockIdx.y;int col = threadIdx.x + blockDim.x * blockIdx.x;
40
4. Loading shared memory:
Done using SIMT thread collaboration (very tricky):
for (int m = 0; m < N/TILE_WIDTH; m++) {
Mds[ty][tx] = Md[Row*N + (m*TILE_WIDTH + tx)];
Nds[ty][tx] = Nd[(m*TILE_WIDTH + ty)*N + Col];
__syncthreads();
// do matrix multiplication operations on pair of tiles …
Thread ID
All elements on row transferred, one per thread
Thread ID
All elements on column transferred, one per thread
Wait for all threads in block
For each tile in row or column
Books says achieves memory coalescing although it does not look to do that in both cases.
41
Example (3 x 3 tiles)
Based upon Programming Massively Parallel Processors A Hands-on Approach, Fig. 6.10, Page 109.
by/bx
m k
Global array A
1
0
2
10 2
m
k
Global array B
1
0
2
10 2
Global memory addressRow*N + (m*TILE_WIDTH + tx)
Global memory address((m*TILE_WIDTH + ty)*N + Col
m = tile number 0 ,1, and 2Row = by * TILE_WIDTH + tyCol = bx * TILE_WIDTH + tx
42
5. Matrix multiplication in shared memory:
int Pvalue = 0;
for (int m = 0; m < N/TILE_WIDTH; m++) {
… // copy tiles to shared memory (step 4)
for ( int k = 0; k < TILE_WIDTH; k++)
Pvalue += Mds[ty][k] * Nds[k][tx];
}
Pd[Row * N + Col] = Pvalue;
}
Copy back to global memory
Multiply tiles, accumulating values
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) {
__shared__ float Mds[TILE_WIDTH][TILE_WIDTH];__shared__ float Nds[TILE_WIDTH][TILE_WIDTH];
int bx = blockIdx.x; int by = blockIdx.y;int tx = threadIdx.x; int ty = threadIdx.y; // thread ID
int Row = by * TILE_WIDTH + ty; //Identify row, column of Pd element to work onint Col = bx * TILE_WIDTH + tx;
float Pvalue = 0;
for (int m = 0; m < Width/TILE_WIDTH; m++) { // loop over Md, Nd to compute Pd element
Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)]; // load Md, Nd tiles into sh. memNds[ty][tx] = Nd[(m*TILE_WIDTH + ty)*Width + Col];__syncthreads();
for ( int k = 0; k < TILE_WIDTH; k++) Pvalue += Mds[ty][k] * Nds[k][tx];}Pd[Row][Col] = Pvalue;
}43
Code given in book Programming Massively
Parallel Processors A Hands-on Approach, Fig. 6.11, Page 110
A mistake in the book as Pd is a pointer and size of array unknown? Will not compile as is. Need to use flattened index
Note copying to shared memory is a collaborative between threads, each thread doing one element of each arrayIn wrong place on page 110, although does
not effect final answer!
44
Some results
32 x 32 array
1 blockof 32 x 32 threads
SpeedupGPU to CPU
2.12GPU sh. mem. to CPU
2.73GPU sh. mem to GPU
without sh memory1.28
Results of second run
45
Some results
256 x 256 array
8 blockof 32 x 32 threads
SpeedupGPU to CPU
153GPU sh. mem. to CPU
217GPU sh. mem to GPU
without sh memory 1.41
46
Some results
1024 x1024 array
32 blockof 32 x 32 threads
SpeedupGPU to CPU
864GPU sh. mem. to CPU
2214 !!!GPU sh. mem to GPU
without sh memory 2.56
47
Some results
2048 x 2048 array
64 blockof 32 x 32 threads
SpeedupGPU to CPU
989GPU sh. mem. to CPU
2962 !!GPU sh. mem to GPU
without sh memory 2.99
48
Speedup
Array size GPU to CPU GPU using shared memory to CPU
GPU using shared memory to GPU not using shared memory
32 x 32 2.12 2.73 1.28
256 x 256 153 217 1.41
1024 x 1024 864 2214 2.56
2048 x 2048 989 2962 2.99
4096 x 4096
Block size 32 x 32. Number of blocks to suit array size
Different Array Sizes
49
Bandwidth improvements by using shared memory
Using a 32 x 32 tiles reduced the number of global memory access by a factor of 16 (two transfers instead of 2 x 32 transfers).
According to PMPP book, page 90, using 16 x 16 tiles:
“it allows the 86.4 GB/s global bandwidth to serve a much larger floating-point computation rate … Can now support 86.4/4 x 16 = 345.6 gigaflops, very close to the peak floating point performance of the G80… effectively removes global memory bandwidth as the major limiting factor of matrix multiplication performance.”
50
Conclusions
Using the shared memory algorithm can make a significant difference, up to 3 times as fast on
the GPU to not using this algorithm in presented tests.
Speedup of almost 3000 over the CPU! (Note though CPU is an old processor)
51
This topic may be explored further in an assignment
Need better toolsNVIDIA offers a debugging tool called Parallel Nsight for Visual Studio/Windows
http://parallelnsight.nvidia.com/
52
Further Reading
Programming Massively Parallel ProcessorsA hands-on Approach
David B. Kirk and Wen-mei W. Hwu
Morgan Kaufmann, 2010
This book is only on NVIDIA GPUs and CUDA programming despite its title.
Questions
54
Things in PMPP book (Ch 6) not covered yet:
Dynamic partitioning of SM resourcesData PrefetchingInstruction usageThread granularity
Also note page 108 says memory coalescing not needed for shared memory!