ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 MatrixMult
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, 2013. CUDAMultiDimBlocks.ppt CUDA Grids,...
-
Upload
kristopher-sader -
Category
Documents
-
view
216 -
download
0
Transcript of 1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, 2013. CUDAMultiDimBlocks.ppt CUDA Grids,...
![Page 1: 1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, 2013. CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c815503460f94939d3f/html5/thumbnails/1.jpg)
1ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, 2013. CUDAMultiDimBlocks.ppt
CUDA Grids, Blocks, and Threads
These notes will introduce:
•One dimensional and multidimensional grids and blocks•How the grid and block structures are defined in CUDA•Predefined CUDA variables•Adding vectors using one-dimensional structures•Adding/multiplying arrays using 2-dimensional structures
![Page 2: 1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, 2013. CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c815503460f94939d3f/html5/thumbnails/2.jpg)
2
Grids, Blocks, and Threads
NVIDIA GPUs consist of an array of execution cores, each of which can support a large number of threads, many more than number of cores.
Threads grouped into “blocks”Blocks can be 1, 2, or 3 dimensional
Each kernel call uses a “grid” of blocksGrids can be 1, 2, or 3 dimensional (3-D available for recent GPUs)
Programmer needs to specify grid/block organization on each kernel call (which can be different each time), within limits set by the GPU
![Page 3: 1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, 2013. CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c815503460f94939d3f/html5/thumbnails/3.jpg)
3
Can be 1, 2, or 3 dimensions(compute capability => 2 see next)
Can be 1, 2 or 3 dimensions
CUDA C programming guide, v 3.2, 2010, NVIDIA
CUDA SIMT Thread StructureAllows flexibility and efficiency in processing 1D, 2-D, and 3-D data on GPU.
Linked to internal organization
Threads in one block execute together.
![Page 4: 1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, 2013. CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c815503460f94939d3f/html5/thumbnails/4.jpg)
4
NVIDIA defines “compute capabilities”, 1.0, 1.1, … with limits and features supported.
Compute capability1.0 (min) 2.x* 3.0/3.5
Grid:Max dimensionality 2 3 3Max size of each dimension (x, y, z) 65535 65535 231 – 1(no of blocks in each dimension) (2,147,483,647)
Blocks:Max dimensionality 3 3 3Max sizes of x- and y- dimension 512 1024 1024Max size of z- dimension 64 64 64Max number of threads per block overall 512 1024 1024
Device characteristics -- some limitations
coit-grid06 and coit-grid07 have C2050s, compute capability 2.0. coit-grid08.uncc.edu has a K20, compute capability 3.5. Most recent Comp Cap. for April 2013.
![Page 5: 1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, 2013. CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c815503460f94939d3f/html5/thumbnails/5.jpg)
5
Need to provide each kernel call with values for:
• Number of blocks in each dimension• Threads per block in each dimension
myKernel<<< B, T >>>(arg1, … );
B – a structure that defines number of blocks in grid in each dimension (1D, 2D, or 3D).
T – a structure that defines number of threads in a block in each dimension (1D, 2D, or 3D).
Defining Grid/Block Structure
![Page 6: 1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, 2013. CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c815503460f94939d3f/html5/thumbnails/6.jpg)
6
1-D grid and/or 1-D blocks
If want a 1-D structure, can use a integer for B and T in:
myKernel<<< B, T >>>(arg1, … );
B – An integer would define a 1D grid of that size
T –An integer would define a 1D block of that size
Example
myKernel<<< 1, 100 >>>(arg1, … );
![Page 7: 1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, 2013. CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c815503460f94939d3f/html5/thumbnails/7.jpg)
7
CUDA Built-in Variablesfor a 1-D grid and 1-D block
threadIdx.x -- “thread index” within block in “x” dimension
blockIdx.x -- “block index” within grid in “x” dimension
blockDim.x -- “block dimension” in “x” dimension (i.e. number of threads in block in x dimension)
Full global thread ID in x dimension can be computed by:
x = blockIdx.x * blockDim.x + threadIdx.x;
![Page 8: 1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, 2013. CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c815503460f94939d3f/html5/thumbnails/8.jpg)
8
Example -- x directionA 1-D grid and 1-D block
4 blocks, each having 8 threads
0 1 2 3 4 765 0 1 2 3 4 7650 1 2 3 4 765 0 1 2 3 4 765
threadIdx.x threadIdx.x threadIdx.x
blockIdx.x = 3
threadIdx.x
blockIdx.x = 1blockIdx.x = 0
Derived from Jason Sanders, "Introduction to CUDA C" GPU technology conference, Sept. 20, 2010.
blockIdx.x = 2
gridDim = 4 x 1blockDim = 8 x 1
Global thread ID = blockIdx.x * blockDim.x + threadIdx.x = 3 * 8 + 2 = thread 26 with linear global addressing
Global ID 26
![Page 9: 1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, 2013. CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c815503460f94939d3f/html5/thumbnails/9.jpg)
9
#define N 2048 // size of vectors#define T 256 // number of threads per block
__global__ void vecAdd(int *a, int *b, int *c) {
int i = blockIdx.x*blockDim.x + threadIdx.x;
c[i] = a[i] + b[i];} int main (int argc, char **argv ) {
…
vecAdd<<<N/T, T>>>(devA, devB, devC); // assumes N/T is an integer
…return (0);
}
Code example with a 1-D grid and blocksVector addition
Number of blocks to map each vector across grid, one element of each vector per thread
Note: __global__ CUDA function qualifier.
__ is two underscores
__global__ must return a void
![Page 10: 1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, 2013. CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c815503460f94939d3f/html5/thumbnails/10.jpg)
10
#define N 2000 // size of vectors#define T 256 // number of threads per block
__global__ void vecAdd(int *a, int *b, int *c) {
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < N) c[i] = a[i] + b[i]; // allows for more threads than vector elements // some unused
} int main (int argc, char **argv ) {
int blocks = (N + T - 1) / T; // efficient way of rounding to next integer …vecAdd<<<blocks, T>>>(devA, devB, devC); …return (0);
}
If T/N not necessarily an integer:
![Page 11: 1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, 2013. CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c815503460f94939d3f/html5/thumbnails/11.jpg)
11
Questions
How many threads are created?
How many threads are unused?
What is the maximum number of threads that can be created in a GPU on coit-grid06/7?
On coit-grid08?
![Page 12: 1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, 2013. CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c815503460f94939d3f/html5/thumbnails/12.jpg)
12
1-D grid and 1-D block suitable for processing one dimensional data
Higher dimensional grids and blocks convenient for higher dimensional data.
Processing 2-D arrays might use a two dimensional grid and two dimensional block
Might need higher dimensions because of limitation on sizes of block in each dimension
CUDA provided with built-in variables and structures to define number of blocks in grid in each dimension and number of threads in a block in each dimension.
Higher dimensional grids/blocks
![Page 13: 1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, 2013. CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c815503460f94939d3f/html5/thumbnails/13.jpg)
13
CUDA Vector Types/Structures
unit3 and dim3 – can be considered essentially as CUDA-defined structures of unsigned integers: x, y, z, i.e.
struct unit3 { x; y; z; };struct dim3 { x; y; z; };
Used to define grid of blocks and threads, see next.
Unassigned structure components automatically set to 1.There are other CUDA vector types.
Built-in CUDA data types and structures
![Page 14: 1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, 2013. CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c815503460f94939d3f/html5/thumbnails/14.jpg)
14
Built-in Variables forGrid/Block Sizes
dim3 gridDim -- Grid dimensions, x, y, z.
Number of blocks in grid = gridDim.x * gridDim.y * gridDim.z
dim3 blockDim -- Size of block dimensions x, y, and z.
Number of threads in a block =
blockDim.x * blockDim.y * blockDim.z
![Page 15: 1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, 2013. CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c815503460f94939d3f/html5/thumbnails/15.jpg)
15
To set values in each dimensions, use for example:
dim3 grid(16, 16); // Grid -- 16 x 16 blocksdim3 block(32, 32); // Block -- 32 x 32 threads…myKernel<<<grid, block>>>(...);
which sets:
gridDim.x = 16gridDim.y = 16gridDim.z = 1blockDim.x = 32blockDim.y = 32blockDim.z = 1
Example Initializing Values
when kernel called
![Page 16: 1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, 2013. CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c815503460f94939d3f/html5/thumbnails/16.jpg)
16
CUDA Built-in Variablesfor Grid/Block Indices
uint3 blockIdx -- block index within grid:
blockIdx.x, blockIdx.y, blockIdx.z
uint3 threadIdx -- thread index within block:
blockIdx.x, blockIdx.y, blockId.z
2-D: Full global thread ID in x and y dimensions can be computed by:
x = blockIdx.x * blockDim.x + threadIdx.x;y = blockIdx.y * blockDim.y + threadIdx.y;
CUDA structures
![Page 17: 1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, 2013. CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c815503460f94939d3f/html5/thumbnails/17.jpg)
17
2-D Grids and 2-D blocks
threadID.x
threadID.y
Thread
blockIdx.x * blockDim.x + threadIdx.x
blockIdx.y * blockDim.y + threadIdx.y
![Page 18: 1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, 2013. CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c815503460f94939d3f/html5/thumbnails/18.jpg)
18
Flattening arrays onto linear memory
Generally memory allocated dynamically on device (GPU) and we cannot not use two-dimensional indices (e.g. a[row][column]) to access array as we might otherwise. (Why?)
We will need to know how the array is laid out in memory and then compute the distance from the beginning of the array.
C uses row-major order --- rows are stored one after the other in memory, i.e. row 0 then row 1 etc.
![Page 19: 1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, 2013. CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c815503460f94939d3f/html5/thumbnails/19.jpg)
19
Flattening an array
Number of columns, N
columnArray element
a[row][column] = a[offset]
offset = column + row * N
where N is number of column in array
row * number of columns
row
0
0
N-1
Note: Another way to flatten array is:
offset = row + column * N
We will come back to this later as it does have very significant consequences on performance.
![Page 20: 1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, 2013. CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c815503460f94939d3f/html5/thumbnails/20.jpg)
20
int col = blockIdx.x*blockDim.x+threadIdx.x;
int row = blockIdx.y*blockDim.y+threadIdx.y;
int index = col + row * N;
a[index] = …
Using CUDA variables
![Page 21: 1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, 2013. CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c815503460f94939d3f/html5/thumbnails/21.jpg)
21
Example using 2-D grid and 2-D blocksAdding two arrays
Corresponding elements of each array added together to form element of third array
![Page 22: 1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, 2013. CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c815503460f94939d3f/html5/thumbnails/22.jpg)
22
CUDA version using 2-D grid and 2-D blocksAdding two arrays
#define N 2048 // size of arrays
__global__void addMatrix (int *a, int *b, int *c) {int col = blockIdx.x*blockDim.x+threadIdx.x;int row =blockIdx.y*blockDim.y+threadIdx.y;int index = col + row * N;
if ( col < N && row < N) c[index]= a[index] + b[index];}
int main() {...dim3 block (16,16);dim3 grid (N/block.x, N/block.y);
addMatrix<<<grid, block>>>(devA, devB, devC);…
}
![Page 23: 1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, 2013. CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c815503460f94939d3f/html5/thumbnails/23.jpg)
23
Matrix Multiplication
Matrix multiplication is an important application in HPC and appears in many applications
C = A * B
where A, B, and C are matrices (two-dimensional arrays.
A restricted case is when B has only one column -- matrix-vector multiplication, which appears in representation of linear equations and partial differential equations
![Page 24: 1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, 2013. CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c815503460f94939d3f/html5/thumbnails/24.jpg)
24
Matrix multiplication, C = A x B
![Page 25: 1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, 2013. CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c815503460f94939d3f/html5/thumbnails/25.jpg)
25
Assume matrices square (N x N matrices).
for (i = 0; i < N; i++)for (j = 0; j < N; j++) {
c[i][j] = 0;for (k = 0; k < N; k++)
c[i][j] = c[i][j] + a[i][k] * b[k][j];}
Requires n3 multiplications and n3 additionsSequential time complexity of O(n3). Very easy to parallelize.
Implementing Matrix MultiplicationSequential Code
![Page 26: 1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, 2013. CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c815503460f94939d3f/html5/thumbnails/26.jpg)
26
CUDA Kernelfor multiplying two arrays
__global__ void gpu_matrixmult(int *gpu_a, int *gpu_b, int *gpu_c, int N) {
int k, sum = 0;
int col = threadIdx.x + blockDim.x * blockIdx.x;
int row = threadIdx.y + blockDim.y * blockIdx.y;
if (col < N && row < N) {
for (k = 0; k < N; k++)
sum += a[row * N + k] * b[k * N + col];
c[row * N + col] = sum;
}
}
In this example, one thread computes one C element and the number of threads must equal or greater than the number of elements
![Page 27: 1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, 2013. CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c815503460f94939d3f/html5/thumbnails/27.jpg)
27
Sequential version with flattened arraysfor comparison
void cpu_matrixmult(int *cpu_a, int *cpu_b, int *cpu_c, int N) {
int i, j, k, sum;
for (row =0; row < N; row++) // row of a
for (col =0; col < N; col++) { // column of b
sum = 0;
for(k = 0; k < N; k++)
sum += cpu_a[row * N + k] * cpu_b[k * N + col];
cpu_c[row * N + col] = sum;
}
}
![Page 28: 1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, 2013. CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c815503460f94939d3f/html5/thumbnails/28.jpg)
28
Matrix mapped on 2-D Grids and 2-D blocks
threadID.x
threadID.y
blockIdx.x * blockDim.x + threadIdx.x
blockIdx.y * blockDim.y + threadIdx.y
A[][column]
A[row][] Thread
Arrays mapped onto structure, one element per thread
Array
Grid
Block
Basically array divided into “tiles” and one tile mapped onto one block
![Page 29: 1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, 2013. CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c815503460f94939d3f/html5/thumbnails/29.jpg)
29
// Matrix addition program MatrixMult.cu, Barry Wilkinson, Dec. 28, 2010.#include <stdio.h>#include <cuda.h>#include <stdlib.h>
__global__ void gpu_matrixmult(int *gpu_a, int *gpu_b, int *gpu_c, int N) {…
}
void cpu_matrixmult(int *cpu_a, int *cpu_b, int *cpu_c, int N) {…
}
int main(int argc, char *argv[]) {int i, j; // loop countersint Grid_Dim_x=1, Grid_Dim_y=1; //Grid structure valuesint Block_Dim_x=1, Block_Dim_y=1; //Block structure valuesint noThreads_x, noThreads_y; // number of threads available in device, each dimensionint noThreads_block; // number of threads in a blockint N = 10; // size of array in each dimensionint *a,*b,*c,*d;int *dev_a, *dev_b, *dev_c;int size; // number of bytes in arrayscudaEvent_t start, stop; // using cuda events to measure timefloat elapsed_time_ms; // which is applicable for asynchronous code also
/* --------------------ENTER INPUT PARAMETERS AND ALLOCATE DATA -----------------------*/… // keyboard input
dim3 Grid(Grid_Dim_x, Grid_Dim_x); //Grid structuredim3 Block(Block_Dim_x,Block_Dim_y); //Block structure, threads/block limited by specific devicesize = N * N * sizeof(int); // number of bytes in total in arrays
a = (int*) malloc(size); //dynamically allocated memory for arrays on hostb = (int*) malloc(size);c = (int*) malloc(size); // results from GPUd = (int*) malloc(size); // results from CPU… // load arrays with some numbers
Complete Program(several slides)
![Page 30: 1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, 2013. CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c815503460f94939d3f/html5/thumbnails/30.jpg)
30
/* ------------- COMPUTATION DONE ON GPU ----------------------------*/
cudaMalloc((void**)&dev_a, size); // allocate memory on devicecudaMalloc((void**)&dev_b, size);cudaMalloc((void**)&dev_c, size);
cudaMemcpy(dev_a, a , size ,cudaMemcpyHostToDevice);cudaMemcpy(dev_b, b , size ,cudaMemcpyHostToDevice);
cudaEventRecord(start, 0); // here start time, after memcpy
gpu_matrixmult<<<Grid,Block>>>(dev_a,dev_b,dev_c,N);cudaMemcpy(c, dev_c, size , cudaMemcpyDeviceToHost);
cudaEventRecord(stop, 0); // measuse end timecudaEventSynchronize(stop);cudaEventElapsedTime(&elapsed_time_ms, start, stop );
printf("Time to calculate results on GPU: %f ms.\n", elapsed_time_ms);
Where you measure time will make a big difference
![Page 31: 1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, 2013. CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c815503460f94939d3f/html5/thumbnails/31.jpg)
31
/* ------------- COMPUTATION DONE ON HOST CPU ----------------------------*/
cudaEventRecord(start, 0); // use same timing*
cpu_matrixmult(a,b,d,N); // do calculation on host
cudaEventRecord(stop, 0); // measure end timecudaEventSynchronize(stop);cudaEventElapsedTime(&elapsed_time_ms, start, stop );
printf("Time to calculate results on CPU: %f ms.\n", elapsed_time_ms); // exe. time
/* ------------------- check device creates correct results -----------------*/…/* --------------------- repeat program ----------------------------------------*/… // while loop to repeat calc with different parameters/* -------------- clean up ---------------------------------------*/
free(a); free(b); free(c);cudaFree(dev_a);cudaFree(dev_b);cudaFree(dev_c);cudaEventDestroy(start);cudaEventDestroy(stop);return 0;
}
![Page 32: 1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, 2013. CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c815503460f94939d3f/html5/thumbnails/32.jpg)
32
Some PreliminariesEffects of First Launch
Program is written so that can repeat with different parameters without stopping program – to eliminate effect of first kernel launch
Also might take advantage of caching – seems not significant as first launch
![Page 33: 1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, 2013. CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c815503460f94939d3f/html5/thumbnails/33.jpg)
33
Some results
Random numbers 0- 9
32 x 32 array
1 blockof 32 x 32 threads
Speedup = 1.65,First time Answer
Check both CPU and GPU same answers
![Page 34: 1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, 2013. CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c815503460f94939d3f/html5/thumbnails/34.jpg)
34
Some results
32 x 32 array
1 blockof 32 x 32 threads
Speedup = 2.12Second time
![Page 35: 1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, 2013. CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c815503460f94939d3f/html5/thumbnails/35.jpg)
35
Some results
32 x 32 array
1 blockof 32 x 32 threads
Speedup = 2.16Third time
Subsequently can vary 2.12 – 2.18
![Page 36: 1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, 2013. CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c815503460f94939d3f/html5/thumbnails/36.jpg)
36
Some results
256 x 256 array
8 blocksof 32 x 32 threads
Speedup = 151.86
![Page 37: 1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, 2013. CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c815503460f94939d3f/html5/thumbnails/37.jpg)
37
Some results
1024 x 1024 array
32 blocksof 32 x 32 threads
Speedup = 860.9
![Page 38: 1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, 2013. CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.](https://reader035.fdocuments.in/reader035/viewer/2022062515/56649c815503460f94939d3f/html5/thumbnails/38.jpg)
Questions