CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists...
Transcript of CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists...
![Page 1: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/1.jpg)
CSE 599 IAccelerated Computing -
Programming GPUSParallel Patterns: Graph Search
![Page 2: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/2.jpg)
Objective
● Study graph search as a prototypical graph-based algorithm● Learn techniques to mitigate the memory-bandwidth-centric nature of
graph-based algorithms● Introduce work queues and see how they fit into a massively parallel
programming framework
![Page 3: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/3.jpg)
Data Parallelism / Data-Dependent Execution
StencilHistogram
SpMV
Prefix Scan
Data Parallel
NotData
Parallel
Data-DependentData-Independent
MergeGraph Search
![Page 4: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/4.jpg)
Massive Graph Applications
● Social media connection graphs● Driving directions● Telecommunication networks● Manufacturing process dependencies● Computation graph● 3D Meshes● Graphical models
Massive graphs tend to be sparse!
![Page 5: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/5.jpg)
Graph Review
0
1
2
7
6
5
4
3
8
Graph
0 1 2 3 4 5 6 7 8
0 1 1
1 1 1
2 1 1 1
3 1 1
4 1 1
5 1
6 1
7 1 1
8
Adjacency Matrix
![Page 6: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/6.jpg)
Column Indices
Graph Review
0
1
2
7
6
5
4
3
8
Graph
Adjacency Matrix (CSR)
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Values
1
2
3
4
5
6
7
4
8
5
8
6
8
0
6
0
2
4
7
9
11
12
13
15
15
Row PointersDestinations
![Page 7: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/7.jpg)
Breadth-First Search
Problem:
Given a source node S, find the number of steps required to reach each node N in the graph
Given this labelling of the graph, one can easily find a shortest path from S to a destination T
![Page 8: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/8.jpg)
Breadth First Search
0
Source = 0Round 0
0:
1:
2:
3:
4:
5:
6:
7:
8:
![Page 9: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/9.jpg)
Breadth First Search
0
1
1
Source = 0Round 1
0:
1:
2:
3:
4:
5:
6:
7:
8:
![Page 10: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/10.jpg)
4:
Breadth First Search
0
1
1
2
2
2
2
2
Source = 0Round 2
0:
1:
2:
3:
5:
6:
7:
8:
![Page 11: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/11.jpg)
Breadth First Search
0
1
1
2
2
2
2
2
3
Source = 0Round 3
0:
1:
2:
3:
4:
5:
6:
7:
8:
![Page 12: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/12.jpg)
Breadth First Search
0
1
1
2
2
2
2
2
3
Source = 0
0:
1:
2:
3:
4:
5:
6:
7:
8:
![Page 13: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/13.jpg)
2:
Breadth First Search
0
Source = 2Round 0
0:
1:
3:
4:
5:
6:
7:
8:
![Page 14: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/14.jpg)
2:
Breadth First Search
0
1
1
1
Source = 2Round 1
0:
1:
3:
4:
5:
6:
7:
8:
![Page 15: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/15.jpg)
2:
Breadth First Search
2
0
1
1
1
2
Source = 2Round 2
0:
1:
3:
4:
5:
6:
7:
8:
![Page 16: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/16.jpg)
2:
Breadth First Search
2
3
0
1
1
1
2
Source = 2Round 3
0:
1:
3:
4:
5:
6:
7:
8:
![Page 17: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/17.jpg)
4:
2:
Breadth First Search
2
3
0
1
1
1
4
4
2
Source = 2Round 4
0:
1:
3:
5:
6:
7:
8:
![Page 18: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/18.jpg)
2:
Breadth First Search
2
3
0
1
1
1
4
4
2
Source = 2
0:
1:
3:
4:
5:
6:
7:
8:
![Page 19: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/19.jpg)
Sequential BFS
Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language - provided data structures (e.g. queue)
We’ll look at a C-style implementation to ease the translation into CUDA
![Page 20: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/20.jpg)
Sequential BFS
void BFS_sequential(int source, const int * rowPointers, const int * destinations, int * distances) {
int frontier[2][MAX_FRONTIER_SIZE];int * currentFrontier = &frontier[0];int currentFrontierSize = 0;int * previousFrontier = &frontier[1];int previousFrontierSize = 0;
insertIntoFrontier(source, previousFrontier, &previousFrontierSize);distances[source] = 0;
while (previousFrontierSize > 0) {// visit all vertices on the previous frontierfor (int f = 0; f < previousFrontierSize; f++) {
const int currentVertex = previousFrontier[f];// check all outgoing edgesfor (int i = rowPointers[currentVertex]; i < rowPointers[currentVertex+1]; ++i) {
if (distances[destinations[i]] == -1) {// this vertex has not been visited yetinsertIntoFrontier(destinations[i], currentFrontier, ¤tFrontierSize);distances[destinations[i]] = distances[currentVertex] + 1;
}}
}swap(currentFrontier, previousFrontier);previousFrontierSize = currentFrontierSize;currentFrontierSize = 0;
}
}
In practice, we’d want to check for and handle overflow here
![Page 21: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/21.jpg)
Sequential BFS
void insertIntoFrontier(int vertex, int * frontier, int * frontierSize) {
frontier[*frontierSize] = vertex;++(*frontierSize);
}
![Page 22: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/22.jpg)
One Parallelization Approach
● Assign one thread per vertex● For each iteration, check all incoming edges to see if the source vertex was
just visited in the last iteration; if so, mark as visited in this iteration
● Not very work efficient; O(VL) for V = number of vertices, L = length of longest path
● Difficult to detect stopping criterion
![Page 23: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/23.jpg)
A More Work-Efficient Approach
● Parallelize each individual iteration of the while loop in the sequential BFS code
● Assign a section of the vertices in the previous frontier to each thread● Introduce a synchronization point at the end of each iteration
![Page 24: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/24.jpg)
Parallel BFS Host Codevoid BFS_host(int source, const int * rowPointers, const int * destinations, int * distances) {
int dFrontier[2][MAX_FRONTIER_SIZE];int * dCurrentFrontierSize;int * dPreviousFrontierSize; int hPreviousFrontierSize;int * dVisited;
int * dCurrentFrontier = &frontier[0];int * dPreviousFrontier = &frontier[1];
// allocate device memory, copy memory from device to host, initialize frontier sizes, etc....
hPreviousFrontierSize = 1;
while (hPreviousFrontierSize > 0) {int numBlocks = (hPreviousFrontierSize-1) / BLOCK_SIZE + 1;
BFS_Bqueue_kernel<<<numBlocks, BLOCK_SIZE>>>(dPreviousFrontier, dPreviousFrontierSize,dCurrentFrontier, dCurrentFrontierSize,dRowPointers, dDestinations, dDistances, dVisited);
swap(dCurrentFrontier,dPreviousFrontier);cudaMemcpy(dPreviousFrontierSize, dCurrentFrontierSize, sizeof(int), cudaMemcpyDeviceToDevice);cudaMemset(dCurrentFrontierSize, 0, sizeof(int));
cudaMemcpy(&hPreviousFrontierSize, dPreviousFrontierSize, sizeof(int), cudaMemcpyDeviceToHost);
}
}
![Page 25: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/25.jpg)
Output Interference in BFS
1
1
0
● We’ll use flags to mark whether or not a vertex has been visited● From a correctness perspective, output interference on flags can be ignored● However, this will lead to additional work
A
B
C
Previous queue: A B
Current queue:
Without synchronization:
alreadyVisited? 0
alreadyVisited? 0
1
11
C C
1
1
0
A
B
C
Previous queue: A B
Current queue:
With atomicExch:
alreadyVisited? 1
alreadyVisited? 0
1
11
C
![Page 26: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/26.jpg)
Basic Parallel BFS Kernel__global__ void BFS_Bqueue_kernel(const int * previousFrontier, const int * previousFrontierSize, int * currentFrontier, int * currentFrontierSize, const int * rowPointers, const int * destinations, int * distances, int * visited) {
const int t = threadIdx.x + blockDim.x * blockIdx.x;if (t < *previousFrontierSize) {
const int vertex = previousFrontier[t];for (int i = rowPointers[vertex]; i < rowPointers[vertex+1]; ++i) {
// check visitation atomically, avoiding redundant expansionconst int alreadyVisited = atomicExch(&(visited[destinations[i]]),1);if (!alreadyVisited) {
// we’re visiting a new vertex: get a spot in line atomicallyconst int queueIndex = atomicAdd(currentFrontierSize, 1);
// place the vertex in line linecurrentFrontier[queueIndex] = destinations[i];
}}
}__syncthreads();
}
![Page 27: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/27.jpg)
Output Interference in Frontier Queue
● There is also output interference in placing vertices in the queue● Synchronization is strictly required here for correct output● This is a bottleneck of the basic kernel
0
...Block 0 Block 1 Block N
12
![Page 28: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/28.jpg)
Privatization of the Frontier Queue
● We can make a private, block-level copy of the frontier queue● Once complete, the private queues are combined to form the global queue
0
...
0 0
0
Block 0 Block NBlock 1
1 1 1222
2
3 34
310
![Page 29: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/29.jpg)
Parallel BFS Kernel with Privatization__global__ void BFS_Bqueue_kernel(const int * previousFrontier, const int * previousFrontierSize, int * currentFrontier, int * currentFrontierSize, const int * rowPointers, const int * destinations, int * distances, int * visited) {
__shared__ int sharedCurrentFrontier[BLOCK_QUEUE_SIZE];__shared__ int sharedCurrentFrontierSize, blockGlobalQueueIndex;
if (threadIdx.x == 0) sharedCurrentFrontierSize = 0;__syncthreads();
const int t = threadIdx.x + blockDim.x * blockIdx.x;if (t < *previousFrontierSize) {
const int vertex = previousFrontier[t];for (int i = rowPointers[vertex]; i < rowPointers[vertex+1]; ++i) {
const int alreadyVisited = atomicExch(&(visited[destinations[i]]),1);if (!alreadyVisited) {
distances[destinations[i]] = distances[i] + 1;const int sharedQueueIndex = atomicAdd(&sharedCurrentFrontierSize,1);if (sharedQueueIndex < BLOCK_QUEUE_SIZE) { // there is space in the local queue sharedCurrentFrontier[sharedQueueIndex] = destinations[i];} else { // go directly to the global queue
sharedCurrentFrontierSize = BLOCK_QUEUE_SIZE;const int globalQueueIndex = atomicAdd(currentFrontierSize, 1);currentFrontier[globalQueueIndex] = destinations[i];
}}
}}__syncthreads();
if (threadIdx.x == 0) blockGlobalQueueIndex = atomicAdd(currentFrontierSize, sharedCurrentFrontierSize);__syncthreads();
for (int i = threadIdx.x; i < sharedCurrentFrontierSize; i += blockDim.x) {currentFrontier[blockGlobalQueueIndex + i] = sharedCurrentFrontier[i];
}}
![Page 30: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/30.jpg)
Remaining Issues
● Irregular global memory access○ Access patterns depend on graph structure and is unpredictable
● Kernel launch overhead ○ There is little parallel work in iterations with narrow frontiers
● Block-level queue length counter contention○ Better than before, but there will still be many serialized atomic operations
● Load imbalance○ Vertices can have vastly different numbers of outgoing edges
![Page 31: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/31.jpg)
BFS Results in Highly Irregular Memory Access
0 6
0 2 4 7 9 10 13 16 17 20 21
2 5 5 6 0 4 5 2 5 7 1 2 4 4 8
1 1 0 0 0 0 1 1 0 0 0visited
destinations
rowPointers
previousFrontier
...
...
...
![Page 32: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/32.jpg)
Global Memory Bandwidth Limitations Got You Down?
Is the usage pattern known, local, and not
more than ~48KB?
Is read-only memory OK?
n
Shared memory!y
Is it less than ~64KB? y
Hope you’ve been nice to your
L2 cache...
n
Constant memory!
y
Texture memory!n
![Page 33: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/33.jpg)
Texture Memory
● Texture memory is another form of global memory● Like constant memory, it is aggressively cached for read-only access● Originally developed and optimized for storing and reading textures for
graphics applications○ Has hardware-level support for 1-,2-, or3-D layouts and interpolated reads○ The texture cache is also spatial layout-aware
● Can be useful for irregular access patterns with un-coalesced reads
![Page 34: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/34.jpg)
Using Texture Memory
Declaration:
texture<int, 1, cudaReadModeElementType> rowPointersTexture;
Host side:
int * hRowPointers;int rowPointersLength;cudaArray * texArray = 0;
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<int>();
cudaMallocArray(&texArray, &channelDesc, rowPointersLength);
cudaMemcpyToArray(texArray, 0, 0, hRowPointers, rowPointersLength*sizeof(int), cudaMemcpyHostToDevice);
cudaBindTextureToArray(rowPointersTexture, texArray);
Device side:
for (int i = tex1D(rowPointersTexture,vertex); i < tex1D(rowPointersTexture,vertex+1); ++i)...
![Page 35: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/35.jpg)
Remaining Issues
● Irregular global memory access○ Access patterns depend on graph structure and is unpredictable
● Kernel launch overhead ○ There is little parallel work in iterations with narrow frontiers
● Block-level queue length counter contention○ Better than before, but there will still be many serialized atomic operations
● Load imbalance○ Vertices can have vastly different numbers of outgoing edges
✔
![Page 36: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/36.jpg)
Kernel Launch Overhead
For some iterations of BFS (especially near the beginning), the frontier can be quite small
The benefits of parallelism only outweigh the kernel launch overhead when the frontier becomes large enough
Some options:
1. Use the CPU if the frontier size dips below some threshold2. Create a single-block variant of the BFS kernel that iterates until its
block-level queue is full before returning to the host
![Page 37: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/37.jpg)
Using the CPU for Small Frontiers
// is the most up-to-date frontier information on host or device?bool currentDataOnDevice = false;
while (hPreviousFrontierSize > 0) {int numBlocks = (hPreviousFrontierSize-1) / BLOCK_SIZE + 1;
if (numBlocks < NUM_BLOCKS_THRESHOLD) {if (currentDataOnDevice) {
// copy data to host ...
}BFS_iterate_sequential(hPreviousFrontier, hPreviousFrontierSize,
hCurrentFrontier, hCurrentFrontierSize, rowPointers, destinations, distances);
currentDataOnDevice = false;
} else {if (!currentDataOnDevice) {
// copy data to device...
}BFS_Bqueue_kernel<<<numBlocks, BLOCK_SIZE>>>(dPreviousFrontier, dPreviousFrontierSize,
dCurrentFrontier, dCurrentFrontierSize,dRowPointers, dDestinations, dDistances, dVisited);
currentDataOnDevice = true;
}
...
![Page 38: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/38.jpg)
Remaining Issues
● Irregular global memory access○ Access patterns depend on graph structure and is unpredictable
● Kernel launch overhead ○ There is little parallel work in iterations with narrow frontiers
● Block-level queue length counter contention○ Better than before, but there will still be many serialized atomic operations
● Load imbalance○ Vertices can have vastly different numbers of outgoing edges
✔✔
![Page 39: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/39.jpg)
Block-Level Queue Contention
While the block-level queues reduced contention for global memory, the block-level counter is now the bottleneck
We can extend the hierarchy by further splitting the block-level queue
![Page 40: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/40.jpg)
Three-Level Queue Hierarchy
Global queue:
Block 0
Block queue:
Sub-queue 0: Sub-queue 1:
Sub-queue 2: Sub-queue 3:
Block 1
Block queue:
Sub-queue 0: Sub-queue 1:
Sub-queue 2: Sub-queue 3:
![Page 41: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/41.jpg)
Sub-Queue Assignment
Sub-queue 0: Sub-queue 1: Sub-queue 2: Sub-queue 3:
0 1 2 3 4 5 6 7 120 121 122 123 124 125 126 127...threadidx.x:
subQueueIndex = threadIdx.x / (blockDim.x / NUM_SUB_QUEUES);
This mapping means threads in the same warp will be using the same queue!
![Page 42: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/42.jpg)
Sub-Queue Assignment
Sub-queue 0: Sub-queue 1: Sub-queue 2: Sub-queue 3:
0 1 2 3 4 5 6 7 120 121 122 123 124 125 126 127...threadidx.x:
subQueueIndex = threadIdx.x % NUM_SUB_QUEUES;
Much better!
threadIdx.x & (NUM_SUB_QUEUES-1);
Shortcut if the number of queues is a power of 2
![Page 43: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/43.jpg)
Remaining Issues
● Irregular global memory access○ Access patterns depend on graph structure and is unpredictable
● Kernel launch overhead ○ There is little parallel work in iterations with narrow frontiers
● Block-level queue length counter contention○ Better than before, but there will still be many serialized atomic operations
● Load imbalance○ Vertices can have vastly different numbers of outgoing edges
✔✔✔
![Page 44: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/44.jpg)
Load Imbalance
Load imbalance is caused by a data dependency and is thus tricky to avoid
Two potential strategies:
1. Delay the assignment of work to threads until after the total amount of work to be done is known
2. Spawn new threads when needed to account for additional work
Tune in to the next lecture on CUDA dynamic parallelism!
This could result in additional code complexity, higher register usage, and / or more synchronization
![Page 45: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/45.jpg)
Conclusion / Takeaways
● Graphs can be processed in parallel!● Texture memory can help with large, read-only memory w/ irregular access● Work queues can be used to track tasks of varying size● Privatization (and multi-level privatization hierarchies) can be used to
reduce contention for work queue insertion
![Page 46: CSE 599 I Accelerated Computing - Programming GPUStws10/cse599i/CSE... · Most computer scientists are familiar with C++ / Java / Python / etc - style BFS implementations using language](https://reader034.fdocuments.in/reader034/viewer/2022042916/5f56a63597d22f61e028c425/html5/thumbnails/46.jpg)
Sources
https://www.wikipedia.org/
Cheng, John, Max Grossman, and Ty McKercher. Professional Cuda C Programming. John Wiley & Sons, 2014.
Hwu, Wen-mei, and David Kirk. "Programming massively parallel processors." Special Edition 92 (2009).
Wilt, Nicholas. The cuda handbook: A comprehensive guide to gpu programming. Pearson Education, 2013.