Eikonal equation solver in CUDA - uni-graz.at
Transcript of Eikonal equation solver in CUDA - uni-graz.at
Eikonal equation solver in CUDA
Daniel Ganellari
University of Graz
January 28, 2016
Daniel Ganellari (IMSC) Eikonal solver in CUDA January 28, 2016 1 / 18
Overview
1 Introduction to the main AlgorithmParallel prefix sum (SCAN) with CUDADevice data preparation for coalesced accessingEikonal CUDA Algorithm
2 Results and Profiling
3 Conclusions and Outlook
4 References
Daniel Ganellari (IMSC) Eikonal solver in CUDA January 28, 2016 2 / 18
Definition and Citation
Blelloch [1]: All-prefix-sums is a good example of a computation thatseems inherently sequential, but for which there is an efficient parallelalgorithm.
Definition: The all-prefix-sums operation takes a binary associativeoperator ⊕, and an array of n elements
[a0, a1, ..., an]
and returns the array:
[a0, (a0 ⊕ a1), ..., (a0 ⊕ a1 ⊕ ...⊕ an−1)]. InclusiveScan
[I , a0, (a0 ⊕ a1), ..., (a0 ⊕ a1 ⊕ ...⊕ an−2)]. ExclusiveScan
where I is the neutral element w.r.t ⊕.Example: If ⊕ is addition, then the exclusive scan operation on the array[3 1 7 0 4 1 6 3], returns [0 3 4 11 11 14 16 22].
Daniel Ganellari (IMSC) Eikonal solver in CUDA January 28, 2016 3 / 18
Sequential and Parallel Scan
In general, all-prefix-sums can be used to convert some sequentialcomputations into equivalent, but parallel, computations as shown inTable 1.
Sequential Parallel
out[0] = 0; forall j in parallel doforall j from 1 to n do temp[j] = f(in[j]);out[j] = out[j-1] + f(in[j-1]); all prefix sums(out, temp);
Table : A sequential computation and its parallel equivalent.
There are many uses for all-prefix-sums:Sorting, lexical analysis, string comparison, polynomial evaluation, streamcompaction, and building histograms and data structures (graphs, trees,etc.) in parallel. For example applications, we refer the reader to thesurvey by Blelloch [1].
Daniel Ganellari (IMSC) Eikonal solver in CUDA January 28, 2016 4 / 18
A Work-Efficient Parallel Scan
The up-sweep (reduce) phase of a work-efficient sum scan algorithm (afterBlelloch [1]).Steps: log(n) Work: O(n)
Daniel Ganellari (IMSC) Eikonal solver in CUDA January 28, 2016 5 / 18
A Work-Efficient Parallel Scan
The down-sweep phase of a work-efficient parallel sum scan algorithmSteps: log(n) Work: O(n)
Daniel Ganellari (IMSC) Eikonal solver in CUDA January 28, 2016 6 / 18
A Work-Efficient Parallel Scan
Parallel Scan Specifications
The algorithm scans an array inside a single thread block.
This is fine for small arrays, up to twice the maximum number ofthreads in a block (since each thread loads and processes twoelements).
The array size must be a power of two.
How to extend the algorithm to scan large arrays of arbitrary size?
Divide the large array into blocks that each can be scanned by asingle thread block.
Scan the blocks and write the total sum of each block to anotherarray of block sums.
Then scan the block sums, generating an array of block incrementsthat are added to all elements in their respective blocks.
Daniel Ganellari (IMSC) Eikonal solver in CUDA January 28, 2016 7 / 18
Arrays of Arbitrary Size
Algorithm for performing a sum scan on a large array of values.
Daniel Ganellari (IMSC) Eikonal solver in CUDA January 28, 2016 8 / 18
CompactionStream compaction requires two steps, a scan and a scatter.
When to use compaction?
Large Number of elements to compact
The computation on each surviving element is expensive
Daniel Ganellari (IMSC) Eikonal solver in CUDA January 28, 2016 10 / 18
Throughput and Colasced OptimizationThe main idea here is to copy all the Neighboring elements (Tetrahedras)for each active point using SCAN, in order to have a lot data to processand in a coalesced way!
3 5 2 8 Compacted Active List Nodes
2 3 4 2 Total Neighbors for each node
0 2 5 9 Ex. Scan to get the Addresses
35 27 57 63 48 127 264 158 99 27 39
2 3 4 2
Table : Data preparation for coalesced and throughput optimization (similar toCRS for sparse matrices)
Daniel Ganellari (IMSC) Eikonal solver in CUDA January 28, 2016 11 / 18
Results
Hardware Specifications
CUDA implementation tested on GeForce GTX 970OpenMP implementation tested on Intel Core i7-4700MQ CPU @2.40GHz x 4 + HT
Meshes # Tetrahedras CUDA OpenMP
slab(structured) 240,000 0.657 0.354TBunnyC2 266,846 0.913 1.673TBunnyC 3,073,529 8.347 18.0039
Table : Result comparison between CUDA and OpenMP implementations
Daniel Ganellari (IMSC) Eikonal solver in CUDA January 28, 2016 14 / 18
Multiprocessor Utilization
The kernel’s blocks are distributed across the GPU’s multiprocessors forexecution. Depending on the number of blocks and the execution durationof each block some multiprocessors may be more highly utilized thanothers during execution of the kernel.
Daniel Ganellari (IMSC) Eikonal solver in CUDA January 28, 2016 15 / 18
Conclusions and Outlook
The scan operation is a simple and powerful parallel primitive with abroad range of applications.
Multiprocessor Utilization shows good results. It means a quite goodoccupancy is achieved.
Right now we have a factor of 2 speedup w.r.t OpenMPimplementation.
Employing domain decomposition we expect to achieve at least afactor of 4 speedup compared to OpenMP implementation.
Measuring and improving memory bandwidth by assuring:I Sufficient occupancyI Coalesce global memory access
resulted a very good procedure during performance improvement
Daniel Ganellari (IMSC) Eikonal solver in CUDA January 28, 2016 16 / 18
References
Guy E. Blelloch (1990)
Prefix Sums and Their Applications
In John H. Reif (Ed.) , Synthesis of Parallel Algorithms, Morgan Kaufmann, 1990
Mark Harris (2007)
Parallel Prefix Sum (Scan) with CUDA
http://www.eecs.umich.edu/courses/eecs570/hw/parprefix.pdf
Udacity
Intro to Parallel Programming
https://www.udacity.com/
Daniel Ganellari (IMSC) Eikonal solver in CUDA January 28, 2016 17 / 18