Eikonal equation solver in CUDA - uni-graz.at

Eikonal equation solver in CUDA

Daniel Ganellari

University of Graz

[email protected]

January 28, 2016

Daniel Ganellari (IMSC) Eikonal solver in CUDA January 28, 2016 1 / 18

Overview

1 Introduction to the main AlgorithmParallel prefix sum (SCAN) with CUDADevice data preparation for coalesced accessingEikonal CUDA Algorithm

2 Results and Profiling

3 Conclusions and Outlook

4 References


Definition and Citation

Blelloch [1]: All-prefix-sums is a good example of a computation thatseems inherently sequential, but for which there is an efficient parallelalgorithm.

Definition: The all-prefix-sums operation takes a binary associativeoperator ⊕, and an array of n elements

[a0, a1, ..., an]

and returns the array:

[a0, (a0 ⊕ a1), ..., (a0 ⊕ a1 ⊕ ...⊕ an−1)]. InclusiveScan

[I , a0, (a0 ⊕ a1), ..., (a0 ⊕ a1 ⊕ ...⊕ an−2)]. ExclusiveScan

where I is the neutral element w.r.t ⊕.Example: If ⊕ is addition, then the exclusive scan operation on the array[3 1 7 0 4 1 6 3], returns [0 3 4 11 11 14 16 22].


Sequential and Parallel Scan

In general, all-prefix-sums can be used to convert some sequentialcomputations into equivalent, but parallel, computations as shown inTable 1.

Sequential Parallel

out[0] = 0; forall j in parallel doforall j from 1 to n do temp[j] = f(in[j]);out[j] = out[j-1] + f(in[j-1]); all prefix sums(out, temp);

Table : A sequential computation and its parallel equivalent.

There are many uses for all-prefix-sums:Sorting, lexical analysis, string comparison, polynomial evaluation, streamcompaction, and building histograms and data structures (graphs, trees,etc.) in parallel. For example applications, we refer the reader to thesurvey by Blelloch [1].


A Work-Efficient Parallel Scan

The up-sweep (reduce) phase of a work-efficient sum scan algorithm (afterBlelloch [1]).Steps: log(n) Work: O(n)



The down-sweep phase of a work-efficient parallel sum scan algorithmSteps: log(n) Work: O(n)



Parallel Scan Specifications

The algorithm scans an array inside a single thread block.

This is fine for small arrays, up to twice the maximum number ofthreads in a block (since each thread loads and processes twoelements).

The array size must be a power of two.

How to extend the algorithm to scan large arrays of arbitrary size?

Divide the large array into blocks that each can be scanned by asingle thread block.

Scan the blocks and write the total sum of each block to anotherarray of block sums.

Then scan the block sums, generating an array of block incrementsthat are added to all elements in their respective blocks.


Arrays of Arbitrary Size

Algorithm for performing a sum scan on a large array of values.


Scan Performance


CompactionStream compaction requires two steps, a scan and a scatter.

When to use compaction?

Large Number of elements to compact

The computation on each surviving element is expensive


Throughput and Colasced OptimizationThe main idea here is to copy all the Neighboring elements (Tetrahedras)for each active point using SCAN, in order to have a lot data to processand in a coalesced way!

3 5 2 8 Compacted Active List Nodes

2 3 4 2 Total Neighbors for each node

0 2 5 9 Ex. Scan to get the Addresses

35 27 57 63 48 127 264 158 99 27 39

2 3 4 2

Table : Data preparation for coalesced and throughput optimization (similar toCRS for sparse matrices)


Eikonal CUDA Update Scheme


Results

Hardware Specifications

CUDA implementation tested on GeForce GTX 970OpenMP implementation tested on Intel Core i7-4700MQ CPU @2.40GHz x 4 + HT

Meshes # Tetrahedras CUDA OpenMP

slab(structured) 240,000 0.657 0.354TBunnyC2 266,846 0.913 1.673TBunnyC 3,073,529 8.347 18.0039

Table : Result comparison between CUDA and OpenMP implementations


Multiprocessor Utilization

The kernel’s blocks are distributed across the GPU’s multiprocessors forexecution. Depending on the number of blocks and the execution durationof each block some multiprocessors may be more highly utilized thanothers during execution of the kernel.


Conclusions and Outlook

The scan operation is a simple and powerful parallel primitive with abroad range of applications.

Multiprocessor Utilization shows good results. It means a quite goodoccupancy is achieved.

Right now we have a factor of 2 speedup w.r.t OpenMPimplementation.

Employing domain decomposition we expect to achieve at least afactor of 4 speedup compared to OpenMP implementation.

Measuring and improving memory bandwidth by assuring:I Sufficient occupancyI Coalesce global memory access

resulted a very good procedure during performance improvement


References

Guy E. Blelloch (1990)

Prefix Sums and Their Applications

In John H. Reif (Ed.) , Synthesis of Parallel Algorithms, Morgan Kaufmann, 1990

Mark Harris (2007)

Parallel Prefix Sum (Scan) with CUDA

http://www.eecs.umich.edu/courses/eecs570/hw/parprefix.pdf

Udacity

Intro to Parallel Programming

https://www.udacity.com/


The End


Eikonal equation solver in CUDA - uni-graz.at

Documents

Transcript of Eikonal equation solver in CUDA - uni-graz.at