Eikonal equation solver in CUDA - uni-graz.at

18
Eikonal equation solver in CUDA Daniel Ganellari University of Graz [email protected] January 28, 2016 Daniel Ganellari (IMSC) Eikonal solver in CUDA January 28, 2016 1 / 18

Transcript of Eikonal equation solver in CUDA - uni-graz.at

Eikonal equation solver in CUDA

Daniel Ganellari

University of Graz

[email protected]

January 28, 2016

Daniel Ganellari (IMSC) Eikonal solver in CUDA January 28, 2016 1 / 18

Overview

1 Introduction to the main AlgorithmParallel prefix sum (SCAN) with CUDADevice data preparation for coalesced accessingEikonal CUDA Algorithm

2 Results and Profiling

3 Conclusions and Outlook

4 References

Daniel Ganellari (IMSC) Eikonal solver in CUDA January 28, 2016 2 / 18

Definition and Citation

Blelloch [1]: All-prefix-sums is a good example of a computation thatseems inherently sequential, but for which there is an efficient parallelalgorithm.

Definition: The all-prefix-sums operation takes a binary associativeoperator ⊕, and an array of n elements

[a0, a1, ..., an]

and returns the array:

[a0, (a0 ⊕ a1), ..., (a0 ⊕ a1 ⊕ ...⊕ an−1)]. InclusiveScan

[I , a0, (a0 ⊕ a1), ..., (a0 ⊕ a1 ⊕ ...⊕ an−2)]. ExclusiveScan

where I is the neutral element w.r.t ⊕.Example: If ⊕ is addition, then the exclusive scan operation on the array[3 1 7 0 4 1 6 3], returns [0 3 4 11 11 14 16 22].

Daniel Ganellari (IMSC) Eikonal solver in CUDA January 28, 2016 3 / 18

Sequential and Parallel Scan

In general, all-prefix-sums can be used to convert some sequentialcomputations into equivalent, but parallel, computations as shown inTable 1.

Sequential Parallel

out[0] = 0; forall j in parallel doforall j from 1 to n do temp[j] = f(in[j]);out[j] = out[j-1] + f(in[j-1]); all prefix sums(out, temp);

Table : A sequential computation and its parallel equivalent.

There are many uses for all-prefix-sums:Sorting, lexical analysis, string comparison, polynomial evaluation, streamcompaction, and building histograms and data structures (graphs, trees,etc.) in parallel. For example applications, we refer the reader to thesurvey by Blelloch [1].

Daniel Ganellari (IMSC) Eikonal solver in CUDA January 28, 2016 4 / 18

A Work-Efficient Parallel Scan

The up-sweep (reduce) phase of a work-efficient sum scan algorithm (afterBlelloch [1]).Steps: log(n) Work: O(n)

Daniel Ganellari (IMSC) Eikonal solver in CUDA January 28, 2016 5 / 18

A Work-Efficient Parallel Scan

The down-sweep phase of a work-efficient parallel sum scan algorithmSteps: log(n) Work: O(n)

Daniel Ganellari (IMSC) Eikonal solver in CUDA January 28, 2016 6 / 18

A Work-Efficient Parallel Scan

Parallel Scan Specifications

The algorithm scans an array inside a single thread block.

This is fine for small arrays, up to twice the maximum number ofthreads in a block (since each thread loads and processes twoelements).

The array size must be a power of two.

How to extend the algorithm to scan large arrays of arbitrary size?

Divide the large array into blocks that each can be scanned by asingle thread block.

Scan the blocks and write the total sum of each block to anotherarray of block sums.

Then scan the block sums, generating an array of block incrementsthat are added to all elements in their respective blocks.

Daniel Ganellari (IMSC) Eikonal solver in CUDA January 28, 2016 7 / 18

Arrays of Arbitrary Size

Algorithm for performing a sum scan on a large array of values.

Daniel Ganellari (IMSC) Eikonal solver in CUDA January 28, 2016 8 / 18

Scan Performance

Daniel Ganellari (IMSC) Eikonal solver in CUDA January 28, 2016 9 / 18

CompactionStream compaction requires two steps, a scan and a scatter.

When to use compaction?

Large Number of elements to compact

The computation on each surviving element is expensive

Daniel Ganellari (IMSC) Eikonal solver in CUDA January 28, 2016 10 / 18

Throughput and Colasced OptimizationThe main idea here is to copy all the Neighboring elements (Tetrahedras)for each active point using SCAN, in order to have a lot data to processand in a coalesced way!

3 5 2 8 Compacted Active List Nodes

2 3 4 2 Total Neighbors for each node

0 2 5 9 Ex. Scan to get the Addresses

35 27 57 63 48 127 264 158 99 27 39

2 3 4 2

Table : Data preparation for coalesced and throughput optimization (similar toCRS for sparse matrices)

Daniel Ganellari (IMSC) Eikonal solver in CUDA January 28, 2016 11 / 18

Eikonal CUDA Update Scheme

Daniel Ganellari (IMSC) Eikonal solver in CUDA January 28, 2016 12 / 18

Eikonal CUDA Update Scheme

Daniel Ganellari (IMSC) Eikonal solver in CUDA January 28, 2016 13 / 18

Results

Hardware Specifications

CUDA implementation tested on GeForce GTX 970OpenMP implementation tested on Intel Core i7-4700MQ CPU @2.40GHz x 4 + HT

Meshes # Tetrahedras CUDA OpenMP

slab(structured) 240,000 0.657 0.354TBunnyC2 266,846 0.913 1.673TBunnyC 3,073,529 8.347 18.0039

Table : Result comparison between CUDA and OpenMP implementations

Daniel Ganellari (IMSC) Eikonal solver in CUDA January 28, 2016 14 / 18

Multiprocessor Utilization

The kernel’s blocks are distributed across the GPU’s multiprocessors forexecution. Depending on the number of blocks and the execution durationof each block some multiprocessors may be more highly utilized thanothers during execution of the kernel.

Daniel Ganellari (IMSC) Eikonal solver in CUDA January 28, 2016 15 / 18

Conclusions and Outlook

The scan operation is a simple and powerful parallel primitive with abroad range of applications.

Multiprocessor Utilization shows good results. It means a quite goodoccupancy is achieved.

Right now we have a factor of 2 speedup w.r.t OpenMPimplementation.

Employing domain decomposition we expect to achieve at least afactor of 4 speedup compared to OpenMP implementation.

Measuring and improving memory bandwidth by assuring:I Sufficient occupancyI Coalesce global memory access

resulted a very good procedure during performance improvement

Daniel Ganellari (IMSC) Eikonal solver in CUDA January 28, 2016 16 / 18

References

Guy E. Blelloch (1990)

Prefix Sums and Their Applications

In John H. Reif (Ed.) , Synthesis of Parallel Algorithms, Morgan Kaufmann, 1990

Mark Harris (2007)

Parallel Prefix Sum (Scan) with CUDA

http://www.eecs.umich.edu/courses/eecs570/hw/parprefix.pdf

Udacity

Intro to Parallel Programming

https://www.udacity.com/

Daniel Ganellari (IMSC) Eikonal solver in CUDA January 28, 2016 17 / 18

The End

Daniel Ganellari (IMSC) Eikonal solver in CUDA January 28, 2016 18 / 18