Compiler and Runtime Based Parallelization & Optimization ... · [3] Samuel F. Antão, Alexey...
Transcript of Compiler and Runtime Based Parallelization & Optimization ... · [3] Samuel F. Antão, Alexey...
Developing applications able to exploit the dazzling performance of
GPUs (Graphics Processing Units) is not a trivial task, and becomes
even harder when they have irregular data access patterns or control
flows. Several approaches have been proposed to simplify GPU
programing such as OpenMP, OpenACC. However they have a
performance gap with native programming models as their compiler
does not have comprehensive knowledge about how to transform code
and what to optimize. This thesis targets directive-based programming
models to enhance their capabilities for GPU programming.
My contributions are in three directions:
➢ Developed a Task Based Programming model (OmpSs+OpenMP),
along with the compiler and runtime.
➢ Code transformation of nested parallelism for irregular applications
such as sparse matrix operations, graph and graphics algorithms
➢ Compiler optimization for loop scheduling in GPU.
Thesis Software Contribution:
➢ Mercurium compiler, Clang Frontend , NVIDIA’s PGI Compiler
Guray Ozen, Eduard Ayguadé Jesús Labarta
Universitat Politècnica de Catalunya, Barcelona, Spain
Compiler and Runtime Based Parallelization & Optimization for GPUs
Contributions Task Based GPU Offload Model (OmpSs+OpenMP)
PublicationsDynamic Loop Scheduling
Lazy Nested Parallelism
MACC = Mercurium ACCelerator model
Introduces a new dialect programming model
Asynchronous Task based GPU Offload model
Combination OmpSs + OpenMP
Incorporate advantages of models
Ease to use accelerators
Its aim is to let compiler offload and parallelize
Developed on top of OmpSs model
Mercurium compiler
Source-to-source code generation for C/C++
Targeting NVIDIA GPUs
(Targeting OpenCL as well but not in thesis)
❑ Eager (naïve) dynamic parallelism causes
slowdown due to kernel launch overhead.
❑Our method yields the best performance.
❑ Dynamic Loop scheduling can reach maximum performance with a small
grid size.
❑ It can even increase performance since it uses few number of grids
[1] Guray Ozen, Eduard Ayguadé, Jesús Labarta: POSTER: Collective
Dynamic Parallelism for Directive Based GPU Programming Languages and
Compilers. PACT 2016
[2] Guray Ozen, Sergi Mateo, Eduard Ayguadé, Jesús Labarta, James Beyer:
Multiple Target Task Sharing Support for the OpenMP Accelerator Model.
IWOMP 2016
[3] Samuel F. Antão, Alexey Bataev, Arpith C. Jacob, Gheorghe-Teodor
Bercea, Alexandre E. Eichenberger, Georgios Rokos, Matt Martineau, Tian
Jin, Guray Ozen, Zehra Sura, Tong Chen, Hyojin Sung, Carlo Bertolli, Kevin
O'Brien: Offloading Support for OpenMP in Clang and LLVM. LLVM-HPC@SC
2016
[4] Guray Ozen, Eduard Ayguadé, Jesús Labarta: Exploring dynamic
parallelism in OpenMP. WACCPD@SC 2015
[5] Guray Ozen, Eduard Ayguadé, Jesús Labarta: On the Roles of the
Programmer, the Compiler and the Runtime System When Programming
Accelerators in OpenMP. IWOMP 2014
❑ Problem: Many GPUs application suffers from excessive kernel size
➢ Solution: We developed dynamic loop scheduling
➢Finds right kernel size
➢Increase performance
Dynamic Loop SchedulingStatic Cyclic Scheduling
❑ Problem: Modern GPUs are trying to solve more irregulars codes:
Graph algorithms, Sparse matrix applications, Irregular data pattern etc.
➢ Solution: We developed a efficient lazy nested parallelism for compilers
Multi-target Task Share
for (int i = 0; i < N; i+=BS) {
#pragma omp target device(any) map(…) nowait#pragma omp
if_device(NVIDIA, cc35) teams distribute parallel forif_device(host) parallel for
for (int j = i; j < i+BS; ++j) <…COMPUTATION…>;
}
❑ Problem: Heterogeneity is everywhere. How to exploit
entire system?
➢ Solution: We developed multiple target task sharing