Compiler and Runtime Based Parallelization & Optimization ... · [3] Samuel F. Antão, Alexey...

1
Developing applications able to exploit the dazzling performance of GPUs (Graphics Processing Units) is not a trivial task, and becomes even harder when they have irregular data access patterns or control flows. Several approaches have been proposed to simplify GPU programing such as OpenMP, OpenACC. However they have a performance gap with native programming models as their compiler does not have comprehensive knowledge about how to transform code and what to optimize. This thesis targets directive-based programming models to enhance their capabilities for GPU programming. My contributions are in three directions: Developed a Task Based Programming model (OmpSs+OpenMP), along with the compiler and runtime. Code transformation of nested parallelism for irregular applications such as sparse matrix operations, graph and graphics algorithms Compiler optimization for loop scheduling in GPU. Thesis Software Contribution: Mercurium compiler, Clang Frontend , NVIDIA’s PGI Compiler Guray Ozen, Eduard Ayguadé Jesús Labarta Universitat Politècnica de Catalunya, Barcelona, Spain Compiler and Runtime Based Parallelization & Optimization for GPUs Contributions Task Based GPU Offload Model (OmpSs+OpenMP) Publications Dynamic Loop Scheduling Lazy Nested Parallelism MACC = Mercurium ACCelerator model Introduces a new dialect programming model Asynchronous Task based GPU Offload model Combination OmpSs + OpenMP Incorporate advantages of models Ease to use accelerators Its aim is to let compiler offload and parallelize Developed on top of OmpSs model Mercurium compiler Source-to-source code generation for C/C++ Targeting NVIDIA GPUs (Targeting OpenCL as well but not in thesis) Eager (naïve) dynamic parallelism causes slowdown due to kernel launch overhead. Our method yields the best performance. Dynamic Loop scheduling can reach maximum performance with a small grid size. It can even increase performance since it uses few number of grids [1] Guray Ozen, Eduard Ayguadé, Jesús Labarta: POSTER: Collective Dynamic Parallelism for Directive Based GPU Programming Languages and Compilers. PACT 2016 [2] Guray Ozen, Sergi Mateo, Eduard Ayguadé, Jesús Labarta, James Beyer: Multiple Target Task Sharing Support for the OpenMP Accelerator Model. IWOMP 2016 [3] Samuel F. Antão, Alexey Bataev, Arpith C. Jacob, Gheorghe-Teodor Bercea, Alexandre E. Eichenberger, Georgios Rokos, Matt Martineau, Tian Jin, Guray Ozen, Zehra Sura, Tong Chen, Hyojin Sung, Carlo Bertolli, Kevin O'Brien: Offloading Support for OpenMP in Clang and LLVM. LLVM-HPC@SC 2016 [4] Guray Ozen, Eduard Ayguadé, Jesús Labarta: Exploring dynamic parallelism in OpenMP. WACCPD@SC 2015 [5] Guray Ozen, Eduard Ayguadé, Jesús Labarta: On the Roles of the Programmer, the Compiler and the Runtime System When Programming Accelerators in OpenMP. IWOMP 2014 Problem: Many GPUs application suffers from excessive kernel size Solution: We developed dynamic loop scheduling Finds right kernel size Increase performance Dynamic Loop Scheduling Static Cyclic Scheduling Problem: Modern GPUs are trying to solve more irregulars codes: Graph algorithms, Sparse matrix applications, Irregular data pattern etc. Solution: We developed a efficient lazy nested parallelism for compilers Multi-target Task Share for (int i =0; i < N; i+=BS) { #pragma omp target device(any) map(…) nowait #pragma omp if_device(NVIDIA, cc35) teams distribute parallel for if_device(host) parallel for for (int j = i; j < i+BS; ++j) <…COMPUTATION…>; } Problem: Heterogeneity is everywhere. How to exploit entire system? Solution: We developed multiple target task sharing

Transcript of Compiler and Runtime Based Parallelization & Optimization ... · [3] Samuel F. Antão, Alexey...

Page 1: Compiler and Runtime Based Parallelization & Optimization ... · [3] Samuel F. Antão, Alexey Bataev, Arpith C. Jacob, Gheorghe-Teodor Bercea, Alexandre E. Eichenberger, Georgios

Developing applications able to exploit the dazzling performance of

GPUs (Graphics Processing Units) is not a trivial task, and becomes

even harder when they have irregular data access patterns or control

flows. Several approaches have been proposed to simplify GPU

programing such as OpenMP, OpenACC. However they have a

performance gap with native programming models as their compiler

does not have comprehensive knowledge about how to transform code

and what to optimize. This thesis targets directive-based programming

models to enhance their capabilities for GPU programming.

My contributions are in three directions:

➢ Developed a Task Based Programming model (OmpSs+OpenMP),

along with the compiler and runtime.

➢ Code transformation of nested parallelism for irregular applications

such as sparse matrix operations, graph and graphics algorithms

➢ Compiler optimization for loop scheduling in GPU.

Thesis Software Contribution:

➢ Mercurium compiler, Clang Frontend , NVIDIA’s PGI Compiler

Guray Ozen, Eduard Ayguadé Jesús Labarta

Universitat Politècnica de Catalunya, Barcelona, Spain

Compiler and Runtime Based Parallelization & Optimization for GPUs

Contributions Task Based GPU Offload Model (OmpSs+OpenMP)

PublicationsDynamic Loop Scheduling

Lazy Nested Parallelism

MACC = Mercurium ACCelerator model

Introduces a new dialect programming model

Asynchronous Task based GPU Offload model

Combination OmpSs + OpenMP

Incorporate advantages of models

Ease to use accelerators

Its aim is to let compiler offload and parallelize

Developed on top of OmpSs model

Mercurium compiler

Source-to-source code generation for C/C++

Targeting NVIDIA GPUs

(Targeting OpenCL as well but not in thesis)

❑ Eager (naïve) dynamic parallelism causes

slowdown due to kernel launch overhead.

❑Our method yields the best performance.

❑ Dynamic Loop scheduling can reach maximum performance with a small

grid size.

❑ It can even increase performance since it uses few number of grids

[1] Guray Ozen, Eduard Ayguadé, Jesús Labarta: POSTER: Collective

Dynamic Parallelism for Directive Based GPU Programming Languages and

Compilers. PACT 2016

[2] Guray Ozen, Sergi Mateo, Eduard Ayguadé, Jesús Labarta, James Beyer:

Multiple Target Task Sharing Support for the OpenMP Accelerator Model.

IWOMP 2016

[3] Samuel F. Antão, Alexey Bataev, Arpith C. Jacob, Gheorghe-Teodor

Bercea, Alexandre E. Eichenberger, Georgios Rokos, Matt Martineau, Tian

Jin, Guray Ozen, Zehra Sura, Tong Chen, Hyojin Sung, Carlo Bertolli, Kevin

O'Brien: Offloading Support for OpenMP in Clang and LLVM. LLVM-HPC@SC

2016

[4] Guray Ozen, Eduard Ayguadé, Jesús Labarta: Exploring dynamic

parallelism in OpenMP. WACCPD@SC 2015

[5] Guray Ozen, Eduard Ayguadé, Jesús Labarta: On the Roles of the

Programmer, the Compiler and the Runtime System When Programming

Accelerators in OpenMP. IWOMP 2014

❑ Problem: Many GPUs application suffers from excessive kernel size

➢ Solution: We developed dynamic loop scheduling

➢Finds right kernel size

➢Increase performance

Dynamic Loop SchedulingStatic Cyclic Scheduling

❑ Problem: Modern GPUs are trying to solve more irregulars codes:

Graph algorithms, Sparse matrix applications, Irregular data pattern etc.

➢ Solution: We developed a efficient lazy nested parallelism for compilers

Multi-target Task Share

for (int i = 0; i < N; i+=BS) {

#pragma omp target device(any) map(…) nowait#pragma omp

if_device(NVIDIA, cc35) teams distribute parallel forif_device(host) parallel for

for (int j = i; j < i+BS; ++j) <…COMPUTATION…>;

}

❑ Problem: Heterogeneity is everywhere. How to exploit

entire system?

➢ Solution: We developed multiple target task sharing