Compiler and Runtime Based Parallelization & Optimization ... · [3] Samuel F. Antão, Alexey...

Developing applications able to exploit the dazzling performance of GPUs (Graphics Processing Units) is not a trivial task, and becomes even harder when they have irregular data access patterns or control flows. Several approaches have been proposed to simplify GPU programing such as OpenMP, OpenACC. However they have a performance gap with native programming models as their compiler does not have comprehensive knowledge about how to transform code and what to optimize. This thesis targets directive-based programming models to enhance their capabilities for GPU programming. My contributions are in three directions: ➢ Developed a Task Based Programming model (OmpSs+OpenMP), along with the compiler and runtime. ➢ Code transformation of nested parallelism for irregular applications such as sparse matrix operations, graph and graphics algorithms ➢ Compiler optimization for loop scheduling in GPU. Thesis Software Contribution: ➢ Mercurium compiler, Clang Frontend , NVIDIA’s PGI Compiler Guray Ozen, Eduard Ayguadé Jesús Labarta Universitat Politècnica de Catalunya, Barcelona, Spain Compiler and Runtime Based Parallelization & Optimization for GPUs Contributions Task Based GPU Offload Model (OmpSs+OpenMP) Publications Dynamic Loop Scheduling Lazy Nested Parallelism MACC = Mercurium ACCelerator model Introduces a new dialect programming model Asynchronous Task based GPU Offload model Combination OmpSs + OpenMP Incorporate advantages of models Ease to use accelerators Its aim is to let compiler offload and parallelize Developed on top of OmpSs model Mercurium compiler Source-to-source code generation for C/C++ Targeting NVIDIA GPUs (Targeting OpenCL as well but not in thesis) ❑ Eager (naïve) dynamic parallelism causes slowdown due to kernel launch overhead. ❑ Our method yields the best performance. ❑ Dynamic Loop scheduling can reach maximum performance with a small grid size. ❑ It can even increase performance since it uses few number of grids [1] Guray Ozen, Eduard Ayguadé, Jesús Labarta: POSTER: Collective Dynamic Parallelism for Directive Based GPU Programming Languages and Compilers. PACT 2016 [2] Guray Ozen, Sergi Mateo, Eduard Ayguadé, Jesús Labarta, James Beyer: Multiple Target Task Sharing Support for the OpenMP Accelerator Model. IWOMP 2016 [3] Samuel F. Antão, Alexey Bataev, Arpith C. Jacob, Gheorghe-Teodor Bercea, Alexandre E. Eichenberger, Georgios Rokos, Matt Martineau, Tian Jin, Guray Ozen, Zehra Sura, Tong Chen, Hyojin Sung, Carlo Bertolli, Kevin O'Brien: Offloading Support for OpenMP in Clang and LLVM. LLVM-HPC@SC 2016 [4] Guray Ozen, Eduard Ayguadé, Jesús Labarta: Exploring dynamic parallelism in OpenMP. WACCPD@SC 2015 [5] Guray Ozen, Eduard Ayguadé, Jesús Labarta: On the Roles of the Programmer, the Compiler and the Runtime System When Programming Accelerators in OpenMP. IWOMP 2014 ❑ Problem: Many GPUs application suffers from excessive kernel size ➢ Solution: We developed dynamic loop scheduling ➢ Finds right kernel size ➢ Increase performance Dynamic Loop Scheduling Static Cyclic Scheduling ❑ Problem: Modern GPUs are trying to solve more irregulars codes: Graph algorithms, Sparse matrix applications, Irregular data pattern etc. ➢ Solution: We developed a efficient lazy nested parallelism for compilers Multi-target Task Share for (int i =0; i < N; i+=BS) { #pragma omp target device(any) map(…) nowait #pragma omp if_device(NVIDIA, cc35) teams distribute parallel for if_device(host) parallel for for (int j = i; j < i+BS; ++j) <…COMPUTATION…>; } ❑ Problem: Heterogeneity is everywhere. How to exploit entire system? ➢ Solution: We developed multiple target task sharing

Upload
others
Category

Documents
view
4
download
0

Embed Size (px):

Transcript of Compiler and Runtime Based Parallelization & Optimization ... · [3] Samuel F. Antão, Alexey...

Page 1: Compiler and Runtime Based Parallelization & Optimization ... · [3] Samuel F. Antão, Alexey Bataev, Arpith C. Jacob, Gheorghe-Teodor Bercea, Alexandre E. Eichenberger, Georgios

Developing applications able to exploit the dazzling performance of

GPUs (Graphics Processing Units) is not a trivial task, and becomes

even harder when they have irregular data access patterns or control

flows. Several approaches have been proposed to simplify GPU

programing such as OpenMP, OpenACC. However they have a

performance gap with native programming models as their compiler

does not have comprehensive knowledge about how to transform code

and what to optimize. This thesis targets directive-based programming

models to enhance their capabilities for GPU programming.

My contributions are in three directions:

➢ Developed a Task Based Programming model (OmpSs+OpenMP),

along with the compiler and runtime.

➢ Code transformation of nested parallelism for irregular applications

such as sparse matrix operations, graph and graphics algorithms

➢ Compiler optimization for loop scheduling in GPU.

Thesis Software Contribution:

➢ Mercurium compiler, Clang Frontend , NVIDIA’s PGI Compiler

Guray Ozen, Eduard Ayguadé Jesús Labarta

Universitat Politècnica de Catalunya, Barcelona, Spain

Compiler and Runtime Based Parallelization & Optimization for GPUs

Contributions Task Based GPU Offload Model (OmpSs+OpenMP)

PublicationsDynamic Loop Scheduling

Lazy Nested Parallelism

MACC = Mercurium ACCelerator model

Introduces a new dialect programming model

Asynchronous Task based GPU Offload model

Combination OmpSs + OpenMP

Incorporate advantages of models

Ease to use accelerators

Its aim is to let compiler offload and parallelize

Developed on top of OmpSs model

Mercurium compiler

Source-to-source code generation for C/C++

Targeting NVIDIA GPUs

(Targeting OpenCL as well but not in thesis)

❑ Eager (naïve) dynamic parallelism causes

slowdown due to kernel launch overhead.

❑Our method yields the best performance.

❑ Dynamic Loop scheduling can reach maximum performance with a small

grid size.

❑ It can even increase performance since it uses few number of grids

[1] Guray Ozen, Eduard Ayguadé, Jesús Labarta: POSTER: Collective

Dynamic Parallelism for Directive Based GPU Programming Languages and

Compilers. PACT 2016

[2] Guray Ozen, Sergi Mateo, Eduard Ayguadé, Jesús Labarta, James Beyer:

Multiple Target Task Sharing Support for the OpenMP Accelerator Model.

IWOMP 2016

[3] Samuel F. Antão, Alexey Bataev, Arpith C. Jacob, Gheorghe-Teodor

Bercea, Alexandre E. Eichenberger, Georgios Rokos, Matt Martineau, Tian

Jin, Guray Ozen, Zehra Sura, Tong Chen, Hyojin Sung, Carlo Bertolli, Kevin

O'Brien: Offloading Support for OpenMP in Clang and LLVM. LLVM-HPC@SC

2016

[4] Guray Ozen, Eduard Ayguadé, Jesús Labarta: Exploring dynamic

parallelism in OpenMP. WACCPD@SC 2015

[5] Guray Ozen, Eduard Ayguadé, Jesús Labarta: On the Roles of the

Programmer, the Compiler and the Runtime System When Programming

Accelerators in OpenMP. IWOMP 2014

❑ Problem: Many GPUs application suffers from excessive kernel size

➢ Solution: We developed dynamic loop scheduling

➢Finds right kernel size

➢Increase performance

Dynamic Loop SchedulingStatic Cyclic Scheduling

❑ Problem: Modern GPUs are trying to solve more irregulars codes:

Graph algorithms, Sparse matrix applications, Irregular data pattern etc.

➢ Solution: We developed a efficient lazy nested parallelism for compilers

Multi-target Task Share

for (int i = 0; i < N; i+=BS) {

#pragma omp target device(any) map(…) nowait#pragma omp

if_device(NVIDIA, cc35) teams distribute parallel forif_device(host) parallel for

for (int j = i; j < i+BS; ++j) <…COMPUTATION…>;

}

❑ Problem: Heterogeneity is everywhere. How to exploit

entire system?

➢ Solution: We developed multiple target task sharing

МЕТАЛЛОВ - nstu.ru · № 2 (63) 2014 3 EDITORIAL COUNCIL EDITORIAL BOARD EDITOR-IN-CHIEF: Anatoliy A. Bataev, D.Sc. (Engineering), Professor, Vice Rector for Academic Affairs,

RAJIV GANDHI INSTITUTE OF TECHNOLOGY, KOTTAYAM PROVISIONAL ... · PDF file90 dinu mamachan m s1 ce ezhukone, ... 103 arpith b m s1 me thalassery, ... rajiv gandhi institute of technology,

OpenMP * Support in Clang/LLVM: Status Update and Future Directions 2014 LLVM Developers' Meeting Alexey Bataev, Zinovy Nis Intel.

Arpith Jacob Programming using OpenMP · 2016. 4. 18. · CLANG • Front-end to parse source code and generate LLVM IR code • Modified to generate code for OpenMP device constructs

Using OpenMP 4.5 in the CLANG/LLVM compiler toolchain...Using OpenMP 4.5 in the CLANG/LLVM compiler toolchain Gheorghe-Teodor Bercea IBM TJ Watson Research Center Yorktown Heights,

We are: Cleonela Şerban Monica Bercea Andrei Chirilă Maria Itu Project manager:

KENDRIYA VIDYALAYA KANJIKODE LIST OF …kvkanjikode.in/pages/admission2016-17/registered2016.pdf · list of candidates registered for admission to class i ... 02/18/16 62 arpith k

M ETABOLIC PROFILE AND OSTEOPOROSIS IN POSTMENOPAUSAL WOMEN First author: Valentina Cojocaru Coauthor: Mara-Maria Bercea Coordinator: Associate Professor.

I. A. Bataev 1, A. A. Bataev 1, V. I. Mali 2, M. A. Esikov 2, P. S. Yartsev 1, A.S. Gontarenko 1 1. Novosibirsk State Technical Univestity 2. Lavrentyev.

S6240’&HighLevelGPU Programming&Using&OpenMP4.5 …on-demand.gputechconf.com/gtc/2016/presentation/s6240-arpith-jacob... · CLANG • Front-end to parse source code and generate

Softcore Vector Processor Team ASP Brandon Harris Arpith Jacob.

Marius Bercea · 2021. 3. 11. · Dates March 27 (Sat.) – May 1 (Sat.), 2021 Location MAKI Gallery / Tennoz II, Tokyo Marius Bercea The Far Sound of Cities Marius Bercea, The far

European Sector Councils on Employment and Skills Alina Maria BERCEA Unit New Skills for New Jobs, Adaptation to Change, CSR, EGF Brussels 15 April 2011.

Social Europe EU Sectoral Skills Council Alina Maria BERCEA Unit New Skills for New Jobs, Adaptation to Change, CSR, EGF Brussels 26 April 2012.

WHAT IS THAT, TO READ? THE CASE OF FOREIGN LAW · 2019. 4. 4. · CENTRUL DE DREPT COMPARAT ŞI INTERDISCIPLINARITATE MODERATOR: RALUCA BERCEA Centrul de Drept Comparat și Interdisciplinaritate

MARIUS BERCEA - ghebaly.comghebaly.com/wp-content/uploads/2018/03/Bercea_MasterPress.pdf · pe Pacific Coast, reportajul lui Ilf si Petrov despre calatoria lor in America“ –,

ii - Facultatea de Litere - Universitatea din Craiova · 179 bera m. danilia - m{d{lina 180 bercea-florea g. emilios 181 berceanu g. alisa-andreea 182 berceanu i. andreea-maria 183

Futu re Trust Bulletin Edition o. 3 June 2017 Maria BERCEA ... · re g iste red users, to the integrated of e-Signature validation, to send b u sin e ss related e-Services. a vivid

AC - ALL CHRISTIANcee-kerala.org/docs/keam2009/other/minoritylist.pdf · 100203 arpith silvaster ac bx ... 101877 lijeesh j ac lc 101880 rini n l ac bx ... 102989 dinu t s ac lc 102998

gifu.spbstu.ruŸреподавателям... · Web viewБатаев Алексей Владимирович/ Bataev, Alexey V. Bataev, A. V. (2017). Analysis of the Use of Cloud

Compiler and Runtime Based Parallelization & Optimization ... · [3] Samuel F. Antão, Alexey...

Documents

Transcript of Compiler and Runtime Based Parallelization & Optimization ... · [3] Samuel F. Antão, Alexey...