OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group...

Post on 11-Jan-2016

215 views 0 download

Tags:

Transcript of OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group...

OpenMP in a Heterogeneous World

Ayodunni AribukiAdvisor: Dr. Barbara Chapman

HPCTools GroupUniversity of Houston

2

Top 10 Supercomputers (June 2011)

3

Why OpenMP• Shared memory parallel programming model

– Extends C, C++. Fortran

• Directives-based– Single code for sequential and parallel version

• Incremental parallelism– Little code modification

• High-level– Leave multithreading details to compiler and runtime

• Widely supported by major compilers– Open64, Intel, GNU, IBM, Microsoft, …– Portable

www.openmp.org

4

OpenMP Example

#pragma omp parallel{ int i;#pragma omp for for(i=0;i<100;i++){ //do stuff } //do more stuff}

0-2425-49

50-74

75-99

Implicit barrier

More

stuff

More

stuff

More

stuff

More

stuff

Fork

Join

5

Present/Future Architectures & Challenges they pose

Node 0

Memory

Node 1

Node 2 Node 3

Memory

Memory Memory

accelerator

Memory

Many more CPUS

Location

Heterogeneity

Scalability

Node 0

Memory

Node 1

Node 2 Node 3

Memory

Memory Memory

6

Heterogeneous Embedded Platform

7

Heterogeneous High-Performance Systems

Each node has multiple CPU cores, and some of the nodes are equipped with additional computational accelerators, such as

GPUs.

www.olcf.ornl.gov/wp-content/uploads/.../Exascale-ASCR-Analysis.pdf

8

• Must map data/computations to specific devices

• Usually involves substantial rewrite of code• Verbose code– Move data to/from device x– Launch kernel on device– Wait until y is ready/done

• Portability becomes an issue– Multiple versions of same code– Hard to maintain

Programming Heterogeneous Multicore:Issues

Always hardware-specific!

9

Programming Models? Today’s Scenario

// Run one OpenMP thread per device per MPI node #pragma omp parallel num_threads(devCount) if (initDevice()) {

// Block and grid dimensions dim3 dimBlock(12,12);kernel<<<1,dimBlock>>>(); cudaThreadExit();

} else {

printf("Device error on %s\n",processor_name);}

MPI_Finalize(); return 0;

}

www.cse.buffalo.edu/faculty/miller/Courses/CSE710/heavner.pdf

10

OpenMP in the Heterogeneous World• All threads are equal– No vocabulary for heterogeneity, separate device

• All threads must have access to the memory– Distributed memories common in embedded systems– Memories may not be coherent

• Implementations rely on OS and threading libraries– Memory allocation, synchronization e.g. Linux,

Pthreads

11

Extending OpenMP Example

#pragma omp parallel for target(dsp) for(j=0;i<m;i++) for (i=0;i<n,i++) c(i,j)=a(i,j)+b(i,j)

Main Memor

y

Application data

General Purpose

Processor Cores

HWA

Application data

Device cores

Upload remote

data

Download remote

data

Remote Procedure

call

12

Heterogeneous OpenMP Solution Stack

OpenMP Application

Directives, Compiler

OpenMP library

Environment

variables

Runtime library

OS/system support for shared memory

OpenMP Parallel Computing Solution Stack

User

laye

r

Pro

g.

layer

Op

en

MP

A

PI

Syste

m layer

Core 1 Core 2 Core n…

MCAPI, MRAPI, MTAPI

• Language extensions

• Efficient code generation

12

• Target Portable Runtime Interface

13

Summarizing My Research

• OpenMP on heterogeneous architectures– Expressing heterogeneity– Generating efficient code for GPUs/DSPs• Managing memories

– Distributed– Explicitly managed

– Enabling portable implementations

14

Backup

15

MCA: Generic Multicore Programming

• Solve portability issue in embedded multicore programming

• Defining and promoting open specifications for– Communication - MCAPI– Resource Management - MRAPI– Task Management - MTAPI

(www.multicore-association.org)

16

Heterogeneous Platform: CPU + Nvidia GPU