The Challenge of Portable Parallel Programming at...

The Challenge of Portable Parallel Programming at

Scale Prepared by

Dr. Barbara Chapman University of Houston

SimRace, Rueil-Malmaison, France December 8, 2015

http://www.cs.uh.edu/~hpctools

On-Going Architectural Changes 2

HPC system nodes continue to grow Thermics, power are now key in design decisions Massive increase in intra-node concurrency Trend toward heterogeneity Deeper, more complex memory hierarchies

0

10000

20000

30000

40000

50000

60000

2011 2012 2013 2014 2015

15559,848

26855,89

38692,792 43298,742

50463,78

0 1334,488 8262,962 9592,324

13780,4008 No.

of C

ores

Year

Top500: Av. Core Count

NLNI

Intel: “Sea of Blocks” Compute Model sL1

CE Host Processor: Full x86,

TLBs, SSE, . . .

iL1 dL1

Async Off. Eng.

Tweaked Decoder

Bus Gasket

Bridge

uL2

Intra-Accelerator Network

sL2

Standard x86 on-die fabric & memory map MC

IPM Bus

External DRAM & NVM

Special I/O Fabric

3

AU iL1

sL1

dL1

AU iL1

sL1

dL1

AU iL1

sL1

dL1

AU iL1

sL1

dL1

AU iL1

sL1

dL1

AU iL1

sL1

dL1

AU iL1

sL1

dL1

AU iL1

sL1

dL1 (c) 2

014,

Inte

l

ALU

RF

L1$ L1S

L2$ L2S

LL$ LLS

IPM

DDR

NVM

DDR

NVM

Disk Pool

O(1

0)

O(1

00)

O(1

)

O(1

0) O(1

,000

) O(1

)

Cores per block

Blocks w/ shared L2 per

die

Dies w

/ shared LL$/SPAD per socket

Boards w/ lim

ited DD

R+N

VM per C

hassis

Chassis w

/ large DD

R+N

VM per Exa-

machine

Machines + D

isk arrays

O(1

) Sockets w

/ IPM per Board

10+ Levels Memory, O(100M) Cores

4

(c) 2

014,

Inte

l

Integration of Accelerators: CAPI and APU • IBM’s Coherent Accelerator Processor Interface (CAPI)

integrates accelerators into system architecture with standardized protocol

• AMD’s Heterogeneous System Architecture (HSA)-based APU also integrates accelerators

• Nvidia’s high-speed GPU interconnect

Coherence Bus CAPP

PSL

Power8 CPU GPU

Global Memory

CPU L2

GPU L2

HW Coherence NVLink

X86, ARM64, POWER

CPU

PCIe

HPC Applications: Requirements Growing complexity in applications

Multidisciplinary, increasing amounts of data Very dense connectivity in social networks, etc.

How do we minimize communications in apps?

Performance • Must exploit features of emerging machines at all levels • APIs must facilitate expression of concurrency, help save power, use memory

efficiently, exploit heterogeneity, minimize synchronization Performance portability

• Implies not just that APIs are widely supported • But also that same code runs well everywhere • Very hard to accomplish

Performance less predictable in dynamic

execution environment

Developing Future HPC Applications

Productivity • Outside of HPC this is paramount • How important is this in HPC? Application development costs are high

Any new HPC programming languages out there? • DSLs will appear • New approaches may yet emerge, esp. task-based such as Legion • Need reasonable migration path for existing code

• Along with interoperability to avoid unneeded rewrite

• Role of application developer in detecting/overcoming errors and efficient energy consumption debated

Both MPI and OpenMP are being extended • Exploring how they might work with proposed exascale runtime • Directive features may ultimately be integrated into base languages

US National Strategic Computing Initiative (NSCI), Summer 2015 • Exaflops of computing power for exabytes of data

• Maintain US leadership in HPC capabilities • Improve HPC application developer productivity. • Make HPC readily available • Establish hardware technology for future HPC systems

• Increase capacity and capability of HPC ecosystem • DOD, DOE, NSF lead effort

• DOE: capable exascale computing program • NSF: scientific discovery, ecosystem, workforce • DOD: data analytic computing

A Layered Programming Approach

DSLs, other means for application scientists to provide information

Adapted versions of today’s portable parallel programming APIs (MPI, OpenMP, PGAS, Charm++)

Maybe some non-portable low-level APIs (threads, CUDA, Verilog)

Machine code, device-level interoperability stds, powerful runtime

Computational Chemistry

Climate Research Astrophysics

New kinds of info

Familiar

Custom

Very low-level

Applications

Heterogeneous Hardware

…

PGAS Programming Model Partitioned Global Address Space Characteristics: • global view of data • data affinity part of memory

model • single-sided remote access

memory memory memory

cpu cpu cpu

process process process

Advantages: representative of modern NUMA

architectures works well for distributed

interconnects with RMA support productivity and performance

Language Extensions: UPC, Coarray Fortran (CAF), Titanium Libraries: OpenSHMEM, Global Arrays, GASPI Research: X10, Chapel, Fortress, CAF 2.0

OpenSHMEM: PGAS Library • Partitioned Global Address Space

• Many processes running same code • Operating on different data

• Private, or remotely accessible • rDMA network support: IB, Cray Aries, BG/Q…

• Implements PGAS as library • Standardizes SPMD 1-sided comms

• Point-to-point & strided transfers • Broadcasts, concatenations, reductions • Synchronizations • Remote atomic memory operations, waits, distributed locks

• Our work: reference implementation, exploring new features 13

Barbara Chapman, Tony Curtis, Swaroop Pophale, Stephen Poole, Jeff Kuehn, Chuck Koelbel, Lauren Smith. Introducing OpenSHMEM: SHMEM for the PGAS Community. In PGAS'10 Proc. Fourth

Conference on Partitioned Global Address Space Programming Models, 2010, ACM

OpenMP 4.0 • Released July 2013

• http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf • http://www.openmp.org/mp-documents/OpenMP_Examples_4.0.1.pdf

• Main changes from 3.1: • Accelerator extensions • SIMD extensions • Places and thread affinity • Taskgroup and dependent tasks • Error handling (cancellation) • User-defined reductions

• OpenMP 4.5 enhances these features • Released November 2015

14

#pragma omp parallel #pragma omp for schedule(dynamic) for (I=0;I<N;I++){ NEAT_STUFF(I); } /* implicit barrier here */

http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf

http://www.openmp.org/mp-documents/OpenMP_Examples_4.0.1.pdf

Use of OpenMP • Moderate-size scientific, technical applications

• Initially, Fortran binding only • General-purpose multicore programming

• Tasks, C and C++ bindings • Embedded systems

• Tasks, kernel offloads • Large-scale parallel computations

• Usually, in conjunction with MPI • Entry-level parallel programmers

Many application developers think that the use of directives means they don’t have to restructure code. That is not true.

The OpenMP ARB 2015

• OpenMP is maintained by the OpenMP Architecture Review Board (the ARB), which

• Interprets OpenMP • Writes new specifications - keeps OpenMP relevant • Works to increase the impact of OpenMP

• Members are organizations - not individuals • Current members

• Permanent: AMD, ARM, Cray, Fujitsu, HP, IBM, Intel, Micron, NEC, Nvidia, Oracle, Red Hat, Texas Instruments

• Auxiliary: ANL, ASC/LLNL, BSC, cOMPunity, EPCC, LANL, LBNL, NASA, ORNL, RWTH Aachen, SNL, TACC, University of Houston

• IWOMP and OpenMPCon: http://www.iwomp.org

www.openmp.org

“High-level directive-based multi-language parallelism that is performant, productive and portable”

http://www.openmp.org

• OpenMP Places and thread affinity policies

• OMP_PLACES to describe hardware regions

• affinity(spread|compact|true|false) • SPREAD: spread threads evenly among the places

spread 8

• COMPACT: collocate OpenMP thread with master thread compact 4

OpenMP 4.0 Affinity

p0

p1

p2

p3

p4

p5

p6

p7

p0

p1

p2

p3

p4

p5

p6

p7

chip 0

core 0

t0 t1

core 1

t2 t3

core 2

t4

core 3

t6 t7

chip 1

core 4

t8 t9

core 5

t10 t11

core 6

t12 t13

core 7

t14 t15

The Tasking Concept In OpenMP

Thread

Generate tasks

Thread

Thread

Thread

Thread

Exec

ute

task

s

Dependences between tasks now also supported

OpenMP for Accelerators

while ((k<=mits)&&(error>tol)) { // a loop copying u[][] to uold[][] is omitted here … #pragma omp parallel for private(resid,j,i) reduction(+:error) for (i=1;i<(n-1);i++) for (j=1;j<(m-1);j++) { resid = (ax*(uold[i-1][j] + uold[i+1][j])\ + ay*(uold[i][j-1] + uold[i][j+1])+ b * uold[i][j] - f[i][j])/b; u[i][j] = uold[i][j] - omega * resid; error = error + resid*resid ; } // rest of the code omitted ... }

#pragma omp target data device (gpu0) map(to:n, m, omega, ax, ay, b, \ f[0:n][0:m]) map(tofrom:u[0:n][0:m]) map(alloc:uold[0:n][0:m])

#pragma omp target device(gpu0)

Early Experiences With The OpenMP Accelerator Model; Chunhua Liao, Yonghong Yan, Bronis R. de Supinski, Daniel J. Quinlan and Barbara Chapman; International Workshop on OpenMP (IWOMP) 2013, September 2013

New and Coming Soon… OpenMP 4.5 • Device construct enhancements

• more control, flexibility in data movement between host and devices, multiple device types

• asynchronous support with nowait and depends • “deep copy” for pointer-based structures/objects

• Loop parallelism enhancements • extended ordered clause to support do-across (e.g. wavefront) parallelism for loop

nests • new taskloop construct for asynchronous loop parallelism with control over task

grain size • Array reductions for C and C++, and more…

OpenMP 5.0 discussions already under way • Interoperability and composability • Locality and affinity • General error model • Transactional memory • Additional looping constructs • Enhanced tasking support

4.5 specification at www.openmp.org

http://www.openmp.org

OpenACC Programming Model • Announced Supercomputing 2011

• Initial work by NVIDIA, Cray, PGI, CAPS

• Directive-based programming for accelerators • For Fortran, C, C++ • Loop-based computations

• Compilers: PGI, Cray, CAPS, OpenARC, OpenUH, GCC (4.9)

OpenACC Features • High-level directive-based programming API for accelerators

such as GPUs, APUs, Intel’s Xeon Phi, FPGAs and even DSP chips. • Data directives: copy, copyin, copyout, etc • Data synchronization directive: update • Compute directives

• parallel: more control to the user • kernels: more freedom to the compiler

• Three levels of parallelism: gang, worker and vector • Commercial OpenACC compilers

• PGI, CRAY, PathScale

• Open source OpenACC compilers • GCC 5.0, OpenARC, OpenUH, RoseACC, etc.

OpenACC Compiler Translation • Need to achieve coalesced memory access on GPUs

Compiling a High-level Directive-Based Programming Model for GPGPUs; Xiaonan Tian, Rengan Xu, Yonghong Yan, Zhifeng Yun, Sunita Chandrasekaran, and Barbara Chapman; 26th International Workshop on Languages and Compilers for Parallel Computing (LCPC2013)

Performance Portability Across Compilers? • Same OpenACC thread setting does not guarantee best

performance for both OpenUH and PGI compilers (PGI 15.7)

Fig: OpenUH performance for micro-RTM Fig: PGI performance for micro-RTM

Research into Future Runtimes

• Roles and responsibilities of runtime system? • Relationship between RT and OS, programming models? • How is information exchanged? • Does RT really have to be dynamic?

What to Model?

Cost models

Processor model Cache model

Parallel model

Loop overhead

Parallel overhead

Machine cost

Cache cost

Reduction cost

Computational resource cost

Dependency latency cost Register spill

cost

Cache cost Operation cost

Issue cost Mem_ref cost

TLB cost

4853,08105 2691,89195 3551,39345

6033,2904

2402,6061 2255,9813

7083,30225 4546,6893

3064,6816 3567,4856 2697,7405 5231,9194

2167,1573

8119,38975

4286,8672 6046,2975

2574,97045 2108,68385

8906,8519

3676,1309 4898,5849

2451,7159

6758,78625 4134,11485

5505,723 2758,22575 3676,2987 4590,789

0

5000

10000

1+2 2+2 3+1 1+4 3+2 4+2 5+1 3+4 2+5 6+1 4+4 5+3 2+6 5+4 6+3 4+6 5+6BW

(MB

/s)

Thread Configuration (# of remote + # of local threads)

HT3 BW vs Threads on 2 Istanbuls

1+22+12+21+33+1

• Data Abstraction: Data-blocks (DB) • Explicitly created and destroyed • Only available “global” memory • Data-blocks can move

• Compute Abstraction: Event-Driven Tasks (EDT) • EDTs are fine-grained computation units • Execute only after all input data-blocks are ready • EDTs can create new EDTs and DBs

• Synchronization Abstraction: Events • Events help to set up EDT dependences • “Producer” EDT delivers a data-block to an event • Data-block is “consumed” by dependent EDTs

• Globally visible namespace: Guids

Open Common Runtime (OCR)

EDT

EVT

Data Data

Data

Data

Data

Data

Data

Data

EDT EDT

…

Guids

28

12/16/2015

HPX • Prototype exascale software

stack • HPX is low-level programming

interface and runtime system • MPI, OpenMP over HPX

• Execution model • Dynamic adaptive resource

management • Message-driven computation • Efficient synchronization • Global name space; • Lightweight threads • Task scheduling; thread

migration for load balancing, throughput.

MPI/OpenMP Application

XPRESS Migration Stack

HPX

OpenX

OpenMP Thin Runtime Glue

OpenMP compiler

MPI

Legacy stack

OpenMP over HPX (on-going work)

OpenMP translation:

Maps OpenMP task and data parallelism onto HPX Exploits data flow execution capabilities at scale Big increase in throughput for fine-grained tasks No direct interface to OS threads No tied tasks Threadprivate tricky, slow Doesn’t support places OpenMP task dependencies

via futures HPX locks faster than OS locks

0

10

20

30

40

50

60

70

80

90

512 341 256 228 171 128 114 85 64

time

(sec

onds

)

Block Size

LU Run Time on 40 Threads, size = 8192

Intel

HPX

Auto-Generation of Tasks

T.-H. Weng, B. Chapman: Implementing OpenMP Using Dataflow Execution Model for Data Locality and Efficient Parallel Execution. Proc. HIPS-7, 2002

What is “right” way for application developer to express tasks? Old example:

Compiler translates OpenMP into collection of tasks and task graph

Analyzes data usage per task

What is “right” size of task? Might need to adjust at run time

Runtime trade-off between load balance, co-mapping of tasks that use same data In-situ mappings execute task

where data is

Productive Application Development and Deployment Environment

Model to adjust number of threads, schedule and other policies to optimize both energy and execution performance.

• Restructuring / refactoring • High accuracy in information on

execution behavior • Models to guide parameter choices • Dynamic adjustment adapts to

current state of system and application / data

• Open interactions across software stack

Where are Directives Headed? Directives are convenient means to express parallelism

- Enable user to retain Fortran / C / C++ … code OpenMP continues to evolve to meet new needs

- Strong HPC representation; prescriptive model - Data locality, affinity; reduced reliance on barriers

OpenACC more focussed effort, faster progress - Will it be subsumed by OpenMP?

Both require (node) code restructuring Is there a subset of OpenMP (and a programming style) that

will work particularly well for exascale nodes? - Or do we need a new set of directives to generate

tasks?

Life is Short

Questions?

The Challenge of Portable Parallel Programming at...

Documents

Transcript of The Challenge of Portable Parallel Programming at...