1 Barbara Chapman University of Houston This work is supported by the National Science Foundation...

1

Barbara ChapmanUniversity of Houston

This work is supported by the National Science Foundation under grants CCF-0444468 and CCF-0702775; and by the Department of Energy under grant DE-

FC02-06ER25759.

Talk Outline

What is OpenMP? Performance aspects of OpenMP

programming How might OpenMP better support multicore? Further down the road: OpenMP Futures

2

Open specifications for Multi Processing

• Compiler directives, library routines, environment variables for specifying shared-memory parallelism

• Widely available

An API for Writing Multithreaded Applications using Fortran, C, C++

OpenMP Programming Model:

4

Fork-Join Parallelism: Master thread spawns a team of threads as needed.

Parallelism is added incrementally until desired performance is achieved: i.e. the sequential program evolves into a parallel program.

Parallel Regions

Master Thread A Nested

Parallel region

A Nested Parallel region

5

The OpenMP Shared Memory API Directive-based multithreaded programming

The user makes strategic decisions Compiler figures out details Threads communicate by sharing variables Synchronization to order accesses and prevent data conflicts Structured programming to reduce likelihood of bugs

Compiler flags enable OpenMP (e.g. –openmp, -xopenmp, -fopenmp, -mp)

#pragma omp parallel#pragma omp for schedule(dynamic)

for (I=0;I<N;I++){NEAT_STUFF(I);

} /* implicit barrier here */

History OpenMP Architecture Review Board

AMD, Cray, Fujitsu, HP, IBM, Intel, NEC, The Portland Group, Inc., SGI, Sun, Microsoft

ASC/LLNL, cOMPunity, EPCC, NASA, RWTH Aachen University

Spec Release DatesFortran v1.0 (Oct ‘97), C/C++ v1.0 (Oct ‘98)Fortran v2.0 (Nov ‘00), C/C++ v2.0 (Mar ‘02)C/C++ and Fortran v2.5 (May ’05)C/C++ and Fortran v3.0 (May ‘08)

6

7

struct node {struct node *left;struct node *right;};extern void process(struct node *);void postorder_traverse( struct node *p ) {if (p->left) { #pragma omp task // p is firstprivate by default postorder_traverse(p->left);}if (p->right) { #pragma omp task // p is firstprivate by default postorder_traverse(p->right);}#pragma omp taskwaitprocess(p);

Package and add explicit task to a task pool

wait for all child tasks to complete

OpenMP 3.0: Tasks

OpenMP 3.0: User-level Library and Environment

Busy waiting may consume valuable resources, interfere with work of other threadsOpenMP 3.0 allows user more control over the way

idle threads are handled

Enhanced support for nested parallelismLibrary routines to determine depth of nesting and IDs of

parent/grandparent threads.Different regions may have different defaults

○ E.g. omp_set_num_threads() inside a parallel region.

OMP_WAIT_POLICY

Some Remarks

Flexible parallel programmingNot all the code in an OpenMP program

must be parallelizedLow-level programming using threadidsCan be combined with MPI

Dynamic adjustmentOf schedule for load balancing Of # of threads

With some care, single source for sequential and parallel code is possible

Cart3D OpenMP Scaling

• OpenMP version used same domain decomposition strategy as MPI for data locality, avoiding false sharing and fine-grained remote data access

• OpenMP version slightly outperformed MPI version on the SGI Altix 3700BX2, both close to linear scaling.

4.7 M cell mesh Space Shuttle Launch Vehicle exampleM = 2.6 = 2.09º = 0.8º

Hybrid MPI+OpenMP Programming Model

Seems to match structure of large-scale platforms

Greater interest with widespread introduction of multicore chips Individual cores may be single threaded or

multithreadedThreads on cores share resources (L2 cache,

memory bandwidth), not necessarily in a uniform way

Systems may be heterogeneous

Remarkably poorly understood

12

OVERFLOW2 - DLRF6 Case

MPI+OpenMP version: numerically explicit scheme + implicit scheme Hybrid version outperformed pure MPI version on the IBM p575+ But the same hybrid code did not give same benefits on the SGI Altix.

36 M grid points, 23 zones, DLRF6 benchmark configuration

Courtesy of Dennis Jespersen, NASA Ames

Hybrid MPI+OpenMP

Fewer processes, so usually less data exchanged; comm./comp. ratio changes

Multilevel parallelismMay be able to exploit additional finer-grain

parallelism using OpenMP

Alleviates MPI load balancing problems Computations expressed in OpenMP

May require careful tuning of OpenMP code

OpenMP Performance Problems

Overheads of OpenMP compilation Differs between constructs and implementations Impact of translation on traditional optimization

Overheads of runtime library routines Some are called frequently

Algorithmic overheads If additional work needed to enable parallelization

Too much synchronization Possibly caused by poor load balance

Poor cache utilization and false sharing○ Poor memory usage has a variety of causes

14

OpenMP Parallel Computing Solution Stack

15

Runtime library

OS/system support for shared memory.

Directives,Compiler

OpenMP libraryEnvironment

variables

Application

End User

Sys

tem

lay

erP

r og

. L

ayer

(O

pen

MP

AP

I)U

ser

laye

r

OpenMP Implementation: OpenUH

Frontends: Parse OpenMP pragmas

OMP_PRELOWER: Preprocessing Semantic checking

LOWER_MP Generation of microtasks

for parallel regions Insertion of runtime calls Variable handling, …

Runtime library Support for thread

manipulations Implements user level

routines Monitoring environment

OpenMP Code Translation

int main(void)

{

int a,b,c;

#pragma omp parallel \ private(c)

do_sth(a,b,c);

return 0;

}

_INT32 main()

{

int a,b,c;

/* microtask */

void __ompregion_main1()

{

_INT32 __mplocal_c;

/*shared variables are kept intact,

substitute accesses to private

variable*/

do_sth(a, b, __mplocal_c);

}

…/*OpenMP runtime calls */

__ompc_fork(&__ompregion_main1);…}

Overheads of OpenMP Directives

0

200000

400000

600000

800000

1000000

1200000

1400000

Ove

rhea

d (C

ycle

s)

1 2 4 8 16 32 64 128 256

PARALLEL

PARALLEL FOR

SINGLE

LOCK/UNLOCK

ATOMIC

Number of Threads

OpenMP OverheadsEPCC Microbenchmarks

SGI Altix 3600

PARALLEL

FOR

PARALLEL FOR

BARRIER

SINGLE

CRITICAL

LOCK/UNLOCK

ORDERED

ATOMIC

REDUCTION

aa

aaaa

aaaa

Synchronization Matters Really important to minimize the cost of

synchronization, wait statesGlobal barriers, critical regions and locks

For example:Reduce usage of barrier with nowait clauseChoose carefully between master and singleAvoid ordered constructAvoid large critical regionsUse appropriate scheduler to avoid long waitsConsider fine-grain locks

○ It’s hard to get locks right

Offending critical region was rewritten

Courtesy of R. Morgan, NASA Ames

Cascade

OpenMP: best practices Carefully choose the most appropriate loop

schedule and chunk size

Smith-Waterman Sequence Alignment Algorithm

OpenMP: best practices

1

10

100

threads

Speed Up 100

600

1000

Ideal

1

10

100

2 4 8 16 32 64 128

threads

Speed Up 100

600

1000

Ideal

#pragma omp for

#pragma omp for dynamic(schedule, 1)

Smith-Waterman Sequence Alignment Algorithm

128 threads with 80% efficiency


#pragma omp parallel{#pragma omp single {ReadFromFile(0,...);} for (i=0; i<N; i++) {#pragma omp single nowait {ReadFromFile(i+1,….);}

#pragma omp for schedule(dynamic) for (j=0; j<ProcessingNum; j++) ProcessChunkOfData();#pragma omp single nowait {WriteResultsToFile(i);} }

}

Overlap I/O and computations OpenMP does not

have parallel I/O Here, one thread

fetches data, while others compute

Threads reading and writing will join computation when they are done


Minimize the number of times parallel regions are entered/exited.

for (i=0; i<n; i++) for (j=0; j<n; j++)#pragma omp parallel for private(k) for (k=0; k<n; k++) { …….}

#pragma omp parallel private(i,j,k) { for (i=0; i<n; i++) for (j=0; j<n; j++)#pragma omp for for (k=0; k<n; k++) { …….}

}

One source of memory inefficiencies

OpenMP: best practices Tune for cache Avoid false sharing

Problem when threads access same cache line

int a[max_threads];#pragma omp parallel for schedule(static,1) for(int i=0; i<N; i++) a[i] +=i;

int a[max_threads][cache_line_size];#pragma omp parallel for schedule(static,1) for(int i=0; i<N; i++) a[i][0] +=i;

MPI and/or OpenMP

Pn

Sp

eed

up

p690

Behrens, O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung mbH GöttingenL. Kornblueh MPI für Meteorologie, Hamburg

OpenMP: best practices Privatize variables where possible Private variables are stored in a thread’s local

stack

double a[MaxThreads][N][N]#pragma omp parallel forfor(i=0; i<MaxThreads; i++){ for(int j…) for(int k…) a[i][j][k] = …}

double a[N][N]#pragma omp parallel private(a){ for(int j…) for(int k…) a[j][k ] = …}

Example: Hybrid CFD codeMPIxOpenMP

OpenMP version (1x8)

MPI version (8x1)

A single procedure is responsible for 20% of the total time in the OpenMP version and is 9 times slower than the MPI version…. Why?

•Privatizing several arrays improved the performance of the whole program by 30% and resulted in a speedup of 10 for the procedure.

•Now this procedure only takes 5% of total time

•Processor Stalls are reduced significantly

OpenMP Privatized Version

Stall Cycle Breakdown for Non-Privatized (NP) andPrivatized (P) Versions of diff_coeff

0.00E+005.00E+091.00E+101.50E+102.00E+102.50E+103.00E+103.50E+104.00E+104.50E+105.00E+10

D-c

ach

stal

ls

Bra

nch

mis

pred

ictio

n

Inst

ruct

ion

mis

s st

all

FLP

Uni

ts

Fron

t-end

flush

es

Cyc

les NP

P

NP-P


OpenMP: best practices Data placement on NUMA architectures Use First Touch policy or system commands to

place data appropriately.

Quartet of four dual-core Opterons

Avoid Thread MigrationAffects data locality

Bind threads to cores. Linux:

numactl –cpubind=0 foobartaskset –c 0,1 foobar

SGI Altixdplace –x2 foobar


Corollary: Avoid nested parallel regions

GenIDLEST Hybrid 1x8 vs. 8x1

31

•Pure MPI 16% taster than pure OpenMP but OpenMP uses 30% less memory.

•OpenMP code will improve further if we merge more parallel regions and reduce synchronization.

• Reduced communication and smaller memory footprint. May be crucial benefits in future

Less Communication with OpenMP: direct memory copies replace send/recv buffers

An Intel prediction: technology might support

2010: 16—64 cores 200GF—1 TF

2013: 64—256 cores 500GF– 4 TF

2016: 256--1024 cores 2 TF– 20 TF

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

xxxxxxxxxxxxxxxxxxx

Many Cores Coming, Ready or Not

Niagara 2

A Challenge: Dealing with Locality OpenMP does not permit explicit control over data

locality Thread fetches data it needs into local cache What do we do now?

Implicit means of data layout (“First touch”) popular Privatize and optimize cache usage

Variety of suggestions for extensions Simplest is a “next touch” directive

Ideas for Locality Support

Control thread placement as well as data locality Data placement techniques:

More system support, next touch directive Make nested parallelism really work: Describe structure of nesting and number of threads

in advance as tree Map entire tree to system resources Permits thread binding

Thread binding techniques: Via system calls, command line Programmer hints to “spread out”, “keep close together

Subteams of Threads?for (j=0; j<ProcessingNum;j++) { #pragma omp for on threads (m:n:k ) for k=0; k<M;k++) { //on threads in subteam ... Process_data (); } // barrier involves subteam only

Increases expressivity of single-level parallelism

Low overhead because of static partitionFacilitates thread-core mapping for better data locality and less resource contention

Hybrid MPI/OpenMP Flexible overlapping of computations and

communication typically requires explicit OpenMP code based on threadids Needs MPI_THREAD_MULTIPLE

!$OMP PARALLEL if (thread_id .eq. id1) then call mpi_routine1() else if (thread_id .e.q. id2) then call mpi_routine2() else do_compute() endif!$OMP END PARALLEL

Hybrid MPI/OpenMP: Subteams

Subteams facilitate the overlapping of computation and communication

Directives used with some of threads in a team

Barrier applied to subteam only

#pragma omp parallel{#pragma omp for onthreads( team1 ){ // this team of threads communicate for(…) MPI_Send/Recv….} /* barrier here only involves team1*/#pragma omp for onthreads( team2 )for (……..){ … /* a team of thread compute */} /* barrier here is only for team2*/…#pragma omp forfor (……..){ /* work based on halo information */}} /*end omp parallel */

Courtesy to Rolf Rabenseifner et al.

BT-MZ Performance with Subteams

Platform: Columbia@NASA

Subteam: subset of existing team

Copyright © 2007-8 ClearSpeed Technology plc. All rights reserved.

ClearSpeed Accelerator: CSX600

• Processor Core:– 40.32 64-bit GFLOPS– 10W typical– 210MHz– 96 PEs, 6 Kbytes each– 8 redundant PEs

• SoC details:– integrated DDR2

memory controller with ECC support

– 128 Kbytes of SRAM• Design details:

– IBM 130nm process– 128 million transistors

(47% logic, 68% memory)

• Sampled Q3 2005

What is the Programming Model?

Heterogeneous programming is currently very low-level○ How are we going to program such systems in future?

If OpenMP is to be used to program a board with devices such as accelerators, GPGUs some problems must be solved

How to identify code that should run on accelerators?

How to share data between host cores and other devices?

What if device not available? How is this compiled? Debugged?

Existing Efforts

IBM OpenMP for Cell/Cyclops Intel EXOCHI CAPS HMPP Streaming OpenMP

ACOTES

OpenMP on Clearspeed

Implicit Levels of Parallelism

42

MPI

core

shared memory

netw

ork

core

shared memorylocal memory

HW

A1

local memory

HW

A2

local memory

HW

A1

local memory

HW

A2

OpenMP HMPP

Courtesy of CAPS, SA

OpenMP: Where should code run? Minimize.#pragma omp parallel private(j,k) ontarget (acc1, acc2) in (b) out (a){ for (i=0; i<n; i++) for (j=0; j<n; j++)#pragma omp for for (k=0; k<n; k++) { …….}} #pragma omp parallel private(j,k){ for (i=0; i<n; i++) for (j=0; j<n; j++)#pragma omp for ontarget (acc1, acc2) in (b) out (a) for (k=0; k<n; k++) { …….}}

Multiple Devices Use #D accelerators in parallel

44

#pragma omp parallel for, private (j) for (jj=0;jj<#D;jj++){ for (j=jj*(n/#D); j<jj*(n/#D)+(n/#D); j++){#pragma hmpp tospeedup1 callsite simplefunc1(n,t1[j],t2,t3[j],alpha); }#pragma hmpp tospeedup1 release }

Courtesy of CAPS

Summary

http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=11387

A good deal to learn about how to get performance in OpenMP

Code modification may be needed

Future versions of OpenMP likely to provide more support for larger numbers of cores, maybe more

Reference Material on OpenMP OpenMP Homepage www.openmp.org:

The primary source of information about OpenMP and its development.

OpenMP User’s Group (cOMPunity) Homepage www.compunity.org:

Books:

Using OpenMP, Barbara Chapman, Gabriele Jost, Ruud Van Der Pas, Cambridge, MA : The MIT Press 2007, ISBN: 978- 0-262-53302-7

Parallel programming in OpenMP, Chandra, Rohit, San Francisco, Calif. : Morgan Kaufmann ; London : Harcourt, 2000, ISBN: 1558606718

Search: www.google.com: OpenMP

46

1 Barbara Chapman University of Houston This work is supported by the National Science Foundation...

Documents

Transcript of 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation...