1 Barbara Chapman University of Houston This work is supported by the National Science Foundation...

46
1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF-0444468 and CCF-0702775; and by the Department of Energy under grant DE-FC02-06ER25759.

Transcript of 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation...

Page 1: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

1

Barbara ChapmanUniversity of Houston

This work is supported by the National Science Foundation under grants CCF-0444468 and CCF-0702775; and by the Department of Energy under grant DE-

FC02-06ER25759.

Page 2: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

Talk Outline

What is OpenMP? Performance aspects of OpenMP

programming How might OpenMP better support multicore? Further down the road: OpenMP Futures

2

Page 3: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

Open specifications for Multi Processing

• Compiler directives, library routines, environment variables for specifying shared-memory parallelism

• Widely available

An API for Writing Multithreaded Applications using Fortran, C, C++

Page 4: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

OpenMP Programming Model:

4

Fork-Join Parallelism: Master thread spawns a team of threads as needed.

Parallelism is added incrementally until desired performance is achieved: i.e. the sequential program evolves into a parallel program.

Parallel Regions

Master Thread A Nested

Parallel region

A Nested Parallel region

Page 5: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

5

The OpenMP Shared Memory API Directive-based multithreaded programming

The user makes strategic decisions Compiler figures out details Threads communicate by sharing variables Synchronization to order accesses and prevent data conflicts Structured programming to reduce likelihood of bugs

Compiler flags enable OpenMP (e.g. –openmp, -xopenmp, -fopenmp, -mp)

#pragma omp parallel#pragma omp for schedule(dynamic)

for (I=0;I<N;I++){NEAT_STUFF(I);

} /* implicit barrier here */

Page 6: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

History OpenMP Architecture Review Board

AMD, Cray, Fujitsu, HP, IBM, Intel, NEC, The Portland Group, Inc., SGI, Sun, Microsoft

ASC/LLNL, cOMPunity, EPCC, NASA, RWTH Aachen University

Spec Release DatesFortran v1.0 (Oct ‘97), C/C++ v1.0 (Oct ‘98)Fortran v2.0 (Nov ‘00), C/C++ v2.0 (Mar ‘02)C/C++ and Fortran v2.5 (May ’05)C/C++ and Fortran v3.0 (May ‘08)

6

Page 7: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

7

struct node {struct node *left;struct node *right;};extern void process(struct node *);void postorder_traverse( struct node *p ) {if (p->left) { #pragma omp task // p is firstprivate by default postorder_traverse(p->left);}if (p->right) { #pragma omp task // p is firstprivate by default postorder_traverse(p->right);}#pragma omp taskwaitprocess(p);

Package and add explicit task to a task pool

wait for all child tasks to complete

OpenMP 3.0: Tasks

Page 8: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

OpenMP 3.0: User-level Library and Environment

Busy waiting may consume valuable resources, interfere with work of other threadsOpenMP 3.0 allows user more control over the way

idle threads are handled

Enhanced support for nested parallelismLibrary routines to determine depth of nesting and IDs of

parent/grandparent threads.Different regions may have different defaults

○ E.g. omp_set_num_threads() inside a parallel region.

OMP_WAIT_POLICY

Page 9: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

Some Remarks

Flexible parallel programmingNot all the code in an OpenMP program

must be parallelizedLow-level programming using threadidsCan be combined with MPI

Dynamic adjustmentOf schedule for load balancing Of # of threads

With some care, single source for sequential and parallel code is possible

Page 10: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

Cart3D OpenMP Scaling

• OpenMP version used same domain decomposition strategy as MPI for data locality, avoiding false sharing and fine-grained remote data access

• OpenMP version slightly outperformed MPI version on the SGI Altix 3700BX2, both close to linear scaling.

4.7 M cell mesh Space Shuttle Launch Vehicle exampleM = 2.6 = 2.09º = 0.8º

Page 11: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

Hybrid MPI+OpenMP Programming Model

Seems to match structure of large-scale platforms

Greater interest with widespread introduction of multicore chips Individual cores may be single threaded or

multithreadedThreads on cores share resources (L2 cache,

memory bandwidth), not necessarily in a uniform way

Systems may be heterogeneous

Remarkably poorly understood

Page 12: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

12

OVERFLOW2 - DLRF6 Case

MPI+OpenMP version: numerically explicit scheme + implicit scheme Hybrid version outperformed pure MPI version on the IBM p575+ But the same hybrid code did not give same benefits on the SGI Altix.

36 M grid points, 23 zones, DLRF6 benchmark configuration

Courtesy of Dennis Jespersen, NASA Ames

Page 13: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

Hybrid MPI+OpenMP

Fewer processes, so usually less data exchanged; comm./comp. ratio changes

Multilevel parallelismMay be able to exploit additional finer-grain

parallelism using OpenMP

Alleviates MPI load balancing problems Computations expressed in OpenMP

May require careful tuning of OpenMP code

Page 14: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

OpenMP Performance Problems

Overheads of OpenMP compilation Differs between constructs and implementations Impact of translation on traditional optimization

Overheads of runtime library routines Some are called frequently

Algorithmic overheads If additional work needed to enable parallelization

Too much synchronization Possibly caused by poor load balance

Poor cache utilization and false sharing○ Poor memory usage has a variety of causes

14

Page 15: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

OpenMP Parallel Computing Solution Stack

15

Runtime library

OS/system support for shared memory.

Directives,Compiler

OpenMP libraryEnvironment

variables

Application

End User

Sys

tem

lay

erP

r og

. L

ayer

(O

pen

MP

AP

I)U

ser

laye

r

Page 16: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

OpenMP Implementation: OpenUH

Frontends: Parse OpenMP pragmas

OMP_PRELOWER: Preprocessing Semantic checking

LOWER_MP Generation of microtasks

for parallel regions Insertion of runtime calls Variable handling, …

Runtime library Support for thread

manipulations Implements user level

routines Monitoring environment

OpenMP Code Translation

int main(void)

{

int a,b,c;

#pragma omp parallel \ private(c)

do_sth(a,b,c);

return 0;

}

_INT32 main()

{

int a,b,c;

/* microtask */

void __ompregion_main1()

{

_INT32 __mplocal_c;

/*shared variables are kept intact,

substitute accesses to private

variable*/

do_sth(a, b, __mplocal_c);

}

…/*OpenMP runtime calls */

__ompc_fork(&__ompregion_main1);…}

Page 17: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

Overheads of OpenMP Directives

0

200000

400000

600000

800000

1000000

1200000

1400000

Ove

rhea

d (C

ycle

s)

1 2 4 8 16 32 64 128 256

PARALLEL

PARALLEL FOR

SINGLE

LOCK/UNLOCK

ATOMIC

Number of Threads

OpenMP OverheadsEPCC Microbenchmarks

SGI Altix 3600

PARALLEL

FOR

PARALLEL FOR

BARRIER

SINGLE

CRITICAL

LOCK/UNLOCK

ORDERED

ATOMIC

REDUCTION

aa

aaaa

aaaa

Page 18: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

Synchronization Matters Really important to minimize the cost of

synchronization, wait statesGlobal barriers, critical regions and locks

For example:Reduce usage of barrier with nowait clauseChoose carefully between master and singleAvoid ordered constructAvoid large critical regionsUse appropriate scheduler to avoid long waitsConsider fine-grain locks

○ It’s hard to get locks right

Page 19: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

Offending critical region was rewritten

Courtesy of R. Morgan, NASA Ames

Cascade

Page 20: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

OpenMP: best practices Carefully choose the most appropriate loop

schedule and chunk size

Smith-Waterman Sequence Alignment Algorithm

Page 21: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

OpenMP: best practices

1

10

100

threads

Speed Up 100

600

1000

Ideal

1

10

100

2 4 8 16 32 64 128

threads

Speed Up 100

600

1000

Ideal

#pragma omp for

#pragma omp for dynamic(schedule, 1)

Smith-Waterman Sequence Alignment Algorithm

128 threads with 80% efficiency

Page 22: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

OpenMP: best practices

#pragma omp parallel{#pragma omp single {ReadFromFile(0,...);} for (i=0; i<N; i++) {#pragma omp single nowait {ReadFromFile(i+1,….);}

#pragma omp for schedule(dynamic) for (j=0; j<ProcessingNum; j++) ProcessChunkOfData();#pragma omp single nowait {WriteResultsToFile(i);} }

}

Overlap I/O and computations OpenMP does not

have parallel I/O Here, one thread

fetches data, while others compute

Threads reading and writing will join computation when they are done

Page 23: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

OpenMP: best practices

Minimize the number of times parallel regions are entered/exited.

for (i=0; i<n; i++) for (j=0; j<n; j++)#pragma omp parallel for private(k) for (k=0; k<n; k++) { …….}

#pragma omp parallel private(i,j,k) { for (i=0; i<n; i++) for (j=0; j<n; j++)#pragma omp for for (k=0; k<n; k++) { …….}

}

One source of memory inefficiencies

Page 24: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

OpenMP: best practices Tune for cache Avoid false sharing

Problem when threads access same cache line

int a[max_threads];#pragma omp parallel for schedule(static,1) for(int i=0; i<N; i++) a[i] +=i;

int a[max_threads][cache_line_size];#pragma omp parallel for schedule(static,1) for(int i=0; i<N; i++) a[i][0] +=i;

Page 25: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

MPI and/or OpenMP

Pn

Sp

eed

up

p690

Behrens, O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung mbH GöttingenL. Kornblueh MPI für Meteorologie, Hamburg

Page 26: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

OpenMP: best practices Privatize variables where possible Private variables are stored in a thread’s local

stack

double a[MaxThreads][N][N]#pragma omp parallel forfor(i=0; i<MaxThreads; i++){ for(int j…) for(int k…) a[i][j][k] = …}

double a[N][N]#pragma omp parallel private(a){ for(int j…) for(int k…) a[j][k ] = …}

Page 27: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

Example: Hybrid CFD codeMPIxOpenMP

OpenMP version (1x8)

MPI version (8x1)

A single procedure is responsible for 20% of the total time in the OpenMP version and is 9 times slower than the MPI version…. Why?

Page 28: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

•Privatizing several arrays improved the performance of the whole program by 30% and resulted in a speedup of 10 for the procedure.

•Now this procedure only takes 5% of total time

•Processor Stalls are reduced significantly

OpenMP Privatized Version

Stall Cycle Breakdown for Non-Privatized (NP) andPrivatized (P) Versions of diff_coeff

0.00E+005.00E+091.00E+101.50E+102.00E+102.50E+103.00E+103.50E+104.00E+104.50E+105.00E+10

D-c

ach

stal

ls

Bra

nch

mis

pred

ictio

n

Inst

ruct

ion

mis

s st

all

FLP

Uni

ts

Fron

t-end

flush

es

Cyc

les NP

P

NP-P

OpenMP: best practices

Page 29: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

OpenMP: best practices Data placement on NUMA architectures Use First Touch policy or system commands to

place data appropriately.

Quartet of four dual-core Opterons

Page 30: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

Avoid Thread MigrationAffects data locality

Bind threads to cores. Linux:

numactl –cpubind=0 foobartaskset –c 0,1 foobar

SGI Altixdplace –x2 foobar

OpenMP: best practices

Corollary: Avoid nested parallel regions

Page 31: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

GenIDLEST Hybrid 1x8 vs. 8x1

31

•Pure MPI 16% taster than pure OpenMP but OpenMP uses 30% less memory.

•OpenMP code will improve further if we merge more parallel regions and reduce synchronization.

• Reduced communication and smaller memory footprint. May be crucial benefits in future

Less Communication with OpenMP: direct memory copies replace send/recv buffers

Page 32: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

An Intel prediction: technology might support

2010: 16—64 cores 200GF—1 TF

2013: 64—256 cores 500GF– 4 TF

2016: 256--1024 cores 2 TF– 20 TF

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

xxxxxxxxxxxxxxxxxxx

Many Cores Coming, Ready or Not

Niagara 2

Page 33: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

A Challenge: Dealing with Locality OpenMP does not permit explicit control over data

locality Thread fetches data it needs into local cache What do we do now?

Implicit means of data layout (“First touch”) popular Privatize and optimize cache usage

Variety of suggestions for extensions Simplest is a “next touch” directive

Page 34: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

Ideas for Locality Support

Control thread placement as well as data locality Data placement techniques:

More system support, next touch directive Make nested parallelism really work: Describe structure of nesting and number of threads

in advance as tree Map entire tree to system resources Permits thread binding

Thread binding techniques: Via system calls, command line Programmer hints to “spread out”, “keep close together

Page 35: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

Subteams of Threads?for (j=0; j<ProcessingNum;j++) { #pragma omp for on threads (m:n:k ) for k=0; k<M;k++) { //on threads in subteam ... Process_data (); } // barrier involves subteam only

Increases expressivity of single-level parallelism

Low overhead because of static partitionFacilitates thread-core mapping for better data locality and less resource contention

Page 36: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

Hybrid MPI/OpenMP Flexible overlapping of computations and

communication typically requires explicit OpenMP code based on threadids Needs MPI_THREAD_MULTIPLE

!$OMP PARALLEL if (thread_id .eq. id1) then call mpi_routine1() else if (thread_id .e.q. id2) then call mpi_routine2() else do_compute() endif!$OMP END PARALLEL

Page 37: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

Hybrid MPI/OpenMP: Subteams

Subteams facilitate the overlapping of computation and communication

Directives used with some of threads in a team

Barrier applied to subteam only

#pragma omp parallel{#pragma omp for onthreads( team1 ){ // this team of threads communicate for(…) MPI_Send/Recv….} /* barrier here only involves team1*/#pragma omp for onthreads( team2 )for (……..){ … /* a team of thread compute */} /* barrier here is only for team2*/…#pragma omp forfor (……..){ /* work based on halo information */}} /*end omp parallel */

Courtesy to Rolf Rabenseifner et al.

Page 38: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

BT-MZ Performance with Subteams

Platform: Columbia@NASA

Subteam: subset of existing team

Page 39: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

Copyright © 2007-8 ClearSpeed Technology plc. All rights reserved.

ClearSpeed Accelerator: CSX600

• Processor Core:– 40.32 64-bit GFLOPS– 10W typical– 210MHz– 96 PEs, 6 Kbytes each– 8 redundant PEs

• SoC details:– integrated DDR2

memory controller with ECC support

– 128 Kbytes of SRAM• Design details:

– IBM 130nm process– 128 million transistors

(47% logic, 68% memory)

• Sampled Q3 2005

Page 40: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

What is the Programming Model?

Heterogeneous programming is currently very low-level○ How are we going to program such systems in future?

If OpenMP is to be used to program a board with devices such as accelerators, GPGUs some problems must be solved

How to identify code that should run on accelerators?

How to share data between host cores and other devices?

What if device not available? How is this compiled? Debugged?

Page 41: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

Existing Efforts

IBM OpenMP for Cell/Cyclops Intel EXOCHI CAPS HMPP Streaming OpenMP

ACOTES

OpenMP on Clearspeed

Page 42: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

Implicit Levels of Parallelism

42

MPI

core

shared memory

netw

ork

core

shared memorylocal memory

HW

A1

local memory

HW

A2

local memory

HW

A1

local memory

HW

A2

OpenMP HMPP

Courtesy of CAPS, SA

Page 43: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

OpenMP: Where should code run? Minimize.#pragma omp parallel private(j,k) ontarget (acc1, acc2) in (b) out (a){ for (i=0; i<n; i++) for (j=0; j<n; j++)#pragma omp for for (k=0; k<n; k++) { …….}} #pragma omp parallel private(j,k){ for (i=0; i<n; i++) for (j=0; j<n; j++)#pragma omp for ontarget (acc1, acc2) in (b) out (a) for (k=0; k<n; k++) { …….}}

Page 44: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

Multiple Devices Use #D accelerators in parallel

44

#pragma omp parallel for, private (j) for (jj=0;jj<#D;jj++){ for (j=jj*(n/#D); j<jj*(n/#D)+(n/#D); j++){#pragma hmpp tospeedup1 callsite simplefunc1(n,t1[j],t2,t3[j],alpha); }#pragma hmpp tospeedup1 release }

Courtesy of CAPS

Page 45: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

Summary

http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=11387

A good deal to learn about how to get performance in OpenMP

Code modification may be needed

Future versions of OpenMP likely to provide more support for larger numbers of cores, maybe more

Page 46: 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation under grants CCF- 0444468 and CCF-0702775; and by the.

Reference Material on OpenMP OpenMP Homepage www.openmp.org:

The primary source of information about OpenMP and its development.

OpenMP User’s Group (cOMPunity) Homepage www.compunity.org:

Books:

Using OpenMP, Barbara Chapman, Gabriele Jost, Ruud Van Der Pas, Cambridge, MA : The MIT Press 2007, ISBN: 978- 0-262-53302-7

Parallel programming in OpenMP, Chandra, Rohit, San Francisco, Calif. : Morgan Kaufmann ; London : Harcourt, 2000, ISBN: 1558606718

Search: www.google.com: OpenMP

46