1 Barbara Chapman University of Houston This work is supported by the National Science Foundation...
-
Upload
crystal-melton -
Category
Documents
-
view
216 -
download
3
Transcript of 1 Barbara Chapman University of Houston This work is supported by the National Science Foundation...
1
Barbara ChapmanUniversity of Houston
This work is supported by the National Science Foundation under grants CCF-0444468 and CCF-0702775; and by the Department of Energy under grant DE-
FC02-06ER25759.
Talk Outline
What is OpenMP? Performance aspects of OpenMP
programming How might OpenMP better support multicore? Further down the road: OpenMP Futures
2
Open specifications for Multi Processing
• Compiler directives, library routines, environment variables for specifying shared-memory parallelism
• Widely available
An API for Writing Multithreaded Applications using Fortran, C, C++
OpenMP Programming Model:
4
Fork-Join Parallelism: Master thread spawns a team of threads as needed.
Parallelism is added incrementally until desired performance is achieved: i.e. the sequential program evolves into a parallel program.
Parallel Regions
Master Thread A Nested
Parallel region
A Nested Parallel region
5
The OpenMP Shared Memory API Directive-based multithreaded programming
The user makes strategic decisions Compiler figures out details Threads communicate by sharing variables Synchronization to order accesses and prevent data conflicts Structured programming to reduce likelihood of bugs
Compiler flags enable OpenMP (e.g. –openmp, -xopenmp, -fopenmp, -mp)
#pragma omp parallel#pragma omp for schedule(dynamic)
for (I=0;I<N;I++){NEAT_STUFF(I);
} /* implicit barrier here */
History OpenMP Architecture Review Board
AMD, Cray, Fujitsu, HP, IBM, Intel, NEC, The Portland Group, Inc., SGI, Sun, Microsoft
ASC/LLNL, cOMPunity, EPCC, NASA, RWTH Aachen University
Spec Release DatesFortran v1.0 (Oct ‘97), C/C++ v1.0 (Oct ‘98)Fortran v2.0 (Nov ‘00), C/C++ v2.0 (Mar ‘02)C/C++ and Fortran v2.5 (May ’05)C/C++ and Fortran v3.0 (May ‘08)
6
7
struct node {struct node *left;struct node *right;};extern void process(struct node *);void postorder_traverse( struct node *p ) {if (p->left) { #pragma omp task // p is firstprivate by default postorder_traverse(p->left);}if (p->right) { #pragma omp task // p is firstprivate by default postorder_traverse(p->right);}#pragma omp taskwaitprocess(p);
Package and add explicit task to a task pool
wait for all child tasks to complete
OpenMP 3.0: Tasks
OpenMP 3.0: User-level Library and Environment
Busy waiting may consume valuable resources, interfere with work of other threadsOpenMP 3.0 allows user more control over the way
idle threads are handled
Enhanced support for nested parallelismLibrary routines to determine depth of nesting and IDs of
parent/grandparent threads.Different regions may have different defaults
○ E.g. omp_set_num_threads() inside a parallel region.
OMP_WAIT_POLICY
Some Remarks
Flexible parallel programmingNot all the code in an OpenMP program
must be parallelizedLow-level programming using threadidsCan be combined with MPI
Dynamic adjustmentOf schedule for load balancing Of # of threads
With some care, single source for sequential and parallel code is possible
Cart3D OpenMP Scaling
• OpenMP version used same domain decomposition strategy as MPI for data locality, avoiding false sharing and fine-grained remote data access
• OpenMP version slightly outperformed MPI version on the SGI Altix 3700BX2, both close to linear scaling.
4.7 M cell mesh Space Shuttle Launch Vehicle exampleM = 2.6 = 2.09º = 0.8º
Hybrid MPI+OpenMP Programming Model
Seems to match structure of large-scale platforms
Greater interest with widespread introduction of multicore chips Individual cores may be single threaded or
multithreadedThreads on cores share resources (L2 cache,
memory bandwidth), not necessarily in a uniform way
Systems may be heterogeneous
Remarkably poorly understood
12
OVERFLOW2 - DLRF6 Case
MPI+OpenMP version: numerically explicit scheme + implicit scheme Hybrid version outperformed pure MPI version on the IBM p575+ But the same hybrid code did not give same benefits on the SGI Altix.
36 M grid points, 23 zones, DLRF6 benchmark configuration
Courtesy of Dennis Jespersen, NASA Ames
Hybrid MPI+OpenMP
Fewer processes, so usually less data exchanged; comm./comp. ratio changes
Multilevel parallelismMay be able to exploit additional finer-grain
parallelism using OpenMP
Alleviates MPI load balancing problems Computations expressed in OpenMP
May require careful tuning of OpenMP code
OpenMP Performance Problems
Overheads of OpenMP compilation Differs between constructs and implementations Impact of translation on traditional optimization
Overheads of runtime library routines Some are called frequently
Algorithmic overheads If additional work needed to enable parallelization
Too much synchronization Possibly caused by poor load balance
Poor cache utilization and false sharing○ Poor memory usage has a variety of causes
14
OpenMP Parallel Computing Solution Stack
15
Runtime library
OS/system support for shared memory.
Directives,Compiler
OpenMP libraryEnvironment
variables
Application
End User
Sys
tem
lay
erP
r og
. L
ayer
(O
pen
MP
AP
I)U
ser
laye
r
OpenMP Implementation: OpenUH
Frontends: Parse OpenMP pragmas
OMP_PRELOWER: Preprocessing Semantic checking
LOWER_MP Generation of microtasks
for parallel regions Insertion of runtime calls Variable handling, …
Runtime library Support for thread
manipulations Implements user level
routines Monitoring environment
OpenMP Code Translation
int main(void)
{
int a,b,c;
#pragma omp parallel \ private(c)
do_sth(a,b,c);
return 0;
}
_INT32 main()
{
int a,b,c;
/* microtask */
void __ompregion_main1()
{
_INT32 __mplocal_c;
/*shared variables are kept intact,
substitute accesses to private
variable*/
do_sth(a, b, __mplocal_c);
}
…/*OpenMP runtime calls */
__ompc_fork(&__ompregion_main1);…}
Overheads of OpenMP Directives
0
200000
400000
600000
800000
1000000
1200000
1400000
Ove
rhea
d (C
ycle
s)
1 2 4 8 16 32 64 128 256
PARALLEL
PARALLEL FOR
SINGLE
LOCK/UNLOCK
ATOMIC
Number of Threads
OpenMP OverheadsEPCC Microbenchmarks
SGI Altix 3600
PARALLEL
FOR
PARALLEL FOR
BARRIER
SINGLE
CRITICAL
LOCK/UNLOCK
ORDERED
ATOMIC
REDUCTION
aa
aaaa
aaaa
Synchronization Matters Really important to minimize the cost of
synchronization, wait statesGlobal barriers, critical regions and locks
For example:Reduce usage of barrier with nowait clauseChoose carefully between master and singleAvoid ordered constructAvoid large critical regionsUse appropriate scheduler to avoid long waitsConsider fine-grain locks
○ It’s hard to get locks right
Offending critical region was rewritten
Courtesy of R. Morgan, NASA Ames
Cascade
OpenMP: best practices Carefully choose the most appropriate loop
schedule and chunk size
Smith-Waterman Sequence Alignment Algorithm
OpenMP: best practices
1
10
100
threads
Speed Up 100
600
1000
Ideal
1
10
100
2 4 8 16 32 64 128
threads
Speed Up 100
600
1000
Ideal
#pragma omp for
#pragma omp for dynamic(schedule, 1)
Smith-Waterman Sequence Alignment Algorithm
128 threads with 80% efficiency
OpenMP: best practices
#pragma omp parallel{#pragma omp single {ReadFromFile(0,...);} for (i=0; i<N; i++) {#pragma omp single nowait {ReadFromFile(i+1,….);}
#pragma omp for schedule(dynamic) for (j=0; j<ProcessingNum; j++) ProcessChunkOfData();#pragma omp single nowait {WriteResultsToFile(i);} }
}
Overlap I/O and computations OpenMP does not
have parallel I/O Here, one thread
fetches data, while others compute
Threads reading and writing will join computation when they are done
OpenMP: best practices
Minimize the number of times parallel regions are entered/exited.
for (i=0; i<n; i++) for (j=0; j<n; j++)#pragma omp parallel for private(k) for (k=0; k<n; k++) { …….}
#pragma omp parallel private(i,j,k) { for (i=0; i<n; i++) for (j=0; j<n; j++)#pragma omp for for (k=0; k<n; k++) { …….}
}
One source of memory inefficiencies
OpenMP: best practices Tune for cache Avoid false sharing
Problem when threads access same cache line
int a[max_threads];#pragma omp parallel for schedule(static,1) for(int i=0; i<N; i++) a[i] +=i;
int a[max_threads][cache_line_size];#pragma omp parallel for schedule(static,1) for(int i=0; i<N; i++) a[i][0] +=i;
MPI and/or OpenMP
Pn
Sp
eed
up
p690
Behrens, O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung mbH GöttingenL. Kornblueh MPI für Meteorologie, Hamburg
OpenMP: best practices Privatize variables where possible Private variables are stored in a thread’s local
stack
double a[MaxThreads][N][N]#pragma omp parallel forfor(i=0; i<MaxThreads; i++){ for(int j…) for(int k…) a[i][j][k] = …}
double a[N][N]#pragma omp parallel private(a){ for(int j…) for(int k…) a[j][k ] = …}
Example: Hybrid CFD codeMPIxOpenMP
OpenMP version (1x8)
MPI version (8x1)
A single procedure is responsible for 20% of the total time in the OpenMP version and is 9 times slower than the MPI version…. Why?
•Privatizing several arrays improved the performance of the whole program by 30% and resulted in a speedup of 10 for the procedure.
•Now this procedure only takes 5% of total time
•Processor Stalls are reduced significantly
OpenMP Privatized Version
Stall Cycle Breakdown for Non-Privatized (NP) andPrivatized (P) Versions of diff_coeff
0.00E+005.00E+091.00E+101.50E+102.00E+102.50E+103.00E+103.50E+104.00E+104.50E+105.00E+10
D-c
ach
stal
ls
Bra
nch
mis
pred
ictio
n
Inst
ruct
ion
mis
s st
all
FLP
Uni
ts
Fron
t-end
flush
es
Cyc
les NP
P
NP-P
OpenMP: best practices
OpenMP: best practices Data placement on NUMA architectures Use First Touch policy or system commands to
place data appropriately.
Quartet of four dual-core Opterons
Avoid Thread MigrationAffects data locality
Bind threads to cores. Linux:
numactl –cpubind=0 foobartaskset –c 0,1 foobar
SGI Altixdplace –x2 foobar
OpenMP: best practices
Corollary: Avoid nested parallel regions
GenIDLEST Hybrid 1x8 vs. 8x1
31
•Pure MPI 16% taster than pure OpenMP but OpenMP uses 30% less memory.
•OpenMP code will improve further if we merge more parallel regions and reduce synchronization.
• Reduced communication and smaller memory footprint. May be crucial benefits in future
Less Communication with OpenMP: direct memory copies replace send/recv buffers
An Intel prediction: technology might support
2010: 16—64 cores 200GF—1 TF
2013: 64—256 cores 500GF– 4 TF
2016: 256--1024 cores 2 TF– 20 TF
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
xxxxxxxxxxxxxxxxxxx
Many Cores Coming, Ready or Not
Niagara 2
A Challenge: Dealing with Locality OpenMP does not permit explicit control over data
locality Thread fetches data it needs into local cache What do we do now?
Implicit means of data layout (“First touch”) popular Privatize and optimize cache usage
Variety of suggestions for extensions Simplest is a “next touch” directive
Ideas for Locality Support
Control thread placement as well as data locality Data placement techniques:
More system support, next touch directive Make nested parallelism really work: Describe structure of nesting and number of threads
in advance as tree Map entire tree to system resources Permits thread binding
Thread binding techniques: Via system calls, command line Programmer hints to “spread out”, “keep close together
Subteams of Threads?for (j=0; j<ProcessingNum;j++) { #pragma omp for on threads (m:n:k ) for k=0; k<M;k++) { //on threads in subteam ... Process_data (); } // barrier involves subteam only
Increases expressivity of single-level parallelism
Low overhead because of static partitionFacilitates thread-core mapping for better data locality and less resource contention
Hybrid MPI/OpenMP Flexible overlapping of computations and
communication typically requires explicit OpenMP code based on threadids Needs MPI_THREAD_MULTIPLE
!$OMP PARALLEL if (thread_id .eq. id1) then call mpi_routine1() else if (thread_id .e.q. id2) then call mpi_routine2() else do_compute() endif!$OMP END PARALLEL
Hybrid MPI/OpenMP: Subteams
Subteams facilitate the overlapping of computation and communication
Directives used with some of threads in a team
Barrier applied to subteam only
#pragma omp parallel{#pragma omp for onthreads( team1 ){ // this team of threads communicate for(…) MPI_Send/Recv….} /* barrier here only involves team1*/#pragma omp for onthreads( team2 )for (……..){ … /* a team of thread compute */} /* barrier here is only for team2*/…#pragma omp forfor (……..){ /* work based on halo information */}} /*end omp parallel */
Courtesy to Rolf Rabenseifner et al.
BT-MZ Performance with Subteams
Platform: Columbia@NASA
Subteam: subset of existing team
Copyright © 2007-8 ClearSpeed Technology plc. All rights reserved.
ClearSpeed Accelerator: CSX600
• Processor Core:– 40.32 64-bit GFLOPS– 10W typical– 210MHz– 96 PEs, 6 Kbytes each– 8 redundant PEs
• SoC details:– integrated DDR2
memory controller with ECC support
– 128 Kbytes of SRAM• Design details:
– IBM 130nm process– 128 million transistors
(47% logic, 68% memory)
• Sampled Q3 2005
What is the Programming Model?
Heterogeneous programming is currently very low-level○ How are we going to program such systems in future?
If OpenMP is to be used to program a board with devices such as accelerators, GPGUs some problems must be solved
How to identify code that should run on accelerators?
How to share data between host cores and other devices?
What if device not available? How is this compiled? Debugged?
Existing Efforts
IBM OpenMP for Cell/Cyclops Intel EXOCHI CAPS HMPP Streaming OpenMP
ACOTES
OpenMP on Clearspeed
Implicit Levels of Parallelism
42
MPI
core
shared memory
netw
ork
core
shared memorylocal memory
HW
A1
local memory
HW
A2
local memory
HW
A1
local memory
HW
A2
OpenMP HMPP
Courtesy of CAPS, SA
OpenMP: Where should code run? Minimize.#pragma omp parallel private(j,k) ontarget (acc1, acc2) in (b) out (a){ for (i=0; i<n; i++) for (j=0; j<n; j++)#pragma omp for for (k=0; k<n; k++) { …….}} #pragma omp parallel private(j,k){ for (i=0; i<n; i++) for (j=0; j<n; j++)#pragma omp for ontarget (acc1, acc2) in (b) out (a) for (k=0; k<n; k++) { …….}}
Multiple Devices Use #D accelerators in parallel
44
#pragma omp parallel for, private (j) for (jj=0;jj<#D;jj++){ for (j=jj*(n/#D); j<jj*(n/#D)+(n/#D); j++){#pragma hmpp tospeedup1 callsite simplefunc1(n,t1[j],t2,t3[j],alpha); }#pragma hmpp tospeedup1 release }
Courtesy of CAPS
Summary
http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=11387
A good deal to learn about how to get performance in OpenMP
Code modification may be needed
Future versions of OpenMP likely to provide more support for larger numbers of cores, maybe more
Reference Material on OpenMP OpenMP Homepage www.openmp.org:
The primary source of information about OpenMP and its development.
OpenMP User’s Group (cOMPunity) Homepage www.compunity.org:
Books:
Using OpenMP, Barbara Chapman, Gabriele Jost, Ruud Van Der Pas, Cambridge, MA : The MIT Press 2007, ISBN: 978- 0-262-53302-7
Parallel programming in OpenMP, Chandra, Rohit, San Francisco, Calif. : Morgan Kaufmann ; London : Harcourt, 2000, ISBN: 1558606718
Search: www.google.com: OpenMP
46