NUMERICAL PARALLEL COMPUTING
NUMERICAL PARALLEL COMPUTINGLecture 3: Programming multicore processors with OpenMP
http://people.inf.ethz.ch/iyves/pnc12/
Peter Arbenz∗, Andreas Adelmann∗∗∗Computer Science Dept, ETH Zurich,
E-mail: [email protected]∗∗Paul Scherrer Institut, Villigen
E-mail: [email protected]
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 1/47
NUMERICAL PARALLEL COMPUTING
Review of last week
So far
I Moore’s law.
I Flynn’s taxonomy of parallel computers:SISD, SIMD, MIMD.
I Some terminology:Work, speedup, efficiency, scalability.
I Amdahl’s and Gustafson’s law
I SIMD programming.
TodayI Shared memory MIMD programming (Part 1)
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 2/47
NUMERICAL PARALLEL COMPUTING
MIMD: Multiple Instruction stream - Multiple Data stream
MIMD: Multiple Instruction stream – Multiple Data stream
Each processor (core) can execute its own instruction stream on itsown data independently from the other processors. Each processoris a full-fledged CPU with both control unit and ALU. MIMDsystems are asynchronous.
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 3/47
NUMERICAL PARALLEL COMPUTING
MIMD: Multiple Instruction stream - Multiple Data stream
Shared memory machines
Shared memory machines (multiprocessors)
I autonomous processors connected to memory system viainterconnection network
I single address space accessible by all processorsI (implicit) communication by means of shared dataI data dependencies / race conditions possible
Fig.2.3 in Pacheco (2011)
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 4/47
NUMERICAL PARALLEL COMPUTING
MIMD: Multiple Instruction stream - Multiple Data stream
Distributed memory machines
I distributed memory machines (multicomputers)I Each processor has its own local/private memoryI processor/memory pairs communicate via interconnection
networkI all data are local to some processor,I (explicit) communication by message passing or some other
means to access memory of remote processor
Fig.2.4 in Pacheco (2011)
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 5/47
NUMERICAL PARALLEL COMPUTING
Shared-memory machines
Typical architecture of a multicore processor
Typical architecture of a multicore processor
Multiple cores share multiplecaches, that are arrange in atree-like structure.
3-levels example:I L1-cache in-core,
I 2 cores share L2-cache,
I all cores have access to all ofthe L3 cache and memory.
UMA: uniform memory accessEach processor has directconnection to (block of) memory.
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 6/47
NUMERICAL PARALLEL COMPUTING
Shared-memory machines
Typical architecture of a multicore processor
Typical architecture of a multicore processor (cont’d)
NUMA: non-uniform memory access
I Processors can access each others’ memory through specialhardware built into processors.
I Own memory is faster to access than remote memory.
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 7/47
NUMERICAL PARALLEL COMPUTING
Shared-memory machines
Typical architecture of a multicore processor
Interconnection networks
Most widely used interconnects on shared memory machines
I bus (slow / cheap / not scalable)
I crossbar switch
Fig.2.7(a) in Pacheco (2011)
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 8/47
NUMERICAL PARALLEL COMPUTING
Execution of parallel programs
Execution of parallel programs
I Multitasking (time sharing)
In operating systems that support multitasking several threadsor processes are executed on the same processor in time slices(time sharing).In this way, latency due to, e.g., I/O operations can behidden. This form of executing multiple tasks at the sametime is called concurrency. Multiple tasks are executed at thesame time, but only one of them has access to the computeresources at any given time. No simultaneous parallel excutionis taking place.
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 9/47
NUMERICAL PARALLEL COMPUTING
Execution of parallel programs
Execution of parallel programs (cont.)I Multiprocessing
Using multiple physical processors admits the parallelexecution of multiple tasks.The parallel hardware may cause overhead though.
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 10/47
NUMERICAL PARALLEL COMPUTING
Execution of parallel programs
Execution of parallel programs (cont.)I Simultaneous Multithreading (SMT)
In simultaneous multithreading (SMT) or hyperthreadingmultiple flows of control are running concurrently on aprocessor (or a core). The processor switches among theseso-called threads of control by means of dedicated hardware.If multiple logical processors are executed on one physicalprocessor then the hardware resources can be better employedand task execution may be sped up. (With two logicalprocessors, performance improvements of up to 30% havebeen observed.)
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 11/47
NUMERICAL PARALLEL COMPUTING
Execution of parallel programs
Multicore programming
Multicore programming
I Multicore processors are programmed with multithreadedprograms.
I Although many programs use multithreading there are somenotable differences between multicore programming and SMT.
I SMT is mainly used by the OS to hide I/O overhead.On multicore processors the work is actually distributed on thedifferent cores.
I Cores have individual caches. False sharing may occur:Two cores may work on different data that is stored in thesame cacheline. Although there is no data dependence thecache line of the other processor is marked invalid if the firstprocessor writes its data item. (Massive) performancedegradation is possible.
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 12/47
NUMERICAL PARALLEL COMPUTING
Execution of parallel programs
Multicore programming
Multicore programming (cont.)I Thread priorities.
If multithreading programs are executed on a single processormachines, always the thread with the highest priority isexecuted.On multicore processors, threads with different priorities canbe executed simultaneously. This may lead to different results!Programming multicore machines therefore requirestechniques, methods, and designs from parallel programming.
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 13/47
NUMERICAL PARALLEL COMPUTING
Execution of parallel programs
Parallel programming models
Parallel programming models
The design of a parallel program is always based on an abstractview of the parallel system on which the software shall beexecuted. This abstract view is called parallel programming model.It does not only describe the underlying hardware. It describes thethe whole system as it is presented to a software developer:
I System software (operating system)
I parallel programming language
I parallel library
I compiler
I run time system
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 14/47
NUMERICAL PARALLEL COMPUTING
Execution of parallel programs
Parallel programming models
Parallel programming models (cont.)Level of parallelism: On which level of the program do we haveparallelism?
I Instruction level parallelismCompiler can detect independent instructions and distribute then ondifferent functional units of a superscalar processor.
I Data or loop level parallelismData structures, e.g., arrays, are partitioned in portions. The sameoperation is applied to all elements of the portions.SIMD.
I Function level parallelismFunctions in a program, e.g., in recursive calls, can be invoked inparallel, provided there are no dependences.
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 15/47
NUMERICAL PARALLEL COMPUTING
Execution of parallel programs
Parallel programming models
Parallel programming models (cont.)Explicit vs. implicit parallelism: How is parallelism declared in theprogram?
I Implicit parallelism.Parallelizing compilers can detect regions/statements in the codethat can be executed concurrently/in parallel.Parallelizing compilers are of limited success.
I Explicit parallelism with implicit partitioning.The programmer indicates to the compiler where there is potentialto parallelize. The partitioning is done implicitly.OpenMP.
I Explicit partitioning.The programmer also indicates how to partition, but does notindicate where to execute the parts.
I Explicit communication and synchronization.MPI.
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 16/47
NUMERICAL PARALLEL COMPUTING
Execution of parallel programs
Parallel programming models
Parallel programming models (cont.)There are two flavors of explicit parallel programming,
I Thread programmingA thread is a sequence of statements that can be executed inparallel with other sequences of statements (threads). Each threadhas its own resources (program counter, status information, etc.)but they use a common address space.Suited for multicore processors.
I Message passing programmingIn message passing programming, processes are used for the variouspieces of the parallel program, that run on physical or logicalprocessors. Each of the processes has its own (private) addressspace.
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 17/47
NUMERICAL PARALLEL COMPUTING
Thread programming
Thread programming
Programming multicore processors is tightly connected to parallelprogramming with a shared address space and to threadprogramming.There are a number of of environments for thread programming
I Pthreads (Posix threads)
I Java threads
I OpenMP
I Intel TBB (thread building blocks)
In this lecture we deal with OpenMP which is the most commonlyused in HPC.
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 18/47
NUMERICAL PARALLEL COMPUTING
Thread programming
Processes vs. threads
Processes
I A process is an instance of a program that is executing moreor less autonomously on a physical processor.
I A process contains all the information needed to execute theprogram.
I process IDI program codeI actual value of the program counterI actual content in registersI data on run time stackI global dataI data on heap
Each process has its own address space.
I Informations change dynamically during process execution.
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 19/47
NUMERICAL PARALLEL COMPUTING
Thread programming
Processes vs. threads
Processes (cont.)I If compute resources are assigned to another process, the
status of the present (to be suspended) process has to besaved, in order that the execution of the suspended processcan be resumed some later time.
I This (time consuming) process is called context switch.
I It is the basic of multitasking where processes are given timeslices in a round robin fashion.
I In contrast to scalar processors, on multiprocessor systems theprocesses can actually run in parallel.
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 20/47
NUMERICAL PARALLEL COMPUTING
Thread programming
Processes vs. threads
Threads
I Thread model is an extension of the process model.
I Each process consists of multiple independent instructionstreams (called threads) that are assigned compute resourcesby some scheduling procedure.
I The threads of a process share the address space of thisprocess. Global variables and all dynamically allocated dataobjects are accessible by all threads of a process.
I Each thread has its own run time stack, registers, programcounter.
I Threads can communicate by reading / writing variables in thecommon address space. (This may require synchronization.)
I We consider system threads here, but no user threads.
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 21/47
NUMERICAL PARALLEL COMPUTING
Thread programming
Processes vs. threads
Threads (cont.)
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 22/47
NUMERICAL PARALLEL COMPUTING
Thread programming
Synchronization
Synchronization
I Threads communicate through shared variables.Uncoordinated access of these variables can lead to undesiredeffects.
I If, e.g., two threads T1 and T2 increment a variable by 1 or 2,the result of the parallel program depends on the way theincriminated variable is accessed. This is called a racecondition.
I To prevent unexpected results the access to shared variablesmust be synchronized.
I A barrier is set to synchronize all threads. All threads wait atthe barrier until all of them have arrived there.
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 23/47
NUMERICAL PARALLEL COMPUTING
Thread programming
Synchronization
Synchronization (cont.)I Mutual exclusion ensures that only one of multiple threads
can access a critical section of the code.(E.g., to increment a variable.)Mutual exclusion serializes the access of the critical section.
I Synchronization can be very time consuming. If it is not doneright, much waiting time is spent.(E.g., if the load is not balanced well among the processors.)
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 24/47
NUMERICAL PARALLEL COMPUTING
Parallel Programming Concepts
Introduction
Parallel Programming Concepts
I The new generation of multicore processors get theirperformance from the multitude of cores on a chip.
I Unlike the situation until recently, the programmer has to takeaction to be able to exploit the improved performance.
I Techniques of parallel programming are known for years inscientific computing and elsewhere.
I What is new: parallel programming has reached main stream.
I The essential step in programming multicore processors is inproviding multiple streams of execution that can be executedsimultaneously (concurrently) on the multiple cores.
I Let us introduce a few concepts and notions.
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 25/47
NUMERICAL PARALLEL COMPUTING
Parallel Programming Concepts
Design of parallel programs
Design of parallel programs: 1. Partitioning
I Basic idea of parallel programming: generate multipleinstruction streams that can be executed in parallel.If we have a sequential code available, we my parallelize it.
I In order to generate independant instruction streams wepartition the problem that we want to solve into tasks. Tasksare the smallest units of parallelism.
I The size of a task is called granularity (fine/coarse grain).
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 26/47
NUMERICAL PARALLEL COMPUTING
Parallel Programming Concepts
Design of parallel programs
Design of parallel programs: 2. Communication
I Tasks may depend on each other in one way or another. Theymay access the same data concurrently (data dependence) orthey may need to wait for another task to finish (flowdependance) as its results are needed.
I Both dependences may be translated into communication in amessage passing environment.
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 27/47
NUMERICAL PARALLEL COMPUTING
Parallel Programming Concepts
Design of parallel programs
Design of parallel programs: 3. Scheduling
I The tasks are mapped (aggregated) on threads or processesthat are executed on the physical compute resources whichcan be processors of a parallel machine or cores of a multicoreprocessor.
I The assignment of tasks to processes or threads is calledscheduling. In static scheduling the assignment takes placebefore the actual computation. In dynamic scheduling theassignment takes place during program execution.
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 28/47
NUMERICAL PARALLEL COMPUTING
Parallel Programming Concepts
Design of parallel programs
Design of parallel programs: 4. Mapping
I The processes or threads are mapped on the processors/cores.This mapping may be explicitly done in the program or(mostly) by the operating system.
I The mapping should be done in a way to minimizecommunication and balance the work load.
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 29/47
NUMERICAL PARALLEL COMPUTING
OpenMP
OpenMP
OpenMP is an application programming interface that provides aparallel programming model for shared memory and distributedshared memory multiprocessors. It extends programming languages(C/C++ and Fortran) by
I a set of compiler directives to express shared memoryparallelism. (Compiler directives are called pragmas in C.)
I runtime library routines and environment variables that areused to examine and modify execution parameters.
There is a standard include file omp.h for C/C++ OpenMPprograms.OpenMP is becoming the de facto standard for parallelizingapplications for shared memory multiprocessors.OpenMP is independent of the underlying hardware or operatingsystem.
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 30/47
NUMERICAL PARALLEL COMPUTING
OpenMP
OpenMP References
There are a number of good books on OpenMP:
I B. Chapman, G. Jost, R. van der Pas: Using OpenMP. MITPress, 2008.Easy to read. Examples both C and Fortran (but not C++).
I R. Chandra, R. Menon, L. Dagum, D. Kohr, D. Maydan,J. McDonald: Parallel programming in OpenMP. MorganKaufmann, San Francisco CA 2001.Easy to read. Much of the material is with Fortran.
I S. Hoffmann and R. Lienhart: OpenMP. Springer, Berlin 2008.Easy to read. C and C++. In German.Available online via NEBIS.
There is an OpenMP organization with most of the majorcomputer manufacturers and the U.S. DOE ACSI program on itsboard, see http://www.openmp.org.
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 31/47
NUMERICAL PARALLEL COMPUTING
OpenMP
Execution model: fork-join parallelism
Execution model: fork-join parallelism
I OpenMP is based on the fork-join execution model.
I At the start of an OpenMP program, a single thread (masterthread) is executing.
I If the master thread arrives at a compiler directive
#pragma omp parallel
that indicates the beginning of a parallel section of the code,then it forks the execution into a certain number of threads.
I At the end of the parallel section the execution is joined againin the single master thread.
I At the end of a parallel section there is an implicite barrier.The program cannot proceed before all threads have reachedthe barrier.
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 32/47
NUMERICAL PARALLEL COMPUTING
OpenMP
Execution model: fork-join parallelism
Execution model: fork-join parallelism II
I Master thread spawns a team of threads as needed
I From a programmers perspective, parallelism is addedincrementally: i.e. the sequential program evolves into aparallel program (vs. MPI distributed memory programming)
I However, parallelism is limited wrt. processor numbers.
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 33/47
NUMERICAL PARALLEL COMPUTING
OpenMP
Execution model: fork-join parallelism
Execution model: fork-join parallelism III
I This is (at least in theory) a dynamic thread generation. Thenumber of threads may vary from one parallel region to theother.The number of threads that are generated can be determinedby functions from the runtime library or by environmentvariables:
setenv OMP NUM THREADS 4
(In static thread generation, the number of threads is fixed apriori from the start.)
I In practice, a number of threads may be initiated at the startof the program. But (slave) threads become active only at thebeginning of a parallel region.
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 34/47
NUMERICAL PARALLEL COMPUTING
OpenMP
Some OpenMP demo programs
OpenMP hello world
#include <stdio.h>
#include <omp.h>
main()
{
#pragma omp parallel
{
printf("Hello world\n");
}
}
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 35/47
NUMERICAL PARALLEL COMPUTING
OpenMP
Some OpenMP demo programs
gcc compiler
OpenMP programs can be compiled by the Gnu compiler
gcc -o hello hello.c -fopenmp -lgomp
When -fopenmp is used, the compiler will generate parallel codebased on the OpenMP directives encountered.
A variable _OPENMP is defined that can be checked with theprecompiler #ifdef.
-lgomp loads libraries of the Gnu OpenMP Project.
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 36/47
NUMERICAL PARALLEL COMPUTING
OpenMP
Some OpenMP demo programs
A more complicated Hello world demo
The following little program calls functions from the OpenMP runtime library to get the number of threads and the id of the actualthread.
/* Modified example 2.1 from Hoffmann-Lienhart */
#include <stdio.h>
#ifdef _OPENMP
#include <omp.h>
#endif
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 37/47
NUMERICAL PARALLEL COMPUTING
OpenMP
Some OpenMP demo programs
A more complicated Hello world demo (cont.)int main(int argc , char* argv [])
{
#ifdef _OPENMP
printf("Number of processors: %d\n",
omp_get_num_procs ());
#pragma omp parallel
{
printf("Thread %d of %d says \" Hallo World!\"\n",
omp_get_thread_num (),
omp_get_num_threads ());
}
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 38/47
NUMERICAL PARALLEL COMPUTING
OpenMP
Some OpenMP demo programs
A more complicated Hello world demo (cont.)#else
printf("OpenMP not supported.\n");
#endif
printf("Job completed.\n");
return 0;
}
It is important to note that the sequence how the threads areactually executed is not always the same. It depends on the run!This is not important here, of course. But if it was, there was arace condition.
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 39/47
NUMERICAL PARALLEL COMPUTING
OpenMP
OpenMP parallel control structures
OpenMP parallel control structures
In the OpenMP fork/join model, the parallel control structures arethose that fork (i.e. start) new threads. There are just two ofthese:
1. The parallel directive is used to create multiple threads ofexecution that execute concurrently.The parallel directive applies to a structured block, i.e. a blockof code with one entry point at the top and one exit point atthe bottom of the block. (Exception: exit()).
2. Further constructs are needed to divide work among anexisting set of parallel threads. The for directive, e.g., is usedto express loop-level parallelism.
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 40/47
NUMERICAL PARALLEL COMPUTING
OpenMP
OpenMP parallel control structures
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 41/47
NUMERICAL PARALLEL COMPUTING
OpenMP
OpenMP parallel control structures
An example for loop-level parallelism (back to axpy)
1. The sequential program
for (i=0; i< N; i++){
y[i] = alpha*x[i] + y[i];
}
2. OpenMP parallel region
#pragma omp parallel
{
for (i=0; i< N; i++){
y[i] = alpha*x[i] + y[i];
}
}
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 42/47
NUMERICAL PARALLEL COMPUTING
OpenMP
OpenMP parallel control structures
3. OpenMP parallel region (assumes Nthrds divdes N)
#pragma omp parallel
{
int id, i, Nthrds, istart, iend;
id = omp_get_thread_num();
Nthrds = omp_get_num_threads();
istart = id*N/Nthrds; iend = (id+1)*N/Nthrds;
for (i=istart; i< iend; i++){
y[i] = alpha*x[i] + y[i];}
}
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 43/47
NUMERICAL PARALLEL COMPUTING
OpenMP
OpenMP parallel control structures
4. OpenMP parallel region combined with a for-directive
#pragma omp parallel
#pragma omp for
for (i=0; i< N; i++){
y[i] = alpha*x[i] + y[i];
}
5. OpenMP parallel region combined with a for-directive and aschedule clause
#pragma omp parallel for schedule (static)
for (i=0; i< N; i++){
y[i] = alpha*x[i] + y[i];
}
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 44/47
NUMERICAL PARALLEL COMPUTING
OpenMP
For-construct with the schedule clause
For-construct with the schedule clause
The schedule clause effects how loop iterations are mapped ontothreads
I schedule(static [,chunk])
Deal out blocks of iterations of size “chunk” to each thread.
I schedule(dynamic [,chunk])
Each thread grabs “chunk” iterations off a queue until alliterations have been handled
I schedule(guided [,chunk])
Threads dynamically grab blocks of iterations. The size of theblock starts large and shrinks down to size “chunk” as thecalculation proceeds (guided self-scheduling).
I schedule(runtime)
Schedule and chunk size defined by OMP SCHEDULEenvironment variable.
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 45/47
NUMERICAL PARALLEL COMPUTING
OpenMP
Timings with varying chunk sizes in the for
Timings with varying chunk sizes in the for#pragma omp parallel for schedule(static,chunk_size)
for (i=0; i< N; i++){
y[i] = alpha*x[i] + y[i];
}
chunk size p = 1 2 4 6 8 12 16
N/p 1674 854 449 317 239 176 59100 1694 1089 601 405 317 239 166
4 1934 2139 1606 1294 850 742 4831 2593 2993 3159 2553 2334 2329 2129
Table 1: Some execution times in µsec for the saxpy operation withvarying chunk size and processor number p
Here, chunks of size chunk size are cyclically assigned to theprocessors. In Table 1 timings on the HP Superdome aresummarized for N = 1000000.Parallel Numerical Computing. Lecture 3, Mar 9, 2012 46/47
NUMERICAL PARALLEL COMPUTING
OpenMP
Timings with varying chunk sizes in the for
How to measure time (walltime.c)
double t0, tw;
...
t0 = walltime(&tw);
for (ict=0; ict<CT; ict++){
sum = 0.0;
#pragma omp parallel for
for (i=0; i< N; i++){
sum = sum + x[i]*y[i];
}
}
tw = walltime(&t0)/(double)CT;
printf("elapsed time: %10.4f musec\n",tw);
Parallel Numerical Computing. Lecture 3, Mar 9, 2012 47/47
Top Related