Download - NUMERICAL PARALLEL COMPUTINGpeople.inf.ethz.ch/arbenz/PARCO/3.pdf · 2012-03-08 · NUMERICAL PARALLEL COMPUTING Review of last week So far I Moore’s law. I Flynn’s taxonomy of

NUMERICAL PARALLEL COMPUTING

NUMERICAL PARALLEL COMPUTINGLecture 3: Programming multicore processors with OpenMP

http://people.inf.ethz.ch/iyves/pnc12/

Peter Arbenz∗, Andreas Adelmann∗∗∗Computer Science Dept, ETH Zurich,

E-mail: [email protected]∗∗Paul Scherrer Institut, Villigen

E-mail: [email protected]

Parallel Numerical Computing. Lecture 3, Mar 9, 2012 1/47

http://people.inf.ethz.ch/iyves/pnc12/

[email protected]

[email protected]


Review of last week

So far

I Moore’s law.

I Flynn’s taxonomy of parallel computers:SISD, SIMD, MIMD.

I Some terminology:Work, speedup, efficiency, scalability.

I Amdahl’s and Gustafson’s law

I SIMD programming.

TodayI Shared memory MIMD programming (Part 1)



MIMD: Multiple Instruction stream - Multiple Data stream

MIMD: Multiple Instruction stream – Multiple Data stream

Each processor (core) can execute its own instruction stream on itsown data independently from the other processors. Each processoris a full-fledged CPU with both control unit and ALU. MIMDsystems are asynchronous.




Shared memory machines

Shared memory machines (multiprocessors)

I autonomous processors connected to memory system viainterconnection network

I single address space accessible by all processorsI (implicit) communication by means of shared dataI data dependencies / race conditions possible

Fig.2.3 in Pacheco (2011)




Distributed memory machines

I distributed memory machines (multicomputers)I Each processor has its own local/private memoryI processor/memory pairs communicate via interconnection

networkI all data are local to some processor,I (explicit) communication by message passing or some other

means to access memory of remote processor

Fig.2.4 in Pacheco (2011)



Shared-memory machines

Typical architecture of a multicore processor


Multiple cores share multiplecaches, that are arrange in atree-like structure.

3-levels example:I L1-cache in-core,

I 2 cores share L2-cache,

I all cores have access to all ofthe L3 cache and memory.

UMA: uniform memory accessEach processor has directconnection to (block of) memory.





Typical architecture of a multicore processor (cont’d)

NUMA: non-uniform memory access

I Processors can access each others’ memory through specialhardware built into processors.

I Own memory is faster to access than remote memory.





Interconnection networks

Most widely used interconnects on shared memory machines

I bus (slow / cheap / not scalable)

I crossbar switch

Fig.2.7(a) in Pacheco (2011)



Execution of parallel programs


I Multitasking (time sharing)

In operating systems that support multitasking several threadsor processes are executed on the same processor in time slices(time sharing).In this way, latency due to, e.g., I/O operations can behidden. This form of executing multiple tasks at the sametime is called concurrency. Multiple tasks are executed at thesame time, but only one of them has access to the computeresources at any given time. No simultaneous parallel excutionis taking place.




Execution of parallel programs (cont.)I Multiprocessing

Using multiple physical processors admits the parallelexecution of multiple tasks.The parallel hardware may cause overhead though.




Execution of parallel programs (cont.)I Simultaneous Multithreading (SMT)

In simultaneous multithreading (SMT) or hyperthreadingmultiple flows of control are running concurrently on aprocessor (or a core). The processor switches among theseso-called threads of control by means of dedicated hardware.If multiple logical processors are executed on one physicalprocessor then the hardware resources can be better employedand task execution may be sped up. (With two logicalprocessors, performance improvements of up to 30% havebeen observed.)




Multicore programming


I Multicore processors are programmed with multithreadedprograms.

I Although many programs use multithreading there are somenotable differences between multicore programming and SMT.

I SMT is mainly used by the OS to hide I/O overhead.On multicore processors the work is actually distributed on thedifferent cores.

I Cores have individual caches. False sharing may occur:Two cores may work on different data that is stored in thesame cacheline. Although there is no data dependence thecache line of the other processor is marked invalid if the firstprocessor writes its data item. (Massive) performancedegradation is possible.





Multicore programming (cont.)I Thread priorities.

If multithreading programs are executed on a single processormachines, always the thread with the highest priority isexecuted.On multicore processors, threads with different priorities canbe executed simultaneously. This may lead to different results!Programming multicore machines therefore requirestechniques, methods, and designs from parallel programming.




Parallel programming models


The design of a parallel program is always based on an abstractview of the parallel system on which the software shall beexecuted. This abstract view is called parallel programming model.It does not only describe the underlying hardware. It describes thethe whole system as it is presented to a software developer:

I System software (operating system)

I parallel programming language

I parallel library

I compiler

I run time system





Parallel programming models (cont.)Level of parallelism: On which level of the program do we haveparallelism?

I Instruction level parallelismCompiler can detect independent instructions and distribute then ondifferent functional units of a superscalar processor.

I Data or loop level parallelismData structures, e.g., arrays, are partitioned in portions. The sameoperation is applied to all elements of the portions.SIMD.

I Function level parallelismFunctions in a program, e.g., in recursive calls, can be invoked inparallel, provided there are no dependences.





Parallel programming models (cont.)Explicit vs. implicit parallelism: How is parallelism declared in theprogram?

I Implicit parallelism.Parallelizing compilers can detect regions/statements in the codethat can be executed concurrently/in parallel.Parallelizing compilers are of limited success.

I Explicit parallelism with implicit partitioning.The programmer indicates to the compiler where there is potentialto parallelize. The partitioning is done implicitly.OpenMP.

I Explicit partitioning.The programmer also indicates how to partition, but does notindicate where to execute the parts.

I Explicit communication and synchronization.MPI.





Parallel programming models (cont.)There are two flavors of explicit parallel programming,

I Thread programmingA thread is a sequence of statements that can be executed inparallel with other sequences of statements (threads). Each threadhas its own resources (program counter, status information, etc.)but they use a common address space.Suited for multicore processors.

I Message passing programmingIn message passing programming, processes are used for the variouspieces of the parallel program, that run on physical or logicalprocessors. Each of the processes has its own (private) addressspace.



Thread programming

Thread programming

Programming multicore processors is tightly connected to parallelprogramming with a shared address space and to threadprogramming.There are a number of of environments for thread programming

I Pthreads (Posix threads)

I Java threads

I OpenMP

I Intel TBB (thread building blocks)

In this lecture we deal with OpenMP which is the most commonlyused in HPC.



Thread programming

Processes vs. threads

Processes

I A process is an instance of a program that is executing moreor less autonomously on a physical processor.

I A process contains all the information needed to execute theprogram.

I process IDI program codeI actual value of the program counterI actual content in registersI data on run time stackI global dataI data on heap

Each process has its own address space.

I Informations change dynamically during process execution.



Thread programming


Processes (cont.)I If compute resources are assigned to another process, the

status of the present (to be suspended) process has to besaved, in order that the execution of the suspended processcan be resumed some later time.

I This (time consuming) process is called context switch.

I It is the basic of multitasking where processes are given timeslices in a round robin fashion.

I In contrast to scalar processors, on multiprocessor systems theprocesses can actually run in parallel.



Thread programming


Threads

I Thread model is an extension of the process model.

I Each process consists of multiple independent instructionstreams (called threads) that are assigned compute resourcesby some scheduling procedure.

I The threads of a process share the address space of thisprocess. Global variables and all dynamically allocated dataobjects are accessible by all threads of a process.

I Each thread has its own run time stack, registers, programcounter.

I Threads can communicate by reading / writing variables in thecommon address space. (This may require synchronization.)

I We consider system threads here, but no user threads.



Thread programming


Threads (cont.)



Thread programming

Synchronization

Synchronization

I Threads communicate through shared variables.Uncoordinated access of these variables can lead to undesiredeffects.

I If, e.g., two threads T1 and T2 increment a variable by 1 or 2,the result of the parallel program depends on the way theincriminated variable is accessed. This is called a racecondition.

I To prevent unexpected results the access to shared variablesmust be synchronized.

I A barrier is set to synchronize all threads. All threads wait atthe barrier until all of them have arrived there.



Thread programming

Synchronization

Synchronization (cont.)I Mutual exclusion ensures that only one of multiple threads

can access a critical section of the code.(E.g., to increment a variable.)Mutual exclusion serializes the access of the critical section.

I Synchronization can be very time consuming. If it is not doneright, much waiting time is spent.(E.g., if the load is not balanced well among the processors.)



Parallel Programming Concepts

Introduction


I The new generation of multicore processors get theirperformance from the multitude of cores on a chip.

I Unlike the situation until recently, the programmer has to takeaction to be able to exploit the improved performance.

I Techniques of parallel programming are known for years inscientific computing and elsewhere.

I What is new: parallel programming has reached main stream.

I The essential step in programming multicore processors is inproviding multiple streams of execution that can be executedsimultaneously (concurrently) on the multiple cores.

I Let us introduce a few concepts and notions.




Design of parallel programs

Design of parallel programs: 1. Partitioning

I Basic idea of parallel programming: generate multipleinstruction streams that can be executed in parallel.If we have a sequential code available, we my parallelize it.

I In order to generate independant instruction streams wepartition the problem that we want to solve into tasks. Tasksare the smallest units of parallelism.

I The size of a task is called granularity (fine/coarse grain).





Design of parallel programs: 2. Communication

I Tasks may depend on each other in one way or another. Theymay access the same data concurrently (data dependence) orthey may need to wait for another task to finish (flowdependance) as its results are needed.

I Both dependences may be translated into communication in amessage passing environment.





Design of parallel programs: 3. Scheduling

I The tasks are mapped (aggregated) on threads or processesthat are executed on the physical compute resources whichcan be processors of a parallel machine or cores of a multicoreprocessor.

I The assignment of tasks to processes or threads is calledscheduling. In static scheduling the assignment takes placebefore the actual computation. In dynamic scheduling theassignment takes place during program execution.





Design of parallel programs: 4. Mapping

I The processes or threads are mapped on the processors/cores.This mapping may be explicitly done in the program or(mostly) by the operating system.

I The mapping should be done in a way to minimizecommunication and balance the work load.



OpenMP

OpenMP

OpenMP is an application programming interface that provides aparallel programming model for shared memory and distributedshared memory multiprocessors. It extends programming languages(C/C++ and Fortran) by

I a set of compiler directives to express shared memoryparallelism. (Compiler directives are called pragmas in C.)

I runtime library routines and environment variables that areused to examine and modify execution parameters.

There is a standard include file omp.h for C/C++ OpenMPprograms.OpenMP is becoming the de facto standard for parallelizingapplications for shared memory multiprocessors.OpenMP is independent of the underlying hardware or operatingsystem.



OpenMP

OpenMP References

There are a number of good books on OpenMP:

I B. Chapman, G. Jost, R. van der Pas: Using OpenMP. MITPress, 2008.Easy to read. Examples both C and Fortran (but not C++).

I R. Chandra, R. Menon, L. Dagum, D. Kohr, D. Maydan,J. McDonald: Parallel programming in OpenMP. MorganKaufmann, San Francisco CA 2001.Easy to read. Much of the material is with Fortran.

I S. Hoffmann and R. Lienhart: OpenMP. Springer, Berlin 2008.Easy to read. C and C++. In German.Available online via NEBIS.

There is an OpenMP organization with most of the majorcomputer manufacturers and the U.S. DOE ACSI program on itsboard, see http://www.openmp.org.


http://www.openmp.org


OpenMP

Execution model: fork-join parallelism


I OpenMP is based on the fork-join execution model.

I At the start of an OpenMP program, a single thread (masterthread) is executing.

I If the master thread arrives at a compiler directive

#pragma omp parallel

that indicates the beginning of a parallel section of the code,then it forks the execution into a certain number of threads.

I At the end of the parallel section the execution is joined againin the single master thread.

I At the end of a parallel section there is an implicite barrier.The program cannot proceed before all threads have reachedthe barrier.



OpenMP


Execution model: fork-join parallelism II

I Master thread spawns a team of threads as needed

I From a programmers perspective, parallelism is addedincrementally: i.e. the sequential program evolves into aparallel program (vs. MPI distributed memory programming)

I However, parallelism is limited wrt. processor numbers.



OpenMP


Execution model: fork-join parallelism III

I This is (at least in theory) a dynamic thread generation. Thenumber of threads may vary from one parallel region to theother.The number of threads that are generated can be determinedby functions from the runtime library or by environmentvariables:

setenv OMP NUM THREADS 4

(In static thread generation, the number of threads is fixed apriori from the start.)

I In practice, a number of threads may be initiated at the startof the program. But (slave) threads become active only at thebeginning of a parallel region.



OpenMP

Some OpenMP demo programs

OpenMP hello world

#include <stdio.h>

#include <omp.h>

main()

{


{

printf("Hello world\n");

}

}



OpenMP


gcc compiler

OpenMP programs can be compiled by the Gnu compiler

gcc -o hello hello.c -fopenmp -lgomp

When -fopenmp is used, the compiler will generate parallel codebased on the OpenMP directives encountered.

A variable _OPENMP is defined that can be checked with theprecompiler #ifdef.

-lgomp loads libraries of the Gnu OpenMP Project.



OpenMP


A more complicated Hello world demo

The following little program calls functions from the OpenMP runtime library to get the number of threads and the id of the actualthread.

/* Modified example 2.1 from Hoffmann-Lienhart */

#include <stdio.h>

#ifdef _OPENMP

#include <omp.h>

#endif



OpenMP


A more complicated Hello world demo (cont.)int main(int argc , char* argv [])

{

#ifdef _OPENMP

printf("Number of processors: %d\n",

omp_get_num_procs ());


{

printf("Thread %d of %d says \" Hallo World!\"\n",

omp_get_thread_num (),

omp_get_num_threads ());

}



OpenMP


A more complicated Hello world demo (cont.)#else

printf("OpenMP not supported.\n");

#endif

printf("Job completed.\n");

return 0;

}

It is important to note that the sequence how the threads areactually executed is not always the same. It depends on the run!This is not important here, of course. But if it was, there was arace condition.



OpenMP

OpenMP parallel control structures


In the OpenMP fork/join model, the parallel control structures arethose that fork (i.e. start) new threads. There are just two ofthese:

1. The parallel directive is used to create multiple threads ofexecution that execute concurrently.The parallel directive applies to a structured block, i.e. a blockof code with one entry point at the top and one exit point atthe bottom of the block. (Exception: exit()).

2. Further constructs are needed to divide work among anexisting set of parallel threads. The for directive, e.g., is usedto express loop-level parallelism.



OpenMP




OpenMP


An example for loop-level parallelism (back to axpy)

1. The sequential program

for (i=0; i< N; i++){

y[i] = alpha*x[i] + y[i];

}

2. OpenMP parallel region


{

for (i=0; i< N; i++){


}

}



OpenMP


3. OpenMP parallel region (assumes Nthrds divdes N)


{

int id, i, Nthrds, istart, iend;

id = omp_get_thread_num();

Nthrds = omp_get_num_threads();

istart = id*N/Nthrds; iend = (id+1)*N/Nthrds;

for (i=istart; i< iend; i++){

y[i] = alpha*x[i] + y[i];}

}



OpenMP


4. OpenMP parallel region combined with a for-directive


#pragma omp for

for (i=0; i< N; i++){


}

5. OpenMP parallel region combined with a for-directive and aschedule clause

#pragma omp parallel for schedule (static)

for (i=0; i< N; i++){


}



OpenMP

For-construct with the schedule clause

For-construct with the schedule clause

The schedule clause effects how loop iterations are mapped ontothreads

I schedule(static [,chunk])

Deal out blocks of iterations of size “chunk” to each thread.

I schedule(dynamic [,chunk])

Each thread grabs “chunk” iterations off a queue until alliterations have been handled

I schedule(guided [,chunk])

Threads dynamically grab blocks of iterations. The size of theblock starts large and shrinks down to size “chunk” as thecalculation proceeds (guided self-scheduling).

I schedule(runtime)

Schedule and chunk size defined by OMP SCHEDULEenvironment variable.



OpenMP

Timings with varying chunk sizes in the for

Timings with varying chunk sizes in the for#pragma omp parallel for schedule(static,chunk_size)

for (i=0; i< N; i++){


}

chunk size p = 1 2 4 6 8 12 16

N/p 1674 854 449 317 239 176 59100 1694 1089 601 405 317 239 166

4 1934 2139 1606 1294 850 742 4831 2593 2993 3159 2553 2334 2329 2129

Table 1: Some execution times in µsec for the saxpy operation withvarying chunk size and processor number p

Here, chunks of size chunk size are cyclically assigned to theprocessors. In Table 1 timings on the HP Superdome aresummarized for N = 1000000.Parallel Numerical Computing. Lecture 3, Mar 9, 2012 46/47


OpenMP

Timings with varying chunk sizes in the for

How to measure time (walltime.c)

double t0, tw;

...

t0 = walltime(&tw);

for (ict=0; ict<CT; ict++){

sum = 0.0;

#pragma omp parallel for

for (i=0; i< N; i++){

sum = sum + x[i]*y[i];

}

}

tw = walltime(&t0)/(double)CT;

printf("elapsed time: %10.4f musec\n",tw);