Parallel Processing with OpenMP
description
Transcript of Parallel Processing with OpenMP
![Page 1: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/1.jpg)
Parallel Processing with OpenMP
Doug Sondak
Boston University
Scientific Computing and Visualization
Office of Information Technology
![Page 2: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/2.jpg)
Outline
• Introduction• Basics• Data Dependencies• A Few More Basics• Caveats & Compilation• Coarse-Grained Parallelization• Thread Control Diredtives• Some Additional Functions• Some Additional Clauses• Nested Paralleliism• Locks
![Page 3: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/3.jpg)
Introduction
![Page 4: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/4.jpg)
Introduction
• Types of parallel machines– distributed memory
• each processor has its own memory address space• variable values are independent
x = 2 on one processor, x = 3 on a different processor
• examples: linux clusters, Blue Gene/L
– shared memory• also called Symmetric Multiprocessing (SMP)• single address space for all processors
– If one processor sets x = 2 , x will also equal 2 on other processors (unless specified otherwise)
• examples: IBM p-series, SGI Altix
![Page 5: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/5.jpg)
Shared vs. Distributed Memory
CPU 0 CPU 1 CPU 2 CPU 3 CPU 0 CPU 1 CPU 2 CPU 3
MEM 0 MEM 1 MEM 2 MEM 3 MEM
shareddistributed
![Page 6: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/6.jpg)
Shared vs. Distributed Memory (cont’d)
• Multiple processes– Each processor (typically) performs
independent task
• Multiple threads– A process spawns additional tasks (threads)
with same memory address space
![Page 7: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/7.jpg)
What is OpenMP?
• Application Programming Interface (API) for multi-threaded parallelization consisting of– Source code directives– Functions– Environment variables
![Page 8: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/8.jpg)
What is OpenMP? (cont’d)
• Advantages– Easy to use– Incremental parallelization– Flexible
• Loop-level or coarse-grain
– Portable• Since there’s a standard, will work on any SMP machine
• Disadvantage– Shared-memory systems only
![Page 9: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/9.jpg)
Basics
![Page 10: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/10.jpg)
Basics
• Goal – distribute work among threads• Two methods will be discussed here
– Loop-level• Specified loops are parallelized• This is approach taken by automatic parallelization tools
– Parallel regions• Sometimes called “coarse-grained”• Don’t know good term; good way to start argument with
semantically precise people• Usually used in message-passing (MPI)
![Page 11: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/11.jpg)
Basics (cont’d)
loop
loop
Loop-level Parallel regions
serial
serial
serial
![Page 12: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/12.jpg)
parallel do & parallel for
• parallel do (Fortran) and parallel for (C) directives parallelize subsequent loop
#pragma omp parallel forfor(i = 1; i <= maxi; i++){ a[i] = b[i] = c[i];}
!$omp parallel dodo i = 1, maxi a(i) = b(i) + c(i)enddo
Use “c$” for fixed-format Fortran
![Page 13: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/13.jpg)
parallel do & parallel for (cont’d)
• Suppose maxi = 1000
Thread 0 gets i = 1 to 250
Thread 1 gets i = 251 to 500 etc.
• Barrier implied at end of loop
![Page 14: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/14.jpg)
workshare
• For Fortran 90/95 array syntax, the parallel workshare directive is analogous to parallel do
• Previous example would be:
• Also works for forall and where statements
!$omp parallel worksharea = b + c!$omp end parallel workshare
![Page 15: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/15.jpg)
Shared vs. Private
• In parallel region, default behavior is that all variables are shared except loop index– All threads read and write the same memory
location for each variable– This is ok if threads are accessing different
elements of an array– Problem if threads write same scalar or array
element– Loop index is private, so each thread has its
own copy
![Page 16: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/16.jpg)
Shared vs. Private (cont’d)
• Here’s an example where a shared variable, i2, could cause a problem
ifirst = 10do i = 1, imax i2 = 2*i j(i) = ifirst + i2enddo
ifirst = 10;for(i = 1; i <= imax; i++){ i2 = 2*i; j[i] = ifirst + i2;}
![Page 17: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/17.jpg)
Shared vs. Private (3)
• OpenMP clauses modify the behavior of directives
• Private clause creates separate memory location for specified variable for each thread
![Page 18: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/18.jpg)
Shared vs. Private (4)
ifirst = 10!$omp parallel do private(i2)do i = 1, imax i2 = 2*i j(i) = ifirst + i2enddo
ifirst = 10;#pragma omp parallel for private(i2)for(i = 1; i <= imax; i++){ i2 = 2*i; j[i] = ifirst + i2;}
![Page 19: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/19.jpg)
Shared vs. Private (5)
• Look at memory just before and after parallel do/parallel for statement– Let’s look at two threads
ifirst = 10…
i2thread 0 = ???
spawn additional thread
ifirst = 10…
i2thread 0 = ???
i2thread 1 = ???
…
before after
![Page 20: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/20.jpg)
Shared vs. Private (6)
• Both threads access the same shared value of ifirst
• They have their own private copies of i2
ifirst = 10…i2thread 0 = ???
i2thread 1 = ???
…
Thread 0 Thread 1
![Page 21: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/21.jpg)
Data Dependencies
![Page 22: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/22.jpg)
Data Dependencies
• Data on one thread can be dependent on data on another thread
• This can result in wrong answers– thread 0 may require a variable that is
calculated on thread 1– answer depends on timing – When thread 0
does the calculation, has thread 1 calculated it’s value yet?
![Page 23: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/23.jpg)
Data Dependencies (cont’d)
• Example – Fibonacci Sequence 0, 1, 1, 2, 3, 5, 8, 13, …
a(1) = 0a(2) = 1do i = 3, 100 a(i) = a(i-1) + a(i-2)enddo
a[1] = 0;a[2] = 1;for(i = 3; i <= 100; i++){ a[i] = a[i-1] + a[i-2];}
![Page 24: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/24.jpg)
Data Dependencies (3)
• parallelize on 2 threads– thread 0 gets i = 3 to 51– thread 1 gets i = 52 to 100 – look carefully at calculation for i = 52 on
thread 1• what will be values of for i -1 and i - 2 ?
![Page 25: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/25.jpg)
Data Dependencies (4)
• A test for dependency:– if serial loop is executed in reverse order, will
it give the same result?– if so, it’s ok– you can test this on your serial code
![Page 26: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/26.jpg)
Data Dependencies (5)
• What about subprogram calls?
• Does the subprogram write x or y to memory?– If so, they need to be private
• Variables local to subprogram are local to each thread
• Be careful with global variables and common blocks
do i = 1, 100 call mycalc(i,x,y)enddo
for(i = 0; i < 100; i++){ mycalc(i,x,y);}
![Page 27: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/27.jpg)
A Few More Basics
![Page 28: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/28.jpg)
More clauses
• Can make private default rather than shared– Fortran only– handy if most of the variables are private– can use continuation characters for long lines
ifirst = 10!$omp parallel do &!$omp default(private) &!$omp shared(ifirst,imax,j)do i = 1, imax i2 = 2*i j(i) = ifirst + i2enddo
![Page 29: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/29.jpg)
More clauses (cont’d)
• Can use default none– declare all variables (except loop variables)
as shared or private– If you don’t declare any variables, you get a
handy list of all variables in loop
![Page 30: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/30.jpg)
More clauses (3)
ifirst = 10!$omp parallel do &!$omp default(none) &!$omp shared(ifirst,imax,j) &!$omp private(i2)do i = 1, imax i2 = 2*i j(i) = ifirst + i2enddo
ifirst = 10;#pragma omp parallel for \ default(none) \ shared(ifirst,imax,j) \ private(i2)for(i = 0; i < imax; i++){ i2 = 2*i; j[i] = ifirst + i2;}
![Page 31: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/31.jpg)
Firstprivate
• Suppose we need a running index total for each index value on each thread
• if iper were declared private, the initial value would not be carried into the loop
iper = 0do i = 1, imax iper = iper + 1 j(i) = iperenddo
iper = 0;for(i = 0; i < imax; i++){ iper = iper + 1; j[i] = iper;}
![Page 32: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/32.jpg)
Firstprivate (cont’d)
• Solution – firstprivate clause
• Creates private memory location for each thread
• Copies value from master thread (thread 0) to each memory locationiper = 0!$omp parallel do &!$omp firstprivate(iper) do i = 1, imax iper = iper + 1 j(i) = iperenddo
iper = 0;#pragma omp parallel for \ firstprivate(iper)for(i = 0; i < imax; i++){ iper = iper + 1; j[i] = iper;}
![Page 33: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/33.jpg)
Lastprivate
• saves value corresponding to the last loop index– "last" in the serial sense
!$omp parallel do lastprivate(i) do i = 1, maxi-1 a(i) = b(i)enddoa(i) = b(1)
#pragma omp parallel for \ lastprivate(i)for(i = 0; i < maxi-1; i++){ a[i] = b[i];}a(i) = b(1);
![Page 34: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/34.jpg)
Reduction
• following example won’t parallelize correctly with parallel do/parallel for– different threads may try to write to sum1
simultaneously sum1 = 0.0 do i = 1, maxi sum1 = sum1 + a(i)enddo
sum1 = 0.0;for(i = 0; i < imaxi; i++){ sum1 = sum1 + a[i];}
![Page 35: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/35.jpg)
Reduction (cont’d)
• Solution? – Reduction clause
• each thread performs its own reduction (sum, in this case)
• results from all threads are automatically reduced (summed) at the end of the loop
sum1 = 0.0!$omp parallel do & !$reduction(+:sum1)do i = 1, maxi sum1 = sum1 + a(i)enddo
sum1 = 0;#pragma omp parallel for \ reduction(+:sum1)for(i = 0; i < imaxi; i++){ sum1 = sum1 + a[i];}
![Page 36: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/36.jpg)
Reduction (3)
• Fortran operators/intrinsics: MAX, MIN, IAND, IOR, IEOR, +, *, -, .AND., .OR., .EQV., .NEQV.
• C operators: +, *, -, /, &, ^, |, &&, ||
• roundoff error may be different than serial case
![Page 37: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/37.jpg)
Ordered
• Suppose you want to write values in a loop:
• If loop were parallelized, could write out of order• ordered directive forces serial order
do i = 1, nproc call do_lots_of_work(result(i)) write(21,101) i, result(i)enddo
for(i = 0; i < nproc; i++){ do_lots_of_work(result[i]); fprintf(fid,”%d %f\n,”i,result[i]”);}
![Page 38: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/38.jpg)
Ordered (cont’d)
!$omp parallel dodo i = 1, nproc call do_lots_of_work(result(i)) !$omp ordered write(21,101) i, result(i) !$omp end orderedenddo
#pragma omp parallel forfor(i = 0; i < nproc; i++){ do_lots_of_work(result[i]); #pragma omp ordered fprintf(fid,”%d %f\n,”i,result[i]”); #pragma omp end ordered}
![Page 39: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/39.jpg)
Ordered (3)
• since do_lots_of_work takes a lot of time, most parallel benefit will be realized
• ordered is helpful for debugging– Is result same with and without ordered
directive?
![Page 40: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/40.jpg)
Conditional Compilation
• Fortran: if compiled without OpenMP, directives are treated as comments– Great for portability
• !$ (c$ for fixed format) can be used for conditional compilation for any source lines !$ print*, ‘number of procs =', nprocs
• C or C++: conditional compilation can be performed with the _OPENMP macro name.
#ifdef _OPENMP
… do stuff …
#endif
![Page 41: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/41.jpg)
Basic OpenMP Functions
• omp_get_thread_num() – returns ID of current thread
• omp_set_num_threads(nthreads)– subroutine in Fortran – sets number of threads in next parallel region to
nthreads – alternative to OMP_NUM_THREADS environment
variable – overrides OMP_NUM_THREADS
• omp_get_num_threads() – returns number of threads in current parallel region
![Page 42: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/42.jpg)
Caveats and Compilation
![Page 43: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/43.jpg)
Some Hints
• OpenMP will do what you tell it to do– If you try parallelize a loop with a dependency, it will go
ahead and do it!
• Do not parallelize small loops– Overhead will be greater than speedup– How small is “small”?
• Answer depends on processor speed and other system-dependent parameters
• Try it!
• Maximize number of operations performed in parallel– parallelize outer loops where possible
![Page 44: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/44.jpg)
Some Hints (cont’d)
• Always profile your code before parallelizing!– tells where most time is being used
• Portland Group profiler:– compile with –Mprof=func flag– run code
• file with profile data will appear in working directory
– run profiler:• pgprof –exe executable-name• GUI will pop up with profiling information
![Page 45: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/45.jpg)
Compile and Run
• Portland Group compilers (katana):– Compile with -mp flag– Can use up to 8 threads
• AIX compilers (twister, etc.): – Use an _r suffix on the compiler name, e.g., cc_r,
xlf90_r – Compile with -qsmp=omp flag
• Intel compilers (skate, cootie):– Compile with -openmp and –fpp flags– Can only use 2 threads
![Page 46: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/46.jpg)
Compile and Run (cont’d)
• GNU compilers (all machines):– Compile with -fopenmp flag
• Blue Gene does not accomodate OpenMP• environment variable OMP_NUM_THREADS
sets number of threads
setenv OMP_NUM_THREADS 4 • for C, include omp.h
![Page 47: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/47.jpg)
Exercise
• Copy files from /scratch/sondak/gt– both C and Fortran versions are available for your
programming pleasure
• compile code with pgcc or pgf90– include profiling flag– call executable “gt”
• Run code batch– qsub rungt– qstat –u username
![Page 48: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/48.jpg)
Exercise (cont’d)
• When run completes, look at profile– pgprof –exe gt– Where is most time being used?– How can we tackle this with OpenMP?
• Recompile without profiler flag• Re-run and note timing from batch output file
– this is the base timing
• We will decide on an OpenMP strategy• Add OpenMP directive(s)• Re-run on 1, 2, and 4 procs. and check new
timings
![Page 49: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/49.jpg)
Parallel
• parallel and do/for can be separated into two directives.
is the same as
!$omp parallel dodo i = 1, maxi a(i) = b(i)enddo
#pragma omp parallel forfor(i=0; i<maxi; i++){ a[i] = b[i];}
!$omp parallel!$omp dodo i = 1, maxi a(i) = b(i)enddo!$omp end parallel
#pragma omp parallel#pragma omp forfor(i=0; i<maxi; i++){ a[i] = b[i];}#pragma omp end parallel
![Page 50: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/50.jpg)
Parallel (cont’d)
• Note that an end parallel directive is required. • Everything within the parallel region will be run in
parallel (surprise!). • The do/for directive indicates that the loop
indices will be distributed among threads rather than duplicating every index on every thread.
![Page 51: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/51.jpg)
Parallel (3)
• Multiple loops in parallel region:
• parallel directive has a significant overhead associated with it.
• The above example has the potential to be faster than using two parallel do/parallel for directives.
!$omp parallel!$omp dodo i = 1, maxi a(i) = b(i)enddo!$omp dodo i = 1, maxi c(i) = a(2)enddo!$omp end parallel
#pragma omp parallel#pragma omp forfor(i=0; i<maxi; i++){ a[i] = b[i];}#pragma omp forfor(i=0; i<maxi; i++){ c[i] = a[2];}#pragma omp end parallel
![Page 52: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/52.jpg)
Coarse-Grained
• OpenMP is not restricted to loop-level (fine-grained) parallelism.
• The !$omp parallel or #pragma omp parallel directive duplicates subsequent code on all threads until a !$omp end parallel or #pragma omp end parallel directive is encountered.
• Allows parallelization similar to “MPI paradigm.”
![Page 53: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/53.jpg)
Coarse-Grained (cont’d)!$omp parallel &!$omp private(myid,istart,iend,nthreads,nper) nthreads = omp_get_num_threads()nper = imax/nthreadsmyid = omp_get_thread_num()istart = myid*nper + 1iend = istart + nper – 1call do_work(istart,iend)do i = istart, iend a(i) = b(i)*c(i) + ...enddo!$omp end parallel
#pragma omp parallel \#pragma omp private(myid,istart,iend,nthreads,nper)nthreads = OMP_GET_NUM_THREADS();nper = imax/nthreads;myid = OMP_GET_THREAD_NUM();istart = myid*nper;iend = istart + nper – 1;do_work(istart,iend);for(i=istart; i<=iend; i++){ a[i] = b[i]*c[i] + ...}#pragma omp end parallel
![Page 54: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/54.jpg)
Thread Control Directives
![Page 55: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/55.jpg)
• barrier synchronizes threads
• Here barrier assures that a(1) or a[0] is available before computing myval
Barrier
$omp parallel private(myid,istart,iend)call myrange(myid,istart,iend)do i = istart, iend a(i) = a(i) - b(i)enddo!$omp barriermyval(myid+1) = a(istart) + a(1)
#pragma omp parallel private(myid,istart,iend)myrange(myid,istart,iend);for(i=istart; i<=iend; i++){ a[i] = a[i] – b[i];}#pragma omp barriermyval[myid] = a[istart] + a[0]
![Page 56: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/56.jpg)
• if you want part of code to be executed only on master thread, use master directive
• “non-master” threads will skip over master region and continue
Master
![Page 57: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/57.jpg)
Master Example - Fortran
!$OMP PARALLEL PRIVATE(myid,istart,iend) call myrange(myid,istart,iend) do i = istart, iend a(i) = a(i) - b(i) enddo !$OMP BARRIER !$OMP MASTER write(21) a !$OMP END MASTER call do_work(istart,iend) !$OMP END PARALLEL
![Page 58: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/58.jpg)
Master Example - C
#pragma omp parallel private(myid,istart,iend) myrange(myid,istart,iend); for(i=istart; i<=iend; i++){ a[i] = a[i] – b[i]; } #pragma omp barrier #pragma omp master fwrite(fid,sizeof(float),iend-istart+1,a); #pragma omp end master do_work(istart,iend); #pragma omp end parallel
![Page 59: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/59.jpg)
• want part of code to be executed only by a single thread
• don’t care whether or not it’s the master thread
• use single directive
• Unlike the end master directive, end single implies a barrier.
Single
![Page 60: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/60.jpg)
Single Example - Fortran
!$OMP PARALLEL PRIVATE(myid,istart,iend) call myrange(myid,istart,iend) do i = istart, iend a(i) = a(i) - b(i) enddo !$OMP BARRIER !$OMP SINGLE write(21) a !$OMP END SINGLE call do_work(istart,iend) !$OMP END PARALLEL
![Page 61: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/61.jpg)
Single Example - C
#pragma omp parallel private(myid,istart,iend) myrange(myid,istart,iend); for(i=istart; i<=iend; i++){ a[i] = a[i] – b[i]; } #pragma omp barrier #pragma omp single fwrite(fid,sizeof(float),nvals,a); #pragma omp end single do_work(istart,iend); #pragma omp end parallel
![Page 62: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/62.jpg)
• have section of code that:1. must be executed by every thread
2. threads may execute in any order
3. threads must not execute simultaneously
• use critical directive
• no implied barrier
Critical
![Page 63: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/63.jpg)
Critical Example - Fortran
!$OMP PARALLEL PRIVATE(myid,istart,iend) call myrange(myid,istart,iend) do i = istart, iend a(i) = a(i) - b(i) enddo !$OMP CRITICAL call mycrit(myid,a) !$OMP END CRITICAL call do_work(istart,iend) !$OMP END PARALLEL
![Page 64: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/64.jpg)
Critical Example - C
#pragma omp parallel private(myid,istart,iend) myrange(myid,istart,iend); for(i=istart; i<=iend; i++){ a[i] = a[i] – b[i]; } #pragma omp critical mycrit(myid,a); #pragma omp end critical do_work(istart,iend); #pragma omp end parallel
![Page 65: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/65.jpg)
• In parallel region, have several independent tasks
• Divide code into sections• Each section is executed on a different thread.• Implied barrier• Fixes number of threads• Note that sections directive delineates region
of code which includes section directives.
Sections
![Page 66: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/66.jpg)
Sections Example - Fortran
!$OMP PARALLEL !$OMP SECTIONS !$OMP SECTION call init_field(a) !$OMP SECTION call check_grid(x) !$OMP END SECTIONS !$OMP END PARALLEL
![Page 67: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/67.jpg)
Sections Example - C
#pragma omp parallel #pragma omp sections #pragma omp section init_field(a); #pragma omp section check_grid(x); #pragma omp end sections #pragma omp end parallel
![Page 68: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/68.jpg)
• Analogous to do and parallel do or for and parallel for, along with sections there exists a parallel sections directive.
Parallel Sections
![Page 69: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/69.jpg)
Parallel Sections Example - Fortran
!$OMP PARALLEL SECTIONS !$OMP SECTION call init_field(a) !$OMP SECTION call check_grid(x) !$OMP END PARALLEL SECTIONS
![Page 70: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/70.jpg)
Parallel Sections Example - C
#pragma omp parallel sections #pragma omp section init_field(a); #pragma omp section check_grid(x); #pragma omp end parallel sections
![Page 71: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/71.jpg)
• Only needed for Fortran– Usually legacy code
• common block variables are inherently shared, may want them to be private
• threadprivate gives each thread its own private copy of variables in specified common block.
– Keep memory use in mind
• Place threadprivate after common block declaration.
Threadprivate
![Page 72: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/72.jpg)
common /work1/ work(1000)
!$omp threadprivate(/work1/)
Threadprivate (cont’d)
![Page 73: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/73.jpg)
• threadprivate renders all values in common block private
• may want to use some current values of variables on all threads
• copyin initializes each thread’s copy of specified common-block variables with values form the master thread
Copyin
![Page 74: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/74.jpg)
Copyin Example - Fortran
real, parameter :: nvals = 501 real, dimension(nvals) :: x, y, xmet, ymet common /work1/ work(nvals), istart, iend !$OMP THREADPRIVATE(/work1/) istart = 2; iend = 500 call read_coords(x,y) !$OMP PARALLEL SECTIONS COPYIN(istart,iend) !$OMP SECTION call metric(x) xmet = work !$OMP SECTION call metric(y) ymet = work !$OMP END PARALLEL SECTIONS
subroutine metric(c) real, dimension(:) :: c common /work1/ work(nvals), istart, iend !$OMP THREADPRIVATE(/work1/) do i = istart, iend work(i) = 0.5*(c(i+1)-c(i-1)) enddo end subroutine metric
![Page 75: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/75.jpg)
• Similar to critical– Allows greater optimization than critical
• Applies to single line following the atomic directive
Atomic
![Page 76: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/76.jpg)
Atomic Example - Fortran
do i = 1, 10 !$OMP ATOMIC x(j(i)) = x(j(i)) + 1.0 enddo
![Page 77: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/77.jpg)
Atomic Example - C
for(i = 0; i<10; i++){ #pragma omp atomic x[j[i]]= x[j[i]] + 1.0; }
![Page 78: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/78.jpg)
• Different threads can have different values for the same shared variable
– example – value in register
• flush assures that the calling thread has a consistent view of memory
• upcoming example is from OpenMP spec
Flush
![Page 79: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/79.jpg)
Flush Example - Fortraninteger isync(num_threads) !$omp parallel default(private) shared(isync)iam = omp_get_thread_num() isync(iam) = 0 !$omp barriercall work()isync(iam) = 1 ! let other threads know I finished work !$omp flush(isync)do while(isync(neigh) == 0) !$omp flush(isync) enddo$omp end parallel
![Page 80: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/80.jpg)
Flush Example - C
int isync[num_threads]; #pragma omp parallel default(private) shared(isync) iam = omp_get_thread_num(); isync[iam] = 0; #pragma omp barrier work(); isync[iam] = 1; /* let other threads know I finished work */ #pragma omp flush(isync) while(isync[neigh] == 0) #pragma omp flush(isync) #pragma omp end parallel
![Page 81: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/81.jpg)
Some Additional Functions
![Page 82: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/82.jpg)
• On some systems the number of threads available to you may be adjusted dynamically.
• The number of threads requested can be construed as the maximum number of threads available.
• Dynamic threading can be turned on or off:
Dynamic Thread Assignment
logical mydyn = .false.call omp_set_dynamic(mydyn)
int mydyn = 0;omp_set_dynamic(mydyn);
![Page 83: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/83.jpg)
• Dynamic threading can also be turned on or off by setting the environment variable omp_dynamic to true or false
– call to omp_set_dynamic overrides the environment variable
• The function omp_get_dynamic() returns a value of “true” or “false,” indicating whether dynamic threading is turned on or off.
Dynamic Thread Assignment (cont’d)
![Page 84: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/84.jpg)
• integer function• returns maximum number of threads available
in current parallel region• same result as omp_get_num_threads if
dynamic threading is turned off
OMP_GET_MAX_THREADS
![Page 85: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/85.jpg)
• integer function• returns maximum number of processors in the
system– indicates amount of hardware, not number of
available processors
• could be used to make sure enough processors are available for specified number of sections
OMP_GET_NUM_PROCS
![Page 86: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/86.jpg)
• logical function• tells whether or not function was called from a
parallel region• useful debugging device when using
“orphaned” directives– example of orphaned directive: omp parallel in
“main,” omp do or omp for in routine called from main
OMP_IN_PARALLEL
![Page 87: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/87.jpg)
Some Additional Clauses
![Page 88: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/88.jpg)
Schedule
• schedule refers to the way in which loop indices are distributed among threads
• default is static, which we had seen in an earlier example– each thread is assigned a contiguous range of
indices in order of thread number• called round robin
– number of indices assigned to each thread is as equal as possible
![Page 89: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/89.jpg)
Schedule (cont’d)
• static example– loop runs from 1 to 51, 4 threads
thread indices no. indices
0 1-13 13
1 14-26 13
2 27-39 13
3 40-51 12
![Page 90: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/90.jpg)
Schedule (3)
• number of indices doled out at a time to each thread is called the chunk size
• can be modified with the SCHEDULE clause
!$omp do schedule(static,5)
#pragma omp for schedule(static,5)
![Page 91: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/91.jpg)
Schedule (4)
• example – same as previous example (loop runs from 1-51, 4 threads), but set chunk size to 5
thread chunk 1 indices
chunk 2 indices
chunk 3 indices
no. indices
0 1-5 21-25 41-45 15
1 6-10 26-30 46-50 15
2 11-15 31-35 51 11
3 16-20 36-40 - 10
![Page 92: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/92.jpg)
Dynamic Schedule
• schedule(dynamic) clause assigns chunks to threads dynamically as the threads become available for more work
• default chunk size is 1• higher overhead than STATIC
!$omp do schedule(dynamic,5)
#pragma omp for schedule(dynamic,5)
![Page 93: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/93.jpg)
Guided Schedule
• schedule(guided) clause assigns chunks automatically, exponentially decreasing chunk size with each assignment
• specified chunk size is the minimum chunk size except for the last chunk, which can be of any size
• default chunk size is 1
![Page 94: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/94.jpg)
Runtime Schedule
• schedule can be specified through omp_schedule environment variable
setenv omp_schedule “dynamic,5”
• the schedule(runtime) clause tells it to set the schedule using the environment variable
!$omp do schedule(runtime)
#pragma omp for schedule(runtime)
![Page 95: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/95.jpg)
Nested Parallelism
![Page 96: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/96.jpg)
• parallel construct within parallel construct:
Nested Parallelism
!$omp parallel do do j = 1, jmax !$omp parallel do do i = 1, i,max call do_work(i,j) enddo enddo
#pragma omp parallel for for(j=0; j<jmax; j++){ #pragma omp parallel for for(i=0; i<imax; i++){ do_work(i,j); } }
![Page 97: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/97.jpg)
• OpenMP standard allows but does not require nested parallelism
• if nested parallelism is not implemented, previous examples are legal, but the inner loop will be serial
• logical function omp_get_nested() returns .true. or .false. (1 or 0) to indicate whether or not nested parallelism is enabled in the current region
Nested Parallelism (cont’d)
![Page 98: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/98.jpg)
• omp_set_nested(nest) enables/disables nested parallelism if argument is true/false
– subroutine in Fortran (.true./.false.)– function in C (1/0)
• nested parallelism can also be turned on or off by setting the environment variable omp_nested to true or false
– calls to omp_set_nested override the environment variable
Nested Parallelism (3)
![Page 99: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/99.jpg)
Locks
![Page 100: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/100.jpg)
• Locks offer additional control of threads• a thread can take/release ownership of a lock
over a specified region of code• when a lock is owned by a thread, other
threads cannot execute the locked region• can be used to perform useful work by one or
more threads while another thread is engaged in a serial task
Locks
![Page 101: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/101.jpg)
• Lock routines each have a single argument– argument type is an integer (Fortran) or a pointer to
an integer (C) long enough to hold an address– OpenMP pre-defines required types
integer(omp_lock_kind) :: mylock
omp_nest_lock_t *mylock;
Lock Routines
![Page 102: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/102.jpg)
• omp_init_lock(mylock)– initializes mylock– must be called before mylock is used– subroutine in Fortran
• omp_set_lock(mylock)– gives ownership of mylock to calling thread– other threads cannot execute code following call
until mylock is released– subroutine in Fortran
Lock Routines (cont’d)
![Page 103: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/103.jpg)
• omp_test_lock(mylock)– logical function– if mylock is presently owned by another
thread, returns false– if mylock is available, returns true and sets
lock, i.e., gives ownership to calling thread
– be careful: omp_test_lock(mylock)does more than just test the lock
Lock Routines (3)
![Page 104: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/104.jpg)
• omp_unset_lock(mylock)– releases ownership of mylock– subroutine in Fortran
• omp_destroy_lock(mylock)– call when you’re done with the lock– complement to omp_init_lock(mylock)– subroutine in Fortran
Lock Routines (4)
![Page 105: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/105.jpg)
• an independent task takes a significant amount of time, and it must be performed serially
• another task, independent of the first, may be performed in parallel
• improve parallel efficiency by doing both tasks at the same time
• one thread locks serial task• other threads work on parallel task until
serial task is complete• lock example
Lock Example
![Page 106: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/106.jpg)
Lock Example
call omp_init_lock(mylock) !$omp parallel if(omp_test_lock(mylock)) then call long_serial_task call omp_unset_lock(mylock) else dowhile(.not. omp_test_lock(mylock)) call short_parallel_task enddo call omp_unset_lock(mylock) endif !$omp end parallel call omp_destroy_lock(mylock)
![Page 107: Parallel Processing with OpenMP](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814d24550346895dba571a/html5/thumbnails/107.jpg)
Lock Example
omp_init_lock(mylock); #pragma omp parallel if(omp_test_lock(mylock)){ long_serial_task(); omp_unset_lock(mylock); }else{ while(! omp_test_lock(mylock)){ short_parallel_task(); } omp_unset_lock(mylock); } #pragma omp end parallel omp_destroy_lock(mylock);