1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 11, 2013. slides 8b-1.ppt Programming with...
-
Upload
melina-thornton -
Category
Documents
-
view
215 -
download
1
Transcript of 1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 11, 2013. slides 8b-1.ppt Programming with...
1ITCS4145/5145, Parallel Programming B. Wilkinson Feb 11, 2013. slides 8b-1.ppt
Programming with Shared Memory
Introduction to OpenMP
Part 1
2
OpenMPThread-based shared memory programming model.
Accepted standard developed in late 1990s by a group of industry specialists.
Higher-level than using thread API’s such as Pthreads or Java threads.
Write programs in C/C++ (or Fortran!) and use OpenMP compiler directives to specify parallelism.
OpenMP also has a few supporting library routines and environment variables.
Several compilers available to compile OpenMP programs include recent Linux C compilers.
3
parallel region
Multiple threads
parallel region
Master thread
OpenMP thread model
Synchronization
Initially, single thread executed by a master thread.
parallel directive hereuses team of threads with subsequent block of code executed by multiple threads in parallel.
Exact number of threads determined by one of several ways, see later.
Other directives within parallel construct to specify parallel for loops and different blocks of code for threads.
Code outside parallel region executed by master thread only
Master thread only
Master thread only
4
Number of threads in a team
Established by one of three ways, either:
1.num_threads clause after the parallel directive
e.g. #pragma omp parallel num_threads(5)or
2. omp_set_num_threads() library routine being previously called
e.g. omp_set_num_threads(6);or
3.Environment variable OMP_NUM_THREADS is defined
e.g $ export OMP_NUM_THREADS=8$ ./hello
in order given or is system dependent if none of above. Number of threads available can be altered dynamically to achieve best use of system resources.
Finding number of threads and thread ID during program execution
• omp_get_num_threads() – get the total number of threads
• omp_get_thread_num() – Returns thread number (ID), an integer from 0 to omp_get_num_thread() -1 where thread 0 is master thread
The names of these two functions are similar; easy to confuse.
6
#pragma omp parallel
structured_block
C “pragmatic” directive instructs compiler to use OpenMP features
All OpenMP directives have omp
OpenMP parallel directive
OpenMP Parallel Directive
Single statement or compound statement created with { ...} with single entry point and single exit point.
Creates multiple threads, each one executing the specified structured_block.
Implicit barrier at end of construct.
7
Hello world example
#pragma omp parallel {printf("Hello World from thread = %d\n", omp_get_thread_num(),
omp_get_num_threads()); }
Output from an 8-processor/core machine:
Hello World from thread 0 of 8Hello World from thread 4 of 8Hello World from thread 3 of 8Hello World from thread 2 of 8Hello World from thread 7 of 8Hello World from thread 1 of 8Hello World from thread 6 of 8Hello World from thread 5 of 8
VERY IMPORTANT Opening brace must on a new line (tabs,spaces ok)
8
Global “shared” variables/data
Any variable declared outside a parallel construct accessible by all threads unless otherwise specified:
int main (int argc, char *argv[]) {
int x; // accessibly by all threads
#pragma omp parallel {… // each thread see the same x
}
}
9
Private variables
Separate copies of variables for each thread.Can be declared within each parallel region but OpenMP provides private clause.
int tid;
…
#pragma omp parallel private(tid)
{
tid = omp_get_thread_num();
printf("Hello World from thread = %d\n", tid);
}
Each thread has a
local variable tid
Also a shared clause available for shared variables.
Another example of shared and private data
int main (int argc, char *argv[])
{
int x;
int tid;
#pragma omp parallel private(tid)
{
tid = omp_get_thread_num();
if (tid == 0) x = 42;
printf ("Thread %d, x = %d\n", tid, x);
}
}
x is shared by all threads
tid is private – each thread has its own copy
Variables declared outside the parallel construct are shared unless otherwise specified
Output
$ ./data
Thread 3, x = 0
Thread 2, x = 0
Thread 1, x = 0
Thread 0, x = 42
Thread 4, x = 42
Thread 5, x = 42
Thread 6, x = 42
Thread 7, x = 42
tid has a separate value for each thread
Why does x change?
Another ExampleShared versus Private
int a[100];
#pragma omp parallel private(tid, n)
{
tid = omp_get_thread_num();
n = omp_get_num_threads();
a[tid] = 10*n;
}
OR
#pragma omp parallel private(tid, n) shared(a)
...
tid and n are private
a[ ] is shared
optional
13
Variations of private variables
private clause – creates private copies of variables for each thread
firstprivate clause - as private clause but initializes each copy to the values given immediately prior to parallel construct.
lastprivate clause – as private but “the value of each lastprivate variable from the sequentially last iteration of the associated loop, or the lexically last section directive, is assigned to the variable’s original object.”
14
Work-SharingSpecifying work inside a parallel region
Four constructs in this classification:
sections – section
for
single
master
In all cases, implicit barrier at end of construct unless a nowait clause included, which overrides the barrier.
Note: These constructs do not start a new team of threads. That done by an enclosing parallel construct.
15
Sections
The construct:
#pragma omp parallel {
#pragma omp sections{
#pragma omp sectionstructured_block
#pragma omp sectionstructured_block…
}}
cause structured blocks to be shared among threads in team.The first section directive optional.
Blocks executed by available threads
Enclosing parallel directive
16
Example#pragma omp parallel shared(a,b,c,d,nthreads) private(i,tid) { tid = omp_get_thread_num();
#pragma omp sections nowait {
#pragma omp section {
printf("Thread %d doing section 1\n",tid);for (i=0; i<N; i++) {
c[i] = a[i] + b[i];printf("Thread %d: c[%d]= %f\n",tid,i,c[i]);
}}
#pragma omp section
{printf("Thread %d doing section 2\n",tid);for (i=0; i<N; i++) {
d[i] = a[i] * b[i];printf("Thread %d: d[%d]= %f\n",tid,i,d[i]);
}}
} /* end of sections */} /* end of parallel section */
One thread does this
Another thread does this
Another sections example#pragma omp parallel shared(a,b,c,d,nthreads)
private(i,tid)
{
tid = omp_get_thread_num();
#pragma omp sections nowait
{
#pragma omp section
{
printf("Thread %d doing section 1\n",tid);
for (i=0; i<N; i++) {
c[i] = a[i] + b[i];
printf("Thread %d: c[%d]=%f\n“,tid,i,c[i]);
}
}
Threads do not wait after finishing section
One thread does this
Sections example continued
#pragma omp section
{
printf("Thread %d doing section 2\n",tid);
for (i=0; i<N; i++) {
d[i] = a[i] * b[i];
printf("Thread %d: d[%d]= %f\n",tid,i,d[i]);
}
}
} /* end of sections */
printf ("Thread %d done\n", tid);
} /* end of parallel section */
Another thread does this
Output
Thread 0 doing section 1
Thread 0: c[0]= 5.000000
Thread 0: c[1]= 7.000000
Thread 0: c[2]= 9.000000
Thread 0: c[3]= 11.000000
Thread 0: c[4]= 13.000000
Thread 3 done
Thread 2 done
Thread 1 doing section 2
Thread 1: d[0]= 0.000000
Thread 1: d[1]= 6.000000
Thread 1: d[2]= 14.000000
Thread 1: d[3]= 24.000000
Thread 0 done
Thread 1: d[4]= 36.000000
Thread 1 done
Threads do not wait (i.e. no barrier)
Output if remove nowait clauseThread 0 doing section 1
Thread 0: c[0]= 5.000000
Thread 0: c[1]= 7.000000
Thread 0: c[2]= 9.000000
Thread 0: c[3]= 11.000000
Thread 0: c[4]= 13.000000
Thread 3 doing section 2
Thread 3: d[0]= 0.000000
Thread 3: d[1]= 6.000000
Thread 3: d[2]= 14.000000
Thread 3: d[3]= 24.000000
Thread 3: d[4]= 36.000000
Thread 3 done
Thread 1 done
Thread 2 done
Thread 0 done
If we remove the nowait, then there is a barrier at the end of the section. Threads wait until they are all done with the section.
Barrier here
21
Combining parallel and section constructs
If a parallel directive is followed by a single “sections” directive, they can be combined into:
#pragma omp parallel sections{
#pragma omp sectionstructured_block
#pragma omp sectionstructured_block
…
}
with similar effect. (However, a nowait clause is not allowed.)
22
Parallel For Loop
#pragma omp parallel {…
#pragma omp forfor ( i = 0; i < n; i++ ) {… // for loop body}
…}
causes for loop to be divided into parts and parts shared among threads in the team – equivalent to a “forall.”Different iterations will be executed by available threads
Must be “for” loop of a simple C form such as (i = 0; i < n; i++) where lower bound and upper bound are constants
Must have a new line here
Enclosing parallel region
23
Example#pragma omp parallel shared(a,b,c,nthreads,chunk) private(i,tid) { tid = omp_get_thread_num(); if (tid == 0) { nthreads = omp_get_num_threads(); printf("Number of threads = %d\n", nthreads); } printf("Thread %d starting...\n",tid); #pragma omp for for (i=0; i<N; i++) { c[i] = a[i] + b[i]; printf("Thread %d: c[%d]= %f\n",tid,i,c[i]); } } /* end of parallel section */
For loop
Executed by one thread
Without “nowait”, threads wait after finishing loop
24
Combined parallel and for constructs
If a parallel directive is followed by a single for directive, it can be combined into:
#pragma omp parallel for
<for loop> {
…
}
with similar effects.
Combining Directives Example
#pragma omp parallel for shared(a,b,c,nthreads) private(i,tid)
for (i = 0; i < N; i++) {
c[i] = a[i] + b[i];
printf("Thread %d: c[%d]= %f\n",tid,i,c[i]);
}
Declares a Parallel Region and a Parallel For
Scheduling a Parallel For• By default, a parallel for scheduled by mapping blocks (or
chunks) of iterations to available threads (static mapping)
Thread 1 starting...
Thread 1: i = 2, c[1] = 9.000000
Thread 1: i = 3, c[1] = 11.000000
Thread 2 starting...
Thread 2: i = 4, c[2] = 13.000000
Thread 3 starting...
Number of threads = 4
Thread 0 starting...
Thread 0: i = 0, c[0] = 5.000000
Thread 0: i = 1, c[0] = 7.000000
Default Chunk Size
threadsofnumber
iterations ofnumber
Barrier here
27
Loop Scheduling and PartitioningOpenMP offers scheduling clauses to add to for construct:
1. Static
#pragma omp parallel for schedule (static,chunk_size)
Partitions loop iterations into equal sized chunks specified by chunk_size. Chunks assigned to threads in round robin fashion.
2. Dynamic
#pragma omp parallel for schedule (dynamic,chunk_size)
Uses internal work queue. Chunk-sized block of loop assigned to threads as they become available.
28
3. Guided
#pragma omp parallel for schedule (guided,chunk_size)
Similar to dynamic but chunk size starts large and gets smaller to reduce time threads have to go to work queue.
chunk size = number of iterations remaining 2 * number of threads
4. Runtime
#pragma omp parallel for schedule (runtime)
Uses OMP_SCEDULE environment variable to specify which of static, dynamic or guided should be used.
Question
• Guided scheduling is similar to Static except that the chunk sizes start large and get smaller.
• What is the advantage of using Guided versus Static?
• Answer: Guided improves load balance
Reduction• A reduction is when we apply a commutative operator to
an aggregate values creating a single value (similar to the MPI_Reduce)
sum = 0;
#pragma omp parallel for reduction(+:sum)
for (k = 0; k < 100; k++ ) {
sum = sum + funct(k);
}
Operation
Variable
Private copy of sum created for each thread by compiler. Private copy will be added to sum at end.Eliminates the need for critical sections here.
31
Single
The directive
#pragma omp parallel {…
#pragma omp single
structured_block
…
}
cause the structured block to be executed by one thread only.
Must have a new line here
32
MasterThe master directive:
#pragma omp parallel {…
#pragma omp masterstructured_block
…}
causes only the master thread to execute the structured block.
Different to those in work sharing group in that there is no implied barrier at end of construct (nor beginning).Other threads encountering master directive will ignore it and associated structured block, and will move on.
Master Example
#pragma omp parallel private(tid)
{
tid = omp_get_thread_num();
printf ("Thread %d starting...\n", tid);
#pragma omp master
{
printf("Thread %d doing work\n",tid);
...
} /* end of master */
printf ("Thread %d done\n", tid);
} /* end of parallel section */
Is there any difference between these two approaches:
Master Directive:
#pragma omp parallel
{
...
#pragma omp master
structured_block
...
}
Using an if statement:
#pragma omp parallel private(tid)
{
...
tid=omp_get_thread_num();
if (tid == 0)
structured_block
...
}