1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 11, 2013. slides 8b-1.ppt Programming with...

35
1 /5145, Parallel Programming B. Wilkinson Feb 11, 2013. slides 8b-1.ppt Programming with Shared Memory Introduction to OpenMP Part 1

Transcript of 1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 11, 2013. slides 8b-1.ppt Programming with...

1ITCS4145/5145, Parallel Programming B. Wilkinson Feb 11, 2013. slides 8b-1.ppt

Programming with Shared Memory

Introduction to OpenMP

Part 1

2

OpenMPThread-based shared memory programming model.

Accepted standard developed in late 1990s by a group of industry specialists.

Higher-level than using thread API’s such as Pthreads or Java threads.

Write programs in C/C++ (or Fortran!) and use OpenMP compiler directives to specify parallelism.

OpenMP also has a few supporting library routines and environment variables.

Several compilers available to compile OpenMP programs include recent Linux C compilers.

3

parallel region

Multiple threads

parallel region

Master thread

OpenMP thread model

Synchronization

Initially, single thread executed by a master thread.

parallel directive hereuses team of threads with subsequent block of code executed by multiple threads in parallel.

Exact number of threads determined by one of several ways, see later.

Other directives within parallel construct to specify parallel for loops and different blocks of code for threads.

Code outside parallel region executed by master thread only

Master thread only

Master thread only

4

Number of threads in a team

Established by one of three ways, either:

1.num_threads clause after the parallel directive

e.g. #pragma omp parallel num_threads(5)or

2. omp_set_num_threads() library routine being previously called

e.g. omp_set_num_threads(6);or

3.Environment variable OMP_NUM_THREADS is defined

e.g $ export OMP_NUM_THREADS=8$ ./hello

in order given or is system dependent if none of above. Number of threads available can be altered dynamically to achieve best use of system resources.

Finding number of threads and thread ID during program execution

• omp_get_num_threads() – get the total number of threads

• omp_get_thread_num() – Returns thread number (ID), an integer from 0 to omp_get_num_thread() -1 where thread 0 is master thread

The names of these two functions are similar; easy to confuse.

6

#pragma omp parallel

structured_block

C “pragmatic” directive instructs compiler to use OpenMP features

All OpenMP directives have omp

OpenMP parallel directive

OpenMP Parallel Directive

Single statement or compound statement created with { ...} with single entry point and single exit point.

Creates multiple threads, each one executing the specified structured_block.

Implicit barrier at end of construct.

7

Hello world example

#pragma omp parallel {printf("Hello World from thread = %d\n", omp_get_thread_num(),

omp_get_num_threads()); }

Output from an 8-processor/core machine:

Hello World from thread 0 of 8Hello World from thread 4 of 8Hello World from thread 3 of 8Hello World from thread 2 of 8Hello World from thread 7 of 8Hello World from thread 1 of 8Hello World from thread 6 of 8Hello World from thread 5 of 8

VERY IMPORTANT Opening brace must on a new line (tabs,spaces ok)

8

Global “shared” variables/data

Any variable declared outside a parallel construct accessible by all threads unless otherwise specified:

int main (int argc, char *argv[]) {

int x; // accessibly by all threads

#pragma omp parallel {… // each thread see the same x

  }

}

9

Private variables

Separate copies of variables for each thread.Can be declared within each parallel region but OpenMP provides private clause.

int tid;

#pragma omp parallel private(tid)

{

tid = omp_get_thread_num();

printf("Hello World from thread = %d\n", tid);

}

Each thread has a

local variable tid

Also a shared clause available for shared variables.

Another example of shared and private data

int main (int argc, char *argv[])

{

int x;

int tid;

#pragma omp parallel private(tid)

{

tid = omp_get_thread_num();

if (tid == 0) x = 42;

printf ("Thread %d, x = %d\n", tid, x);

}

}

x is shared by all threads

tid is private – each thread has its own copy

Variables declared outside the parallel construct are shared unless otherwise specified

Output

$ ./data

Thread 3, x = 0

Thread 2, x = 0

Thread 1, x = 0

Thread 0, x = 42

Thread 4, x = 42

Thread 5, x = 42

Thread 6, x = 42

Thread 7, x = 42

tid has a separate value for each thread

Why does x change?

Another ExampleShared versus Private

int a[100];

#pragma omp parallel private(tid, n)

{

tid = omp_get_thread_num();

n = omp_get_num_threads();

a[tid] = 10*n;

}

OR

#pragma omp parallel private(tid, n) shared(a)

...

tid and n are private

a[ ] is shared

optional

13

Variations of private variables

private clause – creates private copies of variables for each thread

firstprivate clause - as private clause but initializes each copy to the values given immediately prior to parallel construct.

lastprivate clause – as private but “the value of each lastprivate variable from the sequentially last iteration of the associated loop, or the lexically last section directive, is assigned to the variable’s original object.”

14

Work-SharingSpecifying work inside a parallel region

Four constructs in this classification:

sections – section

for

single

master

In all cases, implicit barrier at end of construct unless a nowait clause included, which overrides the barrier.

Note: These constructs do not start a new team of threads. That done by an enclosing parallel construct.

15

Sections

The construct:

#pragma omp parallel {

#pragma omp sections{

#pragma omp sectionstructured_block

#pragma omp sectionstructured_block…

}}

cause structured blocks to be shared among threads in team.The first section directive optional.

Blocks executed by available threads

Enclosing parallel directive

16

Example#pragma omp parallel shared(a,b,c,d,nthreads) private(i,tid) { tid = omp_get_thread_num();

#pragma omp sections nowait {

#pragma omp section {

printf("Thread %d doing section 1\n",tid);for (i=0; i<N; i++) {

c[i] = a[i] + b[i];printf("Thread %d: c[%d]= %f\n",tid,i,c[i]);

}}

  #pragma omp section

{printf("Thread %d doing section 2\n",tid);for (i=0; i<N; i++) {

d[i] = a[i] * b[i];printf("Thread %d: d[%d]= %f\n",tid,i,d[i]);

}}

} /* end of sections */} /* end of parallel section */

One thread does this

Another thread does this

Another sections example#pragma omp parallel shared(a,b,c,d,nthreads)

private(i,tid)

{

tid = omp_get_thread_num();

#pragma omp sections nowait

{

#pragma omp section

{

printf("Thread %d doing section 1\n",tid);

for (i=0; i<N; i++) {

c[i] = a[i] + b[i];

printf("Thread %d: c[%d]=%f\n“,tid,i,c[i]);

}

}

Threads do not wait after finishing section

One thread does this

Sections example continued

#pragma omp section

{

printf("Thread %d doing section 2\n",tid);

for (i=0; i<N; i++) {

d[i] = a[i] * b[i];

printf("Thread %d: d[%d]= %f\n",tid,i,d[i]);

}

}

} /* end of sections */

printf ("Thread %d done\n", tid);

} /* end of parallel section */

Another thread does this

Output

Thread 0 doing section 1

Thread 0: c[0]= 5.000000

Thread 0: c[1]= 7.000000

Thread 0: c[2]= 9.000000

Thread 0: c[3]= 11.000000

Thread 0: c[4]= 13.000000

Thread 3 done

Thread 2 done

Thread 1 doing section 2

Thread 1: d[0]= 0.000000

Thread 1: d[1]= 6.000000

Thread 1: d[2]= 14.000000

Thread 1: d[3]= 24.000000

Thread 0 done

Thread 1: d[4]= 36.000000

Thread 1 done

Threads do not wait (i.e. no barrier)

Output if remove nowait clauseThread 0 doing section 1

Thread 0: c[0]= 5.000000

Thread 0: c[1]= 7.000000

Thread 0: c[2]= 9.000000

Thread 0: c[3]= 11.000000

Thread 0: c[4]= 13.000000

Thread 3 doing section 2

Thread 3: d[0]= 0.000000

Thread 3: d[1]= 6.000000

Thread 3: d[2]= 14.000000

Thread 3: d[3]= 24.000000

Thread 3: d[4]= 36.000000

Thread 3 done

Thread 1 done

Thread 2 done

Thread 0 done

If we remove the nowait, then there is a barrier at the end of the section. Threads wait until they are all done with the section.

Barrier here

21

Combining parallel and section constructs

If a parallel directive is followed by a single “sections” directive, they can be combined into:

#pragma omp parallel sections{

#pragma omp sectionstructured_block

#pragma omp sectionstructured_block

}

with similar effect. (However, a nowait clause is not allowed.)

22

Parallel For Loop

#pragma omp parallel {…

#pragma omp forfor ( i = 0; i < n; i++ ) {… // for loop body}

…}

causes for loop to be divided into parts and parts shared among threads in the team – equivalent to a “forall.”Different iterations will be executed by available threads

Must be “for” loop of a simple C form such as (i = 0; i < n; i++) where lower bound and upper bound are constants

Must have a new line here

Enclosing parallel region

23

Example#pragma omp parallel shared(a,b,c,nthreads,chunk) private(i,tid) { tid = omp_get_thread_num(); if (tid == 0) { nthreads = omp_get_num_threads(); printf("Number of threads = %d\n", nthreads); } printf("Thread %d starting...\n",tid);  #pragma omp for for (i=0; i<N; i++) { c[i] = a[i] + b[i]; printf("Thread %d: c[%d]= %f\n",tid,i,c[i]); }  } /* end of parallel section */

For loop

Executed by one thread

Without “nowait”, threads wait after finishing loop

24

Combined parallel and for constructs

If a parallel directive is followed by a single for directive, it can be combined into:

#pragma omp parallel for

<for loop> {

}

with similar effects.

Combining Directives Example

#pragma omp parallel for shared(a,b,c,nthreads) private(i,tid)

for (i = 0; i < N; i++) {

c[i] = a[i] + b[i];

printf("Thread %d: c[%d]= %f\n",tid,i,c[i]);

}

Declares a Parallel Region and a Parallel For

Scheduling a Parallel For• By default, a parallel for scheduled by mapping blocks (or

chunks) of iterations to available threads (static mapping)

Thread 1 starting...

Thread 1: i = 2, c[1] = 9.000000

Thread 1: i = 3, c[1] = 11.000000

Thread 2 starting...

Thread 2: i = 4, c[2] = 13.000000

Thread 3 starting...

Number of threads = 4

Thread 0 starting...

Thread 0: i = 0, c[0] = 5.000000

Thread 0: i = 1, c[0] = 7.000000

Default Chunk Size

threadsofnumber

iterations ofnumber

Barrier here

27

Loop Scheduling and PartitioningOpenMP offers scheduling clauses to add to for construct:

1. Static

#pragma omp parallel for schedule (static,chunk_size)

Partitions loop iterations into equal sized chunks specified by chunk_size. Chunks assigned to threads in round robin fashion.

2. Dynamic

#pragma omp parallel for schedule (dynamic,chunk_size)

Uses internal work queue. Chunk-sized block of loop assigned to threads as they become available.

28

3. Guided

#pragma omp parallel for schedule (guided,chunk_size)

Similar to dynamic but chunk size starts large and gets smaller to reduce time threads have to go to work queue.

chunk size = number of iterations remaining 2 * number of threads

4. Runtime

#pragma omp parallel for schedule (runtime)

Uses OMP_SCEDULE environment variable to specify which of static, dynamic or guided should be used.

Question

• Guided scheduling is similar to Static except that the chunk sizes start large and get smaller.

• What is the advantage of using Guided versus Static?

• Answer: Guided improves load balance

Reduction• A reduction is when we apply a commutative operator to

an aggregate values creating a single value (similar to the MPI_Reduce)

sum = 0;

#pragma omp parallel for reduction(+:sum)

for (k = 0; k < 100; k++ ) {

sum = sum + funct(k);

}

Operation

Variable

Private copy of sum created for each thread by compiler. Private copy will be added to sum at end.Eliminates the need for critical sections here.

31

Single

The directive

#pragma omp parallel {…

#pragma omp single

structured_block

}

cause the structured block to be executed by one thread only.

Must have a new line here

32

MasterThe master directive:

#pragma omp parallel {…

#pragma omp masterstructured_block

…}

causes only the master thread to execute the structured block.

Different to those in work sharing group in that there is no implied barrier at end of construct (nor beginning).Other threads encountering master directive will ignore it and associated structured block, and will move on.

Master Example

#pragma omp parallel private(tid)

{

tid = omp_get_thread_num();

printf ("Thread %d starting...\n", tid);

#pragma omp master

{

printf("Thread %d doing work\n",tid);

...

} /* end of master */

printf ("Thread %d done\n", tid);

} /* end of parallel section */

Is there any difference between these two approaches:

Master Directive:

#pragma omp parallel

{

...

#pragma omp master

structured_block

...

}

Using an if statement:

#pragma omp parallel private(tid)

{

...

tid=omp_get_thread_num();

if (tid == 0)

structured_block

...

}

Questions

35