Introduction to OpenMP For a more detailed tutorial see: Look at the presentations also see:

Introduction to OpenMP

For a more detailed tutorial see:http://www.openmp.org

Look at the presentations

also see:https://computing.llnl.gov/tutorials/openMP/

http://www.openmp.org/

https://computing.llnl.gov/tutorials/openMP/

Concepts

• An Application Program Interface (API) that may be used to explicitly direct multi-threaded, shared memory parallelism

• Directive based programming– declare properties of language structures

• sections of code, loops

– scope variables

• A few service routines– get information

• Compiler options• Environment variables

Open MP Programming Model

• Directive– #pragma omp directive [clause list]– C$OMP construct [clause [clause] … ]

• Program executes serially until it encounters a parallel directive– #pragma omp parallel [clause list]– /* structured block of code */

• Clause list is used to specify conditions– Conditional parallelism - if (cond)– Degree of concurrency - num_threads(int)– Data Handling -

• private(vlist), firstprivate(vlist), shared(vlist)

OpenMP Programming Model

• fork-join parallelism• Master thread spawns a team of threads

as needed.

Typical OpenMP Use

• Generally used to parallelize loops– Find most time consuming loops– Split iterations up between threads

void main(){ double Res[1000]; for(int i=0;i<1000;i++) { do_huge_comp(Res[i]); }}

void main(){ double Res[1000];

#pragma omp parallel for for(int i=0;i<1000;i++) { do_huge_comp(Res[i]); }}

Thread Interaction

• OpenMP operates using shared memory– Threads communicate via shared variables

• Unintended sharing can lead to race conditions– output changes due to thread scheduling

• Control race conditions using synchronization– synchronization is expensive– change the way data is stored to minimize the

need for synchronization

Syntax format

• Compiler directives– C/C++

• #pragma omp construct [clause [clause] …]

– Fortran• C$OMP construct [clause [clause] … ]• !$OMP construct [clause [clause] … ]• *$OMP construct [clause [clause] … ]

• Since we use directives, no changes need to be made to a program for a compiler that doesn’t support OpenMP

Using OpenMP• Some compilers can automatically place directives with

option– -qsmp=auto (IBM xlc)– some loops may speed up, some may slow down

• Compiler option required when you use directives– icc -openmp (Linux – Intel compiler)– gcc -fopenmp (gcc versions >= 4.2)

• Scoping variables is the hard part!– shared variables, thread private variables

OpenMP Example#include <stdio.h>#include <omp.h>

main (){

int nthreads, tid;

/* Fork a team of threads giving them their own copies of variables */#pragma omp parallel private(tid) {

/* Obtain and print thread id */ tid = omp_get_thread_num(); printf("Hello World from thread = %d\n", tid);

/* Only master thread does this */ if (tid == 0) { nthreads = omp_get_num_threads(); printf("Number of threads = %d\n", nthreads); }

} /* All threads join master thread and terminate */

}

[snell@atropos openmp]$ gcc-4.2 -fopenmp hello.c -o hello[snell@atropos openmp]$ ./hello Hello World from thread = 0Hello World from thread = 1Number of threads = 2[snell@atropos openmp]$

[snell@atropos openmp]$ gcc-4.2 -fopenmp hello.c -o hello[snell@atropos openmp]$ ./hello Hello World from thread = 0Hello World from thread = 1Number of threads = 2[snell@atropos openmp]$

OpenMP Directives

• 5 categories– Parallel Regions– Worksharing– Data Environment– Synchronization– Runtime functions / environment

variables

• Basically the same between C/C++ and Fortran

Parallel Regions

• Create threads with omp parallel

• Threads share A (default behavior)• Master thread creates the threads• Threads all start at same time then synchronize at

a barrier at the end to continue with code.

double A[1000]omp_set_num_threads(4);#pragma omp parallel{

int ID = omp_get_thread_num();dosomething(ID, A);

}

Parallel Regions

#pragma omp parallel [clause ...] newline structured_block

if (scalar_expression) private (list) shared (list) default (shared | none) firstprivate (list) reduction (operator: list) copyin (list) num_threads (integer-expression)

Clauses

Parallel Regions

• The number of threads in a parallel region is determined by the following factors, in order of precedence:– 1. Evaluation of the IF clause – 2. Setting of the NUM_THREADS clause – 3. Use of the omp_set_num_threads() library function – 4. Setting of the OMP_NUM_THREADS environment variable – 5. Implementation default - usually the number of CPUs on

a node, though it could be dynamic.

• Threads are numbered from 0 (master thread) to N-1

Sections construct

• The sections construct gives a different structured block to each thread

• By default there is a barrier at the end. Use the nowait clause to turn off.

#pragma omp parallel#pragma omp sections{

X_calculation();#pragma omp section

y_calculation();#pragma omp section

z_calculation();}

Work-sharing constructs

• the for construct splits up loop iterations

• By default, there is a barrier at the end of the “omp for”. Use the “nowait” clause to turn off the barrier.

#pragma omp parallel#pragma omp for

for (I=0;I<N;I++){

NEAT_STUFF(I);}

Short-hand notation

• Can combine parallel and work sharing constructs

• There is also a “parallel sections” construct

#pragma omp parallel forfor (I=0;I<N;I++){

NEAT_STUFF(I);}

#pragma omp parallel forfor (I=0;I<N;I++){

NEAT_STUFF(I);}

A Rule

• In order to be made parallel, a loop must have canonical “shape”

for (index=start; index end; )

<<=>=>

index++;++index;index--;--index;index += inc;index -= inc;index = index + inc;index = inc + index;index = index – inc;

An example

#pragma omp parallel for private(j)for (i = 0; i < BLOCK_SIZE(id,p,n); i++)

for (j = 0; j < n; j++)a[i][j] = MIN(a[i][j], a[i][k] + tmp[j])

By definition, private variable values are undefined at loop entry and exit

To change this behavior, you can use the firstprivate(var) and lastprivate(var)

clauses

x[0] = complex_function();#pragma omp parallel for private(j) firstprivate(x)for (i = 0; i < n; i++)

for (j = 0; j < m; j++)x[j] = g(i, x[j-1]);

answer[i] = x[j] – x[i];

Scheduling Iterations• The schedule clause effects how loop

iterations are mapped onto threads• schedule(static [,chunk])

– Deal-out blocks of iterations of size “chunk” to each thread.

• schedule(dynamic[,chunk])– Each thread grabs “chunk” iterations off a queue until all

iterations have been handled.• schedule(guided[,chunk])

– Threads dynamically grab blocks of iterations. The size of he block starts large and shrinks down to size “chunk” as the calculation proceeds.

• schedule(runtime)– Schedule and chunk size taken from the OMP_SCHEDULE

environment variable.

An example

#pragma omp parallel for private(j) schedule(static, 2)for (i = 0; i < n; i++)

for (j = 0; j < m; j++)x[j][j] = g(i, x[j-1]);

You can play with the chunk size to meet load balancing issues, etc.

Synchronization Directives

• BARRIER– inside PARALLEL, all threads synchronize

• CRITICAL (lock) / END CRITICAL (lock)– section that can be executed by one

thread only– lock is optional name to distinguish

several critical constructs from each other

An example

double area, pi, x;int i, n;

area = 0.0;

#pragma omp parallel for private(x)for (i = 0; i < n; i++){

x = (i + 0.5)/n;

#pragma omp criticalarea += 4.0/(1.0 + x*x);

}pi = area / n;

Reductions

• Sometimes you want each thread to calculate part of a value then collapse all that into a single value

• Done with reduction clausearea = 0.0;#pragma omp parallel for private(x) reduction (+:area)for (i = 0; i < n; i++){

x = (i + 0.5)/n;area += 4.0/(1.0 + x*x);

}pi = area / n;

Another Example/* A Monte Carlo algorithm for calculating pi */int count; /* points inside the unit quarter circle */unsigned short xi[3]; /* random number seed */int i; /* loop index */int samples; /* Number of points to generate */double x,y; /* Coordinates of points */double pi; /* Estimate of pi */

xi[0] = 1; /* These statements set up the random seed */xi[1] = 1;xi[2] = 0;count = 0;for (i = 0; i < samples; i++){

x = erand48(xi);y = erand48(xi);if (x*x + y*y <= 1.0) count++;

}pi = 4.0 * count / samples;printf(“Estimate of pi: %7.5f\n”, pi);

OpenMP Issues

Each thread needs differentrandom number seeds

count is sharedwe need the aggregate

OpenMP Version/* A Monte Carlo algorithm for calculating pi */int count; /* points inside the unit quarter circle */unsigned short xi[3]; /* random number seed */int i; /* loop index */int samples; /* Number of points to generate */double x,y; /* Coordinates of points */double pi; /* Estimate of pi */

omp_set_num_threads(omp_get_num_procs());xi[0] = 1; xi[1] = 1; xi[2] = omp_get_thread_num();count = 0;

#pragma omp parallel for firstprivate(xi) private(x,y) reduction(+:count)for (i = 0; i < samples; i++){



What is wrong with this?What is wrong with this?

#include <stdio.h>#include <stdlib.h>#include <omp.h>

main(int argc, char *argv[]){ /* A Monte Carlo algorithm for calculating pi */ int count; /* points inside the unit quarter circle */ int i; /* loop index */ int samples; /* Number of points to generate */ double x,y; /* Coordinates of points */ double pi; /* Estimate of pi */

samples = atoi(argv[1]);

#pragma omp parallel{ unsigned short xi[3]; /* random number seed */ xi[0] = 1; /* These statements set up the random seed */ xi[1] = 1; xi[2] = omp_get_thread_num(); count = 0; printf("I am thread %d\n", xi[2]);#pragma omp for firstprivate(xi) private(x,y) reduction(+:count) for (i = 0; i < samples; i++) { x = erand48(xi); y = erand48(xi); if (x*x + y*y <= 1.0) count++; }} pi = 4.0 * (double)count / (double)samples; printf("Count = %d, Samples = %d, Estimate of pi: %7.5f\n", count, samples, pi);}

Corrected VersionCorrected Version

[snell@atropos openmp]$ time ./montecarlopi.single 50000000I am thread 0Count = 39268604, Samples = 50000000, Estimate of pi: 3.14149

real 0m4.628suser 0m4.568ssys 0m0.030s[snell@atropos openmp]$ time ./montecarlopi 50000000I am thread 1I am thread 0Count = 39272420, Samples = 50000000, Estimate of pi: 3.14179

real 0m2.480suser 0m4.620ssys 0m0.030s[snell@atropos openmp]$

[snell@atropos openmp]$ time ./montecarlopi.single 50000000I am thread 0Count = 39268604, Samples = 50000000, Estimate of pi: 3.14149

real 0m4.628suser 0m4.568ssys 0m0.030s[snell@atropos openmp]$ time ./montecarlopi 50000000I am thread 1I am thread 0Count = 39272420, Samples = 50000000, Estimate of pi: 3.14179


An alternate version…

#pragma omp parallel private(xi, t, x, y, local_count){

xi[0] = 1; xi[1] = 1; xi[2] = tid = omp_get_thread_num();t = omp_get_num_threads();local_count = 0;

for (i = tid; i < samples; i += t){

x = erand48(xi);y = erand48(xi);if (x*x + y*y <= 1.0) local_count++;

}#pragma omp critical

count += local_count;} pi = 4.0 * count / samples; printf(“Estimate of pi: %7.5f\n”, pi);}

[snell@atropos openmp]$ time ./montecarlopi2 50000000I am thread 0I am thread 11: local_count is 164610031: local_count is 16392925Count = 32853927, Samples = 50000000, Estimate of pi: 2.62831


[snell@atropos openmp]$ time ./montecarlopi2 50000000I am thread 0I am thread 11: local_count is 164610031: local_count is 16392925Count = 32853927, Samples = 50000000, Estimate of pi: 2.62831


Problems!Problems!

Corrected Version /* A Monte Carlo algorithm for calculating pi */ int count; /* points inside the unit quarter circle */ int local_count; /* points inside the unit quarter circle */ int t, tid; int i; /* loop index */ int samples; /* Number of points to generate */ double x,y; /* Coordinates of points */ double pi; /* Estimate of pi */

samples = atoi(argv[1]);

#pragma omp parallel private(i,t,x,y,local_count) reduction(+:count){ unsigned short xi[3]; /* random number seed */ xi[0] = 1; /* These statements set up the random seed */ xi[1] = 1; xi[2] = tid = omp_get_thread_num(); t = omp_get_num_threads(); count = 0; //printf("I am thread %d\n", xi[2]); for (i = tid; i < samples; i+=t) { x = erand48(xi); y = erand48(xi); if (x*x + y*y <= 1.0) count++; } //printf("%d: count is %d\n", tid, count);} pi = 4.0 * (double)count / (double)samples; printf("Count = %d, Samples = %d, Estimate of pi: %7.5f\n", count, samples, pi);

Corrections:•i should be private!•random number seed array(xi) should also be private

Use reduction

Corrections:•i should be private!•random number seed array(xi) should also be private

Use reduction

[snell@atropos openmp]$ time ./montecarlopi3 50000000Count = 39269476, Samples = 50000000, Estimate of pi: 3.14156

real 0m2.321suser 0m4.559ssys 0m0.014s

[snell@atropos openmp]$ time ./montecarlopi3 50000000Count = 39269476, Samples = 50000000, Estimate of pi: 3.14156

real 0m2.321suser 0m4.559ssys 0m0.014s

Serial Directives

• MASTER / END MASTER– executed by master thread only

• DO SERIAL / END DO SERIAL– loop immediately following should not

be parallelized– useful with -qsmp=omp:auto

• SINGLE– only one thread executes the block

Example Serial Execution/* A Monte Carlo algorithm for calculating pi */…omp_set_num_threads(omp_get_num_procs());xi[0] = 1; xi[1] = 1; xi[2] = omp_get_thread_num();count = 0;

#pragma omp parallel for firstprivate(xi) private(x,y) reduction(+:count)for (i = 0; i < samples; i++){


#pragma omp single{printf(“Loop Iteration: %d\n”, i);}


Conditional Execution

• Overhead of fork/join is high• If a loop is small, you don’t want to

parallellize• But, you may not know how big until

runtime• Conditional clause for parallel execution

– if ( expression )area = 0.0;#pragma omp parallel for private(x) reduction (+:area) if (n > 5000)for (i = 0; i < n; i++){

x = (i + 0.5)/n;area += 4.0/(1.0 + x*x);

}pi = area / n;

Scope Rules

• Shared memory programming model– most variables are shared by default

• Global variables are shared• But not everything is shared

– loop index variables– stack variables in called functions from parallel

region

• variable set and then used in for-loop is PRIVATE

• array whose subscript is constant w.r.t. PARALLEL for-loop and is set and then used within the for-loop is PRIVATE

Scope Clauses

• for/DO directive has extra clauses, the most important– PRIVATE (variable list)– REDUCTION (op: variable list)

• op is sum, min, max• variable is scalar, XLF allows array

Scope Clauses (2)

• PARALLEL and PARALELL for/DO and PARALLEL SECTIONS have also– DEFAULT (variable list)

• scope determined by rules

– SHARED (variable list)– IF (scalar logical expression)

• directives are like programming language extension, not compiler option

integer i,j,n

real*8 a(n,n), b(n)

read (1) b

!$OMP PARALLEL DO

!$OMP PRIVATE (i,j) SHARED (a,b,n)

do j=1,n

do i=1,n

a(i,j) = sqrt(1.d0 + b(j)*i)

end do

end do

!$OMP END PARALLEL DO

Matrix Multiply

!$OMP PARALLEL DO PRIVATE(i,j,k)

do j=1,n

do i=1,n

do k=1,n

c(i,j) = c(i,j) + a(i,k) * b(k,j)

end do

end do

end do

Analysis

• Outer loop is parallel: columns of c• Not optimal for cache use• Can put more directives for each loop• Then granularity might be too fine

OMP Functions

• int omp_get_num_procs()• int omp_get_num_threads()• int omp_get_thread_num()• void omp_set_num_threads(int)

OpenMP Environment Variables

• OpenMP parallelism may be controlled via environment variables– OMP_NUM_THREADS

• Sets number of threads in parallel sections

– OMP_DYNAMIC• When = TRUE, allows number of threads to be set at

runtime

– OMP_NESTED• When = TRUE, enables nested parallelism

– OMP_SCHEDULE• Controls the scheduling assignment • Example - export OMP_SCHEDULE=“static,4”

Fortran Parallel Directives

• PARALLEL / END PARALLEL• PARALLEL SECTIONS / SECTION /

SECTION / END PARALLEL SECTIONS• DO / END DO

– work sharing directive for DO loop immediately following

• PARALLEL DO / END PARALLEL DO– combined section and work sharing

Pthread Translation

Introduction to OpenMP For a more detailed tutorial see: Look at the presentations also see:

Documents

Transcript of Introduction to OpenMP For a more detailed tutorial see: Look at the presentations also see: