Parallel Programming using OpenMP

89
Parallel Programming using OpenMP Qin Liu The Chinese University of Hong Kong 1

Transcript of Parallel Programming using OpenMP

Page 1: Parallel Programming using OpenMP

Parallel Programming using OpenMP

Qin Liu

The Chinese University of Hong Kong

1

Page 2: Parallel Programming using OpenMP

Overview

Why Parallel Programming?

Overview of OpenMP

Core Features of OpenMP

More Features and Details...

One Advanced Feature

2

Page 3: Parallel Programming using OpenMP

Introduction

• OpenMP is one of the most common parallelprogramming models in use today

• It is relatively easy to use which makes a great languageto start with when learning to write parallel programs

• Assumptions:

I We assume you know C++ (OpenMP also supportsFortran)

I We assume you are new to parallel programingI We assume you have access to a compiler that supports

OpenMP (like gcc)

3

Page 4: Parallel Programming using OpenMP

Introduction

• OpenMP is one of the most common parallelprogramming models in use today

• It is relatively easy to use which makes a great languageto start with when learning to write parallel programs

• Assumptions:

I We assume you know C++ (OpenMP also supportsFortran)

I We assume you are new to parallel programingI We assume you have access to a compiler that supports

OpenMP (like gcc)

3

Page 5: Parallel Programming using OpenMP

Introduction

• OpenMP is one of the most common parallelprogramming models in use today

• It is relatively easy to use which makes a great languageto start with when learning to write parallel programs

• Assumptions:

I We assume you know C++ (OpenMP also supportsFortran)

I We assume you are new to parallel programingI We assume you have access to a compiler that supports

OpenMP (like gcc)

3

Page 6: Parallel Programming using OpenMP

Introduction

• OpenMP is one of the most common parallelprogramming models in use today

• It is relatively easy to use which makes a great languageto start with when learning to write parallel programs

• Assumptions:I We assume you know C++ (OpenMP also supports

Fortran)

I We assume you are new to parallel programingI We assume you have access to a compiler that supports

OpenMP (like gcc)

3

Page 7: Parallel Programming using OpenMP

Introduction

• OpenMP is one of the most common parallelprogramming models in use today

• It is relatively easy to use which makes a great languageto start with when learning to write parallel programs

• Assumptions:I We assume you know C++ (OpenMP also supports

Fortran)I We assume you are new to parallel programing

I We assume you have access to a compiler that supportsOpenMP (like gcc)

3

Page 8: Parallel Programming using OpenMP

Introduction

• OpenMP is one of the most common parallelprogramming models in use today

• It is relatively easy to use which makes a great languageto start with when learning to write parallel programs

• Assumptions:I We assume you know C++ (OpenMP also supports

Fortran)I We assume you are new to parallel programingI We assume you have access to a compiler that supports

OpenMP (like gcc)

3

Page 9: Parallel Programming using OpenMP

Why Parallel Programming?

4

Page 10: Parallel Programming using OpenMP

Growth in processor performance since the

late 1970s

Source: Hennessy, J. L., & Patterson, D. A. (2011). Computer architecture: a quantitative approach. Elsevier.

• Good old days: 17 years of sustained growth inperformance at an annual rate of over 50%

5

Page 11: Parallel Programming using OpenMP

The Hardware/Software Contract

• We (SW developers) learn and design sequentialalgorithms such as quick sort and Dijkstra’s algorithm

• Performance comes from hardware

Results: Generations of performance ignorant softwareengineers write serial programs using

performance-handicapped languages (such as Java)...This was OK since performance was a hardware job

But...

• In 2004, Intel canceled its high-performanceuniprocessor projects and joined others in declaringthat the road to higher performance would be viamultiple processors per chip rather than via fasteruniprocessors

6

Page 12: Parallel Programming using OpenMP

The Hardware/Software Contract

• We (SW developers) learn and design sequentialalgorithms such as quick sort and Dijkstra’s algorithm

• Performance comes from hardware

Results: Generations of performance ignorant softwareengineers write serial programs using

performance-handicapped languages (such as Java)...This was OK since performance was a hardware job

But...

• In 2004, Intel canceled its high-performanceuniprocessor projects and joined others in declaringthat the road to higher performance would be viamultiple processors per chip rather than via fasteruniprocessors

6

Page 13: Parallel Programming using OpenMP

The Hardware/Software Contract

• We (SW developers) learn and design sequentialalgorithms such as quick sort and Dijkstra’s algorithm

• Performance comes from hardware

Results: Generations of performance ignorant softwareengineers write serial programs using

performance-handicapped languages (such as Java)...This was OK since performance was a hardware job

But...

• In 2004, Intel canceled its high-performanceuniprocessor projects and joined others in declaringthat the road to higher performance would be viamultiple processors per chip rather than via fasteruniprocessors

6

Page 14: Parallel Programming using OpenMP

The Hardware/Software Contract

• We (SW developers) learn and design sequentialalgorithms such as quick sort and Dijkstra’s algorithm

• Performance comes from hardware

Results: Generations of performance ignorant softwareengineers write serial programs using

performance-handicapped languages (such as Java)...This was OK since performance was a hardware job

But...

• In 2004, Intel canceled its high-performanceuniprocessor projects and joined others in declaringthat the road to higher performance would be viamultiple processors per chip rather than via fasteruniprocessors

6

Page 15: Parallel Programming using OpenMP

The Hardware/Software Contract

• We (SW developers) learn and design sequentialalgorithms such as quick sort and Dijkstra’s algorithm

• Performance comes from hardware

Results: Generations of performance ignorant softwareengineers write serial programs using

performance-handicapped languages (such as Java)...This was OK since performance was a hardware job

But...

• In 2004, Intel canceled its high-performanceuniprocessor projects and joined others in declaringthat the road to higher performance would be viamultiple processors per chip rather than via fasteruniprocessors

6

Page 16: Parallel Programming using OpenMP

Computer Architecture and the Power Wall

To reflect the SpecInt2K benchmark being single-threaded, in the case of the Intel Core® Duo™ processor we consider the power used by only one of the CPU cores and the shared L2 cache. In the Core Duo processor, the power of a single CPU core plus the shared L2 cache is equal to 57% of the rated TDP.

Table 2: Performance and Power of Intel

Microprocessors, 130 nm to 65 nm Assembling the performance and power data over

several generations of Intel microprocessors produces the graph shown in Figure 2.

Figure 2: Normalized Power versus Normalized

Scalar Performance for Multiple Generations of Intel Microprocessors

In Figure 2, both power and performance have been

adjusted to factor out improvements due to process technology over time, and all data have been normalized to the i486™ processor. The results are as if all generations of microprocessors were built on the same process technology. To realize these performance deltas in practice, older microprocessors would need to be given appropriate high-speed memory systems in newer

process technologies (i.e. an L2 cache would become necessary since a main memory latency of 5 clocks at 66 MHz would become 80 clocks at 1 GHz).

In Figure 2, the equation power = performance^1.75 has been fitted to the data points corresponding to several generations of desktop microprocessors. The data points corresponding to the Pentium M and Core Duo processors were not included in the fitted curve.

Relative energy per instruction may be computed by taking the ratio of power to performance. Absolute EPI may be computed by noting that the Pentium 4 processor (Cedarmill) achieves a retired IPC of approximately 0.5 on SpecInt2K [3]. Thus, the Pentium 4 processor’s EPI is 86/(3.6e9•0.5) = 48 nJ. EPI for all other microprocessors may be computed based on their performance and power ratios relative to the Pentium 4 processor. This data is shown in Table 3.

Product Normalized

Performance Normalized Power

EPI on 65 nm at 1.33 volts (nJ)

i486 1.0 1.0 10

Pentium 2.0 2.7 14 Pentium Pro 3.6 9 24 Pentium 4 (Willamette)

6.0 23 38

Pentium 4 (Cedarmill)

7.9 38 48

Pentium M (Dothan)

5.4 7 15

Core Duo (Yonah)

7.7 8 11

Table 3: EPI of Intel Microprocessors

3. Analysis Figure 2 reveals that desktop microprocessors are

expending large amounts of power for relatively small improvements in scalar performance. Relative to the i486 processor, the Pentium 4 processor (Cedarmill) delivers approximately 8 times the scalar performance (2.5x the IPC at 3x the frequency), but consumes 38 times more power. These numbers imply that the Pentium 4 processor is spending 5 times the EPI of the i486 processor if both were fabricated on the same process technology and operated at the same supply voltage. The steep relationship between power and scalar performance was first noted by Fred Pollack in his keynote at MICRO-32 in 1999. Fred made the now well-known observation that we are on the wrong side of a square law.

The reason for the dramatic increase in power is that the design techniques used in the desktop microprocessors tended to result in much more energy being expended per instruction due to the higher capacitance toggled to process each instruction. Deep

Process Product Frequency Performance Power (watts)

Voltage (volts)

130 nm Pentium 4 (Northwood)

3.4 GHz 1342 SpecInt2K

89.0 1.525

130 nm Pentium M (Banias)

1.0 GHz 673 SpecInt2K

7.0 1.004 ULV

90 nm Pentium 4 (Prescott)

3.6 GHz 1734 SpecInt2K

103 1.47

90 nm Pentium M (Dothan)

2.0 GHz 1429 Specint2K

21 1.32

65 nm Pentium 4 (Cedarmill)

3.6 GHz 1764 SpecInt2K

86 1.33

65 nm Core Duo (Yonah)

2.167 GHz 1721 SpecInt2K

31 1.3

0

5

10

15

20

25

30

35

40

0 2 4 6 8Scalar Performance

Pow

er

power = perf ^ 1.75

Banias

i486 Pentium

Pentium Pro

Pentium 4 (Willamette)

Pentium 4 (Cedarmill)

Dothan

Pentium M Core Duo

(Yonah)

Source: Grochowski, Ed, and Murali Annavaram. “Energy per instruction trends in Intel microprocessors.”Technology@Intel Magazine 4, no. 3 (2006): 1-8.

• Growth in power is unsustainable (power = perf1.74)• Partial solution: simple low power cores

7

Page 17: Parallel Programming using OpenMP

The rest of the solution - Add Cores

Source: Multi-Core Parallelism for Low-Power Design - Vishwani D. Agrawal

8

Page 18: Parallel Programming using OpenMP

Microprocessor Trends

Individual processors are many core (and oftenheterogeneous) processors from Intel, AMD, NVIDIA

A new HW/SW contract:

• HW people will do what’s natural for them (lots of simplecores) and SW people will have to adapt (rewriteeverything)

• The problem is this was presented as an ultimatum...nobody asked us if we were OK with this new contract...which is kind of rude

9

Page 19: Parallel Programming using OpenMP

Microprocessor Trends

Individual processors are many core (and oftenheterogeneous) processors from Intel, AMD, NVIDIAA new HW/SW contract:

• HW people will do what’s natural for them (lots of simplecores) and SW people will have to adapt (rewriteeverything)

• The problem is this was presented as an ultimatum...nobody asked us if we were OK with this new contract...which is kind of rude

9

Page 20: Parallel Programming using OpenMP

Microprocessor Trends

Individual processors are many core (and oftenheterogeneous) processors from Intel, AMD, NVIDIAA new HW/SW contract:

• HW people will do what’s natural for them (lots of simplecores) and SW people will have to adapt (rewriteeverything)

• The problem is this was presented as an ultimatum...nobody asked us if we were OK with this new contract...which is kind of rude

9

Page 21: Parallel Programming using OpenMP

Parallel Programming

Process:

1. We have a sequential algorithm

2. Split the program into tasks and identify shared and localdata

3. Use some algorithm strategy to break dependenciesbetween tasks

4. Implement the parallel algorithm in C++/Java/...

Can this process be automated by the compiler?Unlikely... We have to do it manually.

10

Page 22: Parallel Programming using OpenMP

Overview of OpenMP

11

Page 23: Parallel Programming using OpenMP

OpenMP: Overview

OpenMP: an API for writing multi-threaded applications

• A set of compiler directives and library routines forparallel application programmers

• Greatly simplifies writing multi-threaded programs inFortran and C/C++

• Standardizes last 20 years of symmetric multiprocessing(SMP) practice

12

Page 24: Parallel Programming using OpenMP

OpenMP Core Syntax

• Most of the constructs in OpenMP are compilerdirectives:#pragma omp <construct> [clause1 clause2 ...]

• Example:#pragma omp parallel num_threads(4)

• Include file for runtime library: #include <omp.h>

• Most OpenMP constructs apply to a “structured block”I Structured block: a block of one or more statements

with one point of entry at the top and one point of exitat the bottom

13

Page 25: Parallel Programming using OpenMP

Exercise 1: Hello World

A multi-threaded “hello world” program

1 #include <stdio.h>

2 #include <omp.h>

3 int main() {

4 #pragma omp parallel

5 {

6 int ID = omp_get_thread_num ();

7 printf(" hello(%d)", ID);

8 printf(" world(%d)\n", ID);

9 }

10 }

14

Page 26: Parallel Programming using OpenMP

Compiler Notes

• On Windows, you can use Visual Studio C++ 2005 (orlater) or Intel C Compiler 10.1 (or later)

• Linux and OS X with gcc (4.2 or later):

1 $ g++ hello.cpp -fopenmp # add -fopenmp to enable it

2 $ export OMP_NUM_THREADS =16 # set the number of threads

3 $ ./a.out # run our parallel program

• More information:http://openmp.org/wp/openmp-compilers/

15

Page 27: Parallel Programming using OpenMP

Symmetric Multiprocessing (SMP)

• A SMP system: multiple identical processors connect toa single, shared main memory. Two classes:

I Uniform Memory Access (UMA): all the processorsshare the physical memory uniformly

I Non-Uniform Memory Access (NUMA): memoryaccess time depends on the memory location relative toa processor

Source: https://moinakg.wordpress.com/2013/06/05/findings-by-google-on-numa-performance/

16

Page 28: Parallel Programming using OpenMP

Symmetric Multiprocessing (SMP)

• SMP computers are everywhere... Most laptops andservers have multi-core multiprocessor CPUs

• The shared address space and (as we will see)programming models encourage us to think of them asUMA systems

• Reality is more complex... Any multiprocessor CPU witha cache is a NUMA system

• Start out by treating the system as a UMA and justaccept that much of your optimization work will addresscases where that case breaks down

17

Page 29: Parallel Programming using OpenMP

Symmetric Multiprocessing (SMP)

• SMP computers are everywhere... Most laptops andservers have multi-core multiprocessor CPUs

• The shared address space and (as we will see)programming models encourage us to think of them asUMA systems

• Reality is more complex... Any multiprocessor CPU witha cache is a NUMA system

• Start out by treating the system as a UMA and justaccept that much of your optimization work will addresscases where that case breaks down

17

Page 30: Parallel Programming using OpenMP

Symmetric Multiprocessing (SMP)

• SMP computers are everywhere... Most laptops andservers have multi-core multiprocessor CPUs

• The shared address space and (as we will see)programming models encourage us to think of them asUMA systems

• Reality is more complex... Any multiprocessor CPU witha cache is a NUMA system

• Start out by treating the system as a UMA and justaccept that much of your optimization work will addresscases where that case breaks down

17

Page 31: Parallel Programming using OpenMP

Symmetric Multiprocessing (SMP)

• SMP computers are everywhere... Most laptops andservers have multi-core multiprocessor CPUs

• The shared address space and (as we will see)programming models encourage us to think of them asUMA systems

• Reality is more complex... Any multiprocessor CPU witha cache is a NUMA system

• Start out by treating the system as a UMA and justaccept that much of your optimization work will addresscases where that case breaks down

17

Page 32: Parallel Programming using OpenMP

SMP Programming

Source: https://computing.llnl.gov/tutorials/pthreads/

Process:

• an instance of aprogramexecution

• containinformationabout programresources andprogramexecution state

18

Page 33: Parallel Programming using OpenMP

SMP Programming

Source: https://computing.llnl.gov/tutorials/pthreads/

Threads:

• “light weightprocesses”

• share processstate

• reduce the costof swithcingcontext

19

Page 34: Parallel Programming using OpenMP

Concurrency

Threads can be interchanged, interleaved and/or overlapped inreal time.

Source: https://computing.llnl.gov/tutorials/pthreads/

20

Page 35: Parallel Programming using OpenMP

Shared Memory Model

• All threads have access to the same global, sharedmemory

• Threads also have their own private data• Programmers are responsible for synchronizing access

(protecting) globally shared data

Source: https://computing.llnl.gov/tutorials/pthreads/

21

Page 36: Parallel Programming using OpenMP

Exercise 1: Hello World

A multi-threaded “hello world” program

1 #include <stdio.h>

2 #include <omp.h>

3 int main() {

4 #pragma omp parallel

5 {

6 int ID =

omp_get_thread_num ();

7 printf(" hello(%d)", ID);

8 printf(" world(%d)\n",

ID);

9 }

10 }

1 $ g++ hello.cpp -fopenmp

2 $ export OMP_NUM_THREADS =16

3 $ ./a.out

Sample Output:

1 hello (7) world (7)

2 hello (1) hello (9) world (9)

3 world (1)

4 hello (13) world (13)

5 hello (14) hello (4) hello (15)

world (15)

6 world (4)

7 hello (2) world (2)

8 hello (10) world (10)

9 hello (11) world (11)

10 world (14)

11 hello (6) world (6)

12 hello (5) world (5)

13 hello (3) world (3)

14 hello (0) world (0)

15 hello (12) world (12)

16 hello (8) world (8)

Threads interleave and give different outputs every time

22

Page 37: Parallel Programming using OpenMP

How Do Threads Interact in OpenMP?

• OpenMP is a multi-threading, shared address modelI Threads communicate by sharing variables

• Unintended sharing of data causes race conditions:I Race condition: when the program’s outcome changes as

the threads are scheduled differently

• To control race conditions:I Use synchronization to protect data conflicts

• Synchronization is expensive so:I Change how data is accessed to minimize the need for

synchronization

23

Page 38: Parallel Programming using OpenMP

Core Features of OpenMP

24

Page 39: Parallel Programming using OpenMP

OpenMP Programming Model

Fork-Join Parallelism Model:

• OpenMP programs start with only the master thread

• FORK: the master thread creates a team of parallelthreads when encounter a parallel region construct

• The parallel region construct are then executed in parallel

• JOIN: When the team threads complete, they synchronizeand terminate, leaving only the master thread

Source: https://computing.llnl.gov/tutorials/openMP/

25

Page 40: Parallel Programming using OpenMP

Thread Creation: Parallel Regions

• Create threads in OpenMP with the parallel construct

• For example, to create a 4-thread parallel region:

1 double A[1000];

2 omp_set_num_threads (4); // declared in omp.h

3 #pragma omp parallel

4 {

5 int ID = omp_get_thread_num ();

6 pooh(ID ,A);

7 }

8 printf("all done\n");

• Each thread calls pooh(ID, A) for ID from 0 to 3

26

Page 41: Parallel Programming using OpenMP

Thread Creation: Parallel Regions

• Create threads in OpenMP with the parallel construct

• For example, to create a 4-thread parallel region:

1 double A[1000];

2 // specify the number of threads using a clause

3 #pragma omp parallel num_threads (4)

4 {

5 int ID = omp_get_thread_num ();

6 pooh(ID ,A);

7 }

8 printf("all done\n");

• Each thread calls pooh(ID, A) for ID from 0 to 3

27

Page 42: Parallel Programming using OpenMP

Thread Creation: Parallel Regions

1 double A[1000];

2 omp_set_num_threads (4); // declared in omp.h

3 #pragma omp parallel

4 {

5 int ID = omp_get_thread_num ();

6 pooh(ID ,A);

7 }

8 printf("all done\n");

28

Page 43: Parallel Programming using OpenMP

Compute π using Numerical Integration

0 0.5 1

2

2.5

3

3.5

4

x

F(x)=

4/(1

+x2)

2

2.5

3

3.5

4

Mathematically, we know that:∫ 1

0

4

1 + x2dx = π

We can approximate theintegral as a sum of rectangles:

N∑i=0

F (xi)∆x ≈ π

Where each rectangle haswidth ∆x and height F (xi) atthe middle of interval i .

29

Page 44: Parallel Programming using OpenMP

Serial π Program

1 #include <stdio.h>

23 const long num_steps = 100000000;

45 int main () {

6 double sum = 0.0;

7 double step = 1.0 / (double) num_steps;

89 for (int i = 0; i < num_steps; i++) {

10 double x = (i+0.5) * step;

11 sum += 4.0 / (1.0 + x*x);

12 }

1314 double pi = step * sum;

1516 printf("pi is %f \n", pi);

17 }

30

Page 45: Parallel Programming using OpenMP

Exercise 2: First Prallel π Program

• Create a parallel version of the pi program using a parallelconstruct

• Pay close attention to shared versus private variables

• In addition to a parallel construct, you will need theruntime library routines

I int omp_get_num_threads(): number of threads inthe team

I int omp_get_thread_num(): ID of current thread

31

Page 46: Parallel Programming using OpenMP

Exercise 2: First Prallel π Program

1 #include <stdio.h>

2 #include <omp.h>

3 const long num_steps = 100000000;

4 #define NUM_THREADS 4

5 double sum[NUM_THREADS ];

6 int main () {

7 double step = 1.0 / (double) num_steps;

8 omp_set_num_threads(NUM_THREADS);

9 #pragma omp parallel

10 {

11 int id = omp_get_thread_num ();

12 sum[id] = 0.0;

13 for (int i = id; i < num_steps; i += NUM_THREADS) {

14 double x = (i+0.5) * step;

15 sum[id] += 4.0 / (1.0 + x*x);

16 }

17 }

18 double pi = 0.0;

19 for (int i = 0; i < NUM_THREADS; i++)

20 pi += sum[i] * step;

21 printf("pi is %f \n", pi);

22 }

32

Page 47: Parallel Programming using OpenMP

Algorithm Strategy: SPMD

The SPMD (single program, multiple data) technique:

• Run the same program on P processing elements where Pcan be arbitrarily large

• Use the rank (an ID ranging from 0 to P − 1) to selectbetween a set of tasks and to manage any shared datastructures

This pattern is very general and has been used to supportmost (if not all) parallel software.

33

Page 48: Parallel Programming using OpenMP

Results

• Setup: gcc with no optimization on Ubuntu 14.04 with aquad-core Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHzand 16 GB RAM

• The serial version ran in 1.25 seconds

Threads 1 2 3 4

SPMD 1.29 0.72 0.47 0.48

Why such poor scaling?

34

Page 49: Parallel Programming using OpenMP

Results

• Setup: gcc with no optimization on Ubuntu 14.04 with aquad-core Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHzand 16 GB RAM

• The serial version ran in 1.25 seconds

Threads 1 2 3 4

SPMD 1.29 0.72 0.47 0.48

Why such poor scaling?

34

Page 50: Parallel Programming using OpenMP

Reason: False Sharing

• If independent data elements happen to sit on the samecache line, each update will cause the cache lines to“slosh back and forth” between threads... This is called“false sharing”

Source: https://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads

• Solution: pad arrays so elements you use are on distinctcache lines

35

Page 51: Parallel Programming using OpenMP

Eliminate False Sharing by Padding

1 #include <stdio.h>

2 #include <omp.h>

3 const long num_steps = 100000000;

4 #define NUM_THREADS 4

5 #define PAD 8 // assume 64 bbytes L1 cache

6 double sum[NUM_THREADS ][PAD];

7 int main () {

8 double step = 1.0 / (double) num_steps;

9 omp_set_num_threads(NUM_THREADS);

10 #pragma omp parallel

11 {

12 int id = omp_get_thread_num ();

13 sum[id][0] = 0.0;

14 for (int i = id; i < num_steps; i += NUM_THREADS) {

15 double x = (i+0.5) * step;

16 sum[id][0] += 4.0 / (1.0 + x*x);

17 }

18 }

19 double pi = 0.0;

20 for (int i = 0; i < NUM_THREADS; i++)

21 pi += sum[i][0] * step;

22 printf("pi is %f \n", pi);

23 }

36

Page 52: Parallel Programming using OpenMP

Results

• Setup: gcc with no optimization on Ubuntu 14.04 with aquad-core Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHzand 16 GB RAM

• The serial version ran in 1.25 seconds

Threads 1 2 3 4

SPMD 1.29 0.72 0.47 0.48Padding 1.27 0.65 0.43 0.33

37

Page 53: Parallel Programming using OpenMP

Do We Really Need to Pad Our Arrays?

• Padding arrays requires deep knowledge of the cachearchitecture

• Move to a machine with different sized cache lines andyour software performance falls apart

• There has got to be a better way to deal with falsesharing

38

Page 54: Parallel Programming using OpenMP

High Level Synchronization

Recall: to control race conditions

• use synchronization to protect data conflicts

Synchronization: bringing one or more threads to a welldefined and known point in their execution

• Barrier: each thread wait at the barrier until all threadsarrive

• Mutual exclusion: define a block of code that only onethread at a time can execute

39

Page 55: Parallel Programming using OpenMP

Synchronization: barrier

Barrier: each thread wait until all threads arrive

1 #pragma omp parallel

2 {

3 int id = omp_get_thread_num ();

4 A[id] = big_calc1(id);

5 #pragma omp barrier

6 B[id] = big_calc2(id, A); // depend on A calculated by

every thread

7 }

40

Page 56: Parallel Programming using OpenMP

Synchronization: critical

Mutual exclusion: define a block of code that only onethread at a time can execute

1 float res;

2 #pragma omp parallel

3 {

4 float B; int i, id , nthrds;

5 id = omp_get_thread_num ();

6 nthrds = omp_get_num_threads ();

7 for (i = id; i < niters; i += nthrds) {

8 B = big_job(i);

9 #pragma omp critical

10 res += consume(B);

11 // only one at a time calls consume () and modify res

12 }

13 }

41

Page 57: Parallel Programming using OpenMP

Synchronization: atomic

Atomic provides mutual exclusion but only applies to theupdate of a memory location and only supports x += expr,x++, --x...

1 float res;

2 #pragma omp parallel

3 {

4 float tmp , B;

5 B = big();

6 tmp = calc(B);

7 #pragma omp atomic

8 res += tmp;

9 }

42

Page 58: Parallel Programming using OpenMP

Exercise 3: Rewrite π Program using

Synchronization

1 #include <stdio.h>

2 #include <omp.h>

3 const long num_steps = 100000000;

4 #define NUM_THREADS 4

5 int main () {

6 double step = 1.0 / (double) num_steps;

7 omp_set_num_threads(NUM_THREADS);

8 double pi = 0.0;

9 #pragma omp parallel

10 {

11 int id = omp_get_thread_num ();

12 double sum = 0.0; // local scalar not array

13 for (int i = id; i < num_steps; i += NUM_THREADS) {

14 double x = (i+0.5) * step;

15 sum += 4.0 / (1.0 + x*x); // no false sharing

16 }

17 #pragma omp critical

18 pi += sum * step; // must do summation here

19 }

20 printf("pi is %f \n", pi);

21 }

43

Page 59: Parallel Programming using OpenMP

Results

• Setup: gcc with no optimization on Ubuntu 14.04 with aquad-core Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHzand 16 GB RAM

• The serial version ran in 1.25 seconds

Threads 1 2 3 4

SPMD 1.29 0.72 0.47 0.48Padding 1.27 0.65 0.43 0.33Critical 1.26 0.65 0.44 0.33

44

Page 60: Parallel Programming using OpenMP

Mutal Exclusion Done Wrong

Be careful where you put a critical section.

1 #include <stdio.h>

2 #include <omp.h>

3 const long num_steps = 100000000;

4 #define NUM_THREADS 4

5 int main () {

6 double step = 1.0 / (double) num_steps;

7 omp_set_num_threads(NUM_THREADS);

8 double pi = 0.0;

9 #pragma omp parallel

10 {

11 int id = omp_get_thread_num ();

12 for (int i = id; i < num_steps; i += NUM_THREADS) {

13 double x = (i+0.5) * step;

14 #pragma omp critical

15 pi += 4.0 / (1.0 + x*x) * step;

16 }

17 }

18 printf("pi is %f \n", pi);

19 }

Ran in 10 seconds with 4 threads.45

Page 61: Parallel Programming using OpenMP

Example: Using Atomic

1 #include <stdio.h>

2 #include <omp.h>

3 const long num_steps = 100000000;

4 #define NUM_THREADS 4

5 int main () {

6 double step = 1.0 / (double) num_steps;

7 omp_set_num_threads(NUM_THREADS);

8 double pi = 0.0;

9 #pragma omp parallel

10 {

11 int id = omp_get_thread_num ();

12 double sum = 0.0;

13 for (int i = id; i < num_steps; i += NUM_THREADS) {

14 double x = (i+0.5) * step;

15 sum += 4.0 / (1.0 + x*x);

16 }

17 sum *= step;

18 #pragma omp atomic

19 pi += sum;

20 }

21 printf("pi is %f \n", pi);

22 }

46

Page 62: Parallel Programming using OpenMP

For Loop Worksharing

Serial π program:

1 #include <stdio.h>

23 const long num_steps = 100000000;

45 int main () {

6 double sum = 0.0;

7 double step = 1.0 / (double) num_steps;

89 for (int i = 0; i < num_steps; i++) {

10 double x = (i+0.5) * step;

11 sum += 4.0 / (1.0 + x*x);

12 }

1314 double pi = step * sum;

1516 printf("pi is %f \n", pi);

17 }

What we want to parallelize:for (int i = 0; i < num_steps; i++)

47

Page 63: Parallel Programming using OpenMP

For Loop Worksharing

Two equivalent directives:

1 double res[MAX];

2 #pragma omp parallel

3 {

4 #pragma omp for

5 for (int i = 0; i < MAX;

i++)

6 res[i] = huge();

7 }

1 double res[MAX];

2 #pragma omp parallel for

3 for (int i = 0; i < MAX;

i++)

4 res[i] = huge();

48

Page 64: Parallel Programming using OpenMP

Working with For Loops

• Find computational intensive loops

• Make the loop iterations independent, so they canexecute in any order

• Place the appropriate OpenMP directives and test

1 int j, A[MAX];

2 j = 5;

3 for (int i = 0; i < MAX; i++)

{

4 j += 2;

5 A[i] = big(j);

6 }

Each iteration depends on thepreivous one.

1 int A[MAX];

2 #pragma omp parallel for

3 for (int i = 0; i < MAX; i++)

{

4 int j = 5 + 2*(i+1);

5 A[i] = big(j);

6 }

Remove dependency and j isnow local to each iteration.

49

Page 65: Parallel Programming using OpenMP

Working with For Loops

• Find computational intensive loops

• Make the loop iterations independent, so they canexecute in any order

• Place the appropriate OpenMP directives and test

1 int j, A[MAX];

2 j = 5;

3 for (int i = 0; i < MAX; i++)

{

4 j += 2;

5 A[i] = big(j);

6 }

Each iteration depends on thepreivous one.

1 int A[MAX];

2 #pragma omp parallel for

3 for (int i = 0; i < MAX; i++)

{

4 int j = 5 + 2*(i+1);

5 A[i] = big(j);

6 }

Remove dependency and j isnow local to each iteration.

49

Page 66: Parallel Programming using OpenMP

Working with For Loops

• Find computational intensive loops

• Make the loop iterations independent, so they canexecute in any order

• Place the appropriate OpenMP directives and test

1 int j, A[MAX];

2 j = 5;

3 for (int i = 0; i < MAX; i++)

{

4 j += 2;

5 A[i] = big(j);

6 }

Each iteration depends on thepreivous one.

1 int A[MAX];

2 #pragma omp parallel for

3 for (int i = 0; i < MAX; i++)

{

4 int j = 5 + 2*(i+1);

5 A[i] = big(j);

6 }

Remove dependency and j isnow local to each iteration.

49

Page 67: Parallel Programming using OpenMP

The Schdule Clause

The schedule clause affects how loop iterations are mappedonto threads

• schedule(static, [chunk]): each threadindependently decides which which iterations of size“chunk” they will process

• schedule(dynamic, [chunk]): each thread grabs“chunk” iterations off a queue until all iterations havebeen handled

• guided, runtime, auto: skip...

50

Page 68: Parallel Programming using OpenMP

Nested Loops

• For perfectly nested rectangular loops we can parallelizemultiple loops in the nest with the collapse clause:

1 #pragma omp parallel for collapse (2)

2 for (int i = 0; i < N; i++) {

3 for (int j=0; j<M; j++) {

4 .....

5 }

6 }

• Will form a single loop of length N ×M and thenparallelize that

• Useful if N is O(num of threads) so parallelizing the outerloop makes balancing the load difficult

51

Page 69: Parallel Programming using OpenMP

Reduction

• How do we handle this case?

1 double ave=0.0, A[MAX];

2 for (int i = 0; i < MAX; i++) {

3 ave += A[i];

4 }

5 ave = ave/MAX;

• We are combining values into a single accumulationvariable (ave). There is a true dependence between loopiterations that can’t be trivially removed

• This is a very common situation. It is called a “reduction”

• Support for reduction operations is included in mostparallel programming environments such as MapReduceand MPI

52

Page 70: Parallel Programming using OpenMP

Reduction

• OpenMP reduction clause: reduction(op:list)

• Inside a parallel or a worksharing construct:I A local copy of each list variable is made and initialized

depending on the “op” (e.g. 0 for “+”)I Updates occur on the local copyI Local copies are reduced into a single value and

combined with the original global value

• The variables in “list” must be shared in the enclosingparallel region

1 double ave=0.0, A[MAX];

2 #pragma omp parallel for reduction (+: ave)

3 for (int i = 0; i < MAX; i++) {

4 ave += A[i];

5 }

6 ave = ave/MAX;

53

Page 71: Parallel Programming using OpenMP

Exercise 4: π Program with Parallel For

1 #include <stdio.h>

2 #include <omp.h>

3 const long num_steps = 100000000;

4 #define NUM_THREADS 4

5 int main () {

6 double sum = 0.0;

7 double step = 1.0 / (double) num_steps;

8 omp_set_num_threads(NUM_THREADS);

9 #pragma omp parallel for reduction (+: sum)

10 for (int i = 0; i < num_steps; i++) {

11 double x = (i+0.5) * step;

12 sum += 4.0 / (1.0 + x*x);

13 }

14 double pi = step * sum;

15 printf("pi is %f \n", pi);

16 }

Quite simple...

54

Page 72: Parallel Programming using OpenMP

Results

• Setup: gcc with no optimization on Ubuntu 14.04 with aquad-core Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHzand 16 GB RAM

• The serial version ran in 1.25 seconds

Threads 1 2 3 4

SPMD 1.29 0.72 0.47 0.48Padding 1.27 0.65 0.43 0.33Critical 1.26 0.65 0.44 0.33For 1.27 0.65 0.43 0.33

55

Page 73: Parallel Programming using OpenMP

More Features and Details...

56

Page 74: Parallel Programming using OpenMP

Synchronization: barrier

Barrier: each thread wait until all threads arrive

1 #pragma omp parallel

2 {

3 id=omp_get_thread_num ();

4 A[id] = big_calc1(id);

5 #pragma omp barrier

6 #pragma omp for

7 for (int i = 0; i < N; i++) {

8 C[i] = big_calc3(i, A);

9 } // implicit barrier at the end of a for workesharing

construct

10 #pragma omp for nowait

11 for (int i = 0; i < N; i++) {

12 B[i] = big_calc2(C, i);

13 } // no implicit barrier due to nowait

14 A[id] = big_calc4(id);

15 } // implicit barrier at the end of a parallel region

57

Page 75: Parallel Programming using OpenMP

Master Construct

• The master construct denotes a structured block that isonly executed by the master thread

• The other threads just skip it (no synchronization isimplied)

1 #pragma omp parallel

2 {

3 do_many_things ();

4 #pragma omp master

5 { exchange_boundaries (); }

6 #pragma omp barrier

7 do_many_other_things ();

8 }

58

Page 76: Parallel Programming using OpenMP

Single Construct

• The single construct denotes a structured block that isonly executed by only one thread

• A barrier is implied at the end of the single block (canremove the barrier with a nowait clause)

1 #pragma omp parallel

2 {

3 do_many_things ();

4 #pragma omp single

5 { exchange_boundaries (); }

6 do_many_other_things ();

7 }

59

Page 77: Parallel Programming using OpenMP

Low Level Synchronization: lock routines

• Simple lock routines: a simple lock is available if it isunsetomp_init_lock(), omp_set_lock(),

omp_unset_lock(), omp_test_lock(),

omp_destroy_lock()

• Nested locks: a nested lock is available if it is unset or ifit is set but owned by the thread executing the nestedlock functionomp_init_nest_lock(), omp_set_nest_lock(),

omp_unset_nest_lock(), omp_test_nest_lock(),

omp_destroy_nest_lock()

60

Page 78: Parallel Programming using OpenMP

Synchronization: Simple Locks

Example: conflicts are rare, but to play it safe, we must assuremutual exclusion for updates to histogram elements

1 #pragma omp parallel for

2 for (int i = 0; i < NBUCKETS; i++) {

3 omp_init_lock (& hist_locks[i]); // one lock per

elements of hist

4 hist[i] = 0;

5 }

6 #pragma omp parallel for

7 for(int i = 0; i < NVALS; i++) {

8 ival = (int) sample(arr[i]);

9 omp_set_lock (& hist_locks[ival]);

10 hist[ival ]++; // mutual exclusion , less wait

compared to critical

11 omp_unset_lock (& hist_locks[ival]);

12 }

13 #pragma omp parallel for

14 for(i=0;i<NBUCKETS; i++)

15 omp_destroy_lock (& hist_locks[i]); // free locks

when done

61

Page 79: Parallel Programming using OpenMP

Sections Worksharing

• The sections worksharing construct gives a differentstructured block to each thread

1 #pragma omp parallel

2 {

3 #pragma omp sections

4 {

5 #pragma omp section

6 x_calculation ();

7 #pragma omp section

8 y_calculation ();

9 #pragma omp section

10 z_calculation ();

11 } // implicit barrier that can be turned off with nowait

12 }

Shorthand: #pragma omp parallel sections

62

Page 80: Parallel Programming using OpenMP

Runtime Library Routines

• Modify/Check the number of threads -omp_set_num_threads(),

omp_get_num_threads(), omp_get_thread_num(),

omp_get_max_threads()

• Are we in an active parallel region? -omp_in_parallel()

• How many processors in the system? - omp_num_procs()

• Many less commonly used routines

63

Page 81: Parallel Programming using OpenMP

Data Sharing

• Most variables are SHARED by default: static variables,global variables, heap data (malloc(), new)

• But not everything is shared: stack variables in functionscalled from parallel regions are PRIVATE

Examples:

1 double A[10];

2 int main() {

3 int index [10];

4 #pragma omp parallel

5 work(index);

6 printf("%d\n", index [0]);

7 }

1 extern double A[10];

2 void work(int *index) {

3 double temp [10];

4 static int count;

5 // do something

6 }

A, index, count are shared.temp is private to each thread.

64

Page 82: Parallel Programming using OpenMP

Data Scope Attribute Clauses

• Change scope attributes using:shared, private, firstprivate, lastprivate

• The default attributes can be overridden with:default(shared|none)

• Skip details...

65

Page 83: Parallel Programming using OpenMP

Data Scope Attribute Clauses: Example

1 int main() {

2 std:: string a = "a", b = "b", c = "c";

3 #pragma omp parallel firstprivate(a) private(b) shared(c)

num_threads (2)

4 {

5 a += "k"; // a is initialized with "a"

6 b += "k"; // b is initialized with std:: string ()

7 #pragma omp critical

8 {

9 c += "k"; // c is shared

10 std::cout << omp_get_thread_num () << ": " << a

<< ", " << b << ", " << c << "\n";

11 }

12 }

13 std::cout << a << ", " << b << ", " << c << "\n";

14 }

Sample Output:

1 0: ak , k, ck

2 1: ak , k, ckk

3 a, b, ckk

66

Page 84: Parallel Programming using OpenMP

One Advanced Feature

67

Page 85: Parallel Programming using OpenMP

OpenMP Tasks

• The for and sections worksharings worked well for manycases. However,

I loops need a known length at run timeI finite number of parallel sections

• This didn’t work well with certain common problems:I linked list, recursive algorithms, etc.

• Introduce the task directive (only for OpenMP 3.0+)

68

Page 86: Parallel Programming using OpenMP

Task: Example

Traversal of a tree:

1 struct node { node *left , *right; };

2 extern void process(node* );

3 void traverse(node* p) {

4 if (p->left)

5 #pragma omp task // p is firstprivate by default

6 traverse(p->left);

7 if (p->right)

8 #pragma omp task // p is firstprivate by default

9 traverse(p->right);

10 process(p);

11 }

1 #pragma omp parallel

2 {

3 #pragma omp single nowait

4 { traverse(root); }

5 }

69

Page 87: Parallel Programming using OpenMP

Task: Example

What if we want to force a postorder traversal of the tree?

1 struct node { node *left , *right; };

2 extern void process(node* );

3 void traverse(node* p) {

4 if (p->left)

5 #pragma omp task // p is firstprivate by default

6 traverse(p->left);

7 if (p->right)

8 #pragma omp task // p is firstprivate by default

9 traverse(p->right);

10 #pragma omp taskwait // a barrier only for tasks

11 process(p);

12 }

70

Page 88: Parallel Programming using OpenMP

Task: Example

Process elements of a linked list in parallel:

1 struct node { int data; node* next; };

2 extern void process(node* );

3 void increment_list_items(node* head) {

4 #pragma omp parallel

5 {

6 #pragma omp single

7 {

8 for(node* p = head; p; p = p->next) {

9 #pragma omp task

10 process(p); // p is firstprivate by default

11 }

12 }

13 }

14 }

71

Page 89: Parallel Programming using OpenMP

References

• Guide into OpenMP: Easy multithreading programmingfor C++:http://bisqwit.iki.fi/story/howto/openmp

• Introduction to OpenMP - Tim Mattson (Intel):I slides: http://openmp.org/mp-documents/Intro_

To_OpenMP_Mattson.pdfI videos: https://www.youtube.com/playlist?list=

PLLX-Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG

• OpenMP specification: http://openmp.org

72