A“Hands on”Introductionto OpenMP - Unit information : School of Computer...
Transcript of A“Hands on”Introductionto OpenMP - Unit information : School of Computer...
A “Hands-on” Introduction to OpenMP*
* The name “OpenMP” is the property of the OpenMP Architecture Review Board.
Tim Mattson Intel Corp.
Introduction
z OpenMP is one of the most common parallel programming models in use today.
z It is relatively easy to use which makes a great language to start with when learning to write parallel software.
z Assumptions: �We assume you know C. OpenMP supports Fortran
and C++, but we will restrict ourselves to C. �We assume you are new to parallel programming. �We assume you have access to a compiler that
supports OpenMP (more on that later).
2
Acknowledgements z This course is based on a long series of tutorials presented at
Supercomputing conferences. The following people helped prepare this content: � J. Mark Bull (the University of Edinburgh) � Rudi Eigenmann (Purdue University) � Barbara Chapman (University of Houston) � Larry Meadows, Sanjiv Shah, and Clay Breshears (Intel Corp).
z Some slides are based on a course I teach with Kurt Keutzer of UC Berkeley. The course is called “CS194: Architecting parallel applications with design patterns”. These slides are marked with the UC Berkeley ParLab logo:
3
4
Preliminaries: z Our plan ... Active learning!
�We will mix short lectures with short exercises. z Download exercises and reference materials. z Please follow these simple rules
�Do the exercises we assign and then change things around and experiment. –Embrace active learning!
�Don’t cheat: Do Not look at the solutions before you complete an exercise … even if you get really frustrated.
5
Outline z Unit 1: Getting started with OpenMP
� Mod1: Introduction to parallel programming � Mod 2: The boring bits: Using an OpenMP compiler (hello world) � Disc 1: Hello world and how threads work
z Unit 2: The core features of OpenMP � Mod 3: Creating Threads (the Pi program) � Disc 2: The simple Pi program and why it sucks � Mod 4: Synchronization (Pi program revisited) � Disc 3: Synchronization overhead and eliminating false sharing � Mod 5: Parallel Loops (making the Pi program simple) � Disc 4: Pi program wrap-up
z Unit 3: Working with OpenMP � Mod 6: Synchronize single masters and stuff � Mod 7: Data environment � Disc 5: Debugging OpenMP programs � Mod 8: Skills practice … linked lists and OpenMP � Disc 6: Different ways to traverse linked lists
z Unit 4: a few advanced OpenMP topics � Mod 8: Tasks (linked lists the easy way) � Disc 7: Understanding Tasks � Mod 8: The scary stuff … Memory model, atomics, and flush (pairwise synch). � Disc 8: The pitfalls of pairwise synchronization � Mod 9: Threadprivate Data and how to support libraries (Pi again) � Disc 9: Random number generators
z Unit 5: Recapitulation
6
Outline z Unit 1: Getting started with OpenMP
� Mod1: Introduction to parallel programming � Mod 2: The boring bits: Using an OpenMP compiler (hello world) � Disc 1: Hello world and how threads work
z Unit 2: The core features of OpenMP � Mod 3: Creating Threads (the Pi program) � Disc 2: The simple Pi program and why it sucks � Mod 4: Synchronization (Pi program revisited) � Disc 3: Synchronization overhead and eliminating false sharing � Mod 5: Parallel Loops (making the Pi program simple) � Disc 4: Pi program wrap-up
z Unit 3: Working with OpenMP � Mod 6: Synchronize single masters and stuff � Mod 7: Data environment � Disc 5: Debugging OpenMP programs � Mod 8: Skills practice … linked lists and OpenMP � Disc 6: Different ways to traverse linked lists
z Unit 4: a few advanced OpenMP topics � Mod 8: Tasks (linked lists the easy way) � Disc 7: Understanding Tasks � Mod 8: The scary stuff … Memory model, atomics, and flush (pairwise synch). � Disc 8: The pitfalls of pairwise synchronization � Mod 9: Threadprivate Data and how to support libraries (Pi again) � Disc 9: Random number generators
z Unit 5: Recapitulation
Moore's Law
Moore’s Law
Slide source: UCB CS 194 Fall’2010
z In 1965, Intel co-founder Gordon Moore predicted (from just 3 data points!) that semiconductor density would double every 18 months. �He was right! Transistors are still shrinking as he projected.
Consequences of Moore’s law…
The Hardware/Software contract
z Write your software as you choose and we HW-geniuses will take care of performance.
9
• The result: Generations of performance ignorant software engineers using performance-handicapped languages (such as Java) … which was OK since performance was a HW job.
Third party names are the property of their owners.
10
… Computer architecture and the power wall
0
5
10
15
20
25
30
0 2 4 6 8Scalar Performance
Pow
er power = perf ^ 1.74
Pentium M
i486 Pentium
Pentium Pro
Pentium 4 (Wmt)
Pentium 4 (Psc)
Growth in power is unsustainable Growth in power is unsustainable
Source: E. Grochowski of Intel
11
… partial solution: simple low power cores
0
5
10
15
20
25
30
0 2 4 6 8Scalar Performance
Pow
er power = perf ^ 1.74
Pentium M
i486 Pentium
Pentium Pro
Pentium 4 (Wmt)
Pentium 4 (Psc)
Mobile CPUs with shallow
pipelines use less power
Source: E. Grochowski of Intel
Eventually Pentium 4 used over 30 pipeline stages!!!!
For the rest of the solution consider power in a chip …
Processor
f
Input Output
Capacitance = C Voltage = V
Frequency = f Power = CV2f
C = capacitance … it measures the ability of a circuit to store energy:
C = q/V Æ q = CV
Work is pushing something (charge or q) across a “distance” … in electrostatic
terms pushing q from 0 to V:
V * q = W.
But for a circuit q = CV so
W = CV2
power is work over time … or how many times in a second we oscillate the circuit
Power = W* F Æ Power = CV2f
... The rest of the solution add cores
Processor
f
Processor
f/2
Processor
f/2
f
Input Output
Input
Output
Capacitance = C Voltage = V
Frequency = f Power = CV2f Capacitance = 2.2C
Voltage = 0.6V Frequency = 0.5f
Power = 0.396CV2f Chandrakasan, A.P.; Potkonjak, M.; Mehra, R.; Rabaey, J.; Brodersen, R.W.,
"Optimizing power using transformations," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,, vol.14, no.1, pp.12-31, Jan 1995
Source: Vishwani Agrawal
Microprocessor trends
IBM Cell
NVIDIA Tesla C1060 Intel SCC Processor
AMD ATI RV770
3rd party names are the property of their owners.
Individual processors are many core (and often heterogeneous) processors.
80 cores 30 cores
8 wide SIMD
1 CPU + 6 cores
10 cores
16 wide SIMD
48 cores
Source: OpenCL tutorial, Gaster, Howes, Mattson, and Lokhmotov, HiPEAC 2011
ARM MPCORE Intel® Xeon® processor
4 cores
4 cores
The result…
15
+
= A new contract … HW people will do what’s natural
for them (lots of simple cores) and SW people will have to adapt (rewrite everything)
The problem is this was presented as an ultimatum … nobody asked us if we were OK with this new contract …
which is kind of rude.
Concurrency vs. Parallelism z Two important definitions:
�Concurrency: A condition of a system in which multiple tasks are logically active at one time.
�Parallelism: A condition of a system in which multiple tasks are actually active at one time.
Figure from “An Introduction to Concurrency in Programming Languages” by J. Sottile, Timothy G. Mattson, and Craig E Rasmussen, 2010
Concurrent, parallel Execution
Concurrent, non-parallel Execution
Concurrency vs. Parallelism
Figure from “An Introduction to Concurrency in Programming Languages” by J. Sottile, Timothy G. Mattson, and Craig E Rasmussen, 2010
z Two important definitions: �Concurrency: A condition of a system in which multiple
tasks are logically active at one time. �Parallelism: A condition of a system in which multiple
tasks are actually active at one time.
Programs
Concurrent Programs
Parallel Programs
Concurrent vs. Parallel applications
� Parallel application: An application for which the computations actually execute simultaneously in order to complete a problem in less time. • The problem doesn’t inherently require concurrency … you can state it sequentially.
� Concurrent application: An application for which computations logically execute simultaneously due to the semantics of the application. • The problem is fundamentally concurrent.
� We distinguish between two classes of applications that exploit the concurrency in a problem:
The Parallel programming process:
Original Problem Tasks, shared and local data
Find Concurrency
(
Implementation strategy
Corresponding source code
Program SPMD_Emb_Par () {
TYPE *tmp, *func(); global_array Data(TYPE); global_array Res(TYPE); int N = get_num_procs();
int id = get_proc_id(); if (id==0) setup_problem(N,DATA);
for (int I= 0; I<N;I=I+Num){ tmp = func(I);
Res.accumulate( tmp); } }
Program SPMD_Emb_Par () {
TYPE *tmp, *func(); global_array Data(TYPE); global_array Res(TYPE); int N = get_num_procs();
int id = get_proc_id(); if (id==0) setup_problem(N,DATA);
for (int I= 0; I<N;I=I+Num){ tmp = func(I);
Res.accumulate( tmp); } }
Program SPMD_Emb_Par () {
TYPE *tmp, *func(); global_array Data(TYPE); global_array Res(TYPE); int N = get_num_procs();
int id = get_proc_id(); if (id==0) setup_problem(N,DATA);
for (int I= 0; I<N;I=I+Num){ tmp = func(I);
Res.accumulate( tmp); } }
Program SPMD_Emb_Par () {
TYPE *tmp, *func(); global_array Data(TYPE); global_array Res(TYPE);
int Num = get_num_procs(); int id = get_proc_id();
if (id==0) setup_problem(N, Data); for (int I= ID; I<N;I=I+Num){
tmp = func(I, Data); Res.accumulate( tmp);
} }
Units of execution + new shared data for extracted dependencies