Art of Multiprocessor Programming1 This work is licensed under a Creative Commons Attribution-...
-
Upload
jody-hensley -
Category
Documents
-
view
221 -
download
2
Transcript of Art of Multiprocessor Programming1 This work is licensed under a Creative Commons Attribution-...
Art of Multiprocessor Programming 1
This work is licensed under a
Creative Commons Attribution-ShareAlike 2.5 License.
• You are free:– to Share — to copy, distribute and transmit the work – to Remix — to adapt the work
• Under the following conditions:– Attribution. You must attribute the work to “The Art of
Multiprocessor Programming” (but not in any way that suggests that the authors endorse you or your use of the work).
– Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license.
• For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to– http://creativecommons.org/licenses/by-sa/3.0/.
• Any of the above conditions can be waived if you get permission from the copyright holder.
• Nothing in this license impairs or restricts the author's moral rights.
Parallel and distributed
programmingAdam Piotrowski
Grzegorz Jabłoński
Lecture I
3
Parallel and distributed programming
lux.dmcs.pl/padp
Grading policy Lecture slides and material
Additional material
Course Overview
4
Introduction to multiprocessor programming Using OpenMP
Distributed programmingMPI, CORBA programming
Theoretical approche to multiprocessor
programmingFundamentals - Models, algorithms
Real-World programming - ArchitecturesTechniques
Books
5
The Art of Multiprocessor Programming
by Maurice Herlihy and Nir Shavit
Using OpenMPPortable Shared Memory Parallel
Programming
by Barbara Chapman, Gabriele Jost and Ruud van der Pas
Art of Multiprocessor Programming 6
Still on some of your desktops: The Uniprocesor
memory
cpu
Art of Multiprocessor Programming 7
In the Enterprise: The Shared Memory
Multiprocessor(SMP)
cache
BusBus
shared memory
cachecache
Art of Multiprocessor Programming 8
Traditional Scaling Process
User code
TraditionalUniprocessor
Speedup1.8x1.8x
7x7x
3.6x3.6x
Time: Moore’s law
Art of Multiprocessor Programming 9
Multicore Scaling Process
User code
Multicore
Speedup 1.8x1.8x
7x7x
3.6x3.6x
Unfortunately, not so simple…
Art of Multiprocessor Programming 10
Real-World Scaling Process
1.8x1.8x 2x2x 2.9x2.9x
User code
Multicore
Speedup
Parallelization and Synchronization require great care…
Art of Multiprocessor Programming
11
Sequential Computation
memory
object object
thread
Art of Multiprocessor Programming
12
Concurrent Computation
memory
object object
thre
ads
Art of Multiprocessor Programming 13
Asynchrony
• Sudden unpredictable delays– Cache misses (short)– Page faults (long)– Scheduling quantum used up (really
long)
Art of Multiprocessor Programming 14
Model Summary
• Multiple threads– Sometimes called processes
• Single shared memory• Objects live in memory• Unpredictable asynchronous
delays
Art of Multiprocessor Programming 15
Parallel Primality Testing
• Challenge– Print primes from 1 to 1010
• Given– Ten-processor multiprocessor– One thread per processor
• Goal– Get ten-fold speedup (or close)
Art of Multiprocessor Programming 16
Load Balancing
• Split the work evenly• Each thread tests range of 109
…
…109 10102·1091
P0 P1 P9
Art of Multiprocessor Programming
17
Procedure for Thread i
void primePrint { int i = ThreadID.get(); // IDs in {0..9} for (j = i*109+1, j<(i+1)*109; j++) { if (isPrime(j)) print(j); }}
Art of Multiprocessor Programming 18
Issues
• Higher ranges have fewer primes• Yet larger numbers harder to test• Thread workloads
– Uneven– Hard to predict
• Need dynamic load balancing
Art of Multiprocessor Programming 19
Issues
• Higher ranges have fewer primes• Yet larger numbers harder to test• Thread workloads
– Uneven– Hard to predict
• Need dynamic load balancingre
jected
Art of Multiprocessor Programming
20
17
18
19
Shared Counter
each thread takes a number
Art of Multiprocessor Programming
21
Procedure for Thread i
int counter = new Counter(1); void primePrint { long j = 0; while (j < 1010) { j = counter.getAndIncrement(); if (isPrime(j)) print(j); }}
Art of Multiprocessor Programming
22
Counter counter = new Counter(1); void primePrint { long j = 0; while (j < 1010) { j = counter.getAndIncrement(); if (isPrime(j)) print(j); }}
Procedure for Thread i
Shared counterobject
Art of Multiprocessor Programming
23
Procedure for Thread i
Counter counter = new Counter(1); void primePrint { long j = 0; while (j < 1010) { j = counter.getAndIncrement(); if (isPrime(j)) print(j); }}
Stop when every value taken
Art of Multiprocessor Programming
24
Counter counter = new Counter(1); void primePrint { long j = 0; while (j < 1010) { j = counter.getAndIncrement(); if (isPrime(j)) print(j); }}
Procedure for Thread i
Increment & return each new
value
Art of Multiprocessor Programming
25
Counter Implementation
public class Counter { private long value;
public long getAndIncrement() { return value++; }}
Art of Multiprocessor Programming
26
Counter Implementation
public class Counter { private long value;
public long getAndIncrement() { return value++; }} OK for single thread,
not for concurrent threads
Art of Multiprocessor Programming
27
What It Means
public class Counter { private long value;
public long getAndIncrement() { return value++; }}
Art of Multiprocessor Programming
28
What It Means
public class Counter { private long value;
public long getAndIncrement() { return value++; }}
temp = value; value = value + 1; return temp;
Art of Multiprocessor Programming
29
time
Not so good…
Value… 1
read 1
read 1
write 2
read 2
write 3
write 2
2 3 2
Art of Multiprocessor Programming
30
Is this problem inherent?
If we could only glue reads and writes…
read
write read
write
Art of Multiprocessor Programming
31
Challenge
public class Counter { private long value;
public long getAndIncrement() { temp = value; value = temp + 1; return temp; }}
Art of Multiprocessor Programming
32
Challenge
public class Counter { private long value;
public long getAndIncrement() { temp = value; value = temp + 1; return temp; }}
Make these steps atomic (indivisible)
Art of Multiprocessor Programming
33
Hardware Solution
public class Counter { private long value;
public long getAndIncrement() { temp = value; value = temp + 1; return temp; }} ReadModifyWrite()
instruction
Art of Multiprocessor Programming
34
An Aside: Java™
public class Counter { private long value;
public long getAndIncrement() { synchronized { temp = value; value = temp + 1; } return temp; }}
Art of Multiprocessor Programming
35
An Aside: Java™
public class Counter { private long value;
public long getAndIncrement() { synchronized { temp = value; value = temp + 1; } return temp; }}
Synchronized block
Art of Multiprocessor Programming
36
An Aside: Java™
public class Counter { private long value;
public long getAndIncrement() { synchronized { temp = value; value = temp + 1; } return temp; }}
Mutual Exclusion
Art of Multiprocessor Programming 37
Why do we care?
• We want as much of the code as possible to execute concurrently (in parallel)
• A larger sequential part implies reduced performance
• Amdahl’s law: this relation is not linear…
Art of Multiprocessor Programming 38
Amdahl’s Law
OldExecutionTimeNewExecutionTimeSpeedup=
…of computation given n CPUs instead of 1
Art of Multiprocessor Programming 39
Amdahl’s Law
p
pn
1
1Speedup=
Art of Multiprocessor Programming 40
Amdahl’s Law
p
pn
1
1Speedup=
Parallel fraction
Art of Multiprocessor Programming 41
Amdahl’s Law
p
pn
1
1Speedup=
Parallel fraction
Sequential fraction
Art of Multiprocessor Programming 42
Amdahl’s Law
p
pn
1
1Speedup=
Parallel fraction
Number of
processors
Sequential fraction
Art of Multiprocessor Programming
43
Example
• Ten processors• 60% concurrent, 40% sequential• How close to 10-fold speedup?
Art of Multiprocessor Programming
44
Example
• Ten processors• 60% concurrent, 40% sequential• How close to 10-fold speedup?
106.0
6.01
1
Speedup=2.17=
Art of Multiprocessor Programming
45
Example
• Ten processors• 80% concurrent, 20% sequential• How close to 10-fold speedup?
Art of Multiprocessor Programming
46
Example
• Ten processors• 80% concurrent, 20% sequential• How close to 10-fold speedup?
108.0
8.01
1
Speedup=3.57=
Art of Multiprocessor Programming
47
Example
• Ten processors• 90% concurrent, 10% sequential• How close to 10-fold speedup?
Art of Multiprocessor Programming
48
Example
• Ten processors• 90% concurrent, 10% sequential• How close to 10-fold speedup?
109.0
9.01
1
Speedup=5.26=
Art of Multiprocessor Programming
49
Example
• Ten processors• 99% concurrent, 1% sequential• How close to 10-fold speedup?
Art of Multiprocessor Programming
50
Example
• Ten processors• 99% concurrent, 1% sequential• How close to 10-fold speedup?
1099.0
99.01
1
Speedup=9.17=
Art of Multiprocessor Programming 51
The Moral
• Making good use of our multiple processors (cores) means – Finding ways to effectively parallelize
our code• Minimize sequential parts• Reduce idle time in which threads wait
without
Art of Multiprocessor Programming
52
Implementing a Counter
public class Counter { private long value;
public long getAndIncrement() { temp = value; value = temp + 1; return temp; }}
Make these steps indivisible using
locks
Art of Multiprocessor Programming
53
Locks (Mutual Exclusion)
public interface Lock {
public void lock();
public void unlock();}
Art of Multiprocessor Programming
54
Locks (Mutual Exclusion)
public interface Lock {
public void lock();
public void unlock();}
acquire lock
Art of Multiprocessor Programming
55
Locks (Mutual Exclusion)
public interface Lock {
public void lock();
public void unlock();}
release lock
acquire lock
Art of Multiprocessor Programming
56
Using Locks
public class Counter { private long value; private Lock lock; public long getAndIncrement() { lock.lock(); try { int temp = value; value = value + 1; } finally { lock.unlock(); } return temp; }}
Art of Multiprocessor Programming
57
Using Locks
public class Counter { private long value; private Lock lock; public long getAndIncrement() { lock.lock(); try { int temp = value; value = value + 1; } finally { lock.unlock(); } return temp; }}
acquire Lock
Art of Multiprocessor Programming
58
Using Locks
public class Counter { private long value; private Lock lock; public long getAndIncrement() { lock.lock(); try { int temp = value; value = value + 1; } finally { lock.unlock(); } return temp; }}
Release lock(no matter what)
Art of Multiprocessor Programming
59
Using Locks
public class Counter { private long value; private Lock lock; public long getAndIncrement() { lock.lock(); try { int temp = value; value = value + 1; } finally { lock.unlock(); } return temp; }}
Critical section
Art of Multiprocessor Programming 60
Today: Revisit Mutual Exclusion
• Think of performance, not just correctness and progress
• Begin to understand how performance depends on our software properly utilizing the multiprocessor machine’s hardware
• And get to know a collection of locking algorithms…
(1)
Art of Multiprocessor Programming 61
What Should you do if you can’t get a lock?
• Keep trying– “spin” or “busy-wait”– Good if delays are short
• Give up the processor– Good if delays are long– Always good on uniprocessor
(1)
Art of Multiprocessor Programming 62
What Should you do if you can’t get a lock?
• Keep trying– “spin” or “busy-wait”– Good if delays are short
• Give up the processor– Good if delays are long– Always good on uniprocessor
our focus
Art of Multiprocessor Programming 63
Basic Spin-Lock
CS
Resets lock upon exit
spin lock
critical section
...
Art of Multiprocessor Programming 64
Basic Spin-Lock
CS
Resets lock upon exit
spin lock
critical section
...
…lock introduces sequential bottleneck
Art of Multiprocessor Programming 65
Review: Test-and-Set
• Boolean value• Test-and-set (TAS)
– Swap true with current value– Return value tells if prior value was
true or false
• Can reset just by writing false
Art of Multiprocessor Programming 66
Review: Test-and-Set
public class AtomicBoolean { boolean value; public synchronized boolean getAndSet(boolean newValue) {
boolean prior = value; value = newValue; return prior; }}
(5)
Art of Multiprocessor Programming 67
Review: Test-and-Set
public class AtomicBoolean { boolean value; public synchronized boolean getAndSet(boolean newValue) {
boolean prior = value; value = newValue; return prior; }}
Swap old and new values
Art of Multiprocessor Programming 68
Review: Test-and-Set
AtomicBoolean lock = new AtomicBoolean(false)…boolean prior = lock.getAndSet(true)
Art of Multiprocessor Programming 69
Review: Test-and-Set
AtomicBoolean lock = new AtomicBoolean(false)…boolean prior = lock.getAndSet(true)
(5)
Swapping in true is called “test-and-set” or TAS
Art of Multiprocessor Programming 70
Test-and-Set Locks
• Locking– Lock is free: value is false– Lock is taken: value is true
• Acquire lock by calling TAS– If result is false, you win– If result is true, you lose
• Release lock by writing false
Art of Multiprocessor Programming 71
Test-and-set Lock
class TASlock { AtomicBoolean state = new AtomicBoolean(false);
void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }}
Art of Multiprocessor Programming 72
Test-and-set Lock
class TASlock { AtomicBoolean state = new AtomicBoolean(false);
void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }}
Lock state is AtomicBoolean
Art of Multiprocessor Programming 73
Test-and-set Lock
class TASlock { AtomicBoolean state = new AtomicBoolean(false);
void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }}
Keep trying until lock acquired
Art of Multiprocessor Programming 74
Test-and-set Lock
class TASlock { AtomicBoolean state = new AtomicBoolean(false);
void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }}
Release lock by resetting state to false
Art of Multiprocessor Programming 75
Performance
• Experiment– n threads– Increment shared counter 1 million
times
• How long should it take?• How long does it take?
Art of Multiprocessor Programming 76
Graph
ideal
tim e
threads
no speedup because of sequential bottleneck
Art of Multiprocessor Programming 77
Mystery #1ti
m e
threads
TAS lock
Ideal
(1)
What is going on?
Art of Multiprocessor Programming 78
Bus-Based Architectures
Bus
cache
memory
cachecache
Art of Multiprocessor Programming 79
Bus-Based Architectures
Bus
cache
memory
cachecache
Random access memory (10s of cycles)
Art of Multiprocessor Programming 80
Bus-Based Architectures
cache
memory
cachecache
Shared Bus•Broadcast medium•One broadcaster at a time•Processors and memory all “snoop”
Bus
Art of Multiprocessor Programming 81
Bus-Based Architectures
Bus
cache
memory
cachecache
Per-Processor Caches•Small•Fast: 1 or 2 cycles•Address & state information
Art of Multiprocessor Programming 82
Jargon Watch
• Cache hit– “I found what I wanted in my cache”– Good Thing™
Art of Multiprocessor Programming 83
Jargon Watch
• Cache hit– “I found what I wanted in my cache”– Good Thing™
• Cache miss– “I had to shlep all the way to memory
for that data”– Bad Thing™
Art of Multiprocessor Programming 84
Cave Canem
• This model is still a simplification– But not in any essential way– Illustrates basic principles
• Will discuss complexities later
Art of Multiprocessor Programming 85
Bus
Processor Issues Load Request
cache
memory
cachecache
data
Art of Multiprocessor Programming 86
Bus
Processor Issues Load Request
Bus
cache
memory
cachecache
data
Gimmedata
Art of Multiprocessor Programming 87
cache
Bus
Memory Responds
Bus
memory
cachecache
data
Got your data right here data
Art of Multiprocessor Programming 88
Bus
Processor Issues Load Request
memory
cachecachedata
data
Gimmedata
Art of Multiprocessor Programming 89
Bus
Processor Issues Load Request
Bus
memory
cachecachedata
data
Gimmedata
Art of Multiprocessor Programming 90
Bus
Processor Issues Load Request
Bus
memory
cachecachedata
data
I got data
Art of Multiprocessor Programming 91
Bus
Other Processor Responds
memory
cachecache
data
I got data
datadata
Bus
Art of Multiprocessor Programming 92
Bus
Other Processor Responds
memory
cachecache
data
datadata
Bus
Art of Multiprocessor Programming 93
Modify Cached Data
Bus
data
memory
cachedata
data
(1)
Art of Multiprocessor Programming 94
Modify Cached Data
Bus
data
memory
cachedata
data
data
(1)
Art of Multiprocessor Programming 95
memory
Bus
data
Modify Cached Data
cachedata
data
Art of Multiprocessor Programming 96
memory
Bus
data
Modify Cached Data
cache
What’s up with the other copies?
data
data
Art of Multiprocessor Programming 97
Cache Coherence
• We have lots of copies of data– Original copy in memory – Cached copies at processors
• Some processor modifies its own copy– What do we do with the others?– How to avoid confusion?
Art of Multiprocessor Programming 98
Write-Back Caches
• Accumulate changes in cache• Write back when needed
– Need the cache for something else– Another processor wants it
• On first modification– Invalidate other entries– Requires non-trivial protocol …
Art of Multiprocessor Programming 99
Write-Back Caches
• Cache entry has three states– Invalid: contains raw seething bits– Valid: I can read but I can’t write– Dirty: Data has been modified
• Intercept other load requests• Write back to memory before using cache
Art of Multiprocessor Programming 100
Bus
Invalidate
memory
cachedatadata
data
Art of Multiprocessor Programming 101
Bus
Invalidate
Bus
memory
cachedatadata
data
Mine, all mine!
Art of Multiprocessor Programming 102
Bus
Invalidate
Bus
memory
cachedatadata
data
cache
Uh,oh
Art of Multiprocessor Programming 103
cache
Bus
Invalidate
memory
cachedata
data
Other caches lose read permission
Art of Multiprocessor Programming 104
cache
Bus
Invalidate
memory
cachedata
data
Other caches lose read permission
This cache acquires write permission
Art of Multiprocessor Programming 105
cache
Bus
Invalidate
memory
cachedata
data
Memory provides data only if not present in any cache, so no need
to change it now (expensive)
(2)
Art of Multiprocessor Programming 106
cache
Bus
Another Processor Asks for Data
memory
cachedata
data
(2)
Bus
Art of Multiprocessor Programming 107
cache data
Bus
Owner Responds
memory
cachedata
data
(2)
Bus
Here it is!
Art of Multiprocessor Programming 108
Bus
End of the Day …
memory
cachedata
data
(1)
Reading OK, no writing
data data
Art of Multiprocessor Programming 109
Simple TASLock
• TAS invalidates cache lines• Spinners
– Miss in cache– Go to bus
• Thread wants to release lock– delayed behind spinners
Art of Multiprocessor Programming 110
Test-and-test-and-set
• Wait until lock “looks” free– Spin on local cache– No bus use while lock busy
• Problem: when lock is released– Invalidation storm …
Art of Multiprocessor Programming 111
Local Spinning while Lock is Busy
Bus
memory
busybusybusy
busy
Art of Multiprocessor Programming 112
Bus
On Release
memory
freeinvalidinvalid
free
Art of Multiprocessor Programming 113
On Release
Bus
memory
freeinvalidinvalid
free
miss miss
Everyone misses, rereads
(1)
Art of Multiprocessor Programming 114
On Release
Bus
memory
freeinvalidinvalid
free
TAS(…) TAS(…)
Everyone tries TAS
(1)
Art of Multiprocessor Programming 115
Mystery Explained
TAS lock
TTAS lock
Ideal
tim e
threads
Better than TAS but still
not as good as ideal
OpenMP• (Shared Memory, Thread Based Parallelism) OpenMP is a
shared-memory application programming interface (API) whose features, are based on prior efforts to facilitate shared-memory parallel programming.
• (Compiler Directive Based) OpenMP provides directives, library functions and environment variables to create and control the execution of parallel programs.
• (Explicit Parallelism) OpenMP’s directives let the user tell the compiler which instructions to execute in parallel and how to distribute them among the threads that will run the code.
116
OpenMP exampleint main(){
for (i = 0; i < n; i++) {a [i] = i * 0.5;b [i] = i * 2.0;
}sum = 0;for (i = 1; i <= n; i++ ) {
sum = sum + a[i]*b[i];}printf ("sum = %f", sum);
}
int main(){
for (i = 0; i < n; i++) {a [i] = i * 0.5;b [i] = i * 2.0
}sum = 0; t = 0;#pragma omp parallel private(t) \
shared(sum, a, b, n){ t = 0;
#pragma omp for for (i = 1; i <= n; i++ ) {
t = t + a[i]*b[i];}#pragma omp critical (update_sum)sum += t;
}printf ("sum = %f \n", sum);
}
117
PARALLEL Region Construct
• A parallel region is a block of code that will be executed by multiple threads. This is the fundamental OpenMP parallel construct.
• Starting from the beginning of this parallel region, the code is duplicated and all threads will execute that code.
• There is an implied barrier at the end of a parallel section. Only the master thread continues execution past this point.
• How Many Threads?– Setting of the NUM_THREADS clause
– Use of the omp_set_num_threads() library function
– Setting of the OMP_NUM_THREADS environment variable
– Implementation default - usually the number of CPUs on a node
• A parallel region must be a structured block that does not span multiple routines or code files
• It is illegal to branch into or out of a parallel region
118
PARALLEL Region Construct
#pragma omp parallel [clause ...] newline
if (scalar_expression)
private (list)
shared (list)
default (shared | none)
firstprivate (list)
num_threads (integer-expression)
structured_block
119
PARALLEL Region Construct
#include <omp.h> void main () {
int nthreads, tid; /* Fork a team of threads with each thread having a private tid variable */ #pragma omp parallel private(tid) {
/* Obtain and print thread id */ tid = omp_get_thread_num(); printf("Hello World from thread = %d\n", tid); /* Only master thread does this */ if (tid == 0) {
nthreads = omp_get_num_threads(); printf("Number of threads = %d\n", nthreads);
} } /* All threads join master thread and terminate */
}
120
Work-Sharing Constructsfor Directive
• The for directive specifies that the iterations of the loop immediately following it must be executed in parallel by the team. This assumes a parallel region has already been initiated, otherwise it executes in serial on a single processor.
121
Work-Sharing Constructsfor Directive
#pragma omp for [clause ...] newline
schedule (type [,chunk])
ordered
private (list)
firstprivate (list)
lastprivate (list)
shared (list)
nowait
for_loop
122
Work-Sharing Constructsfor Directive
#include <omp.h> #define CHUNKSIZE 100 #define N 1000 void main () {
int i, chunk; float a[N], b[N], c[N]; /* Some initializations */ for (i=0; i < N; i++) a[i] = b[i] = i * 1.0; chunk = CHUNKSIZE; #pragma omp parallel shared(a,b,c,chunk) private(i) {
#pragma omp for schedule(dynamic,chunk) nowait for (i=0; i < N; i++) c[i] = a[i] + b[i];
} /* end of parallel section */
}
123
Work-Sharing Constructs SECTIONS Directive
• The SECTIONS directive is a non-iterative work-sharing construct. It specifies that the enclosed section(s) of code are to be divided among the threads in the team.
124
Work-Sharing Constructs
SECTIONS Directive#pragma omp sections [clause ...] newline
private (list)
firstprivate (list)
lastprivate (list)
nowait
{
#pragma omp section newline structured_block
#pragma omp section newline structured_block
}
125
Work-Sharing Constructs
SECTIONS Directive#include <omp.h> #define N 1000 void main () {
int i; float a[N], b[N], c[N], d[N]; /* Some initializations */ for (i=0; i < N; i++) a[i] = i * 1.5; b[i] = i + 22.35; #pragma omp parallel shared(a,b,c,d) private(i) {
#pragma omp sections nowait { #pragma omp section
for (i=0; i < N; i++) c[i] = a[i] + b[i]; #pragma omp section for (i=0; i < N; i++) d[i] = a[i] * b[i];
} /* end of sections */
} /* end of parallel section */
}
126
Work-Sharing Constructs
SINGLE Directive• The SINGLE directive specifies that the enclosed code is to be
executed by only one thread in the team.
• May be useful when dealing with sections of code that are not thread safe (such as I/O)
#pragma omp single [clause ...] newline
private (list)
firstprivate (list)
nowait
structured_block
127
Data Scope Attribute Clauses - PRIVATE
• The PRIVATE clause declares variables in its list to be private to each thread.
• A new object of the same type is declared once for each thread in the team
• All references to the original object are replaced with references to the new object
• Variables declared PRIVATE should be assumed to be uninitialized for each thread
128
Data Scope Attribute Clauses - SHARED
• The SHARED clause declares variables in its list to be shared among all threads in the team.
• A shared variable exists in only one memory location and all threads can read or write to that address
• It is the programmer's responsibility to ensure that multiple threads properly access SHARED variables (such as via CRITICAL sections)
129
Data Scope Attribute Clauses - FIRSTPRIVATE• The FIRSTPRIVATE clause combines the
behavior of the PRIVATE clause with automatic initialization of the variables in its list.
• Listed variables are initialized according to the value of their original objects prior to entry into the parallel or work-sharing construct.
130
Data Scope Attribute Clauses - LASTPRIVATE
• The LASTPRIVATE clause combines the behavior of the PRIVATE clause with a copy from the last loop iteration or section to the original variable object.
• The value copied back into the original variable object is obtained from the last (sequentially) iteration or section of the enclosing construct. For example, the team member which executes the final iteration for a DO section, or the team member which does the last SECTION of a SECTIONS context performs the copy with its own values
131
Synchronization Constructs
CRITICAL Directive• The CRITICAL directive specifies a region
of code that must be executed by only one thread at a time.
• If a thread is currently executing inside a CRITICAL region and another thread reaches that CRITICAL region and attempts to execute it, it will block until the first thread exits that CRITICAL region.
132
Synchronization Constructs
CRITICAL Directive#include <omp.h> void main() {
int x; x = 0; #pragma omp parallel shared(x) {
#pragma omp critical x = x + 1;
} /* end of parallel section */
}
133
Synchronization Constructs
• MASTER Directive
– The MASTER directive specifies a region that is to be executed only by the master thread of the team. All other threads on the team skip this section of code
• BARRIER Directive
– The BARRIER directive synchronizes all threads in the team.
– When a BARRIER directive is reached, a thread will wait at that point until all other threads have reached that barrier. All threads then resume executing in parallel the code that follows the barrier.
134
Synchronization Constructs
• ATOMIC Directive– The ATOMIC directive specifies that a
specific memory location must be updated atomically, rather than letting multiple threads attempt to write to it. In essence, this directive provides a mini-CRITICAL section.
– The directive applies only to a single, immediately following statement
135
Find an errorint main(){
for (i = 0; i < n; i++) {a [i] = i * 0.5;b [i] = i * 2.0
}sum = 0; t = 0;#pragma omp parallel shared(sum, a, b, n){
#pragma omp for private(t)for (i = 1; i <= n; i++ ) {
t = t + a[i]*b[i];}#pragma omp critical (update_sum)sum += t;
}printf ("sum = %f \n", sum);
}
136
Find an error
for (i=0; i<n-1; i++)a[i] = a[i] + b[i];
for (i=0; i<n-1; i++)a[i] = a[i+1] + b[i];
137
Find an error
#pragma omp parallel{
int Xlocal = omp_get_thread_num();Xshared = omp_get_thread_num(); printf("Xlocal = %d Xshared = %d\n",Xlocal,Xshared);
}int i, j;#pragma omp parallel forfor (i=0; i<n; i++)
for (j=0; j<m; j++) {a[i][j] = compute(i,j);
}
138
Find an error
void compute(int n){
int i;double h, x, sum;h = 1.0/(double) n;sum = 0.0;#pragma omp for reduction(+:sum) shared(h)for (i=1; i <= n; i++) {
x = h * ((double)i - 0.5);sum += (1.0 / (1.0 + x*x));
}pi = h * sum;
}
139
Find an error
void main (){.............
#pragma omp parallel for private(i,a,b)for (i=0; i<n; i++){
b++;a = b+i;
} /*-- End of parallel for --*/c = a + b;.............
}
140
Art of Multiprocessor Programming 141
This work is licensed under a
Creative Commons Attribution-ShareAlike 2.5 License.
• You are free:– to Share — to copy, distribute and transmit the work – to Remix — to adapt the work
• Under the following conditions:– Attribution. You must attribute the work to “The Art of
Multiprocessor Programming” (but not in any way that suggests that the authors endorse you or your use of the work).
– Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license.
• For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to– http://creativecommons.org/licenses/by-sa/3.0/.
• Any of the above conditions can be waived if you get permission from the copyright holder.
• Nothing in this license impairs or restricts the author's moral rights.