1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.
-
Upload
herbert-nash -
Category
Documents
-
view
232 -
download
3
Transcript of 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.
1
Programming with Shared Memory
CONTENT
Introduction Cilk TBB OpenMP
2
3
• Using heavy weight processes• Using threads. Example Pthreads• Using a completely new programming language for parallel programming - not popular. Example Ada• Using library routines with an existing sequential programming language• Modifying the syntax of an existing sequential programming language to create a parallel programming language • Using an existing sequential programming language supplemented with compiler directives for specifying parallelism. Example OpenMP
Alternatives for Programming SharedMemory
4
FAMILY TREE
Chare Kernelsmall tasks
Cilk space efficient schedulercache-oblivious algorithms
OpenMP*fork/join tasks
JSR-166(FJTask)containers
OpenMP taskqueuewhile & recursion
Intel® TBB
STLgeneric programming
STAPLrecursive ranges
Threaded-C continuation taskstask stealing
ECMA .NET*parallel iteration classes
Libraries
1988
2001
2006
1995
Languages
Pragmas
5
Operating systems often based upon notion of a process.
Processor time shares between processes, switching from one process to another. Might occur at regular intervals or when an active process becomes delayed.
Offers opportunity to deschedule processes blocked from proceeding for some reason, e.g. waiting for an I/O operation to complete.
Concept could be used for parallel programming. Not much used because of overhead but fork/join concepts used elsewhere.
Using Heavyweight Processes
6
FORK-JOIN construct
7
UNIX System Calls
SPMD model with different code for master process and forked slave process.
8
Differences between a process and threads
SAS(SHARED ADDRESS SPACE) PROGRAMMING MODEL
9
Thread(Process)
Thread(Process)
System
X
read(X) write(X)
Shared variable
10
线程安全( Thread safe ) 是指某线程可由多个线程同时调用,并且能够产生正确的结果 .
标准的 I/O 线程安全:输出消息时不会产生字符交错情况 .
返回时间的系统调用可能不是线程安全的 .
访问共享数据的例程需要特别设计以确保是线程安全的 .
Thread-Safe Routines
11
考虑如下两进程:每一进程往共享变量 x 加 1. 首先读 x ,然后计算加 1 ,最后结果写回去。
Accessing Shared Data
12
Conflict in accessing shared variable
13
临界区包含代码以及所涉及的资源。建立临界区是确保在任何时刻只有一个进程访问特定资源。
这种机制也称之为互斥( mutual exclusion )
Critical Section
14
最简单的互斥机制是锁。
一种锁是种位变量: 1 指示一个进程进入了临界区; 0 指示没有进程在临界区 .
类似于门锁 : 进程来到临界区的“门口”,如果发现门是开着,它进去并锁上门。如果它完成操作,打开门离开临界区
Locks
15
Control of critical sections through busy waiting
16
Locks are implemented in Pthreads with mutually exclusive lock variables, or “mutex” variables:
.pthread_mutex_lock(&mutex1);
critical sectionpthread_mutex_unlock(&mutex1);
.
If a thread reaches a mutex lock and finds it locked, it will wait for the lock to open. If more than one thread is waiting for the lock to open when it opens, the system will select one thread to be allowed to proceed. Only the thread that locks a mutex can unlock it.
Pthread Lock Routines
17
当进程 P1 锁定资源 R1 后,然后申请被 P2 锁定的资源 R2 ,同时 P2 申请资源 R1
Deadlock
18
死锁也可以发生在如下的循环锁中
Deadlock (deadly embrace)
19
Offers one routine that can test whether a lock is actually closed without blocking the thread:
pthread_mutex_trylock()
Will lock an unlocked mutex and return 0 or will return with EBUSY if the mutex is already locked – might find a use in overcoming deadlock.
Pthreads
20
A positive integer (including zero) operated upon by two operations:
P operation on semaphore s
Waits until s is greater than zero and then decrements s by one and allows the process to continue.
V operation on semaphore s
Increments s by one and releases one of the waiting processes (if any).
Semaphores
21
P and V operations are performed indivisibly.
Mechanism for activating waiting processes is also implicit in P and V operations. Though exact algorithm not specified, algorithm expected to be fair. Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore.
22
Mutual exclusion of critical sections can be achieved with onesemaphore having the value 0 or 1 (a binary semaphore), whichacts as a lock variable, but the P and V operations include a process scheduling mechanism:Process 1 Process 2 Process 3Noncritical section Noncritical section Noncritical section
. . .
. . .
. . .P(s) P(s) P(s) Critical section Critical section Critical sectionV(s) V(s) V(s)
. . .
. . .
. . .Noncritical section Noncritical section Noncritical section
23
Can take on positive values other than zero and one.Provide, for example, a means of recording the number of “resource units” available or used and can be used to solve producer/ consumer problems.
General semaphore(or counting semaphore)
24
Suite of procedures that provides only way to access shared resource. Only one process can use a monitor procedure at any instant.Could be implemented using a semaphore or lock to protect entry, i.e.,
monitor_proc1(){lock(x);
.monitor body
.unlock(x);return;}
Monitor
25
Often, a critical section is to be executed if a specific globalcondition exists; for example, if a certain value of a variable has been reached.
With locks, the global variable would need to be examined at frequent intervals (“polled”) within a critical section.
Very time-consuming and unproductive exercise.
Can be overcome by introducing so-called condition variables.
Condition Variables
26
Language Constructs for ParallelismShared Data
Shared memory variables might be declared as shared with, say,
shared int x;
27
par Construct
For specifying concurrent statements:
par {S1;S2;..Sn;
}
28
forall Construct
To start multiple similar processes together:
forall (i = 0; i < n; i++) {S1;S2;..Sm;
}
which generates n processes each consisting of the statements forming the body of the for loop, S1, S2, …, Sm. Each process uses a different value of i.
29
Example
forall (i = 0; i < 5; i++)a[i] = 0;
clears a[0], a[1], a[2], a[3], and a[4] to zero concurrently.
DESIGN FOR MULTITHREADING Good design is critical Bad multithreading can be worse than no
multithreading Deadlocks, synchronization bugs, poor performance,
etc.
BAD MULTITHREADING
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Rendering ThreadRendering ThreadRendering Thread
Game Thread
GOOD MULTITHREADING
Main Thread
Physics
Rendering Thread
Animation/Skinning
Particle Systems
Networking
File I/O
Game Thread
Present
Rendering
AI
Physics
InputFrame 2Frame 3Frame 4
ANOTHER PARADIGM: CASCADES
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Frame 1
Advantages: Synchronization points are few and well-defined
Disadvantages: Increases latency (for constant frame rate) Needs simple (one-way) data flow
MULTITHREADED PROGRAMMING INCILK
MULTITHREADED PROGRAMMING INCILK
34
• Introduction
• Inlets
• Abort
CONTENT
35
© 2
00
9 M
atth
ew
J. So
ttile, T
imo
thy G
. M
attso
n, a
nd
Cra
ig E
Ra
smu
ssen
36
CILK IN ONE SLIDE
扩展 C 语言支持并行,同时不改变原来的串行语义 . 面向 fork-join 方式任务产生的编程模型非常适合递归算法 (e.g.
branch-and-bound) 有着坚实的理论基础 … 能够证明性能
cilk Marks a function as a “cilk” function that can be spawned
spawn Spawns a cilk function … only 2 to 5 times the cost of a regular function call
sync Wait until immediate children spawned functions return
• 高级关键字inlet Define a function to handle return values from a cilk task
cilk_fence A portable memory fence.
abort Terminate all currently existing spawned tasks
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 37
RECURSION IS AT THE HEART OF CILK Cilk 中派生新任务非常方便 . 不是采用循环,而是递归地产生很多任务 . 创建任务嵌套队列,调度器采用 workstealing 确
保所有的核都忙
With Cilk, the programmer worries about expressing concurrency, not the details of
how it is implemented
INTRODUCTION ...
Cilk program 是一组 procedures
procedure 是一系列 threads
Cilk threads are:
represented by nodes in the dag
38
39
FIBONACCI – AN EXAMPLE
int fib (int n) {if (n<2) return (n); else { int x,y; x = fib(n-1); y = fib(n-2); return (x+y); }}
int fib (int n) {if (n<2) return (n); else { int x,y; x = fib(n-1); y = fib(n-2); return (x+y); }}
C C
cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); }}
cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); }}
Cilk codeCilk code
Cilk provides no new data types.
40
BASIC CILK KEYWORDS
cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); }}
cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); }}
声明一 Cilk 函数或过程。该过程可以并行 spawn.
声明一 Cilk 函数或过程。该过程可以并行 spawn.
派生子线程。子线程可以与父进程并行执行
派生子线程。子线程可以与父进程并行执行
Control cannot pass this point until all spawned children have returned.
Control cannot pass this point until all spawned children have returned.
41
DYNAMIC MULTITHREADING
cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); }}
cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); }}
The computation dag unfolds dynamically.
Example: fib(4)
4
3
2
2
1
1 1 0
0
42
MULTITHREADED COMPUTATION
• 有向无环图 G = (V, E) 代表了并行指令流 .• 每一顶点 v代表一 (Cilk) thread: 最大指令序列,不包
含并行控制指令 (spawn, sync, return).• 每一边 e可以是一 spawn 边 , return 边 , 或 continue
边 .
spawn edgereturn edge
continue edge
initial thread
final thread
CACTUS STACK
43
BB
AA
CC
EEDD
AA AA
BB
AA
CCAA
CC
DD
AA
CC
EE
Views of stack
CBA D E
Cilk supports C’s rule for pointers: A pointer to stack space can be passed from parent to child, but not from child to parent. (Cilk also supports malloc.)
Cilk’s cactus stack supports several views in parallel.
Cilk’s cactus stack supports several views in parallel.
44
OPERATING ON RETURNED VALUES
• Cilk achieves this functionality using an internal function, called an inlet, which is executed as a secondary thread on the parent frame when the child returns.
• The inlet keyword defines a void internal function to be an inlet.
x += spawn foo(a,b,c); x += spawn foo(a,b,c);
45
SEMANTICS OF INLETScilk int fib (int n){ int x = 0;
if (n<2) return n;else { summer(spawn fib (n-1)); summer(spawn fib (n-2)); sync; return (x);}}
cilk int fib (int n){ int x = 0;
if (n<2) return n;else { summer(spawn fib (n-1)); summer(spawn fib (n-2)); sync; return (x);}}
inlet void summer (int result){ x += result; return;}
1. The Cilk procedure fib(i) is spawned.
2. Control passes to the next statement.
3. When fib(i) returns, summer() is invoked
46
47
SEMANTICS OF INLETS
• In the current implementation of Cilk, the inlet definition may not contain a spawn, and only the first argument of the inlet may be spawned at the call site.
48
IMPLICIT INLETS
cilk int wfib(int n) { if (n == 0) { return 0; } else { int i, x = 1; for (i=0; i<=n-2; i++) { x += spawn wfib(i); } sync; return x; }}
cilk int wfib(int n) { if (n == 0) { return 0; } else { int i, x = 1; for (i=0; i<=n-2; i++) { x += spawn wfib(i); } sync; return x; }}
对于赋值运算 , Cilk 编译器自动产生一个不明确的 inlet
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 49
COMMON PATTERN FOR CILK 考虑包含循环的程序
• 将其转换为递归结构… 将范围分割为两半直到每一块足够小
void vadd (real *A, real *B, int n){ int i; for(i=0; i<n; i++) A[i] += B[i];}
void vadd (real *A, real *B, int n){ if (n<MIN) { int i; for(i=0; i<n; i++) A[i] += B[i]; } else { vadd(A, B, n/2); vadd(A+n/2, B+n/2, n-n/2); } }• 加入 Cilk 关键词
spawnspawn
sync;
cilk
50
COMPUTING A PRODUCT
p = Aii = 0
n
int product(int *A, int n) { int i, p=1; for (i=0; i<n; i++) { p *= A[i];
int product(int *A, int n) { int i, p=1; for (i=0; i<n; i++) { p *= A[i];
} return p;}
} return p;}
优化 : 如果部分结果为 0 ,终止计算
COMPUTING A PRODUCT
p = Aii = 0
n
int product(int *A, int n) { int i, p=1; for (i=0; i<n; i++) { p *= A[i];
int product(int *A, int n) { int i, p=1; for (i=0; i<n; i++) { p *= A[i];
} return p;}
} return p;}
if (p == 0) break;
优化 : 如果部分结果为 0 ,终止计算
51
COMPUTING A PRODUCT IN PARALLEL
p = Aii = 0
n
cilk int prod(int *A, int n) { int p = 1; if (n == 1) { return A[0]; } else { p *= spawn product(A, n/2); p *= spawn product(A+n/2, n-n/2); sync; return p; }}
cilk int prod(int *A, int n) { int p = 1; if (n == 1) { return A[0]; } else { p *= spawn product(A, n/2); p *= spawn product(A+n/2, n-n/2); sync; return p; }}
怎样终止 ? 52
CILK’S ABORT FEATUREcilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; return; }
if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); mult( spawn product(A+n/2, n-n/2) ); sync; return p; }}
cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; return; }
if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); mult( spawn product(A+n/2, n-n/2) ); sync; return p; }}
1. Recode the implicit inlet to make it explicit.53
CILK’S ABORT FEATUREcilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x;
return; }
if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); mult( spawn product(A+n/2, n-n/2) ); sync; return p; }}
cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x;
return; }
if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); mult( spawn product(A+n/2, n-n/2) ); sync; return p; }}
2. Check for 0 within the inlet.54
CILK’S ABORT FEATURE
cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; return; }
if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); mult( spawn product(A+n/2, n-n/2) ); sync; return p; }}
cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; return; }
if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); mult( spawn product(A+n/2, n-n/2) ); sync; return p; }}
cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; if (p == 0) {
abort; /* Aborts existing children, */ } /* but not future ones. */ return; }
if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); mult( spawn product(A+n/2, n-n/2) ); sync; return p; }}
cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; if (p == 0) {
abort; /* Aborts existing children, */ } /* but not future ones. */ return; }
if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); mult( spawn product(A+n/2, n-n/2) ); sync; return p; }}
2. Check for 0 within the inlet.55
CILK’S ABORT FEATURE
cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; if (p == 0) {
abort; /* Aborts existing children, */ } /* but not future ones. */ return; }
if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) );
mult( spawn product(A+n/2, n-n/2) ); sync; return p; }}
cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; if (p == 0) {
abort; /* Aborts existing children, */ } /* but not future ones. */ return; }
if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) );
mult( spawn product(A+n/2, n-n/2) ); sync; return p; }}
56
CILK’S ABORT FEATUREcilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; return; }
if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); mult( spawn product(A+n/2, n-n/2) ); sync; return p; }}
cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; return; }
if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); mult( spawn product(A+n/2, n-n/2) ); sync; return p; }}
cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; if (p == 0) {
abort; /* Aborts existing children, */ } /* but not future ones. */ return; }
if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); if (p == 0) { /* Don’t spawn if we’ve */ return 0; /* already aborted! */ } mult( spawn product(A+n/2, n-n/2) ); sync; return p; }}
cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; if (p == 0) {
abort; /* Aborts existing children, */ } /* but not future ones. */ return; }
if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); if (p == 0) { /* Don’t spawn if we’ve */ return 0; /* already aborted! */ } mult( spawn product(A+n/2, n-n/2) ); sync; return p; }}
57
58
MUTUAL EXCLUSION
Cilk’s solution to mutual exclusion is no better than anybody else’s.
Cilk provides a library of locks declared with Cilk_lockvar. • To avoid deadlock with the Cilk scheduler, a
lock should only be held within a Cilk thread.• I.e., spawn and sync should not be executed
while a lock is held.
LOCKING
Cilk_lockvar data type
#include <cilk-lib.h>::Cilk_lockvar mylock;:{Cilk_lock_init(mylock);:Cilk_lock(mylock); /* begin critical section */
:Cilk_unlock(mylock); /* end critical section */
}
60
KEY IDEASCilk is simple: cilk, spawn, sync, SYNCHED, inlet, abort
JCilk is simplerWork & span
Work & spanWork & spanWork & span Work & span Work & span Work & span Work & span Work & span Work & span
Work & span
Work & span
Work & span
Work & span
Work & span
INTEL’S THREADING BUILDING BLOCKS
61
THREADING BUILDING BLOCKS LIBRARY CHARACTERISTICS C++ Library. Targets threading for performance (designed to
parallelize computationally intensive work). Is compatible with other threading packages. Emphasizes scalable data parallel
programming. Specifies templates and tasks instead of
threads - the library schedules tasks onto threads and manages load balancing.
62
63
COMPONENTS OF TBB (VERSION 2.1)
Synchronization primitivesatomic operationsvarious flavors of mutexes (improved)
Parallel algorithmsparallel_for (improved)parallel_reduce (improved)parallel_do (new)pipeline (improved)parallel_sortparallel_scan
Concurrent containersconcurrent_hash_mapconcurrent_queueconcurrent_vector(all improved)
Task schedulerWith new functionality
Memory allocatorstbb_allocator (new), cache_aligned_allocator, scalable_allocator
Utilitiestick_counttbb_thread (new)
64
C++ REVIEW: FUNCTION TEMPLATE Type-parameterized function.
Strongly typed. Obeys scope rules. Actual arguments evaluated exactly once. Not redundantly instantiated.
template<typename T>void swap( T& x, T& y ) { T z = x; x = y; y = z;}
void reverse( float* first, float* last ) { while( first<last-1 ) swap( *first++, *--last );}
Compiler instantiates template swap with T=float.
[first,last) define half-open interval
65
GENERICITY OF SWAP
T(const T&) Copy constructor
void T::operator=(const T&); Assignment
~T() Destructor
template<typename T>void swap( T& x, T& y ) { T z = x; x = y; y = z; }
// Construct z// Assignment// Assignment// Destroy z
Requirements for T
66
C++ REVIEW: TEMPLATE CLASS Type-parameterized class
template<typename T, typename U>class pair {public: T first; U second; pair( const T& x, const U& y ) : first(x), second(y) {}};
pair<string,int> x;x.first = “abc”;x.second = 42;
Compiler instantiates template pair with T=string and U=int.
67
TBB LIBRARY ALGORITHM – PARALLEL_FOR parallel_for is a template function provided by library. template <typename Range, typename Body>
void parallel_for(const Range& range, Functor& func, partitioner );
Requirements for Range R:
Library provides blocked_range, blocked_range2d, blocked_range3d Programmer can define new kinds of ranges
R(const R&) Copy a range
R::~R() Destroy a range
bool R::empty() const Is range empty?
bool R::is_divisible() const Can range be split?
R::R (R& r, split) Split r into two subranges
68
REQUIREMENTS FOR FUNCTOR
template <typename Range, typename Functor>void parallel_for(const Range& range, Functor& func, partitioner );
Requirements for Functor func:
F::F( const F& ) Copy constructor
F::~F() Destructor
void F::operator() (Range& subrange) Apply F to subrange
EXAMPLE – PARALLEL_FOR
69
• Example: concurrently apply a function to each element in an array.
• Serial version:
void SerialApplyFoo( float a[], size_t n ) { for( size_t i=0; i<n; ++i ) Foo(a[i]); }
• Iteration space is 0…(n-1).
TBB LIBRARY ALGORITHM – PARALLEL_FOR, CONTINUED Parallel version requires two steps:
70
#include "tbb/blocked_range.h"
class ApplyFoo {
float *const my_a;
public: ApplyFoo( float a[] ) : my_a(a) {}
void operator()( const blocked_range<size_t>& r ) const { float *a = my_a; for( size_t i=r.begin(); i!=r.end(); ++i ) Foo(a[i]); } };
TBB LIBRARY ALGORITHM – PARALLEL_FOR, CONTINUED
parallel_for breaks iteration space into chunks each of which are run on separate threads
blocked_range<T>(begin,end,grainsize) - recursively divisible struct.
grainsize, specifies the number of iterations for a “reasonable size” chunk to deal out to a processor. If the iteration space has more than grainsize iterations, parallel_for splits it into separate subranges that are scheduled separately.
operator () processes a chunk 71
#include "tbb/parallel_for.h"
void ParallelApplyFoo( float a[], size_t n ) { parallel_for(blocked_range<size_t>(0,n,IdealGrainSize), ApplyFoo(a) ); }
TBB LIBRARY - ALGORITHMS parallel_reduce – math ops with elements of an
array in parallel parallel_do – loops indeterminate length iteration
spaces parallel_* - several others…
72
73
WORK STEALING
Thread
deque
mailbox
Thread
deque
mailbox
Thread
deque
mailbox
Thread
deque
mailbox
Cache Affinity2. Steal task advertised in mailbox
Load balance
3. Steal oldest task from random victim
Locality1. Take youngest task from my deque
Override0. Do explicitly specified task
74
deque
HOW THIS WORKS
Split range...
.. recursively...
...until grainsize.
75
WORK DEPTH FIRST; STEAL BREADTH FIRST
L1
L2
victim thread
Best choice for theft!• big piece of work• data far from victim’s hot data.
Second best choice.
76
PARALLEL SORT EXAMPLE (WITH WORK STEALING)QUICKSORT – STEP 1
THREAD 1
32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63
Thread 1 starts with the initial data
tbb::parallel_sort (color, color+64);
76
77
37
11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 15 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35
52 47 41 43 53 60 61 38 56 48 59 54 50 49 51 45 62 39 42 40 58 55 57 44 46 63
37
THREAD 1
32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63
THREAD 2THREAD 3 THREAD 4
Thread 1 partitions/splits its data
77
PARALLEL SORT EXAMPLE (WITH WORK STEALING)QUICKSORT – STEP 2
78
37
11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 15 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35
52 47 41 43 53 60 61 38 56 48 59 54 50 49 51 45 62 39 42 40 58 55 57 44 46 63
37
THREAD 1 THREAD 2
32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63
Thread 2 gets work by stealing from Thread 1
THREAD 3 THREAD 4
78
PARALLEL SORT EXAMPLE (WITH WORK STEALING)QUICKSORT – STEP 2
79
7 37 49
11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 15 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35
52 47 41 43 53 60 61 38 56 48 59 54 50 49 51 45 62 39 42 40 58 55 57 44 46 63
37
THREAD 1
1 0 2 6 4 5 3
712 29 27 19 20 30 33 31 25 21 11 15 17 26 18 16 10 9 23 13 14 8 24 36 32 28 22 34 35
45 47 41 43 46 44 40 38 42 48 39
4950 52 51 54 62 59 56 61 58 55 57 60 53 63
THREAD 2
32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63
Thread 1 partitions/splits its data
Thread 2 partitions/splits its data
79
PARALLEL SORT EXAMPLE (WITH WORK STEALING)QUICKSORT – STEP 3
80
7 37 49
11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 15 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35
52 47 41 43 53 60 61 38 56 48 59 54 50 49 51 45 62 39 42 40 58 55 57 44 46 63
37
THREAD 1
1 0 2 6 4 5 3
712 29 27 19 20 30 33 31 25 21 11 15 17 26 18 16 10 9 23 13 14 8 24 36 32 28 22 34 35
45 47 41 43 46 44 40 38 42 48 39
4950 52 51 54 62 59 56 61 58 55 57 60 53 63
THREAD 2THREAD 3 THREAD 4
32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63
Thread 3 gets work by stealing from Thread 1
Thread 4 gets work by stealing from Thread 2
80
PARALLEL SORT EXAMPLE (WITH WORK STEALING)QUICKSORT – STEP 3
81
11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 15 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35
52 47 41 43 53 60 61 38 56 48 59 54 50 49 51 45 62 39 42 40 58 55 57 44 46 63
37
1 0 2 6 4 5 3
712 29 27 19 20 30 33 31 25 21 11 15 17 26 18 16 10 9 23 13 14 8 24 36 32 28 22 34 35
45 47 41 43 46 44 40 38 42 48 39
4950 52 51 54 62 59 56 61 58 55 57 60 53 63
11 8 14 13 9 10 16 12 17 15
1821 25 26 31 33 30 20 23 19 27 29 24 36 32 28 22 34 35
THREAD 1 THREAD 2THREAD 3 THREAD 4
32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63
Thread 1 sorts the rest of its data
0 1 2 3 4 5 6 7 18 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
Thread 4 sorts the rest of its data
Thread 2 sorts the rest its data
Thread 3 partitions/splitsits data
81
PARALLEL SORT EXAMPLE (WITH WORK STEALING)QUICKSORT – STEP 4
82
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 15 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35
52 47 41 43 53 60 61 38 56 48 59 54 50 49 51 45 62 39 42 40 58 55 57 44 46 63
37
THREAD 1
1 0 2 6 4 5 3
712 29 27 19 20 30 33 31 25 21 11 15 17 26 18 16 10 9 23 13 14 8 24 36 32 28 22 34 35
45 47 41 43 46 44 40 38 42 48 39
4950 52 51 54 62 59 56 61 58 55 57 60 53 63
THREAD 2THREAD 3 THREAD 4
11 8 14 13 9 10 16 12 17 15
1821 25 26 31 33 30 20 23 19 27 29 24 36 32 28 22 34 35
32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63
Thread 1 gets more work by stealing from Thread 3
Thread 3 sorts therest of its data
82
PARALLEL SORT EXAMPLE (WITH WORK STEALING)QUICKSORT – STEP 5
83
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 27 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 15 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35
52 47 41 43 53 60 61 38 56 48 59 54 50 49 51 45 62 39 42 40 58 55 57 44 46 63
37
THREAD 1
1 0 2 6 4 5 3
712 29 27 19 20 30 33 31 25 21 11 15 17 26 18 16 10 9 23 13 14 8 24 36 32 28 22 34 35
45 47 41 43 46 44 40 38 42 48 39
4950 52 51 54 62 59 56 61 58 55 57 60 53 63
THREAD 2THREAD 3 THREAD 4
11 8 14 13 9 10 16 12 17 15
1821 25 26 31 33 30 20 23 19 27 29 24 36 32 28 22 34 35
19 25 26 22 24 21 20 23
2730 29 33 36 32 28 31 34 35
32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63
Thread 1 partitions/splits its data
83
PARALLEL SORT EXAMPLE (WITH WORK STEALING)QUICKSORT – STEP 6
84
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 27 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 15 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35
52 47 41 43 53 60 61 38 56 48 59 54 50 49 51 45 62 39 42 40 58 55 57 44 46 63
37
THREAD 1
1 0 2 6 4 5 3
712 29 27 19 20 30 33 31 25 21 11 15 17 26 18 16 10 9 23 13 14 8 24 36 32 28 22 34 35
45 47 41 43 46 44 40 38 42 48 39
4950 52 51 54 62 59 56 61 58 55 57 60 53 63
THREAD 2THREAD 3 THREAD 4
11 8 14 13 9 10 16 12 17 15
1821 25 26 31 33 30 20 23 19 27 29 24 36 32 28 22 34 35
19 25 26 22 24 21 20 23
2730 29 33 36 32 28 31 34 35
32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63
Thread 2 gets more work by stealing from Thread 1
Thread 1 sorts the rest of its data
84
PARALLEL SORT EXAMPLE (WITH WORK STEALING)QUICKSORT – STEP 6
85
11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 15 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35
52 47 41 43 53 60 61 38 56 48 59 54 50 49 51 45 62 39 42 40 58 55 57 44 46 63
37
THREAD 1
1 0 2 6 4 5 3
712 29 27 19 20 30 33 31 25 21 11 15 17 26 18 16 10 9 23 13 14 8 24 36 32 28 22 34 35
45 47 41 43 46 44 40 38 42 48 39
4950 52 51 54 62 59 56 61 58 55 57 60 53 63
THREAD 2THREAD 3 THREAD 4
11 8 14 13 9 10 16 12 17 15
1821 25 26 31 33 30 20 23 19 27 29 24 36 32 28 22 34 35
19 25 26 22 24 21 20 23
2730 29 33 36 32 28 31 34 35
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63
Thread 2 sorts the rest of its data
DONE 85
PARALLEL SORT EXAMPLE (WITH WORK STEALING)QUICKSORT – STEP 7
TBB LIBRARY - PIPELINE
TBB implements the pipeline pattern. Data flows through a series of pipeline stages,
and each stage processes the data in some way.
86
87
Parallel stage scales because it can process items in parallel or out of order.
Serial stage processes items one at a time in order. Another serial stage.
Items wait for turn in serial stage
Controls excessive parallelism by limiting total number of items flowing through pipeline.
Uses sequence numbers recover order for serial stage.
Tag incoming items with sequence numbers
Throughput limited by throughput of slowest serial stage.
PARALLEL PIPELINE
810 9 3 1
3
4
2
5
6
7
EXAMPLE
Sample problem: read a text file (sequential), capitalize the first letter of each word (parallel), and write the modified text to a new file (sequential).
88
TBB LIBRARY – PIPELINE, CONTINUED
89
// Create the pipeline tbb::pipeline pipeline;
// Create file-reading writing stage and add it to the pipeline MyInputFilter input_filter( input_file ); pipeline.add_filter( input_filter );
// Create capitalization stage and add it to the pipeline MyTransformFilter transform_filter; pipeline.add_filter( transform_filter ); // Create file-writing stage and add it to the pipeline MyOutputFilter output_filter( output_file ); pipeline.add_filter( output_filter );
// Run the pipeline pipeline.run( MyInputFilter::n_buffer );
// Must remove filters from pipeline before they are implicitly destroyed. pipeline.clear();
TBB – PIPELINE, CONTINUED
90
// Filter that writes each buffer to a file. class MyOutputFilter: public tbb::filter { FILE* my_output_file; public: MyOutputFilter( FILE* output_file ); /*override*/void* operator()( void* item ); }; MyOutputFilter::MyOutputFilter( FILE* output_file ) : tbb::filter(serial), my_output_file(output_file) { }
void* MyOutputFilter::operator()( void* item ) { MyBuffer& b = *static_cast<MyBuffer*>(item); fwrite( b.begin(), 1, b.size(), my_output_file ); return NULL; }
TBB – PIPELINE, CONTINUED
91
// Changes the first letter of each word from lower case to upper case. class MyTransformFilter: public tbb::filter { public: MyTransformFilter(); /*override*/void* operator()( void* item ); }; MyTransformFilter::MyTransformFilter() : tbb::filter(parallel) {}
/*override*/void* MyTransformFilter::operator()( void* item ) { // a for loop and ‘toupper()’ go here… }
TBB - TIMING tick_count class
92
using namespace tbb;
void Foo() { tick_count t0 = tick_count::now(); ...action being timed... tick_count t1 = tick_count::now(); printf("time for action = %g seconds\n", (t1-t0).seconds() );}
TBB: TASK SCHEDULERIntel Propaganda: The task scheduler is the engine that powers
the loop templates. When practical, use the loop templates instead
of the task scheduler because the templates hide the complexity of the scheduler.
However, if you have an algorithm that does not naturally map onto one of the high-level templates, use the task scheduler. All of the scheduler functionality that is used by the high-level templates is available for you to use directly, so you can build new high-level templates that are just as powerful as the existing ones. 93
TBB – TASK SCHEDULER Maps tasks to threads. Handles load balancing and scheduling. Hides threading details – just think in terms of tasks. Any task using the task scheduler must have an
initialized tbb::task_scheduler_init object.
94
#include "tbb/task_scheduler_init.h" using namespace tbb; int main() { task_scheduler_init init; ... return 0; }
95
EXAMPLE: NAIVE FIBONACCI CALCULATION
Recursion typically used to calculate Fibonacci number F(n)=F(n-1)+F(n-2)
long SerialFib( long n ) { if( n<2 ) return n; else return SerialFib(n-1) + SerialFib(n-2);}
95
96
EXAMPLE: NAIVE FIBONACCI CALCULATION
Can envision Fibonacci computation as a task graph
SerialFib(4)
SerialFib(3) SerialFib(2)
SerialFib(1)
SerialFib(2)
SerialFib(1) SerialFib(0)
SerialFib(2)
SerialFib(1) SerialFib(0)
SerialFib(3)
SerialFib(2) SerialFib(1)
SerialFib(1)
SerialFib(0)SerialFib(1)
SerialFib(0)
96
97
FIBONACCI - TASK SPAWNING SOLUTION
Use TBB tasks to thread creation and execution of task graph
Create new root task Allocate task object Construct task
Spawn (execute) task, wait for completion
long ParallelFib( long n ) { long sum; FibTask& a = *new(Task::allocate_root()) FibTask(n,&sum); Task::spawn_root_and_wait(a); return sum;}
97
98
class FibTask: public task {public: const long n; long* const sum; FibTask( long n_, long* sum_ ) : n(n_), sum(sum_) {} task* execute() { // Overrides virtual function task::execute if( n<CutOff ) { *sum = SerialFib(n); } else { long x, y; FibTask& a = *new( allocate_child() ) FibTask(n-1,&x); FibTask& b = *new( allocate_child() ) FibTask(n-2,&y); set_ref_count(3); // 3 = 2 children + 1 for wait spawn( b ); spawn_and_wait_for_all( a ); *sum = x+y; } return NULL; }};
FIBONACCI - TASK SPAWNING SOLUTION
Derived from TBB task class
Create new child tasks to compute (n-1)th and (n-2)th Fibonacci numbers
Reference count is used to know when spawned tasks have completed
Set before spawning any children
Spawn task; return immediately
Can be scheduled at any time
Spawn task; block until all children have completed execution
The execute method does the computation of a task
98
100
CONCURRENT CONTAINERS
TBB Library provides highly concurrent containers STL containers are not concurrency-friendly:
attempt to modify them concurrently can corrupt container
Standard practice is to wrap a lock around STL containers Turns container into serial bottleneck
Library provides fine-grained locking or lockless implementations Worse single-thread performance, but better
scalability. Can be used with the library, OpenMP, or native
threads.
100
TBB - CONTAINERS
concurrent_hash_map<Key,T,HashCompare> concurrent_queue<T,Allocator> concurrent_vector<T,Allocator> Supports concurrent access, concurrent
operations and parallel iteration. The TBB library retains control over memory
allocation.
101
102
CONCURRENCY-FRIENDLY INTERFACES
Some STL interfaces are inherently not concurrency-friendly
For example, suppose two threads each execute:
Solution: concurrent_queue has pop_if_present
extern std::queue q;
if(!q.empty()) {
item=q.front();
q.pop();
}
At this instant, another thread might pop last element.
102
103
CONCURRENT QUEUE CONTAINER
concurrent_queue<T> Preserves local FIFO order
If thread pushes and another thread pops two values, they come out in the same order that they went in
Method push(const T&) places copy of item on back of queue
Two kinds of pops Blocking – pop(T&) non-blocking – pop_if_present(T&)
Method size() returns signed integer If size() returns –n, it means n pops await corresponding
pushes Method empty() returns size() == 0
Difference between pushes and pops May return true if queue is empty, but there are pending
pop()
103
104
CONCURRENT QUEUE CONTAINER EXAMPLE
Simple example to enqueue and print integers
Constructor for queue
Push items onto queue
While more things on queue Pop item off Print item
#include “tbb/concurrent_queue.h”#include <stdio.h>using namespace tbb;
int main (){ concurrent_queue<int> queue; int j;
for (int i = 0; i < 10; i++) queue.push(i);
while (!queue.empty()) { queue.pop(&j); printf(“from queue: %d\n”, j); } return 0;}
104
TBB - ALLOCATION
tbb_allocator<T>: allocates and frees memory via the TBB malloc library if available, otherwise it reverts to using malloc and free.
scalable_allocator<T>: allocates and frees memory in a way that scales with the number of processors.
others…
111
112
SCALABLE MEMORY ALLOCATORS Serial memory allocation can easily become
a bottleneck in multithreaded applications Threads require mutual exclusion into shared
heap TBB offers two choices for scalable memory
allocation Similar to the STL template class std::allocator
scalable_allocator Offers scalability, but not protection from false sharing Memory is returned to each thread from a separate
pool cache_aligned_allocator
Offers both scalability and false sharing protection 112
113
METHODS FOR SCALABLE_ALLOCATOR #include “tbb/scalable_allocator.h” template<typename T> class scalable_allocator;
Scalable versions of malloc, free, realloc, calloc void *scalable_malloc( size_t size ); void scalable_free( void *ptr ); void *scalable_realloc( void *ptr, size_t size ); void *scalable_calloc( size_t nobj, size_t size );
STL allocator functionality T* A::allocate( size_type n, void* hint=0 )
Allocate space for n values
void A::deallocate( T* p, size_t n ) Deallocate n values from p
void A::construct( T* p, const T& value ) void A::destroy( T* p )
113
114
SCALABLE ALLOCATORS EXAMPLE
#include “tbb/scalable_allocator.h”typedef char _Elem;
typedef std::basic_string<_Elem, std::char_traits<_Elem>, tbb::scalable_allocator<_Elem>> MyString; . . .{. . . int *p; MyString str1 = "qwertyuiopasdfghjkl";
MyString str2 = "asdfghjklasdfghjkl"; p = tbb::scalable_allocator<int>().allocate(24);. . .}
Use TBB scalable allocator for STL basic_string class
Use TBB scalable allocator to allocate 24 integers
114
115
TBB: SYNCHRONIZATION PRIMITIVES Parallel tasks must sometimes touch shared data
When data updates might overlap, use mutual exclusion to avoid race
High-level generic abstraction for HW atomic operations Atomically protect update of single variable
Critical regions of code are protected by scoped locks The range of the lock is determined by its lifetime (scope) Leaving lock scope calls the destructor, making it exception
safe Minimizing lock lifetime avoids possible contention Several mutex behaviors are available
Spin vs. queued “are we there yet” vs. “wake me when we get there”
Writer vs. reader/writer (supports multiple readers/single writer) Scoped wrapper of native mutual exclusion function 115
116
ATOMIC EXECUTION atomic<T>
T should be integral type or pointer type Full type-safe support for 8, 16, 32, and 64-bit integersOperations
atomic <int> i;. . .int z = i.fetch_and_add(2);
‘= x’ and ‘x = ’ read/write value of x
x.fetch_and_store (y) z = x, y = x, return z
x.fetch_and_add (y) z = x, x += y, return z
x.compare_and_swap (y,p) z = x, if (x==p) x=y; return z
116
117
SUMMARY
Intel® Threading Building Blocks is a parallel programming model for C++ applications
Used for computationally intense code A focus on data parallel programming
Intel® Threading Building Blocks provides Generic parallel algorithms Highly concurrent containers Low-level synchronization primitives A task scheduler that can be used directly
117
SHARED MEMORY PROGRAMMING WITH OPENMP
118
CS267 Lecture 6 119
INTRODUCTION TO OPENMP What is OpenMP?
Open specification for Multi-Processing “Standard” API for defining multi-threaded shared-
memory programs openmp.org – Talks, examples, forums, etc.
High-level API Preprocessor (compiler) directives ( ~ 80% ) Library Calls ( ~ 19% ) Environment Variables ( ~ 1% )
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 120
OPENMP* OVERVIEW:
omp_set_lock(lck)
#pragma omp parallel for private(A, B)
#pragma omp critical
C$OMP parallel do shared(a, b, c)
C$OMP PARALLEL REDUCTION (+: A, B)
call OMP_INIT_LOCK (ilok)
call omp_test_lock(jlok)
setenv OMP_SCHEDULE “dynamic”
CALL OMP_SET_NUM_THREADS(10)
C$OMP DO lastprivate(XX)
C$OMP ORDERED
C$OMP SINGLE PRIVATE(X)
C$OMP SECTIONS
C$OMP MASTER
C$OMP ATOMIC
C$OMP FLUSH
C$OMP PARALLEL DO ORDERED PRIVATE (A, B, C)
C$OMP THREADPRIVATE(/ABC/)
C$OMP PARALLEL COPYIN(/blk/)
Nthrds = OMP_GET_NUM_PROCS()
!$OMP BARRIER
OpenMP: An API for Writing Multithreaded Applications
A set of compiler directives and library routines for parallel application programmers
Makes writing multi-threaded applications in Fortran, C and C++ as easy as we can make it.
Standardizes last 20 years of SMP practice
* The name “OpenMP” is the property of the OpenMP Architecture Review Board.
120
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 121
THE ESSENCE OF OPENMP Create threads that execute in a shared address
space: The only way to create threads is with the “parallel construct” Once created, all threads execute the code inside the construct.
Split up the work between threads by one of two means: SPMD (Single program Multiple Data) … all threads execute the
same code and you use the thread ID to assign work to a thread.
Workshare constructs split up loops and tasks between threads.
Manage data environment to avoid data access conflicts Synchronization so correct results are produced regardless of
how threads are scheduled. Carefully manage which data can be private (local to each
thread) and shared. 121
CS267 Lecture 6 122
A PROGRAMMER’S VIEW OF OPENMP OpenMP is a portable, threaded, shared-memory
programming specification with “light” syntax Exact behavior depends on OpenMP implementation! Requires compiler support (C or Fortran)
OpenMP will: Allow a programmer to separate a program into serial
regions and parallel regions, rather than T concurrently-executing threads.
Hide stack management Provide synchronization constructs
OpenMP will not: Parallelize automatically Guarantee speedup Provide freedom from data races
CS267 Lecture 6 123
MOTIVATION – OPENMP
int main() {
// Do this part in parallel printf( "Hello, World!\n" );
return 0; }
CS267 Lecture 6 124
MOTIVATION – OPENMP
int main() {
omp_set_num_threads(16);
// Do this part in parallel #pragma omp parallel { printf( "Hello, World!\n" ); }
return 0; }
© 2
00
9 M
atth
ew
J. So
ttile, T
imo
thy G
. M
attso
n, a
nd
Cra
ig E
Ra
smu
ssen
125
OPENMP EXECUTION MODEL: Fork-Join Parallelism:
Master thread spawns a team of threads as needed.
Parallelism added incrementally until performance are met: i.e. the sequential program evolves into a parallel program. Parallel Regions
Master Thread in red
A Nested Paralle
l region
Sequential Parts
CS
267 Lecture 6
126
PROGRAMMING MODEL – CONCURRENT LOOPS
OpenMP easily parallelizes loops Requires: No data
dependencies (reads/write or write/write pairs) between iterations!
Preprocessor calculates loop bounds for each thread directly from serial source ?
?
for( i=0; i < 25; i++ ) {
printf(“Foo”);
}
#pragma omp parallel for
CS267 Lecture 6 127
PROGRAMMING MODEL – LOOP SCHEDULING
schedule clause determines how loop iterations are divided among the thread team static([chunk]) divides iterations statically between
threads Each thread receives [chunk] iterations, rounding as
necessary to account for all iterations Default [chunk] is ceil( # iterations / # threads )
dynamic([chunk]) allocates [chunk] iterations per thread, allocating an additional [chunk] iterations when a thread finishes Forms a logical work queue, consisting of all loop iterations Default [chunk] is 1
guided([chunk]) allocates dynamically, but [chunk] is exponentially reduced with each allocation
CS
267 Lecture 6
128
PROGRAMMING MODEL – DATA SHARING
Parallel programs often employ two types of data Shared data, visible to all
threads, similarly named Private data, visible to a
single thread (often stack-allocated)
• OpenMP:• shared variables are shared• private variables are private
• PThreads:• Global-scoped variables are
shared• Stack-allocated variables are
private
// shared, globals
int bigdata[1024];
void* foo(void* bar) {
// private, stack
int tid;
/* Calculation goes
here */
}
int bigdata[1024];
void* foo(void* bar) {
int tid;
#pragma omp parallel \
shared ( bigdata ) \
private ( tid )
{
/* Calc. here */
}
}
CS267 Lecture 6 129
PROGRAMMING MODEL - SYNCHRONIZATION
OpenMP Synchronization OpenMP Critical Sections
Named or unnamed No explicit locks / mutexes
Barrier directives
Explicit Lock functions When all else fails – may
require flush directive
Single-thread regions within parallel regions master, single directives
#pragma omp critical{ /* Critical code here */}
#pragma omp barrier
omp_set_lock( lock l );/* Code goes here */omp_unset_lock( lock l );
#pragma omp single{ /* Only executed once */}
© 2
00
9 M
atth
ew
J. So
ttile, T
imo
thy G
. M
attso
n, a
nd
Cra
ig E
Ra
smu
ssen
130
EXAMPLE PROBLEM: NUMERICAL INTEGRATION
4.0
(1+x2) dx = 0
1
F(xi)x i = 0
N
Mathematically, we know that:
We can approximate the integral as a sum of rectangles:
Where each rectangle has width x and height F(xi) at the middle of interval i.
F(x
) =
4.0
/(1
+x
2 )
4.0
2.0
1.0X
0.0
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 131
PI PROGRAM: AN EXAMPLE
static long num_steps = 100000;double step;void main (){ int i; double x, pi, sum = 0.0;
step = 1.0/(double) num_steps; x = 0.5 * step;
for (i=0;i<= num_steps; i++){ sum += 4.0/(1.0+x*x);
x+=step; } pi = step * sum;
}131
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 132
PI PROGRAM: IDENTIFY CONCURRENCY
static long num_steps = 100000;double step;void main (){ int i; double x, pi, sum = 0.0;
step = 1.0/(double) num_steps; x = 0.5 * step;
for (i=0;i<= num_steps; i++){ sum += 4.0/(1.0+x*x);
x+=step; } pi = step * sum;
}
Loop iterations
can in principle be
executed concurrentl
y
132
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 133
PI PROGRAM: EXPOSE CONCURRENCY, PART 1
static long num_steps = 100000;double step;void main (){ double pi, sum = 0.0;
step = 1.0/(double) num_steps;
int i; double x; for (i=0;i<= num_steps; i++){
x = (i+0.5)*step; sum += 4.0/(1.0+x*x);
} pi = step * sum;
}
Isolate data that must be shared from data local to a
task
Redefine x to remove loop
carried dependence
This is called a reduction … results from each
iteration accumulated into a single global. 133
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 134
PI PROGRAM: EXPOSE CONCURRENCY, PART 2DEAL WITH THE REDUCTION
static long num_steps = 100000;#define NUM 4 //expected max thread count
double step;void main (){ double pi, sum[NUM] = {0.0};
step = 1.0/(double) num_steps;
int i, ID=0; double x; for (i=0;i<= num_steps; i++){
x = (i+0.5)*step; sum[ID] += 4.0/(1.0+x*x);
} for(int i=0, pi=0.0;i<NUM;i++)
pi += step * sum[i];}
Common Trick:
promote scalar “sum” to an array indexed by
the number of threads to
create thread local copies of shared
data.
134
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 135
PI PROGRAM: EXPRESS CONCURRENCY USING OPENMP#include <omp.h>static long num_steps = 100000;#define NUM 4double step;void main (){ double pi, sum[NUM] = {0.0};
step = 1.0/(double) num_steps;#pragma omp parallel num_threads(NUM){ int i, ID; double x; ID = omp_get_thread_num();
for (i=ID;i<= num_steps; i+=NUM){ x = (i+0.5)*step; sum[ID] += 4.0/(1.0+x*x);
}}
for(int i=0, pi=0.0;i<NUM;i++) pi += step * sum[i];}
Create NUM threads
Each thread executes code in the parallel
block
Simple mod to loop to deal
out iterations to threads
variables defined inside a
thread are private to
that thread
automatic variables defined outside a parallel region are shared between threads
135
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 136
PI PROGRAM: FIXING THE NUM THREADS BUG#include <omp.h>static long num_steps = 100000;#define NUM 4 double step;void main (){ double pi, sum[NUM] = {0.0};
step = 1.0/(double) num_steps;#pragma omp parallel num_threads(NUM){ int nthreads = omp_get_num_threads(); int i, ID; double x; ID = omp_get_thread_num();
for (i=ID;i<= num_steps; i+=nthreads){ x = (i+0.5)*step; sum[ID] += 4.0/(1.0+x*x);
}}
for(int i=0, pi=0.0;i<NUM;i++) pi += step * sum[i];}
Hence, you need to add
a bit of code to get the actual number of
threads
NUM is a requested number of
threads, but an OS can choose to give you fewer.
136
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 137
INCREMENTAL PARALLELISM
Software development with incremental Parallelism:Behavior preserving transformations to
expose concurrency.Express concurrency incrementally by
adding OpenMP directives… in a large program I can do this loop by loop to evolve my original program into a parallel OpenMP program.
Build and time program, optimize as needed with behavior preserving transformations until you reach the desired performance. 137
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 138
PI PROGRAM: EXECUTE CONCURRENCY#include <omp.h>static long num_steps = 100000;#define NUM 4 double step;void main (){ double pi, sum[NUM] = {0.0};
step = 1.0/(double) num_steps;#pragma omp parallel num_threads(NUM){ int nthreads = omp_get_num_threads(); int i, ID; double x; ID = omp_get_thread_num();
for (i=ID;i<= num_steps; i+=nthreads){ x = (i+0.5)*step; sum[ID] += 4.0/(1.0+x*x);
}}
for(int i=0, pi=0.0;i<NUM;i++) pi += step * sum[i];}
The performance can suffer on some systems due to false sharing of
sum[ID] … i.e. independent
elements of the sum array share a cache line and
hence every update requires
a cache line transfer between
threads.
Build this program and execute on
parallel hardware.
138
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 139
PI PROGRAM: SAFE UPDATE OF SHARED DATA#include <omp.h>static long num_steps = 100000;#define NUM 4double step;int main (){ double pi, sum=0.0;
step = 1.0/(double) num_steps;#pragma omp parallel num_threads(NUM){ int i, ID; double x, psum= 0.0; int nthreads = omp_get_num_threads(); ID = omp_get_thread_num(); for (i=ID;i<= num_steps; i+=nthreads){
x = (i+0.5)*step; psum += 4.0/(1.0+x*x);
} #pragma omp critical sum += psum;} pi = step * sum; }
Replace array for sum with a local/private
version of sum (psum) … no more false
sharing
Use a critical section so only one thread at
a time can update sum, i.e. you can
safely combine psum values 139
© 2
00
9 M
atth
ew
J. So
ttile, T
imo
thy G
. M
attso
n, a
nd
Cra
ig E
Ra
smu
ssen
140
PI PROGRAM: MAKING LOOP-SPLITTING AND REDUCTIONS EVEN EASIER
#include <omp.h>static long num_steps = 100000; double step;void main (){ int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps;#pragma omp parallel for private(i, x) reduction(+:sum) for (i=0;i<= num_steps; i++){
x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x);
} pi = step * sum;}
Reduction used to manage
dependencies
Private clause creates data
local to a thread
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 141
SYNCHRONIZATION: BARRIER Barrier: Each thread waits until all threads
arrive.#pragma omp parallel shared (A, B, C) private(id){
id=omp_get_thread_num();A[id] = big_calc1(id);
#pragma omp barrier #pragma omp for
for(i=0;i<N;i++){C[i]=big_calc3(i,A);}#pragma omp for nowait
for(i=0;i<N;i++){ B[i]=big_calc2(C, i); }A[id] = big_calc4(id);
}implicit barrier at the end of a parallel region
implicit barrier at the end of a for worksharing construct
no implicit barrier due to nowait
141
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 142
PUTTING THE MASTER THREAD TO WORK The master construct denotes a structured block
that is only executed by the master thread. The other threads just skip it (no synchronization is implied).
#pragma omp parallel {
do_many_things();#pragma omp master
{ exchange_boundaries(); }#pragma omp barrier
do_many_other_things();}
142
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 143
RUNTIME LIBRARY ROUTINES AND ICVS To use a known, fixed number of threads in a program,
(1) tell the system that you don’t want dynamic adjustment of the number of threads, (2) set the number of threads, then (3) save the number you got.
#include <omp.h>void main(){ int num_threads; omp_set_dynamic( 0 ); omp_set_num_threads( omp_num_procs() );#pragma omp parallel { int id=omp_get_thread_num();#pragma omp single num_threads = omp_get_num_threads(); do_lots_of_stuff(id); }}
Protect this op since Memory stores are not atomic
Request as many threads as you have processors.
Disable dynamic adjustment of the number of threads.
Internal Control Variables (ICVs) … define state of runtime system to a thread. Consistent pattern: set with “omp_set” or an environment variable, read with “omp_get”
143
© 2
00
9 M
atth
ew
J. So
ttile, T
imo
thy G
. M
attso
n, a
nd
Cra
ig E
Ra
smu
ssen
144
OPTIMIZING LOOP PARALLEL PROGRAMS
#include <omp.h>#pragma omp parallel{// define neighborhood as the num_neighbors particles// within “cutoff” of each particle “i”. #pragma omp for for( int i = 0; i < n; i++ ) { Fx[i]=0.0; Fy[i]=0.0; for (int j = 0; j < num_neigh[i]; j++) neigh_ind = neigh[i][j]; Fx[i] += forceX(i, neigh_ind); FY[i] += forceY(i, neigh_ind); } }}
Particles may be unevenly distributed … i.e. different particles have different numbers of neighbors.
Evenly spreading out loop iterations may fail to balance the load among threads
We need a way to tell the compiler how to best distribute the load.
Short range force computation for a particle system using the cut-off method
144
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 145
THE SCHEDULE CLAUSE The schedule clause affects how loop iterations are mapped onto
threads schedule(static [,chunk])
Deal-out blocks of iterations of size “chunk” to each thread.
schedule(dynamic[,chunk]) Each thread grabs “chunk” iterations off a queue until all
iterations have been handled. schedule(guided[,chunk])
Threads dynamically grab blocks of iterations. The size of the block starts large and shrinks down to size “chunk” as the calculation proceeds.
schedule(runtime) Schedule and chunk size taken from the OMP_SCHEDULE
environment variable (or the runtime library … for OpenMP 3.0).
145
© 2
00
9 M
atth
ew
J. So
ttile, T
imo
thy G
. M
attso
n, a
nd
Cra
ig E
Ra
smu
ssen
146
OPTIMIZING LOOP PARALLEL PROGRAMS
#include <omp.h>#pragma omp parallel{// define neighborhood as the num_neigh particles// within “cutoff” of each particle “i”. #pragma omp for schedule(dynamic, 10) for( int i = 0; i < n; i++ ) { Fx[i]=0.0; Fy[i]=0.0; for (int j = 0; j < num_neigh[i]; j++) neigh_ind = neigh[i][j]; Fx[i] += forceX(i, neigh_ind); FY[i] += forceY(i, neigh_ind); } }}
Divide range of n into chunks of size 10.
Each thread computes a chunk then goes back to get its next chunk of 10 iterations.
Dynamically balances the load between threads.
Short range force computation for a particle system using the cut-off method
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 147
Schedule Clause When To Use
STATIC Pre-determined and predictable by the programmer
DYNAMIC Unpredictable, highly variable work per iteration
GUIDED Special case of dynamic to reduce scheduling overhead
loop work-sharing constructs:The schedule clause
Least work at runtime : scheduling done at compile-time
Most work at runtime : complex scheduling logic used at run-time
147
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 148
SECTIONS WORK-SHARING CONSTRUCT The Sections work-sharing construct gives a
different structured block to each thread.
#pragma omp parallel{
#pragma omp sections { #pragma omp section X_calculation(); #pragma omp section
y_calculation(); #pragma omp section
z_calculation(); }
}
By default, there is a barrier at the end of the “omp sections”. Use the “nowait” clause to turn off the barrier.
148
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 149
SINGLE WORK-SHARING CONSTRUCT
The single construct denotes a block of code that is executed by only one thread.
A barrier is implied at the end of the single block.
#pragma omp parallel {
do_many_things();#pragma omp single
{ exchange_boundaries(); }do_many_other_things();
}
149
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 150
SUMMARY OF OPENMP’S KEY CONSTRUCTS
The only way to create threads is with the parallel construct:#pragma omp parallel
All thread execute the instructions in a parallel construct. Split work between threads by:
SPMD: use thread ID to control execution Worksharing constructs to split loops (simple loops only)
#pragma omp for Combined parallel/workshare as a shorthand
#pragma omp parallel for High level synchronization is safest
#pragma critical#pragma barrier
150