1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

143
1 Programming with Shared Memory

Transcript of 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

Page 1: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

1

Programming with Shared Memory

Page 2: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

CONTENT

Introduction Cilk TBB OpenMP

2

Page 3: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

3

• Using heavy weight processes• Using threads. Example Pthreads• Using a completely new programming language for parallel programming - not popular. Example Ada• Using library routines with an existing sequential programming language• Modifying the syntax of an existing sequential programming language to create a parallel programming language • Using an existing sequential programming language supplemented with compiler directives for specifying parallelism. Example OpenMP

Alternatives for Programming SharedMemory

Page 4: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

4

FAMILY TREE

Chare Kernelsmall tasks

Cilk space efficient schedulercache-oblivious algorithms

OpenMP*fork/join tasks

JSR-166(FJTask)containers

OpenMP taskqueuewhile & recursion

Intel® TBB

STLgeneric programming

STAPLrecursive ranges

Threaded-C continuation taskstask stealing

ECMA .NET*parallel iteration classes

Libraries

1988

2001

2006

1995

Languages

Pragmas

Page 5: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

5

Operating systems often based upon notion of a process.

Processor time shares between processes, switching from one process to another. Might occur at regular intervals or when an active process becomes delayed.

Offers opportunity to deschedule processes blocked from proceeding for some reason, e.g. waiting for an I/O operation to complete.

Concept could be used for parallel programming. Not much used because of overhead but fork/join concepts used elsewhere.

Using Heavyweight Processes

Page 6: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

6

FORK-JOIN construct

Page 7: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

7

UNIX System Calls

SPMD model with different code for master process and forked slave process.

Page 8: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

8

Differences between a process and threads

Page 9: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

SAS(SHARED ADDRESS SPACE) PROGRAMMING MODEL

9

Thread(Process)

Thread(Process)

System

X

read(X) write(X)

Shared variable

Page 10: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

10

线程安全( Thread safe ) 是指某线程可由多个线程同时调用,并且能够产生正确的结果 .

标准的 I/O 线程安全:输出消息时不会产生字符交错情况 .

返回时间的系统调用可能不是线程安全的 .

访问共享数据的例程需要特别设计以确保是线程安全的 .

Thread-Safe Routines

Page 11: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

11

考虑如下两进程:每一进程往共享变量 x 加 1. 首先读 x ,然后计算加 1 ,最后结果写回去。

Accessing Shared Data

Page 12: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

12

Conflict in accessing shared variable

Page 13: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

13

临界区包含代码以及所涉及的资源。建立临界区是确保在任何时刻只有一个进程访问特定资源。

这种机制也称之为互斥( mutual exclusion )

Critical Section

Page 14: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

14

最简单的互斥机制是锁。

一种锁是种位变量: 1 指示一个进程进入了临界区; 0 指示没有进程在临界区 .

类似于门锁 : 进程来到临界区的“门口”,如果发现门是开着,它进去并锁上门。如果它完成操作,打开门离开临界区

Locks

Page 15: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

15

Control of critical sections through busy waiting

Page 16: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

16

Locks are implemented in Pthreads with mutually exclusive lock variables, or “mutex” variables:

.pthread_mutex_lock(&mutex1);

critical sectionpthread_mutex_unlock(&mutex1);

.

If a thread reaches a mutex lock and finds it locked, it will wait for the lock to open. If more than one thread is waiting for the lock to open when it opens, the system will select one thread to be allowed to proceed. Only the thread that locks a mutex can unlock it.

Pthread Lock Routines

Page 17: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

17

当进程 P1 锁定资源 R1 后,然后申请被 P2 锁定的资源 R2 ,同时 P2 申请资源 R1

Deadlock

Page 18: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

18

死锁也可以发生在如下的循环锁中

Deadlock (deadly embrace)

Page 19: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

19

Offers one routine that can test whether a lock is actually closed without blocking the thread:

pthread_mutex_trylock()

Will lock an unlocked mutex and return 0 or will return with EBUSY if the mutex is already locked – might find a use in overcoming deadlock.

Pthreads

Page 20: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

20

A positive integer (including zero) operated upon by two operations:

P operation on semaphore s

Waits until s is greater than zero and then decrements s by one and allows the process to continue.

V operation on semaphore s

Increments s by one and releases one of the waiting processes (if any).

Semaphores

Page 21: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

21

P and V operations are performed indivisibly.

Mechanism for activating waiting processes is also implicit in P and V operations. Though exact algorithm not specified, algorithm expected to be fair. Processes delayed by P(s) are kept in abeyance until released by a V(s) on the same semaphore.

Page 22: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

22

Mutual exclusion of critical sections can be achieved with onesemaphore having the value 0 or 1 (a binary semaphore), whichacts as a lock variable, but the P and V operations include a process scheduling mechanism:Process 1 Process 2 Process 3Noncritical section Noncritical section Noncritical section

. . .

. . .

. . .P(s) P(s) P(s) Critical section Critical section Critical sectionV(s) V(s) V(s)

. . .

. . .

. . .Noncritical section Noncritical section Noncritical section

Page 23: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

23

Can take on positive values other than zero and one.Provide, for example, a means of recording the number of “resource units” available or used and can be used to solve producer/ consumer problems.

General semaphore(or counting semaphore)

Page 24: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

24

Suite of procedures that provides only way to access shared resource. Only one process can use a monitor procedure at any instant.Could be implemented using a semaphore or lock to protect entry, i.e.,

monitor_proc1(){lock(x);

.monitor body

.unlock(x);return;}

Monitor

Page 25: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

25

Often, a critical section is to be executed if a specific globalcondition exists; for example, if a certain value of a variable has been reached.

With locks, the global variable would need to be examined at frequent intervals (“polled”) within a critical section.

Very time-consuming and unproductive exercise.

Can be overcome by introducing so-called condition variables.

Condition Variables

Page 26: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

26

Language Constructs for ParallelismShared Data

Shared memory variables might be declared as shared with, say,

shared int x;

Page 27: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

27

par Construct

For specifying concurrent statements:

par {S1;S2;..Sn;

}

Page 28: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

28

forall Construct

To start multiple similar processes together:

forall (i = 0; i < n; i++) {S1;S2;..Sm;

}

which generates n processes each consisting of the statements forming the body of the for loop, S1, S2, …, Sm. Each process uses a different value of i.

Page 29: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

29

Example

forall (i = 0; i < 5; i++)a[i] = 0;

clears a[0], a[1], a[2], a[3], and a[4] to zero concurrently.

Page 30: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

DESIGN FOR MULTITHREADING Good design is critical Bad multithreading can be worse than no

multithreading Deadlocks, synchronization bugs, poor performance,

etc.

Page 31: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

BAD MULTITHREADING

Thread 1

Thread 2

Thread 3

Thread 4

Thread 5

Page 32: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

Rendering ThreadRendering ThreadRendering Thread

Game Thread

GOOD MULTITHREADING

Main Thread

Physics

Rendering Thread

Animation/Skinning

Particle Systems

Networking

File I/O

Game Thread

Page 33: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

Present

Rendering

AI

Physics

InputFrame 2Frame 3Frame 4

ANOTHER PARADIGM: CASCADES

Thread 1

Thread 2

Thread 3

Thread 4

Thread 5

Frame 1

Advantages: Synchronization points are few and well-defined

Disadvantages: Increases latency (for constant frame rate) Needs simple (one-way) data flow

Page 34: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

MULTITHREADED PROGRAMMING INCILK

MULTITHREADED PROGRAMMING INCILK

34

Page 35: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

• Introduction

• Inlets

• Abort

CONTENT

35

Page 36: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

© 2

00

9 M

atth

ew

J. So

ttile, T

imo

thy G

. M

attso

n, a

nd

Cra

ig E

Ra

smu

ssen

36

CILK IN ONE SLIDE

扩展 C 语言支持并行,同时不改变原来的串行语义 . 面向 fork-join 方式任务产生的编程模型非常适合递归算法 (e.g.

branch-and-bound) 有着坚实的理论基础 … 能够证明性能

cilk Marks a function as a “cilk” function that can be spawned

spawn Spawns a cilk function … only 2 to 5 times the cost of a regular function call

sync Wait until immediate children spawned functions return

• 高级关键字inlet Define a function to handle return values from a cilk task

cilk_fence A portable memory fence.

abort Terminate all currently existing spawned tasks

Page 37: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 37

RECURSION IS AT THE HEART OF CILK Cilk 中派生新任务非常方便 . 不是采用循环,而是递归地产生很多任务 . 创建任务嵌套队列,调度器采用 workstealing 确

保所有的核都忙

With Cilk, the programmer worries about expressing concurrency, not the details of

how it is implemented

Page 38: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

INTRODUCTION ...

Cilk program 是一组 procedures

procedure 是一系列 threads

Cilk threads are:

represented by nodes in the dag

38

Page 39: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

39

FIBONACCI – AN EXAMPLE

int fib (int n) {if (n<2) return (n); else { int x,y; x = fib(n-1); y = fib(n-2); return (x+y); }}

int fib (int n) {if (n<2) return (n); else { int x,y; x = fib(n-1); y = fib(n-2); return (x+y); }}

C C

cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); }}

cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); }}

Cilk codeCilk code

Cilk provides no new data types.

Page 40: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

40

BASIC CILK KEYWORDS

cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); }}

cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); }}

声明一 Cilk 函数或过程。该过程可以并行 spawn.

声明一 Cilk 函数或过程。该过程可以并行 spawn.

派生子线程。子线程可以与父进程并行执行

派生子线程。子线程可以与父进程并行执行

Control cannot pass this point until all spawned children have returned.

Control cannot pass this point until all spawned children have returned.

Page 41: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

41

DYNAMIC MULTITHREADING

cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); }}

cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawn fib(n-1); y = spawn fib(n-2); sync; return (x+y); }}

The computation dag unfolds dynamically.

Example: fib(4)

4

3

2

2

1

1 1 0

0

Page 42: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

42

MULTITHREADED COMPUTATION

• 有向无环图 G = (V, E) 代表了并行指令流 .• 每一顶点 v代表一 (Cilk) thread: 最大指令序列,不包

含并行控制指令 (spawn, sync, return).• 每一边 e可以是一 spawn 边 , return 边 , 或 continue

边 .

spawn edgereturn edge

continue edge

initial thread

final thread

Page 43: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

CACTUS STACK

43

BB

AA

CC

EEDD

AA AA

BB

AA

CCAA

CC

DD

AA

CC

EE

Views of stack

CBA D E

Cilk supports C’s rule for pointers: A pointer to stack space can be passed from parent to child, but not from child to parent. (Cilk also supports malloc.)

Cilk’s cactus stack supports several views in parallel.

Cilk’s cactus stack supports several views in parallel.

Page 44: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

44

OPERATING ON RETURNED VALUES

• Cilk achieves this functionality using an internal function, called an inlet, which is executed as a secondary thread on the parent frame when the child returns.

• The inlet keyword defines a void internal function to be an inlet.

x += spawn foo(a,b,c); x += spawn foo(a,b,c);

Page 45: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

45

SEMANTICS OF INLETScilk int fib (int n){ int x = 0;

if (n<2) return n;else { summer(spawn fib (n-1)); summer(spawn fib (n-2)); sync; return (x);}}

cilk int fib (int n){ int x = 0;

if (n<2) return n;else { summer(spawn fib (n-1)); summer(spawn fib (n-2)); sync; return (x);}}

inlet void summer (int result){ x += result; return;}

Page 46: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

1. The Cilk procedure fib(i) is spawned.

2. Control passes to the next statement.

3. When fib(i) returns, summer() is invoked

46

Page 47: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

47

SEMANTICS OF INLETS

• In the current implementation of Cilk, the inlet definition may not contain a spawn, and only the first argument of the inlet may be spawned at the call site.

Page 48: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

48

IMPLICIT INLETS

cilk int wfib(int n) { if (n == 0) { return 0; } else { int i, x = 1; for (i=0; i<=n-2; i++) { x += spawn wfib(i); } sync; return x; }}

cilk int wfib(int n) { if (n == 0) { return 0; } else { int i, x = 1; for (i=0; i<=n-2; i++) { x += spawn wfib(i); } sync; return x; }}

对于赋值运算 , Cilk 编译器自动产生一个不明确的 inlet

Page 49: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 49

COMMON PATTERN FOR CILK 考虑包含循环的程序

• 将其转换为递归结构… 将范围分割为两半直到每一块足够小

void vadd (real *A, real *B, int n){ int i; for(i=0; i<n; i++) A[i] += B[i];}

void vadd (real *A, real *B, int n){ if (n<MIN) { int i; for(i=0; i<n; i++) A[i] += B[i]; } else { vadd(A, B, n/2); vadd(A+n/2, B+n/2, n-n/2); } }• 加入 Cilk 关键词

spawnspawn

sync;

cilk

Page 50: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

50

COMPUTING A PRODUCT

p = Aii = 0

n

int product(int *A, int n) { int i, p=1; for (i=0; i<n; i++) { p *= A[i];

int product(int *A, int n) { int i, p=1; for (i=0; i<n; i++) { p *= A[i];

} return p;}

} return p;}

优化 : 如果部分结果为 0 ,终止计算

Page 51: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

COMPUTING A PRODUCT

p = Aii = 0

n

int product(int *A, int n) { int i, p=1; for (i=0; i<n; i++) { p *= A[i];

int product(int *A, int n) { int i, p=1; for (i=0; i<n; i++) { p *= A[i];

} return p;}

} return p;}

if (p == 0) break;

优化 : 如果部分结果为 0 ,终止计算

51

Page 52: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

COMPUTING A PRODUCT IN PARALLEL

p = Aii = 0

n

cilk int prod(int *A, int n) { int p = 1; if (n == 1) { return A[0]; } else { p *= spawn product(A, n/2); p *= spawn product(A+n/2, n-n/2); sync; return p; }}

cilk int prod(int *A, int n) { int p = 1; if (n == 1) { return A[0]; } else { p *= spawn product(A, n/2); p *= spawn product(A+n/2, n-n/2); sync; return p; }}

怎样终止 ? 52

Page 53: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

CILK’S ABORT FEATUREcilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; return; }

if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); mult( spawn product(A+n/2, n-n/2) ); sync; return p; }}

cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; return; }

if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); mult( spawn product(A+n/2, n-n/2) ); sync; return p; }}

1. Recode the implicit inlet to make it explicit.53

Page 54: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

CILK’S ABORT FEATUREcilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x;

return; }

if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); mult( spawn product(A+n/2, n-n/2) ); sync; return p; }}

cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x;

return; }

if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); mult( spawn product(A+n/2, n-n/2) ); sync; return p; }}

2. Check for 0 within the inlet.54

Page 55: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

CILK’S ABORT FEATURE

cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; return; }

if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); mult( spawn product(A+n/2, n-n/2) ); sync; return p; }}

cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; return; }

if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); mult( spawn product(A+n/2, n-n/2) ); sync; return p; }}

cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; if (p == 0) {

abort; /* Aborts existing children, */ } /* but not future ones. */ return; }

if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); mult( spawn product(A+n/2, n-n/2) ); sync; return p; }}

cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; if (p == 0) {

abort; /* Aborts existing children, */ } /* but not future ones. */ return; }

if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); mult( spawn product(A+n/2, n-n/2) ); sync; return p; }}

2. Check for 0 within the inlet.55

Page 56: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

CILK’S ABORT FEATURE

cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; if (p == 0) {

abort; /* Aborts existing children, */ } /* but not future ones. */ return; }

if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) );

mult( spawn product(A+n/2, n-n/2) ); sync; return p; }}

cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; if (p == 0) {

abort; /* Aborts existing children, */ } /* but not future ones. */ return; }

if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) );

mult( spawn product(A+n/2, n-n/2) ); sync; return p; }}

56

Page 57: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

CILK’S ABORT FEATUREcilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; return; }

if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); mult( spawn product(A+n/2, n-n/2) ); sync; return p; }}

cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; return; }

if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); mult( spawn product(A+n/2, n-n/2) ); sync; return p; }}

cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; if (p == 0) {

abort; /* Aborts existing children, */ } /* but not future ones. */ return; }

if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); if (p == 0) { /* Don’t spawn if we’ve */ return 0; /* already aborted! */ } mult( spawn product(A+n/2, n-n/2) ); sync; return p; }}

cilk int product(int *A, int n) { int p = 1; inlet void mult(int x) { p *= x; if (p == 0) {

abort; /* Aborts existing children, */ } /* but not future ones. */ return; }

if (n == 1) { return A[0]; } else { mult( spawn product(A, n/2) ); if (p == 0) { /* Don’t spawn if we’ve */ return 0; /* already aborted! */ } mult( spawn product(A+n/2, n-n/2) ); sync; return p; }}

57

Page 58: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

58

MUTUAL EXCLUSION

Cilk’s solution to mutual exclusion is no better than anybody else’s.

Cilk provides a library of locks declared with Cilk_lockvar. • To avoid deadlock with the Cilk scheduler, a

lock should only be held within a Cilk thread.• I.e., spawn and sync should not be executed

while a lock is held.

Page 59: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

LOCKING

Cilk_lockvar data type

#include <cilk-lib.h>::Cilk_lockvar mylock;:{Cilk_lock_init(mylock);:Cilk_lock(mylock); /* begin critical section */

:Cilk_unlock(mylock); /* end critical section */

}

Page 60: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

60

KEY IDEASCilk is simple: cilk, spawn, sync, SYNCHED, inlet, abort

JCilk is simplerWork & span

Work & spanWork & spanWork & span Work & span Work & span Work & span Work & span Work & span Work & span

Work & span

Work & span

Work & span

Work & span

Work & span

Page 61: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

INTEL’S THREADING BUILDING BLOCKS

61

Page 62: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

THREADING BUILDING BLOCKS LIBRARY CHARACTERISTICS C++ Library. Targets threading for performance (designed to

parallelize computationally intensive work). Is compatible with other threading packages. Emphasizes scalable data parallel

programming. Specifies templates and tasks instead of

threads - the library schedules tasks onto threads and manages load balancing.

62

Page 63: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

63

COMPONENTS OF TBB (VERSION 2.1)

Synchronization primitivesatomic operationsvarious flavors of mutexes (improved)

Parallel algorithmsparallel_for (improved)parallel_reduce (improved)parallel_do (new)pipeline (improved)parallel_sortparallel_scan

Concurrent containersconcurrent_hash_mapconcurrent_queueconcurrent_vector(all improved)

Task schedulerWith new functionality

Memory allocatorstbb_allocator (new), cache_aligned_allocator, scalable_allocator

Utilitiestick_counttbb_thread (new)

Page 64: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

64

C++ REVIEW: FUNCTION TEMPLATE Type-parameterized function.

Strongly typed. Obeys scope rules. Actual arguments evaluated exactly once. Not redundantly instantiated.

template<typename T>void swap( T& x, T& y ) { T z = x; x = y; y = z;}

void reverse( float* first, float* last ) { while( first<last-1 ) swap( *first++, *--last );}

Compiler instantiates template swap with T=float.

[first,last) define half-open interval

Page 65: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

65

GENERICITY OF SWAP

T(const T&) Copy constructor

void T::operator=(const T&); Assignment

~T() Destructor

template<typename T>void swap( T& x, T& y ) { T z = x; x = y; y = z; }

// Construct z// Assignment// Assignment// Destroy z

Requirements for T

Page 66: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

66

C++ REVIEW: TEMPLATE CLASS Type-parameterized class

template<typename T, typename U>class pair {public: T first; U second; pair( const T& x, const U& y ) : first(x), second(y) {}};

pair<string,int> x;x.first = “abc”;x.second = 42;

Compiler instantiates template pair with T=string and U=int.

Page 67: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

67

TBB LIBRARY ALGORITHM – PARALLEL_FOR parallel_for is a template function provided by library. template <typename Range, typename Body>

void parallel_for(const Range& range, Functor& func, partitioner );

Requirements for Range R:

Library provides blocked_range, blocked_range2d, blocked_range3d Programmer can define new kinds of ranges

R(const R&) Copy a range

R::~R() Destroy a range

bool R::empty() const Is range empty?

bool R::is_divisible() const Can range be split?

R::R (R& r, split) Split r into two subranges

Page 68: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

68

REQUIREMENTS FOR FUNCTOR

template <typename Range, typename Functor>void parallel_for(const Range& range, Functor& func, partitioner );

Requirements for Functor func:

F::F( const F& ) Copy constructor

F::~F() Destructor

void F::operator() (Range& subrange) Apply F to subrange

Page 69: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

EXAMPLE – PARALLEL_FOR

69

• Example: concurrently apply a function to each element in an array.

• Serial version:

void SerialApplyFoo( float a[], size_t n ) { for( size_t i=0; i<n; ++i ) Foo(a[i]); }

• Iteration space is 0…(n-1).

Page 70: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

TBB LIBRARY ALGORITHM – PARALLEL_FOR, CONTINUED Parallel version requires two steps:

70

#include "tbb/blocked_range.h"

class ApplyFoo {

float *const my_a;

public: ApplyFoo( float a[] ) : my_a(a) {}

void operator()( const blocked_range<size_t>& r ) const { float *a = my_a; for( size_t i=r.begin(); i!=r.end(); ++i ) Foo(a[i]); } };

Page 71: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

TBB LIBRARY ALGORITHM – PARALLEL_FOR, CONTINUED

parallel_for breaks iteration space into chunks each of which are run on separate threads

blocked_range<T>(begin,end,grainsize) - recursively divisible struct.

grainsize, specifies the number of iterations for a “reasonable size” chunk to deal out to a processor. If the iteration space has more than grainsize iterations, parallel_for splits it into separate subranges that are scheduled separately.

operator () processes a chunk 71

#include "tbb/parallel_for.h"

void ParallelApplyFoo( float a[], size_t n ) { parallel_for(blocked_range<size_t>(0,n,IdealGrainSize), ApplyFoo(a) ); }

Page 72: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

TBB LIBRARY - ALGORITHMS parallel_reduce – math ops with elements of an

array in parallel parallel_do – loops indeterminate length iteration

spaces parallel_* - several others…

72

Page 73: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

73

WORK STEALING

Thread

deque

mailbox

Thread

deque

mailbox

Thread

deque

mailbox

Thread

deque

mailbox

Cache Affinity2. Steal task advertised in mailbox

Load balance

3. Steal oldest task from random victim

Locality1. Take youngest task from my deque

Override0. Do explicitly specified task

Page 74: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

74

deque

HOW THIS WORKS

Split range...

.. recursively...

...until grainsize.

Page 75: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

75

WORK DEPTH FIRST; STEAL BREADTH FIRST

L1

L2

victim thread

Best choice for theft!• big piece of work• data far from victim’s hot data.

Second best choice.

Page 76: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

76

PARALLEL SORT EXAMPLE (WITH WORK STEALING)QUICKSORT – STEP 1

THREAD 1

32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63

Thread 1 starts with the initial data

tbb::parallel_sort (color, color+64);

76

Page 77: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

77

37

11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 15 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35

52 47 41 43 53 60 61 38 56 48 59 54 50 49 51 45 62 39 42 40 58 55 57 44 46 63

37

THREAD 1

32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63

THREAD 2THREAD 3 THREAD 4

Thread 1 partitions/splits its data

77

PARALLEL SORT EXAMPLE (WITH WORK STEALING)QUICKSORT – STEP 2

Page 78: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

78

37

11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 15 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35

52 47 41 43 53 60 61 38 56 48 59 54 50 49 51 45 62 39 42 40 58 55 57 44 46 63

37

THREAD 1 THREAD 2

32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63

Thread 2 gets work by stealing from Thread 1

THREAD 3 THREAD 4

78

PARALLEL SORT EXAMPLE (WITH WORK STEALING)QUICKSORT – STEP 2

Page 79: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

79

7 37 49

11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 15 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35

52 47 41 43 53 60 61 38 56 48 59 54 50 49 51 45 62 39 42 40 58 55 57 44 46 63

37

THREAD 1

1 0 2 6 4 5 3

712 29 27 19 20 30 33 31 25 21 11 15 17 26 18 16 10 9 23 13 14 8 24 36 32 28 22 34 35

45 47 41 43 46 44 40 38 42 48 39

4950 52 51 54 62 59 56 61 58 55 57 60 53 63

THREAD 2

32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63

Thread 1 partitions/splits its data

Thread 2 partitions/splits its data

79

PARALLEL SORT EXAMPLE (WITH WORK STEALING)QUICKSORT – STEP 3

Page 80: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

80

7 37 49

11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 15 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35

52 47 41 43 53 60 61 38 56 48 59 54 50 49 51 45 62 39 42 40 58 55 57 44 46 63

37

THREAD 1

1 0 2 6 4 5 3

712 29 27 19 20 30 33 31 25 21 11 15 17 26 18 16 10 9 23 13 14 8 24 36 32 28 22 34 35

45 47 41 43 46 44 40 38 42 48 39

4950 52 51 54 62 59 56 61 58 55 57 60 53 63

THREAD 2THREAD 3 THREAD 4

32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63

Thread 3 gets work by stealing from Thread 1

Thread 4 gets work by stealing from Thread 2

80

PARALLEL SORT EXAMPLE (WITH WORK STEALING)QUICKSORT – STEP 3

Page 81: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

81

11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 15 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35

52 47 41 43 53 60 61 38 56 48 59 54 50 49 51 45 62 39 42 40 58 55 57 44 46 63

37

1 0 2 6 4 5 3

712 29 27 19 20 30 33 31 25 21 11 15 17 26 18 16 10 9 23 13 14 8 24 36 32 28 22 34 35

45 47 41 43 46 44 40 38 42 48 39

4950 52 51 54 62 59 56 61 58 55 57 60 53 63

11 8 14 13 9 10 16 12 17 15

1821 25 26 31 33 30 20 23 19 27 29 24 36 32 28 22 34 35

THREAD 1 THREAD 2THREAD 3 THREAD 4

32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63

Thread 1 sorts the rest of its data

0 1 2 3 4 5 6 7 18 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

Thread 4 sorts the rest of its data

Thread 2 sorts the rest its data

Thread 3 partitions/splitsits data

81

PARALLEL SORT EXAMPLE (WITH WORK STEALING)QUICKSORT – STEP 4

Page 82: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

82

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 15 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35

52 47 41 43 53 60 61 38 56 48 59 54 50 49 51 45 62 39 42 40 58 55 57 44 46 63

37

THREAD 1

1 0 2 6 4 5 3

712 29 27 19 20 30 33 31 25 21 11 15 17 26 18 16 10 9 23 13 14 8 24 36 32 28 22 34 35

45 47 41 43 46 44 40 38 42 48 39

4950 52 51 54 62 59 56 61 58 55 57 60 53 63

THREAD 2THREAD 3 THREAD 4

11 8 14 13 9 10 16 12 17 15

1821 25 26 31 33 30 20 23 19 27 29 24 36 32 28 22 34 35

32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63

Thread 1 gets more work by stealing from Thread 3

Thread 3 sorts therest of its data

82

PARALLEL SORT EXAMPLE (WITH WORK STEALING)QUICKSORT – STEP 5

Page 83: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

83

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 27 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 15 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35

52 47 41 43 53 60 61 38 56 48 59 54 50 49 51 45 62 39 42 40 58 55 57 44 46 63

37

THREAD 1

1 0 2 6 4 5 3

712 29 27 19 20 30 33 31 25 21 11 15 17 26 18 16 10 9 23 13 14 8 24 36 32 28 22 34 35

45 47 41 43 46 44 40 38 42 48 39

4950 52 51 54 62 59 56 61 58 55 57 60 53 63

THREAD 2THREAD 3 THREAD 4

11 8 14 13 9 10 16 12 17 15

1821 25 26 31 33 30 20 23 19 27 29 24 36 32 28 22 34 35

19 25 26 22 24 21 20 23

2730 29 33 36 32 28 31 34 35

32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63

Thread 1 partitions/splits its data

83

PARALLEL SORT EXAMPLE (WITH WORK STEALING)QUICKSORT – STEP 6

Page 84: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

84

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 27 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 15 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35

52 47 41 43 53 60 61 38 56 48 59 54 50 49 51 45 62 39 42 40 58 55 57 44 46 63

37

THREAD 1

1 0 2 6 4 5 3

712 29 27 19 20 30 33 31 25 21 11 15 17 26 18 16 10 9 23 13 14 8 24 36 32 28 22 34 35

45 47 41 43 46 44 40 38 42 48 39

4950 52 51 54 62 59 56 61 58 55 57 60 53 63

THREAD 2THREAD 3 THREAD 4

11 8 14 13 9 10 16 12 17 15

1821 25 26 31 33 30 20 23 19 27 29 24 36 32 28 22 34 35

19 25 26 22 24 21 20 23

2730 29 33 36 32 28 31 34 35

32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63

Thread 2 gets more work by stealing from Thread 1

Thread 1 sorts the rest of its data

84

PARALLEL SORT EXAMPLE (WITH WORK STEALING)QUICKSORT – STEP 6

Page 85: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

85

11 0 9 26 31 30 3 19 12 29 27 1 20 5 33 4 25 21 7 15 17 6 18 16 10 2 23 13 14 8 24 36 32 28 22 34 35

52 47 41 43 53 60 61 38 56 48 59 54 50 49 51 45 62 39 42 40 58 55 57 44 46 63

37

THREAD 1

1 0 2 6 4 5 3

712 29 27 19 20 30 33 31 25 21 11 15 17 26 18 16 10 9 23 13 14 8 24 36 32 28 22 34 35

45 47 41 43 46 44 40 38 42 48 39

4950 52 51 54 62 59 56 61 58 55 57 60 53 63

THREAD 2THREAD 3 THREAD 4

11 8 14 13 9 10 16 12 17 15

1821 25 26 31 33 30 20 23 19 27 29 24 36 32 28 22 34 35

19 25 26 22 24 21 20 23

2730 29 33 36 32 28 31 34 35

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

32 44 9 26 31 57 3 19 55 29 27 1 20 5 42 62 25 51 49 15 54 6 18 48 10 2 60 41 14 47 24 36 37 52 22 34 35 11 28 8 13 43 53 23 61 38 56 16 59 17 50 7 21 45 4 39 33 40 58 12 30 0 46 63

Thread 2 sorts the rest of its data

DONE 85

PARALLEL SORT EXAMPLE (WITH WORK STEALING)QUICKSORT – STEP 7

Page 86: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

TBB LIBRARY - PIPELINE

TBB implements the pipeline pattern. Data flows through a series of pipeline stages,

and each stage processes the data in some way.

86

Page 87: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

87

Parallel stage scales because it can process items in parallel or out of order.

Serial stage processes items one at a time in order. Another serial stage.

Items wait for turn in serial stage

Controls excessive parallelism by limiting total number of items flowing through pipeline.

Uses sequence numbers recover order for serial stage.

Tag incoming items with sequence numbers

Throughput limited by throughput of slowest serial stage.

PARALLEL PIPELINE

810 9 3 1

3

4

2

5

6

7

Page 88: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

EXAMPLE

Sample problem: read a text file (sequential), capitalize the first letter of each word (parallel), and write the modified text to a new file (sequential).

88

Page 89: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

TBB LIBRARY – PIPELINE, CONTINUED

89

// Create the pipeline tbb::pipeline pipeline;

// Create file-reading writing stage and add it to the pipeline MyInputFilter input_filter( input_file ); pipeline.add_filter( input_filter );

// Create capitalization stage and add it to the pipeline MyTransformFilter transform_filter; pipeline.add_filter( transform_filter ); // Create file-writing stage and add it to the pipeline MyOutputFilter output_filter( output_file ); pipeline.add_filter( output_filter );

// Run the pipeline pipeline.run( MyInputFilter::n_buffer );

// Must remove filters from pipeline before they are implicitly destroyed. pipeline.clear();

Page 90: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

TBB – PIPELINE, CONTINUED

90

// Filter that writes each buffer to a file. class MyOutputFilter: public tbb::filter { FILE* my_output_file; public: MyOutputFilter( FILE* output_file ); /*override*/void* operator()( void* item ); }; MyOutputFilter::MyOutputFilter( FILE* output_file ) : tbb::filter(serial), my_output_file(output_file) { }

void* MyOutputFilter::operator()( void* item ) { MyBuffer& b = *static_cast<MyBuffer*>(item); fwrite( b.begin(), 1, b.size(), my_output_file ); return NULL; }

Page 91: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

TBB – PIPELINE, CONTINUED

91

// Changes the first letter of each word from lower case to upper case. class MyTransformFilter: public tbb::filter { public: MyTransformFilter(); /*override*/void* operator()( void* item ); }; MyTransformFilter::MyTransformFilter() : tbb::filter(parallel) {}

/*override*/void* MyTransformFilter::operator()( void* item ) { // a for loop and ‘toupper()’ go here… }

Page 92: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

TBB - TIMING tick_count class

92

using namespace tbb;

void Foo() { tick_count t0 = tick_count::now(); ...action being timed... tick_count t1 = tick_count::now(); printf("time for action = %g seconds\n", (t1-t0).seconds() );}

Page 93: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

TBB: TASK SCHEDULERIntel Propaganda: The task scheduler is the engine that powers

the loop templates. When practical, use the loop templates instead

of the task scheduler because the templates hide the complexity of the scheduler.

However, if you have an algorithm that does not naturally map onto one of the high-level templates, use the task scheduler. All of the scheduler functionality that is used by the high-level templates is available for you to use directly, so you can build new high-level templates that are just as powerful as the existing ones. 93

Page 94: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

TBB – TASK SCHEDULER Maps tasks to threads. Handles load balancing and scheduling. Hides threading details – just think in terms of tasks. Any task using the task scheduler must have an

initialized tbb::task_scheduler_init object.

94

#include "tbb/task_scheduler_init.h" using namespace tbb; int main() { task_scheduler_init init; ... return 0; }

Page 95: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

95

EXAMPLE: NAIVE FIBONACCI CALCULATION

Recursion typically used to calculate Fibonacci number F(n)=F(n-1)+F(n-2)

long SerialFib( long n ) { if( n<2 ) return n; else return SerialFib(n-1) + SerialFib(n-2);}

95

Page 96: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

96

EXAMPLE: NAIVE FIBONACCI CALCULATION

Can envision Fibonacci computation as a task graph

SerialFib(4)

SerialFib(3) SerialFib(2)

SerialFib(1)

SerialFib(2)

SerialFib(1) SerialFib(0)

SerialFib(2)

SerialFib(1) SerialFib(0)

SerialFib(3)

SerialFib(2) SerialFib(1)

SerialFib(1)

SerialFib(0)SerialFib(1)

SerialFib(0)

96

Page 97: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

97

FIBONACCI - TASK SPAWNING SOLUTION

Use TBB tasks to thread creation and execution of task graph

Create new root task Allocate task object Construct task

Spawn (execute) task, wait for completion

long ParallelFib( long n ) { long sum; FibTask& a = *new(Task::allocate_root()) FibTask(n,&sum); Task::spawn_root_and_wait(a); return sum;}

97

Page 98: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

98

class FibTask: public task {public: const long n; long* const sum; FibTask( long n_, long* sum_ ) : n(n_), sum(sum_) {} task* execute() { // Overrides virtual function task::execute if( n<CutOff ) { *sum = SerialFib(n); } else { long x, y; FibTask& a = *new( allocate_child() ) FibTask(n-1,&x); FibTask& b = *new( allocate_child() ) FibTask(n-2,&y); set_ref_count(3); // 3 = 2 children + 1 for wait spawn( b ); spawn_and_wait_for_all( a ); *sum = x+y; } return NULL; }};

FIBONACCI - TASK SPAWNING SOLUTION

Derived from TBB task class

Create new child tasks to compute (n-1)th and (n-2)th Fibonacci numbers

Reference count is used to know when spawned tasks have completed

Set before spawning any children

Spawn task; return immediately

Can be scheduled at any time

Spawn task; block until all children have completed execution

The execute method does the computation of a task

98

Page 99: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

100

CONCURRENT CONTAINERS

TBB Library provides highly concurrent containers STL containers are not concurrency-friendly:

attempt to modify them concurrently can corrupt container

Standard practice is to wrap a lock around STL containers Turns container into serial bottleneck

Library provides fine-grained locking or lockless implementations Worse single-thread performance, but better

scalability. Can be used with the library, OpenMP, or native

threads.

100

Page 100: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

TBB - CONTAINERS

concurrent_hash_map<Key,T,HashCompare> concurrent_queue<T,Allocator> concurrent_vector<T,Allocator> Supports concurrent access, concurrent

operations and parallel iteration. The TBB library retains control over memory

allocation.

101

Page 101: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

102

CONCURRENCY-FRIENDLY INTERFACES

Some STL interfaces are inherently not concurrency-friendly

For example, suppose two threads each execute:

Solution: concurrent_queue has pop_if_present

extern std::queue q;

if(!q.empty()) {

item=q.front();

q.pop();

}

At this instant, another thread might pop last element.

102

Page 102: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

103

CONCURRENT QUEUE CONTAINER

concurrent_queue<T> Preserves local FIFO order

If thread pushes and another thread pops two values, they come out in the same order that they went in

Method push(const T&) places copy of item on back of queue

Two kinds of pops Blocking – pop(T&) non-blocking – pop_if_present(T&)

Method size() returns signed integer If size() returns –n, it means n pops await corresponding

pushes Method empty() returns size() == 0

Difference between pushes and pops May return true if queue is empty, but there are pending

pop()

103

Page 103: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

104

CONCURRENT QUEUE CONTAINER EXAMPLE

Simple example to enqueue and print integers

Constructor for queue

Push items onto queue

While more things on queue Pop item off Print item

#include “tbb/concurrent_queue.h”#include <stdio.h>using namespace tbb;

int main (){ concurrent_queue<int> queue; int j;

for (int i = 0; i < 10; i++) queue.push(i);

while (!queue.empty()) { queue.pop(&j); printf(“from queue: %d\n”, j); } return 0;}

104

Page 104: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

TBB - ALLOCATION

tbb_allocator<T>: allocates and frees memory via the TBB malloc library if available, otherwise it reverts to using malloc and free.

scalable_allocator<T>: allocates and frees memory in a way that scales with the number of processors.

others…

111

Page 105: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

112

SCALABLE MEMORY ALLOCATORS Serial memory allocation can easily become

a bottleneck in multithreaded applications Threads require mutual exclusion into shared

heap TBB offers two choices for scalable memory

allocation Similar to the STL template class std::allocator

scalable_allocator Offers scalability, but not protection from false sharing Memory is returned to each thread from a separate

pool cache_aligned_allocator

Offers both scalability and false sharing protection 112

Page 106: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

113

METHODS FOR SCALABLE_ALLOCATOR #include “tbb/scalable_allocator.h” template<typename T> class scalable_allocator;

Scalable versions of malloc, free, realloc, calloc void *scalable_malloc( size_t size ); void scalable_free( void *ptr ); void *scalable_realloc( void *ptr, size_t size ); void *scalable_calloc( size_t nobj, size_t size );

STL allocator functionality T* A::allocate( size_type n, void* hint=0 )

Allocate space for n values

void A::deallocate( T* p, size_t n ) Deallocate n values from p

void A::construct( T* p, const T& value ) void A::destroy( T* p )

113

Page 107: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

114

SCALABLE ALLOCATORS EXAMPLE

#include “tbb/scalable_allocator.h”typedef char _Elem;

typedef std::basic_string<_Elem, std::char_traits<_Elem>, tbb::scalable_allocator<_Elem>> MyString; . . .{. . . int *p; MyString str1 = "qwertyuiopasdfghjkl";

MyString str2 = "asdfghjklasdfghjkl"; p = tbb::scalable_allocator<int>().allocate(24);. . .}

Use TBB scalable allocator for STL basic_string class

Use TBB scalable allocator to allocate 24 integers

114

Page 108: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

115

TBB: SYNCHRONIZATION PRIMITIVES Parallel tasks must sometimes touch shared data

When data updates might overlap, use mutual exclusion to avoid race

High-level generic abstraction for HW atomic operations Atomically protect update of single variable

Critical regions of code are protected by scoped locks The range of the lock is determined by its lifetime (scope) Leaving lock scope calls the destructor, making it exception

safe Minimizing lock lifetime avoids possible contention Several mutex behaviors are available

Spin vs. queued “are we there yet” vs. “wake me when we get there”

Writer vs. reader/writer (supports multiple readers/single writer) Scoped wrapper of native mutual exclusion function 115

Page 109: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

116

ATOMIC EXECUTION atomic<T>

T should be integral type or pointer type Full type-safe support for 8, 16, 32, and 64-bit integersOperations

atomic <int> i;. . .int z = i.fetch_and_add(2);

‘= x’ and ‘x = ’ read/write value of x

x.fetch_and_store (y) z = x, y = x, return z

x.fetch_and_add (y) z = x, x += y, return z

x.compare_and_swap (y,p) z = x, if (x==p) x=y; return z

116

Page 110: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

117

SUMMARY

Intel® Threading Building Blocks is a parallel programming model for C++ applications

Used for computationally intense code A focus on data parallel programming

Intel® Threading Building Blocks provides Generic parallel algorithms Highly concurrent containers Low-level synchronization primitives A task scheduler that can be used directly

117

Page 111: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

SHARED MEMORY PROGRAMMING WITH OPENMP

118

Page 112: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

CS267 Lecture 6 119

INTRODUCTION TO OPENMP What is OpenMP?

Open specification for Multi-Processing “Standard” API for defining multi-threaded shared-

memory programs openmp.org – Talks, examples, forums, etc.

High-level API Preprocessor (compiler) directives ( ~ 80% ) Library Calls ( ~ 19% ) Environment Variables ( ~ 1% )

Page 113: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 120

OPENMP* OVERVIEW:

omp_set_lock(lck)

#pragma omp parallel for private(A, B)

#pragma omp critical

C$OMP parallel do shared(a, b, c)

C$OMP PARALLEL REDUCTION (+: A, B)

call OMP_INIT_LOCK (ilok)

call omp_test_lock(jlok)

setenv OMP_SCHEDULE “dynamic”

CALL OMP_SET_NUM_THREADS(10)

C$OMP DO lastprivate(XX)

C$OMP ORDERED

C$OMP SINGLE PRIVATE(X)

C$OMP SECTIONS

C$OMP MASTER

C$OMP ATOMIC

C$OMP FLUSH

C$OMP PARALLEL DO ORDERED PRIVATE (A, B, C)

C$OMP THREADPRIVATE(/ABC/)

C$OMP PARALLEL COPYIN(/blk/)

Nthrds = OMP_GET_NUM_PROCS()

!$OMP BARRIER

OpenMP: An API for Writing Multithreaded Applications

A set of compiler directives and library routines for parallel application programmers

Makes writing multi-threaded applications in Fortran, C and C++ as easy as we can make it.

Standardizes last 20 years of SMP practice

* The name “OpenMP” is the property of the OpenMP Architecture Review Board.

120

Page 114: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 121

THE ESSENCE OF OPENMP Create threads that execute in a shared address

space: The only way to create threads is with the “parallel construct” Once created, all threads execute the code inside the construct.

Split up the work between threads by one of two means: SPMD (Single program Multiple Data) … all threads execute the

same code and you use the thread ID to assign work to a thread.

Workshare constructs split up loops and tasks between threads.

Manage data environment to avoid data access conflicts Synchronization so correct results are produced regardless of

how threads are scheduled. Carefully manage which data can be private (local to each

thread) and shared. 121

Page 115: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

CS267 Lecture 6 122

A PROGRAMMER’S VIEW OF OPENMP OpenMP is a portable, threaded, shared-memory

programming specification with “light” syntax Exact behavior depends on OpenMP implementation! Requires compiler support (C or Fortran)

OpenMP will: Allow a programmer to separate a program into serial

regions and parallel regions, rather than T concurrently-executing threads.

Hide stack management Provide synchronization constructs

OpenMP will not: Parallelize automatically Guarantee speedup Provide freedom from data races

Page 116: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

CS267 Lecture 6 123

MOTIVATION – OPENMP

int main() {

// Do this part in parallel printf( "Hello, World!\n" );

return 0; }

Page 117: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

CS267 Lecture 6 124

MOTIVATION – OPENMP

int main() {

omp_set_num_threads(16);

// Do this part in parallel #pragma omp parallel { printf( "Hello, World!\n" ); }

return 0; }

Page 118: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

© 2

00

9 M

atth

ew

J. So

ttile, T

imo

thy G

. M

attso

n, a

nd

Cra

ig E

Ra

smu

ssen

125

OPENMP EXECUTION MODEL: Fork-Join Parallelism:

Master thread spawns a team of threads as needed.

Parallelism added incrementally until performance are met: i.e. the sequential program evolves into a parallel program. Parallel Regions

Master Thread in red

A Nested Paralle

l region

Sequential Parts

Page 119: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

CS

267 Lecture 6

126

PROGRAMMING MODEL – CONCURRENT LOOPS

OpenMP easily parallelizes loops Requires: No data

dependencies (reads/write or write/write pairs) between iterations!

Preprocessor calculates loop bounds for each thread directly from serial source ?

?

for( i=0; i < 25; i++ ) {

printf(“Foo”);

}

#pragma omp parallel for

Page 120: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

CS267 Lecture 6 127

PROGRAMMING MODEL – LOOP SCHEDULING

schedule clause determines how loop iterations are divided among the thread team static([chunk]) divides iterations statically between

threads Each thread receives [chunk] iterations, rounding as

necessary to account for all iterations Default [chunk] is ceil( # iterations / # threads )

dynamic([chunk]) allocates [chunk] iterations per thread, allocating an additional [chunk] iterations when a thread finishes Forms a logical work queue, consisting of all loop iterations Default [chunk] is 1

guided([chunk]) allocates dynamically, but [chunk] is exponentially reduced with each allocation

Page 121: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

CS

267 Lecture 6

128

PROGRAMMING MODEL – DATA SHARING

Parallel programs often employ two types of data Shared data, visible to all

threads, similarly named Private data, visible to a

single thread (often stack-allocated)

• OpenMP:• shared variables are shared• private variables are private

• PThreads:• Global-scoped variables are

shared• Stack-allocated variables are

private

// shared, globals

int bigdata[1024];

void* foo(void* bar) {

// private, stack

int tid;

/* Calculation goes

here */

}

int bigdata[1024];

void* foo(void* bar) {

int tid;

#pragma omp parallel \

shared ( bigdata ) \

private ( tid )

{

/* Calc. here */

}

}

Page 122: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

CS267 Lecture 6 129

PROGRAMMING MODEL - SYNCHRONIZATION

OpenMP Synchronization OpenMP Critical Sections

Named or unnamed No explicit locks / mutexes

Barrier directives

Explicit Lock functions When all else fails – may

require flush directive

Single-thread regions within parallel regions master, single directives

#pragma omp critical{ /* Critical code here */}

#pragma omp barrier

omp_set_lock( lock l );/* Code goes here */omp_unset_lock( lock l );

#pragma omp single{ /* Only executed once */}

Page 123: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

© 2

00

9 M

atth

ew

J. So

ttile, T

imo

thy G

. M

attso

n, a

nd

Cra

ig E

Ra

smu

ssen

130

EXAMPLE PROBLEM: NUMERICAL INTEGRATION

4.0

(1+x2) dx = 0

1

F(xi)x i = 0

N

Mathematically, we know that:

We can approximate the integral as a sum of rectangles:

Where each rectangle has width x and height F(xi) at the middle of interval i.

F(x

) =

4.0

/(1

+x

2 )

4.0

2.0

1.0X

0.0

Page 124: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 131

PI PROGRAM: AN EXAMPLE

static long num_steps = 100000;double step;void main (){ int i; double x, pi, sum = 0.0;

step = 1.0/(double) num_steps; x = 0.5 * step;

for (i=0;i<= num_steps; i++){ sum += 4.0/(1.0+x*x);

x+=step; } pi = step * sum;

}131

Page 125: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 132

PI PROGRAM: IDENTIFY CONCURRENCY

static long num_steps = 100000;double step;void main (){ int i; double x, pi, sum = 0.0;

step = 1.0/(double) num_steps; x = 0.5 * step;

for (i=0;i<= num_steps; i++){ sum += 4.0/(1.0+x*x);

x+=step; } pi = step * sum;

}

Loop iterations

can in principle be

executed concurrentl

y

132

Page 126: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 133

PI PROGRAM: EXPOSE CONCURRENCY, PART 1

static long num_steps = 100000;double step;void main (){ double pi, sum = 0.0;

step = 1.0/(double) num_steps;

int i; double x; for (i=0;i<= num_steps; i++){

x = (i+0.5)*step; sum += 4.0/(1.0+x*x);

} pi = step * sum;

}

Isolate data that must be shared from data local to a

task

Redefine x to remove loop

carried dependence

This is called a reduction … results from each

iteration accumulated into a single global. 133

Page 127: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 134

PI PROGRAM: EXPOSE CONCURRENCY, PART 2DEAL WITH THE REDUCTION

static long num_steps = 100000;#define NUM 4 //expected max thread count

double step;void main (){ double pi, sum[NUM] = {0.0};

step = 1.0/(double) num_steps;

int i, ID=0; double x; for (i=0;i<= num_steps; i++){

x = (i+0.5)*step; sum[ID] += 4.0/(1.0+x*x);

} for(int i=0, pi=0.0;i<NUM;i++)

pi += step * sum[i];}

Common Trick:

promote scalar “sum” to an array indexed by

the number of threads to

create thread local copies of shared

data.

134

Page 128: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 135

PI PROGRAM: EXPRESS CONCURRENCY USING OPENMP#include <omp.h>static long num_steps = 100000;#define NUM 4double step;void main (){ double pi, sum[NUM] = {0.0};

step = 1.0/(double) num_steps;#pragma omp parallel num_threads(NUM){ int i, ID; double x; ID = omp_get_thread_num();

for (i=ID;i<= num_steps; i+=NUM){ x = (i+0.5)*step; sum[ID] += 4.0/(1.0+x*x);

}}

for(int i=0, pi=0.0;i<NUM;i++) pi += step * sum[i];}

Create NUM threads

Each thread executes code in the parallel

block

Simple mod to loop to deal

out iterations to threads

variables defined inside a

thread are private to

that thread

automatic variables defined outside a parallel region are shared between threads

135

Page 129: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 136

PI PROGRAM: FIXING THE NUM THREADS BUG#include <omp.h>static long num_steps = 100000;#define NUM 4 double step;void main (){ double pi, sum[NUM] = {0.0};

step = 1.0/(double) num_steps;#pragma omp parallel num_threads(NUM){ int nthreads = omp_get_num_threads(); int i, ID; double x; ID = omp_get_thread_num();

for (i=ID;i<= num_steps; i+=nthreads){ x = (i+0.5)*step; sum[ID] += 4.0/(1.0+x*x);

}}

for(int i=0, pi=0.0;i<NUM;i++) pi += step * sum[i];}

Hence, you need to add

a bit of code to get the actual number of

threads

NUM is a requested number of

threads, but an OS can choose to give you fewer.

136

Page 130: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 137

INCREMENTAL PARALLELISM

Software development with incremental Parallelism:Behavior preserving transformations to

expose concurrency.Express concurrency incrementally by

adding OpenMP directives… in a large program I can do this loop by loop to evolve my original program into a parallel OpenMP program.

Build and time program, optimize as needed with behavior preserving transformations until you reach the desired performance. 137

Page 131: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 138

PI PROGRAM: EXECUTE CONCURRENCY#include <omp.h>static long num_steps = 100000;#define NUM 4 double step;void main (){ double pi, sum[NUM] = {0.0};

step = 1.0/(double) num_steps;#pragma omp parallel num_threads(NUM){ int nthreads = omp_get_num_threads(); int i, ID; double x; ID = omp_get_thread_num();

for (i=ID;i<= num_steps; i+=nthreads){ x = (i+0.5)*step; sum[ID] += 4.0/(1.0+x*x);

}}

for(int i=0, pi=0.0;i<NUM;i++) pi += step * sum[i];}

The performance can suffer on some systems due to false sharing of

sum[ID] … i.e. independent

elements of the sum array share a cache line and

hence every update requires

a cache line transfer between

threads.

Build this program and execute on

parallel hardware.

138

Page 132: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 139

PI PROGRAM: SAFE UPDATE OF SHARED DATA#include <omp.h>static long num_steps = 100000;#define NUM 4double step;int main (){ double pi, sum=0.0;

step = 1.0/(double) num_steps;#pragma omp parallel num_threads(NUM){ int i, ID; double x, psum= 0.0; int nthreads = omp_get_num_threads(); ID = omp_get_thread_num(); for (i=ID;i<= num_steps; i+=nthreads){

x = (i+0.5)*step; psum += 4.0/(1.0+x*x);

} #pragma omp critical sum += psum;} pi = step * sum; }

Replace array for sum with a local/private

version of sum (psum) … no more false

sharing

Use a critical section so only one thread at

a time can update sum, i.e. you can

safely combine psum values 139

Page 133: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

© 2

00

9 M

atth

ew

J. So

ttile, T

imo

thy G

. M

attso

n, a

nd

Cra

ig E

Ra

smu

ssen

140

PI PROGRAM: MAKING LOOP-SPLITTING AND REDUCTIONS EVEN EASIER

#include <omp.h>static long num_steps = 100000; double step;void main (){ int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps;#pragma omp parallel for private(i, x) reduction(+:sum) for (i=0;i<= num_steps; i++){

x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x);

} pi = step * sum;}

Reduction used to manage

dependencies

Private clause creates data

local to a thread

Page 134: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 141

SYNCHRONIZATION: BARRIER Barrier: Each thread waits until all threads

arrive.#pragma omp parallel shared (A, B, C) private(id){

id=omp_get_thread_num();A[id] = big_calc1(id);

#pragma omp barrier #pragma omp for

for(i=0;i<N;i++){C[i]=big_calc3(i,A);}#pragma omp for nowait

for(i=0;i<N;i++){ B[i]=big_calc2(C, i); }A[id] = big_calc4(id);

}implicit barrier at the end of a parallel region

implicit barrier at the end of a for worksharing construct

no implicit barrier due to nowait

141

Page 135: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 142

PUTTING THE MASTER THREAD TO WORK The master construct denotes a structured block

that is only executed by the master thread. The other threads just skip it (no synchronization is implied).

#pragma omp parallel {

do_many_things();#pragma omp master

{ exchange_boundaries(); }#pragma omp barrier

do_many_other_things();}

142

Page 136: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 143

RUNTIME LIBRARY ROUTINES AND ICVS To use a known, fixed number of threads in a program,

(1) tell the system that you don’t want dynamic adjustment of the number of threads, (2) set the number of threads, then (3) save the number you got.

#include <omp.h>void main(){ int num_threads; omp_set_dynamic( 0 ); omp_set_num_threads( omp_num_procs() );#pragma omp parallel { int id=omp_get_thread_num();#pragma omp single num_threads = omp_get_num_threads(); do_lots_of_stuff(id); }}

Protect this op since Memory stores are not atomic

Request as many threads as you have processors.

Disable dynamic adjustment of the number of threads.

Internal Control Variables (ICVs) … define state of runtime system to a thread. Consistent pattern: set with “omp_set” or an environment variable, read with “omp_get”

143

Page 137: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

© 2

00

9 M

atth

ew

J. So

ttile, T

imo

thy G

. M

attso

n, a

nd

Cra

ig E

Ra

smu

ssen

144

OPTIMIZING LOOP PARALLEL PROGRAMS

#include <omp.h>#pragma omp parallel{// define neighborhood as the num_neighbors particles// within “cutoff” of each particle “i”. #pragma omp for for( int i = 0; i < n; i++ ) { Fx[i]=0.0; Fy[i]=0.0; for (int j = 0; j < num_neigh[i]; j++) neigh_ind = neigh[i][j]; Fx[i] += forceX(i, neigh_ind); FY[i] += forceY(i, neigh_ind); } }}

Particles may be unevenly distributed … i.e. different particles have different numbers of neighbors.

Evenly spreading out loop iterations may fail to balance the load among threads

We need a way to tell the compiler how to best distribute the load.

Short range force computation for a particle system using the cut-off method

144

Page 138: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 145

THE SCHEDULE CLAUSE The schedule clause affects how loop iterations are mapped onto

threads schedule(static [,chunk])

Deal-out blocks of iterations of size “chunk” to each thread.

schedule(dynamic[,chunk]) Each thread grabs “chunk” iterations off a queue until all

iterations have been handled. schedule(guided[,chunk])

Threads dynamically grab blocks of iterations. The size of the block starts large and shrinks down to size “chunk” as the calculation proceeds.

schedule(runtime) Schedule and chunk size taken from the OMP_SCHEDULE

environment variable (or the runtime library … for OpenMP 3.0).

145

Page 139: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

© 2

00

9 M

atth

ew

J. So

ttile, T

imo

thy G

. M

attso

n, a

nd

Cra

ig E

Ra

smu

ssen

146

OPTIMIZING LOOP PARALLEL PROGRAMS

#include <omp.h>#pragma omp parallel{// define neighborhood as the num_neigh particles// within “cutoff” of each particle “i”. #pragma omp for schedule(dynamic, 10) for( int i = 0; i < n; i++ ) { Fx[i]=0.0; Fy[i]=0.0; for (int j = 0; j < num_neigh[i]; j++) neigh_ind = neigh[i][j]; Fx[i] += forceX(i, neigh_ind); FY[i] += forceY(i, neigh_ind); } }}

Divide range of n into chunks of size 10.

Each thread computes a chunk then goes back to get its next chunk of 10 iterations.

Dynamically balances the load between threads.

Short range force computation for a particle system using the cut-off method

Page 140: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 147

Schedule Clause When To Use

STATIC Pre-determined and predictable by the programmer

DYNAMIC Unpredictable, highly variable work per iteration

GUIDED Special case of dynamic to reduce scheduling overhead

loop work-sharing constructs:The schedule clause

Least work at runtime : scheduling done at compile-time

Most work at runtime : complex scheduling logic used at run-time

147

Page 141: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 148

SECTIONS WORK-SHARING CONSTRUCT The Sections work-sharing construct gives a

different structured block to each thread.

#pragma omp parallel{

#pragma omp sections { #pragma omp section X_calculation(); #pragma omp section

y_calculation(); #pragma omp section

z_calculation(); }

}

By default, there is a barrier at the end of the “omp sections”. Use the “nowait” clause to turn off the barrier.

148

Page 142: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 149

SINGLE WORK-SHARING CONSTRUCT

The single construct denotes a block of code that is executed by only one thread.

A barrier is implied at the end of the single block.

#pragma omp parallel {

do_many_things();#pragma omp single

{ exchange_boundaries(); }do_many_other_things();

}

149

Page 143: 1 Programming with Shared Memory. C ONTENT Introduction Cilk TBB OpenMP 2.

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 150

SUMMARY OF OPENMP’S KEY CONSTRUCTS

The only way to create threads is with the parallel construct:#pragma omp parallel

All thread execute the instructions in a parallel construct. Split work between threads by:

SPMD: use thread ID to control execution Worksharing constructs to split loops (simple loops only)

#pragma omp for Combined parallel/workshare as a shorthand

#pragma omp parallel for High level synchronization is safest

#pragma critical#pragma barrier

150