Futures, Scheduling, and Work Distribution Companion slides for The Art of Multiprocessor...

Post on 19-Dec-2015

229 views 2 download

Tags:

Transcript of Futures, Scheduling, and Work Distribution Companion slides for The Art of Multiprocessor...

Futures, Scheduling, and Work Distribution

Companion slides forThe Art of Multiprocessor Programming

by Maurice Herlihy & Nir Shavit

(Some images in this lecture courtesy of Charles Leiserson)

Art of Multiprocessor Programming 22

How to write Parallel Apps?

• Split a program into parallel parts• In an effective way• Thread management

Art of Multiprocessor Programming 33

Matrix Multiplication

BAC

Art of Multiprocessor Programming 44

Matrix Multiplication

ci j =N ¡ 1X

k=0

aki ¢bj k

Art of Multiprocessor Programming 55

Matrix Multiplication

cij = k=0N-1 aki * bjk

Art of Multiprocessor Programming 6

Matrix Multiplication class Worker extends Thread { int row, col; Worker(int row, int col) { row = row; col = col; } public void run() { double dotProduct = 0.0; for (int i = 0; i < n; i++) dotProduct += a[row][i] * b[i][col]; c[row][col] = dotProduct; }}}

Art of Multiprocessor Programming 7

Matrix Multiplication class Worker extends Thread { int row, col; Worker(int row, int col) { row = row; col = col; } public void run() { double dotProduct = 0.0; for (int i = 0; i < n; i++) dotProduct += a[row][i] * b[i][col]; c[row][col] = dotProduct; }}}

a thread

Art of Multiprocessor Programming 8

Matrix Multiplication class Worker extends Thread { int row, col; Worker(int row, int col) { row = row; col = col; } public void run() { double dotProduct = 0.0; for (int i = 0; i < n; i++) dotProduct += a[row][i] * b[i][col]; c[row][col] = dotProduct; }}}

Which matrix entry to compute

Art of Multiprocessor Programming 9

Matrix Multiplication class Worker extends Thread { int row, col; Worker(int row, int col) { row = row; col = col; } public void run() { double dotProduct = 0.0; for (int i = 0; i < n; i++) dotProduct += a[row][i] * b[i][col]; c[row][col] = dotProduct; }}}

Actual computation

Art of Multiprocessor Programming 10

Matrix Multiplication void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join();}

Art of Multiprocessor Programming 11

Matrix Multiplication void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join();}

Create n x n threads

Art of Multiprocessor Programming 12

Matrix Multiplication void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join();}

Start them

Art of Multiprocessor Programming 13

Matrix Multiplication void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join();}

Wait for them to finish

Start them

Art of Multiprocessor Programming 14

Matrix Multiplication void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join();}

Wait for them to finish

Start them

What’s wrong with this picture?

Art of Multiprocessor Programming 15

Thread Overhead

• Threads Require resources– Memory for stacks– Setup, teardown– Scheduler overhead

• Short-lived threads– Ratio of work versus overhead bad

Art of Multiprocessor Programming 16

Thread Pools

• More sensible to keep a pool of– long-lived threads

• Threads assigned short-lived tasks– Run the task– Rejoin pool– Wait for next assignment

Art of Multiprocessor Programming 17

Thread Pool = Abstraction

• Insulate programmer from platform– Big machine, big pool– Small machine, small pool

• Portable code– Works across platforms– Worry about algorithm, not platform

Art of Multiprocessor Programming 18

ExecutorService Interface

• In java.util.concurrent– Task = Runnable object

• If no result value expected• Calls run() method.

– Task = Callable<T> object• If result value of type T expected• Calls T call() method.

Art of Multiprocessor Programming 19

Future<T>Callable<T> task = …; …

Future<T> future = executor.submit(task);

T value = future.get();

Art of Multiprocessor Programming 20

Future<T>Callable<T> task = …; …

Future<T> future = executor.submit(task);

T value = future.get();

Submitting a Callable<T> task returns a Future<T> object

Art of Multiprocessor Programming 21

Future<T>Callable<T> task = …; …

Future<T> future = executor.submit(task);

T value = future.get();

The Future’s get() method blocks until the value is available

Art of Multiprocessor Programming 22

Future<?>Runnable task = …; …

Future<?> future = executor.submit(task);

future.get();

Art of Multiprocessor Programming 2323

Future<?>Runnable task = …; …

Future<?> future = executor.submit(task);

future.get();

Submitting a Runnable task returns a Future<?> object

Art of Multiprocessor Programming 2424

Future<?>Runnable task = …; …

Future<?> future = executor.submit(task);

future.get();

The Future’s get() method blocks until the computation is complete

Art of Multiprocessor Programming 2525

Note

• Executor Service submissions– Like New England traffic signs– Are purely advisory in nature

• The executor– Like the Boston driver– Is free to ignore any such advice– And could execute tasks sequentially …

Art of Multiprocessor Programming 2626

Matrix Addition

00 00 00 00 01 01

10 10 10 10 11 11

C C A B B A

C C A B A B

Art of Multiprocessor Programming 2727

Matrix Addition

00 00 00 00 01 01

10 10 10 10 11 11

C C A B B A

C C A B A B

4 parallel additions

Art of Multiprocessor Programming 2828

Matrix Addition Taskclass AddTask implements Runnable { Matrix a, b; // add this! public void run() { if (a.dim == 1) { c[0][0] = a[0][0] + b[0][0]; // base case } else { (partition a, b into half-size matrices aij and bij) Future<?> f00 = exec.submit(addTask(a00,b00)); … Future<?> f11 = exec.submit(addTask(a11,b11)); f00.get(); …; f11.get(); … }}

Art of Multiprocessor Programming 29

Matrix Addition Taskclass AddTask implements Runnable { Matrix a, b; // add this! public void run() { if (a.dim == 1) { c[0][0] = a[0][0] + b[0][0]; // base case } else { (partition a, b into half-size matrices aij and bij) Future<?> f00 = exec.submit(addTask(a00,b00)); … Future<?> f11 = exec.submit(addTask(a11,b11)); f00.get(); …; f11.get(); … }}

Constant-time operation

Art of Multiprocessor Programming 30

Matrix Addition Taskclass AddTask implements Runnable { Matrix a, b; // add this! public void run() { if (a.dim == 1) { c[0][0] = a[0][0] + b[0][0]; // base case } else { (partition a, b into half-size matrices aij and bij) Future<?> f00 = exec.submit(addTask(a00,b00)); … Future<?> f11 = exec.submit(addTask(a11,b11)); f00.get(); …; f11.get(); … }}

Submit 4 tasks

Art of Multiprocessor Programming 31

Matrix Addition Taskclass AddTask implements Runnable { Matrix a, b; // add this! public void run() { if (a.dim == 1) { c[0][0] = a[0][0] + b[0][0]; // base case } else { (partition a, b into half-size matrices aij and bij) Future<?> f00 = exec.submit(addTask(a00,b00)); … Future<?> f11 = exec.submit(addTask(a11,b11)); f00.get(); …; f11.get(); … }}

Base case: add directly

Art of Multiprocessor Programming 32

Matrix Addition Taskclass AddTask implements Runnable { Matrix a, b; // multiply this! public void run() { if (a.dim == 1) { c[0][0] = a[0][0] + b[0][0]; // base case } else { (partition a, b into half-size matrices aij and bij) Future<?> f00 = exec.submit(addTask(a00,b00)); … Future<?> f11 = exec.submit(addTask(a11,b11)); f00.get(); …; f11.get(); … }}

Let them finish

Art of Multiprocessor Programming 33

Dependencies

• Matrix example is not typical• Tasks are independent

– Don’t need results of one task …– To complete another

• Often tasks are not independent

Art of Multiprocessor Programming 3434

Fibonacci

1 if n = 0 or 1F(n)

F(n-1) + F(n-2) otherwise

• Note– Potential parallelism– Dependencies

Art of Multiprocessor Programming 3535

Disclaimer

• This Fibonacci implementation is– Egregiously inefficient

• So don’t try this at home or job!

– But illustrates our point• How to deal with dependencies

• Exercise:– Make this implementation efficient!

Art of Multiprocessor Programming 36

class FibTask implements Callable<Integer> { static ExecutorService exec = Executors.newCachedThreadPool(); int arg; public FibTask(int n) { arg = n; } public Integer call() { if (arg > 2) { Future<Integer> left = exec.submit(new FibTask(arg-1)); Future<Integer> right = exec.submit(new FibTask(arg-2)); return left.get() + right.get(); } else { return 1; }}}

Multithreaded Fibonacci

Art of Multiprocessor Programming 37

class FibTask implements Callable<Integer> { static ExecutorService exec = Executors.newCachedThreadPool(); int arg; public FibTask(int n) { arg = n; } public Integer call() { if (arg > 2) { Future<Integer> left = exec.submit(new FibTask(arg-1)); Future<Integer> right = exec.submit(new FibTask(arg-2)); return left.get() + right.get(); } else { return 1; }}}

Multithreaded Fibonacci

Parallel calls

Art of Multiprocessor Programming 38

class FibTask implements Callable<Integer> { static ExecutorService exec = Executors.newCachedThreadPool(); int arg; public FibTask(int n) { arg = n; } public Integer call() { if (arg > 2) { Future<Integer> left = exec.submit(new FibTask(arg-1)); Future<Integer> right = exec.submit(new FibTask(arg-2)); return left.get() + right.get(); } else { return 1; }}}

Multithreaded Fibonacci

Pick up & combine results

Art of Multiprocessor Programming 3939

The Blumofe-Leiserson DAG Model

• Multithreaded program is– A directed acyclic graph (DAG)– That unfolds dynamically

• Each node is– A single unit of work

Art of Multiprocessor Programming 4040

Fibonacci DAGfib(4)

Art of Multiprocessor Programming 4141

Fibonacci DAGfib(4)

fib(3)

Art of Multiprocessor Programming 4242

Fibonacci DAGfib(4)

fib(3) fib(2)

fib(2)

Art of Multiprocessor Programming 4343

Fibonacci DAGfib(4)

fib(3) fib(2)

fib(2) fib(1) fib(1)fib(1)

fib(1) fib(1)

Art of Multiprocessor Programming 4444

Fibonacci DAGfib(4)

fib(3) fib(2)

callget

fib(2) fib(1) fib(1)fib(1)

fib(1) fib(1)

Art of Multiprocessor Programming 4545

How Parallel is That?

• Define work:– Total time on one processor

• Define critical-path length:– Longest dependency path– Can’t beat that!

Unfolded DAG

Art of Multiprocessor Programming 46

Parallelism?

Art of Multiprocessor Programming 47

Serial fraction = 3/18 = 1/6 …

Amdahl’s Law says speedup cannot exceed 6.

Work?

Art of Multiprocessor Programming 48

1

2

3

754

7 8 9 10

11 12 13 14

15 16

17

18

T1: time needed on one processor

Just count the nodes ….

T1 = 18

Critical Path?

Art of Multiprocessor Programming 49

T∞: time needed on as many

processors as you like

Critical Path?

Art of Multiprocessor Programming 50

1

2

3

4

5

6

7

8

9

Longest path ….

T∞ = 9

T∞: time needed on as many

processors as you like

Art of Multiprocessor Programming 5151

Notation Watch

• TP = time on P processors

• T1 = work (time on 1 processor)

• T∞ = critical path length (time on ∞ processors)

Art of Multiprocessor Programming 5252

Simple Laws

• Work Law: TP ≥ T1/P– In one step, can’t do more than P work

• Critical Path Law: TP ≥ T∞

– Can’t beat infinite resources

Art of Multiprocessor Programming 5353

Performance Measures

• Speedup on P processors– Ratio T1/TP

– How much faster with P processors• Linear speedup

– T1/TP = Θ(P)

• Max speedup (average parallelism)– T1/T∞

Sequential Composition

Art of Multiprocessor Programming 54

A B

Work: T1(A) + T1(B)

Critical Path: T∞ (A) + T∞ (B)

Parallel Composition

Art of Multiprocessor Programming 55

Work: T1(A) + T1(B)

Critical Path: max{T∞(A), T∞(B)}

A

B

Art of Multiprocessor Programming 5656

Matrix Addition

00 00 00 00 01 01

10 10 10 10 11 11

C C A B B A

C C A B A B

Art of Multiprocessor Programming 5757

Matrix Addition

00 00 00 00 01 01

10 10 10 10 11 11

C C A B B A

C C A B A B

4 parallel additions

Art of Multiprocessor Programming 5858

Addition

• Let AP(n) be running time – For n x n matrix– on P processors

• For example– A1(n) is work

– A∞(n) is critical path length

Art of Multiprocessor Programming 5959

Addition

• Work is

A1(n) = 4 A1(n/2) + Θ(1)

4 spawned additions

Partition, synch, etc

Art of Multiprocessor Programming 6060

Addition

• Work is

A1(n) = 4 A1(n/2) + Θ(1)

= Θ(n2)

Same as double-loop summation

Art of Multiprocessor Programming 6161

Addition

• Critical Path length is

A∞(n) = A∞(n/2) + Θ(1)

spawned additions in parallel

Partition, synch, etc

Art of Multiprocessor Programming 6262

Addition

• Critical Path length is

A∞(n) = A∞(n/2) + Θ(1)

= Θ(log n)

Art of Multiprocessor Programming 6363

Matrix Multiplication Redux

BAC

Art of Multiprocessor Programming 6464

Matrix Multiplication Redux

2221

1211

2221

1211

2221

1211

BB

BB

AA

AA

CC

CC

Art of Multiprocessor Programming 6565

First Phase …

2222122121221121

2212121121121111

2221

1211

BABABABA

BABABABA

CC

CC

8 multiplications

Art of Multiprocessor Programming 6666

Second Phase …

2222122121221121

2212121121121111

2221

1211

BABABABA

BABABABA

CC

CC

4 additions

Art of Multiprocessor Programming 6767

Multiplication

• Work is

M1(n) = 8 M1(n/2) + A1(n)

8 parallel multiplications

Final addition

Art of Multiprocessor Programming 6868

Multiplication

• Work is

M1(n) = 8 M1(n/2) + Θ(n2)

= Θ(n3)

Same as serial triple-nested loop

Art of Multiprocessor Programming 6969

Multiplication

• Critical path length is

M∞(n) = M∞(n/2) + A∞(n)

Half-size parallel

multiplications

Final addition

Art of Multiprocessor Programming 7070

Multiplication

• Critical path length is

M∞(n) = M∞(n/2) + A∞(n)

= M∞(n/2) + Θ(log n)

= Θ(log2 n)

Art of Multiprocessor Programming 7171

Parallelism

• M1(n)/ M∞(n) = Θ(n3/log2 n)

• To multiply two 1000 x 1000 matrices– 10003/102=107

• Much more than number of processors on any real machine

Art of Multiprocessor Programming 7272

Shared-Memory Multiprocessors

• Parallel applications– Do not have direct access to HW

processors• Mix of other jobs

– All run together– Come & go dynamically

Art of Multiprocessor Programming 7373

Ideal Scheduling Hierarchy

Tasks

Processors

User-level scheduler

Art of Multiprocessor Programming 7474

Realistic Scheduling Hierarchy

Tasks

Threads

Processors

User-level scheduler

Kernel-level scheduler

Art of Multiprocessor Programming 75

For Example

• Initially,– All P processors available for application

• Serial computation– Takes over one processor– Leaving P-1 for us– Waits for I/O– We get that processor back ….

Art of Multiprocessor Programming 76

Speedup

• Map threads onto P processes• Cannot get P-fold speedup

– What if the kernel doesn’t cooperate?• Can try for speedup proportional to P

Art of Multiprocessor Programming 7777

Scheduling Hierarchy

• User-level scheduler– Tells kernel which threads are ready

• Kernel-level scheduler– Synchronous (for analysis, not correctness!)– Picks pi threads to schedule at step i

Greedy Scheduling

• A node is ready if predecessors are done

• Greedy: schedule as many ready nodes as possible

• Optimal scheduler is greedy (why?)

• But not every greedy scheduler is optimal

Art of Multiprocessor Programming 78

Greedy Scheduling

Art of Multiprocessor Programming 79

There are P processors

Complete Step:• >P nodes ready• run any P

Incomplete Step:• < P nodes ready• run them all

Art of Multiprocessor Programming 8080

Theorem

For any greedy scheduler,

TP ≤ T1/P + T∞

Art of Multiprocessor Programming 8181

Theorem

For any greedy scheduler,

Actual time

TP ≤ T1/P + T∞

Art of Multiprocessor Programming 8282

Theorem

For any greedy scheduler,

No better than work divided among processors

TP ≤ T1/P + T∞

Art of Multiprocessor Programming 8383

Theorem

For any greedy scheduler,

No better than critical path length

TP ≤ T1/P + T∞

TP ≤ T1/P + T∞

Art of Multiprocessor Programming 84

Proof:

Number of incomplete steps ≤ T1/P …

… because each performs P work.

Number of complete steps ≤ T1 …

… because each shortens the unexecuted critical path by 1

QED

Art of Multiprocessor Programming 8585

Near-OptimalityTheorem: any greedy scheduler is within a factor of 2 of optimal.

Remark: Optimality is NP-hard!

Art of Multiprocessor Programming 8686

Proof of Near-OptimalityLet TP

* be the optimal time.

TP* ≥ max{T1/P, T∞}

TP ≤ T1/P + T∞

TP ≤ 2 max{T1/P ,T∞}

TP ≤ 2 TP*

From work and critical path laws

Theorem

QED

Art of Multiprocessor Programming 8787

Work Distributionzzz…

Art of Multiprocessor Programming 8888

Work Dealing

Yes!

Art of Multiprocessor Programming 8989

The Problem with Work Dealing

D’oh!

D’oh!

D’oh!

Art of Multiprocessor Programming 9090

Work StealingNo work…

Yes!

Art of Multiprocessor Programming 9191

Lock-Free Work Stealing

• Each thread has a pool of ready work• Remove work without synchronizing• If you run out of work, steal someone

else’s• Choose victim at random

Art of Multiprocessor Programming 9292

Local Work Pools

Each work pool is a Double-Ended Queue

Art of Multiprocessor Programming 9393

Work DEQueue1

pushBottom

popBottom

work

1. Double-Ended Queue

Art of Multiprocessor Programming 9494

Obtain Work

• Obtain work• Run task until• Blocks or terminates

popBottom

Art of Multiprocessor Programming 9595

New Work

• Unblock node• Spawn node

pushBottom

Art of Multiprocessor Programming 9696

Whatcha Gonna do When the Well Runs Dry?

@&%$!!

empty

Art of Multiprocessor Programming 9797

Steal Work from OthersPick random thread’s DEQeueue

Art of Multiprocessor Programming 9898

Steal this Task!

popTop

Art of Multiprocessor Programming 9999

Task DEQueue

• Methods– pushBottom– popBottom– popTop

Never happen concurrently

Art of Multiprocessor Programming 100100

Task DEQueue

• Methods– pushBottom– popBottom– popTop

Most common – make them fast(minimize use of

CAS)

Art of Multiprocessor Programming 101101

Ideal

• Wait-Free• Linearizable• Constant time

Fortune Cookie: “It is better to be young, rich and beautiful, than old, poor, and ugly”

Art of Multiprocessor Programming 102102

Compromise

• Method popTop may fail if– Concurrent popTop succeeds, or a – Concurrent popBottom takes last task

Blame the victim!

Art of Multiprocessor Programming 103103

Dreaded ABA Problem

topCAS

Art of Multiprocessor Programming 104104

Dreaded ABA Problem

top

Art of Multiprocessor Programming 105105

Dreaded ABA Problem

top

Art of Multiprocessor Programming 106106

Dreaded ABA Problem

top

Art of Multiprocessor Programming 107107

Dreaded ABA Problem

top

Art of Multiprocessor Programming 108108

Dreaded ABA Problem

top

Art of Multiprocessor Programming 109109

Dreaded ABA Problem

top

Art of Multiprocessor Programming 110110

Dreaded ABA Problem

topCAS

Uh-Oh …

Yes!

Art of Multiprocessor Programming 111111

Fix for Dreaded ABA

stamp

top

bottom

Art of Multiprocessor Programming 112112

Bounded DEQueuepublic class BDEQueue { AtomicStampedReference<Integer> top; volatile int bottom; Runnable[] tasks; …}

Art of Multiprocessor Programming 113113

Bounded DEQueuepublic class BDEQueue { AtomicStampedReference<Integer> top; volatile int bottom; Runnable[] tasks; …}

Index & Stamp (synchronized)

Art of Multiprocessor Programming 114114

Bounded DEQueuepublic class BDEQueue { AtomicStampedReference<Integer> top; volatile int bottom; Runnable[] deq; …}

index of bottom taskno need to synchronizememory barrier needed

Art of Multiprocessor Programming 115

Bounded DEQueuepublic class BDEQueue { AtomicStampedReference<Integer> top; volatile int bottom; Runnable[] tasks; …}

Array holding tasks

Art of Multiprocessor Programming 116

pushBottom()

public class BDEQueue { … void pushBottom(Runnable r){ tasks[bottom] = r; bottom++; } …}

Art of Multiprocessor Programming 117

pushBottom()

public class BDEQueue { … void pushBottom(Runnable r){ tasks[bottom] = r; bottom++; } …} Bottom is the index to store

the new task in the array

Art of Multiprocessor Programming 118

pushBottom()

public class BDEQueue { … void pushBottom(Runnable r){ tasks[bottom] = r; bottom++; } …}

Adjust the bottom index

stamp

top

bottom

Art of Multiprocessor Programming 119

Steal Work

public Runnable popTop() { int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = oldTop + 1; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) return null; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp))

return r; return null; }

Art of Multiprocessor Programming 120

Steal Work

public Runnable popTop() { int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = oldTop + 1; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) return null; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp))

return r; return null; } Read top (value & stamp)

Art of Multiprocessor Programming 121

Steal Work

public Runnable popTop() { int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = oldTop + 1; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) return null; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp))

return r; return null; } Compute new value & stamp

Art of Multiprocessor Programming 122

Steal Work

public Runnable popTop() { int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = oldTop + 1; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) return null; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp))

return r; return null; }

Quit if queue is empty

stamp

top

bottom

Art of Multiprocessor Programming 123

Steal Work

public Runnable popTop() { int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = oldTop + 1; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) return null; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp))

return r; return null; }

Try to steal the task

stamp

topCAS

bottom

Art of Multiprocessor Programming 124

Steal Work

public Runnable popTop() { int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = oldTop + 1; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) return null; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp))

return r; return null; }

Give up if conflict occurs

Art of Multiprocessor Programming 125

Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0; }

Take Work

Art of Multiprocessor Programming 126

Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0; }

126

Take Work

Make sure queue is non-empty

Art of Multiprocessor Programming 127

Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0; }

Take Work

Prepare to grab bottom task

Art of Multiprocessor Programming 128128

Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0; }

Take Work

Read top, & prepare new values

Art of Multiprocessor Programming 129

Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0;}

129

Take Work

If top & bottom one or more apart, no conflict

stamp

top

bottom

Art of Multiprocessor Programming 130

Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0;}

130

Take Work

At most one item left

stamp

top

bottom

Art of Multiprocessor Programming 131

Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0;}

131

Take WorkTry to steal last task.Reset bottom because the DEQueue will be empty even if unsuccessful (why?)

Art of Multiprocessor Programming 132132

Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0;}

Take Work

stamp

topCAS

bottom

I win CAS

Art of Multiprocessor Programming 133133

Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0;}

Take Work

stamp

topCAS

bottom

If I lose CAS, thief must have won…

Art of Multiprocessor Programming 134134

Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0;}

Take WorkFailed to get last task(bottom could be less than top) Must still reset top and bottom

since deque is empty

Art of Multiprocessor Programming 135135

Old English Proverb

• “May as well be hanged for stealing a sheep as a goat”

• From which we conclude:– Stealing was punished severely– Sheep were worth more than goats

Art of Multiprocessor Programming 136136

Variations

• Stealing is expensive– Pay CAS– Only one task taken

• What if– Move more than one task– Randomly balance loads?

Art of Multiprocessor Programming 137137

Work Balancing

5

2

b2+5c/ 2 = 3

d 2+5e/2=4

Art of Multiprocessor Programming 138138

Work-Balancing Threadpublic void run() { int me = ThreadID.get(); while (true) { Runnable task = queue[me].deq(); if (task != null) task.run(); int size = queue[me].size(); if (random.nextInt(size+1) == size) { int victim = random.nextInt(queue.length); int min = …, max = …; synchronized (queue[min]) { synchronized (queue[max]) { balance(queue[min], queue[max]);}}}}}

Art of Multiprocessor Programming 139139

Work-Balancing Threadpublic void run() { int me = ThreadID.get(); while (true) { Runnable task = queue[me].deq(); if (task != null) task.run(); int size = queue[me].size(); if (random.nextInt(size+1) == size) { int victim = random.nextInt(queue.length); int min = …, max = …; synchronized (queue[min]) { synchronized (queue[max]) { balance(queue[min], queue[max]);}}}}}

Keep running tasks

Art of Multiprocessor Programming 140140

Work-Balancing Threadpublic void run() { int me = ThreadID.get(); while (true) { Runnable task = queue[me].deq(); if (task != null) task.run(); int size = queue[me].size(); if (random.nextInt(size+1) == size) { int victim = random.nextInt(queue.length); int min = …, max = …; synchronized (queue[min]) { synchronized (queue[max]) { balance(queue[min], queue[max]);}}}}}

With probability1/|queue|

Art of Multiprocessor Programming 141141

Work-Balancing Threadpublic void run() { int me = ThreadID.get(); while (true) { Runnable task = queue[me].deq(); if (task != null) task.run(); int size = queue[me].size(); if (random.nextInt(size+1) == size) { int victim = random.nextInt(queue.length); int min = …, max = …; synchronized (queue[min]) { synchronized (queue[max]) { balance(queue[min], queue[max]);}}}}}

Choose random victim

Art of Multiprocessor Programming 142142

Work-Balancing Threadpublic void run() { int me = ThreadID.get(); while (true) { Runnable task = queue[me].deq(); if (task != null) task.run(); int size = queue[me].size(); if (random.nextInt(size+1) == size) { int victim = random.nextInt(queue.length); int min = …, max = …; synchronized (queue[min]) { synchronized (queue[max]) { balance(queue[min], queue[max]);}}}}}

Lock queues in canonical order

Art of Multiprocessor Programming 143143

Work-Balancing Threadpublic void run() { int me = ThreadID.get(); while (true) { Runnable task = queue[me].deq(); if (task != null) task.run(); int size = queue[me].size(); if (random.nextInt(size+1) == size) { int victim = random.nextInt(queue.length); int min = …, max = …; synchronized (queue[min]) { synchronized (queue[max]) { balance(queue[min], queue[max]);}}}}}

Rebalance queues

Art of Multiprocessor Programming 144144

Work Stealing & Balancing

• Clean separation between app & scheduling layer

• Works well when number of processors fluctuates.

• Works on “black-box” operating systems

Art of Multiprocessor Programming 145145

         This work is licensed under a Creative Commons Attribution-ShareAlike 2.5 License.

• You are free:– to Share — to copy, distribute and transmit the work – to Remix — to adapt the work

• Under the following conditions:– Attribution. You must attribute the work to “The Art of Multiprocessor

Programming” (but not in any way that suggests that the authors endorse you or your use of the work).

– Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license.

• For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to– http://creativecommons.org/licenses/by-sa/3.0/.

• Any of the above conditions can be waived if you get permission from the copyright holder.

• Nothing in this license impairs or restricts the author's moral rights.

Art of Multiprocessor Programming 146146

T O M

M RA V LO O

R I D LED