Have%Your%Cake%In%Parallel%And%Eat%It%Sequen6ally%Too!%...

Have Your Cake In Parallel And Eat It Sequen6ally Too! Seman&cally Sequen&al, Parallel Execu&on of Mul&processor Programs

Gagan Gupta

Mul&processors are ubiquitous, but programming them con&nues to be challenging Our Goal: Simplify mul&processor programming without compromising performance

Benefits: •  Simplified programming; Simplified system design; BeCer reliability •  Performance at par or beCer (5% to 288%) than conven&onal methods

Summary

Conven6onal Wisdom Our Approach

Order in programs obstructs parallelism Order can help to expose parallelism!

Use non-‐determinis&c programs, or make dataflow in programs explicit

Use ordered programs; maintain precise program-‐order execu&on seman6cs

Programmer should expose parallelism Use run-‐6me dataflow and specula6ve techniques to expose parallelism

Sequencer

•  Unrolls dynamic instances of tasks

•  Computes data set dynamically (user assisted)

1 for (i=0; i<n; i++) { 2  call F (wr, rd) 3  }

Fn Write Set Read Set

F1: {B, C} {A} F2: {D} {A} F3: {?} {?} F4: {B} {D} F5: {B} {D} F6: {G} {H}

Example code and dynamic task instances

Dataflow Engine

Precise-‐restart Engine

•  Tracks tasks and their order in a Reorder List •  Checkpoints mod set in History Buffer •  Re&res task in (total) program order

Order-‐aware Load

-‐balan

cing Task

Sche

duler

F2

t1 t2 t3 t4 t5 t6

Execute Declares dataset Execute Re-execute out-of-order => misspeculated out-of-order speculatively => rolled back non-speculatively using History Buffer

Example speculative dataflow execution on 3 processors

F1

F2

F6

F2

F3

F4 F5

F3

Time

P1

P2

P3 F3 F3

Mul&p

rocessor

Pro

gram

F1 F2 F3 F4 F5 F6

t1 t5

F1 F2

F3

F6 F5

F4 Time

?? F3

ParaKram speedup (harmonic mean) is 20% higher than non-

deterministic Pthreads (excludes Cholesky)

0

1

2

3

4

5

6

7

8

9

8x Core i7-965

16x Opteron 8350

32x Opteron 8356

Spee

dup

Harmonic Mean of Achieved Speedups

Non-deterministic

ParaKram

ParaKram speedup is 288% higher than non-

deterministic OpenMP, 75% over Cilk

Applications: Barneshut Blackscholes Pbzip2 Dedup Histogram RevereseIndex Swaptions Mergesort RE WordCount ConjugateGradient

0

1

2

3

4

5

6

7

Genome Mergesort Labyrinth

Spee

dup

Speculative Execution (8x Intel Xeon)

ParaKram

TL2

Cilk

ParaKram scales with system size;

Non-deterministic method does not scale

ParaKram speedup is up to 77% higher than non-

deterministic Cilk and TL2 STM

0

2

4

6

8

10

12

14

1 2 4 8 12 16 20 24

Spee

dup

# Processors

Cholesky Decomposition (24x Intel Xeon)

ParaKram Speculative

ParaKram Non-speculative

OpenMP

Cilk

0

10

20

30

40

50

60

70

1 2 4 8 12 16 20 24

Spee

dup

# Processors

Tolerating Exceptions

Conventional Checkpoint-and-Recovery

ParaKram

Exploi6ng Parallelism Conven6onal

Develop Parallel Algorithm

Schedule Execu&on of Tasks

Ensure Independence between Parallel Tasks

Program text ≠> Order

F1

F2

F3

F4 F5

•  Execu&on has to respect programmer-‐exposed parallelism

•  Cannot schedule F5 in t2

F5

F1

F2

F3

F4 F5

Programmer’s Role

Reason About Inter-‐task Data Access Conflicts

Our Approach

Develop Parallel Algorithm

Program text => Order

Reason About Task-‐local Data Accesses

Conven6onal Our Approach

•  If dependences are known, distant parallelism can be exposed

•  Can schedule F5 in t2

Task Dependence Graph of Cholesky Decomposi&on

•  Run-‐&me parallel execu&on manager (C++ library)

•  Performs out-‐of-‐order superscalar processor-‐like execu&on on mul&processors

ParaKram

F1

F1

F1, F6, F2

F1

F2

F1 F2 F3 F4 F5 F6

Epoch Reorder List Entries Completed Retired

t1

F2 F3 F4 F5 F6 t2

F2 F3 F4 F5 F6 t3

F3 F4 F5 F6 t4

Precisely restarting misspeculated task (F3 from above)

t1

t2

t3

t4

t1

t2

t3

t4

Uncovers parallelism past blocked tasks in the program •  Constructs dynamic data dependence graph using write and read sets

•  Executes tasks out-‐of-‐order •  If task dependences/order are unknown, speculates tasks are independent

•  Detects and rec&fies misspecula&on

Multiprocessor program

Task Scheduler

Dataflow Engine

Precise-restart Engine

Sequencer

Multiprocessor System

ParaKram

Have%Your%Cake%In%Parallel%And%Eat%It%Sequen6ally%Too!%...

Documents

Transcript of Have%Your%Cake%In%Parallel%And%Eat%It%Sequen6ally%Too!%...