Have%Your%Cake%In%Parallel%And%Eat%It%Sequen6ally%Too!%...

1
Have Your Cake In Parallel And Eat It Sequen6ally Too! Seman&cally Sequen&al, Parallel Execu&on of Mul&processor Programs Gagan Gupta Mul&processors are ubiquitous, but programming them con&nues to be challenging Our Goal : Simplify mul&processor programming without compromising performance Benefits: Simplified programming; Simplified system design; BeCer reliability Performance at par or beCer (5% to 288%) than conven&onal methods Summary Conven6onal Wisdom Our Approach Order in programs obstructs parallelism Order can help to expose parallelism! Use nondeterminis&c programs, or make dataflow in programs explicit Use ordered programs; maintain precise programorder execu&on seman6cs Programmer should expose parallelism Use run6me dataflow and specula6ve techniques to expose parallelism Sequencer Unrolls dynamic instances of tasks Computes data set dynamically (user assisted) 1 for (i=0; i<n; i++) { 2 call F (wr, rd) 3 } Fn Write Set Read Set F1: {B, C} {A} F2: {D} {A} F3: {?} {?} F4: {B} {D} F5: {B} {D} F6: {G} {H} Example code and dynamic task instances Dataflow Engine Preciserestart Engine Tracks tasks and their order in a Reorder List Checkpoints mod set in History Buffer Re&res task in (total) program order Orderaware Loadbalancing Task Scheduler F2 t1 t2 t3 t4 t5 t6 Execute Declares dataset Execute Re-execute out-of-order => misspeculated out-of-order speculatively => rolled back non-speculatively using History Buffer Example speculative dataflow execution on 3 processors F1 F2 F6 F2 F3 F4 F5 F3 Time P1 P2 P3 F3 F3 Mul&processor Program F1 F2 F3 F4 F5 F6 t1 t5 F1 F2 F3 F6 F5 F4 Time ?? F3 ParaKram speedup (harmonic mean) is 20% higher than non- deterministic Pthreads (excludes Cholesky) 0 1 2 3 4 5 6 7 8 9 8x Core i7-965 16x Opteron 8350 32x Opteron 8356 Speedup Harmonic Mean of Achieved Speedups Non-deterministic ParaKram ParaKram speedup is 288% higher than non- deterministic OpenMP, 75% over Cilk Applications: Barneshut Blackscholes Pbzip2 Dedup Histogram RevereseIndex Swaptions Mergesort RE WordCount ConjugateGradient 0 1 2 3 4 5 6 7 Genome Mergesort Labyrinth Speedup Speculative Execution (8x Intel Xeon) ParaKram TL2 Cilk ParaKram scales with system size; Non-deterministic method does not scale ParaKram speedup is up to 77% higher than non- deterministic Cilk and TL2 STM 0 2 4 6 8 10 12 14 1 2 4 8 12 16 20 24 Speedup # Processors Cholesky Decomposition (24x Intel Xeon) ParaKram Speculative ParaKram Non-speculative OpenMP Cilk 0 10 20 30 40 50 60 70 1 2 4 8 12 16 20 24 Speedup # Processors Tolerating Exceptions Conventional Checkpoint-and-Recovery ParaKram Exploi6ng Parallelism Conven6onal Develop Parallel Algorithm Schedule Execu&on of Tasks Ensure Independence between Parallel Tasks Program text ≠> Order F1 F2 F3 F4 F5 Execu&on has to respect programmerexposed parallelism Cannot schedule F5 in t2 F5 F1 F2 F3 F4 F5 Programmer’s Role Reason About Intertask Data Access Conflicts Our Approach Develop Parallel Algorithm Program text => Order Reason About Tasklocal Data Accesses Conven6onal Our Approach If dependences are known, distant parallelism can be exposed Can schedule F5 in t2 Task Dependence Graph of Cholesky Decomposi&on Run&me parallel execu&on manager (C++ library) Performs outoforder superscalar processorlike execu&on on mul&processors ParaKram F1 F1 F1, F6, F2 F1 F2 F1 F2 F3 F4 F5 F6 Epoch Reorder List Entries Completed Retired t1 F2 F3 F4 F5 F6 t2 F2 F3 F4 F5 F6 t3 F3 F4 F5 F6 t4 Precisely restarting misspeculated task (F3 from above) t1 t2 t3 t4 t1 t2 t3 t4 Uncovers parallelism past blocked tasks in the program Constructs dynamic data dependence graph using write and read sets Executes tasks outoforder If task dependences/order are unknown, speculates tasks are independent Detects and rec&fies misspecula&on Multiprocessor program Task Scheduler Dataflow Engine Precise-restart Engine Sequencer Multiprocessor System ParaKram

Transcript of Have%Your%Cake%In%Parallel%And%Eat%It%Sequen6ally%Too!%...

Have  Your  Cake  In  Parallel  And  Eat  It  Sequen6ally  Too!  Seman&cally  Sequen&al,  Parallel  Execu&on  of  Mul&processor  Programs  

Gagan  Gupta  

Mul&processors  are  ubiquitous,  but  programming  them  con&nues  to  be  challenging    Our  Goal:  Simplify  mul&processor  programming  without  compromising  performance    

               

Benefits:  •  Simplified  programming;  Simplified  system  design;  BeCer  reliability  •  Performance  at  par  or  beCer  (5%  to  288%)  than  conven&onal  methods  

Summary  

Conven6onal  Wisdom   Our  Approach  

Order  in  programs  obstructs  parallelism   Order  can  help  to  expose  parallelism!  

Use  non-­‐determinis&c  programs,  or  make  dataflow  in  programs  explicit  

Use  ordered  programs;  maintain  precise  program-­‐order  execu&on  seman6cs  

Programmer  should  expose  parallelism   Use  run-­‐6me  dataflow    and  specula6ve  techniques  to  expose    parallelism  

Sequencer  

•  Unrolls  dynamic  instances  of  tasks  

•  Computes  data  set  dynamically  (user  assisted)  

1    for  (i=0;  i<n;  i++)  {  2     call  F  (wr,  rd)  3  }  

   Fn            Write  Set              Read  Set  

   F1:    {B,  C}        {A}      F2:    {D}              {A}      F3:    {?}              {?}      F4:    {B}              {D}      F5:    {B}              {D}      F6:    {G}              {H}  

Example code and dynamic task instances

Dataflow  Engine    

     Precise-­‐restart  Engine  

•  Tracks  tasks  and  their  order  in  a  Reorder  List  •  Checkpoints  mod  set  in  History  Buffer  •  Re&res  task  in  (total)  program  order  

Order-­‐aware  Load

-­‐balan

cing  Task  

Sche

duler  

F2

t1 t2 t3 t4 t5 t6

Execute Declares dataset Execute Re-execute out-of-order => misspeculated out-of-order speculatively => rolled back non-speculatively using History Buffer

Example speculative dataflow execution on 3 processors

F1

F2

F6

F2

F3

F4 F5

F3

Time

P1

P2

P3 F3 F3

Mul&p

rocessor

Pro

gram

F1 F2 F3 F4 F5 F6

t1 t5

F1 F2

F3

F6 F5

F4 Time

??   F3

ParaKram speedup (harmonic mean) is 20% higher than non-

deterministic Pthreads (excludes Cholesky)

0

1

2

3

4

5

6

7

8

9

8x Core i7-965

16x Opteron 8350

32x Opteron 8356

Spee

dup

Harmonic Mean of Achieved Speedups

Non-deterministic

ParaKram

ParaKram speedup is 288% higher than non-

deterministic OpenMP, 75% over Cilk

Applications: Barneshut Blackscholes Pbzip2 Dedup Histogram RevereseIndex Swaptions Mergesort RE WordCount ConjugateGradient

0

1

2

3

4

5

6

7

Genome Mergesort Labyrinth

Spee

dup

Speculative Execution (8x Intel Xeon)

ParaKram

TL2

Cilk

ParaKram scales with system size;

Non-deterministic method does not scale

ParaKram speedup is up to 77% higher than non-

deterministic Cilk and TL2 STM

0

2

4

6

8

10

12

14

1 2 4 8 12 16 20 24

Spee

dup

# Processors

Cholesky Decomposition (24x Intel Xeon)

ParaKram Speculative

ParaKram Non-speculative

OpenMP

Cilk

0

10

20

30

40

50

60

70

1 2 4 8 12 16 20 24

Spee

dup

# Processors

Tolerating Exceptions

Conventional Checkpoint-and-Recovery

ParaKram

Exploi6ng  Parallelism  Conven6onal  

Develop  Parallel  Algorithm  

Schedule  Execu&on  of  Tasks  

Ensure  Independence  between  Parallel  Tasks  

Program  text  ≠>  Order  

F1

F2

F3

F4 F5

•  Execu&on  has  to  respect  programmer-­‐exposed  parallelism  

•  Cannot  schedule  F5  in  t2  

F5

F1

F2

F3

F4 F5

Programmer’s  Role  

Reason  About  Inter-­‐task  Data  Access  Conflicts  

Our  Approach  

Develop  Parallel  Algorithm  

Program  text  =>  Order  

Reason  About  Task-­‐local  Data  Accesses    

Conven6onal   Our  Approach  

•  If  dependences  are  known,  distant  parallelism  can  be  exposed  

•  Can  schedule  F5  in  t2  

Task  Dependence  Graph  of  Cholesky  Decomposi&on  

 •  Run-­‐&me  parallel  execu&on  manager  (C++  library)  

•  Performs  out-­‐of-­‐order  superscalar  processor-­‐like  execu&on  on  mul&processors  

ParaKram  

F1

F1

F1, F6, F2

F1

F2

F1 F2 F3 F4 F5 F6

Epoch Reorder List Entries Completed Retired

t1

F2 F3 F4 F5 F6 t2

F2 F3 F4 F5 F6 t3

F3 F4 F5 F6 t4

Precisely restarting misspeculated task (F3 from above)

t1

t2

t3

t4

t1

t2

t3

t4

Uncovers  parallelism  past  blocked  tasks  in  the  program  •  Constructs  dynamic  data  dependence  graph  using  write  and  read  sets  

•  Executes  tasks  out-­‐of-­‐order  •  If  task  dependences/order  are  unknown,  speculates  tasks  are  independent  

•  Detects  and  rec&fies  misspecula&on  

Multiprocessor program

Task Scheduler

Dataflow Engine

Precise-restart Engine

Sequencer

Multiprocessor System

ParaKram