Download - Have%Your%Cake%In%Parallel%And%Eat%It%Sequen6ally%Too!% …Your%Cake%In%Parallel%And%Eat%It%Sequen6ally%Too!% ... Non-deterministic method does not scale ... Conventional Checkpoint-and

Transcript

Have  Your  Cake  In  Parallel  And  Eat  It  Sequen6ally  Too!  Seman&cally  Sequen&al,  Parallel  Execu&on  of  Mul&processor  Programs  

Gagan  Gupta  

Mul&processors  are  ubiquitous,  but  programming  them  con&nues  to  be  challenging    Our  Goal:  Simplify  mul&processor  programming  without  compromising  performance    

               

Benefits:  •  Simplified  programming;  Simplified  system  design;  BeCer  reliability  •  Performance  at  par  or  beCer  (5%  to  288%)  than  conven&onal  methods  

Summary  

Conven6onal  Wisdom   Our  Approach  

Order  in  programs  obstructs  parallelism   Order  can  help  to  expose  parallelism!  

Use  non-­‐determinis&c  programs,  or  make  dataflow  in  programs  explicit  

Use  ordered  programs;  maintain  precise  program-­‐order  execu&on  seman6cs  

Programmer  should  expose  parallelism   Use  run-­‐6me  dataflow    and  specula6ve  techniques  to  expose    parallelism  

Sequencer  

•  Unrolls  dynamic  instances  of  tasks  

•  Computes  data  set  dynamically  (user  assisted)  

1    for  (i=0;  i<n;  i++)  {  2     call  F  (wr,  rd)  3  }  

   Fn            Write  Set              Read  Set  

   F1:    {B,  C}        {A}      F2:    {D}              {A}      F3:    {?}              {?}      F4:    {B}              {D}      F5:    {B}              {D}      F6:    {G}              {H}  

Example code and dynamic task instances

Dataflow  Engine    

     Precise-­‐restart  Engine  

•  Tracks  tasks  and  their  order  in  a  Reorder  List  •  Checkpoints  mod  set  in  History  Buffer  •  Re&res  task  in  (total)  program  order  

Order-­‐aware  Load

-­‐balan

cing  Task  

Sche

duler  

F2

t1 t2 t3 t4 t5 t6

Execute Declares dataset Execute Re-execute out-of-order => misspeculated out-of-order speculatively => rolled back non-speculatively using History Buffer

Example speculative dataflow execution on 3 processors

F1

F2

F6

F2

F3

F4 F5

F3

Time

P1

P2

P3 F3 F3

Mul&p

rocessor

Pro

gram

F1 F2 F3 F4 F5 F6

t1 t5

F1 F2

F3

F6 F5

F4 Time

??   F3

ParaKram speedup (harmonic mean) is 20% higher than non-

deterministic Pthreads (excludes Cholesky)

0

1

2

3

4

5

6

7

8

9

8x Core i7-965

16x Opteron 8350

32x Opteron 8356

Spee

dup

Harmonic Mean of Achieved Speedups

Non-deterministic

ParaKram

ParaKram speedup is 288% higher than non-

deterministic OpenMP, 75% over Cilk

Applications: Barneshut Blackscholes Pbzip2 Dedup Histogram RevereseIndex Swaptions Mergesort RE WordCount ConjugateGradient

0

1

2

3

4

5

6

7

Genome Mergesort Labyrinth

Spee

dup

Speculative Execution (8x Intel Xeon)

ParaKram

TL2

Cilk

ParaKram scales with system size;

Non-deterministic method does not scale

ParaKram speedup is up to 77% higher than non-

deterministic Cilk and TL2 STM

0

2

4

6

8

10

12

14

1 2 4 8 12 16 20 24

Spee

dup

# Processors

Cholesky Decomposition (24x Intel Xeon)

ParaKram Speculative

ParaKram Non-speculative

OpenMP

Cilk

0

10

20

30

40

50

60

70

1 2 4 8 12 16 20 24

Spee

dup

# Processors

Tolerating Exceptions

Conventional Checkpoint-and-Recovery

ParaKram

Exploi6ng  Parallelism  Conven6onal  

Develop  Parallel  Algorithm  

Schedule  Execu&on  of  Tasks  

Ensure  Independence  between  Parallel  Tasks  

Program  text  ≠>  Order  

F1

F2

F3

F4 F5

•  Execu&on  has  to  respect  programmer-­‐exposed  parallelism  

•  Cannot  schedule  F5  in  t2  

F5

F1

F2

F3

F4 F5

Programmer’s  Role  

Reason  About  Inter-­‐task  Data  Access  Conflicts  

Our  Approach  

Develop  Parallel  Algorithm  

Program  text  =>  Order  

Reason  About  Task-­‐local  Data  Accesses    

Conven6onal   Our  Approach  

•  If  dependences  are  known,  distant  parallelism  can  be  exposed  

•  Can  schedule  F5  in  t2  

Task  Dependence  Graph  of  Cholesky  Decomposi&on  

 •  Run-­‐&me  parallel  execu&on  manager  (C++  library)  

•  Performs  out-­‐of-­‐order  superscalar  processor-­‐like  execu&on  on  mul&processors  

ParaKram  

F1

F1

F1, F6, F2

F1

F2

F1 F2 F3 F4 F5 F6

Epoch Reorder List Entries Completed Retired

t1

F2 F3 F4 F5 F6 t2

F2 F3 F4 F5 F6 t3

F3 F4 F5 F6 t4

Precisely restarting misspeculated task (F3 from above)

t1

t2

t3

t4

t1

t2

t3

t4

Uncovers  parallelism  past  blocked  tasks  in  the  program  •  Constructs  dynamic  data  dependence  graph  using  write  and  read  sets  

•  Executes  tasks  out-­‐of-­‐order  •  If  task  dependences/order  are  unknown,  speculates  tasks  are  independent  

•  Detects  and  rec&fies  misspecula&on  

Multiprocessor program

Task Scheduler

Dataflow Engine

Precise-restart Engine

Sequencer

Multiprocessor System

ParaKram