MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a...

68
MPI as a Programming Model and Development Environment for Heterogeneous Compu9ng (with FPGAs) Paul Chow University of Toronto Department of Electrical and Computer Engineering ArchES Compu>ng Systems June 10, 2012

Transcript of MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a...

Page 1: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

MPI  as  a  Programming  Model  and  Development  Environment  for  Heterogeneous  Compu9ng  

(with  FPGAs)  

Paul  Chow    

University  of  Toronto  Department  of  Electrical  and  Computer  Engineering  

 

ArchES  Compu>ng  Systems  June  10,  2012  

   

Page 2: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

Let’s  Build  a  Big  Machine  

•  FPGAs  seen  to  provide  significant  accelera>on  in  “one’s  and  two’s”  

•  What  about  thousands?  •  Big  applica>on?  –  Molecular  Dynamics  

June  10,  2012   CARL  2012   2  

Page 3: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

How  do  you  think  about  such  a  system?  

•  Not  just  a  hardware  circuit  with  inputs  and  outputs  

•  It’s  a  purpose-­‐built  compu>ng  machine    •  Desire  programmability,  scalability,  reuse,  maintainability  

•  Ini>ally  all  FPGAs,  now  includes  x86  systems  •  Needs  a  programming  model  –  BTW,  hardware  designers  don’t  think  this  way…    

June  10,  2012   CARL  2012   3  

Page 4: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

The  Typical  Accelerator  Model  •  Accelerator  is  a  

slave  to  so]ware  •  Host  ini>ates  and  

controls  all  interac>ons  

•  Does  not  scale  to  more  accelerators  easily  or  efficiently  

•  Communica>on  to  other  Accelerated  Func>ons  done  via  Host  

Host  

FPGA  

Accelerated  Func>on  

Func>on  Call  

Link  Layer  Interface  

Vendor-­‐  Specific  Interface  

FSB,  HT,  PCIe,  etc.  

June  10,  2012   4  CARL  2012  

OpenFPGA  GenAPI  is  acempt  to  standardize  an  API  but  s>ll  at  a  low-­‐level  

Page 5: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

June  10,  2012   CARL  2012   5  

Many  interfacing  and  communica>on  issues  before  even  thinking  about  the  applica>on.    Not  to  forget  the  hardware  design!  

Page 6: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

Requirements  for  Programming  Model  

•  Unified  –  same  API  for  all  communica>on  •  Usable  by  applica>on  experts,  i.e.,  hide  the  heterogeneity  –  it’s  just  a  collec>on  of  processors  

•  Applica>on  portability  between  plahorms  •  Applica>on  scalability  

Leverage  exis9ng  soCware  programming  models  and  adapt  to  heterogeneous  environment  

June  10,  2012   CARL  2012   6  

Page 7: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

Which  Programming  Model?  

•  Hardware  –  chunks  of  logic  with  acached  local  memory  

•  Sounds  like  distributed  memory  

June  10,  2012   CARL  2012   7  

Message  Passing                                MPI  

Page 8: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

Using  MPI  •  Message  Passing  Interface  •  MPI  is  a  common  Applica>on  Programming  Interface  used  for  parallel  applica>ons  on  distributed-­‐memory  systems  – Freely  available,  suppor>ng  tools,  large  knowledge  base  

– Lots  to  leverage  from  

June  10,  2012   CARL  2012   8  

Page 9: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

Version  Guide  

June  10,  2012   CARL  2012   9  

TMD-­‐MPI   ArchES-­‐MPI  Original  research  project  

Commercialized  

Will  see  both  names  in  publica>ons.    Treat  as  equivalent  here.  

Page 10: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

An  “Embedded  MPI”  

June  10,  2012   CARL  2012   10  

•  Lightweight  subset  of  the  MPI  standard  •  Tailored  to  a  par>cular  applica>on  •  No  opera>ng  system  required  •  Small  memory  footprint  ~8.7KB  •  Simple  protocol  •  Used  as  a  model  for  implemen>ng  a  message-­‐passing  engine  in  hardware  –  the  MPE  

•  Abstrac>on  isolates  so]ware  from  hardware  changes  providing  portability  

Saldaña,  et  al.,    ACM  TRETS,  Vol.  3,  No.  4,  November  2010  

Page 11: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

Protocols  

June  10,  2012   CARL  2012   11  

TMD-­‐MPI    communica>on  protocols  

Page 12: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

The  Flow  

June  10,  2012   CARL  2012   12  

Also  a  system  simula>on  

Page 13: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

A  Parallel  Example  

June  10,  2012   CARL  2012   13  

T1   T3   T4   Tn  

T2  

T0  

T0  is  the  root  task  T3  to  Tn  are  parallel  tasks  replicated  as  resources  allow  

This  does  not  easily  map  to  the  coprocessor  model  

Page 14: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

Mapping  Stage  

June  10,  2012   CARL  2012   14  

Hardware  Accelerator  

Hardware  Accelerator  

Hardware  Accelerator  

Tasks  are  assigned  to  the  available  types  of    processing  elements  

Page 15: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

Accelerator  Architecture  

June  10,  2012   CARL  2012   15  

main  (  )  {          MPI_Init()                MPI_Recv()          Compute()          MPI_Send()                MPI_Finalize()  }  

Compute()  

MPI_Send()  

MPI_Finalize()  

MPI_Recv()  

MPI_Init()  Message  Passing  Engine  (MPE)  

On-­‐chip  Network  

Infrastructure  

So]ware   Hardware  

MPI  FSM   Compute()  

Data  Commands  

So]ware  control  and  dataflow    easily  maps  to  hardware  

Page 16: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

Communica>on  Middleware  

FSB  Interface  

MPI  FSB  Bridge  

Xilinx  FPGA  

HW  Engine  

HW  MPI  (MPE)  

MicroBlaze  

SW  MPI  

HW  Engine  

HW  MPI  (MPE)  

LVDS  interface  

MPI  LVDS  Bridge  

MGT  Interface  

MPI  MGT  Bridge  

Packet  Switch    

June  10,  2012   CARL  2012   16  

Page 17: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

Achieving  Portability  •  Portability  is  achieved  

by  using  a  Middleware  abstrac>on  layer.    MPI  na>vely  provides  so]ware  portability  

•  ArchES  MPI  provides  a  Hardware  Middleware  to  enable  hardware  portability.    The  MPE    provides  the  portable    hardware  interface  to  be  used  by  a  hardware  accelerator  

June  10,  2012   CARL  2012   17  

SW  Applica>on      

SW  OS  

SW  Middleware  

Host-­‐specific  Hardware  

HW  Applica>on  

HW  OS  

Heterogeneous  Environment  

Host  

FPGA  HW  Middleware  

SW  Applica>on                SW  OS  

Host-­‐specific  Hardware  

So]ware    Environment  

SW  Middleware  

Page 18: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

Achieving  Scalability  

•  MPI  naturally  enables  scalability  •  Making  use  of  addi>onal  processing  capability  can  be  as  easy  as  changing  

a  configura>on  file  •  Scaling  at  the  different  levels  (FPGAs,  modules,  hosts)  is  transparent  to  the  

applica>on  June  10,  2012   CARL  2012   18  

Adding  FPGAs  

Adding  modules  Adding  hosts  

Direct  module  to  module  link  

Page 19: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

BUILDING  A  LARGE  HPC  APPLICATION  

June  10,  2012   CARL  2012   19  

Page 20: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

June  10,  2012   CARL  2012   20  

Molecular  Dynamics  

•  Simulate  mo>on  of  molecules  at  atomic  level  •  Highly  compute-­‐intensive  •  Understand  protein  folding  •  Computer-­‐aided  drug  design  

Page 21: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

The  TMD  Machine  

•  The  Toronto  Molecular  Dynamics  Machine  •  Use  mul>-­‐FPGA  system  to  accelerate  MD  •  Principal  algorithm  developer:    Chris  Madill,  Ph.D.  candidate  (now  done!)  in  Biochemistry  – Writes  C++  using  MPI,  not  Verilog/VHDL  

•  Have  used  three  plahorms  –  portability  

June  10,  2012   CARL  2012   21  

Page 22: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

Plahorm  Evolu>on  

Network  of  Five  V2Pro  PCI  Cards  (2006)   Network  of  BEE2  Mul>-­‐FPGA  Boards  (2007)  

•   First  to  integrate  hardware  accelera>on  •   Simple  LJ  fluids  only  

•   Added  electrosta>c  terms  •   Added  bonded  terms  

FPGA  portability  and  design  abstrac9on  facilitated  ongoing  migra9on.  

Page 23: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

2010  –  Xilinx/Nallatech  ACP  

June  10,  2012   CARL  2012   23  

Stack  of  5  large  Virtex-­‐5  FPGAs  +  1  FPGA  for  FSB  PHY  interface  

Quad  socket  Xeon  Server  

Page 24: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

Origin  of  Computa>onal  Complexity  

103  -­‐  101

0  

∑ −=i

iiib rrkU 20 )(

∑∑∑= = +

=N

i

N

j ij

ji

n nrqq

U1 12

1

τ

⎥⎥⎦

⎢⎢⎣

⎡⎟⎠

⎞⎜⎝

⎛−⎟⎠

⎞⎜⎝

⎛=612

4)(rr

rV σσε

∑ −=i

iiia kU 20 )( θθ

( )[ ]( )∑

⎩⎨⎧

=−

≠−+=

i iii

iiiiit nk

nnkU

0,00,cos1

γφ

O(n2)  

O(n)  

June  10,  2012   24  CARL  2012  

Page 25: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

CPUi  

Processi  Bonded  

Nonbonded  

PME  

Datai  

CPUi  

Processi  Bonded  

Nonbonded  

PME  

Datai  

Typical  MD  Simulator  

CPUi  

Processi  Bonded  

Nonbonded  

PME  

Datai  

CPUi  

Processi  Bonded  

Nonbonded  

PME  

Datai  

June  10,  2012   25  CARL  2012  

Page 26: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

TMD  Machine  Architecture  

Bond    Engine  

Visualizer  

Output  

Scheduler  

Input  

MPI::Send(&msg,  size,  dest  …);  Atom  

Manager  Atom  Manager  Atom  

Manager  

Bond    Engine  

Long  range  Electrosta>cs  

Engine  

Long  range  Electrosta>cs  

Engine  

Long  range  Electrosta>cs  

Engine  

Atom  Manager  

Short  range  Nonbond  Engine  

Short  range  Nonbond  Engine  

Short  range  Nonbond  Engine  

Short  range  Nonbond  Engine  

Short  range  Nonbond  Engine  

Short  range  Nonbond  Engine  

June  10,  2012   26  CARL  2012  

Page 27: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

FSB  

Target  Plahorm  for  MD  

FSB  

NBE   NBE  

NBE   NBE  

FSB  

NBE   NBE  

NBE   NBE  

MEM   PME  

FSB  

NBE   NBE  

NBE   NBE  

PME   MEM  

Socket  0  

Socket  2  

Socket  1  

Socket  3  

Short  range  Nonbonded  

Long  range  Electrosta>c  

Bonds  

Ini9al  Breakdown  of  CPU  Time  �  12  short  range  nonbond  FPGAs  �  2-­‐3  pipelines/NBE  FPGA;  Each  runs  15-­‐30x  CPU  �  NBE  360-­‐1080x  

�  2  PME  FPGAs  with  fast  memory  and  fibre  op>c  interconnects  �  PME  420x  

�  Bonds  on  quad-­‐core  Xeon  server  �  Bonds  1x  

Sys  Mem  

Sys  Mem  

Quad  Xeon  

Sys  Mem  

8.5  GB/s  @  1066  MHz  

72.5  GB/s  

27  June  10,  2012   CARL  2012  

Page 28: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

Performance  Modeling  

Problem  :  Difficult  to  mathema>cally  predict  the  expected  speedup  a  priori  due  to  the  conten>ous  nature  of  many-­‐to-­‐many  communica>ons.  

 Solu9on:    Measuring  the  non-­‐determinis>c  behaviour  using  Jumpshot  on  the  so]ware  version  and  back-­‐annotate  the  determinis>c  behaviour.  •  Make  use  of  exis>ng  tools!  

June  10,  2012   28  CARL  2012  

Page 29: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

Single  Timestep  Profile  

Timestep  =  108  ms  (327  506  atoms)  June  10,  2012   29  CARL  2012  

Page 30: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

Performance  •  Significant  overlap  between  all  

force  calcula>ons.  •  108.02  ms  is  equivalent  to  

between  80  and  88  Infiniband-­‐connected  cores  at  U  of  T’s  supercomputer,  SciNet.    

•  160-­‐176  hyperthreaded  cores  •  Can  we  do  becer?  

–  140  with  hardware  bond  engines  –  change  engine  from  SW  to  HW,  no  architectural  change  

–  More  with  QPI  systems  

June  10,  2012   30  CARL  2012  

Page 31: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

Final  Performance  Equivalent  for  MD  

FPGA/CPU     Supercomputer   Scaling  Factor  

Space   5U   17.5*2U   1/7  Cooling   N/A   Share  of  735-­‐ton  

chiller  ∞?  

Capital  Cost   $15000*   $120000   1/8  Annual  Electricity  Cost   $241    

(Assuming  500W)  $6758   1/30  

Performance  (Core  Equivalent)  

140  Cores   1*140  Cores   140x  

*Current  system  is  a  prototype.  Cost  is  based  on  projec>ons  for  next-­‐genera>on  system.  

Page 32: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

TMD  Perspec>ve  •  S>ll  comparing  apples  to  oranges.  •  Individually,  hardware  engines  are  able  to  sustain  calcula>ons  hundreds  of  >mes  faster  than  tradi>onal  CPUs.  

•  Communica>on  costs  degrade  overall  performance.  •  FPGA  plahorm  is  using  older  CPUs  and  older  communica>on  links  than  SciNet.  

•  Migra>ng  the  FPGA  por>on  to  a  SciNet  compa>ble  plahorm  will  further  increase  the  rela>ve  performance  and  provide  a  more  accurate  CPU/FPGA  comparison.  

Page 33: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

BUILDING  AN  EMBEDDED  APPLICATION  

June  10,  2012   CARL  2012   33  

Page 34: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

Restricted  Boltzmann  Machine  

34  

Ly,  Saldaña,  Chow,  FPT  2009  

June  10,  2012   CARL  2012  

Page 35: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

Class  II:  Processor-­‐based  Op>miza>ons  Direct  Memory  Access  MPI  Engine  

35  

MPI_Send(...)  1.  Processor  writes  4  words  

•  des>na>on  rank  •  address  of  data  buffer  •  message  size  •  message  tag  

2.  PLB_MPE  decodes  message  header  3.  PLB_MPE  transfers  data  from  memory  

Page 36: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

SW/HW  Op>miza>on  

June  10,  2012   36  

2.33x  

3.94x  

5.32x  CARL  2012  

Page 37: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

Class  III:  HW/HW  Op>miza>on  Case  Study:  Vector  Addi>on  

June  10,  2012   37  CARL  2012  

Messages  vs  Streams  

Page 38: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

HW/HW  Op>miza>on  Case  Study:  Vector  Addi>on  

June  10,  2012   38  CARL  2012  

Page 39: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

Scalable  Video  Processor  (SoC)  

June  10,  2012   CARL  2012   39  

SDRAM  

MPMC  

Video  Dec  

Video  Out  

MPI  NoC  

Micro  Blaze  

Memory  Writer  

EMAC  

Codec  1   BR

AM  

Codec  2   BR

AM  

To  other  boards  

PLB  

NPI   NPI  NPI  NPI  

NI  

NI  

Page 40: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

Streaming  Codec  

June  10,  2012   CARL  2012   40  

SDRAM  

MPMC  

Video  Dec  

Video  Out  

MPI  NoC  

Micro  Blaze  

Memory  Writer  

EMAC  

Codec  1   BR

AM  

Codec  2   BR

AM  

To  other  boards  

PLB  

NPI   NPI  NPI  NPI  

NI  

NI  

Page 41: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

Frame  Processing  

June  10,  2012   CARL  2012   41  

SDRAM  

MPMC  

Video  Dec  

Video  Out  

MPI  NoC  

Micro  Blaze  

Memory  Writer  

EMAC  

Codec  1   BR

AM  

Codec  2   BR

AM  

To  other  boards  

PLB  

NPI   NPI  NPI  NPI  

NI  

NI  

Page 42: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

Mul>-­‐Card  System  

June  10,  2012   CARL  2012   42  

   

   

NI

   

   

NI

Applica>on  is  agnos>c  to  the  number  of  cards  

Page 43: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

SUPPORTING  PARTIAL  RECONFIGURATION  

June  10,  2012   CARL  2012   43  

Page 44: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

Template-­‐based  bitstreams  

Dynamic  Region  

Sta>c  Region  

NoC   NoC   NoC  

DDR  Interface  

A  library  of  pre-­‐built  bitstreams  for  FPGAs  ...  

R4  

R1  

R3  

R2  R0  

Reconfigurable    Module  (RM)  

Saldaña,  Patel,  Liu,  Chow,  ReconFig  2010,  IJRC  2012  

Page 45: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

MPE  and  the  Dynamic  Region  

Main  Compute  Pipeline  

Ctrl  FSM  

MPE  

Data  Fifo  

Cmd  Fifo  

Status  and    Control  signals    (Reset,  busy,  done,  enable)  

Wrapper  

Main  Compute  Pipeline  

Ctrl  FSM  

MPE  

Data  Fifo  

Cmd  Fifo  

Status  and    Control  signals    (Reset,  busy,  done,  enable)  

Network-­‐on-­‐Chip  

Wrapper  

gRst  SRL  

Dynamic  Region  

Sta>c  Region  

Network-­‐on-­‐Chip  

FF  PR  Flag  

Wrapper-­‐contained    (Applica>on-­‐specific  template  bitstreams)  

Self-­‐contained  (Generic  template  bitstreams)  

Page 46: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

PR  Synchroniza>on,  Data  Store  and  Restore  

Par>al  Reconfigura>on  

Process  

Par>al  Reconfigura>on  

Process  Req-­‐to-­‐Cfg  

Cfg-­‐Done  

Cfg-­‐Done  

Time  

OK-­‐to-­‐Cfg  Cfg  

Configura>on  Controller  X86  

Configura>on  Controller  X86  

RM_A  

RM_B  

RM_A  

RM_B  

Processor-­‐Ini>ated  PR  

RM-­‐Ini>ated  PR  

Page 47: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

PR  User  code  

MPI_Init();      //  <-­‐-­‐-­‐  Template  bitstream  configura>on  (where  possible)  ...  MPI_Send  (  ...,  dest,  CFG_TAG,...);    MPI_Recv  (  status_data_RM_A,  …  ,  dest,    OK_TO_CFG_TAG,  ...);    ARCHES_MPI_Reconfig  (  RM_B.bit,  board_num,  fpga_num  );    MPI_Send  (  status_data_RM_B,  …  ,  dest,  CFG_DONE_TAG);  ...  …  MPI_Recv  (  status_data_RM_A,  …  ,  dest,    REQ_TO_CFG_TAG,  ...);    ARCHES_MPI_Reconfig  (  RM_B.bit,  board_num,  fpga_num  );    MPI_Send  (  status_data_RM_B,  …  ,  dest,  CFG_DONE_TAG);  ...  

Processor-­‐ini>ated    PR  

RM-­‐ini>ated    PR  

Page 48: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

HARDWARE  SUPPORT  FOR  BROADCAST  AND  REDUCE  

June  10,  2012   CARL  2012   48  

Page 49: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

June  10,  2012   CARL  2012   49  

Experimental  Plahorm  Peng,  Saldaña,  Chow,  FPL2011  

Page 50: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

June  10,  2012   CARL  2012   50  

Changes  to  the  NoC  

Page 51: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

June  10,  2012   CARL  2012   51  

Changes  to  the  NoC  

Page 52: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

DEBUGGING  AND  PROFILING  

June  10,  2012   CARL  2012   52  

Page 53: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

53   CARL 2012

Debugging: Multi-FPGA System-Level Simulation

ModelSim ModelSim ModelSim

MPI Messages

FPGA signals Stdout

ACP0 ACP1 ACP2 Intel

Quad-core Xeon

•  Test SW and HW ranks all at once •  No need to resynthesize •  Full visibility into the FPGA •  Good for modeling application-defined protocols at initial stages of development

June 10, 2012

Page 54: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

54   CARL 2012

Profiling: Jumpshot

•  Well-known tool •  Extracts MPI protocol states from the MPE •  Profile just like in Software •  Works only for embedded processors and hardware engines

June 10, 2012

Nunes,  Saldaña,  Chow,  FPT  2008  

Page 55: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

ADDING  HIGH-­‐LEVEL  SYNTHESIS  

June  10,  2012   CARL  2012   55  

Page 56: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

The  Flow  

June  10,  2012   CARL  2012   56  

Also  a  system  simula>on  

HLS  AutoESL  

Page 57: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

Benefits  of  Using  MPI  for  HLS  

•  High-­‐level  data  and  control  flow  defined  at  MPI  message  level  

•  Not  synthesizing  the  en>re  system  – HLS  focus  is  on  kernels  of  computa>on  

•  Interfaces  are  well-­‐defined  – Easier  hand-­‐off  to  HLS  tool  (or  human)  

June  10,  2012   CARL  2012   57  

Page 58: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

SOME  NUMBERS  

June  10,  2012   CARL  2012   58  

Page 59: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

Configura>ons  for  Performance  Tes>ng  

Send  round-­‐trip  messages  between  two  MPI  tasks  (black  squares)  X86  has  Xeon  cores  using  so]ware  MPI,    FPGA  has  hardware  engines  (HW)  using  the  MPE    Δt  =  round_trip_>me/(2*num_samples)  Latency  =  Δt  for  a  small  message  size  BW  =  message_size/Δt    Measurements  here  are  done  using  only  FSB-­‐Base  modules.    We  can  do  this  also  with  the  FSB-­‐Compute  and  FSB-­‐Expansion  Modules  by  moving  the  loca>on  of  the  HW  

June  10,  2012   59  CARL  2012  

Xeon-­‐Xeon   Xeon-­‐HW   Intra-­‐FPGA  HW-­‐HW   Inter-­‐FPGA  HW-­‐HW  

Page 60: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

Preliminary  Performance  Numbers  

Xeon-Xeon Xeon-HW HW-HW (intra-FPGA)

HW-HW (inter-FPGA)

Latency [µs] (64-byte transfer) 1.9 2.78 0.39 3.5

Bandwidth [MB/s] 1000 410 531 400

June  10,  2012   60  CARL  2012  

On-chip network using 32-bit channels and clocked at 133 MHz MPI using Rendezvous Protocol

l   Xilinx  driver  performance  numbers  l  Latency  =  0.5  μs  (64  byte  transfer)  l  Bandwidth  =  2  GB/s    

l   MPI  Ready  Protocol  achieves  about  1/3  of  the  Rendezvous  latency.  For  Xeon-­‐HW  it  is  1μs  (only  2X  slower  than  Xilinx  driver  transfer  latency)    l   128-­‐bit  on-­‐chip  channels  will  quadruple  the  HW  bandwidth  (to  approx.  2GB/s)  and  also  reduce  latency  

l Other  performance  enhancements  possible      

Page 61: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

Latency  

June  10,  2012   CARL  2012   61  

1

10

100

1000

10000

100000

0 8 32 128 512 2048 8192 32768 131072 524288 2097152

Tim

e [u

s]

Message Size [Bytes]

(delta = n^2) Point-to-point with Rendezvous Protocol

FSB

x8-Gen1-PCIe

1Gb Eth – UDP

Page 62: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

Bandwidth  

June  10,  2012   CARL  2012   62  

0

100

200

300

400

500

600

700

800

0 8 32 128 512 2048 8192 32768 131072 524288 2097152

Thro

ughp

ut [M

B/s

]

Message Size [Bytes]

Point-to-point with Rendezvous Protocol

FSB

x8-Gen1-PCIe 1Gb Eth – UDP

Page 63: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

Conclusions  I  

•  Raising  the  level  of  abstrac9on  important  – scalability,  portability,  flexibility,  reusability,  maintainability  

– produc>vity,  accessible  to  applica>on  experts  •  Adap9ng  an  exis9ng  programming  model  brings  an  ecosystem  that  can  be  leveraged  – Debugging    – Profiling    – Knowledge  and  experience  

June  10,  2012   CARL  2012   63  

Page 64: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

Conclusions  II  •  Adap9ng  MPI  works  well  

– Well-­‐known  programming  model  for  parallel  processing  

–  Significant  ecosystem  for  heterogeneous  systems  possible  

–  Provides  for  incremental  and  itera>ve  design  •  MPI  can  be  easily  extended/adapted  for  a  heterogeneous  environment  – Messages  vs  streaming  –  Coalescing  –  Par>al  reconfigura>on  

June  10,  2012   CARL  2012   64  

Page 65: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

Conclusions  III  •  Computa9onal  architecture  may  change  with  awareness  of  heterogeneous  compu9ng  elements  – MD  –  heterogeneous  versus  homogeneous  par>>oning  

– Messages  used  to  carry  instruc>ons  to  engines  •  Must  do  more  top-­‐down  thinking  about  how  to  use/incorporate  FPGAs  into  the  compu9ng  world  – Mostly  bocom-­‐up  (hardware)  thinking  so  far  – OpenCL  looks  to  be  a  popular  path  today  

June  10,  2012   CARL  2012   65  

Page 66: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

The  Real  Workers  Taneem  Ahmed  Xander  Chin  Charles  Lo  Chris  Madill  Vince  Mirian  Arun  Patel  Manuel  Saldaña  Kam  Pui  Tang  Ruediger  Willenberg      

June  10,  2012   CARL  2012   66  

Chris  Comis  Danny  Gupta  Alex  Kaganov  Daniel  Ly  Daniel  Nunes  Emanuel  Ramalho  Lesley  Shannon  David  Woods  Mike  Yan      

Page 67: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

Research  Support  

June  10,  2012   CARL  2012   67  

SOCRN  

Page 68: MPI$as$a$Programming$Model$and ......The!Typical!Accelerator!Model! • Accelerator!is!a slave!to!so]ware! • Hostini>ates!and! controls!all! interac>ons! • Does!notscale!to! more!accelerators!

Thank  you  

June  10,  2012   CARL  2012   68  

Professor  Paul  Chow  University  of  Toronto  [email protected]