Chris Madill Molecular Structure and Function, Hospital for Sick Children Department of...

18
An MPI Approach to High-Performance Computing with FPGAs Chris Madill Molecular Structure and Function, Hospital for Sick Children Department of Biochemistry, University of Toronto Supervised by Dr. Paul Chow Electrical and Computer Engineering, University of Toronto SHARCNET Symposium on GPU and CELL Computing 2008
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    217
  • download

    2

Transcript of Chris Madill Molecular Structure and Function, Hospital for Sick Children Department of...

Page 1: Chris Madill Molecular Structure and Function, Hospital for Sick Children Department of Biochemistry, University of Toronto Supervised by Dr. Paul Chow.

An MPI Approach to High-Performance Computing

with FPGAsChris Madill

Molecular Structure and Function, Hospital for Sick Children

Department of Biochemistry, University of Toronto

Supervised by Dr. Paul ChowElectrical and Computer Engineering,

University of Toronto

SHARCNET Symposium on GPU and CELL Computing 2008

Page 2: Chris Madill Molecular Structure and Function, Hospital for Sick Children Department of Biochemistry, University of Toronto Supervised by Dr. Paul Chow.

Introduction Many scientific applications can be accelerated by

targeting parallel machines

2

This work demonstrates a method for combining high performance computer clusters with FPGAs for maximum computational power

Coarse-grained parallelization allows applications to be distributed across hundreds or thousands of nodes

FPGAs can accelerate many computing tasks by 2 or 3 orders of magnitude over a CPU

Page 3: Chris Madill Molecular Structure and Function, Hospital for Sick Children Department of Biochemistry, University of Toronto Supervised by Dr. Paul Chow.

Popular HPC Configurations

Interconnection Network

MEM

CPU CPU CPU CPU…Interconnection Network

CPU CPU CPU CPU…

MEM MEM MEM MEM

Interconnection Network

ASP(GPU/FPGA)

ASP(GPU/FPGA)

ASP(GPU/FPGA)

ASP(GPU/FPGA)

MEM MEM MEM MEM

Interconnection Network

CPU FPGA FPGA GPU…

MEM MEM MEM

MEM

3

Page 4: Chris Madill Molecular Structure and Function, Hospital for Sick Children Department of Biochemistry, University of Toronto Supervised by Dr. Paul Chow.

A Demanding Application

Page 5: Chris Madill Molecular Structure and Function, Hospital for Sick Children Department of Biochemistry, University of Toronto Supervised by Dr. Paul Chow.

How Do You Program This?

FPGAs can speed up applications, however...

High barrier of entry for designing digital hardware

Developing monolithic FPGA designs is very daunting

How does one easily take advantage of FPGAs for accelerating HPC applications?

5

Page 6: Chris Madill Molecular Structure and Function, Hospital for Sick Children Department of Biochemistry, University of Toronto Supervised by Dr. Paul Chow.

TMD Toronto Molecular Dynamics machine is an investigation into high

performance computing based on a scalable network of FPGAs

Applications are defined as a simple collection of computing tasks

A task is roughly equivalent to a software process/thread

Major focus is facilitating transition from cluster-based applications to TMD machine

Task

ComputingEngine

EmbeddedMicroprocessor

Processor onCPU Node

TM

D M

ach

ine

Su

perc

om

pu

ter

CPU

6

Page 7: Chris Madill Molecular Structure and Function, Hospital for Sick Children Department of Biochemistry, University of Toronto Supervised by Dr. Paul Chow.

7

Application Design Flow Step 1: Application Prototyping

• Software prototype of application developed• Profiling identifies compute-intensive routines

Step 2: Application Refinement• Partitioning into tasks communicating using

MPI• Communication patterns analyzed to

determine network topology

Step 3: TMD Prototyping• Tasks are ported to soft-processors on TMD• On-chip communication network verified

Step 4: TMD Optimization• Intensive tasks replaced with hardware engines• MPE handles communication for hardware

engines• Hardware engines easily moved, replicated

ApplicationPrototype

Process A Process B Process CMPI MPI

CPU ClusterFPGA Network

A CTMD-MPITMD-MPI

B

TMD-MPI TMD-MPIB

Page 8: Chris Madill Molecular Structure and Function, Hospital for Sick Children Department of Biochemistry, University of Toronto Supervised by Dr. Paul Chow.

Communication

Use essential subset of MPI standard

Software library for tasks run on processors

Hardware Message Passing Engine (MPE) for hardware-based tasks

Tasks do not know (or care) whether remote tasks are run as software processes or hardware engines

MPI isolation of tasks facilitates C-to-gates compilers

8

Page 9: Chris Madill Molecular Structure and Function, Hospital for Sick Children Department of Biochemistry, University of Toronto Supervised by Dr. Paul Chow.

Xilinx ACP

9

The Xilinx Advanced Computing Platform are modules that plug directly into CPU socket

Direct access to FSB

CPU and FPGA are both peers in system

Equal priority main memory access

Page 10: Chris Madill Molecular Structure and Function, Hospital for Sick Children Department of Biochemistry, University of Toronto Supervised by Dr. Paul Chow.

Xilinx ACP

CPU does not have to orchestrate activity of FPGA

CPU does not have to relay data to and from FPGAs

FPGA not on slow connection to CPU

All tasks can run independently

10

Page 11: Chris Madill Molecular Structure and Function, Hospital for Sick Children Department of Biochemistry, University of Toronto Supervised by Dr. Paul Chow.

Tasks in MD

11

BondsAll

ob llk 2)(

AnglesAll

ok 2)(

TorsionsAll

nA )]cos(1[

PairsAll rr

612

4

PairsAll r

qq 21 +d

-d

+

+

+

+

å®

F ò

U =

Page 12: Chris Madill Molecular Structure and Function, Hospital for Sick Children Department of Biochemistry, University of Toronto Supervised by Dr. Paul Chow.

Final MD Target

12

FSB

Quad Core CPU

MEM

Xilinx ACP Module

User FPGA 2

User FPGA 1

Comm FPGA

NBE 1

NBE 2

NBE 3

NBE 4

Comm

Xilinx ACP Module

User FPGA 4

User FPGA 3

Comm FPGA

NBE 5

NBE 6

NBE 7

NBE 8

Comm

Xilinx ACP Module

User FPGA 5

Comm FPGA

Ewald

Comm

Ewald

User FPGA 6

BondsAll

ob llk 2)(

AnglesAll

ok 2)(

TorsionsAll

nA )]cos(1[

PairsAll rr

612

4

PairsAll r

qq 21

Page 13: Chris Madill Molecular Structure and Function, Hospital for Sick Children Department of Biochemistry, University of Toronto Supervised by Dr. Paul Chow.

Conclusion Target system is a combination of software

running on CPUs and FPGA hardware accelerators

Key to performance is in identifying hotspots and adding corresponding hardware acceleration

Hardware engineer must focus only on small part of overall application

MPI facilitates hardware/software isolation, collaboration

13

Page 14: Chris Madill Molecular Structure and Function, Hospital for Sick Children Department of Biochemistry, University of Toronto Supervised by Dr. Paul Chow.

Acknowledgements

SOCRN

1: Molecular Structure and Function, The Hospital for Sick Children2: Department of Biochemistry, University of Toronto

Prof. Paul ChowProf. Régis Pomès1,2

David ChuiChristopher ComisSam LeeDaniel LyLesley ShannonMike Yan

Danny GuptaAlireza HeiderbarghiAlex KaganovDaniel LyChris Madill1,2

Daniel NunesEmanuel RamalhoDavid Woods

Arun PatelManuel Saldaña

Arches Computing:

TMD Group: Past Members:

Page 15: Chris Madill Molecular Structure and Function, Hospital for Sick Children Department of Biochemistry, University of Toronto Supervised by Dr. Paul Chow.

Layout Editor

15

Page 16: Chris Madill Molecular Structure and Function, Hospital for Sick Children Department of Biochemistry, University of Toronto Supervised by Dr. Paul Chow.

16

TMD-MPI Implementation

Application

Hardware

MPI Application Interface

Point-to-Point MPI Functions

Send/Receive Implementation

FSL Hardware Interface

Layer 4: MPI InterfaceAll MPI functions implemented in TMD-MPI that are available to the application.

Layer 3: Collective OperationsBarrier synchronization, data gathering and message broadcasts.Layer 2: Communication PrimitivesMPI_Send and MPI_Recv methods are used to transmit data between processes.Layer 1: Hardware InterfaceLow level methods to communicate with FSLs for both on and off-chip communication.

Page 17: Chris Madill Molecular Structure and Function, Hospital for Sick Children Department of Biochemistry, University of Toronto Supervised by Dr. Paul Chow.

Intra-FPGA Communication Communication links are based on Fast Simplex

Links (FSL)• Unidirectional Point-to-Point FIFO• Provides buffering and flow-control• Can be used to isolate different clock domains

FSLs simplify component interconnects• Standardized interface, used by both hardware engines and

processors• Can assemble system modules rapidly

Application-specific network topologies can be defined

17

Page 18: Chris Madill Molecular Structure and Function, Hospital for Sick Children Department of Biochemistry, University of Toronto Supervised by Dr. Paul Chow.

Inter-FPGA Communication Inter-FPGA communication uses abstracted

communication links

Communication is independent of physical link

• Single serial transceivers (FSL-over-Aurora)• Bonded serial transceivers (FSL-over-XAUI)• Parallel Busses (FSL-over-Wires)

• FSL-over-10GbE coming soon…