Download - Chris Madill Molecular Structure and Function, Hospital for Sick Children Department of Biochemistry, University of Toronto Supervised by Dr. Paul Chow.

An MPI Approach to High-Performance Computing

with FPGAsChris Madill

Molecular Structure and Function, Hospital for Sick Children

Department of Biochemistry, University of Toronto

Supervised by Dr. Paul ChowElectrical and Computer Engineering,

University of Toronto

SHARCNET Symposium on GPU and CELL Computing 2008

Introduction Many scientific applications can be accelerated by

targeting parallel machines

2

This work demonstrates a method for combining high performance computer clusters with FPGAs for maximum computational power

Coarse-grained parallelization allows applications to be distributed across hundreds or thousands of nodes

FPGAs can accelerate many computing tasks by 2 or 3 orders of magnitude over a CPU

Popular HPC Configurations

Interconnection Network

MEM

CPU CPU CPU CPU…Interconnection Network

CPU CPU CPU CPU…

MEM MEM MEM MEM


ASP(GPU/FPGA)

ASP(GPU/FPGA)

ASP(GPU/FPGA)

ASP(GPU/FPGA)

…

MEM MEM MEM MEM


CPU FPGA FPGA GPU…

MEM MEM MEM

MEM

3

A Demanding Application

How Do You Program This?

FPGAs can speed up applications, however...

High barrier of entry for designing digital hardware

Developing monolithic FPGA designs is very daunting

How does one easily take advantage of FPGAs for accelerating HPC applications?

5

TMD Toronto Molecular Dynamics machine is an investigation into high

performance computing based on a scalable network of FPGAs

Applications are defined as a simple collection of computing tasks

A task is roughly equivalent to a software process/thread

Major focus is facilitating transition from cluster-based applications to TMD machine

Task

ComputingEngine

EmbeddedMicroprocessor

Processor onCPU Node

TM

D M

ach

ine

Su

perc

om

pu

ter

CPU

6

7

Application Design Flow Step 1: Application Prototyping

• Software prototype of application developed• Profiling identifies compute-intensive routines

Step 2: Application Refinement• Partitioning into tasks communicating using

MPI• Communication patterns analyzed to

determine network topology

Step 3: TMD Prototyping• Tasks are ported to soft-processors on TMD• On-chip communication network verified

Step 4: TMD Optimization• Intensive tasks replaced with hardware engines• MPE handles communication for hardware

engines• Hardware engines easily moved, replicated

ApplicationPrototype

Process A Process B Process CMPI MPI

CPU ClusterFPGA Network

A CTMD-MPITMD-MPI

B

TMD-MPI TMD-MPIB

Communication

Use essential subset of MPI standard

Software library for tasks run on processors

Hardware Message Passing Engine (MPE) for hardware-based tasks

Tasks do not know (or care) whether remote tasks are run as software processes or hardware engines

MPI isolation of tasks facilitates C-to-gates compilers

8

Xilinx ACP

9

The Xilinx Advanced Computing Platform are modules that plug directly into CPU socket

Direct access to FSB

CPU and FPGA are both peers in system

Equal priority main memory access

Xilinx ACP

CPU does not have to orchestrate activity of FPGA

CPU does not have to relay data to and from FPGAs

FPGA not on slow connection to CPU

All tasks can run independently

10

Tasks in MD

11

BondsAll

ob llk 2)(

AnglesAll

ok 2)(

TorsionsAll

nA )]cos(1[

PairsAll rr

612

4

PairsAll r

qq 21 +d

-d

+

+

+

+

å®

F ò

U =

Final MD Target

12

FSB

Quad Core CPU

MEM

Xilinx ACP Module

User FPGA 2

User FPGA 1

Comm FPGA

NBE 1

NBE 2

NBE 3

NBE 4

Comm

Xilinx ACP Module

User FPGA 4

User FPGA 3

Comm FPGA

NBE 5

NBE 6

NBE 7

NBE 8

Comm

Xilinx ACP Module

User FPGA 5

Comm FPGA

Ewald

Comm

Ewald

User FPGA 6

BondsAll

ob llk 2)(

AnglesAll

ok 2)(

TorsionsAll

nA )]cos(1[

PairsAll rr

612

4

PairsAll r

qq 21

Conclusion Target system is a combination of software

running on CPUs and FPGA hardware accelerators

Key to performance is in identifying hotspots and adding corresponding hardware acceleration

Hardware engineer must focus only on small part of overall application

MPI facilitates hardware/software isolation, collaboration

13

Acknowledgements

SOCRN

1: Molecular Structure and Function, The Hospital for Sick Children2: Department of Biochemistry, University of Toronto

Prof. Paul ChowProf. Régis Pomès1,2

David ChuiChristopher ComisSam LeeDaniel LyLesley ShannonMike Yan

Danny GuptaAlireza HeiderbarghiAlex KaganovDaniel LyChris Madill1,2

Daniel NunesEmanuel RamalhoDavid Woods

Arun PatelManuel Saldaña

Arches Computing:

TMD Group: Past Members:

Layout Editor

15

16

TMD-MPI Implementation

Application

Hardware

MPI Application Interface

Point-to-Point MPI Functions

Send/Receive Implementation

FSL Hardware Interface

Layer 4: MPI InterfaceAll MPI functions implemented in TMD-MPI that are available to the application.

Layer 3: Collective OperationsBarrier synchronization, data gathering and message broadcasts.Layer 2: Communication PrimitivesMPI_Send and MPI_Recv methods are used to transmit data between processes.Layer 1: Hardware InterfaceLow level methods to communicate with FSLs for both on and off-chip communication.

Intra-FPGA Communication Communication links are based on Fast Simplex

Links (FSL)• Unidirectional Point-to-Point FIFO• Provides buffering and flow-control• Can be used to isolate different clock domains

FSLs simplify component interconnects• Standardized interface, used by both hardware engines and

processors• Can assemble system modules rapidly

Application-specific network topologies can be defined

17

Inter-FPGA Communication Inter-FPGA communication uses abstracted

communication links

Communication is independent of physical link

• Single serial transceivers (FSL-over-Aurora)• Bonded serial transceivers (FSL-over-XAUI)• Parallel Busses (FSL-over-Wires)

• FSL-over-10GbE coming soon…