An MPI Approach to High-Performance Computing
with FPGAsChris Madill
Molecular Structure and Function, Hospital for Sick Children
Department of Biochemistry, University of Toronto
Supervised by Dr. Paul ChowElectrical and Computer Engineering,
University of Toronto
SHARCNET Symposium on GPU and CELL Computing 2008
Introduction Many scientific applications can be accelerated by
targeting parallel machines
2
This work demonstrates a method for combining high performance computer clusters with FPGAs for maximum computational power
Coarse-grained parallelization allows applications to be distributed across hundreds or thousands of nodes
FPGAs can accelerate many computing tasks by 2 or 3 orders of magnitude over a CPU
Popular HPC Configurations
Interconnection Network
MEM
CPU CPU CPU CPU…Interconnection Network
CPU CPU CPU CPU…
MEM MEM MEM MEM
Interconnection Network
ASP(GPU/FPGA)
ASP(GPU/FPGA)
ASP(GPU/FPGA)
ASP(GPU/FPGA)
…
MEM MEM MEM MEM
Interconnection Network
CPU FPGA FPGA GPU…
MEM MEM MEM
MEM
3
A Demanding Application
How Do You Program This?
FPGAs can speed up applications, however...
High barrier of entry for designing digital hardware
Developing monolithic FPGA designs is very daunting
How does one easily take advantage of FPGAs for accelerating HPC applications?
5
TMD Toronto Molecular Dynamics machine is an investigation into high
performance computing based on a scalable network of FPGAs
Applications are defined as a simple collection of computing tasks
A task is roughly equivalent to a software process/thread
Major focus is facilitating transition from cluster-based applications to TMD machine
Task
ComputingEngine
EmbeddedMicroprocessor
Processor onCPU Node
TM
D M
ach
ine
Su
perc
om
pu
ter
CPU
6
7
Application Design Flow Step 1: Application Prototyping
• Software prototype of application developed• Profiling identifies compute-intensive routines
Step 2: Application Refinement• Partitioning into tasks communicating using
MPI• Communication patterns analyzed to
determine network topology
Step 3: TMD Prototyping• Tasks are ported to soft-processors on TMD• On-chip communication network verified
Step 4: TMD Optimization• Intensive tasks replaced with hardware engines• MPE handles communication for hardware
engines• Hardware engines easily moved, replicated
ApplicationPrototype
Process A Process B Process CMPI MPI
CPU ClusterFPGA Network
A CTMD-MPITMD-MPI
B
TMD-MPI TMD-MPIB
Communication
Use essential subset of MPI standard
Software library for tasks run on processors
Hardware Message Passing Engine (MPE) for hardware-based tasks
Tasks do not know (or care) whether remote tasks are run as software processes or hardware engines
MPI isolation of tasks facilitates C-to-gates compilers
8
Xilinx ACP
9
The Xilinx Advanced Computing Platform are modules that plug directly into CPU socket
Direct access to FSB
CPU and FPGA are both peers in system
Equal priority main memory access
Xilinx ACP
CPU does not have to orchestrate activity of FPGA
CPU does not have to relay data to and from FPGAs
FPGA not on slow connection to CPU
All tasks can run independently
10
Tasks in MD
11
BondsAll
ob llk 2)(
AnglesAll
ok 2)(
TorsionsAll
nA )]cos(1[
PairsAll rr
612
4
PairsAll r
qq 21 +d
-d
+
+
+
+
å®
F ò
U =
Final MD Target
12
FSB
Quad Core CPU
MEM
Xilinx ACP Module
User FPGA 2
User FPGA 1
Comm FPGA
NBE 1
NBE 2
NBE 3
NBE 4
Comm
Xilinx ACP Module
User FPGA 4
User FPGA 3
Comm FPGA
NBE 5
NBE 6
NBE 7
NBE 8
Comm
Xilinx ACP Module
User FPGA 5
Comm FPGA
Ewald
Comm
Ewald
User FPGA 6
BondsAll
ob llk 2)(
AnglesAll
ok 2)(
TorsionsAll
nA )]cos(1[
PairsAll rr
612
4
PairsAll r
qq 21
Conclusion Target system is a combination of software
running on CPUs and FPGA hardware accelerators
Key to performance is in identifying hotspots and adding corresponding hardware acceleration
Hardware engineer must focus only on small part of overall application
MPI facilitates hardware/software isolation, collaboration
13
Acknowledgements
SOCRN
1: Molecular Structure and Function, The Hospital for Sick Children2: Department of Biochemistry, University of Toronto
Prof. Paul ChowProf. Régis Pomès1,2
David ChuiChristopher ComisSam LeeDaniel LyLesley ShannonMike Yan
Danny GuptaAlireza HeiderbarghiAlex KaganovDaniel LyChris Madill1,2
Daniel NunesEmanuel RamalhoDavid Woods
Arun PatelManuel Saldaña
Arches Computing:
TMD Group: Past Members:
Layout Editor
15
16
TMD-MPI Implementation
Application
Hardware
MPI Application Interface
Point-to-Point MPI Functions
Send/Receive Implementation
FSL Hardware Interface
Layer 4: MPI InterfaceAll MPI functions implemented in TMD-MPI that are available to the application.
Layer 3: Collective OperationsBarrier synchronization, data gathering and message broadcasts.Layer 2: Communication PrimitivesMPI_Send and MPI_Recv methods are used to transmit data between processes.Layer 1: Hardware InterfaceLow level methods to communicate with FSLs for both on and off-chip communication.
Intra-FPGA Communication Communication links are based on Fast Simplex
Links (FSL)• Unidirectional Point-to-Point FIFO• Provides buffering and flow-control• Can be used to isolate different clock domains
FSLs simplify component interconnects• Standardized interface, used by both hardware engines and
processors• Can assemble system modules rapidly
Application-specific network topologies can be defined
17
Inter-FPGA Communication Inter-FPGA communication uses abstracted
communication links
Communication is independent of physical link
• Single serial transceivers (FSL-over-Aurora)• Bonded serial transceivers (FSL-over-XAUI)• Parallel Busses (FSL-over-Wires)
• FSL-over-10GbE coming soon…
Top Related