A Profiler for a Multi-Core Multi-FPGA System
description
Transcript of A Profiler for a Multi-Core Multi-FPGA System
![Page 1: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/1.jpg)
A Profiler for a Multi-Core Multi-FPGA System
by
Daniel Nunes
Supervisor:
Professor Paul Chow
September 30th, 2008
University of Toronto
Electrical and Computer Engineering Department
![Page 2: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/2.jpg)
Overview
Background Profiling Model The Profiler Case Studies Conclusions Future Work
![Page 3: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/3.jpg)
How Do We Program This System? Lets look at what
traditional clusters use and try to port it to these type of machines
User
FPGA
User
FPGA
User
FPGA
User
FPGA
Ctrl
FPGA
![Page 4: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/4.jpg)
Traditional Clusters
MPI is a de facto standard for parallel HPC
MPI can also be used to program a cluster of FPGAs
![Page 5: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/5.jpg)
The TMD
Heterogeneous multi-core multi-FPGA system developed at UofT
Uses message passing (TMD-MPI)
![Page 6: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/6.jpg)
TMD-MPI
Subset of the MPI standard Allows an independence between the
application and the hardware TMD-MPI functionality is also
implemented in hardware (TMD-MPE)
![Page 7: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/7.jpg)
TMD-MPI – Rendezvous Protocol
This implementation uses the Rendezvous protocol, a synchronous communication mode
Req. to Send
Acknowledge
Data
![Page 8: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/8.jpg)
The TMD Implementation on BEE2 Boards
PPC
MB
PPC MB
MBPPC
PPC
PPCMB
NoC
NoC
NoC
NoC
NoC
User FPGA
User FPGA
User FPGA
User FPGA
Ctrl FPGA
![Page 9: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/9.jpg)
How Do We Profile This System? Lets look at how it is done
in traditional clusters and try to adapt it to hardware
![Page 10: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/10.jpg)
MPICH - MPE
Collects information from MPI calls and defined user states through embedded calls
Includes a tool to view all log files (Jumpshot)
![Page 11: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/11.jpg)
Goals Of This Work
Implement a hardware profiler capable of extracting the same data as the MPE
Make it less intrusive
Make it compatible with the API used by MPE
Make it compatible with Jumpshot
![Page 12: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/12.jpg)
Tracers
PPCProcessor’s Computation
Tracer
Receive
Tracer
Send
Tracer
TMD
MPE
Receive
Tracer
Send
Tracer
TMD
MPE
Engine’s Computation
Tracer
The Profiler interacts with the computation elements through tracers that register important events
TMD-MPE requires two tracers due to its parallel nature
PPCProcessor’s Computation
Tracer
![Page 13: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/13.jpg)
Tracers - Hardware Engine Computation
MUX
R0
Tracer for Hardware Engine
Cycle Counter
32 32 32
![Page 14: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/14.jpg)
Tracers - TMD-MPE
R0 R1 R2 R3
R4
MPE Data Reg
MUX
MUX
MUX
Tracer for TMD-MPE
Cycle Counter
TMD
MPE
32
32 32 32
32
32
![Page 15: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/15.jpg)
Tracers – Processors Computation
Register Bank
(9 x 32 bits)
MUX
Register Bank
(5 x 32 bits)
Stack
Stack
MPI Calls States User Define States
Tracer for PowerPC/MicroBlaze
Cycle Counter
PPC
3232 32 32
![Page 16: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/16.jpg)
Profiler’s Network
Tracer
Tracer
Tracer
.
.
.
Gather Collector DDR
User FPGA Control FPGA
![Page 17: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/17.jpg)
Synchronization
Synchronization within the same board Release reset of the cycle counters
simultaneously Synchronization between boards
Periodically exchange of messages between the root board and all other boards
![Page 18: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/18.jpg)
Visualize with
Jumpshot
Profiler’s FlowCollect Data
Dump to Host
Convert
To CLOG2
Convert
To SLOG2
After Execution
Back
End
Front
End
![Page 19: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/19.jpg)
Case Studies
Barrier Sequential vs Binary Tree
TMD-MPE - Unexpected Message Queue Unexpected Message Queue addressable by
rank The Heat Equation
Blocking Calls vs Non-Blocking Calls LINPACK Benchmark
16 Node System Calculating a LU Decomposition of a Matrix
![Page 20: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/20.jpg)
Barrier
Synchronization call – No node will advance until all nodes have reached the barrier
0
1 2
3 4 5 6
7
0
1 2 3 4 5 6 7
![Page 21: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/21.jpg)
Barrier Implemented Sequentially
Send Receive
![Page 22: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/22.jpg)
Barrier Implemented as a Binary Tree
Send Receive
![Page 23: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/23.jpg)
TMD-MPE – Unexpected Messages Queue
All request to send that arrive to a node before it issues a MPI_RECV are kept in this queue.
![Page 24: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/24.jpg)
TMD-MPE – Unexpected Messages Queue
Send Receive Queue Search and Reorganization
![Page 25: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/25.jpg)
TMD-MPE – Unexpected Messages Queue
Send Receive Queue Search and Reorganization
![Page 26: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/26.jpg)
TMD-MPE – Unexpected Messages Queue
Send Receive
![Page 27: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/27.jpg)
The Heat Equation Application
Partial differential equation that describes the temperature change over time
41,1,,1,1
,
jijijijiji
uuuuv
2,, )( jiji vu
![Page 28: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/28.jpg)
The Heat Equation Application
![Page 29: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/29.jpg)
The Heat Equation Application
Send Receive Computation
![Page 30: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/30.jpg)
The Heat Equation Application
Send Receive Computation
![Page 31: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/31.jpg)
The LINPACK Benchmark
Solves a system of linear equations
LU factorization with partial pivoting
LUPA
![Page 32: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/32.jpg)
The LINPACK Benchmark
assigned to Rank 0
assigned to Rank 1
assigned to Rank 2
0 1 n-3 n-2 n-12 3 4 5
![Page 33: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/33.jpg)
The LINPACK Benchmark
Send Receive Computation
![Page 34: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/34.jpg)
The LINPACK Benchmark
Send Receive Computation
![Page 35: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/35.jpg)
Profiler’s Overhead
Block LUTs Flip-Flops BRAMsCollector 3856 (5%) 1279 (1%) 0 (0%)
Gather 187 (0%) 53 (0%) 0 (0%)
Engine Computation Tracer
396 (0%) 701 (1%) 0 (0%)
TMD-MPE Tracer 526 (0%) 1000 (1%) 0 (0%)
Processors Computation Tracer
without MPE1196 (1%) 1521 (2%) 0 (0%)
Processors Computation Tracer
with MPE
855 (1%) 1200 (1%) 0 (0%)
![Page 36: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/36.jpg)
Conclusions
All major features of the MPE were implemented
The profiler was successfully used to study the behavior of the applications
Less intrusive More events available to profile Can profile network components Compatible with existing profiling software
environments
![Page 37: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/37.jpg)
Future Work
Reduce the footprint of the profiler’s hardware blocks.
Profile the Microblaze and PowerPC in a non-intrusive way.
Allow real-time profiling
![Page 38: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/38.jpg)
Thank You(Questions?)
![Page 39: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/39.jpg)
Off-Chip Communications Node
The TMD (2)
Off-Chip Communications Node
FSL
PPC
TMD-MPE
TMD-MPE
InterChip
FSL XAUI
Computation Node
Computation Node
Network InterfaceHardware Engine
Network
On-chip
![Page 40: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/40.jpg)
Profiler (2)
TMD-MPE
Tracer RX Tracer TX Tracer Comp
To Gather
From Cycle Counter
From Cycle Counter
From Cycle Counter
PPC
PLB
TMD-MPE
Tracer RX Tracer TX
DCR2FSL
Bridge
Tracer Comp
To Gather
DC
R
From Cycle Counter
GPIO
Processor Profiler Architecture
Engine Profiler Architecture
![Page 41: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/41.jpg)
Profiler (1)
XAUI
PPC
μB
Collector
IC IC
PPC
μB
Gather
ICIC
DDR
Control FPGA
User FPGA 1User FPGA 4
Board 0
Board N
Switch
Gather
Cycle Counter
Cycle Counter
Network
On-chip
Network
On-chip
![Page 42: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/42.jpg)
Profiler (2)
TMD-MPE
Tracer RX Tracer TX Tracer Comp
To Gather
From Cycle Counter
From Cycle Counter
From Cycle Counter
PPC
PLB
TMD-MPE
Tracer RX Tracer TX
DCR2FSL
Bridge
Tracer Comp
To Gather
DC
R
From Cycle Counter
GPIO
Processor Profiler Architecture
Engine Profiler Architecture
![Page 43: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/43.jpg)
Hardware Profiling Benefits
Less intrusive More events available to profile Can profile network components Compatible with existing profiling
software environments
![Page 44: A Profiler for a Multi-Core Multi-FPGA System](https://reader035.fdocuments.in/reader035/viewer/2022081514/568158a8550346895dc5fb5e/html5/thumbnails/44.jpg)
MPE PROTOCOL
Message Size (NDW )Opcode Src/Dest Rank3 1 3 0 2 2
1C t r l b it 2 9 2 1 0
Tag0
Data-word (0)0
Data-word (1)0
Data-word (NDW -1)0