High-Performance Custom Computing System with Stratix10...
Transcript of High-Performance Custom Computing System with Stratix10...
Kentaro Sano
Processor Research Team, R-CCS Riken
High-Performance Custom Computing System
with Stratix10 FPGAs
High-Performance Custom Computing System
with Stratix10 FPGAs
April 16, 2019
2
What kind of platforms is feasible/promising with FPGAs? What programming models? What killer applications?
April 16, 2019
JLESC Project on FPGA PlatformsJLESC Project on FPGA Platforms
3
Many-core becomes difficult to scale in Post-Moore era.
Inefficient utilization of Tr. Mechanisms to boost IPC But, no more transistors &
no more power budget on a chip
Latency-sensitive characteristic von Neumann with
+ memory-update cycle+ control cycle
Inefficient data-movementamong cores Read and write in
memory hierarchy(scratch pad / cache)
April 16, 2019
MotivationMotivation
Many-core architecture
4
Inefficient utilization of Tr. Mechanisms to boost IPC But, no more transistors &
no more power budget on a chip
Latency-sensitive characteristic von Neumann with
+ memory-update cycle+ control cycle
Inefficient data-movementamong cores Read and write in
memory hierarchy(scratch pad / cache)
April 16, 2019
What We Need ?What We Need ?
More efficient use of transistor & switching
for computation
Latency-tolerantarchitecture w/o cycles
Data-movement w/o memory access
5 April 16, 2019
Answer: Spatial Custom Computing Answer: Spatial Custom Computing
More efficient use of transistor & switching
for computation
Latency-tolerantarchitecture w/o cycles
Data-movement w/o memory access
Spatial compt. w/ Data-flow Flow instead of cycles
A C
outi
Bxi yi zi
x x
x +
+Flow
controller
Customization Reconfigurable computing
(with FPGAs)
We consider data-movement first.
6 April 16, 2019
Concept and ArchitectureConcept and Architecture
System-wide spatial custom computing(stream data through custom data-paths)
memory /storage
memory /storage
memory /storage
FPGAs / CPUs
CPUs
Global network
Task(kernel)
Architecture example (based on conservative ideas)
Main memory
Core
node
CoreCore
CPU’s network
Memory FPGA
CoreCoreCore
Inter-FPGA network
Memory FPGAMemory
Mem
C C
M FFPGA
SCC: spatial custom computing
7
We are building anexperimental prototype systemwith Stratix10 FPGA Cluster.
How to couple the SCC part with existing machine?
How to program custom data-paths on FPGAs?
Eco system?
April 16, 2019
Toward Proof of ConceptToward Proof of Concept
9 April 16, 2019
Experimental Prototype SystemExperimental Prototype System
QSFP28Cables
Infin
iban
d ED
R 4X
Switc
h
System
network (10G
BASE‐T Etherne
t)
PCIe Gen3 x16
10GBASE-T 100Gb/s IB
Host server #1(Many core)
10GBASE-T 100Gb/s IB
Host server #2
100Gb/s IB
FPGA #1
FPGA #2FPGA‐gateway
server #1
FPGA Cluster for Spatial Custom
Computing
Console, LCD KBD SW
100Gb/s IB
FPGA #3
FPGA #4FPGA‐gateway
server #2
100Gb/s IB
FPGA #5
FPGA #6FPGA‐gateway
server #3
100Gb/s IB
FPGA #7
FPGA #8FPGA‐gateway
server #4
100Gb/s IB
FPGA #15
FPGA #16FPGA‐gateway
server #8
10GBASE-T
ManagementServerExternal network(10GBase-T)
10GBASE-T x n
...
Dedicatedinter-FPGA
network
Software bridge for FPGAs
10GBASE-T 100Gb/s IB
Host server #3
As an existing machine
FPGA ClusterBridge w/ IB
11
Rush Creek (preliminary eval.) Arria10 FPGA (20nm) 1150K LEs, 53 Mb BRAMs 1518 FP DSPs (1.5 TF in SP)
8GB DDR4 x 2ch PCIe Gen3 x8 1x QSFP+ (40Gb/s)
Darby Creek (prototype system) Stratix10 FPGA (14nm) 2753K LEs, 229 Mb BRAMs 5760 FP DSPs (10 TF in SP)
8GB DDR4 x 4ch PCIe Gen3 x16 2x QSFP28 (100Gb/s) ARM Cortex-A53 1.5 GHz
April 16, 2019
PAC : Programmable Acceleration CardPAC : Programmable Acceleration Card
Stratix10: 1SX280HN2F43E2VG
Arria10: 10AX115N2F40E2LG
15
FIM (FPGA Interface Manager) : HW shell made by IntelAFU (Acceleration Function Unit) : Reconfigurable logic region
AFU Shell : Our own HW shellDMAs & Interconnectand custom computing cores
April 16, 2019
FPGA Shell and User Logic on FPGAFPGA Shell and User Logic on FPGA
FPGA Board (PAC)PCIe (Host CPU's memory)
FIMDDR4A EMIF DDR4B EMIF
Fabric
PCIe Gen3 x8 EP
FME
PR
HSSI PHY Mode CtrlPlatform
Management
HSSI PHY
HSSI PLL
HSSI Reset
HSSI Reconf
HSSI Controller
Local Memory Interface
CCI-P Interface, Clocks, Power, Error
AFU SlotAvailable to AFU
ALMs 391,213 (92%)M20K 2549 (94%)DSPs 1518 (100%) H
SSI I
nter
face
FIU
HSSI
Local Memory
DDR4 mems
QSF
P+
Arria10 FPGA
PCIe host
M2S DMA
Mem
ory
inte
rcon
nect
DDR4 EMIF
S2M DMA
Switch 1
Switch 2
CustomComputing
Core
AFU Shell
16
Light-weight software interface for PAC PCIe driver, APIs,
User space libraries (e.g. DMA) Resource management PCIe memory access Partial reconfiguration
April 16, 2019
Intel OPAE (Open Programmable Accel. Engine)Intel OPAE (Open Programmable Accel. Engine)
Intel FPGA (PAC)
DMA API
SGDMAs
18
Software bridge to connect hosts and FPGAs by IB (EDR) Apps on hosts can use remote FPGAs transparently
April 16, 2019
Remote-OPAE (developed by us)Remote-OPAE (developed by us)
FPGA-gateway serverHost server
Intel FPGA (PAC)
R-OPAE Daemon
R-OPAE Library
Verbs, RDMA
R-OPAE C API
DMA Library
DMA API
SGDMAs
IBEDR
19
Software bridge for hosts connected by Infiniband Apps on hosts can use remote FPGAs transparentlyWe can allocate arbitrarily groups of FPGAs to hosts.
April 16, 2019
Allocation Flexibility by Remote-OPAEAllocation Flexibility by Remote-OPAE
Gateway servers and FPGAsHost servers
Host server #1
Host server #2
Host server #n
GW server #1
GW server #2
GW server #n
FPGA #1-1FPGA #1-2FPGA #1-3FPGA #2-1FPGA #2-2FPGA #2-3
FPGA #n-1FPGA #n-2FPGA #n-3
20
Fluid dynamics simulation LBM (lattice Boltzmann method)
Array architecture of SPEs Stream Processing Elements Generated by SPGen compiler
April 16, 2019
Preliminary Evaluation of Computing Core with Stratix10 FPGAPreliminary Evaluation of Computing Core with Stratix10 FPGA
LBM SPE
Boundary
Propagation
Collision
pipe
pipe
pipe...
pipe
pipe
pipe...
pipe
pipe
pipe...
SPE 1
SPE 2
SPE 3
PipeliningSpatially parallelizing
21
Relative consumption of n-cascaded LBM PEs
April 16, 2019
LBM Resource Consumption (Preliminary)LBM Resource Consumption (Preliminary)
Arria10 (DE5A-NET) Stratix10 (Darby Creek)
logic
registers
on-chip memoryfloating-pointDSP blocks
1518 DSPs 5760 DSPs
x4 cores
531GF225MHz
2TF 225MHz
~7.1TF
800MHz
22
Need to change architecture for post-Moore era
Spatial custom computing
Experimental prototype system with Stratix10 FPGAs Intel PAC and OPAE Data-flow compiler Loosely-coupled approach with Remote-OPAE Dedicated inter-FPGA network (Torus, Ethernet-based)
Preliminary evaluation Bandwidth and latency for OPAE and Remote-OPAE with Arria10 LBM fluid simulation performance estimation for Stratix10
Future work and collaboration?
April 16, 2019
SummarySummary