High-Performance Custom Computing System with Stratix10...

16
Kentaro Sano Processor Research Team, R-CCS Riken High-Performance Custom Computing System with Stratix10 FPGAs High-Performance Custom Computing System with Stratix10 FPGAs April 16, 2019

Transcript of High-Performance Custom Computing System with Stratix10...

Page 1: High-Performance Custom Computing System with Stratix10 …icl.utk.edu/jlesc9/files/STA2.2/jlesc9_sano.pdf · 2019-04-18 · FPGA #1 FPGA #2 FPGA‐gateway server #1 FPGA Cluster

Kentaro Sano

Processor Research Team, R-CCS Riken

High-Performance Custom Computing System

with Stratix10 FPGAs

High-Performance Custom Computing System

with Stratix10 FPGAs

April 16, 2019

Page 2: High-Performance Custom Computing System with Stratix10 …icl.utk.edu/jlesc9/files/STA2.2/jlesc9_sano.pdf · 2019-04-18 · FPGA #1 FPGA #2 FPGA‐gateway server #1 FPGA Cluster

2

What kind of platforms is feasible/promising with FPGAs? What programming models? What killer applications?

April 16, 2019

JLESC Project on FPGA PlatformsJLESC Project on FPGA Platforms

Page 3: High-Performance Custom Computing System with Stratix10 …icl.utk.edu/jlesc9/files/STA2.2/jlesc9_sano.pdf · 2019-04-18 · FPGA #1 FPGA #2 FPGA‐gateway server #1 FPGA Cluster

3

Many-core becomes difficult to scale in Post-Moore era.

Inefficient utilization of Tr. Mechanisms to boost IPC But, no more transistors &

no more power budget on a chip

Latency-sensitive characteristic von Neumann with

+ memory-update cycle+ control cycle

Inefficient data-movementamong cores Read and write in

memory hierarchy(scratch pad / cache)

April 16, 2019

MotivationMotivation

Many-core architecture

Page 4: High-Performance Custom Computing System with Stratix10 …icl.utk.edu/jlesc9/files/STA2.2/jlesc9_sano.pdf · 2019-04-18 · FPGA #1 FPGA #2 FPGA‐gateway server #1 FPGA Cluster

4

Inefficient utilization of Tr. Mechanisms to boost IPC But, no more transistors &

no more power budget on a chip

Latency-sensitive characteristic von Neumann with

+ memory-update cycle+ control cycle

Inefficient data-movementamong cores Read and write in

memory hierarchy(scratch pad / cache)

April 16, 2019

What We Need ?What We Need ?

More efficient use of transistor & switching

for computation

Latency-tolerantarchitecture w/o cycles

Data-movement w/o memory access

Page 5: High-Performance Custom Computing System with Stratix10 …icl.utk.edu/jlesc9/files/STA2.2/jlesc9_sano.pdf · 2019-04-18 · FPGA #1 FPGA #2 FPGA‐gateway server #1 FPGA Cluster

5 April 16, 2019

Answer: Spatial Custom Computing Answer: Spatial Custom Computing

More efficient use of transistor & switching

for computation

Latency-tolerantarchitecture w/o cycles

Data-movement w/o memory access

Spatial compt. w/ Data-flow Flow instead of cycles

A C

outi

Bxi yi zi

x x

x +

+Flow

controller

Customization Reconfigurable computing

(with FPGAs)

We consider data-movement first.

Page 6: High-Performance Custom Computing System with Stratix10 …icl.utk.edu/jlesc9/files/STA2.2/jlesc9_sano.pdf · 2019-04-18 · FPGA #1 FPGA #2 FPGA‐gateway server #1 FPGA Cluster

6 April 16, 2019

Concept and ArchitectureConcept and Architecture

System-wide spatial custom computing(stream data through custom data-paths)

memory /storage

memory /storage

memory /storage

FPGAs / CPUs

CPUs

Global network

Task(kernel)

Architecture example (based on conservative ideas)

Main memory

Core

node

CoreCore

CPU’s network

Memory FPGA

CoreCoreCore

Inter-FPGA network

Memory FPGAMemory

Mem

C C

M FFPGA

SCC: spatial custom computing

Page 7: High-Performance Custom Computing System with Stratix10 …icl.utk.edu/jlesc9/files/STA2.2/jlesc9_sano.pdf · 2019-04-18 · FPGA #1 FPGA #2 FPGA‐gateway server #1 FPGA Cluster

7

We are building anexperimental prototype systemwith Stratix10 FPGA Cluster.

How to couple the SCC part with existing machine?

How to program custom data-paths on FPGAs?

Eco system?

April 16, 2019

Toward Proof of ConceptToward Proof of Concept

Page 8: High-Performance Custom Computing System with Stratix10 …icl.utk.edu/jlesc9/files/STA2.2/jlesc9_sano.pdf · 2019-04-18 · FPGA #1 FPGA #2 FPGA‐gateway server #1 FPGA Cluster

9 April 16, 2019

Experimental Prototype SystemExperimental Prototype System

QSFP28Cables

Infin

iban

d ED

R 4X

 Switc

h

System

 network (10G

BASE‐T Etherne

t)

PCIe Gen3 x16

10GBASE-T 100Gb/s IB

Host server #1(Many core)

10GBASE-T 100Gb/s IB

Host server #2

100Gb/s IB

FPGA #1

FPGA #2FPGA‐gateway

server #1

FPGA Cluster for Spatial Custom

Computing

Console, LCD KBD SW

100Gb/s IB

FPGA #3

FPGA #4FPGA‐gateway

server #2

100Gb/s IB

FPGA #5

FPGA #6FPGA‐gateway

server #3

100Gb/s IB

FPGA #7

FPGA #8FPGA‐gateway

server #4

100Gb/s IB

FPGA #15

FPGA #16FPGA‐gateway

server #8

10GBASE-T

ManagementServerExternal network(10GBase-T)

10GBASE-T x n

...

Dedicatedinter-FPGA

network

Software bridge for FPGAs

10GBASE-T 100Gb/s IB

Host server #3

As an existing machine

FPGA ClusterBridge w/ IB

Page 9: High-Performance Custom Computing System with Stratix10 …icl.utk.edu/jlesc9/files/STA2.2/jlesc9_sano.pdf · 2019-04-18 · FPGA #1 FPGA #2 FPGA‐gateway server #1 FPGA Cluster

11

Rush Creek (preliminary eval.) Arria10 FPGA (20nm) 1150K LEs, 53 Mb BRAMs 1518 FP DSPs (1.5 TF in SP)

8GB DDR4 x 2ch PCIe Gen3 x8 1x QSFP+ (40Gb/s)

Darby Creek (prototype system) Stratix10 FPGA (14nm) 2753K LEs, 229 Mb BRAMs 5760 FP DSPs (10 TF in SP)

8GB DDR4 x 4ch PCIe Gen3 x16 2x QSFP28 (100Gb/s) ARM Cortex-A53 1.5 GHz

April 16, 2019

PAC : Programmable Acceleration CardPAC : Programmable Acceleration Card

Stratix10: 1SX280HN2F43E2VG

Arria10: 10AX115N2F40E2LG

Page 10: High-Performance Custom Computing System with Stratix10 …icl.utk.edu/jlesc9/files/STA2.2/jlesc9_sano.pdf · 2019-04-18 · FPGA #1 FPGA #2 FPGA‐gateway server #1 FPGA Cluster

15

FIM (FPGA Interface Manager) : HW shell made by IntelAFU (Acceleration Function Unit) : Reconfigurable logic region

AFU Shell : Our own HW shellDMAs & Interconnectand custom computing cores

April 16, 2019

FPGA Shell and User Logic on FPGAFPGA Shell and User Logic on FPGA

FPGA Board (PAC)PCIe (Host CPU's memory)

FIMDDR4A EMIF DDR4B EMIF

Fabric

PCIe Gen3 x8 EP

FME

PR

HSSI PHY Mode CtrlPlatform

Management

HSSI PHY

HSSI PLL

HSSI Reset

HSSI Reconf

HSSI Controller

Local Memory Interface

CCI-P Interface, Clocks, Power, Error

AFU SlotAvailable to AFU

ALMs 391,213 (92%)M20K 2549 (94%)DSPs 1518 (100%) H

SSI I

nter

face

FIU

HSSI

Local Memory

DDR4 mems

QSF

P+

Arria10 FPGA

PCIe host

M2S DMA

Mem

ory

inte

rcon

nect

DDR4 EMIF

S2M DMA

Switch 1

Switch 2

CustomComputing

Core

AFU Shell

Page 11: High-Performance Custom Computing System with Stratix10 …icl.utk.edu/jlesc9/files/STA2.2/jlesc9_sano.pdf · 2019-04-18 · FPGA #1 FPGA #2 FPGA‐gateway server #1 FPGA Cluster

16

Light-weight software interface for PAC PCIe driver, APIs,

User space libraries (e.g. DMA) Resource management PCIe memory access Partial reconfiguration

April 16, 2019

Intel OPAE (Open Programmable Accel. Engine)Intel OPAE (Open Programmable Accel. Engine)

Intel FPGA (PAC)

DMA API

SGDMAs

Page 12: High-Performance Custom Computing System with Stratix10 …icl.utk.edu/jlesc9/files/STA2.2/jlesc9_sano.pdf · 2019-04-18 · FPGA #1 FPGA #2 FPGA‐gateway server #1 FPGA Cluster

18

Software bridge to connect hosts and FPGAs by IB (EDR) Apps on hosts can use remote FPGAs transparently

April 16, 2019

Remote-OPAE (developed by us)Remote-OPAE (developed by us)

FPGA-gateway serverHost server

Intel FPGA (PAC)

R-OPAE Daemon

R-OPAE Library

Verbs, RDMA

R-OPAE C API

DMA Library

DMA API

SGDMAs

IBEDR

Page 13: High-Performance Custom Computing System with Stratix10 …icl.utk.edu/jlesc9/files/STA2.2/jlesc9_sano.pdf · 2019-04-18 · FPGA #1 FPGA #2 FPGA‐gateway server #1 FPGA Cluster

19

Software bridge for hosts connected by Infiniband Apps on hosts can use remote FPGAs transparentlyWe can allocate arbitrarily groups of FPGAs to hosts.

April 16, 2019

Allocation Flexibility by Remote-OPAEAllocation Flexibility by Remote-OPAE

Gateway servers and FPGAsHost servers

Host server #1

Host server #2

Host server #n

GW server #1

GW server #2

GW server #n

FPGA #1-1FPGA #1-2FPGA #1-3FPGA #2-1FPGA #2-2FPGA #2-3

FPGA #n-1FPGA #n-2FPGA #n-3

Page 14: High-Performance Custom Computing System with Stratix10 …icl.utk.edu/jlesc9/files/STA2.2/jlesc9_sano.pdf · 2019-04-18 · FPGA #1 FPGA #2 FPGA‐gateway server #1 FPGA Cluster

20

Fluid dynamics simulation LBM (lattice Boltzmann method)

Array architecture of SPEs Stream Processing Elements Generated by SPGen compiler

April 16, 2019

Preliminary Evaluation of Computing Core with Stratix10 FPGAPreliminary Evaluation of Computing Core with Stratix10 FPGA

LBM SPE

Boundary

Propagation

Collision

pipe

pipe

pipe...

pipe

pipe

pipe...

pipe

pipe

pipe...

SPE 1

SPE 2

SPE 3

PipeliningSpatially parallelizing

Page 15: High-Performance Custom Computing System with Stratix10 …icl.utk.edu/jlesc9/files/STA2.2/jlesc9_sano.pdf · 2019-04-18 · FPGA #1 FPGA #2 FPGA‐gateway server #1 FPGA Cluster

21

Relative consumption of n-cascaded LBM PEs

April 16, 2019

LBM Resource Consumption (Preliminary)LBM Resource Consumption (Preliminary)

Arria10 (DE5A-NET) Stratix10 (Darby Creek)

logic

registers

on-chip memoryfloating-pointDSP blocks

1518 DSPs 5760 DSPs

x4 cores

531GF225MHz

2TF 225MHz

~7.1TF

800MHz

Page 16: High-Performance Custom Computing System with Stratix10 …icl.utk.edu/jlesc9/files/STA2.2/jlesc9_sano.pdf · 2019-04-18 · FPGA #1 FPGA #2 FPGA‐gateway server #1 FPGA Cluster

22

Need to change architecture for post-Moore era

Spatial custom computing

Experimental prototype system with Stratix10 FPGAs Intel PAC and OPAE Data-flow compiler Loosely-coupled approach with Remote-OPAE Dedicated inter-FPGA network (Torus, Ethernet-based)

Preliminary evaluation Bandwidth and latency for OPAE and Remote-OPAE with Arria10 LBM fluid simulation performance estimation for Stratix10

Future work and collaboration?

April 16, 2019

SummarySummary