FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with...

© Copyright 2019 Xilinx

FPGA as a First-Class Citizen in Cloud Computing

Lintao ZhangMicrosoft Research Asia


Standard Disclaimer

• This presentation contains work of many people, including many of my colleagues, collaborators, and student interns.

• All mistakes are mine.

• This presentation contains only my personal opinion, it does not reflect position of Microsoft.


Haswell (fp32), 8.6 Broadwell (fp32), 9.1

TPU1 (int8), 1227

TPU2 (fp), 281

K80 (fp32), 29

P100 (fp16), 71

P40 (int8), 188

V100 (fp 4x4x4), 400

A10 (int8), 69

A10 (int4), 200S10 (int8), 240

S10 (int6), 440S10 (int4), 640

1

10

100

1000

2014 2015 2016 2017 2018 2019 2020

Ener

gy E

ffici

ency

(Gig

aOps

/Wat

t)

Estimated Year of Deployment

LOG SCALE

Note: assumes power consumption of 160W/TPU2 chip (not confirmed)

CPU: End of frequency & Dennard scaling

The rise of domain-specific computing

Credit: Doug Burger


Heterogeneous Architectures

CPU GPU FPGA ASICThroughput Low High High Very highLatency High Very high Low LowPower High High Low Very lowPrice at scale High Very high Low LowFlexibility Very high High High Low


Project Catapult: FPGA in Azure


Deployment of FPGAs in Azure

CPU/GPU/TPU compute + NVMe SSD storage

QPICPU

40Gb/s ToR

FPGA

CPU

40Gb/s

Hardware acceleration plane

Search

ML

Network

DB

SmartNIC


FPGA in Production: SmartNICs in Azure

High Performance FPGA-powered SDN

Bandwidth Highest bandwidth of any cloud so far, with standard compute VM get up to 32Gbps

LatencyConsistent low latency network, 5x+ latency improvement, sub 15us within tenants

Enables workloads requiring native performance to run in cloud VMs


FPGA in Production: Project Brainwave

A Scalable FPGA-powered DNN Serving Platform

Performance: Excellent inference performance at low batch size, >10x lower than CPU and GPUs

FlexibilitySupport CNN, LSTM, MLP, RL, Decision Trees, and exploit Sparsity, Compression

ScaleDeployed in Azure and Serving Production workloads


Challenges of Using FPGA in Data Centers

• Programing FPGA is hard• Software developers are not familiar with HDL programming

• FPGA needs to communicate and work with other devices• No generic socket programming interface available for hardware

• FPGA needs to provide value for data center workloads• FPGA is not traditionally leveraged by applications such as storage, database,

stream processing, graph ……

• Algorithms need to adapt for FPGA• Need to leverage fine grain parallelism, and fight with bandwidth, capacity,

and complexity limits.



• Programing FPGA is hard• Software developers are not familiar with RTL programming

• FPGA needs to communicate and work with other devices• No generic socket programming interface available for hardware

• FPGA needs to provide value for data center workloads• FPGA is not traditionally leveraged by applications such as storage, database,

stream processing, graph ……

• Algorithms need to adapt for FPGA• Need to leverage fine grain parallelism, and fight with bandwidth, capacity,

and complexity limits.


Overview of The Talk

• Introduction• Simplifying FPGA Application Development• Simplifying FPGA Communication• A High-Performance FPGA Based Key Value Store • Sparsity to Facilitate DNN Inference on FPGA• Conclusion


Overview of the Talk

• Introduction• Simplifying FPGA Application Development

ClickNP: Highly Flexible and High-Performance Network Processing with Reconfigurable Hardware (SIGCOMM’16)

• Simplifying FPGA Communication• A High-Performance FPGA Based Key Value Store • Sparsity to Facilitate DNN Inference on FPGA• Conclusion


Challenge: FPGA Programming

Write and debug HDL code is hard

88’h68656C6C6F20776F726C64


ClickNP: Making FPGA accessible to software developers

• Flexible: fully programmable using high-level language• Modularized: Click abstractions familiar to software developers;

easy code reuse• High performance: high throughput; microsecond-scale latency• Joint CPU/FPGA packet processing: FPGA is no panacea; fine-

grained processing separation

17


ClickNP Programming ModelStream processing model

A B C


Element: single-threaded core

(interrupt)

(I/O)(I/O)

(reg/mem)

(main thread)

(ISR)

Developed in C-like language (OpenCL)Write once, run in both CPU and FPGACan Embedding native Verilog code in ClickNP element


Workflow and Architecture

Catapult shell

ClickNP

FPGA

Host Catapult PCIe Driver

ClickNP library

ClickNP host process

Mgr thrd Worker thrd

ClickNPelements

ClickNPhost mgr

ClickNP compiler

vendor libs

vendor HLS

PCIe I/O channel

Cross-platform toolchainAltera OpenCL / Vivado HLSVisual Studio / GCC

vendor specific runtime

ClickNPscript

C compiler


Development Efficiency & PerformanceNetwork Function Line of Code Resource Util. Throughput (vs. CPU) Avg Latency (vs. CPU)

Packet generator 665 16% 18x 25x

Packet capture 250 8% 17x 18x

Firewall 538 32% 22x 40x

IPSec gateway 695 35% 60x 3x

Load balancer 860 36% 55x 12x

Packet scheduler 584 11% 32x 10x

IPSec GatewayPerformance Comparisons



• Introduction• Simplifying FPGA Application Development• Simplifying FPGA Communication

Direct Universal Access: Making Data Center Resources Available to FPGA (NSDI’19)

• A High Performance FPGA Based Key Value Store • Sparsity to Facilitate DNN Inference on FPGA• Conclusion


Current FPGA Communication Architecture

Application Layer

Physical Layer

Application Application

DDR 4 IP PCIe Gen 3 IP

DDR Stack LTL NVMeHostDMA

FPGA Chip

QSFP IP

DMA

PCIe Gen 3QSFP

DDR 4

FPGA Board

DCQCN(in LTL) Transport Layer

Data Link Layer


Current FPGA Communication Architecture

MAC layer

Physical layer

Application Application

DDR 4 IP PCIe Gen 3 IP

DDR Stack LTL NVMeHostDMA

FPGA Chip

QSFP IP

DMA

PCIe Gen 3QSFP

DDR 4

FPGA Board

???

Image from Lixia Zhang et al., CCR 2014

DCQCN(in LTL)


Direct Universal AccessFPGA 1Server

QSFPDDR

Server

LTLDDR access

FPGA Connect

HostDMA

PCIe Gen3DDR

App 2

LTLDDR access

FPGA Connect

HostDMA

Datacenter networking fabric

QSFP

FPGA 2

DDR

FPGA Connect

DDR access

Host DMA

PCIe Gen3 PCIe Gen3CPU

DUA DUA

App 1

DUA

Intra-server networking fabric③

②

① ④

FPGA 3


DUA OverviewFPGA

Server

QSFPDDR

Server

AppApp

LTLDDR access

FPGA Connect

HostDMA

PCIe Gen3DDR

AppApp

LTLDDR access

FPGA Connect

HostDMA


QSFP

FPGA

DDR

FPGA Connect

DDR access

Host DMA


DUA DUA

AppApp

DUA


②

① ④

DUA is an “IP layer”An abstract overlay network

Leverage all existing h/w stacksHierarchical addressing & routing


DUA OverviewFPGA

Server

QSFPDDR

Server

AppApp

LTLDDR access

FPGA Connect

HostDMA

PCIe Gen3DDR

AppApp

LTLDDR access

FPGA Connect

HostDMA


QSFP

FPGA

DDR

FPGA Connect

DDR access

Host DMA


DUA DUA

AppApp

DUA

Intra-server networking fabric

DUA is an “IP layer”

③

②

① ④Efficient Routing

Direct resource access by FPGA, totally bypass CPU


DUA OverviewFPGA

Server

QSFPDDR

Server

AppApp

LTLDDR access

FPGA Connect

HostDMA

PCIe Gen3DDR

AppApp

LTLDDR access

FPGA Connect

HostDMA


QSFP

FPGA

DDR

FPGA Connect

DDR access

Host DMA


DUA DUA

AppApp

DUA


②

① ④

Compatible BSD-socket Interfacefor both applications and communication stacks

Efficient RoutingDUA is an “IP layer”


DUA OverviewFPGA

Server

QSFPDDR

Server

AppApp

LTLDDR access

FPGA Connect

HostDMA

PCIe Gen3DDR

AppApp

LTLDDR access

FPGA Connect

HostDMA


QSFP

FPGA

DDR

FPGA Connect

DDR access

Host DMA


DUA DUA

AppApp

DUA


②

① ④

Efficient RoutingCompatible BSD-socket Interface

General Multiplexingfor both applications and communication stacks



DUA OverviewFPGA

Server

QSFPDDR

Server

AppApp

LTLDDR access

FPGA Connect

HostDMA

PCIe Gen3DDR

AppApp

LTLDDR access

FPGA Connect

HostDMA


QSFP

FPGA

DDR

FPGA Connect

DDR access

Host DMA


DUA DUA

AppApp

DUA


②

① ④

Efficient RoutingCompatible BSD-socket InterfaceCompatible BSD-socket interface


Unified Multiplexing

SecurityProtect against both inside and outside attacks


App

FPGA

App

App

NIC

FPGA

CPU

FPGA

NIC

CPU



QSFP

DUA Underlay

PCIe Gen3DDR

Server

Server

CPU Control Agent

DUAOverlay

FPGA Control Agent

FPGA Host Stack

NVMe Stack

LTL

FPGA Connect

Host DMA

DDR ControllerDUA

data plane

System Architecture

DUA Data Plane


App

FPGA

App

App

NIC

FPGA

CPU

FPGA

NIC

CPU



QSFP

DUA Underlay

PCIe Gen3DDR

Server

Server

CPU Control Agent

DUAOverlay

FPGA Control Agent

FPGA Host Stack

NVMe Stack

LTL

FPGA Connect

Host DMA

DDR Controller

System Architecture

DUA Data Plane DUAcontrol plane


Evaluation – Regex Matching

• Up to 105~107 higher than CPU, Up to 105 lower than CPU• Up to 3 times throughput and up to 55% latency reduction compared to using

CPU to move data between FPGAs

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0 2000 4000 6000 8000 10000 12000 14000 16000 18000

Thro

ughp

ut (G

B/s)

Input String Length (Byte)

through FPGA Connect

through CPU

Pure CPU1.00E+004.00E+001.60E+016.40E+012.56E+021.02E+034.10E+031.64E+046.55E+042.62E+051.05E+064.19E+061.68E+07

64 128 256 512 1024 2048 4096 8192 16384

Late

ncy

(us)

Input String Length (Byte)

through FPGA Connect through CPU Pure CPU



• Introduction• Simplifying FPGA Application Development• Simplifying FPGA Communication• A High Performance FPGA Based Key Value Store

KV-Direct: High-Performance In-Memory Key-Value Store with Programmable NIC

(SOSP’17)

• Sparsity to Facilitate DNN Inference on FPGA• Conclusion


Key-Value Store in Data Centers

Web Cache



Web Cache Shared Data Structure



Design Goals

____


Key-Value Store Architectures

Bottleneck: Network stack in OS(~300 Kops per core)




e.g. DPDK, mtcp, libvma, two-sided RDMABottlenecks: CPU random memory accessand KV operation computation(~5 Mops per core)




Communication overhead: multiple round-trips per KV operation (fetch index, data)Synchronization overhead: write operations

e.g. DPDK, mtcp, libvma, two-sided RDMABottlenecks: CPU random memory accessand KV operation computation(~5 Mops per core)



Offload KV processing on CPUto Programmable NIC


Performance Challenges

ToR switch

Host DRAM (256 GB)

PCIe Gen3x16 DMA

40 GbE

FPGAOn-board

DRAM (4 GB)

13 GB/s120 Mops

Header overhead and limited parallelism:Be frugal on memory accesses



ToR switch

Host DRAM (256 GB)

PCIe Gen3x16 DMA

40 GbE

FPGAOn-board

DRAM (4 GB)

1us delay120 Mops

Atomic operations have dependency:PCIe latency hiding



ToR switch

Host DRAM (256 GB)

PCIe Gen3x16 DMA

40 GbE

FPGAOn-board

DRAM (4 GB)

1us delay120 Mops

0.2us delay100 Mops

Load dispatch



ToR switch

Host DRAM (256 GB)

PCIe Gen3x16 DMA

40 GbE

FPGAOn-board

DRAM (4 GB)

1us delay120 Mops

60 Mpps

Client-side batchingVector-type operations

0.2us delay100 Mops


KV Processor Architecture

NIC

PCIe

DRAM


KV-Direct Performance


Scalability with Multiple NICs


Scalability with Multiple NICs

1.22 billion KV op/s357 watts power



• Introduction• Simplifying FPGA Application Development• Simplifying FPGA Communication• A High Performance FPGA Based Key Value Store • Sparsity to Facilitate DNN Inference on FPGA

Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced Sparsity (FPGA’19)

• Conclusion


Low Latency DNN Inference is Important

𝑥𝑥𝑡𝑡

𝑦𝑦𝑡𝑡

𝑦𝑦𝑡𝑡𝑦𝑦t − 1

ct − 1

Matrix-Vector Multiplication is the work horse of DNN inference


Weight Pruning1. Train Connectivity

2. Prune Connections

3. Train Weights

�𝑤𝑤 ← 𝑎𝑎𝑎𝑎𝑎𝑎 𝑤𝑤𝑖𝑖𝑖𝑖 �𝑤𝑤𝑖𝑖 ≤ 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑎𝑎𝑇𝑇𝑇𝑇𝑇𝑇𝑇

𝑤𝑤𝑖𝑖 ← 0

Highly unstructured sparse pattern

Sparsity = 50%~90%near-zeros

Difficult to accelerate


Pros:• High model accuracy• High compression ratioCons:• Irregular pattern• Difficult to accelerate

Cons:• Low model accuracy• Low compression ratioPros:• Regular pattern• Easy to accelerate

Irregular Regular

Fine-grained Coarse-grained

Accuracy and Speedup Tradeoff


Speedup and Accuracy Tradeoff

78

79

80

81

82

83

84

85

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Original

Fine-grained

Block Size (4*4)

Block Size (8*8)

Block Size (16*16)

Sparsity

the smaller the better

ModelLoss

Fine-grained:Accurate but

irregular

Coarse-grained:Regular but not accurate


DenseMatrix

Bank Split

Bank Balanced Pruning


DenseMatrix

DenseMatrix Row

BBSMatrix Row

Bank Split

0.8 -0.1 0.2 1.5 1.0 0.3 -0.4 -1.4 0.7 2.0 0.9 -0.5 1.2 -1.3 2.1 0.2

0 1 2 3 4 5 6 7 12 13 14 158 9 10 11

0.8 1.5 1.0 -1.4 2.0 0.9 -1.3 2.1

Traverse all rows

Fine-grained pruning inside each bank

Threshold percentage to obtain identical sparsity ratio among banks

Bank Balanced Pruning


Weight map visualization


Bank 0 Bank 1

Weight map visualization


Speech Recognition on TIMIT dataset

Language model PTB dataset

Model Accuracy


Speech Recognition on TIMIT dataset

Language model PTB dataset

Very close

Model Accuracy

Very close


V0 V1 V2

V3 V4 V5

V6 V7 V8

V9 V10 V11

BSB matrix row

Dense vector

Bank 0 Bank 1 Bank 2 Bank 3

A 0 B C D 0 0 E F G 0 H

Bank 0

Bank 1

Bank 2

Bank 3

Accumulate

Partial dot product: V0A+V3C+V7E+V9G

V0 V3 V7 V9 A C E G

S1

Sparse MV Multiplication (SpMxV)


V0 V1 V2

V3 V4 V5

V6 V7 V8

V9 V10 V11

BSB matrix row

Dense vector

Bank 0 Bank 1 Bank 2 Bank 3

A 0 B C D 0 0 E F G 0 H

Bank 0

Bank 1

Bank 2

Bank 3

B D F H

Accumulate

Partial dot product: V2B+V4D+V8F+V11H

V2 V4 V8 V11

S2

S1+S2

Sparse MV Multiplication (SpMxV)


FPGA

SpMxV PE

*

...

*

** +

++

EWOP

ACT

+

Controller Instruction Buffer

DMA

*Private Vector Buffer

Output

+

DRAMCntlr

PCIeCntlr

Off-chipDRAM

HostServer

Vector Memory

MatrixMemory

Indices

Values

Accelerator Overview


Hardware Efficiency


~34x ~7x

Hardware Efficiency



• Programing FPGA is hard• Programming Network Function with ClickNP

• FPGA needs to communicate and work with other devices• Facilitate communication with Direct Universal Access

• FPGA needs to provide value for data center workloads• KV-Direct is a fast Key Value Store

• Algorithms need to adapt for FPGA• Bank Balanced Sparsity accelerate DNN Inference


Conclusion

• FPGA is becoming a key resource in the cloud• It is not easy to leverage FPGA for data center workloads• We did several work to

• Reduce development cost• Simplify communication• Find valuable applications to accelerate • Adapt algorithms to fit FPGA

• Many challenges remain to be addressed• The trend is clear: FPGA is a first-class citizen in the cloud and future

applications need to take good use of it


Thanks!

FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with...

Documents

Transcript of FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with...