FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with...
Transcript of FPGA as a First-Class Citizen in Cloud Computing · • FPGA needs to communicate and work with...
© Copyright 2019 Xilinx
FPGA as a First-Class Citizen in Cloud Computing
Lintao ZhangMicrosoft Research Asia
© Copyright 2019 Xilinx
Standard Disclaimer
• This presentation contains work of many people, including many of my colleagues, collaborators, and student interns.
• All mistakes are mine.
• This presentation contains only my personal opinion, it does not reflect position of Microsoft.
© Copyright 2019 Xilinx
Haswell (fp32), 8.6 Broadwell (fp32), 9.1
TPU1 (int8), 1227
TPU2 (fp), 281
K80 (fp32), 29
P100 (fp16), 71
P40 (int8), 188
V100 (fp 4x4x4), 400
A10 (int8), 69
A10 (int4), 200S10 (int8), 240
S10 (int6), 440S10 (int4), 640
1
10
100
1000
2014 2015 2016 2017 2018 2019 2020
Ener
gy E
ffici
ency
(Gig
aOps
/Wat
t)
Estimated Year of Deployment
LOG SCALE
Note: assumes power consumption of 160W/TPU2 chip (not confirmed)
CPU: End of frequency & Dennard scaling
The rise of domain-specific computing
Credit: Doug Burger
© Copyright 2019 Xilinx
Heterogeneous Architectures
CPU GPU FPGA ASICThroughput Low High High Very highLatency High Very high Low LowPower High High Low Very lowPrice at scale High Very high Low LowFlexibility Very high High High Low
© Copyright 2019 Xilinx
Project Catapult: FPGA in Azure
© Copyright 2019 Xilinx
Project Catapult: FPGA in Azure
© Copyright 2019 Xilinx
Deployment of FPGAs in Azure
CPU/GPU/TPU compute + NVMe SSD storage
QPICPU
40Gb/s ToR
FPGA
CPU
40Gb/s
Hardware acceleration plane
Search
ML
Network
DB
SmartNIC
© Copyright 2019 Xilinx
FPGA in Production: SmartNICs in Azure
High Performance FPGA-powered SDN
Bandwidth Highest bandwidth of any cloud so far, with standard compute VM get up to 32Gbps
LatencyConsistent low latency network, 5x+ latency improvement, sub 15us within tenants
Enables workloads requiring native performance to run in cloud VMs
© Copyright 2019 Xilinx
FPGA in Production: Project Brainwave
A Scalable FPGA-powered DNN Serving Platform
Performance: Excellent inference performance at low batch size, >10x lower than CPU and GPUs
FlexibilitySupport CNN, LSTM, MLP, RL, Decision Trees, and exploit Sparsity, Compression
ScaleDeployed in Azure and Serving Production workloads
© Copyright 2019 Xilinx
Challenges of Using FPGA in Data Centers
• Programing FPGA is hard• Software developers are not familiar with HDL programming
• FPGA needs to communicate and work with other devices• No generic socket programming interface available for hardware
• FPGA needs to provide value for data center workloads• FPGA is not traditionally leveraged by applications such as storage, database,
stream processing, graph ……
• Algorithms need to adapt for FPGA• Need to leverage fine grain parallelism, and fight with bandwidth, capacity,
and complexity limits.
© Copyright 2019 Xilinx
Challenges of Using FPGA in Data Centers
• Programing FPGA is hard• Software developers are not familiar with RTL programming
• FPGA needs to communicate and work with other devices• No generic socket programming interface available for hardware
• FPGA needs to provide value for data center workloads• FPGA is not traditionally leveraged by applications such as storage, database,
stream processing, graph ……
• Algorithms need to adapt for FPGA• Need to leverage fine grain parallelism, and fight with bandwidth, capacity,
and complexity limits.
© Copyright 2019 Xilinx
Challenges of Using FPGA in Data Centers
• Programing FPGA is hard• Software developers are not familiar with RTL programming
• FPGA needs to communicate and work with other devices• No generic socket programming interface available for hardware
• FPGA needs to provide value for data center workloads• FPGA is not traditionally leveraged by applications such as storage, database,
stream processing, graph ……
• Algorithms need to adapt for FPGA• Need to leverage fine grain parallelism, and fight with bandwidth, capacity,
and complexity limits.
© Copyright 2019 Xilinx
Challenges of Using FPGA in Data Centers
• Programing FPGA is hard• Software developers are not familiar with RTL programming
• FPGA needs to communicate and work with other devices• No generic socket programming interface available for hardware
• FPGA needs to provide value for data center workloads• FPGA is not traditionally leveraged by applications such as storage, database,
stream processing, graph ……
• Algorithms need to adapt for FPGA• Need to leverage fine grain parallelism, and fight with bandwidth, capacity,
and complexity limits.
© Copyright 2019 Xilinx
Overview of The Talk
• Introduction• Simplifying FPGA Application Development• Simplifying FPGA Communication• A High-Performance FPGA Based Key Value Store • Sparsity to Facilitate DNN Inference on FPGA• Conclusion
© Copyright 2019 Xilinx
Overview of the Talk
• Introduction• Simplifying FPGA Application Development
ClickNP: Highly Flexible and High-Performance Network Processing with Reconfigurable Hardware (SIGCOMM’16)
• Simplifying FPGA Communication• A High-Performance FPGA Based Key Value Store • Sparsity to Facilitate DNN Inference on FPGA• Conclusion
© Copyright 2019 Xilinx
Challenge: FPGA Programming
Write and debug HDL code is hard
88’h68656C6C6F20776F726C64
© Copyright 2019 Xilinx
ClickNP: Making FPGA accessible to software developers
• Flexible: fully programmable using high-level language• Modularized: Click abstractions familiar to software developers;
easy code reuse• High performance: high throughput; microsecond-scale latency• Joint CPU/FPGA packet processing: FPGA is no panacea; fine-
grained processing separation
17
© Copyright 2019 Xilinx
ClickNP Programming ModelStream processing model
A B C
© Copyright 2019 Xilinx
Element: single-threaded core
(interrupt)
(I/O)(I/O)
(reg/mem)
(main thread)
(ISR)
Developed in C-like language (OpenCL)Write once, run in both CPU and FPGACan Embedding native Verilog code in ClickNP element
© Copyright 2019 Xilinx
Workflow and Architecture
Catapult shell
ClickNP
FPGA
Host Catapult PCIe Driver
ClickNP library
ClickNP host process
Mgr thrd Worker thrd
ClickNPelements
ClickNPhost mgr
ClickNP compiler
vendor libs
vendor HLS
PCIe I/O channel
Cross-platform toolchainAltera OpenCL / Vivado HLSVisual Studio / GCC
vendor specific runtime
ClickNPscript
C compiler
© Copyright 2019 Xilinx
Development Efficiency & PerformanceNetwork Function Line of Code Resource Util. Throughput (vs. CPU) Avg Latency (vs. CPU)
Packet generator 665 16% 18x 25x
Packet capture 250 8% 17x 18x
Firewall 538 32% 22x 40x
IPSec gateway 695 35% 60x 3x
Load balancer 860 36% 55x 12x
Packet scheduler 584 11% 32x 10x
IPSec GatewayPerformance Comparisons
© Copyright 2019 Xilinx
Overview of the Talk
• Introduction• Simplifying FPGA Application Development• Simplifying FPGA Communication
Direct Universal Access: Making Data Center Resources Available to FPGA (NSDI’19)
• A High Performance FPGA Based Key Value Store • Sparsity to Facilitate DNN Inference on FPGA• Conclusion
© Copyright 2019 Xilinx
Current FPGA Communication Architecture
Application Layer
Physical Layer
Application Application
DDR 4 IP PCIe Gen 3 IP
DDR Stack LTL NVMeHostDMA
FPGA Chip
QSFP IP
DMA
PCIe Gen 3QSFP
DDR 4
FPGA Board
DCQCN(in LTL) Transport Layer
Data Link Layer
© Copyright 2019 Xilinx
Current FPGA Communication Architecture
MAC layer
Physical layer
Application Application
DDR 4 IP PCIe Gen 3 IP
DDR Stack LTL NVMeHostDMA
FPGA Chip
QSFP IP
DMA
PCIe Gen 3QSFP
DDR 4
FPGA Board
???
Image from Lixia Zhang et al., CCR 2014
DCQCN(in LTL)
© Copyright 2019 Xilinx
Direct Universal AccessFPGA 1Server
QSFPDDR
Server
LTLDDR access
FPGA Connect
HostDMA
PCIe Gen3DDR
App 2
LTLDDR access
FPGA Connect
HostDMA
Datacenter networking fabric
QSFP
FPGA 2
DDR
FPGA Connect
DDR access
Host DMA
PCIe Gen3 PCIe Gen3CPU
DUA DUA
App 1
DUA
Intra-server networking fabric③
②
① ④
FPGA 3
© Copyright 2019 Xilinx
DUA OverviewFPGA
Server
QSFPDDR
Server
AppApp
LTLDDR access
FPGA Connect
HostDMA
PCIe Gen3DDR
AppApp
LTLDDR access
FPGA Connect
HostDMA
Datacenter networking fabric
QSFP
FPGA
DDR
FPGA Connect
DDR access
Host DMA
PCIe Gen3 PCIe Gen3CPU
DUA DUA
AppApp
DUA
Intra-server networking fabric③
②
① ④
DUA is an “IP layer”An abstract overlay network
Leverage all existing h/w stacksHierarchical addressing & routing
© Copyright 2019 Xilinx
DUA OverviewFPGA
Server
QSFPDDR
Server
AppApp
LTLDDR access
FPGA Connect
HostDMA
PCIe Gen3DDR
AppApp
LTLDDR access
FPGA Connect
HostDMA
Datacenter networking fabric
QSFP
FPGA
DDR
FPGA Connect
DDR access
Host DMA
PCIe Gen3 PCIe Gen3CPU
DUA DUA
AppApp
DUA
Intra-server networking fabric
DUA is an “IP layer”
③
②
① ④Efficient Routing
Direct resource access by FPGA, totally bypass CPU
© Copyright 2019 Xilinx
DUA OverviewFPGA
Server
QSFPDDR
Server
AppApp
LTLDDR access
FPGA Connect
HostDMA
PCIe Gen3DDR
AppApp
LTLDDR access
FPGA Connect
HostDMA
Datacenter networking fabric
QSFP
FPGA
DDR
FPGA Connect
DDR access
Host DMA
PCIe Gen3 PCIe Gen3CPU
DUA DUA
AppApp
DUA
Intra-server networking fabric③
②
① ④
Compatible BSD-socket Interfacefor both applications and communication stacks
Efficient RoutingDUA is an “IP layer”
© Copyright 2019 Xilinx
DUA OverviewFPGA
Server
QSFPDDR
Server
AppApp
LTLDDR access
FPGA Connect
HostDMA
PCIe Gen3DDR
AppApp
LTLDDR access
FPGA Connect
HostDMA
Datacenter networking fabric
QSFP
FPGA
DDR
FPGA Connect
DDR access
Host DMA
PCIe Gen3 PCIe Gen3CPU
DUA DUA
AppApp
DUA
Intra-server networking fabric③
②
① ④
Efficient RoutingCompatible BSD-socket Interface
General Multiplexingfor both applications and communication stacks
DUA is an “IP layer”
© Copyright 2019 Xilinx
DUA OverviewFPGA
Server
QSFPDDR
Server
AppApp
LTLDDR access
FPGA Connect
HostDMA
PCIe Gen3DDR
AppApp
LTLDDR access
FPGA Connect
HostDMA
Datacenter networking fabric
QSFP
FPGA
DDR
FPGA Connect
DDR access
Host DMA
PCIe Gen3 PCIe Gen3CPU
DUA DUA
AppApp
DUA
Intra-server networking fabric③
②
① ④
Efficient RoutingCompatible BSD-socket InterfaceCompatible BSD-socket interface
DUA is an “IP layer”
Unified Multiplexing
SecurityProtect against both inside and outside attacks
© Copyright 2019 Xilinx
App
FPGA
App
App
NIC
FPGA
CPU
FPGA
NIC
CPU
Datacenter networking fabric
Intra-server networking fabric
QSFP
DUA Underlay
PCIe Gen3DDR
Server
Server
CPU Control Agent
DUAOverlay
FPGA Control Agent
FPGA Host Stack
NVMe Stack
LTL
FPGA Connect
Host DMA
DDR ControllerDUA
data plane
System Architecture
DUA Data Plane
© Copyright 2019 Xilinx
App
FPGA
App
App
NIC
FPGA
CPU
FPGA
NIC
CPU
Datacenter networking fabric
Intra-server networking fabric
QSFP
DUA Underlay
PCIe Gen3DDR
Server
Server
CPU Control Agent
DUAOverlay
FPGA Control Agent
FPGA Host Stack
NVMe Stack
LTL
FPGA Connect
Host DMA
DDR Controller
System Architecture
DUA Data Plane DUAcontrol plane
© Copyright 2019 Xilinx
Evaluation – Regex Matching
• Up to 105~107 higher than CPU, Up to 105 lower than CPU• Up to 3 times throughput and up to 55% latency reduction compared to using
CPU to move data between FPGAs
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
Thro
ughp
ut (G
B/s)
Input String Length (Byte)
through FPGA Connect
through CPU
Pure CPU1.00E+004.00E+001.60E+016.40E+012.56E+021.02E+034.10E+031.64E+046.55E+042.62E+051.05E+064.19E+061.68E+07
64 128 256 512 1024 2048 4096 8192 16384
Late
ncy
(us)
Input String Length (Byte)
through FPGA Connect through CPU Pure CPU
© Copyright 2019 Xilinx
Overview of the Talk
• Introduction• Simplifying FPGA Application Development• Simplifying FPGA Communication• A High Performance FPGA Based Key Value Store
KV-Direct: High-Performance In-Memory Key-Value Store with Programmable NIC
(SOSP’17)
• Sparsity to Facilitate DNN Inference on FPGA• Conclusion
© Copyright 2019 Xilinx
Key-Value Store in Data Centers
Web Cache
© Copyright 2019 Xilinx
Key-Value Store in Data Centers
Web Cache Shared Data Structure
© Copyright 2019 Xilinx
Key-Value Store in Data Centers
Design Goals
____
© Copyright 2019 Xilinx
Key-Value Store Architectures
Bottleneck: Network stack in OS(~300 Kops per core)
© Copyright 2019 Xilinx
Key-Value Store Architectures
Bottleneck: Network stack in OS(~300 Kops per core)
e.g. DPDK, mtcp, libvma, two-sided RDMABottlenecks: CPU random memory accessand KV operation computation(~5 Mops per core)
© Copyright 2019 Xilinx
Key-Value Store Architectures
Bottleneck: Network stack in OS(~300 Kops per core)
Communication overhead: multiple round-trips per KV operation (fetch index, data)Synchronization overhead: write operations
e.g. DPDK, mtcp, libvma, two-sided RDMABottlenecks: CPU random memory accessand KV operation computation(~5 Mops per core)
© Copyright 2019 Xilinx
Key-Value Store Architectures
Offload KV processing on CPUto Programmable NIC
© Copyright 2019 Xilinx
Performance Challenges
ToR switch
Host DRAM (256 GB)
PCIe Gen3x16 DMA
40 GbE
FPGAOn-board
DRAM (4 GB)
13 GB/s120 Mops
Header overhead and limited parallelism:Be frugal on memory accesses
© Copyright 2019 Xilinx
Performance Challenges
ToR switch
Host DRAM (256 GB)
PCIe Gen3x16 DMA
40 GbE
FPGAOn-board
DRAM (4 GB)
1us delay120 Mops
Atomic operations have dependency:PCIe latency hiding
© Copyright 2019 Xilinx
Performance Challenges
ToR switch
Host DRAM (256 GB)
PCIe Gen3x16 DMA
40 GbE
FPGAOn-board
DRAM (4 GB)
1us delay120 Mops
0.2us delay100 Mops
Load dispatch
© Copyright 2019 Xilinx
Performance Challenges
ToR switch
Host DRAM (256 GB)
PCIe Gen3x16 DMA
40 GbE
FPGAOn-board
DRAM (4 GB)
1us delay120 Mops
60 Mpps
Client-side batchingVector-type operations
0.2us delay100 Mops
© Copyright 2019 Xilinx
KV Processor Architecture
NIC
PCIe
DRAM
© Copyright 2019 Xilinx
KV-Direct Performance
© Copyright 2019 Xilinx
Scalability with Multiple NICs
© Copyright 2019 Xilinx
Scalability with Multiple NICs
1.22 billion KV op/s357 watts power
© Copyright 2019 Xilinx
Overview of the Talk
• Introduction• Simplifying FPGA Application Development• Simplifying FPGA Communication• A High Performance FPGA Based Key Value Store • Sparsity to Facilitate DNN Inference on FPGA
Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced Sparsity (FPGA’19)
• Conclusion
© Copyright 2019 Xilinx
Low Latency DNN Inference is Important
𝑥𝑥𝑡𝑡
𝑦𝑦𝑡𝑡
𝑦𝑦𝑡𝑡𝑦𝑦t − 1
ct − 1
Matrix-Vector Multiplication is the work horse of DNN inference
© Copyright 2019 XilinxHan, Song, et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS’15
Redundancy in neural networks
Weights Distribution
© Copyright 2019 Xilinx
Weight Pruning1. Train Connectivity
2. Prune Connections
3. Train Weights
�𝑤𝑤 ← 𝑎𝑎𝑎𝑎𝑎𝑎 𝑤𝑤𝑖𝑖𝑖𝑖 �𝑤𝑤𝑖𝑖 ≤ 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑎𝑎𝑇𝑇𝑇𝑇𝑇𝑇𝑇
𝑤𝑤𝑖𝑖 ← 0
Highly unstructured sparse pattern
Sparsity = 50%~90%near-zeros
Difficult to accelerate
© Copyright 2019 Xilinx
Pros:• High model accuracy• High compression ratioCons:• Irregular pattern• Difficult to accelerate
Cons:• Low model accuracy• Low compression ratioPros:• Regular pattern• Easy to accelerate
Irregular Regular
Fine-grained Coarse-grained
Accuracy and Speedup Tradeoff
© Copyright 2019 Xilinx
Speedup and Accuracy Tradeoff
78
79
80
81
82
83
84
85
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Original
Fine-grained
Block Size (4*4)
Block Size (8*8)
Block Size (16*16)
Sparsity
the smaller the better
ModelLoss
Fine-grained:Accurate but
irregular
Coarse-grained:Regular but not accurate
© Copyright 2019 Xilinx
DenseMatrix
Bank Split
Bank Balanced Pruning
© Copyright 2019 Xilinx
DenseMatrix
DenseMatrix Row
BBSMatrix Row
Bank Split
0.8 -0.1 0.2 1.5 1.0 0.3 -0.4 -1.4 0.7 2.0 0.9 -0.5 1.2 -1.3 2.1 0.2
0 1 2 3 4 5 6 7 12 13 14 158 9 10 11
0.8 1.5 1.0 -1.4 2.0 0.9 -1.3 2.1
Traverse all rows
Fine-grained pruning inside each bank
Threshold percentage to obtain identical sparsity ratio among banks
Bank Balanced Pruning
© Copyright 2019 Xilinx
Weight map visualization
© Copyright 2019 Xilinx
Bank 0 Bank 1
Weight map visualization
© Copyright 2019 Xilinx
Speech Recognition on TIMIT dataset
Language model PTB dataset
Model Accuracy
© Copyright 2019 Xilinx
Speech Recognition on TIMIT dataset
Language model PTB dataset
Very close
Model Accuracy
Very close
© Copyright 2019 Xilinx
V0 V1 V2
V3 V4 V5
V6 V7 V8
V9 V10 V11
BSB matrix row
Dense vector
Bank 0 Bank 1 Bank 2 Bank 3
A 0 B C D 0 0 E F G 0 H
Bank 0
Bank 1
Bank 2
Bank 3
Accumulate
Partial dot product: V0A+V3C+V7E+V9G
V0 V3 V7 V9 A C E G
S1
Sparse MV Multiplication (SpMxV)
© Copyright 2019 Xilinx
V0 V1 V2
V3 V4 V5
V6 V7 V8
V9 V10 V11
BSB matrix row
Dense vector
Bank 0 Bank 1 Bank 2 Bank 3
A 0 B C D 0 0 E F G 0 H
Bank 0
Bank 1
Bank 2
Bank 3
B D F H
Accumulate
Partial dot product: V2B+V4D+V8F+V11H
V2 V4 V8 V11
S2
S1+S2
Sparse MV Multiplication (SpMxV)
© Copyright 2019 Xilinx
FPGA
SpMxV PE
*
...
*
** +
++
EWOP
ACT
+
Controller Instruction Buffer
DMA
*Private Vector Buffer
Output
+
DRAMCntlr
PCIeCntlr
Off-chipDRAM
HostServer
Vector Memory
MatrixMemory
Indices
Values
Accelerator Overview
© Copyright 2019 Xilinx
FPGA
SpMxV PE
*
...
*
** +
++
EWOP
ACT
+
Controller Instruction Buffer
DMA
*Private Vector Buffer
Output
+
DRAMCntlr
PCIeCntlr
Off-chipDRAM
HostServer
Vector Memory
MatrixMemory
Indices
Values
Accelerator Overview
© Copyright 2019 Xilinx
FPGA
SpMxV PE
*
...
*
** +
++
EWOP
ACT
+
Controller Instruction Buffer
DMA
*Private Vector Buffer
Output
+
DRAMCntlr
PCIeCntlr
Off-chipDRAM
HostServer
Vector Memory
MatrixMemory
Indices
Values
Accelerator Overview
© Copyright 2019 Xilinx
FPGA
SpMxV PE
*
...
*
** +
++
EWOP
ACT
+
Controller Instruction Buffer
DMA
*Private Vector Buffer
Output
+
DRAMCntlr
PCIeCntlr
Off-chipDRAM
HostServer
Vector Memory
MatrixMemory
Indices
Values
Accelerator Overview
© Copyright 2019 Xilinx
Hardware Efficiency
© Copyright 2019 Xilinx
~34x ~7x
Hardware Efficiency
© Copyright 2019 Xilinx
Challenges of Using FPGA in Data Centers
• Programing FPGA is hard• Programming Network Function with ClickNP
• FPGA needs to communicate and work with other devices• Facilitate communication with Direct Universal Access
• FPGA needs to provide value for data center workloads• KV-Direct is a fast Key Value Store
• Algorithms need to adapt for FPGA• Bank Balanced Sparsity accelerate DNN Inference
© Copyright 2019 Xilinx
Conclusion
• FPGA is becoming a key resource in the cloud• It is not easy to leverage FPGA for data center workloads• We did several work to
• Reduce development cost• Simplify communication• Find valuable applications to accelerate • Adapt algorithms to fit FPGA
• Many challenges remain to be addressed• The trend is clear: FPGA is a first-class citizen in the cloud and future
applications need to take good use of it
© Copyright 2019 Xilinx
Thanks!