Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of...

31
Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing Li

Transcript of Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of...

Page 1: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing

ImprovingthePerformanceofOpenCL-basedFPGAAcceleratorforConvolutionalNeuralNetwork

JialiangZhangandJingLi

Page 2: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing

Outline

• Background• Motivation• Balance analysis model• Proposed design• Performance• Conclusion

2

Page 3: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing

ConvolutionalNeuralNetwork

3

Marco Architecture of VGG

Imagefrom:https://blog.heuritech.com/2016/02/29/a-brief-report-of-the-heuritech-deep-learning-meetup-5/l

Page 4: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing

OpenCL Framework

Application Source CodeOpenCL Kernel

OpenCL FrameworkOpenCL CHost API

FPGA

InfrastructureKernel

FPGARuntimeFPGA Driver

Offline FPGA OpenCL Compiler

CLANG Frontend

IR Analysis &Optimization

VerilogBackend

BasicBlockLibrary

(i)

(iii)

(ii)

• OpenCLprovidesagoodFPGAabstration

• UnlikeOpenCLonGPUorCPU,OpenCLFPGAdescribesbothhardwareandsoftware

• Use #prgma to guide hardwaregeneration:– loopunrolling; SIMDfactor; compute

unitreplication

4

Page 5: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing

OpenCL FPGA Framework

CU

LocalMemory

CU

LocalMemory

CU

LocalMemory

CU

LocalMemory

2x2 PE Replication (Controlled by User) 4x CU Replication (Controlled by User)

16x Onchip Memory Replication ( Compiler Inferred )

PCIe Controller

SOC BUS

Dispather

Infrastructure Region

HOST

DDR Controller

GlobalMemory

ComputationSubsystem

Local Memory

Subsystem

Control Flow Data Flow

DDR Controller

… GlobalMemory

KernelRegion

(a) Top Level (b) Compute Unit Level (c) Processing Element Level

PE

BRAMBRAM

PE

BRAMBRAM

PE

BRAMBRAM

PE

BRAMBRAM

5

Page 6: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing

Related works• GenericCNNaccelerators: [1]-[3]• Balancecomputation and external memoryaccess: [4] -[7]• Hardware Abstraction: [7][8]• Contribution:– IdentifytheperformancebottleneckinlargescaleFPGACNNacceleratorison-chipmemorybandwidth

– A CNN accelerator achieved optimized balance among computation,on-chipmemory and external memory access

6

[1]Farabet,etal.Hardwareacceleratedconvolutional neuralnetworksforsyntheticvision systems.ISCAS2010[2]M.Peemen,etal.Memory-centricacceleratordesignforconvolutional neuralnetworks.ICCD2013[3]V.Gokhale, etal.A240G-ops/smobilecoprocessor fordeepneuralnetworks.CVPRWorkshops, 2014.[4]C.Zhang,etal.OptimizingFPGA-basedacceleratordesign fordeepconvolutional neuralnetworks.ISFPGA2015[5]N.Suda, et al. Throughput-OptimizedOpenCL-based FPGAAcceleratorforLarge-ScaleConvolutional NeuralNetworks. ISFPGA 2016[6] J. Qiu, et al. GoingDeeperwithEmbedded FPGAPlatformforConvolutional NeuralNetwork. ISFPGA 2016[7]C. Zhang Caffeine:TowardsUniformedRepresentationandAccelerationforDeepConvolutional NeuralNetworks. ICCAD 2016[8] H. Sharma, et al. FromHigh-LevelDeepNeuralModels toFPGAs. MICRO2016

Page 7: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing

Motivation

• New FPGA devices havemoreand faster DSP resources

• We should useallDSPresourceseverycycletoobtaintheGFLOPSnumberindatasheet

• The existing design cannot utilizethe DSP resources well

7

Page 8: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing

BalanceAnalysis Model• Balanceistherelationshipbetweenstorage and computation resources• Assumption:• Onlyonebottleneckinthesystem• Bandwidthbounded• Computationandmemoryaccessoverlapperfectly.

8

MachineBalance

CodeBalance

On-chip 𝑩𝒎𝑶𝒏 𝑩𝒎𝑶𝒏

Off-chip 𝑩𝒎𝑶𝒇𝒇 𝑩𝑪

𝑶𝒇𝒇

Hardwaredetermined

Algorithmdetermined

Refertoon-chipMemBWRefertooff-chipMemBW

Page 9: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing

MachineBalance

9

PE PE PE

Buffer Buffer

Computational Resources

On-chip Memory

… …

NBRAM *WIDTHBRAM *fBRAM

PE PE PEComputational Resources

… …

NDDR*𝐵𝑊))*

DDR DDROff-chip Memory

NDSP*OPS/DSP*fDSP NDSP*OPS/DSP*fDSP

Page 10: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing

CodeBalance

10

• On-chipCodeBalanceisdeterminedbyinputandoutputdatasizeofinstructions

• Forexample,MAC operation hasa

1

• Off-chipCodeBalanceisdeterminedbytheinputandoutputdatasizeofthewholeprogram

• Assumeunlimited on-chipmemorysize

Page 11: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing

BreakthememoryWall

11

Toachieve𝑃,-. ,weneedtosatisfy𝐵,122 >𝐵3

122 and 𝐵,14 >𝐵3122

Page 12: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing

Off-chipbalance

Layer 𝑩𝑪𝑶𝒇𝒇

CONV1 0.0021CONV2 0.0010CONV3 0.00062CONV4 0.00087CONV5 0.00277FC 0.5

12

TechnologyNode

𝑩𝒎𝑶𝒇𝒇

28nm 0.16820nm 0.217

14/16nm 0.17714/16nmw/HBM

0.177

• 𝑩𝑪𝑶𝒇𝒇 <𝑩𝒎

𝑶𝒇𝒇 forconvolutionlayers

• 𝑩𝑪𝑶𝒇𝒇 >𝑩𝒎

𝑶𝒇𝒇 forFC layer,butonlycontributeasmallportionofcomputation

Page 13: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing

On-chipbalance

13

TechnologyNode

𝑩𝒎𝑶𝒏

28nm 0.0033

20nm 0.0026

14/16nm 0.0010

Layer 𝑩𝑪𝑶𝒏

CONV1 1CONV2 1CONV3 1CONV4 1CONV5 1FC 1

• 𝑩𝑪𝑶𝒏 >𝑩𝒎𝑶𝒏 foralllayer

• On-chipmemorybandwidthbecomesthebottleneck

• Weneedtoincrease𝑩𝒎𝑶𝒏

Page 14: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing

DataReuseRequirement

303384

72

1000

0

200

400

600

800

1000

1200

28nm 20nm 14nmw/HMC

14nmw/DDR

• Thebalanceanalysis assumesunlimitedon-chipmemorysize(loaddataonce)

• Underlimitedon-chipmemorycapacity,weneedtoloaddatamorethanonce.

• Tosatisfiedexternalmemorybandwidthrequirement,weneedtoreusedataatleast.

14

Page 15: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing

OptimizationDirection

• Increase to utilize all DSP resources

• Satisfy thedatareuserequirementwithlimitedon-chipmemorysize.

15

Page 16: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing

OptimizationDirection

• Increase to utilize all DSP resources

• Satisfy thedatareuserequirementwithlimitedon-chipmemorysize.

16

Page 17: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing

MatrixMultiplicationKernel

BRAM

PE

BRAM

PE

BRAM

BRAM

PE PE

17

CU

LocalMemory

CU

LocalMemory

CU

LocalMemory

CU

LocalMemory

2x2 PE Replication (Controlled by User) 4x CU Replication (Controlled by User)

16x Onchip Memory Replication ( Compiler Inferred )

PCIe Controller

SOC BUS

Dispather

Infrastructure Region

HOST

DDR Controller

GlobalMemory

ComputationSubsystem

Local Memory

Subsystem

Control Flow Data Flow

DDR Controller

… GlobalMemory

KernelRegion

(a) Top Level (b) Compute Unit Level (c) Processing Element Level

PE

BRAMBRAM

PE

BRAMBRAM

PE

BRAMBRAM

PE

BRAMBRAM

EnableMulticasting

Page 18: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing

BRAMusagereduction

18

Bymulticasting,wecanincreaseon-chipmachinebalanceby 𝑁)67

ExperimentonArria10GX1150FPGA(w/ 1518 DSP and 2713 M20kMemory):

Page 19: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing

OptimizationDirection

• Increase to utilize all DSP resources

• Satisfy thedatareuserequirementwithlimitedon-chipmemorysize.

19

Page 20: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing

Kernel Design

2D DISPATHER

Compute Units

Compute Units

Compute Units

Compute Units

Buffer

PointerParameters

ShiftRegsiter

Crossbar

ExternalMemory

ExternalMemory

20

Page 21: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing

CU Design

21

BRAM32bit x512

MAC

BRAM32bit x512

MAC

BRAM32bit x512

MAC

BRAM32bit x512

BRAM32bit x512

MAC

MAC

MAC

BRAM32bit x512

MAC

MAC

MAC

INPUT A512x

32x32x

32x

32x

INPUT B512x

32x

32x

32x 32x 32x

32x32x32x

32x 32x 32x

FF

FF

FF

MUX

OutputAddress

Output512x

512x

512x

512x

8x Duplication8x

Dup

licat

ion

(a)

Page 22: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing

Work-item Scheduling

22

Page 23: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing

2D Scheduling

0,0 0,1 0,2 0,n…

1,0 1,1 1,2 1,n…

2,0 2,1 2,2 2,n…

m,0 m,1 m,2 m,n…

… … … …

0,0 0,1 0,2 0,n…

1,0 1,1 1,2 1,n…

2,0 2,1 2,2 2,n…

m,0 m,1 m,2 m,n…… … … …

m

n(a) One Dimensional Dispatching (b) Two Dimensional Dispatching

23

𝑥9

𝑥:

Page 24: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing

Minimizingexternalmemorybandwidth

• Objective:– Findtheoptimized schedulingpolicy

– To minimizeexternalmemorybandwidth

• Constraint:– On-chipmemorysize

• Basedon:– CNNlayerparameters

24

Page 25: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing

Optimizationresult

25

Theoptimizationcaneffectivelyreducetherequirementofoff-chipmemorybandwidth

Page 26: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing

Experiment Setup

• FPGA: Arria10GX1150• 1518 DSPs• 2713 M20k memory• 1GBDDR4 with17GB/sBW

• Kernel implemented as anOpenCLIPlibrarypackage usingVerilog

26

Arria 10 GX FPGA Development Kit

Page 27: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing

Implementation Result

27

Total Ours Percentage

DSP 1518 1320 86

BRAM 2713 1250 46

Logic 1506k 437k 43

Frequency 370 MHz

Page 28: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing

VGG Performance

Layers Number of Ops Duration (ms) PerformanceGOP/s

Conv 30.69G 11.9 2568

FC 0.073G 5.5 13

Total 30.76G 17.4 1790

28

Overall VGG Classification Throughput: 57 image/s

Page 29: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing

Performance ComparisonSuda et. al [1] Qiu et. al [2] Ours

Platform Stratix V GSD8 Zynq XC7Z045 Arria 10 GX1150Performance(GOP/s) 117.8 136.9 1790

PerformanceDensity(OPs/DSP/Cycle)

0.36 1.17 3.06

Power efficiency(GOP/J)

1.84 14.22 47.88

29

[1]N.Suda, et al. Throughput-OptimizedOpenCL-basedFPGAAcceleratorforLarge-ScaleConvolutionalNeuralNetworks. ISFPGA 2016[2] J. Qiu, et al. GoingDeeperwithEmbeddedFPGAPlatformforConvolutionalNeuralNetwork. ISFPGA 2016

Page 30: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing

Summary

• Motivation• Balance analysis model• Our Work:– Multicasting among PEs– 2D Scheduling

• Performance comparison

30

Page 31: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing

Thanks!

Q&A

31