Multipliers for FPGA Machine Learning Applications
Transcript of Multipliers for FPGA Machine Learning Applications
![Page 1: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/1.jpg)
Multipliers for FPGA Machine Learning Applications
Philip Leong (梁恆惠) | Computer Engineering LaboratorySchool of Electrical and Information Engineering,
The University of Sydney
![Page 2: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/2.jpg)
2
Australia
Population: ~25M (2017)Europe: ~743M (2018)Shanghai: ~24M (2018)
Population density: 3.20 people per km2 (Shanghai 2059)
![Page 3: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/3.jpg)
Computer Engineering Laboratory
› Focuses on how to use parallelism to solve demanding problems - Novel architectures, applications and design techniques using VLSI, FPGA and parallel
computing technology
› Research- Reconfigurable computing
- Machine learning
- Signal processing
› Collaborations- Xilinx, Intel, Exablaze
- Defence and DSTG
- clustertech.com
3
![Page 4: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/4.jpg)
Overview
› Multipliers (and adders) play a key role in the implementation of DNNs› This talk
- Two speed multiplier with different critical paths for zero and non-zero recodings
- PIR-DSP block to support a range of precisions
- AddNet which uses k-levels of shifted values as multipliers
- A fully pipelined DNN implementation with ternary coefficients
› These slides are available at https://phwl.github.io/talks
![Page 5: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/5.jpg)
A Two Speed Multiplier
D. J. M. Moss, D. Boland, and P. H. W. Leong
![Page 6: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/6.jpg)
Overview
› Multipliers (and adders) play a key role in the implementation of DNNs› This talk
- Two speed multiplier with different critical paths for zero and non-zero recodings
- PIR-DSP block to support a range of precisions
- AddNet which uses k-levels of shifted values as multipliers
- A fully pipelined DNN implementation with ternary coefficients
![Page 7: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/7.jpg)
Unsigned Multiplication
7
› Shift-and-Add Algorithm
Example: Multiply 118d by 99d
Multiplicand Multiplier
Step1) Initialize
Step2) Find partial products
Step3) Sum up the shifted partial products
118d99d
1062d1062 d11682d
Two’s Complement Method
Step1) Initialize
Step2) Find partial products
Step3) Sum up the shifted partial products
118d = 01110110b99d = 01100011b
01110110b
Convert 2’s-Comp back to decimal:0010 1101 1010 0010 = 11682d
00000000 b00000000 b
01110110 b01110110 b
00000000 b010110110100010 b
01110110 b00000000 b
Source: Shawn Nicholl, Xilinx
![Page 8: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/8.jpg)
Signed Multiplication
› How can we handle signed multiplication? › Could
- multiply absolute values
- separately calculate the sign
- negate if necessary
› But …
8
![Page 9: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/9.jpg)
Signed Multiplication using Booth Recoding
› Booth Recoding- Reduce the number of partial products by recoding the multiplier operand
- Works for signed numbers
9
Example: Multiply -118 by -99
Recall, 99 = 0110 0011b
-99 = 1001 1101b
Radix-2 Booth Recoding
0101 1110-99 =
An An-1Partial Product
0 0 00 1 +B1 0 -B1 1 0
Low-order BitLast Bit Shifted Out
Source: Shawn Nicholl, Xilinx
![Page 10: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/10.jpg)
Example of Booth Radix-2 Recoding
› -99=(-27+25-22+21-20)
10
Multiply -118 by -99
Radix-2 Booth
Step1) Initialize
Step2) Find partial products
Step3) Sum up the shifted partial products
-118 = 0111 0110b
01110110b
Convert 2’s-Comp back to decimal:0010 1101 1010 0010 = 11682d
00000000 b00000000 b
1110001010 b000000000 b01110110 b
0010110110100010b
110001010 b01110110 b
0101 1110-99 =-BB
-B00B0
-B
B = -118 = 1000 1010b-B = 118 = 0111 0110b
A = -99 = 1001 1101b
Sign Extension
0101 1110-99 =
Source: Shawn Nicholl, Xilinx
![Page 11: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/11.jpg)
Booth Radix-4 Multiplication
11
› Similar to Radix-2, but uses looks at two low-order bits at a time (instead of 1)
› (-99=-2.43+2.42-1.41+1.40)
Yi+2 Yi+1 Yi ei
0 0 0 00 0 1 +B0 1 0 +B0 1 1 +2B1 0 0 -2B1 0 1 -B1 1 0 -B1 1 1 0
Low-order Bits
Last Bit Shifted Out
Recall, 99d = 0110 0011b
1001 1100b1b
-99d = 1001 1101bRadix-4 Booth Recoding
-99d = 1122
Source: Shawn Nicholl, Xilinx
![Page 12: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/12.jpg)
Example of Booth Radix-4 Multiplication
12
Radix-4 Booth
Step1) Initialize
Step2) Find partial products
Step3) Sum up the shifted partial products
-118d = 0111 0110b
Convert 2’s-Comp back to decimal:0010 1101 1010 0010 = 11682d
111111110001010b
011101100 b0010110110100010 b
01110110 b11100010100 b
B-B2B
-2B
B = -118d = 1000 1010b-B = 118d = 0111 0110b
2B = -236d = 1 0001 0100b-2B = 236d = 0 1110 1100b
A = -99d = 1001 1101b
Example: Multiply -118d by -99d
Sign Extension
-99d = 1122
-99d = 1122
- Reduces number of partial products by half!Source: Shawn Nicholl, Xilinx
![Page 13: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/13.jpg)
Booth Radix-4 Multiplier Implementation
13
![Page 14: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/14.jpg)
Booth Radix-4 Multiplier Datapath
14
![Page 15: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/15.jpg)
Two-Speed Multiplier
15
• Booth Radix-4 datapath split into 2 sections, each with own critical path
• Non-zero encodings take !𝐾𝜏 (add) and zero take 𝜏 (skip)
• Naturally supports sparse problems
![Page 16: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/16.jpg)
Two-speed multiplier Execution
16
![Page 17: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/17.jpg)
Results
17
![Page 18: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/18.jpg)
Area * Time Improvement of TSM
18
![Page 19: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/19.jpg)
Summary
› Variant of the serial-parallel modified radix-4 Booth multiplier› Adds only the non-zero Booth encodings and skips over the zero operations› Two sub-circuits with different critical paths are utilised so that throughput and
latency are improved for a subset of multiplier values› For bit widths of 32 and 64, our optimisations can result in a 1.42-3.36x
improvement over the standard parallel Booth multiplier› Future work: explore training NN with weights to minimise execution time on
TSM
![Page 20: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/20.jpg)
PIR-DSP: An FPGA DSP block Architecture for Multi-Precision Deep Neural Networks
SeyedRamin Rasoulinezhad, Hao Zhou, Lingli Wang, and Philip H.W. Leong
![Page 21: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/21.jpg)
Overview
› Multipliers (and adders) play a key role in the implementation of DNNs› This talk
- Two speed multiplier with different critical paths for zero and non-zero recodings
- PIR-DSP block to support a range of precisions
- AddNet which uses k-levels of shifted values as multipliers
- A fully pipelined DNN implementation with ternary coefficients
![Page 22: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/22.jpg)
› Introduction› PIR-DSP Architecture› Results› Conclusion
22
Overview
![Page 23: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/23.jpg)
Embedded Deep Neural Networks
› DNNs for embedded applications share two features to reduce computation and storage requirements- Low precision (from 1-16 bits)
- Depthwise separable convolutions
23
Standard Convolution (standard)
Depthwise Convolution (DW) Pointwise Convolution (PW)
![Page 24: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/24.jpg)
24
Computation and Storage for Embedded DNNs
Standard DW PW FC OtherStandard DW PW FC Other
NASNet-A4@1056MobileNet-v2ShuffleNet-v2SqueezeNet
Distribution of # of MACs Distribution of # of parameters
Motivation (1)
![Page 25: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/25.jpg)
Motivation (2)
25
Low-Precision Neural Networks
Faraone et al, “SYQ: Learning Symmetric Quantization For Efficient Deep Neural Networks”, CVPR’18
Imagenet accuracy with binary and ternary weights and 8-bit activations
![Page 26: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/26.jpg)
Aims
› Optimise FPGA DSP architecture to better support- Efficient implementation of embedded DNNs- Wordlengths down to ternary and binary
› Talk will focus on convolutions
26
![Page 27: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/27.jpg)
› Introduction› PIR-DSP Architecture› Results› Conclusion
27
Overview
![Page 28: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/28.jpg)
Existing DSPs
› Xilinx DSP48- 27×18 multiplier, 48-bit ALU
(Add/Sub/Logic), 27-bit pre-adder, Wide 96bit XOR, 48-bit comparator
28
- No support for low-precision computations
- No run-time configuration
- 1D arrangement inefficient for implementing 2D systolic arrays
› Intel (Multiprecision)- 27×27 multiplier decomposable to two
19×18, Compile-time configurable Coefficient registers, Two 18-bit pre-adder, 54-bit adder
![Page 29: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/29.jpg)
PIR-DSP
› PIR-DSP: Optimized version of DSP48- Precision: Multiplier architecture
- Interconnect: Shift-Reg
- Reuse : RF/FIFO
29
PIR-DSP
DSP48
![Page 30: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/30.jpg)
Precision (1)
› Based on two approaches:1. Chopping
2. Recursive decomposition
30
![Page 31: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/31.jpg)
Precision (2)
› Notation: M×NCijDk › PIR-DSP multiplier: 27×18C32D2
- Chopping factors 3 and 2 respectively for 27 and 18- (27=9+9+9)×(18=9+9)- Six 9×9 multiplier
- Decomposing factor is 2- Each 9×9 multiplier decomposes to Two 4×4 or Four 2×2 multipliers
› PIR-DSP Modes:- One 27×18 à 1 MAC- Two 9×9 + 9×9 + 9×9 à 6 MACs- Four 4×4 + 4×4 + 4×4 à 12 MACs- Eight 2×2 + 2×2 + 2×2 à 24 MACs
31
Parameterised Decomposable MAC unit
![Page 32: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/32.jpg)
Interconnect (1)
› Three types of convolutions1- Depth-wise: using three PIR-DSPs
2- Standard: based on depth-wise convolution implementation and adding the partial results
32
2D systolic array (Eyeriss) conventional ours depthwise convolution
filter r3
sums
ifmap r1filter r2
filter r1
ifmap r2
ifmap r3ifmap r4
ifmap r5
![Page 33: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/33.jpg)
Interconnect (2)
33
3- Point-wise
![Page 34: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/34.jpg)
Interconnect (2)
34
3- Point-wise
![Page 35: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/35.jpg)
Interconnect (2)
35
3- Point-wise
![Page 36: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/36.jpg)
Interconnect (2)
36
3- Point-wise
![Page 37: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/37.jpg)
Interconnect (2)
37
3- Point-wise
![Page 38: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/38.jpg)
Interconnect (2)
38
3- Point-wise
![Page 39: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/39.jpg)
Interconnect (2)
39
3- Point-wise
![Page 40: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/40.jpg)
Interconnect (2)
40
3- Point-wise
![Page 41: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/41.jpg)
Interconnect (2)
41
3- Point-wise
![Page 42: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/42.jpg)
Reuse
42
Depthwise Convolution (DW) Pointwise Convolution (PW)
![Page 43: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/43.jpg)
› Introduction› PIR-DSP Architecture› Results› Conclusion
43
Overview
![Page 44: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/44.jpg)
Area and Frequency
› SMIC 65-nm standard cell technology- Synopsis Design Compiler 2013.12
44
Version Area Ratio FmaxDSP48E2 1.0 463+ M27×18C32D2 MAC-IP 1.14 358+ interconnect 1.18 362+ reuse 1.28 357
![Page 45: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/45.jpg)
Energy
› Other networks are similar
45
![Page 46: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/46.jpg)
Related Work
› Sits between Sharma (low-precision) and Boutros (high-precision)
46
Bitfusion [56]ISCA’18
Ours Boutros [44]FPL’18
Ours
Area 0.24 1 0.77 1Performance Per Area2x2 1 0.44x4 1 0.7 1 1.28x8 1 1.4 1 1.216x16 1 0.427x18 1 0.8
![Page 47: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/47.jpg)
› Introduction› PIR-DSP Architecture› Results› Conclusion
47
Overview
![Page 48: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/48.jpg)
Summary
› Described optimizations to the DSP48 to support a range of low-precision DNNs and quantified their impact on performance- Precision, Interconnect and Reuse
- designs are available at http://github.com/raminrasoulinezhad/PIR-DSP
› Future research - Consider what we can do if we give up DSP48-like functionality
- Other interconnect optimisations
48
![Page 49: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/49.jpg)
AddNet: Deep Neural Networks usingFPGA-Optimized Multipliers
Julian Faraone, Martin Kumm Member, Martin Hardieck, Peter Zipf, Xueyuan Liu, David Boland, Philip H.W. Leong
![Page 50: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/50.jpg)
Overview
› Multipliers (and adders) play a key role in the implementation of DNNs› This talk
- Two speed multiplier with different critical paths for zero and non-zero recodings
- PIR-DSP block to support a range of precisions
- AddNet which uses k-levels of shifted values as multipliers
- A fully pipelined DNN implementation with ternary coefficients
![Page 51: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/51.jpg)
Introduction
› Reconfigurable constant coefficient multipliers (RCCMs) implement y = cx for a fixed set of coefficients c using only adds and shifts e.g.
› Present FPGA logic element optimised RCCMs which implement large sets c with few resources
› When applied to neural networks (AddNet), achieve up to 50% resource savings over traditional 8-bit quantized networks
![Page 52: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/52.jpg)
Reconfigurable constant coefficient multipliers (RCCM)
› Another example has the coefficient set C={12305, 20746}
› Top adder computes
› Bottom adder computes
52
![Page 53: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/53.jpg)
Base Topologies
› Topology A has potentially a larger coefficient set ±Ap ± B1 𝜎 𝑠 chooses operation
› Topology B allows symmetric coefficients around 0 as Ap-B1=-(Ap−B1)
› Maximum possible set size is 2
53
Topology A Topology B
![Page 54: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/54.jpg)
Bit Level Slice Mapping
54
Works for any 6-LUT FPGA such as Xilinx Ultrascale or Intel Stratix X
![Page 55: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/55.jpg)
RCCM Circuits
55
![Page 56: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/56.jpg)
Distribution Matching
› Weights in a DNN follow a distribution (Gaussian-like)› RCCM coefficients can be chosen to match
- Find best 5 in terms of Kullback-Leibler divergence, then choose set with largest number of coefficients
56
![Page 57: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/57.jpg)
DNN Training using AddNet
› Trained using straight-through estimator (STE)
57
![Page 58: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/58.jpg)
Distribution-Optimised Coefficient Sets (for Gaussian)
› AlexNet top1/top5: 53.8%/76.9% (unoptimised) vs 55.8%/79.8% (optimised)
58
![Page 59: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/59.jpg)
Implementation
59
![Page 60: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/60.jpg)
Resource Utilization
60
![Page 61: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/61.jpg)
Power Consumption
› 8-bit disab. = DSP disabled, 8-bit enab. = DSP enabled
61
![Page 62: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/62.jpg)
Accuracy-Area Tradeoff Compared with Uniform Multipliers
62
![Page 63: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/63.jpg)
Summary
› Described AddNet which uses constant coefficient multipliers to save area› Showed it is advantageous over uniform multipliers› Uses trainability of DNNs to match computer architecture
63
![Page 64: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/64.jpg)
Unrolling Ternary Networks
Stephen Tridgell, Martin Kumm, Martin Hardieck, David Boland, Duncan Moss, Peter Zipf, Philip H.W. Leong
![Page 65: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/65.jpg)
Overview
› Multipliers (and adders) play a key role in the implementation of DNNs› This talk
- Two speed multiplier with different critical paths for zero and non-zero recodings
- PIR-DSP block to support a range of precisions
- AddNet which uses k-levels of shifted values as multipliers
- A fully pipelined DNN implementation with ternary coefficients
![Page 66: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/66.jpg)
Introduction
› Not possible to make fully parallel implementations of a NN on contemporary FPGA due to size
› Fit entire DNN on FPGA by exploiting unstructured sparsity and the following techniques:1. Buffering of streaming inputs in a pipelined manner
2. Ternary weights implemented as pruned adder trees
3. Common subexpression merging
4. 16-bit bit serial arithmetic to minimize accuracy loss with low area
5. Sparsity control
66
![Page 67: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/67.jpg)
Buffering of Streaming Inputs
67
Implement Pipelined 3x3 Convolution
Input FIFO outputs thepixel each cycle to both Buffer A and the first stage of a shift register.Buffer A and Buffer B delay the output by the image width
![Page 68: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/68.jpg)
Ternary Weights as Pruned Adder Trees
› Weights are ternary- So multiplication with ±1 is either addition or subtraction
- Multiplication with 0 makes matrix sparse
68
![Page 69: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/69.jpg)
Common Subexpression Elimination
› Weights are ternary- Reduces convolution to
constructing adder tree
- Subexpression merged to reduce implementation
69
![Page 70: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/70.jpg)
Common Subexpression Elimination (RPAG)
› RPAG Algorithm- Greedy algorithm for the related Multiple Constant Multiplication problem
- Looks at all the outputs of a matrix-vector multiplication and calculates the minimal tree depth, d, required to get the results
- Tries to determine the minimum number of terms needed at depth d − 1 to compute the terms at depth d and iterates until d=1 (whole tree generated)
70
![Page 71: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/71.jpg)
Top-down CSE (TD-CSE)
› Builds multiple adder trees from the inputs to the outputs by creating an adder each iteration
› Count frequency of all size 2 subexpressions, replace most frequent (x6=x2+x3)
71
![Page 72: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/72.jpg)
Bottom-up CSE (BU-CSE)
› Starts at the outputs and works back to the inputs› More computation than TD-CSE but can find larger common subexpressions› Largest common subexpression is then selected to be removed e.g. x6 =x0+x2+x3
appears twice and is added to the bottom row
72
![Page 73: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/73.jpg)
Comparison of CSE Techniques for all Layers
› RPAG too computationally expensive for layers 2-6
73
![Page 74: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/74.jpg)
Digit Serial Arithmetic
› Used 16-bit fixed point› Each layer followed by batch
normalization with floating point scaling factor
› Suppose that for a given layer, p pixels arrive at the same time- For p≥ 1 have p adder trees in
parallel- For p < 1 word or bit-serial adders
can match input rate with hardware resources
- 4-bit digit serial has 1/4 area
- 1-bit bit serial has 1/16 area
› Avoids idle adders
74
![Page 75: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/75.jpg)
Network Studied
› VGG-7 network› Ternary weights› 16-bit activations› Accept a single pixel every cycle
(p=1)- W*W image takes W*W cycles
75
![Page 76: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/76.jpg)
Sparsity Control
› CIFAR10 dataset› Image padded with 4 pixels each side and randomly cropped back to 32x32› Weights are compared with threshold
- 0 if less than threshold, 𝑠(±1) otherwise (s is a scaling factor)
› We introduce the idea of changing 𝜖 to control sparsity
76
![Page 77: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/77.jpg)
Breakdown of Layer Sparsity
77
![Page 78: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/78.jpg)
Improvement in using CSE
78
![Page 79: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/79.jpg)
Implementation
› System implemented on Ultrascale+ VU9P @ 125 MHz› Open Source Verilog generator
- https://github.com/da-steve101/binary_connect_cifar
› Generated code using in AWS F1 implementation- https://github.com/da-steve101/aws-fpga
79
![Page 80: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/80.jpg)
Area Breakdown
80
![Page 81: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/81.jpg)
Accuracy
81
Comparison with ASIC and FPGA implementations
![Page 82: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/82.jpg)
Accuracy vs Speed
82
![Page 83: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/83.jpg)
Summary
› Presented method to unroll convolution with ternary weights and make parallel implementation- Exploits unstructured sparsity with no overhead
- Uses CSE, sparsity control and digit serial adders to further reduce area
- Limited amount of buffering and only loosely dependent on image size
› As larger FPGAs become available this technique may become more favourable
83
![Page 84: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/84.jpg)
Summary of the Four Techniques
Flexibility Throughput AreaTwo speed Normal Normal NormalPIR High High HighAddNet Reduced High LowUnrolled ternary Ternary High Low
84
› Multipliers form the basis for the computational part of ML› Presented a number of different techniques to trade off flexibility, throughput and
area› First three are applicable to ASICs or FPGA hard blocks but unrolled ternary is
FPGA-specific
![Page 85: Multipliers for FPGA Machine Learning Applications](https://reader036.fdocuments.in/reader036/viewer/2022081602/6157f9f713b9786c4b365fa6/html5/thumbnails/85.jpg)
References
[1] D. J. M. Moss, D. Boland, and P. H. W. Leong. A two-speed, radix-4, serial–parallel multiplier. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 27(4):769–777, April 2019. (doi:10.1109/TVLSI.2018.2883645)[2] SeyedRamin Rasoulinezhad, Hao Zhou, Lingli Wang, and Philip H.W. Leong. PIR-DSP: An FPGA DSP block architecture for multi-precision deep neural networks. In Proc. IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), pages 1–8, 2019. (doi:10.1109/FCCM.2019.00015)[3] Julian Faraone, Martin Kumm, Martin Hardieck, Peter Zipf, Xueyuan Liu, David Boland, and Philip H.W. Leong. AddNet: Deep neural networks using FPGA-optimized multipliers. IEEE Transactions on VLSI Systems, page to appear (accepted 11 Aug 2019), 2019. (doi:10.1109/TVLSI.2019.2939429)[4] Stephen Tridgell, Martin Kumm, Martin Hardieck, David Boland , Duncan Moss, Peter Zipf, and Philip H. W. Leong. Unrolling ternary neural networks. ACM Transactions on Reconfigurable Technology and Systems, page to appear (accepted 30 Aug 2019), 2019.
85