Post on 22-Dec-2021
VLSI & System Lab.
Research Trends on Convolutional Neural Network
(CNN) Accelerator Design
2019 SoC 학술대회
숙명여자대학교 최 웅
VLSI & System Lab.
Outline
❑ Overview : Convolutional Neural Network
❑ CNN Accelerator Design
- Background
- Maximize Data Reuse
- Reduction: Computation Size
- Reduction: Computation Number
- Processing-in-Memory (PIM)
❑ Summary
VLSI & System Lab.
Outline
❑ Overview : Convolutional Neural Network
❑ CNN Accelerator Design
- Background
- Maximize Data Reuse
- Reduction: Computation Size
- Reduction: Computation Number
- Processing-in-Memory (PIM)
❑ Summary
VLSI & System Lab.
AI & Neural Network
The science and engineering of
creating intelligent machines
Field of study that gives
computers the ability to learn
w/o being explicitly
programmed
An algorithm that takes its basic func. from understanding
of how the brain operates
ConvolutionalNeural Network
Source : MIT Tutorial
VLSI & System Lab.
Neural Network Applications
Image Process Autonomous Machines Security & Defense
Medical Game
VLSI & System Lab.
Simple Neural Network
𝑌𝑗 =
𝑘=0
𝑁−1
𝑊𝑘𝑋𝑗−𝑘
Input x1
Input x2
Input x3
Yj
Output 1
Output 2
Input Nodes Layer
Hidden NodesLayer
Output NodesLayer
W1
Bunch
of w
eig
ht W
W2
W3
MAC:multiply-and-accumulate(= dot product operation)
Neuron & Synapse
Simple Imitation
VLSI & System Lab.
Simple Neural Network
Input x1
Input x2
Input x3
Output y1
Output y2
Input Nodes Layer
Hidden NodesLayer
Output NodesLayer
0.0
0.1
Input x1
Input x2
Input x3
Dog
Cat
Output y1
Output y2
Input Nodes Layer
Hidden NodesLayer
Output NodesLayer
0.0
0.1
47%
53%
Input x1
Input x2
Input x3
Dog
Cat
Output y1
Output y2
Input Nodes Layer
Hidden NodesLayer
Output NodesLayer
Training Algorithm
Loss Function, Back Propagation, Batch Normalization
Input x1
Input x2
Input x3
Dog
Cat
Output y1
Output y2
Input Nodes Layer
Hidden NodesLayer
Output NodesLayer
0.4
0.8
95%
5%
Random initialization Let’s try a dog w/ un-trained NN
Weight training Let’s try again inference
VLSI & System Lab.
Deep Neural Network
ClassesLow LevelFeatures
High LevelFeatures
CONVLayer
CONVLayer
FullyConnectLayer
POOLLayer
DownSampling
5-1000 Layers 1-3 Layers
Available Big Data
350M Images / day
300 hours videos/1 min
GPU Acceleration
New ML Techniques
Top-5 Image Classification Accuracy
Human201520142013201220112010
30
0
10
20
Deep Neural Network (DNN)
[ Russakovsky et al., IJCV 2015 ]
Source : MIT Tutorial
VLSI & System Lab.
Convolutional Neural Network
ClassesLow LevelFeatures
High LevelFeatures
CONVLayer
CONVLayer
FullyConnectLayer
POOLLayer
DownSampling
Filter (Weights)
Input Feature Map (X)
Output Feature Map (Y)Shift & Filtering
Convo
lution
Act
ivation
Sigmoid Hyperbolic Tangent Leaky ReLU Exponential LU ReLU
The most Hardware-friendly function
Source : MIT Tutorial
VLSI & System Lab.
Convolutional Neural Network
ClassesLow LevelFeatures
High LevelFeatures
CONVLayer
CONVLayer
FullyConnectLayer
POOLLayer
DownSampling
• Reduce resolution of each channel independently
• Increase translation-invariance and noise-resilience
Ale
xNet
Case
1 0 3 3
5 6 6 8
3 1 2 3
2 2 2 5
6 8
3 5
max(∙)
MaxPooling
1 0 3 3
5 6 6 8
3 1 2 3
2 2 2 5
3 5
2 3
avg(∙)
AveragePooling
Location of Pooling Layers
Input
CO
NV La
yer
CO
NV La
yer
Pool La
yer
Pool La
yer
Pool La
yer
CO
NV La
yer
FC La
yer
Softm
ax
FC La
yer
FC La
yer
CO
NV La
yer
CO
NV La
yer
Source : MIT Tutorial
VLSI & System Lab.
Convolutional Neural Network
ClassesLow LevelFeatures
High LevelFeatures
CONVLayer
CONVLayer
FullyConnectLayer
POOLLayer
DownSampling
• Fully-connected layer account for more than 90% of total number of parameters, dominating memory and energy
• Simple matrix multiplication
Fully-connected
Source : MIT Tutorial
VLSI & System Lab.
And Others …
ClassesLow LevelFeatures
High LevelFeatures
CONVLayer
CONVLayer
FullyConnectLayer
POOLLayer
DownSampling
[Sergey Ioffe et al., ICML 2015]
SoftmaxNormalization
-0.7
1.2
0.5
-0.2
0.08
0.53
0.26
0.13
sum = 1
ProbabilitiesLogits
• Pre-processing to balance betweenthe training and inference (accuracyhighly relies on these procedure)
• Not essential when the difference betweeneach class is not seriously importance
𝑠 =𝑒
σ𝑒
Ranking is maintained
VLSI & System Lab.
Various CNN Configurations
[1] [Krizhevsky et al. NIPS 2012][2] [Simonyan and Zisserman, ICLR 2015][3] [He et al., CVPR 2016]
AlexNet [1] VGG-16 [2] ResNet-50 [3]
VLSI & System Lab.
Outline
❑ Overview : Convolutional Neural Network
❑ CNN Accelerator Design
- Background
- Maximize Data Reuse
- Reduction: Computation Size
- Reduction: Computation Number
- Processing-in-Memory (PIM)
❑ Summary
VLSI & System Lab.
Hardware TechnologiesTr
ain
ing
Data
Cente
r
Infe
rence
Data
Cente
rEdge/H
ybrid
HARDWARE TECHNOLOGIES USED IN MACHINE LEARNING
Performance & Functionality
Google TPU FPGA (Xilinx & Intel)
FPGA SoCsNVIDIA Jetson
NVIDIA Tesla P40 & P4
NVIDIA Drive PX2
NVIDIA Pascal & VoltaGoogle Cloud TPUAMD Radeon GPU
QualcommSnapdragon FPGA (Xilinx & Intel)
Source: https://tanjo.ai/contents/2323923
VLSI & System Lab.
ASICs
FPGAs
GPUs
Deep CNN on “Cloud Platforms”
• Accelerator is more efficient in terms of power and energy consumption
• Trade off between Energy Efficiency and Flexibility
GPU based CPU+FPGA based ASIC based
VLSI & System Lab.
Need Reconfigurable Accelerator
Input
CO
NV La
yer
CO
NV La
yer
Pool La
yer
Pool La
yer
Pool La
yer
CO
NV La
yer
FC La
yer
Softm
ax
FC La
yer
FC La
yer
CO
NV La
yer
CO
NV La
yer
Input
CO
NV La
yer
CO
NV La
yer
Pool La
yer
Pool La
yer
Pool La
yer
CO
NV La
yer
CO
NV La
yer
CO
NV La
yer
CO
NV La
yer
CO
NV La
yer
CO
NV La
yer
CO
NV La
yer
Pool La
yer
CO
NV La
yer
CO
NV La
yer
CO
NV La
yer
Pool La
yer
FC La
yer
Softm
ax
FC La
yer
FC La
yer
Ale
xNet
VGG-1
6
[Jinmook Lee et al., JSSC 2018]
In different Network• Different number of layers
• Different number of filters / channels
In different Architecture• Different algorithmic structures
In different Quantization• Different bit-width of different layers
VLSI & System Lab.
Reconfig. Vs Energy Efficiency
Dynamically Reconfigurable DNN Accelerator
Low-level Reconfiguration
14x12 PE array
Different Modes
High-level Reconfiguration
Use Instruction Set Architecture
General Purpose Processor
(w/ Software Programing)
Application Specific
Accelerator(Hard-wired)
RuntimeReconfigurable
Accelerator
Target Position
Pro
gra
mm
abili
ty
Performance & Energy Efficiency
VLSI & System Lab.
𝑌𝑗 =
𝑘=0
𝑁−1
𝑊𝑘𝑋𝑗−𝑘
Main Operation in CNN
Architecture Weight Size Ifmap Size # Multiply-Adds Top-1 Accuracy
AlexNet 238 MB 1.6 MB 724 M 57.10 %
VGG-16 528 MB 34.8 MB 15.5 B 70.50 %
ResNet-50 99 MB 37.5 MB 3.9 MB 75.20 %
ClassesLow LevelFeatures
High LevelFeatures
CONVLayer
CONVLayer
FullyConnectLayer
POOLLayer
DownSampling
𝑌𝑗 =
𝑘=0
𝑁−1
𝑊𝑘𝑋𝑗−𝑘 𝑌𝑗 =
𝑘=0
𝑁−1
𝑊𝑘𝑋𝑗−𝑘
VLSI & System Lab.
Data-Centric CNN
…………
…………
…………
…………
Accelerator Design• Maximize Data Reuse
• Reduction: Computation Size
• Reduction: Computation Number
• Processing-in-memory
32bit FP Add
32bit FP Mult
32bit SRAM Read
32bit DRAM Read
100 101 102 104103
0.9
3.7
5
640
Operation Energy (pJ) & Relative Energy Cost
8bit Add 0.03
~ X 21333
[M Horowitz et al., ISSCC 2014]
VLSI & System Lab.
Outline
❑ Overview : Convolutional Neural Network
❑ CNN Accelerator Design
- Background
- Maximize Data Reuse
- Reduction: Computation Size
- Reduction: Computation Number
- Processing-in-Memory (PIM)
❑ Summary
VLSI & System Lab.
Maximize Data Reuse : Data Flow
Weight Stationary
• Maximize weight reuse
• Broadcast activation
• Accumulate pSUMs spatially
Output Stationary
• Maximize pSUM reuse
• Broadcast weight
• Reuse activation spatially
No Local Reuse
• Use a large global buffer
• Reduce DRAM access
• Multicast activation &
weight
• Accumulate pSUMs spatially
[DianNao, ASPLOS 2014] [DaDianNao, MICRO 2014] [Zhang, FPGA 2015] [TPU, ISCA 2017]
[Peemen, ICCD 2013] [ShiDianNao, ISCA 2015] [Gupta, ICML 2015] [Moons, VLSI 2016]
[nn-X (NeuFlow), CVPRW 2014] [Park, ISSCC 2015] [ISAAC, ISCA 2016] [PRIME, ISCA 2016]
Source : MIT Tutorial
VLSI & System Lab.
Maximize Data Reuse : Filtering
Simple CNN on CIFAR-10
CONVLayer
iFMAPData
Reused Data
Reusing Rate
1 1024 540 52.7 %
2 256 220 85.9 %
3 64 60 93.8 %
AlexNet on ImageNet
CONVLayer
iFMAPData
Reused Data
Reusing Rate
1 51529 4158 8.1 %
2 961 520 54.1 %
3 225 72 32.0 %
4 225 72 32.0 %
5 225 72 32.0 %
Filter (Weights)
Input Feature Map (X)
Output Feature Map (Y)Shift & Filtering
One-directional Bi-directional
Reuse
Reuse
Reuse
Discard
Update Si
Fi–Si
Input Feature (Ii)
Filter Size(Fi)
Stride(Si)
1st
Downward Sliding
Discard
Update
Discard
Update
2nd
Downward Sliding ((I- F)/S)th
Downward Sliding
Fi × (Fi – Si)
Si
Fi
Fi × (Fi – Si)
Fi × (Fi – Si)
Stride
Stride
Ci
# of filter : Mi
[Ref.] Woong Choi, Kyungrak Choi, and Jongsun Park, "Low Cost Convolutional Neural Network Accelerator Based on Bi-directional Filtering and Bit-widthReduction", IEEE Access, vol. 6, pp. 14734-14746, Mar. 2018.
VLSI & System Lab.
Reduction: Computation Size
[Ref.] Woong Choi, Kyungrak Choi, and Jongsun Park, "Low Cost Convolutional Neural Network Accelerator Based on Bi-directional Filtering and Bit-widthReduction", IEEE Access, vol. 6, pp. 14734-14746, Mar. 2018.
• Directly reduced the memory & PEs
• Trade-off : Bit-width ↔ Accuracy
weight value weight value
linear log2
weight value
non-linear
de
nsit
y
de
nsit
y
0 1 1 0 0 1
sign
integer
mantissa
fractional
0 1 1 0 0 1
sign mantissa
fractional
Act
Act
Act
standard
binWeight
binNetwork
+, , ×
+,
XNOR,
bitcount
weight
weight
weight
operations
VLSI & System Lab.
Reduction: Computation Size
Previous Output
𝑌𝑗−1 =
𝑘=0
𝑁−1
𝑊𝑘𝑋𝑗−𝑘−1
𝑌𝑗 − 𝑌𝑗−1 + 𝑌𝑗−1 =
𝑘=0
𝑁−1
𝑊𝑘 𝑋𝑗−𝑘 − 𝑋𝑗−𝑘−1 + 𝑌𝑗−1
Current Output
𝑌𝑗 =
𝑘=0
𝑁−1
𝑊𝑘𝑋𝑗−𝑘
❑ Differential Input Method (DIM)
[Ref.] Woong Choi, Kyungrak Choi, and Jongsun Park, "Low Cost Convolutional Neural Network Accelerator Based on Bi-directional Filtering and Bit-widthReduction", IEEE Access, vol. 6, pp. 14734-14746, Mar. 2018.
VLSI & System Lab.
Reduction: Computation Size
❑ Differential Input Method (DIM)
𝑌𝑗 =
𝑘=0
𝑁−1
𝑊𝑘 𝑋𝑗−𝑘 − 𝑋𝑗−𝑘−1 + 𝑌𝑗−1Current Output
𝑌𝑗 =
𝑘=0
𝑁−1
𝑊𝑘𝑋𝑗−𝑘
×106
Oc
cu
ren
ce
w/o
DIM
0
3
6
×106
w/
DIM
0
3
6
Oc
cu
ren
ce
-120 -40 40 120 -120 -40 40 120 -120 -40 40 120
Layer 1 Layer 2 Layer 3
Input activation value (Convnet)
×102
w/o
DIM
0
3
6
Oc
cu
ren
ce
Input activation value (AlexNet)
×102
0
3
6O
ccu
ren
ce
-120 1200 -120 1200 -120 1200 -120 1200 -120 1200
Layer 1 Layer 2 Layer 3 Layer 4 Layer 5
w/
DIM
Reduced Dynamic Range
[Ref.] Woong Choi, Kyungrak Choi, and Jongsun Park, "Low Cost Convolutional Neural Network Accelerator Based on Bi-directional Filtering and Bit-widthReduction", IEEE Access, vol. 6, pp. 14734-14746, Mar. 2018.
VLSI & System Lab.
Reduction: Computation Size
❑ Differential Input Method (DIM)
𝑌𝑗 =
𝑘=0
𝑁−1
𝑊𝑘 𝑋𝑗−𝑘 − 𝑋𝑗−𝑘−1 + 𝑌𝑗−1Current Output
𝑌𝑗 =
𝑘=0
𝑁−1
𝑊𝑘𝑋𝑗−𝑘
x3 x2 x1 x0
[Ref.] Woong Choi, Kyungrak Choi, and Jongsun Park, "Low Cost Convolutional Neural Network Accelerator Based on Bi-directional Filtering and Bit-widthReduction", IEEE Access, vol. 6, pp. 14734-14746, Mar. 2018.
VLSI & System Lab.
Reduction: Computation Size
❑ Differential Input Method (DIM)
𝑌𝑗 =
𝑘=0
𝑁−1
𝑊𝑘 𝑋𝑗−𝑘 − 𝑋𝑗−𝑘−1 + 𝑌𝑗−1Current Output
𝑌𝑗 =
𝑘=0
𝑁−1
𝑊𝑘𝑋𝑗−𝑘
x4-x3 x1-x0x2-x1x3-x2
Bit-width Reduced Components Hardware Overhead
[Ref.] Woong Choi, Kyungrak Choi, and Jongsun Park, "Low Cost Convolutional Neural Network Accelerator Based on Bi-directional Filtering and Bit-widthReduction", IEEE Access, vol. 6, pp. 14734-14746, Mar. 2018.
VLSI & System Lab.
Reduction: Computation Size
❑ Adaptive Bit-Width Reduction w/ DIM
Layer B
sign
many large values in layer B
magnitude ofinteger part
Layer A
sign
ineffective many small values in layer A
magnitude ofinteger part
quantization error(layer-by-layer)
adjacenttwo inputs
sign
ineffectiveadaptive
decimal pointdynamic
range
sign
CASE-1
ineffective
CASE-2
rarely occurredquantization error
quantization error(pixel-by-pixel)DIM
[Ref.] Woong Choi, Kyungrak Choi, and Jongsun Park, "Low Cost Convolutional Neural Network Accelerator Based on Bi-directional Filtering and Bit-widthReduction", IEEE Access, vol. 6, pp. 14734-14746, Mar. 2018.
VLSI & System Lab.
Reduction: Computation Size
❑ Adaptive Bit-Width Reduction (ABWR)
Reference
CIFAR-10 ImageNet
Bit-widthper layer
AccuracyLoss (%)
Bit-widthper layer
AccuracyLoss (%)
[1] 6-6-5 0.8 5-7-9-8-8 0.8a)
[2] 8-7-7 ~0.7 10-8-8-8-8 ~0.6
[3] 4-5-7 ~ 1.0b) 9-7-4-5-7 ~ 1.0b)
[4]c) 8-8-8 0.3 8-8-8-8-8 0.3
ABWR w/ DIM 5-5-6 0.1 6-6-6-6-6 0.4
a) Top-5 accuracy.b) Relative accuracy loss. Baseline accuracy is not presented.c) Requires retraining weights for fine-tuning with software.
[1] B. Moons at al., ``Energy efficient ConvNets through approximate computing,'' in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), Mar. 2016, pp. 1-8.
[2] P. Judd at al., ``Proteus: Exploiting numerical precision variability in deep neural networks,'' in Proc. Int. Conf. Supercomput. (ICS), 2016, Art. no. 23.
[3] P. Judd at al., ``Stripes: Bit-serial deep neural network computing,'' IEEE Comput. Archit. Lett., vol. 16, no. 1, pp. 80-83, Jan./Jun. 2017.
[4] P. Gysel at al., ``Hardware-oriented approximation of convolutional neural networks,'' in Proc. Int. Conf. Learn. Represent. (ICLR), 2016, pp. 1-8.
VLSI & System Lab.
Reduction: Computation Number
• Reduced Activation → Gating or skipping cycle & memory access
• Accuracy loss depending on the techniques
train connectivity
prune connections
train weights
before after
5x5 filterapply
sequentially5x1 and
1x5 filters
MobileNets
filter decomposition9 -1 -3
1 -5 5
-2 6 1
9 0 0
1 0 5
0 6 0
ReLU
Input: 0,0,12,0,0,0,0,53,0,0,22, ...
2 12 4 53 2 22 0Output:
Run Run Run Term
Level Level Level
[Ref.] Woong Choi, Kyungrak Choi, and Jongsun Park, "Low Cost Convolutional Neural Network Accelerator Based on Bi-directional Filtering and Bit-widthReduction", IEEE Access, vol. 6, pp. 14734-14746, Mar. 2018.
VLSI & System Lab.
Reduction: Computation Number
❑ DIM w/ Near-Zero Skipping
original image normalized after prunning
pruned data
dynamic range
DIM image after pruning w/ DIM after recoveringDIM data
pruned data
Informativedata
Activation Skipping Rate(AlexNet w/ 0.28%
accuracy loss)
48.4 %(x1.00)
56.8 %(x1.17)
[Ref.] Woong Choi, Kyungrak Choi, and Jongsun Park, "Low Cost Convolutional Neural Network Accelerator Based on Bi-directional Filtering and Bit-widthReduction", IEEE Access, vol. 6, pp. 14734-14746, Mar. 2018.
VLSI & System Lab.
Reduction: Computation Number
❑ DIM w/ ABWR & Near-Zero Skipping
• Based on 65nm CMOS standard cell library
• 4xM-PE (24xNPE) for AlexNet
32.8 % reduced computation cycle
[Ref.] Woong Choi, Kyungrak Choi, and Jongsun Park, "Low Cost Convolutional Neural Network Accelerator Based on Bi-directional Filtering and Bit-widthReduction", IEEE Access, vol. 6, pp. 14734-14746, Mar. 2018.
VLSI & System Lab.
Outline
❑ Overview : Convolutional Neural Network
❑ CNN Accelerator Design
- Background
- Maximize Data Reuse
- Reduction: Computation Size
- Reduction: Computation Number
- Processing-in-Memory (PIM)
❑ Summary
VLSI & System Lab.
Processing in Memory
Memory LogicCPU(Logic) Memory
Massive data
• PIM: A technique that performs simple logic within a memory device to reduce the amount of data being passed to the processor.
…………
…………
…………
…………
von Neumann architecture Processing-in-memory
VLSI & System Lab.
Processing in Memory
❑ Basic DRAM Operation : Read → Write-Back
DRAM Array Architecture
DRAM read operation: sensing “1”
DRAM read operation: sensing “0”
• DRAM consists of Cell(1T1C) array
• WL on → charge sharing btw. cell cap. and Bit-line(BL) cap. →sensing
[Vivek Seshadri et al., MICRO 2017]
VLSI & System Lab.
Processing in Memory
❑ DRAM-based PIM : AND, OR Operation
• AND/OR operation : 3 WLs on → charge sharing → sensing
• Majority function for A,B, and C input
• A & B when C = 1, A || B when C = 0
DRAM PIM: OR/AND Operation Table
R = A OR B R = A AND B
[Vivek Seshadri et al., MICRO 2017]
VLSI & System Lab.
Processing in Memory
❑ DRAM-based PIM : NOT Operation (Inverting)
• 2T1C Cell : Read → Inverting → Write-Back
• Separated decoder to reduce hardware overhead
• Throughput x32, energy x35 compared to DDR3 in AND/OR/NOT
Cell ArrayRow
Dec.
Dec.
Sense Amp.
PIM Operation Region
Read
Inverting
Write-back
Special 2T1C Cell & Column Line DRAM 2T1C Row & Specific PIM Region
[Vivek Seshadri et al., MICRO 2017]
VLSI & System Lab.
Processing in Memory
❑ Binary Neural Network
Floating or Fixed Point
Operation
Binary (Bit-wise)Operation
(Hardware-friendly)
[1] M. Courbariaux et al. 2016. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. (2016).arXiv:1602.02830.
[2] W. Tang et al. 2017. Wang, “How to Train a Compact Binary Neural Network with High Accuracy?. in 2017 AAAI, 2625–2631.
Explosive PIMResearches
VLSI & System Lab.
Processing in Memory
❑ Contents Addressable Memory Operation
• Contents addressable memory (CAM) : look-up table
• Search data is applied to the Search-Lines (SLs) in parallel
• Search results develop on the Match-Lines (MLs) in parallel
RAM
(Random Access Memory)
CAM
↑ Addr. ↓ Data ↑ Data ↓ Addr.
VLSI & System Lab.
Processing in Memory
❑ Contents Addressable Memory Operation
• Match-line (ML) precharge (VDD)
• Search-line (SL) activation → ML evaluation
1. ML Precharge 2. SL Activation
@ Mismatch case → ML = VSS
1 0 1 0
1 0 1 0
@ Match case → ML = VDD
1 0 1 0
0 1 0 1
VLSI & System Lab.
Processing in Memory
❑ Binarized Convolution Operation
𝑦 = 𝜎(
𝑖=1
𝑁
𝑥𝑖𝑤𝑖) 𝜎(𝑥) = ቊ+1, 𝑥 ≥ 0−1, 𝑥 < 0XNOR-popcount
X1 X2 X3 XN
MLSL
SL
b
W1 W2 W3 WNPCHb
ML
Pre
char
ge Match
VM L
t
SA
Mismatch
N-bit N/2-bit 1-bit
t
VSO
SO
a few
Mismatch
(‘1’ case)
many
mismatch
(‘0’ case)
[Ref.] Woong Choi, Kwanghyo Jeong, Kyungrak Choi, Kyeongho Lee and Jongsun Park, "Content Addressable Memory Based Binarized Neural NetworkAccelerator Using Time-Domain Signal Processing", Design Automation Conference (DAC), Jun. 2018.
VLSI & System Lab.
Processing in Memory
❑ Binarized Convolution Operation
[Ref.] Woong Choi, Kwanghyo Jeong, Kyungrak Choi, Kyeongho Lee and Jongsun Park, "Content Addressable Memory Based Binarized Neural NetworkAccelerator Using Time-Domain Signal Processing", Design Automation Conference (DAC), Jun. 2018.
• Half of weight bits are inversed for reference ML delay
• Batch normalization bias : modify the number of inversed weight bits
VLSI & System Lab.
Processing in Memory
❑ Binarized Convolution Operation
Digit 7 Digit 2
JSSC 17’ [1]Conv. [2]
(redesigned)This work
Technology 65nm 65nm 65nm
Supply Voltage 1.0V 1.1V 1.1V
OperationCONV-BNorm-BinAct
(For 2nd CONV layer for MNIST)
PE
Components
MUX, Delay,
ResistorXNOR, Adder CAM Cell
PE Area (1PE) 85.1 um2 11.84 um2 4.0 um2
Energy
Efficiency *48.2 TSop/J **25.2 TSop/J *88.5 TSop/J
Computation error → 1.56% top-1 accuracy degradation
[1] D. Miyashita et al. 2017. A Neuromorphic Chip Optimized for Deep Learning and CMOS Technology With Time-Domain Analog and Digital Mixed-SignalProcessing,” IEEE JSSC. 52, 10, (2017), 2679–2689.
[2] H. Yonekawa et al. 2017. On-Chip Memory Based Binarized Convolutional Deep Neural Network Applying Batch Normalization Free Technique on an FPGA. in2017 IPDPSW, 98–105.
VLSI & System Lab.
Outline
❑ Overview : Convolutional Neural Network
❑ CNN Accelerator Design
- Background
- Maximize Data Reuse
- Reduction: Computation Size
- Reduction: Computation Number
- Processing-in-Memory (PIM)
❑ Summary
VLSI & System Lab.
Data-Centric CNN
…………
…………
…………
…………
Accelerator Design• Maximize Data Reuse
• Reduction: Computation Size
• Reduction: Computation Number
• Processing-in-memory
32bit FP Add
32bit FP Mult
32bit SRAM Read
32bit DRAM Read
100 101 102 104103
0.9
3.7
5
640
Operation Energy (pJ) & Relative Energy Cost
8bit Add 0.03
X 21333
[M Horowitz et al., ISSCC 2014]
VLSI & System Lab.
2018 Issues on AI
❑ 사람만큼 자연스러운 전화 예약 : Google Duplex
❑ 자율 주행 택시 서비스 : Waymo (Google)
❑ 진짜 같은 가짜 이미지 : Style-based GAN
❑ 단백질의 3D 형태를 예측 : AlphaFold
❑ 인간보다 똑똑한 언어처리 : BERT, Elmo, Big Bird
❑ 온라인에서도 옷을 ‘입어볼 수’ 있다 : Virtual Try-
On
Source: https://brunch.co.kr/@omniousofficial/32
VLSI & System Lab.
AI Hardware vs Human
❑ Energy Discrepancy
5*104 W
Alphago Zero
1~2*103 W
vs.
20 W
• Where does this inefficiency come from? Algorithm, Architecture, Circuits, Device, and Materials