How can we optimize convolutional neural network designs...
-
Upload
phamnguyet -
Category
Documents
-
view
233 -
download
8
Transcript of How can we optimize convolutional neural network designs...
How can we optimize convolutional neural network designs on mobile and embedded
systems?
June 18, 2016
Sungjoo Yoo
Computing Memory Architecture Lab.
CSE, SNU
http://cmalab.snu.ac.kr
Agenda
• Introduction• Connecting two convolutions in human visual cortex and artificial neural
network
• Optimizing CNN architecture• Boosting• Pruning unimportant connections• Low-rank approximations• Narrow data (quantization)
• Optimizing CNN implementation• GPU: cuDNN, Winograd convolution, …• Hardware accelerators
• Summary
[Bear]
Retina: On-centered Cell ~ Image Sensor Cell
Retina LGN Primary Visual Cortex (V1)
[Kandel]
V1
Convolution
[Kolb_Whishaw]
Line Detection in V1 ~ Convolution
Convolutional Neural Network (CNN)
• LeNet (1989)
• Consists of convolution layer and subsampling (max-pooling) layer
• Training: backpropagation
Convolution: 2D Input Case
[Chen, 2016]
Convolution: 3D Input Case
• A receptive field in input feature maps gives an output neuron
• Each output feature map has its own set of kernel weights
[Chen, 2016]
Convolution: 3D Input / 3D Output
[Chen, 2016]
Training (backprop) determines kernel weights
Convolution: Computation and Model Size
• # multiplications = kxkxD x NxNxH
• # weights = kxkxD x H
Agenda
• Introduction• Connecting two convolutions in human visual cortex and artificial neural
network
• Optimizing CNN architecture• Boosting• Pruning unimportant connections• Low-rank approximations• Narrow data (quantization)
• Optimizing CNN implementation• GPU: cuDNN, Winograd convolution, …• Hardware accelerators
• Summary
Simple Example: Classification for Two Classes
• Classification ~ draw a surface between two groups
• Complex (high order) surface ~ high cost
• Basic idea: classify simple ones first at low cost
[Venkataramani, 2015]
Big/Little DNN: Overview
Input Image
Little DNN
Big DNN
Successchecker
① Classification
② a) High confidence
Result
② b) Low confidence
[Park, 2015]
Experiment Setup: Comparison of Computation Workload between Big & Little DNNs
• Big DNN has ~10X larger amount of computation
19.51
2.54 1.58
0.67 0.79
0
5
10
15
20
big s m f c
# o
f m
ultip
lications
[x10^
9]
ImageNet Classifiers
188.80
54.40
26.88
4.48 1.26
28.80
14.40 3.60 0.90
0
50
100
150
200
big m1 m2 m3 m4 m5 m6 m7 m8#
of
multip
lications
[x10^
4]
MNIST Classifiers
[Park, 2015]
Experiment Setup: HW NPU
• Based on [Zhang, FPGA 2015]
• HW NPU• 512 compute engines (PEs)
• Double buffering
• Loop unrolling
• Verilog design, 65nm TSMC
• In-house cycle accurate simulator+DRAMSim2• Micron DRAM power model
DRAMDRAM
DRAM
DMAunit
SRAM Input Buffer
PEPEPEPEPEPEPE
SRAM Output Buffer
PEPEPEPEPEPE
X +
PE
*Zhang, et al., “Optimizing FPGA-based accelerator design for deep convolutional neural networks”, FPGA 2015.
[Park, 2015]
Result: MNIST
• 93.0% energy reduction
• Accuracy loss of 0.08%
0
0.5
1
1.5
2
2.5
Energ
y [
mJ/
Image]
SRAM Size [B]
DRAM Computation SRAM
85.6 %
85.3 %
0
0.3
0.6
0.9
1.2
1.5
m1 m2 m3 m4 m5 m6 m7 m8
Energ
y [
mJ/
Image]
big/LITTLE Configuration
Big only Static threshold Dynamic threshold
99.07 99.10 98.97
98.35
99.06
98.90
99.04
98.76
99.12
97.6
98
98.4
98.8
99.2
m1 m2 m3 m4 m5 m6 m7 m8 Big
Acc
ura
cy [
%]
big/LITTLE Configuration
[Park, 2015]
Result: ImageNet
• 56.7% energy reduction
• Top-1 accuracy loss of 0.51%
0
5
10
15
20
25
30
35
40
45
50
Energ
y [J/Im
age]
SRAM Size [B]
DRAM Computation SRAM
34.3 %34.3 %
0
2.5
5
7.5
10
12.5
s m f c
Energ
y [
J/Im
age]
big/LITTLE Configuration
Big only Static threshold
Dynamic threshold
68.81
67.53 67.21
68.90
69.41
66
67
68
69
70
s m f c Big
Acc
ura
cy [
%]
big/LITTLE Configuration
[Park, 2015]
Agenda
• Introduction• Connecting two convolutions in human visual cortex and artificial neural
network
• Optimizing CNN architecture• Boosting• Pruning unimportant connections• Low-rank approximations• Narrow data (quantization)
• Optimizing CNN implementation• GPU: cuDNN, Winograd convolution, …• Hardware accelerators
• Summary
Pruning CNN
[Han, 2015]
Neuron Pruning is Natural in Biological System• # synapses increases before 2 years old and, then decreases due to
pruning
[https://universe-review.ca/R10-16-ANS12.htm]
Convolution with Matrix Multiplication(called Convolution Lowering)• Input: 3x3x3
• Output: 2x2x2
• Convolutional kernel: 3x2x2
[Chetlur 2014]
2x22
2
2
3
3
3 2
22
22
Pruning [Han 2015] Hardly Reduces the Runtime of Convolution on GPU
[Han 2015][Chetlur 2014]
Group-wise Brain Damage
• For each input feature map, the same location of 2D filter elements is pruned
• Pruning is performed in an incremental manner• Repeat the followings until no more
pruning candidate• Prune a column in F matrix and train the
network to recover from accuracy loss
• Result• 3X reduction in # multiplications for AlexNet
[Lebedev 2016]
Agenda
• Introduction• Connecting two convolutions in human visual cortex and artificial neural
network
• Optimizing CNN architecture• Boosting• Pruning unimportant connections• Low-rank approximations• Narrow data (quantization)
• Optimizing CNN implementation• GPU: cuDNN, Winograd convolution, …• Hardware accelerators
• Summary
Singular Value Decomposition (SVD)
[K. Baker, Singular Value Decomposition Tutorial]
Example of Truncated SVD: A~USVT
Take square roots of 3 largest eigenvaluesTake 3 eigenvectors
associated with the selected eigenvaluesTake 3 eigenvectorsassociated with the selected eigenvalues
[K. Baker, Singular Value Decomposition Tutorial]
Error degrades accuracy. How to reclaim lost accuracy?
128 59
48x5x5 25x5x5
48x5x5
25x5x5
27x27
YX
Z Z’
U3 U4C
48
55
55
128
27
27
25 59
55
55
27
27
Low-rank Approximation in CNN
[Kim, 2016]
Experiments on Samsung Galaxy S6
• Exynos 7420 + LPDDR4• ARM Mali T760, 190Gflops for 8 cores, max 256 threads/core, 32KB
L1$/core, 1MB shared L2$, 25.6GB/s LPDDR4 with four x16 channels
• Comparison: nVidia Titan X provides 6.6TFlops for 24 cores, 336GB/s main memory, max 2k threads/core, 64KB L1$/core, 3MB shared L2$
Exynos 7420
T760
[Kim, 2016]
AlexNet: Power Consumption• Total 245mJ/image, 117ms
• GPU power > DRAM power
• Convolutional layers dominate total energy consumption and runtime
• At fully connected layers, GPU power drops while DRAM power increases • Due to a large number of memory accesses
for weights and less data reuse, i.e., low core utilization (=long total idle time)
GPU
DRAM
C1 C2 C3 C4 C5 F6 F7 F8
[Kim, 2016]
VGG_S: Power Consumption• Total 825mJ/image, 357ms
• Convolutional layers dominate total energy consumption and runtime
• At convolutional layers, DRAM consumes larger power than in AlexNetdue to a large number of weights
• At fully connected layers, similar trend as in AlexNet• GPU power ~ DRAM power
AlexNet
VGG_S
[Kim, 2016]
GoogLeNet: Power Consumption• Total 473mJ/image, 273ms
• 1st and 2nd convolutional layers consume 1/4 of total energy and runtime
• Inception modules • Relatively low power consumption in both GPU
and DRAM • Power consumption fluctuates due to many
small inception modules and cache-unfriendly 1x1 convolutions,
• Fully connected layer (1M parameters) consumes a very little amount of power in GPU and DRAM
13
88
K
10
72
K
84
0K58
0K
46
3K
43
7K
36
4K
38
0K
15
9K
71
M
54
M17
0M11
9M
10
0M
88
M
73
M
30
4M
12
8M
1M
1M
AlexNet
GoogLeNet
[Kim, 2016]
13
88
K
10
72
K
84
0K58
0K
46
3K
43
7K
36
4K
38
0K
15
9K
71
M
54
M17
0M11
9M
10
0M
88
M
73
M
30
4M
12
8M
1M
1M
AlexNet, VGG_S vs GoogLeNet: Top-5, Runtime and Power
80.0%117ms245mJ
84.6%357ms825mJ
88.9%273ms473mJ
[Kim, 2016]
Results of Low Rank Approximation
• Significant reductions in energy consumption and runtime• Energy: x4.26~x1.6
• Runtime:x3.68~x1.42
3.41X 4.26X 1.6X
[Kim, 2016]
Fine-tuning is Required for Accuracy Recovery
• Low-rank approximation loses accuracy
• Fine-tuning recovers lost error• 1 epoch: 1 run of back propagation
with the entire training set
[Kim, 2016]
Agenda
• Introduction• Connecting two convolutions in human visual cortex and artificial neural
network
• Optimizing CNN architecture• Boosting• Pruning unimportant connections• Low-rank approximations• Narrow data (quantization)
• Optimizing CNN implementation• GPU: cuDNN, Winograd convolution, …• Hardware accelerators
• Summary
Narrow-Data CNN
• Performance improvement due to narrow data• E.g., 16bit 4bit data, 4X speedup with the same memory bandwidth
consumption
weight
activation
MUL ADD
Conventional convolution Convolution with narrow data
weight
activation
*
ADD
16b
16b
4b
4b
4b
4b
4b
4b
4b
4b
*
*
*
Logarithm-based Quantization (Log-Quant)
• Less quantization errors for small values
• No need of multiplication
[Miyashita, 2016]
Log-Quant
• Performance improvement due to narrow data
• Replace multipliers with shifters better area/energy efficiency
weight
activation
MUL ADD
Conventional convolution LogQuant-based convolution with narrow data
weight
activation
>>
ADD
>>
>>
>>
16b
16b
4b
4b
4b
4b
4b
4b
4b
4b
Preliminary Results: Log-Quant AlexNet
Act base = 1 Activation bit
Weight base = 4 3 4 5 6
Weight bit
2 79.328 79.326 79.334 79.334
3 77.236 77.254 77.254 77.546
4 8.894 8.66 8.662 8.662
5 1.466 1.186 1.222 1.222
6 1.314 1.184 1.006 1.006
7 1.318 1.294 1.342 1.342
Act base = 4 Activation bit
Weight base = 4 3 4 5 6
Weight bit
2 79.488 79.41 79.336 79.328
3 79.488 77.826 77.35 77.43
4 64.632 12.364 6.402 7.4
5 58.33 5.26 0.306 0.272
6 58.296 5.354 0.158 0.172
7 58.274 5.426 0.178 0.186
• 0.3% accuracy loss at 5-bit weight and activation
[CMA Lab, 2016]
Agenda
• Introduction• Connecting two convolutions in human visual cortex and artificial neural
network
• Optimizing CNN architecture• Boosting• Pruning unimportant connections• Low-rank approximations• Narrow data (quantization)
• Optimizing CNN implementation• GPU: cuDNN, Winograd convolution, …• Hardware accelerators
• Summary
Convolution with Matrix Multiplication(a.k.a Convolution Lowering)• Input: 3x3x3
• Output: 2x2x2
• Convolutional kernel: 3x2x2
[Chetlur 2014]
2x2
3
3
3 2
2
2
22
Matrix Size vs. GPU Cache Size
• Example: 2nd convolutional layer on AlexNet
• Input size = 55x55x48x4B = 580KB• Input matrix size = 580KBx5x5 = 14.5MB
• Output size = 27x27x128x4B = 387KB
• Kernel size = 48x5x5x128 = 614KB
14.5MB
614KB 387KB
cuBLAS vs. cuDNN
43
DeviceDRAM
D4 D3 D1 D0
D5 D4 D2 D1
D7 D6 D4 D3
D8 D7 D5 D4
D0 D1 D2
D3 D4 D5
D6 D7 D8
D0 D1 D2
D3 D4 D5
D6 D7 D8
F0 F1
F2 F3
O0 O1
O2 O3* =
F0
F1
F2
F3
O0
O1
O2
O3
D4 D3 D1 D0
D5 D4 D2 D1
D7 D6 D4 D3
D8 D7 D5 D4
x =
F0
F1
F2
F3
O0
O1
O2
O3
D4 D3 D1 D0
D5 D4 D2 D1
D7 D6 D4 D3
D8 D7 D5 D4
x =
cuBLAScuDNN
D4 D3 D1 D0
D5 D4 D2 D1
D0 D1 D2
D3 D4 D5SM
cuBLAS vs. cuDNN
44
DeviceDRAM
D4 D3 D1 D0
D5 D4 D2 D1
D7 D6 D4 D3
D8 D7 D5 D4
D0 D1 D2
D3 D4 D5
D6 D7 D8
cuBLAScuDNN
D4 D3 D1 D0
D5 D4 D2 D1
D0 D1 D2
D3 D4 D5SM
F0
F1
F2
F3
O0
O1
O2
O3
D4 D3 D1 D0
D5 D4 D2 D1
D7 D6 D4 D3
D8 D7 D5 D4
x =
cuBLAS vs. cuDNN
45
DeviceDRAM
D4 D3 D1 D0
D5 D4 D2 D1
D7 D6 D4 D3
D8 D7 D5 D4
D0 D1 D2
D3 D4 D5
D6 D7 D8
cuBLAScuDNN
D4 D3 D1 D0
D5 D4 D2 D1
D0 D1 D2
D3 D4 D5SM
F0
F1
F2
F3
O0
O1
O2
O3
D4 D3 D1 D0
D5 D4 D2 D1
D7 D6 D4 D3
D8 D7 D5 D4
x =
cuBLAS vs. cuDNN
46
DeviceDRAM
D4 D3 D1 D0
D5 D4 D2 D1
D7 D6 D4 D3
D8 D7 D5 D4
D0 D1 D2
D3 D4 D5
D6 D7 D8
cuBLAScuDNN
D4 D3 D1 D0
D5 D4 D2 D1
D0 D1 D2
D3 D4 D5SM
F0
F1
F2
F3
O0
O1
O2
O3
D4 D3 D1 D0
D5 D4 D2 D1
D7 D6 D4 D3
D8 D7 D5 D4
x =
cuBLAS vs. cuDNN
47
DeviceDRAM
D4 D3 D1 D0
D5 D4 D2 D1
D7 D6 D4 D3
D8 D7 D5 D4
D0 D1 D2
D3 D4 D5
D6 D7 D8
cuBLAScuDNN
D4 D3 D1 D0
D5 D4 D2 D1
D0 D1 D2
D3 D4 D5SM
F0
F1
F2
F3
O0
O1
O2
O3
D4 D3 D1 D0
D5 D4 D2 D1
D7 D6 D4 D3
D8 D7 D5 D4
x =
cuDNN has been utilized due to improvements inoff-chip memory BW utilization
on-chip cache utilization
However, # multiplications remains the same
Winograd Convolution
• Reduce # multiplications at the cost of additional additions• 2.26X faster than FFT for F(2x2, 3x3) [Lavin, 2015]
• Example: F(2,3) and F(2x2, 3x3) 1D
2D
[Lavin, 2015]
F(4x4, 3x3) and F(6x6, 3x3)
2D
input
output
r
kernel
Tile-based 2D Convolution: E.g., Nine F(2x2, 3x3)’s for 6x6 Output Feature Map
Three Steps in Winograd Convolution
• The larger tiles, the less multiplications and the more additions
• Finally, additions dominate total runtime
[CMA Lab, 2016]
D times*SpD
H times*SpD
V
M
D*H times*SpD
Three Steps in Winograd Convolution
• The larger tiles, the less multiplications and the more additions
• Finally, additions dominate total runtime
[CMA Lab, 2016]
Agenda
• Introduction• Connecting two convolutions in human visual cortex and artificial neural
network
• Optimizing CNN architecture• Boosting• Pruning unimportant connections• Low-rank approximations• Narrow data (quantization)
• Optimizing CNN implementation• GPU: cuDNN, Winograd convolution, …• Hardware accelerators
• Summary
Hardware Accelerator, a.k.a., Neural Processing Unit (NPU)• Commercial chip solutions
• Movidius Myriad 2• Mobileye EyeQ3/4• Google TPU• …
• Academic works• DianNao, ASPLOS 2014• ShiDianNao, ISCA 2015• EIE (Stanford), ISCA 2016• Eyeriss (MIT), ISSCC/ISCA 2016• KAIST, ISSCC 2016
IP solutions Chip solutions
GPU(-like) CogniVue OpusNVIDIA Tegra X1Samsung Exynos
CNN-awareSynopsys EV52TeraDeep
Qualcomm ZerothMobileye EyeQ4
VLIW/SIMD
Apical Spirit coreCadence (Tensilica) IVP coreCEVA XM-4 corevideantis v-MP4 vision core
Movidius Myriad 2Analog Devices BF609Inuitive NU3000Texas Instruments TDA3x
[BDTi 2015]
Off-chip memory traffic- Some large (4MB~400kB) on-chip memory is enough for 32b~3-4b data (3~4bit data obtained from quantization)
On-chip memory traffic for parallel computation- Reuse of data fetched from on-chip memory is critical
KAIST, ISSCC 2016
[KAIST, 2016]
[KAIST, 2016]
Kernel weight is reused 8 times
[KAIST, 2016]
[KAIST, 2016]
[KAIST, 2016]
Kernel weight is reused 8 times Data item is reused 4 times
[KAIST, 2016]
Agenda
• Introduction• Connecting two convolutions in human visual cortex and artificial neural
network
• Optimizing CNN architecture• Boosting• Pruning unimportant connections• Low-rank approximations• Narrow data (quantization)
• Optimizing CNN implementation• GPU: cuDNN, Winograd convolution, …• Hardware accelerators
• Summary
Take-Away
• Removing redundancy (=Exploiting locality) in convolutional neural networks (CNNs)• Boosting, pruning, low rank approximation, … Design-time solutions
• What about runtime solutions?
• How to exploit value locality, e.g., zeros in weight and activation (at the granularity of neuron, sub-feature map, layer and sub-network)?
• Exploiting parallelism and data reuse in CNN execution• Only for inference, only a few mega-bytes (or ~100kB) of on-chip memory will
be sufficient to keep the input/output feature maps and convolution kernel weights for each layer
• How to reduce on-chip memory accesses? Data reuse (by broadcast)
• What about hardware accelerator for learning?
Reference
• [Bear] M. Bear et al., Neuroscience: Exploring the Brain 3e, Lippincott Williams and Wilkins, 2016.
• [Kandel] E. Kandel, Principles of Neural Science 5e, McGraw-Hill Education / Medical, 2012.
• [Kolb_Whishaw] B. Kolb and I. Q. Whishaw, An Introduction to Brain and Behavior 3e, Worth Publishers, 2009.
• [Chen, 2016] Y. Chen, Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks, ISSCC, 2016.
• [Chetlur, 2014] S. Chetlur, et al., “cuDNN: Efficient Primitives for Deep Learning,” arXiv preprint arXiv:1410.0759v3, 2014.
• [Han, 2015] S. Han, et al., “Learning both weights and connections for efficient neural network,” NIPS, 2015.
• [Kim, 2016] Y. Kim, et al., “Compression of Deep Convolutional Neural Networks for Fast and Low Power Applications,” Proc. International Conference on Learning and Representation (ICLR), May 2016.
• [Park, 2015] E. Park, et al., “Big/Little Deep Neural Network for Ultra Low Power Inference,” Proc. CODES+ISSS, Oct. 2015.
• [Lavin, 2015] A. Lavin and S. Gray, “Fast algorithms for convolutional neural networks,” arXiv preprint arXiv:1509.09308, 2015.
• [Lebedev, 2016] V. Lebedev and V. Lempitsky, “Fast ConvNets Using Group-wise Brain Damage,” arXiv preprint arXiv:1506.02515v2, 2015.
• [Miyashita, 2016] D. Miyashita, et al., “Convolutional Neural Networks using Logarithmic Data Representation,” arXiv preprint arXiv:1603.01025v2, 2016.
• [Microsoft, 2015] K. Ovtcharov, et al., “Accelerating Deep Convolutional Neural Networks Using Specialized Hardware,” MicroSoft, 2016.
• [KAIST, 2016] J. Sim, et al., “A 1.42TOPS/W Deep Convolutional Neural Network Recognition Processor for Intelligent IoE Systems,” ISSCC, 2016.
Thank You!