How can we optimize convolutional neural network designs...

How can we optimize convolutional neural network designs on mobile and embedded

systems?

June 18, 2016

Sungjoo Yoo

Computing Memory Architecture Lab.

CSE, SNU

http://cmalab.snu.ac.kr

Agenda

• Introduction• Connecting two convolutions in human visual cortex and artificial neural

network

• Optimizing CNN architecture• Boosting• Pruning unimportant connections• Low-rank approximations• Narrow data (quantization)

• Optimizing CNN implementation• GPU: cuDNN, Winograd convolution, …• Hardware accelerators

• Summary

[Bear]

Retina: On-centered Cell ~ Image Sensor Cell

Retina LGN Primary Visual Cortex (V1)

[Kandel]

V1

Convolution

[Kolb_Whishaw]

Line Detection in V1 ~ Convolution

Convolutional Neural Network (CNN)

• LeNet (1989)

• Consists of convolution layer and subsampling (max-pooling) layer

• Training: backpropagation

Convolution: 2D Input Case

[Chen, 2016]

Convolution: 3D Input Case

• A receptive field in input feature maps gives an output neuron

• Each output feature map has its own set of kernel weights

[Chen, 2016]

Convolution: 3D Input / 3D Output

[Chen, 2016]

Training (backprop) determines kernel weights

Convolution: Computation and Model Size

• # multiplications = kxkxD x NxNxH

• # weights = kxkxD x H

Agenda


network



• Summary

Simple Example: Classification for Two Classes

• Classification ~ draw a surface between two groups

• Complex (high order) surface ~ high cost

• Basic idea: classify simple ones first at low cost

[Venkataramani, 2015]

Big/Little DNN: Overview

Input Image

Little DNN

Big DNN

Successchecker

① Classification

② a) High confidence

Result

② b) Low confidence

[Park, 2015]

Experiment Setup: Comparison of Computation Workload between Big & Little DNNs

• Big DNN has ~10X larger amount of computation

19.51

2.54 1.58

0.67 0.79

0

5

10

15

20

big s m f c

# o

f m

ultip

lications

[x10^

9]

ImageNet Classifiers

188.80

54.40

26.88

4.48 1.26

28.80

14.40 3.60 0.90

0

50

100

150

200

big m1 m2 m3 m4 m5 m6 m7 m8#

of

multip

lications

[x10^

4]

MNIST Classifiers

[Park, 2015]

Experiment Setup: HW NPU

• Based on [Zhang, FPGA 2015]

• HW NPU• 512 compute engines (PEs)

• Double buffering

• Loop unrolling

• Verilog design, 65nm TSMC

• In-house cycle accurate simulator+DRAMSim2• Micron DRAM power model

DRAMDRAM

DRAM

DMAunit

SRAM Input Buffer

PEPEPEPEPEPEPE

SRAM Output Buffer

PEPEPEPEPEPE

X +

PE

*Zhang, et al., “Optimizing FPGA-based accelerator design for deep convolutional neural networks”, FPGA 2015.

[Park, 2015]

Result: MNIST

• 93.0% energy reduction

• Accuracy loss of 0.08%

0

0.5

1

1.5

2

2.5

Energ

y [

mJ/

Image]

SRAM Size [B]

DRAM Computation SRAM

85.6 %

85.3 %

0

0.3

0.6

0.9

1.2

1.5

m1 m2 m3 m4 m5 m6 m7 m8

Energ

y [

mJ/

Image]

big/LITTLE Configuration

Big only Static threshold Dynamic threshold

99.07 99.10 98.97

98.35

99.06

98.90

99.04

98.76

99.12

97.6

98

98.4

98.8

99.2

m1 m2 m3 m4 m5 m6 m7 m8 Big

Acc

ura

cy [

%]


[Park, 2015]

Result: ImageNet

• 56.7% energy reduction

• Top-1 accuracy loss of 0.51%

0

5

10

15

20

25

30

35

40

45

50

Energ

y [J/Im

age]

SRAM Size [B]

DRAM Computation SRAM

34.3 %34.3 %

0

2.5

5

7.5

10

12.5

s m f c

Energ

y [

J/Im

age]


Big only Static threshold

Dynamic threshold

68.81

67.53 67.21

68.90

69.41

66

67

68

69

70

s m f c Big

Acc

ura

cy [

%]


[Park, 2015]

Agenda


network



• Summary

Pruning CNN

[Han, 2015]

Neuron Pruning is Natural in Biological System• # synapses increases before 2 years old and, then decreases due to

pruning

[https://universe-review.ca/R10-16-ANS12.htm]

Convolution with Matrix Multiplication(called Convolution Lowering)• Input: 3x3x3

• Output: 2x2x2

• Convolutional kernel: 3x2x2

[Chetlur 2014]

2x22

2

2

3

3

3 2

22

22

Pruning [Han 2015] Hardly Reduces the Runtime of Convolution on GPU

[Han 2015][Chetlur 2014]

Group-wise Brain Damage

• For each input feature map, the same location of 2D filter elements is pruned

• Pruning is performed in an incremental manner• Repeat the followings until no more

pruning candidate• Prune a column in F matrix and train the

network to recover from accuracy loss

• Result• 3X reduction in # multiplications for AlexNet

[Lebedev 2016]

Agenda


network



• Summary

Singular Value Decomposition (SVD)

[K. Baker, Singular Value Decomposition Tutorial]

Example of Truncated SVD: A~USVT

Take square roots of 3 largest eigenvaluesTake 3 eigenvectors

associated with the selected eigenvaluesTake 3 eigenvectorsassociated with the selected eigenvalues

[K. Baker, Singular Value Decomposition Tutorial]

Error degrades accuracy. How to reclaim lost accuracy?

128 59

48x5x5 25x5x5

48x5x5

25x5x5

27x27

YX

Z Z’

U3 U4C

48

55

55

128

27

27

25 59

55

55

27

27

Low-rank Approximation in CNN

[Kim, 2016]

Experiments on Samsung Galaxy S6

• Exynos 7420 + LPDDR4• ARM Mali T760, 190Gflops for 8 cores, max 256 threads/core, 32KB

L1$/core, 1MB shared L2$, 25.6GB/s LPDDR4 with four x16 channels

• Comparison: nVidia Titan X provides 6.6TFlops for 24 cores, 336GB/s main memory, max 2k threads/core, 64KB L1$/core, 3MB shared L2$

Exynos 7420

T760

[Kim, 2016]

AlexNet: Power Consumption• Total 245mJ/image, 117ms

• GPU power > DRAM power

• Convolutional layers dominate total energy consumption and runtime

• At fully connected layers, GPU power drops while DRAM power increases • Due to a large number of memory accesses

for weights and less data reuse, i.e., low core utilization (=long total idle time)

GPU

DRAM

C1 C2 C3 C4 C5 F6 F7 F8

[Kim, 2016]

VGG_S: Power Consumption• Total 825mJ/image, 357ms

• Convolutional layers dominate total energy consumption and runtime

• At convolutional layers, DRAM consumes larger power than in AlexNetdue to a large number of weights

• At fully connected layers, similar trend as in AlexNet• GPU power ~ DRAM power

AlexNet

VGG_S

[Kim, 2016]

GoogLeNet: Power Consumption• Total 473mJ/image, 273ms

• 1st and 2nd convolutional layers consume 1/4 of total energy and runtime

• Inception modules • Relatively low power consumption in both GPU

and DRAM • Power consumption fluctuates due to many

small inception modules and cache-unfriendly 1x1 convolutions,

• Fully connected layer (1M parameters) consumes a very little amount of power in GPU and DRAM

13

88

K

10

72

K

84

0K58

0K

46

3K

43

7K

36

4K

38

0K

15

9K

71

M

54

M17

0M11

9M

10

0M

88

M

73

M

30

4M

12

8M

1M

1M

AlexNet

GoogLeNet

[Kim, 2016]

13

88

K

10

72

K

84

0K58

0K

46

3K

43

7K

36

4K

38

0K

15

9K

71

M

54

M17

0M11

9M

10

0M

88

M

73

M

30

4M

12

8M

1M

1M

AlexNet, VGG_S vs GoogLeNet: Top-5, Runtime and Power

80.0%117ms245mJ

84.6%357ms825mJ

88.9%273ms473mJ

[Kim, 2016]

Results of Low Rank Approximation

• Significant reductions in energy consumption and runtime• Energy: x4.26~x1.6

• Runtime:x3.68~x1.42

3.41X 4.26X 1.6X

[Kim, 2016]

Fine-tuning is Required for Accuracy Recovery

• Low-rank approximation loses accuracy

• Fine-tuning recovers lost error• 1 epoch: 1 run of back propagation

with the entire training set

[Kim, 2016]

Agenda


network



• Summary

Narrow-Data CNN

• Performance improvement due to narrow data• E.g., 16bit 4bit data, 4X speedup with the same memory bandwidth

consumption

weight

activation

MUL ADD

Conventional convolution Convolution with narrow data

weight

activation

*

ADD

16b

16b

4b

4b

4b

4b

4b

4b

4b

4b

*

*

*

Logarithm-based Quantization (Log-Quant)

• Less quantization errors for small values

• No need of multiplication

[Miyashita, 2016]

Log-Quant

• Performance improvement due to narrow data

• Replace multipliers with shifters better area/energy efficiency

weight

activation

MUL ADD

Conventional convolution LogQuant-based convolution with narrow data

weight

activation

>>

ADD

>>

>>

>>

16b

16b

4b

4b

4b

4b

4b

4b

4b

4b

Preliminary Results: Log-Quant AlexNet

Act base = 1 Activation bit

Weight base = 4 3 4 5 6

Weight bit

2 79.328 79.326 79.334 79.334

3 77.236 77.254 77.254 77.546

4 8.894 8.66 8.662 8.662

5 1.466 1.186 1.222 1.222

6 1.314 1.184 1.006 1.006

7 1.318 1.294 1.342 1.342

Act base = 4 Activation bit

Weight base = 4 3 4 5 6

Weight bit

2 79.488 79.41 79.336 79.328

3 79.488 77.826 77.35 77.43

4 64.632 12.364 6.402 7.4

5 58.33 5.26 0.306 0.272

6 58.296 5.354 0.158 0.172

7 58.274 5.426 0.178 0.186

• 0.3% accuracy loss at 5-bit weight and activation

[CMA Lab, 2016]

Agenda


network



• Summary

Convolution with Matrix Multiplication(a.k.a Convolution Lowering)• Input: 3x3x3

• Output: 2x2x2

• Convolutional kernel: 3x2x2

[Chetlur 2014]

2x2

3

3

3 2

2

2

22

Matrix Size vs. GPU Cache Size

• Example: 2nd convolutional layer on AlexNet

• Input size = 55x55x48x4B = 580KB• Input matrix size = 580KBx5x5 = 14.5MB

• Output size = 27x27x128x4B = 387KB

• Kernel size = 48x5x5x128 = 614KB

14.5MB

614KB 387KB

cuBLAS vs. cuDNN

43

DeviceDRAM

D4 D3 D1 D0

D5 D4 D2 D1

D7 D6 D4 D3

D8 D7 D5 D4

D0 D1 D2

D3 D4 D5

D6 D7 D8

D0 D1 D2

D3 D4 D5

D6 D7 D8

F0 F1

F2 F3

O0 O1

O2 O3* =

F0

F1

F2

F3

O0

O1

O2

O3

D4 D3 D1 D0

D5 D4 D2 D1

D7 D6 D4 D3

D8 D7 D5 D4

x =

F0

F1

F2

F3

O0

O1

O2

O3

D4 D3 D1 D0

D5 D4 D2 D1

D7 D6 D4 D3

D8 D7 D5 D4

x =

cuBLAScuDNN

D4 D3 D1 D0

D5 D4 D2 D1

D0 D1 D2

D3 D4 D5SM

cuBLAS vs. cuDNN

44

DeviceDRAM

D4 D3 D1 D0

D5 D4 D2 D1

D7 D6 D4 D3

D8 D7 D5 D4

D0 D1 D2

D3 D4 D5

D6 D7 D8

cuBLAScuDNN

D4 D3 D1 D0

D5 D4 D2 D1

D0 D1 D2

D3 D4 D5SM

F0

F1

F2

F3

O0

O1

O2

O3

D4 D3 D1 D0

D5 D4 D2 D1

D7 D6 D4 D3

D8 D7 D5 D4

x =

cuBLAS vs. cuDNN

45

DeviceDRAM

D4 D3 D1 D0

D5 D4 D2 D1

D7 D6 D4 D3

D8 D7 D5 D4

D0 D1 D2

D3 D4 D5

D6 D7 D8

cuBLAScuDNN

D4 D3 D1 D0

D5 D4 D2 D1

D0 D1 D2

D3 D4 D5SM

F0

F1

F2

F3

O0

O1

O2

O3

D4 D3 D1 D0

D5 D4 D2 D1

D7 D6 D4 D3

D8 D7 D5 D4

x =

cuBLAS vs. cuDNN

46

DeviceDRAM

D4 D3 D1 D0

D5 D4 D2 D1

D7 D6 D4 D3

D8 D7 D5 D4

D0 D1 D2

D3 D4 D5

D6 D7 D8

cuBLAScuDNN

D4 D3 D1 D0

D5 D4 D2 D1

D0 D1 D2

D3 D4 D5SM

F0

F1

F2

F3

O0

O1

O2

O3

D4 D3 D1 D0

D5 D4 D2 D1

D7 D6 D4 D3

D8 D7 D5 D4

x =

cuBLAS vs. cuDNN

47

DeviceDRAM

D4 D3 D1 D0

D5 D4 D2 D1

D7 D6 D4 D3

D8 D7 D5 D4

D0 D1 D2

D3 D4 D5

D6 D7 D8

cuBLAScuDNN

D4 D3 D1 D0

D5 D4 D2 D1

D0 D1 D2

D3 D4 D5SM

F0

F1

F2

F3

O0

O1

O2

O3

D4 D3 D1 D0

D5 D4 D2 D1

D7 D6 D4 D3

D8 D7 D5 D4

x =

cuDNN has been utilized due to improvements inoff-chip memory BW utilization

on-chip cache utilization

However, # multiplications remains the same

Winograd Convolution

• Reduce # multiplications at the cost of additional additions• 2.26X faster than FFT for F(2x2, 3x3) [Lavin, 2015]

• Example: F(2,3) and F(2x2, 3x3) 1D

2D

[Lavin, 2015]

F(4x4, 3x3) and F(6x6, 3x3)

2D

input

output

r

kernel

Tile-based 2D Convolution: E.g., Nine F(2x2, 3x3)’s for 6x6 Output Feature Map

Three Steps in Winograd Convolution

• The larger tiles, the less multiplications and the more additions

• Finally, additions dominate total runtime

[CMA Lab, 2016]

D times*SpD

H times*SpD

V

M

D*H times*SpD

Three Steps in Winograd Convolution

• The larger tiles, the less multiplications and the more additions

• Finally, additions dominate total runtime

[CMA Lab, 2016]

Agenda


network



• Summary

Hardware Accelerator, a.k.a., Neural Processing Unit (NPU)• Commercial chip solutions

• Movidius Myriad 2• Mobileye EyeQ3/4• Google TPU• …

• Academic works• DianNao, ASPLOS 2014• ShiDianNao, ISCA 2015• EIE (Stanford), ISCA 2016• Eyeriss (MIT), ISSCC/ISCA 2016• KAIST, ISSCC 2016

IP solutions Chip solutions

GPU(-like) CogniVue OpusNVIDIA Tegra X1Samsung Exynos

CNN-awareSynopsys EV52TeraDeep

Qualcomm ZerothMobileye EyeQ4

VLIW/SIMD

Apical Spirit coreCadence (Tensilica) IVP coreCEVA XM-4 corevideantis v-MP4 vision core

Movidius Myriad 2Analog Devices BF609Inuitive NU3000Texas Instruments TDA3x

[BDTi 2015]

Off-chip memory traffic- Some large (4MB~400kB) on-chip memory is enough for 32b~3-4b data (3~4bit data obtained from quantization)

On-chip memory traffic for parallel computation- Reuse of data fetched from on-chip memory is critical

KAIST, ISSCC 2016

[KAIST, 2016]

[KAIST, 2016]

Kernel weight is reused 8 times

[KAIST, 2016]

[KAIST, 2016]

Kernel weight is reused 8 times Data item is reused 4 times

[KAIST, 2016]

Agenda


network



• Summary

Take-Away

• Removing redundancy (=Exploiting locality) in convolutional neural networks (CNNs)• Boosting, pruning, low rank approximation, … Design-time solutions

• What about runtime solutions?

• How to exploit value locality, e.g., zeros in weight and activation (at the granularity of neuron, sub-feature map, layer and sub-network)?

• Exploiting parallelism and data reuse in CNN execution• Only for inference, only a few mega-bytes (or ~100kB) of on-chip memory will

be sufficient to keep the input/output feature maps and convolution kernel weights for each layer

• How to reduce on-chip memory accesses? Data reuse (by broadcast)

• What about hardware accelerator for learning?

Reference

• [Bear] M. Bear et al., Neuroscience: Exploring the Brain 3e, Lippincott Williams and Wilkins, 2016.

• [Kandel] E. Kandel, Principles of Neural Science 5e, McGraw-Hill Education / Medical, 2012.

• [Kolb_Whishaw] B. Kolb and I. Q. Whishaw, An Introduction to Brain and Behavior 3e, Worth Publishers, 2009.

• [Chen, 2016] Y. Chen, Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks, ISSCC, 2016.

• [Chetlur, 2014] S. Chetlur, et al., “cuDNN: Efficient Primitives for Deep Learning,” arXiv preprint arXiv:1410.0759v3, 2014.

• [Han, 2015] S. Han, et al., “Learning both weights and connections for efficient neural network,” NIPS, 2015.

• [Kim, 2016] Y. Kim, et al., “Compression of Deep Convolutional Neural Networks for Fast and Low Power Applications,” Proc. International Conference on Learning and Representation (ICLR), May 2016.

• [Park, 2015] E. Park, et al., “Big/Little Deep Neural Network for Ultra Low Power Inference,” Proc. CODES+ISSS, Oct. 2015.

• [Lavin, 2015] A. Lavin and S. Gray, “Fast algorithms for convolutional neural networks,” arXiv preprint arXiv:1509.09308, 2015.

• [Lebedev, 2016] V. Lebedev and V. Lempitsky, “Fast ConvNets Using Group-wise Brain Damage,” arXiv preprint arXiv:1506.02515v2, 2015.

• [Miyashita, 2016] D. Miyashita, et al., “Convolutional Neural Networks using Logarithmic Data Representation,” arXiv preprint arXiv:1603.01025v2, 2016.

• [Microsoft, 2015] K. Ovtcharov, et al., “Accelerating Deep Convolutional Neural Networks Using Specialized Hardware,” MicroSoft, 2016.

• [KAIST, 2016] J. Sim, et al., “A 1.42TOPS/W Deep Convolutional Neural Network Recognition Processor for Intelligent IoE Systems,” ISSCC, 2016.

Thank You!

How can we optimize convolutional neural network designs...

Documents

Transcript of How can we optimize convolutional neural network designs...