How can we optimize convolutional neural network designs...

64
How can we optimize convolutional neural network designs on mobile and embedded systems? June 18, 2016 Sungjoo Yoo Computing Memory Architecture Lab. CSE, SNU http://cmalab.snu.ac.kr

Transcript of How can we optimize convolutional neural network designs...

Page 1: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

How can we optimize convolutional neural network designs on mobile and embedded

systems?

June 18, 2016

Sungjoo Yoo

Computing Memory Architecture Lab.

CSE, SNU

http://cmalab.snu.ac.kr

Page 2: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Agenda

• Introduction• Connecting two convolutions in human visual cortex and artificial neural

network

• Optimizing CNN architecture• Boosting• Pruning unimportant connections• Low-rank approximations• Narrow data (quantization)

• Optimizing CNN implementation• GPU: cuDNN, Winograd convolution, …• Hardware accelerators

• Summary

Page 3: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

[Bear]

Retina: On-centered Cell ~ Image Sensor Cell

Page 4: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Retina LGN Primary Visual Cortex (V1)

[Kandel]

V1

Page 5: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Convolution

[Kolb_Whishaw]

Line Detection in V1 ~ Convolution

Page 6: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Convolutional Neural Network (CNN)

• LeNet (1989)

• Consists of convolution layer and subsampling (max-pooling) layer

• Training: backpropagation

Page 7: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Convolution: 2D Input Case

[Chen, 2016]

Page 8: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Convolution: 3D Input Case

• A receptive field in input feature maps gives an output neuron

• Each output feature map has its own set of kernel weights

[Chen, 2016]

Page 9: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Convolution: 3D Input / 3D Output

[Chen, 2016]

Training (backprop) determines kernel weights

Page 10: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Convolution: Computation and Model Size

• # multiplications = kxkxD x NxNxH

• # weights = kxkxD x H

Page 11: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Agenda

• Introduction• Connecting two convolutions in human visual cortex and artificial neural

network

• Optimizing CNN architecture• Boosting• Pruning unimportant connections• Low-rank approximations• Narrow data (quantization)

• Optimizing CNN implementation• GPU: cuDNN, Winograd convolution, …• Hardware accelerators

• Summary

Page 12: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Simple Example: Classification for Two Classes

• Classification ~ draw a surface between two groups

• Complex (high order) surface ~ high cost

• Basic idea: classify simple ones first at low cost

[Venkataramani, 2015]

Page 13: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Big/Little DNN: Overview

Input Image

Little DNN

Big DNN

Successchecker

① Classification

② a) High confidence

Result

② b) Low confidence

[Park, 2015]

Page 14: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Experiment Setup: Comparison of Computation Workload between Big & Little DNNs

• Big DNN has ~10X larger amount of computation

19.51

2.54 1.58

0.67 0.79

0

5

10

15

20

big s m f c

# o

f m

ultip

lications

[x10^

9]

ImageNet Classifiers

188.80

54.40

26.88

4.48 1.26

28.80

14.40 3.60 0.90

0

50

100

150

200

big m1 m2 m3 m4 m5 m6 m7 m8#

of

multip

lications

[x10^

4]

MNIST Classifiers

[Park, 2015]

Page 15: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Experiment Setup: HW NPU

• Based on [Zhang, FPGA 2015]

• HW NPU• 512 compute engines (PEs)

• Double buffering

• Loop unrolling

• Verilog design, 65nm TSMC

• In-house cycle accurate simulator+DRAMSim2• Micron DRAM power model

DRAMDRAM

DRAM

DMAunit

SRAM Input Buffer

PEPEPEPEPEPEPE

SRAM Output Buffer

PEPEPEPEPEPE

X +

PE

*Zhang, et al., “Optimizing FPGA-based accelerator design for deep convolutional neural networks”, FPGA 2015.

[Park, 2015]

Page 16: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Result: MNIST

• 93.0% energy reduction

• Accuracy loss of 0.08%

0

0.5

1

1.5

2

2.5

Energ

y [

mJ/

Image]

SRAM Size [B]

DRAM Computation SRAM

85.6 %

85.3 %

0

0.3

0.6

0.9

1.2

1.5

m1 m2 m3 m4 m5 m6 m7 m8

Energ

y [

mJ/

Image]

big/LITTLE Configuration

Big only Static threshold Dynamic threshold

99.07 99.10 98.97

98.35

99.06

98.90

99.04

98.76

99.12

97.6

98

98.4

98.8

99.2

m1 m2 m3 m4 m5 m6 m7 m8 Big

Acc

ura

cy [

%]

big/LITTLE Configuration

[Park, 2015]

Page 17: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Result: ImageNet

• 56.7% energy reduction

• Top-1 accuracy loss of 0.51%

0

5

10

15

20

25

30

35

40

45

50

Energ

y [J/Im

age]

SRAM Size [B]

DRAM Computation SRAM

34.3 %34.3 %

0

2.5

5

7.5

10

12.5

s m f c

Energ

y [

J/Im

age]

big/LITTLE Configuration

Big only Static threshold

Dynamic threshold

68.81

67.53 67.21

68.90

69.41

66

67

68

69

70

s m f c Big

Acc

ura

cy [

%]

big/LITTLE Configuration

[Park, 2015]

Page 18: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Agenda

• Introduction• Connecting two convolutions in human visual cortex and artificial neural

network

• Optimizing CNN architecture• Boosting• Pruning unimportant connections• Low-rank approximations• Narrow data (quantization)

• Optimizing CNN implementation• GPU: cuDNN, Winograd convolution, …• Hardware accelerators

• Summary

Page 19: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Pruning CNN

[Han, 2015]

Page 20: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Neuron Pruning is Natural in Biological System• # synapses increases before 2 years old and, then decreases due to

pruning

[https://universe-review.ca/R10-16-ANS12.htm]

Page 21: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Convolution with Matrix Multiplication(called Convolution Lowering)• Input: 3x3x3

• Output: 2x2x2

• Convolutional kernel: 3x2x2

[Chetlur 2014]

2x22

2

2

3

3

3 2

22

22

Page 22: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Pruning [Han 2015] Hardly Reduces the Runtime of Convolution on GPU

[Han 2015][Chetlur 2014]

Page 23: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Group-wise Brain Damage

• For each input feature map, the same location of 2D filter elements is pruned

• Pruning is performed in an incremental manner• Repeat the followings until no more

pruning candidate• Prune a column in F matrix and train the

network to recover from accuracy loss

• Result• 3X reduction in # multiplications for AlexNet

[Lebedev 2016]

Page 24: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Agenda

• Introduction• Connecting two convolutions in human visual cortex and artificial neural

network

• Optimizing CNN architecture• Boosting• Pruning unimportant connections• Low-rank approximations• Narrow data (quantization)

• Optimizing CNN implementation• GPU: cuDNN, Winograd convolution, …• Hardware accelerators

• Summary

Page 25: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Singular Value Decomposition (SVD)

[K. Baker, Singular Value Decomposition Tutorial]

Page 26: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Example of Truncated SVD: A~USVT

Take square roots of 3 largest eigenvaluesTake 3 eigenvectors

associated with the selected eigenvaluesTake 3 eigenvectorsassociated with the selected eigenvalues

[K. Baker, Singular Value Decomposition Tutorial]

Error degrades accuracy. How to reclaim lost accuracy?

Page 27: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

128 59

48x5x5 25x5x5

48x5x5

25x5x5

27x27

YX

Z Z’

U3 U4C

48

55

55

128

27

27

25 59

55

55

27

27

Low-rank Approximation in CNN

[Kim, 2016]

Page 28: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Experiments on Samsung Galaxy S6

• Exynos 7420 + LPDDR4• ARM Mali T760, 190Gflops for 8 cores, max 256 threads/core, 32KB

L1$/core, 1MB shared L2$, 25.6GB/s LPDDR4 with four x16 channels

• Comparison: nVidia Titan X provides 6.6TFlops for 24 cores, 336GB/s main memory, max 2k threads/core, 64KB L1$/core, 3MB shared L2$

Exynos 7420

T760

[Kim, 2016]

Page 29: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

AlexNet: Power Consumption• Total 245mJ/image, 117ms

• GPU power > DRAM power

• Convolutional layers dominate total energy consumption and runtime

• At fully connected layers, GPU power drops while DRAM power increases • Due to a large number of memory accesses

for weights and less data reuse, i.e., low core utilization (=long total idle time)

GPU

DRAM

C1 C2 C3 C4 C5 F6 F7 F8

[Kim, 2016]

Page 30: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

VGG_S: Power Consumption• Total 825mJ/image, 357ms

• Convolutional layers dominate total energy consumption and runtime

• At convolutional layers, DRAM consumes larger power than in AlexNetdue to a large number of weights

• At fully connected layers, similar trend as in AlexNet• GPU power ~ DRAM power

AlexNet

VGG_S

[Kim, 2016]

Page 31: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

GoogLeNet: Power Consumption• Total 473mJ/image, 273ms

• 1st and 2nd convolutional layers consume 1/4 of total energy and runtime

• Inception modules • Relatively low power consumption in both GPU

and DRAM • Power consumption fluctuates due to many

small inception modules and cache-unfriendly 1x1 convolutions,

• Fully connected layer (1M parameters) consumes a very little amount of power in GPU and DRAM

13

88

K

10

72

K

84

0K58

0K

46

3K

43

7K

36

4K

38

0K

15

9K

71

M

54

M17

0M11

9M

10

0M

88

M

73

M

30

4M

12

8M

1M

1M

AlexNet

GoogLeNet

[Kim, 2016]

Page 32: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

13

88

K

10

72

K

84

0K58

0K

46

3K

43

7K

36

4K

38

0K

15

9K

71

M

54

M17

0M11

9M

10

0M

88

M

73

M

30

4M

12

8M

1M

1M

AlexNet, VGG_S vs GoogLeNet: Top-5, Runtime and Power

80.0%117ms245mJ

84.6%357ms825mJ

88.9%273ms473mJ

[Kim, 2016]

Page 33: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Results of Low Rank Approximation

• Significant reductions in energy consumption and runtime• Energy: x4.26~x1.6

• Runtime:x3.68~x1.42

3.41X 4.26X 1.6X

[Kim, 2016]

Page 34: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Fine-tuning is Required for Accuracy Recovery

• Low-rank approximation loses accuracy

• Fine-tuning recovers lost error• 1 epoch: 1 run of back propagation

with the entire training set

[Kim, 2016]

Page 35: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Agenda

• Introduction• Connecting two convolutions in human visual cortex and artificial neural

network

• Optimizing CNN architecture• Boosting• Pruning unimportant connections• Low-rank approximations• Narrow data (quantization)

• Optimizing CNN implementation• GPU: cuDNN, Winograd convolution, …• Hardware accelerators

• Summary

Page 36: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Narrow-Data CNN

• Performance improvement due to narrow data• E.g., 16bit 4bit data, 4X speedup with the same memory bandwidth

consumption

weight

activation

MUL ADD

Conventional convolution Convolution with narrow data

weight

activation

*

ADD

16b

16b

4b

4b

4b

4b

4b

4b

4b

4b

*

*

*

Page 37: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Logarithm-based Quantization (Log-Quant)

• Less quantization errors for small values

• No need of multiplication

[Miyashita, 2016]

Page 38: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Log-Quant

• Performance improvement due to narrow data

• Replace multipliers with shifters better area/energy efficiency

weight

activation

MUL ADD

Conventional convolution LogQuant-based convolution with narrow data

weight

activation

>>

ADD

>>

>>

>>

16b

16b

4b

4b

4b

4b

4b

4b

4b

4b

Page 39: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Preliminary Results: Log-Quant AlexNet

Act base = 1 Activation bit

Weight base = 4 3 4 5 6

Weight bit

2 79.328 79.326 79.334 79.334

3 77.236 77.254 77.254 77.546

4 8.894 8.66 8.662 8.662

5 1.466 1.186 1.222 1.222

6 1.314 1.184 1.006 1.006

7 1.318 1.294 1.342 1.342

Act base = 4 Activation bit

Weight base = 4 3 4 5 6

Weight bit

2 79.488 79.41 79.336 79.328

3 79.488 77.826 77.35 77.43

4 64.632 12.364 6.402 7.4

5 58.33 5.26 0.306 0.272

6 58.296 5.354 0.158 0.172

7 58.274 5.426 0.178 0.186

• 0.3% accuracy loss at 5-bit weight and activation

[CMA Lab, 2016]

Page 40: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Agenda

• Introduction• Connecting two convolutions in human visual cortex and artificial neural

network

• Optimizing CNN architecture• Boosting• Pruning unimportant connections• Low-rank approximations• Narrow data (quantization)

• Optimizing CNN implementation• GPU: cuDNN, Winograd convolution, …• Hardware accelerators

• Summary

Page 41: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Convolution with Matrix Multiplication(a.k.a Convolution Lowering)• Input: 3x3x3

• Output: 2x2x2

• Convolutional kernel: 3x2x2

[Chetlur 2014]

2x2

3

3

3 2

2

2

22

Page 42: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Matrix Size vs. GPU Cache Size

• Example: 2nd convolutional layer on AlexNet

• Input size = 55x55x48x4B = 580KB• Input matrix size = 580KBx5x5 = 14.5MB

• Output size = 27x27x128x4B = 387KB

• Kernel size = 48x5x5x128 = 614KB

14.5MB

614KB 387KB

Page 43: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

cuBLAS vs. cuDNN

43

DeviceDRAM

D4 D3 D1 D0

D5 D4 D2 D1

D7 D6 D4 D3

D8 D7 D5 D4

D0 D1 D2

D3 D4 D5

D6 D7 D8

D0 D1 D2

D3 D4 D5

D6 D7 D8

F0 F1

F2 F3

O0 O1

O2 O3* =

F0

F1

F2

F3

O0

O1

O2

O3

D4 D3 D1 D0

D5 D4 D2 D1

D7 D6 D4 D3

D8 D7 D5 D4

x =

F0

F1

F2

F3

O0

O1

O2

O3

D4 D3 D1 D0

D5 D4 D2 D1

D7 D6 D4 D3

D8 D7 D5 D4

x =

cuBLAScuDNN

D4 D3 D1 D0

D5 D4 D2 D1

D0 D1 D2

D3 D4 D5SM

Page 44: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

cuBLAS vs. cuDNN

44

DeviceDRAM

D4 D3 D1 D0

D5 D4 D2 D1

D7 D6 D4 D3

D8 D7 D5 D4

D0 D1 D2

D3 D4 D5

D6 D7 D8

cuBLAScuDNN

D4 D3 D1 D0

D5 D4 D2 D1

D0 D1 D2

D3 D4 D5SM

F0

F1

F2

F3

O0

O1

O2

O3

D4 D3 D1 D0

D5 D4 D2 D1

D7 D6 D4 D3

D8 D7 D5 D4

x =

Page 45: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

cuBLAS vs. cuDNN

45

DeviceDRAM

D4 D3 D1 D0

D5 D4 D2 D1

D7 D6 D4 D3

D8 D7 D5 D4

D0 D1 D2

D3 D4 D5

D6 D7 D8

cuBLAScuDNN

D4 D3 D1 D0

D5 D4 D2 D1

D0 D1 D2

D3 D4 D5SM

F0

F1

F2

F3

O0

O1

O2

O3

D4 D3 D1 D0

D5 D4 D2 D1

D7 D6 D4 D3

D8 D7 D5 D4

x =

Page 46: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

cuBLAS vs. cuDNN

46

DeviceDRAM

D4 D3 D1 D0

D5 D4 D2 D1

D7 D6 D4 D3

D8 D7 D5 D4

D0 D1 D2

D3 D4 D5

D6 D7 D8

cuBLAScuDNN

D4 D3 D1 D0

D5 D4 D2 D1

D0 D1 D2

D3 D4 D5SM

F0

F1

F2

F3

O0

O1

O2

O3

D4 D3 D1 D0

D5 D4 D2 D1

D7 D6 D4 D3

D8 D7 D5 D4

x =

Page 47: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

cuBLAS vs. cuDNN

47

DeviceDRAM

D4 D3 D1 D0

D5 D4 D2 D1

D7 D6 D4 D3

D8 D7 D5 D4

D0 D1 D2

D3 D4 D5

D6 D7 D8

cuBLAScuDNN

D4 D3 D1 D0

D5 D4 D2 D1

D0 D1 D2

D3 D4 D5SM

F0

F1

F2

F3

O0

O1

O2

O3

D4 D3 D1 D0

D5 D4 D2 D1

D7 D6 D4 D3

D8 D7 D5 D4

x =

cuDNN has been utilized due to improvements inoff-chip memory BW utilization

on-chip cache utilization

However, # multiplications remains the same

Page 48: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Winograd Convolution

• Reduce # multiplications at the cost of additional additions• 2.26X faster than FFT for F(2x2, 3x3) [Lavin, 2015]

• Example: F(2,3) and F(2x2, 3x3) 1D

2D

[Lavin, 2015]

Page 49: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

F(4x4, 3x3) and F(6x6, 3x3)

2D

Page 50: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

input

output

r

kernel

Tile-based 2D Convolution: E.g., Nine F(2x2, 3x3)’s for 6x6 Output Feature Map

Page 51: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Three Steps in Winograd Convolution

• The larger tiles, the less multiplications and the more additions

• Finally, additions dominate total runtime

[CMA Lab, 2016]

D times*SpD

H times*SpD

V

M

D*H times*SpD

Page 52: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Three Steps in Winograd Convolution

• The larger tiles, the less multiplications and the more additions

• Finally, additions dominate total runtime

[CMA Lab, 2016]

Page 53: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Agenda

• Introduction• Connecting two convolutions in human visual cortex and artificial neural

network

• Optimizing CNN architecture• Boosting• Pruning unimportant connections• Low-rank approximations• Narrow data (quantization)

• Optimizing CNN implementation• GPU: cuDNN, Winograd convolution, …• Hardware accelerators

• Summary

Page 54: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Hardware Accelerator, a.k.a., Neural Processing Unit (NPU)• Commercial chip solutions

• Movidius Myriad 2• Mobileye EyeQ3/4• Google TPU• …

• Academic works• DianNao, ASPLOS 2014• ShiDianNao, ISCA 2015• EIE (Stanford), ISCA 2016• Eyeriss (MIT), ISSCC/ISCA 2016• KAIST, ISSCC 2016

IP solutions Chip solutions

GPU(-like) CogniVue OpusNVIDIA Tegra X1Samsung Exynos

CNN-awareSynopsys EV52TeraDeep

Qualcomm ZerothMobileye EyeQ4

VLIW/SIMD

Apical Spirit coreCadence (Tensilica) IVP coreCEVA XM-4 corevideantis v-MP4 vision core

Movidius Myriad 2Analog Devices BF609Inuitive NU3000Texas Instruments TDA3x

[BDTi 2015]

Off-chip memory traffic- Some large (4MB~400kB) on-chip memory is enough for 32b~3-4b data (3~4bit data obtained from quantization)

On-chip memory traffic for parallel computation- Reuse of data fetched from on-chip memory is critical

Page 55: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

KAIST, ISSCC 2016

[KAIST, 2016]

Page 56: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

[KAIST, 2016]

Page 57: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Kernel weight is reused 8 times

[KAIST, 2016]

Page 58: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

[KAIST, 2016]

Page 59: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

[KAIST, 2016]

Page 60: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Kernel weight is reused 8 times Data item is reused 4 times

[KAIST, 2016]

Page 61: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Agenda

• Introduction• Connecting two convolutions in human visual cortex and artificial neural

network

• Optimizing CNN architecture• Boosting• Pruning unimportant connections• Low-rank approximations• Narrow data (quantization)

• Optimizing CNN implementation• GPU: cuDNN, Winograd convolution, …• Hardware accelerators

• Summary

Page 62: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Take-Away

• Removing redundancy (=Exploiting locality) in convolutional neural networks (CNNs)• Boosting, pruning, low rank approximation, … Design-time solutions

• What about runtime solutions?

• How to exploit value locality, e.g., zeros in weight and activation (at the granularity of neuron, sub-feature map, layer and sub-network)?

• Exploiting parallelism and data reuse in CNN execution• Only for inference, only a few mega-bytes (or ~100kB) of on-chip memory will

be sufficient to keep the input/output feature maps and convolution kernel weights for each layer

• How to reduce on-chip memory accesses? Data reuse (by broadcast)

• What about hardware accelerator for learning?

Page 63: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Reference

• [Bear] M. Bear et al., Neuroscience: Exploring the Brain 3e, Lippincott Williams and Wilkins, 2016.

• [Kandel] E. Kandel, Principles of Neural Science 5e, McGraw-Hill Education / Medical, 2012.

• [Kolb_Whishaw] B. Kolb and I. Q. Whishaw, An Introduction to Brain and Behavior 3e, Worth Publishers, 2009.

• [Chen, 2016] Y. Chen, Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks, ISSCC, 2016.

• [Chetlur, 2014] S. Chetlur, et al., “cuDNN: Efficient Primitives for Deep Learning,” arXiv preprint arXiv:1410.0759v3, 2014.

• [Han, 2015] S. Han, et al., “Learning both weights and connections for efficient neural network,” NIPS, 2015.

• [Kim, 2016] Y. Kim, et al., “Compression of Deep Convolutional Neural Networks for Fast and Low Power Applications,” Proc. International Conference on Learning and Representation (ICLR), May 2016.

• [Park, 2015] E. Park, et al., “Big/Little Deep Neural Network for Ultra Low Power Inference,” Proc. CODES+ISSS, Oct. 2015.

• [Lavin, 2015] A. Lavin and S. Gray, “Fast algorithms for convolutional neural networks,” arXiv preprint arXiv:1509.09308, 2015.

• [Lebedev, 2016] V. Lebedev and V. Lempitsky, “Fast ConvNets Using Group-wise Brain Damage,” arXiv preprint arXiv:1506.02515v2, 2015.

• [Miyashita, 2016] D. Miyashita, et al., “Convolutional Neural Networks using Logarithmic Data Representation,” arXiv preprint arXiv:1603.01025v2, 2016.

• [Microsoft, 2015] K. Ovtcharov, et al., “Accelerating Deep Convolutional Neural Networks Using Specialized Hardware,” MicroSoft, 2016.

• [KAIST, 2016] J. Sim, et al., “A 1.42TOPS/W Deep Convolutional Neural Network Recognition Processor for Intelligent IoE Systems,” ISSCC, 2016.

Page 64: How can we optimize convolutional neural network designs ...prism.sejong.ac.kr/download/PRISM4_slides/PRISM_SNU.pdf · How can we optimize convolutional neural network designs on

Thank You!