Research Trends on Convolutional Neural Network (CNN ...

VLSI & System Lab.

Research Trends on Convolutional Neural Network

(CNN) Accelerator Design

2019 SoC 학술대회

숙명여자대학교 최 웅

VLSI & System Lab.

Outline

❑ Overview : Convolutional Neural Network

❑ CNN Accelerator Design

- Background

- Maximize Data Reuse

- Reduction: Computation Size

- Reduction: Computation Number

- Processing-in-Memory (PIM)

❑ Summary

VLSI & System Lab.

AI & Neural Network

The science and engineering of

creating intelligent machines

Field of study that gives

computers the ability to learn

w/o being explicitly

programmed

An algorithm that takes its basic func. from understanding

of how the brain operates

ConvolutionalNeural Network

Source : MIT Tutorial

VLSI & System Lab.

Neural Network Applications

Image Process Autonomous Machines Security & Defense

Medical Game

VLSI & System Lab.

Simple Neural Network

𝑌𝑗 =

𝑘=0

𝑁−1

𝑊𝑘𝑋𝑗−𝑘

Input x1

Input x2

Input x3

Yj

Output 1

Output 2

Input Nodes Layer

Hidden NodesLayer

Output NodesLayer

W1

Bunch

of w

eig

ht W

W2

W3

MAC:multiply-and-accumulate(= dot product operation)

Neuron & Synapse

Simple Imitation

VLSI & System Lab.

Simple Neural Network

Input x1

Input x2

Input x3

Output y1

Output y2

Input Nodes Layer

Hidden NodesLayer

Output NodesLayer

0.0

0.1

Input x1

Input x2

Input x3

Dog

Cat

Output y1

Output y2

Input Nodes Layer

Hidden NodesLayer

Output NodesLayer

0.0

0.1

47%

53%

Input x1

Input x2

Input x3

Dog

Cat

Output y1

Output y2

Input Nodes Layer

Hidden NodesLayer

Output NodesLayer

Training Algorithm

Loss Function, Back Propagation, Batch Normalization

Input x1

Input x2

Input x3

Dog

Cat

Output y1

Output y2

Input Nodes Layer

Hidden NodesLayer

Output NodesLayer

0.4

0.8

95%

5%

Random initialization Let’s try a dog w/ un-trained NN

Weight training Let’s try again inference

VLSI & System Lab.

Deep Neural Network

ClassesLow LevelFeatures

High LevelFeatures

CONVLayer

CONVLayer

FullyConnectLayer

POOLLayer

DownSampling

5-1000 Layers 1-3 Layers

Available Big Data

350M Images / day

300 hours videos/1 min

GPU Acceleration

New ML Techniques

Top-5 Image Classification Accuracy

Human201520142013201220112010

30

0

10

20

Deep Neural Network (DNN)

[ Russakovsky et al., IJCV 2015 ]


VLSI & System Lab.

Convolutional Neural Network


High LevelFeatures

CONVLayer

CONVLayer

FullyConnectLayer

POOLLayer

DownSampling

Filter (Weights)

Input Feature Map (X)

Output Feature Map (Y)Shift & Filtering

Convo

lution

Act

ivation

Sigmoid Hyperbolic Tangent Leaky ReLU Exponential LU ReLU

The most Hardware-friendly function


VLSI & System Lab.



High LevelFeatures

CONVLayer

CONVLayer

FullyConnectLayer

POOLLayer

DownSampling

• Reduce resolution of each channel independently

• Increase translation-invariance and noise-resilience

Ale

xNet

Case

1 0 3 3

5 6 6 8

3 1 2 3

2 2 2 5

6 8

3 5

max(∙)

MaxPooling

1 0 3 3

5 6 6 8

3 1 2 3

2 2 2 5

3 5

2 3

avg(∙)

AveragePooling

Location of Pooling Layers

Input

CO

NV La

yer

CO

NV La

yer

Pool La

yer

Pool La

yer

Pool La

yer

CO

NV La

yer

FC La

yer

Softm

ax

FC La

yer

FC La

yer

CO

NV La

yer

CO

NV La

yer


VLSI & System Lab.



High LevelFeatures

CONVLayer

CONVLayer

FullyConnectLayer

POOLLayer

DownSampling

• Fully-connected layer account for more than 90% of total number of parameters, dominating memory and energy

• Simple matrix multiplication

Fully-connected


VLSI & System Lab.

And Others …


High LevelFeatures

CONVLayer

CONVLayer

FullyConnectLayer

POOLLayer

DownSampling

[Sergey Ioffe et al., ICML 2015]

SoftmaxNormalization

-0.7

1.2

0.5

-0.2

0.08

0.53

0.26

0.13

sum = 1

ProbabilitiesLogits

• Pre-processing to balance betweenthe training and inference (accuracyhighly relies on these procedure)

• Not essential when the difference betweeneach class is not seriously importance

𝑠 =𝑒

σ𝑒

Ranking is maintained

VLSI & System Lab.

Various CNN Configurations

[1] [Krizhevsky et al. NIPS 2012][2] [Simonyan and Zisserman, ICLR 2015][3] [He et al., CVPR 2016]

AlexNet [1] VGG-16 [2] ResNet-50 [3]

VLSI & System Lab.

Outline



- Background





❑ Summary

VLSI & System Lab.

Hardware TechnologiesTr

ain

ing

Data

Cente

r

Infe

rence

Data

Cente

rEdge/H

ybrid

HARDWARE TECHNOLOGIES USED IN MACHINE LEARNING

Performance & Functionality

Google TPU FPGA (Xilinx & Intel)

FPGA SoCsNVIDIA Jetson

NVIDIA Tesla P40 & P4

NVIDIA Drive PX2

NVIDIA Pascal & VoltaGoogle Cloud TPUAMD Radeon GPU

QualcommSnapdragon FPGA (Xilinx & Intel)

Source: https://tanjo.ai/contents/2323923

VLSI & System Lab.

ASICs

FPGAs

GPUs

Deep CNN on “Cloud Platforms”

• Accelerator is more efficient in terms of power and energy consumption

• Trade off between Energy Efficiency and Flexibility

GPU based CPU+FPGA based ASIC based

VLSI & System Lab.

Need Reconfigurable Accelerator

Input

CO

NV La

yer

CO

NV La

yer

Pool La

yer

Pool La

yer

Pool La

yer

CO

NV La

yer

FC La

yer

Softm

ax

FC La

yer

FC La

yer

CO

NV La

yer

CO

NV La

yer

Input

CO

NV La

yer

CO

NV La

yer

Pool La

yer

Pool La

yer

Pool La

yer

CO

NV La

yer

CO

NV La

yer

CO

NV La

yer

CO

NV La

yer

CO

NV La

yer

CO

NV La

yer

CO

NV La

yer

Pool La

yer

CO

NV La

yer

CO

NV La

yer

CO

NV La

yer

Pool La

yer

FC La

yer

Softm

ax

FC La

yer

FC La

yer

Ale

xNet

VGG-1

6

[Jinmook Lee et al., JSSC 2018]

In different Network• Different number of layers

• Different number of filters / channels

In different Architecture• Different algorithmic structures

In different Quantization• Different bit-width of different layers

VLSI & System Lab.

Reconfig. Vs Energy Efficiency

Dynamically Reconfigurable DNN Accelerator

Low-level Reconfiguration

14x12 PE array

Different Modes

High-level Reconfiguration

Use Instruction Set Architecture

General Purpose Processor

(w/ Software Programing)

Application Specific

Accelerator(Hard-wired)

RuntimeReconfigurable

Accelerator

Target Position

Pro

gra

mm

abili

ty

Performance & Energy Efficiency

VLSI & System Lab.

𝑌𝑗 =

𝑘=0

𝑁−1


Main Operation in CNN

Architecture Weight Size Ifmap Size # Multiply-Adds Top-1 Accuracy

AlexNet 238 MB 1.6 MB 724 M 57.10 %

VGG-16 528 MB 34.8 MB 15.5 B 70.50 %

ResNet-50 99 MB 37.5 MB 3.9 MB 75.20 %


High LevelFeatures

CONVLayer

CONVLayer

FullyConnectLayer

POOLLayer

DownSampling

𝑌𝑗 =

𝑘=0

𝑁−1

𝑊𝑘𝑋𝑗−𝑘 𝑌𝑗 =

𝑘=0

𝑁−1


VLSI & System Lab.

Data-Centric CNN

…………

…………

…………

…………

Accelerator Design• Maximize Data Reuse

• Reduction: Computation Size

• Reduction: Computation Number

• Processing-in-memory

32bit FP Add

32bit FP Mult

32bit SRAM Read

32bit DRAM Read

100 101 102 104103

0.9

3.7

5

640

Operation Energy (pJ) & Relative Energy Cost

8bit Add 0.03

~ X 21333

[M Horowitz et al., ISSCC 2014]

VLSI & System Lab.

Outline



- Background





❑ Summary

VLSI & System Lab.

Maximize Data Reuse : Data Flow

Weight Stationary

• Maximize weight reuse

• Broadcast activation

• Accumulate pSUMs spatially

Output Stationary

• Maximize pSUM reuse

• Broadcast weight

• Reuse activation spatially

No Local Reuse

• Use a large global buffer

• Reduce DRAM access

• Multicast activation &

weight

• Accumulate pSUMs spatially

[DianNao, ASPLOS 2014] [DaDianNao, MICRO 2014] [Zhang, FPGA 2015] [TPU, ISCA 2017]

[Peemen, ICCD 2013] [ShiDianNao, ISCA 2015] [Gupta, ICML 2015] [Moons, VLSI 2016]

[nn-X (NeuFlow), CVPRW 2014] [Park, ISSCC 2015] [ISAAC, ISCA 2016] [PRIME, ISCA 2016]


VLSI & System Lab.

Maximize Data Reuse : Filtering

Simple CNN on CIFAR-10

CONVLayer

iFMAPData

Reused Data

Reusing Rate

1 1024 540 52.7 %

2 256 220 85.9 %

3 64 60 93.8 %

AlexNet on ImageNet

CONVLayer

iFMAPData

Reused Data

Reusing Rate

1 51529 4158 8.1 %

2 961 520 54.1 %

3 225 72 32.0 %

4 225 72 32.0 %

5 225 72 32.0 %

Filter (Weights)

Input Feature Map (X)

Output Feature Map (Y)Shift & Filtering

One-directional Bi-directional

Reuse

Reuse

Reuse

Discard

Update Si

Fi–Si

Input Feature (Ii)

Filter Size(Fi)

Stride(Si)

1st

Downward Sliding

Discard

Update

Discard

Update

2nd

Downward Sliding ((I- F)/S)th

Downward Sliding

Fi × (Fi – Si)

Si

Fi

Fi × (Fi – Si)

Fi × (Fi – Si)

Stride

Stride

Ci

# of filter : Mi

[Ref.] Woong Choi, Kyungrak Choi, and Jongsun Park, "Low Cost Convolutional Neural Network Accelerator Based on Bi-directional Filtering and Bit-widthReduction", IEEE Access, vol. 6, pp. 14734-14746, Mar. 2018.

VLSI & System Lab.

Reduction: Computation Size


• Directly reduced the memory & PEs

• Trade-off : Bit-width ↔ Accuracy

weight value weight value

linear log2

weight value

non-linear

de

nsit

y

de

nsit

y

0 1 1 0 0 1

sign

integer

mantissa

fractional

0 1 1 0 0 1

sign mantissa

fractional

Act

Act

Act

standard

binWeight

binNetwork

+, , ×

+,

XNOR,

bitcount

weight

weight

weight

operations

VLSI & System Lab.


Previous Output

𝑌𝑗−1 =

𝑘=0

𝑁−1

𝑊𝑘𝑋𝑗−𝑘−1

𝑌𝑗 − 𝑌𝑗−1 + 𝑌𝑗−1 =

𝑘=0

𝑁−1

𝑊𝑘 𝑋𝑗−𝑘 − 𝑋𝑗−𝑘−1 + 𝑌𝑗−1

Current Output

𝑌𝑗 =

𝑘=0

𝑁−1


❑ Differential Input Method (DIM)


VLSI & System Lab.



𝑌𝑗 =

𝑘=0

𝑁−1

𝑊𝑘 𝑋𝑗−𝑘 − 𝑋𝑗−𝑘−1 + 𝑌𝑗−1Current Output

𝑌𝑗 =

𝑘=0

𝑁−1


×106

Oc

cu

ren

ce

w/o

DIM

0

3

6

×106

w/

DIM

0

3

6

Oc

cu

ren

ce

-120 -40 40 120 -120 -40 40 120 -120 -40 40 120

Layer 1 Layer 2 Layer 3

Input activation value (Convnet)

×102

w/o

DIM

0

3

6

Oc

cu

ren

ce

Input activation value (AlexNet)

×102

0

3

6O

ccu

ren

ce

-120 1200 -120 1200 -120 1200 -120 1200 -120 1200

Layer 1 Layer 2 Layer 3 Layer 4 Layer 5

w/

DIM

Reduced Dynamic Range


VLSI & System Lab.



𝑌𝑗 =

𝑘=0

𝑁−1


𝑌𝑗 =

𝑘=0

𝑁−1


x3 x2 x1 x0


VLSI & System Lab.



𝑌𝑗 =

𝑘=0

𝑁−1


𝑌𝑗 =

𝑘=0

𝑁−1


x4-x3 x1-x0x2-x1x3-x2

Bit-width Reduced Components Hardware Overhead


VLSI & System Lab.


❑ Adaptive Bit-Width Reduction w/ DIM

Layer B

sign

many large values in layer B

magnitude ofinteger part

Layer A

sign

ineffective many small values in layer A

magnitude ofinteger part

quantization error(layer-by-layer)

adjacenttwo inputs

sign

ineffectiveadaptive

decimal pointdynamic

range

sign

CASE-1

ineffective

CASE-2

rarely occurredquantization error

quantization error(pixel-by-pixel)DIM


VLSI & System Lab.


❑ Adaptive Bit-Width Reduction (ABWR)

Reference

CIFAR-10 ImageNet

Bit-widthper layer

AccuracyLoss (%)

Bit-widthper layer

AccuracyLoss (%)

[1] 6-6-5 0.8 5-7-9-8-8 0.8a)

[2] 8-7-7 ~0.7 10-8-8-8-8 ~0.6

[3] 4-5-7 ~ 1.0b) 9-7-4-5-7 ~ 1.0b)

[4]c) 8-8-8 0.3 8-8-8-8-8 0.3

ABWR w/ DIM 5-5-6 0.1 6-6-6-6-6 0.4

a) Top-5 accuracy.b) Relative accuracy loss. Baseline accuracy is not presented.c) Requires retraining weights for fine-tuning with software.

[1] B. Moons at al., ``Energy efficient ConvNets through approximate computing,'' in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), Mar. 2016, pp. 1-8.

[2] P. Judd at al., ``Proteus: Exploiting numerical precision variability in deep neural networks,'' in Proc. Int. Conf. Supercomput. (ICS), 2016, Art. no. 23.

[3] P. Judd at al., ``Stripes: Bit-serial deep neural network computing,'' IEEE Comput. Archit. Lett., vol. 16, no. 1, pp. 80-83, Jan./Jun. 2017.

[4] P. Gysel at al., ``Hardware-oriented approximation of convolutional neural networks,'' in Proc. Int. Conf. Learn. Represent. (ICLR), 2016, pp. 1-8.

VLSI & System Lab.

Reduction: Computation Number

• Reduced Activation → Gating or skipping cycle & memory access

• Accuracy loss depending on the techniques

train connectivity

prune connections

train weights

before after

5x5 filterapply

sequentially5x1 and

1x5 filters

MobileNets

filter decomposition9 -1 -3

1 -5 5

-2 6 1

9 0 0

1 0 5

0 6 0

ReLU

Input: 0,0,12,0,0,0,0,53,0,0,22, ...

2 12 4 53 2 22 0Output:

Run Run Run Term

Level Level Level


VLSI & System Lab.


❑ DIM w/ Near-Zero Skipping

original image normalized after prunning

pruned data

dynamic range

DIM image after pruning w/ DIM after recoveringDIM data

pruned data

Informativedata

Activation Skipping Rate(AlexNet w/ 0.28%

accuracy loss)

48.4 %(x1.00)

56.8 %(x1.17)


VLSI & System Lab.


❑ DIM w/ ABWR & Near-Zero Skipping

• Based on 65nm CMOS standard cell library

• 4xM-PE (24xNPE) for AlexNet

32.8 % reduced computation cycle


VLSI & System Lab.

Outline



- Background





❑ Summary

VLSI & System Lab.

Processing in Memory

Memory LogicCPU(Logic) Memory

Massive data

• PIM: A technique that performs simple logic within a memory device to reduce the amount of data being passed to the processor.

…………

…………

…………

…………

von Neumann architecture Processing-in-memory

VLSI & System Lab.


❑ Basic DRAM Operation : Read → Write-Back

DRAM Array Architecture

DRAM read operation: sensing “1”

DRAM read operation: sensing “0”

• DRAM consists of Cell(1T1C) array

• WL on → charge sharing btw. cell cap. and Bit-line(BL) cap. →sensing

[Vivek Seshadri et al., MICRO 2017]

VLSI & System Lab.


❑ DRAM-based PIM : AND, OR Operation

• AND/OR operation : 3 WLs on → charge sharing → sensing

• Majority function for A,B, and C input

• A & B when C = 1, A || B when C = 0

DRAM PIM: OR/AND Operation Table

R = A OR B R = A AND B


VLSI & System Lab.


❑ DRAM-based PIM : NOT Operation (Inverting)

• 2T1C Cell : Read → Inverting → Write-Back

• Separated decoder to reduce hardware overhead

• Throughput x32, energy x35 compared to DDR3 in AND/OR/NOT

Cell ArrayRow

Dec.

Dec.

Sense Amp.

PIM Operation Region

Read

Inverting

Write-back

Special 2T1C Cell & Column Line DRAM 2T1C Row & Specific PIM Region


VLSI & System Lab.


❑ Binary Neural Network

Floating or Fixed Point

Operation

Binary (Bit-wise)Operation

(Hardware-friendly)

[1] M. Courbariaux et al. 2016. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. (2016).arXiv:1602.02830.

[2] W. Tang et al. 2017. Wang, “How to Train a Compact Binary Neural Network with High Accuracy?. in 2017 AAAI, 2625–2631.

Explosive PIMResearches

VLSI & System Lab.


❑ Contents Addressable Memory Operation

• Contents addressable memory (CAM) : look-up table

• Search data is applied to the Search-Lines (SLs) in parallel

• Search results develop on the Match-Lines (MLs) in parallel

RAM

(Random Access Memory)

CAM

↑ Addr. ↓ Data ↑ Data ↓ Addr.

VLSI & System Lab.


❑ Contents Addressable Memory Operation

• Match-line (ML) precharge (VDD)

• Search-line (SL) activation → ML evaluation

1. ML Precharge 2. SL Activation

@ Mismatch case → ML = VSS

1 0 1 0

1 0 1 0

@ Match case → ML = VDD

1 0 1 0

0 1 0 1

VLSI & System Lab.


❑ Binarized Convolution Operation

𝑦 = 𝜎(

𝑖=1

𝑁

𝑥𝑖𝑤𝑖) 𝜎(𝑥) = ቊ+1, 𝑥 ≥ 0−1, 𝑥 < 0XNOR-popcount

X1 X2 X3 XN

MLSL

SL

b

W1 W2 W3 WNPCHb

ML

Pre

char

ge Match

VM L

t

SA

Mismatch

N-bit N/2-bit 1-bit

t

VSO

SO

a few

Mismatch

(‘1’ case)

many

mismatch

(‘0’ case)

[Ref.] Woong Choi, Kwanghyo Jeong, Kyungrak Choi, Kyeongho Lee and Jongsun Park, "Content Addressable Memory Based Binarized Neural NetworkAccelerator Using Time-Domain Signal Processing", Design Automation Conference (DAC), Jun. 2018.

VLSI & System Lab.



[Ref.] Woong Choi, Kwanghyo Jeong, Kyungrak Choi, Kyeongho Lee and Jongsun Park, "Content Addressable Memory Based Binarized Neural NetworkAccelerator Using Time-Domain Signal Processing", Design Automation Conference (DAC), Jun. 2018.

• Half of weight bits are inversed for reference ML delay

• Batch normalization bias : modify the number of inversed weight bits

VLSI & System Lab.



Digit 7 Digit 2

JSSC 17’ [1]Conv. [2]

(redesigned)This work

Technology 65nm 65nm 65nm

Supply Voltage 1.0V 1.1V 1.1V

OperationCONV-BNorm-BinAct

(For 2nd CONV layer for MNIST)

PE

Components

MUX, Delay,

ResistorXNOR, Adder CAM Cell

PE Area (1PE) 85.1 um2 11.84 um2 4.0 um2

Energy

Efficiency *48.2 TSop/J **25.2 TSop/J *88.5 TSop/J

Computation error → 1.56% top-1 accuracy degradation

[1] D. Miyashita et al. 2017. A Neuromorphic Chip Optimized for Deep Learning and CMOS Technology With Time-Domain Analog and Digital Mixed-SignalProcessing,” IEEE JSSC. 52, 10, (2017), 2679–2689.

[2] H. Yonekawa et al. 2017. On-Chip Memory Based Binarized Convolutional Deep Neural Network Applying Batch Normalization Free Technique on an FPGA. in2017 IPDPSW, 98–105.

VLSI & System Lab.

Outline



- Background





❑ Summary

VLSI & System Lab.

Data-Centric CNN

…………

…………

…………

…………

Accelerator Design• Maximize Data Reuse

• Reduction: Computation Size

• Reduction: Computation Number

• Processing-in-memory

32bit FP Add

32bit FP Mult

32bit SRAM Read

32bit DRAM Read

100 101 102 104103

0.9

3.7

5

640

Operation Energy (pJ) & Relative Energy Cost

8bit Add 0.03

X 21333

[M Horowitz et al., ISSCC 2014]

VLSI & System Lab.

2018 Issues on AI

❑ 사람만큼 자연스러운 전화 예약 : Google Duplex

❑ 자율 주행 택시 서비스 : Waymo (Google)

❑ 진짜 같은 가짜 이미지 : Style-based GAN

❑ 단백질의 3D 형태를 예측 : AlphaFold

❑ 인간보다 똑똑한 언어처리 : BERT, Elmo, Big Bird

❑ 온라인에서도 옷을 ‘입어볼 수’ 있다 : Virtual Try-

On

Source: https://brunch.co.kr/@omniousofficial/32

VLSI & System Lab.

AI Hardware vs Human

❑ Energy Discrepancy

5*104 W

Alphago Zero

1~2*103 W

vs.

20 W

• Where does this inefficiency come from? Algorithm, Architecture, Circuits, Device, and Materials

Research Trends on Convolutional Neural Network (CNN ...

Documents

Transcript of Research Trends on Convolutional Neural Network (CNN ...