Research Trends on Convolutional Neural Network (CNN ...

VLSI & System Lab.

Research Trends on Convolutional Neural Network

(CNN) Accelerator Design

2019 SoC 학술대회

숙명여자대학교 최 웅

VLSI & System Lab.

Outline

❑ Overview : Convolutional Neural Network

❑ CNN Accelerator Design

- Background

- Maximize Data Reuse

- Reduction: Computation Size

- Reduction: Computation Number

- Processing-in-Memory (PIM)

❑ Summary

VLSI & System Lab.

Outline

- Background

❑ Summary

VLSI & System Lab.

AI & Neural Network

The science and engineering of

creating intelligent machines

Field of study that gives

computers the ability to learn

w/o being explicitly

programmed

An algorithm that takes its basic func. from understanding

of how the brain operates

ConvolutionalNeural Network

Source : MIT Tutorial

VLSI & System Lab.

Neural Network Applications

Image Process Autonomous Machines Security & Defense

Medical Game

VLSI & System Lab.

Simple Neural Network

𝑌𝑗 =

𝑘=0

𝑁−1

𝑊𝑘𝑋𝑗−𝑘

Input x1

Input x2

Input x3

Output 1

Output 2

Input Nodes Layer

Hidden NodesLayer

Output NodesLayer

MAC:multiply-and-accumulate(= dot product operation)

Neuron & Synapse

Simple Imitation

VLSI & System Lab.

Simple Neural Network

Input x1

Input x2

Input x3

Output y1

Output y2

Input Nodes Layer

Hidden NodesLayer

Output NodesLayer

Input x1

Input x2

Input x3

Output y1

Output y2

Input Nodes Layer

Hidden NodesLayer

Output NodesLayer

Input x1

Input x2

Input x3

Output y1

Output y2

Input Nodes Layer

Hidden NodesLayer

Output NodesLayer

Training Algorithm

Loss Function, Back Propagation, Batch Normalization

Input x1

Input x2

Input x3

Output y1

Output y2

Input Nodes Layer

Hidden NodesLayer

Output NodesLayer

Random initialization Let’s try a dog w/ un-trained NN

Weight training Let’s try again inference

VLSI & System Lab.

Deep Neural Network

ClassesLow LevelFeatures

High LevelFeatures

CONVLayer

FullyConnectLayer

POOLLayer

DownSampling

5-1000 Layers 1-3 Layers

Available Big Data

350M Images / day

300 hours videos/1 min

GPU Acceleration

New ML Techniques

Top-5 Image Classification Accuracy

Human201520142013201220112010

Deep Neural Network (DNN)

[ Russakovsky et al., IJCV 2015 ]

VLSI & System Lab.

Convolutional Neural Network

High LevelFeatures

CONVLayer

FullyConnectLayer

POOLLayer

DownSampling

Filter (Weights)

Input Feature Map (X)

Output Feature Map (Y)Shift & Filtering

lution

ivation

Sigmoid Hyperbolic Tangent Leaky ReLU Exponential LU ReLU

The most Hardware-friendly function

VLSI & System Lab.

High LevelFeatures

CONVLayer

FullyConnectLayer

POOLLayer

DownSampling

• Reduce resolution of each channel independently

• Increase translation-invariance and noise-resilience

1 0 3 3

5 6 6 8

3 1 2 3

2 2 2 5

max(∙)

MaxPooling

1 0 3 3

5 6 6 8

3 1 2 3

2 2 2 5

avg(∙)

AveragePooling

Location of Pooling Layers

Pool La

VLSI & System Lab.

High LevelFeatures

CONVLayer

FullyConnectLayer

POOLLayer

DownSampling

• Fully-connected layer account for more than 90% of total number of parameters, dominating memory and energy

• Simple matrix multiplication

Fully-connected

VLSI & System Lab.

And Others …

High LevelFeatures

CONVLayer

FullyConnectLayer

POOLLayer

DownSampling

[Sergey Ioffe et al., ICML 2015]

SoftmaxNormalization

sum = 1

ProbabilitiesLogits

• Pre-processing to balance betweenthe training and inference (accuracyhighly relies on these procedure)

• Not essential when the difference betweeneach class is not seriously importance

𝑠 =𝑒

σ𝑒

Ranking is maintained

VLSI & System Lab.

Various CNN Configurations

[1] [Krizhevsky et al. NIPS 2012][2] [Simonyan and Zisserman, ICLR 2015][3] [He et al., CVPR 2016]

AlexNet [1] VGG-16 [2] ResNet-50 [3]

VLSI & System Lab.

Outline

- Background

❑ Summary

VLSI & System Lab.

Hardware TechnologiesTr

rEdge/H

HARDWARE TECHNOLOGIES USED IN MACHINE LEARNING

Performance & Functionality

Google TPU FPGA (Xilinx & Intel)

FPGA SoCsNVIDIA Jetson

NVIDIA Tesla P40 & P4

NVIDIA Drive PX2

NVIDIA Pascal & VoltaGoogle Cloud TPUAMD Radeon GPU

QualcommSnapdragon FPGA (Xilinx & Intel)

Source: https://tanjo.ai/contents/2323923

VLSI & System Lab.

Deep CNN on “Cloud Platforms”

• Accelerator is more efficient in terms of power and energy consumption

• Trade off between Energy Efficiency and Flexibility

GPU based CPU+FPGA based ASIC based

VLSI & System Lab.

Need Reconfigurable Accelerator

Pool La

[Jinmook Lee et al., JSSC 2018]

In different Network• Different number of layers

• Different number of filters / channels

In different Architecture• Different algorithmic structures

In different Quantization• Different bit-width of different layers

VLSI & System Lab.

Reconfig. Vs Energy Efficiency

Dynamically Reconfigurable DNN Accelerator

Low-level Reconfiguration

14x12 PE array

Different Modes

High-level Reconfiguration

Use Instruction Set Architecture

General Purpose Processor

(w/ Software Programing)

Application Specific

Accelerator(Hard-wired)

RuntimeReconfigurable

Accelerator

Target Position

Performance & Energy Efficiency

VLSI & System Lab.

𝑌𝑗 =

𝑘=0

𝑁−1

Main Operation in CNN

Architecture Weight Size Ifmap Size # Multiply-Adds Top-1 Accuracy

AlexNet 238 MB 1.6 MB 724 M 57.10 %

VGG-16 528 MB 34.8 MB 15.5 B 70.50 %

ResNet-50 99 MB 37.5 MB 3.9 MB 75.20 %

High LevelFeatures

CONVLayer

FullyConnectLayer

POOLLayer

DownSampling

𝑌𝑗 =

𝑘=0

𝑁−1

𝑊𝑘𝑋𝑗−𝑘 𝑌𝑗 =

𝑘=0

𝑁−1

VLSI & System Lab.

Data-Centric CNN

…………

Accelerator Design• Maximize Data Reuse

• Reduction: Computation Size

• Reduction: Computation Number

• Processing-in-memory

32bit FP Add

32bit FP Mult

32bit SRAM Read

32bit DRAM Read

100 101 102 104103

Operation Energy (pJ) & Relative Energy Cost

8bit Add 0.03

~ X 21333

[M Horowitz et al., ISSCC 2014]

VLSI & System Lab.

Outline

- Background

❑ Summary

VLSI & System Lab.

Maximize Data Reuse : Data Flow

Weight Stationary

• Maximize weight reuse

• Broadcast activation

• Accumulate pSUMs spatially

Output Stationary

• Maximize pSUM reuse

• Broadcast weight

• Reuse activation spatially

No Local Reuse

• Use a large global buffer

• Reduce DRAM access

• Multicast activation &

weight

• Accumulate pSUMs spatially

[DianNao, ASPLOS 2014] [DaDianNao, MICRO 2014] [Zhang, FPGA 2015] [TPU, ISCA 2017]

[Peemen, ICCD 2013] [ShiDianNao, ISCA 2015] [Gupta, ICML 2015] [Moons, VLSI 2016]

[nn-X (NeuFlow), CVPRW 2014] [Park, ISSCC 2015] [ISAAC, ISCA 2016] [PRIME, ISCA 2016]

VLSI & System Lab.

Maximize Data Reuse : Filtering

Simple CNN on CIFAR-10

CONVLayer

iFMAPData

Reused Data

Reusing Rate

1 1024 540 52.7 %

2 256 220 85.9 %

3 64 60 93.8 %

AlexNet on ImageNet

CONVLayer

iFMAPData

Reused Data

Reusing Rate

1 51529 4158 8.1 %

2 961 520 54.1 %

3 225 72 32.0 %

4 225 72 32.0 %

5 225 72 32.0 %

Filter (Weights)

Input Feature Map (X)

Output Feature Map (Y)Shift & Filtering

One-directional Bi-directional

Discard

Update Si

Fi–Si

Input Feature (Ii)

Filter Size(Fi)

Stride(Si)

Downward Sliding

Discard

Update

Discard

Update

Downward Sliding ((I- F)/S)th

Downward Sliding

Fi × (Fi – Si)

Stride

# of filter : Mi

[Ref.] Woong Choi, Kyungrak Choi, and Jongsun Park, "Low Cost Convolutional Neural Network Accelerator Based on Bi-directional Filtering and Bit-widthReduction", IEEE Access, vol. 6, pp. 14734-14746, Mar. 2018.

VLSI & System Lab.

Reduction: Computation Size

• Directly reduced the memory & PEs

• Trade-off : Bit-width ↔ Accuracy

weight value weight value

linear log2

weight value

non-linear

0 1 1 0 0 1

integer

mantissa

fractional

0 1 1 0 0 1

sign mantissa

fractional

standard

binWeight

binNetwork

+, , ×

bitcount

weight

operations

VLSI & System Lab.

Previous Output

𝑌𝑗−1 =

𝑘=0

𝑁−1

𝑊𝑘𝑋𝑗−𝑘−1

𝑌𝑗 − 𝑌𝑗−1 + 𝑌𝑗−1 =

𝑘=0

𝑁−1

𝑊𝑘 𝑋𝑗−𝑘 − 𝑋𝑗−𝑘−1 + 𝑌𝑗−1

Current Output

𝑌𝑗 =

𝑘=0

𝑁−1

❑ Differential Input Method (DIM)

VLSI & System Lab.

𝑌𝑗 =

𝑘=0

𝑁−1

𝑊𝑘 𝑋𝑗−𝑘 − 𝑋𝑗−𝑘−1 + 𝑌𝑗−1Current Output

𝑌𝑗 =

𝑘=0

𝑁−1

-120 -40 40 120 -120 -40 40 120 -120 -40 40 120

Layer 1 Layer 2 Layer 3

Input activation value (Convnet)

Input activation value (AlexNet)

-120 1200 -120 1200 -120 1200 -120 1200 -120 1200

Layer 1 Layer 2 Layer 3 Layer 4 Layer 5

Reduced Dynamic Range

VLSI & System Lab.

𝑌𝑗 =

𝑘=0

𝑁−1

𝑌𝑗 =

𝑘=0

𝑁−1

x3 x2 x1 x0

VLSI & System Lab.

𝑌𝑗 =

𝑘=0

𝑁−1

𝑌𝑗 =

𝑘=0

𝑁−1

x4-x3 x1-x0x2-x1x3-x2

Bit-width Reduced Components Hardware Overhead

VLSI & System Lab.

❑ Adaptive Bit-Width Reduction w/ DIM

Layer B

many large values in layer B

magnitude ofinteger part

Layer A

ineffective many small values in layer A

magnitude ofinteger part

quantization error(layer-by-layer)

adjacenttwo inputs

ineffectiveadaptive

decimal pointdynamic

CASE-1

ineffective

CASE-2

rarely occurredquantization error

quantization error(pixel-by-pixel)DIM

VLSI & System Lab.

❑ Adaptive Bit-Width Reduction (ABWR)

Reference

CIFAR-10 ImageNet

Bit-widthper layer

AccuracyLoss (%)

Bit-widthper layer

AccuracyLoss (%)

[1] 6-6-5 0.8 5-7-9-8-8 0.8a)

[2] 8-7-7 ~0.7 10-8-8-8-8 ~0.6

[3] 4-5-7 ~ 1.0b) 9-7-4-5-7 ~ 1.0b)

[4]c) 8-8-8 0.3 8-8-8-8-8 0.3

ABWR w/ DIM 5-5-6 0.1 6-6-6-6-6 0.4

a) Top-5 accuracy.b) Relative accuracy loss. Baseline accuracy is not presented.c) Requires retraining weights for fine-tuning with software.

[1] B. Moons at al., ``Energy efficient ConvNets through approximate computing,'' in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), Mar. 2016, pp. 1-8.

[2] P. Judd at al., ``Proteus: Exploiting numerical precision variability in deep neural networks,'' in Proc. Int. Conf. Supercomput. (ICS), 2016, Art. no. 23.

[3] P. Judd at al., ``Stripes: Bit-serial deep neural network computing,'' IEEE Comput. Archit. Lett., vol. 16, no. 1, pp. 80-83, Jan./Jun. 2017.

[4] P. Gysel at al., ``Hardware-oriented approximation of convolutional neural networks,'' in Proc. Int. Conf. Learn. Represent. (ICLR), 2016, pp. 1-8.

VLSI & System Lab.

Reduction: Computation Number

• Reduced Activation → Gating or skipping cycle & memory access

• Accuracy loss depending on the techniques

train connectivity

prune connections

train weights

before after

5x5 filterapply

sequentially5x1 and

1x5 filters

MobileNets

filter decomposition9 -1 -3

1 -5 5

-2 6 1

Input: 0,0,12,0,0,0,0,53,0,0,22, ...

2 12 4 53 2 22 0Output:

Run Run Run Term

Level Level Level

VLSI & System Lab.

❑ DIM w/ Near-Zero Skipping

original image normalized after prunning

pruned data

dynamic range

DIM image after pruning w/ DIM after recoveringDIM data

pruned data

Informativedata

Activation Skipping Rate(AlexNet w/ 0.28%

accuracy loss)

48.4 %(x1.00)

56.8 %(x1.17)

VLSI & System Lab.

❑ DIM w/ ABWR & Near-Zero Skipping

• Based on 65nm CMOS standard cell library

• 4xM-PE (24xNPE) for AlexNet

32.8 % reduced computation cycle

VLSI & System Lab.

Outline

- Background

❑ Summary

VLSI & System Lab.

Processing in Memory

Memory LogicCPU(Logic) Memory

Massive data

• PIM: A technique that performs simple logic within a memory device to reduce the amount of data being passed to the processor.

…………

von Neumann architecture Processing-in-memory

VLSI & System Lab.

❑ Basic DRAM Operation : Read → Write-Back

DRAM Array Architecture

DRAM read operation: sensing “1”

DRAM read operation: sensing “0”

• DRAM consists of Cell(1T1C) array

• WL on → charge sharing btw. cell cap. and Bit-line(BL) cap. →sensing

[Vivek Seshadri et al., MICRO 2017]

VLSI & System Lab.

❑ DRAM-based PIM : AND, OR Operation

• AND/OR operation : 3 WLs on → charge sharing → sensing

• Majority function for A,B, and C input

• A & B when C = 1, A || B when C = 0

DRAM PIM: OR/AND Operation Table

R = A OR B R = A AND B

VLSI & System Lab.

❑ DRAM-based PIM : NOT Operation (Inverting)

• 2T1C Cell : Read → Inverting → Write-Back

• Separated decoder to reduce hardware overhead

• Throughput x32, energy x35 compared to DDR3 in AND/OR/NOT

Cell ArrayRow

Sense Amp.

PIM Operation Region

Inverting

Write-back

Special 2T1C Cell & Column Line DRAM 2T1C Row & Specific PIM Region

VLSI & System Lab.

❑ Binary Neural Network

Floating or Fixed Point

Operation

Binary (Bit-wise)Operation

(Hardware-friendly)

[1] M. Courbariaux et al. 2016. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. (2016).arXiv:1602.02830.

[2] W. Tang et al. 2017. Wang, “How to Train a Compact Binary Neural Network with High Accuracy?. in 2017 AAAI, 2625–2631.

Explosive PIMResearches

VLSI & System Lab.

❑ Contents Addressable Memory Operation

• Contents addressable memory (CAM) : look-up table

• Search data is applied to the Search-Lines (SLs) in parallel

• Search results develop on the Match-Lines (MLs) in parallel

(Random Access Memory)

↑ Addr. ↓ Data ↑ Data ↓ Addr.

VLSI & System Lab.

❑ Contents Addressable Memory Operation

• Match-line (ML) precharge (VDD)

• Search-line (SL) activation → ML evaluation

1. ML Precharge 2. SL Activation

@ Mismatch case → ML = VSS

1 0 1 0

@ Match case → ML = VDD

1 0 1 0

0 1 0 1

VLSI & System Lab.

❑ Binarized Convolution Operation

𝑦 = 𝜎(

𝑖=1

𝑥𝑖𝑤𝑖) 𝜎(𝑥) = ቊ+1, 𝑥 ≥ 0−1, 𝑥 < 0XNOR-popcount

X1 X2 X3 XN

W1 W2 W3 WNPCHb

ge Match

Mismatch

N-bit N/2-bit 1-bit

Mismatch

(‘1’ case)

mismatch

(‘0’ case)

[Ref.] Woong Choi, Kwanghyo Jeong, Kyungrak Choi, Kyeongho Lee and Jongsun Park, "Content Addressable Memory Based Binarized Neural NetworkAccelerator Using Time-Domain Signal Processing", Design Automation Conference (DAC), Jun. 2018.

VLSI & System Lab.

[Ref.] Woong Choi, Kwanghyo Jeong, Kyungrak Choi, Kyeongho Lee and Jongsun Park, "Content Addressable Memory Based Binarized Neural NetworkAccelerator Using Time-Domain Signal Processing", Design Automation Conference (DAC), Jun. 2018.

• Half of weight bits are inversed for reference ML delay

• Batch normalization bias : modify the number of inversed weight bits

VLSI & System Lab.

Digit 7 Digit 2

JSSC 17’ [1]Conv. [2]

(redesigned)This work

Technology 65nm 65nm 65nm

Supply Voltage 1.0V 1.1V 1.1V

OperationCONV-BNorm-BinAct

(For 2nd CONV layer for MNIST)

Components

MUX, Delay,

ResistorXNOR, Adder CAM Cell

PE Area (1PE) 85.1 um2 11.84 um2 4.0 um2

Energy

Efficiency *48.2 TSop/J **25.2 TSop/J *88.5 TSop/J

Computation error → 1.56% top-1 accuracy degradation

[1] D. Miyashita et al. 2017. A Neuromorphic Chip Optimized for Deep Learning and CMOS Technology With Time-Domain Analog and Digital Mixed-SignalProcessing,” IEEE JSSC. 52, 10, (2017), 2679–2689.

[2] H. Yonekawa et al. 2017. On-Chip Memory Based Binarized Convolutional Deep Neural Network Applying Batch Normalization Free Technique on an FPGA. in2017 IPDPSW, 98–105.

VLSI & System Lab.

Outline

- Background

❑ Summary

VLSI & System Lab.

Data-Centric CNN

…………

Accelerator Design• Maximize Data Reuse

• Reduction: Computation Size

• Reduction: Computation Number

• Processing-in-memory

32bit FP Add

32bit FP Mult

32bit SRAM Read

32bit DRAM Read

100 101 102 104103

Operation Energy (pJ) & Relative Energy Cost

8bit Add 0.03

X 21333

[M Horowitz et al., ISSCC 2014]

VLSI & System Lab.

2018 Issues on AI

❑ 사람만큼 자연스러운 전화 예약 : Google Duplex

❑ 자율 주행 택시 서비스 : Waymo (Google)

❑ 진짜 같은 가짜 이미지 : Style-based GAN

❑ 단백질의 3D 형태를 예측 : AlphaFold

❑ 인간보다 똑똑한 언어처리 : BERT, Elmo, Big Bird

❑ 온라인에서도 옷을 ‘입어볼 수’ 있다 : Virtual Try-

Source: https://brunch.co.kr/@omniousofficial/32

VLSI & System Lab.

AI Hardware vs Human

❑ Energy Discrepancy

5*104 W

Alphago Zero

1~2*103 W

• Where does this inefficiency come from? Algorithm, Architecture, Circuits, Device, and Materials

Research Trends on Convolutional Neural Network (CNN ...

Documents

Transcript of Research Trends on Convolutional Neural Network (CNN ...

Deep Convolutional Neural Networks for Forensic Age ...blog.hakzone.info/.../uploads/2020/05/CNN-for-Forensic-Age-Estimation.… · 3. The Convolutional Neural Network (CNN) The most

Convolutional Neural Network & Backpropagation … · 2017-11-09 · Content 1.Neural Network 2.Backpropagation Algorithm 3.Convolutional Neural Network 4.Backpropagation on CNN

Rain Prediction Using Convolutional Neural Network (CNN ...

Convolutional Neural Networkscis581/Lectures/Fall2018/CNN-intr… · Convolutional Neural Network (CNN) Pipeline: A Monitor. Convolutional Neural Nets (CNNs) in a nutshell: • A

Convolutional Neural Network Hyper-Parameters Optimization ...€¦ · applications, Convolutional Neural Network (CNN) is the most widely used technique for image classification.

Lecture 4: AutoEncoders and Convolutional Neural Networks ...users.jyu.fi/~olkhriye/ties4911/lectures/Lecture04.pdf · Convolutional Neural Networks (CNN) are probably the most popular

CNN-PDM: A Convolutional Neural Network Framework For ...

An empirical evaluation of convolutional neural networks ... · A convolutional neural network (CNN) is an artiﬁcial neural network that employs convolution in place of afﬁne

Tube Convolutional Neural Network (T-CNN) for Action ...crcv.ucf.edu/papers/iccv17/T-CNN-camera-ready.pdf · Tube Convolutional Neural Network ... stream networks (a spatial CNN and

Implementation of a Convolutional Neural Network (CNN) on ...oa.upm.es/53784/1/TFG_PABLO_CORREA_GOMEZ.pdf · Neural Network (CNN) on a FPGA for Sign Language’s Alphabet recognition

Convolutional Neural Nets · Convolutional Neural Network (CNN) Convolutional layers localize connections to a small window of input For example, the input node x 3 connects to the

Beyond Convolutional Neural Networks (CNN)i-systems.github.io/HSE545/iAI/DL/topics/06_Modern_CNN/02_Beyond_CNN.pdf · Beyond Convolutional Neural Networks (CNN) Industrial AI Lab.

Convolutional Networks - Université de Montréal · Convolutional Neural Networks •Object detection systems based on the deep convolutional neural network (CNN) have recently made

Convolutional Neural Network - CNN · Introduction CNN Layers CNN Models Popular Frameworks Papers References Convolutional Neural Network - CNN Eduardo Todt, Bruno Alexandre Krinski

ViP-CNN: Visual Phrase Guided Convolutional Neural ......IEEE 2017 Conference on Computer Vision and Pattern Recognition ViP-CNN: Visual Phrase Guided Convolutional Neural Network

Newton Methods for Convolutional Neural Networkscjlin/papers/cnn/newton-CNN.pdf · used networks such as Convolutional Neural Networks (CNN). One reason is that Newton methods for

Convolutional Neural Networks (CNN)www.深度学习.top/Talks/CNN1.pdf · Convolutional Neural Networks (CNN) 2016.11.3 SCHOOL OF MATHEMATICAL SCIENCE, PEKING UNIVERSITY

Convolutional Neural Networks (CNN) · Convolutional Neural Networks (CNN) •Motivation –The bird occupies a local area and looks the same in different parts of an image. We should

Convolutional Neural Networks for Deep Learning: An Introduction · 2017-10-05 · Introduction: Convolutional Neural Networks (CNN) Convolutional Neural Networks: A deep learning

Tube Convolutional Neural Network (T-CNN) for Action ...openaccess.thecvf.com/...Tube_Convolutional_Neural... · propose Tube Convolutional Neural Network (T-CNN) for action detection.