SCNN: An Accelerator for Compressed-sparse Convolutional...

Angshuman Parashar, Minsoo Rhu*, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W. Keckler, and William J. Dally

SCNN: An Accelerator for Compressed-sparseConvolutional Neural Networks

* Now at POSTECH (Pohang University of Science and Technology), South Korea

2© NVIDIA 2017

Motivation

3© NVIDIA 2017

Convolution (dense)Inner product between every input pixel and every filter coefficient

Sliding window is intuitive and maps to reasonable hardware implementation

Input activation maps Filters

(Weights)

Output activation maps

4© NVIDIA 2017

: useless MUL ops

5© NVIDIA 2017

: useless MUL ops

6© NVIDIA 2017

: useless MUL ops

7© NVIDIA 2017

: useless MUL ops

8© NVIDIA 2017

Convolution with SparsityMost operand values are zero

Static sparsity: pruned network weights set to ‘0’ during training

* Han et al., “Learning Both Weights and Connections for Efficient Neural Network”, NIPS-2015

9© NVIDIA 2017

* Albericio et al., “CNVLUTIN: Ineffectual-Neuron-Free Deep Neural Network Computing”, ISCA-2016

Dynamic sparsity: negative-valued activations clamped to ‘0’ during inference

10© NVIDIA 2017

Sliding window based convolution is wasteful

Fraction of non-zero (NZ) activations & weights is roughly 20~50% per layer

Density (activations)

Density (weights)

< VGGNet >

11© NVIDIA 2017

* = ?Dynamic sparsity Static sparsity

12© NVIDIA 2017

: useless MUL ops

13© NVIDIA 2017

: useless MUL ops

14© NVIDIA 2017

: useless MUL ops

15© NVIDIA 2017

: useless MUL ops

16© NVIDIA 2017

MotivationCNN inference often performed in power-limited environments

Our goal: sparsity-optimized CNN accelerator for high energy-efficiency

17© NVIDIA 2017

Possible Solutions

18© NVIDIA 2017

Option 1: Leverage Dense CNN DesignEmploy pair of bit-masks to track non-zero weights and/or activations

* = ?NZ activations NZ weights

sliding window

19© NVIDIA 2017

Option 1: Leverage Dense CNN Design

NZ bitmask

(weights)

NZ activations NZ weights

Employ pair of bit-masks to track non-zero weights and/or activations

20© NVIDIA 2017

NZ bitmask

(weights)

NZ bitmask

(activations)

21© NVIDIA 2017

NZ bitmask

(activations)

NZ bitmask

(weights)

2 useful MULs

ALU ALU ALU ALU

22© NVIDIA 2017

NZ bitmask

(activations)

NZ bitmask

(weights)

1 useful MUL

ALU ALU ALU ALU

23© NVIDIA 2017

NZ bitmask

(activations)

NZ bitmask

(weights)

0 useful MULs

ALU ALU ALU ALU

24© NVIDIA 2017

NZ bitmask

(activations)

NZ bitmask

(weights)

0 useful MULs

ALU ALU ALU ALU

Challenge: find enough useful MULs to fully

populate the vector ALUs

25© NVIDIA 2017

SCNN Intuition & Approach

26© NVIDIA 2017

Intuition behind SCNNForget the sliding windows based convolution

All NZ activations must (at some point in time) be multiplied by all NZ weights

Holds true for convolution stride ‘1’

27© NVIDIA 2017

28© NVIDIA 2017

29© NVIDIA 2017

30© NVIDIA 2017

31© NVIDIA 2017

32© NVIDIA 2017

33© NVIDIA 2017

34© NVIDIA 2017

35© NVIDIA 2017

36© NVIDIA 2017

The SCNN approachThe Cartesian product (i.e., all-to-all) based convolution operation

Assuming a convolution stride of ‘1’:

Minimum # MULs: Cartesian product of the NZ activations and the NZ weights

* = ?y

37© NVIDIA 2017

* = ?y

Compress

NZ activations

Compress

NZ weights

38© NVIDIA 2017

Scatter

network

Accumulate MULs

PE frontend PE backend

39© NVIDIA 2017

SCNN Architecture

40© NVIDIA 2017

SCNN Architecture2D spatially arranged processing elements (PEs)

41© NVIDIA 2017

SCNN ArchitectureWorkload distribution

Weights broadcast to all PEs

All PEs have a copy of all the NZ weights of the CNN model

42© NVIDIA 2017

SCNN ArchitectureWorkload distribution

Filters

(Weights)

PE-0 PE-1

PE-2 PE-3

Input activation maps

PE-0 PE-1

PE-2 PE-3

Each PE is allocated with a partial volume of input and output activations

Input and output activations stay local to PE

43© NVIDIA 2017

SCNN ArchitectureHalo resolution?

Filters

(Weights)

Output halos:

PE-0 calculates (A x B), but the result should be accumulated in PE-1 (X)

APE-0 PE-1

PE-2 PE-3

Input activation maps

PE-0 PE-1

PE-2 PE-3

SCNN ArchitectureInter-PE communication channel for halo resolution

Weights

NEneighbor

SEneighbor

: traffic for halo resolution

SCNN PE microarchitecture(Compressed-sparse frontend) + (Scattered-dense backend)

PE frontend

PE backend

Evaluation

Network models and input activations

trained → pruned → retrained models using Caffe

Area & power

System-C → Catapult HLS → Verilog RTL → Synthesis of an SCNN PE

Performance & energy

Performance model for cycle-level simulation of SCNN

Analytical model for design space exploration (dataflows, sparse vs. dense)

Methodology

Evaluation

DCNN: operates solely on dense weights and activations

DCNN-opt: DCNN with (de)compression of activations + ALU power-gating

SCNN: sparse-optimized CNN accelerator

Architecture configurations

PerformanceDense vs. sparse

< VGGNet >

* Network-wide improvement over DCNN: AlexNet (2.37x), GoogLeNet (2.19x)

Energy consumptionDense vs. sparse

* Network-wide improvement over DCNN: AlexNet (2.13x), GoogLeNet (2.51x)

< VGGNet >

Related WorkQualitative comparison to prior work

ArchitectureSparse optimizations

Convolution dataflowWeights Activations

DaDianNao [ASPLOS ‘14] - -

(Variant of)

Sliding-window

Eyeriss [ISCA ‘16] - Power-gating

CNVLUTIN [ISCA ‘16] - Zero-skipping

Cambricon-X [MICRO ’16] Zero-skipping -

SCNN Zero-skipping Cartesian-product

Follow-up questions?

Technical leads:

SCNN architecture & sparse models: Minsoo Rhu

TimeLoop (CNN analytical model): Angshuman Parashar

Power & area modeling: Rangharajan Venkatesan

Contacts the authors

Conclusion

Novel Cartesian-product based convolution operation

Simple/compressed/sparse PE frontend

Scatter/dense PE backend

Superior performance and energy-efficiency

Average 2.7x higher performance than dense CNN architecture

Average 2.3x higher energy-efficiency than dense CNN architecture

SCNN: a compressed-sparse CNN accelerator

SCNN: An Accelerator for Compressed-sparse Convolutional...

Documents

Transcript of SCNN: An Accelerator for Compressed-sparse Convolutional...

Minsoo Ryu Hanyang University - Real-Time Computing …rtcc.hanyang.ac.kr/sitedata/2017_Under_SE/04_Software... · · 2017-09-19development project, ... I nsulin Pump/Control Software/SRS/3.3.2

Exploitation of the host ubiquitin system by human ...mbib.med.harvard.edu/.../AshidaSasakawa_NatRevMicro2014_optional.pdf · Hiroshi Ashida 1, Minsoo Kim 1 and Chihiro Sasakawa 1–3

SCNN: An Accelerator for Compressed-sparse Convolutional ... · accelerator architecture due to the overheads of managing the sparse dataﬂow. On a range of networks, SCNN provides

Introduction to Artificial Neural Network Models Angshuman Saha Image Source: ww.physiol.ucl.ac.uk/fedwards/ ca1%20neuron.jpg.

Fast-SCNN: Fast Semantic Segmentation Networkeven faster computation with competitive results on sub-sampled inputs, without any network modiﬁcations. 1. Introduction Fast semantic

Travel Time and Reliability: Is Data Quality a Showstopper? The Georgia Navigator Experience Angshuman Guin URS Corporation angshuman_guin@urscorp.com.

HRD Team: Monica Gupta Angshuman Siddhanta

Research Articles Chemie - ether.chem.iitb.ac.inarindam/pdf/2020angewchem_sheikh.pdfTariq Sheikh, Vaibhav Nawale, Nithin Pathoor, Chinmay Phadnis, Arindam Chowdhury, and Angshuman

SWD arXiv:1801.06732v2 [cs.CV] 3 Feb 2018 · 2018. 2. 6. · SWD and Fast SCNN that both perform well on low resolution images while Fast SCNN is more computationally efﬁcient.

Research Articles Chemieether.chem.iitb.ac.in/~arindam/pdf/2020angewchem_sheikh.pdfTariq Sheikh, Vaibhav Nawale, Nithin Pathoor, Chinmay Phadnis, Arindam Chowdhury, and Angshuman Nag*

Timeloop: A Systematic Approach to DNN Accelerator Evaluation - … · 2019. 4. 12. · Timeloop: A Systematic Approach to DNN Accelerator Evaluation Angshuman Parashar Priyanka Raina

The Story of Hong Gildong PAID In the Language of Miracles ......MINSOO KANG, translator. The Story of Hong Gildong INTRODUCTION AND NOTES BY MINSOO KANG A new, definitive translation

Supporting Information Significantly Enhanced ... · Angshuman Ray Chowdhuri1, Satyajit Tripathy2, Soumen Chandra1, Somenath Roy2, Sumanta kumar Sahu*1 1 Department of Applied Chemistry,

Introduction to Embedded Systems - RTCC Lab.rtcc.hanyang.ac.kr/.../Lecture_01_Introduction_to_Embedded_Syste… · Introduction to Embedded Systems Minsoo Ryu ... Intel’s 8051,

Understanding Reuse, Performance, and Hardware …Hyoukjun Kwon, Prasanth Chatarasi, Michael Pellauer, Angshuman Parashar, Vivek Sarkar, and Tushar Krishna. 2019. Understanding Reuse,

AI/ES (Artificial Intelligence / Expert System) Visual Prolog: Part 2 2012. Fall. SME., Pukyong Nat l Univ. Kim, Minsoo.

1 SGH-X640 SVC TRAINING H/W BASEBAND PART MINSOO, LEE DATE : 10. MAR. 2005 PLACE : GUMI, KOREA.

A Locality-Aware Memory Hierarchy for Energy-Efficient GPU Architectures [Minsoo Rhu, Michael Sullivan, Jingwen Leng, Mattan Erez] PRESENTED BY: Garima.

Gated-SCNN: Gated Shape CNNs for Semantic Segmentationopenaccess.thecvf.com/content_ICCV_2019/papers/...processes information in parallel to the classical stream. Key to this architecture

icsi smriti irani webinar final · Title: icsi smriti irani webinar final.cdr Author: Angshuman Created Date: 8/23/2020 2:47:28 PM