A Scalable Time-based Integrate-and-Fire Neuromorphic...

A Scalable Time-based Integrate-and-Fire Neuromorphic Core with Brain-Inspired Leak

and Local Lateral Inhibition Capabilities

Muqing Liu, Luke R. Everson, Chris H. Kim

Dept. of ECE, University of Minnesota, Minneapolis, MN

liux3300@umn.edu

Outline

• Background

• Time Based Neural Networks

• Leaky Neuron and Local Lateral Inhibition

• Digit Recognition Application

• Measurement Results

• Conclusion

Neuromorphic Computing

• Biological neuron behavior: Weight multiplication (synapse) → Weight integration (cell body) → Threshold comparison & fire.

• Applications: Image recognition/classification, natural language processing, speech recognition, etc.

Biological neuron model

* synaptic weights: excitatory (+) or inhibitory (-)

Artificial neuron model

http://juanribon.com/design/nerve-cell-body-diagram.php

Prior Arts: Deep Learning Processor

• Circuit/Architecture innovations:

− Data reuse in convolutional neural network.

− Utilize sparsity by data gating/zero skipping.

− Reduced weight precision �� binary neural networks.

(108KB)

Peak Performance:

16.8 – 42.0 GOPS

(1OP = 1MAC)

Power:

278mW @ 1V

4000µm4000µ

21mW @ 1.1V

3.9 TOPS/W @1.1V

235mW @ 1.1VEyeriss: DCNN Accelerator

DNPU: Reconfigurable CNN-

RNN Processor

TSMC 65nm LP 1P9M 65nm 1P8M CMOS

[1] Y.-H. Chen, et al., ISSCC, 2016. [2] D. Shin, et al., ISSCC, 2017.

Prior Arts: Emerging NVM based Implementation

• Comparison with CMOS implementation:

− Pros: Compact, analog computation.

− Cons: Susceptible to noise, immature process.

Memresitor based crossbar array [3] PCM based crossbar array[4]

[3] K.-H. Kim, et al., Nano Lett., Dec. 2011. [4] D. Kuzum, et al., Nano Lett., Jun. 2011.

Time-based vs. Digital Implementation

x1·w1 +x2·w2 + ··· + xi·wi

x1·w1 x2·w2 xi·wi

Delay1 Delay2 Delayi

Accumulate

∑∑∑∑

= Delay1 + Delay2 + ··· + Delayi

y = i xi·wi ∑∑∑∑y = i xi·wi

= x1·w1 +x2·w2 + ··· + xi·wi

N-bit Multipliers

M-bit Adder

Activationx2

Time-based Neural Network Digital Neural Network

Time-based Digital

Core circuits

Programmable delay circuits

Multipliers & adders

Area and power efficient High resolution

Moderate resolutionLarge area and power

consumption

Comparison with Previous Time-based Neural Network

Proposed Time-based Neural Net

SRAMSRAM

DCO with 128 Programmable Delay Stages

8b CounterDQ

Compare & Fire

C0C1C6C7

LEAKNeuron control logic

rstrstrstrst

Threshold8

Leaky Integrate & Fire, Local Lateral Inhibition

SRAMSRAMSRAM

EN_DCO

SRAMSRAMSRA

MSRAMSRAMSRA

W0,1<2:0>

∑∑∑∑ ⋅⋅⋅⋅

TDCOX2,X3

W2,3<2:0>X124,X125

W124,125<2:0>X126,X127

W126,127<2:0>

∑∑∑∑Delayi

∝ ∝ ∝ ∝

Proposed Time-based Neural Net

• Input pixel: Xi

− Determines whether a stage is activated or not.

• Weight: Wi<2:0>

− Determines how many capacitors are turned on asload in that stage.

SRAMSRAM

Programmable Delay Stage

Wi<2:0>

wi<2> wi<1> wi<0>

3 SRAM cells

8.1µm

3 SRAM cells

Unit cell layout (2 stages)

*BL,BLB omitted

for simplicity

64x128 Time-based Neural Network

• 8 DCO cores are groupedtogether to implement locallateral inhibition.

• 64 DCO neuromorphiccores in total.

• 121 out of 128 DCO stagesare used as programmableinputs.

• Remaining 7 stages are

reserved for calibration.

Frequency Calibration and Linearity Test

• Frequency variation between 10 DCOs

− Before calibration: 1.17%, after calibration: 0.10%.

• Leaky neuron: Ions diffuse through the neuron cell .

• Local lateral inhibition: Active neuron strives tosuppress the activities of its neighbors.

Lateral inhibition: Mach band illusion[4]Electrical modeling of cell membrane[3]

[3] W. Gerstner, et al., Neuronal Dynamics. [4] Wikipedia.

Bio-Inspired Features: Leaky Neuron and Local Lateral Inhibition (LLI)

Time-based Leak and LLI

Compare & FireSPIKE

Time-based Leaky Integrate & Fire Neuron

Time-based Local Lateral Inhibition (LLI)

Threshold

SPIKE<0> SPIKE<1> SPIKE<2> SPIKE<7>

(LSB reset)

C0C1C6C7

Neighbor

counter

bit reset

• Leak enabled:− LSB of every counter

is reset periodically.

• LLI enabled:− Specific bits in the

neighboring countersare reset after a DCOspikes.

− The fastest DCO resetsthe other DCOs moreoften than it is reset byothers.

Leak and LLI

• Leak: Uniformly lower spiking frequency.

• LLI: Preferentially lower spiking frequency.

• Goal: Higher contrast between different neuron outputs.

*None: No leak and no LLI, basic DCO operation.

DCO No.0

DCO No.0S

y*None LEAK

Sharper contrast

*None LLI

Sharper contrast

Handwritten Digit Recognition

• Input database: MNIST.

• Learning method: Supervised learning.

• Learning network: Single-layer & multi-layer perceptronnetwork.

Single-layer Digit Recognition

• Single-layer architecture: Proof-of-concept for time-based neural network

Multi-layer Digit Recognition

• Multi-layer architecture: Demonstrates the scalability ofthe core.

Measurement Results

*None: No leak and no LLI, basic DCO operation.

65nm LP CMOS, 1.2V, 25oC

Single-layer with 11x11

images

Two-layer with 11x11

images

Two-layer with 4-patch

22x22 images

Measured (*None)

Simulation

Measured (Leaky)

• Measured recognition accuracy from hardware iscomparable to software simulation results.

Measurement Results

0 1 2 3 4 5 6 7 8

*None LLI1.7%

(Target) Digit

65nm LP CMOS, 1.2V, 25oC

• Spike count difference between digit “2” and “0”

− Without LLI: 1.7%, with LLI: 17.7 %.

Measurement Results

0.6 0.7 0.8 0.9 1 1.1 1.2 1.3

Supply voltage (V)

DCO Frequency (MHz)

Power (µW per DCO)

65nm LP CMOS, 25oC

• Wide operating range: 0.7V ~ 1.2V.

Performance Comparison

[5] D. Miyashita, et al., ASSCC, 2017. [6] K. J. Lee, et al., ISSCC, 2016. [7] J. K. Kim, et al., VLSI, 2015.

This work

ApplicationHand writing recognition

Technology 65nm

Area 0.24mm2 (64 DCOs)

Voltage 1.2V

Frequency 99MHz (nominal DCO freq.)

FunctionMulti-layer perceptron

network

Performance Comparison

16.6GE/PEc

ISSCC’16 [6]

Object detection + intention prediction

16.0mm2

250MHz

Deep neural network

Power Efficiency

309G ÷ N spikes/s/W(N=spiking thresholda)

320.4 µW/DCO

Circuit Type Time-based Analog + Digital

VLSI’15 [7]

Object Recognition

1.8mm2

Spiking LCA with classification

5.7pJ/pixel (memory+logic)

3.65mW

Digital

40MHz (Inference)

ASSCC’17 [5]

Hand writing recognition

3.61mm2 (32K PEs)

Convolutional neural network

48.2TSOp/s/W

Time-based

862GOPS/W

-5.7pJ/pixel

(memory+logic)

37.4TOPS/Wd

0.43pJ/pixel (logic)e

a. N=16 in our measurements.

b. SOp/s/W: Synaptic operation (SOp). In DCO based time-domain neural network, one oscillation of DCO is equivalent to 121 SOp.

c. 1GE: 1.44um2(65nm). PE: processing element.

d. Operation: One operation is defined as one multiplication and accumulation (MAC). In DCO based time-domain neural network, one oscillation of DCO is equivalent to 121 3-bit MAC.

e. Used spiking threshold of 16, and only accounted for the power consumption of core logic circuits, memory power is not included, since weight is not updated during the inference.

Hardware Efficiency

- 76.5GE/PE - -

76.5GE/PE

48.2TSOp/s/W37.4TSOp/s/Wb

Die Photo and Performance Summary

Conclusion

• Neural network function is computed in timedomain using standard digital circuits with higharea and power efficiency.

• Implemented brain-inspired leak and locallateral inhibition features to enhance thecontrast between neuron outputs.

• 65nm test chip measurements confirm 91%hand-written digit recognition accuracy.

A Scalable Time-based Integrate-and-Fire Neuromorphic...

Documents

Transcript of A Scalable Time-based Integrate-and-Fire Neuromorphic...

All-Digital PLL Frequency and Phase Noise Degradation ...people.ece.umn.edu/groups/VLSIresearch/papers/2018/IRPS18_ADPLL_slides.pdf · Beat Frequency Monitor 8 “Silicon Odometer”,

A 0.2-to-1.45GHz Subsampling Fractional-N All-Digital MDLL ...people.ece.umn.edu/groups/VLSIresearch/papers/2016/ISSCC16_MD… · Fractional-N All-Digital MDLL with Zero-Offset Aperture

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, …people.ece.umn.edu/groups/VLSIresearch/papers/2011/... · 65 nm low-leakage CMOS process show a 1.25 ms data ... search activities

A 3D NAND Flash Ready 8-Bit Convolutional Neural Network Core ...people.ece.umn.edu/groups/VLSIresearch/papers/2019/IEDM19_3DN… · Neural Network Core Demonstrated in a Standard

A 10Gb/s 10mm On-Chip Serial Link in 65nm CMOS Featuring …people.ece.umn.edu/groups/VLSIresearch/papers/2017/VLSI17_TBDFE_slides.pdfSymposia on VLSI Technology and Circuits A 10Gb/s

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO ...people.ece.umn.edu/groups/VLSIresearch/papers/2012/JSSC...IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO. 2, FEBRUARY 2012 547

Fast Tag Comparator Using Diode Partitioned Domino for 64b ...people.ece.umn.edu/groups/VLSIresearch/papers/... · Semiconductor Research Corporation under contract 1078.001. H. Suzuki

A Fully Integrated 40pF Output Capacitor Beat-Frequency ...people.ece.umn.edu/groups/VLSIresearch/papers/2018/ISSCC18_LDO_slides.pdf18.5 A Fully Integrated 40pF Output Capacitor Beat-Frequency-Quantizer-Based

Flexible Electronics: Materials, Circuits, and Design ...people.ece.umn.edu/groups/VLSIresearch/papers/2012/DMD12_Flextronics.pdf · Flexible Electronics: Materials, Circuits, and

ElectromigrationEffects in Power Grids …people.ece.umn.edu › groups › VLSIresearch › papers › 2018 › VLSI...Symposia on VLSI Technology and Circuits ElectromigrationEffects

Distributed Active Decoupling Capacitors for On-Chip ...people.ece.umn.edu/groups/VLSIresearch/papers/conferences/VLSI06_presentation.pdfDistributed Active Decoupling Capacitors for

A Magnetic Tunnel Junction Based True Random …people.ece.umn.edu/groups/VLSIresearch/papers/2014/IEDM14_TRNG... · A Magnetic Tunnel Junction Based True Random Number Generator

A Fully Integrated Digital LDO With Built-In Adaptive ...people.ece.umn.edu/groups/VLSIresearch/papers/2018/... · voltage quantizer by a pair of voltage-controlled oscillator and

AP bl Ad ti PhA Programmable Adaptive Phase- Shiftinggg ...people.ece.umn.edu/groups/VLSIresearch/papers/2011/ISSCC11_PLL_slides.pdf · Conclusions • Resonant noise is an important

A 32Gb/s Digital-Intensive Single -Ended PAM-4 Transceiver ...people.ece.umn.edu/groups/VLSIresearch/papers/2020/ISSCC20_PAM4_slides.pdfInternational Solid-State Circuits Conference

F Soft Response Generation and Thresholding Strategies for ...people.ece.umn.edu/groups/VLSIresearch/papers/2016/ISLPED16_PUF.pdf · Soft Response Generation and Thresholding Strategies

A Counter based ADC Non-linearity Measurement Circuit and ...people.ece.umn.edu/groups/VLSIresearch/papers/2019/... · 10b SAR-ADC DNL vs. Duty Cycle Shorter duty cycle shorter BTI

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 44, NO. 6, JUNE ...people.ece.umn.edu/groups/VLSIresearch/papers/2009/JSSC09_Swit… · IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 44, NO.

A Physical Unclonable Function based on Capacitor Mismatch ...people.ece.umn.edu/groups/VLSIresearch/papers/2018/... · Physical Unclonable Function (PUF) is a circuit block that

11-1 An All-In-One Silicon Odometer for Separately ...people.ece.umn.edu/groups/VLSIresearch/papers/2009/VLSI09_Odometer.pdfThe BTI_ROSC transistors suffer the same amount of BTI as