Sudhakar Pamarti, Puneet Gupta, and Kang–L. Wang ... · Sudhakar Pamarti, Puneet Gupta, and...

1
www.darpa.mil Spintronic Stochastic Dataflow Computing Problem: Computing Has Hit a Memory Wall Stochastic Computing (SC) Reduces Memory Access Unique number representation , by the frequency of occurrence of “1”s in a random, binary bitstream Highly compact stochastic computing (SC) MACs enable massive parallelization of compute a.k.a. spatial unrolling Spatial unrolling greatly reduces scratch pad & DRAM access But, prior art in SC struggles with large random bitstream generators Voltage-Controlled Spintronics Enables Dense, Low Energy, CMOS-Compatible Memory (MeRAM) 1 4-bit Stochastic MAC: 0.05pJ/op, ~4μm 2 Stochastic MAC (Y=P*A+(1-P)*B) : MUX(Y=A*B) Conventional MAC 4-bit fixed-point MAC: 0.08pJ/op, ~200μm 2 4-bit Multiplier 4-bit Adder FA A0 B0 FA A1 B1 FA A2 B2 FA A3 B3 Distribution A: Approved for public release: distribution unlimited. Today’s compute units are large requiring repeated access to memory for operands/results in data-intensive applications Off-chip memory (e.g., DRAM) incurs high communication costs e.g., 100x more than compute On-compute die memory (e.g., SRAM) is not big enough for applications such as machine learning Memory wall problems are omnipresent: both in edge devices and even in the cloud VC-MTJ: Res = R P or R AP Access Transistor Read Ckt. Memory Type Unit Cell Area (um 2 ) Read Write Time (ns) Bitline Energy (fJ) Time (ns) Bitline Energy (fJ) Me RAM 0.01 4 27.5 <1 30 STT- MRAM 0.06 1.3 209 20 4500 SRAM (32nm) 0 .17 0.7 84 <2 >10 0 e DRAM (45nm) 0.07 1.4 64 <2 >70 Voltage controlled magnetic tunneling junction (VC-MTJ) Switching time < 1ns, voltage < 1V, current < 10uA, energy < 5fJ Nonvolatile, endurance > 10 15 , Area < 20F 2 MeRAM array is 5x smaller, 3x more efficient (simulated) Spintronics Random Num. Gen. Solves SC Problem free fixed V=0 Eb free fixed V=Vb free fixed V=0 Eb Long Pulse RNG Type Energy/Bit (fJ) Area (μm 2 ) Latency (ns) RNG Periodicity VCMA=90,T=400 o C VCMA=130,T=550 o C VC-MTJ 37 18 7 26 True random LFSR 28 35 8 2048 True, uncorrelated random bits from thermal noise Spintronics random number gen is 5x smaller than LFSR Spintronics+SC is Great for ML on Edge Devices • Compact SC MACs and dense, low energy, non- volatile MeRAM offer great latency & energy benefits • SC is a purely digital approach that can approach the efficiency of analog neuromorphic systems • Runtime energy-latency tradeoff on same hardware References [1] Grezes, C., et.al. "Ultra-low switching energy and scaling in electric-field-controlled nanoscale magnetic tunnel junctions with high resistance-area product." Applied Physics Letters 2016; 108:012403. [2] Wang, S., et.al. "Hybrid VC-MTJ/CMOS non-volatile stochastic logic for efficient computing" Proceedings of the Conf. on Design, Automation & Test in Europe, March 2017, pp. 1442-1447. [3] Chen. Y. et.al. “Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks”, IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127- 138, January 2017 Progress So Far… • Demonstrated MeRAM devices in lab 1 and MeRAM array benefits using circuit sims • Developed scalable digital SC architecture for deep neural networks & verified in sims Offers 6.9x fewer scratchpad accesses & 410x lower EDP compared to iso-area Eyeriss 3 What Will We Accomplish? • Demonstrate MeRAM integrated with CMOS • Develop a scalable SC+MeRAM architecture for Deep Neural Network (DNN) applications e.g., AlexNet • Achieve, in Silicon, >200x improvement in EDP over prior art, with comparable inference accuracy 1101111101 (0.8) 1010101100 (0.5) 1000101100 (0.4) 0010010010 (0.3) 1010111110 (0.7) Acknowledgment and Disclaimer This material is based on research sponsored by Air Force Research Laboratory (AFRL) and Defense Advanced Research Projects Agency (DARPA) under agreement number FA8650-18-2-7867. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of AFRL and DARPA or the U.S. Government. Sudhakar Pamarti, Puneet Gupta, and Kang–L. Wang University of California, Los Angeles New Materials and Devices: Framework for Novel Compute (FRANC)

Transcript of Sudhakar Pamarti, Puneet Gupta, and Kang–L. Wang ... · Sudhakar Pamarti, Puneet Gupta, and...

Page 1: Sudhakar Pamarti, Puneet Gupta, and Kang–L. Wang ... · Sudhakar Pamarti, Puneet Gupta, and Kang–L. Wang. University of California, Los Angeles. New Materials and Devices: Framework

www.darpa.mil

Spintronic Stochastic Dataflow Computing

Problem: Computing Has Hit a Memory Wall

Stochastic Computing (SC) Reduces Memory Access

• Unique number representation, by the frequency of occurrence of “1”s in a random, binary bitstream

• Highly compact stochastic computing (SC) MACs enable massive • parallelization of compute a.k.a. spatial unrolling• Spatial unrolling greatly reduces scratch pad & DRAM access• But, prior art in SC struggles with large random bitstream generators

Voltage-Controlled Spintronics Enables Dense, Low Energy, CMOS-Compatible Memory (MeRAM)1

4-bit Stochastic MAC: 0.05pJ/op, ~4μm2

Stochastic MAC (Y=P*A+(1-P)*B) : MUX(Y=A*B) Conventional MAC

4-bit fixed-point MAC: 0.08pJ/op, ~200μm2

4-bit Multiplier 4-bit Adder

FAA0B0

FAA1B1

FAA2B2

FAA3B3

Distribution A: Approved for public release: distribution unlimited.

▪ Today’s compute units are large requiring repeated access to memory for operands/results in data-intensive applications▪ Off-chip memory (e.g., DRAM) incurs high communication costs e.g., 100x more than compute▪ On-compute die memory (e.g., SRAM) is not big enough for applications such as machine learning▪ Memory wall problems are omnipresent: both in edge devices and even in the cloud

VC-MTJ: Res = RP or RAP

AccessTransistor

Read Ckt.

Memory Type

Unit Cell Area (um2)

Read Write

Time(ns)

BitlineEnergy

(fJ)Time(ns)

BitlineEnergy

(fJ)Me RAM 0 .0 1 4 27.5 <1 30

STT-MRAM 0 .0 6 1.3 20 9 20 450 0

SRAM (32nm ) 0 .17 0 .7 84 <2 >10 0

e DRAM(45nm ) 0 .0 7 1.4 64 <2 >70

• Voltage controlled magnetic tunneling junction (VC-MTJ)

• Switching time < 1ns, voltage < 1V, current < 10uA, energy < 5fJ

• Nonvolatile, endurance > 1015 , Area < 20F2

• MeRAM array is 5x smaller, 3x more efficient (simulated)

Spintronics Random Num. Gen. Solves SC Problem

free

fixed

V=0

Eb

free

fixed

V=Vbfree

fixed

V=0

Eb

Long Pulse

RNG TypeEnergy/Bit (fJ) Area

(µm2)Latency

(ns) RNG PeriodicityVCMA=90 ,T=40 0 oC VCMA=130 ,T=550 oC

VC-MTJ 37 18 7 26 True randomLFSR 28 35 8 2048

• True, uncorrelated random bits from thermal noise

• Spintronics random number gen is 5x smaller than LFSR

Spintronics+SC is Great for ML on Edge Devices• Compact SC MACs and dense, low energy, non-

volatile MeRAM offer great latency & energy benefits• SC is a purely digital approach that can approach the

efficiency of analog neuromorphic systems• Runtime energy-latency tradeoff on same hardware

References[1] Grezes, C., et.al. "Ultra-low switching energy and scaling in electric-field-controlled nanoscale magnetic tunnel junctions with high resistance-area product." Applied Physics Letters 2016; 108:012403.[2] Wang, S., et.al. "Hybrid VC-MTJ/CMOS non-volatile stochastic logic for efficient computing" Proceedings of the Conf. on Design, Automation & Test in Europe, March 2017, pp. 1442-1447. [3] Chen. Y. et.al. “Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks”, IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127-138, January 2017

Progress So Far…• Demonstrated MeRAM devices in lab1 and MeRAM

array benefits using circuit sims• Developed scalable digital SC architecture for deep

neural networks & verified in sims• Offers 6.9x fewer scratchpad accesses & 410x lower

EDP compared to iso-area Eyeriss3

What Will We Accomplish?• Demonstrate MeRAM integrated with CMOS• Develop a scalable SC+MeRAM architecture for Deep Neural

Network (DNN) applications e.g., AlexNet• Achieve, in Silicon, >200x improvement in EDP over prior art,

with comparable inference accuracy

1101111101 (0.8)

1010101100 (0.5)

1000101100 (0.4)

0010010010 (0.3)

1010111110 (0.7)

Acknowledgment and DisclaimerThis material is based on research sponsored by Air Force Research Laboratory (AFRL) and Defense Advanced Research Projects Agency (DARPA) under agreement number FA8650-18-2-7867. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of AFRL and DARPA or the U.S. Government.

Sudhakar Pamarti, Puneet Gupta, and Kang–L. WangUniversity of California, Los Angeles

New Materials and Devices: Framework for Novel Compute (FRANC)