The Design of Spintronic-based Circuitry for Memory and ...

The Design of Spintronic-based Circuitry for

Memory and Logic Units in Computer Systems

A DISSERTATION

SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL

OF THE UNIVERSITY OF MINNESOTA

BY

Cong Ma

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Advisor David J. Lilja

October, 2018

c© Cong Ma 2018

ALL RIGHTS RESERVED

Acknowledgements

I would like to express my sincere gratitude to my advisor Prof. David J. Lilja

for the continuous support of my Ph.D study and related research. His guidance

and encouragement helped me overcome challenges and obstacles. I am grateful

for his patience, his kindness and invaluable advice in research and in life.

Besides my advisor, I would like to thank the rest of my thesis committee:

Prof. Kia Bazargan, Prof. Sachin Sapatnekar, and Prof. Pen-Chung Yew; and

my preliminary exam committee: Prof. Chris H. Kim, and Prof. Ulya Karpuzcu,

for their insightful comments and suggestions.

I am grateful for the resources from the University of Minnesota Supercomput-

ing Institute and the support from C-SPIN, one of six centers of STARnet, a Semi-

conductor Research Corporation program, sponsored by MARCO and DARPA,

and from Nation Science Foundation grant no. CCF-1241987. Any opinions, find-

ings and conclusions or recommendations expressed in this material are those of

the authors and do not necessarily reflect the views of the NSF and C-SPIN.

I would like to thank my colleagues and my fellow labmates: Dr. Peng Li,

Prof. Yuan Ji, Bill Tuohy, Pushkar Nandkar, Dr. Jongyeon Kim, Ibrahim Ahmed,

Zhaoxin Liang, Dr. Bingzhe Li, Dr. Manas Minglani, Dr. Hassan Najafi, Yaobin

Qin, for the brainstorms, for the inspiring discussions, for the sleepless nights

before deadlines, and for all the fun we had.

I would like to thank Prof. Emad Ebbini, who provided me an opportunity to

join their team when I was in the Master program to get valuable experience of

research works. Also, I would like to thank the staff of Electrical and Computer

i

Engineering Department, the Graduate School and International Student and

Scholar Services of University of Minnesota, especially the Graduate Advisor,

Linda Jagerson, for all my last minute questions.

My sincere thanks also goes to my former manager Alka Deshpande at Oracle,

for her support during the difficult times, and to Prof. Shujuan Wang, Prof.

Guofu Zhai, Prof. Sijiu Liu and Dr. Lei Kang at Harbin Institute of Technology,

for their guidance opens the door to research.

I would also like to thank my friends at University of Minnesota, especially Dr.

Yinglong Feng, Jie Kang, Yi Wang, Dr. Xiaofan Wu, Dr. Wei Zhang, Dr. Zisheng

Zhang, Keping Song, and Qi Zhao, for the great parties, for the homemake hot

pots, for the fun trips, and for the Dota nights.

Last but not least, I would like to thank my family: my wife, Wenwei Zhang,

for always believing in me and indulging my geeky talks; my parents, Shuhua Yang

and Baocheng Ma, for the continuous love; my parents-in-law, Yueyan Pan and

Yi Zhang, for the unconditional support; and my frenchie, MobyDick, for always

sitting next to me and warming my feet while writing this thesis. Thank you all

for the love and support. This thesis would not have been possible without you.

ii

Dedication

To my wife, Wenwei and my dog, MobyDick. Thank you for your support and

companionship during the most difficult time in my life.

iii

Abstract

As CMOS technology starts to face serious scaling and power consumption

issues, emerging beyond-CMOS technologies draw substantial attention in recent

years. Spintronic device, one of the most promising CMOS alternatives, with

smaller size and low standby power consumption, fits the needs of the trending

mobile and IoT devices. Spin-Transfer Torque-MRAM (STT-MRAM) with com-

parable read latency with SRAM and All-spin logic (ASL) capable of implement-

ing pure spin-based circuit are the potential candidates to replace CMOS memory

and logic devices. However, spintronic memory continues to require higher write

energy, presenting a challenge to memory hierarchy design when energy consump-

tion is a concern. This motivates the use of STT-MRAM for the first level caches

of a multicore processor to reduce energy consumption without significantly de-

grading the performance. The large STT-MRAM first-level cache implementation

saves leakage power. And the use of small level-0 cache regains the performance

drop due to the long write latency of STT-MRAM. This combination reduces the

energy-delay product by 65% on average compared to CMOS baseline. All-spin

logic suffers from random bit flips that significantly impacts the Boolean logic re-

liability. Stochastic computing, using random bit streams for computations, has

shown low hardware cost and high fault-tolerance compared to the conventional

binary encoding. It motivates the use of ASL in stochastic computing to take

advantage of its simplicity and fault tolerance. Finite-state machine (FSM), a se-

quential stochastic computing element, can compute complex functions including

the exponentiation and hyperbolic tangent functions more efficiently, but it suffers

from long calculation latency and autocorrelation issues. A parallel implementa-

tion scheme of FSM is proposed to use an estimator and a dispatcher to directly

initialize the FSM to the steady state. It shows equivalent or better results than

the serial implementation with some hardware overhead. A re-randomizer that

uses an up/down counter is also proposed to solve the autocorrelation issue.

iv

Contents

Acknowledgements i

Dedication iii

Abstract iv

List of Tables vii

List of Figures viii

1 Introduction 1

2 Background 5

2.1 Spintronic memory device, STT-MRAM . . . . . . . . . . . . . . 6

2.2 Spintronic logic device, All-spin logic . . . . . . . . . . . . . . . . 7

3 Related Works 11

4 Incorporating Spintronic Devices in CPU Caches 17

4.1 Simulation Methodology . . . . . . . . . . . . . . . . . . . . . . . 18

4.1.1 Technology Modeling . . . . . . . . . . . . . . . . . . . . . 19

4.1.2 Architectural Simulation . . . . . . . . . . . . . . . . . . . 22

4.2 Performance, Energy and Scalability . . . . . . . . . . . . . . . . 25

4.2.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.2.2 Energy Comparison . . . . . . . . . . . . . . . . . . . . . . 27

v

4.2.3 Energy-Delay Product . . . . . . . . . . . . . . . . . . . . 29

4.2.4 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2.5 Larger L0 impact . . . . . . . . . . . . . . . . . . . . . . . 34

4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5 Incorporating Spintronic Devices in Logic Units 38

5.1 Spintronic logic devices . . . . . . . . . . . . . . . . . . . . . . . . 39

5.2 Parallel Implementation of FSM . . . . . . . . . . . . . . . . . . . 46

5.2.1 The parallel FSM design . . . . . . . . . . . . . . . . . . . 47

5.2.2 Experiments and Results . . . . . . . . . . . . . . . . . . . 52

5.2.3 Latency and Hardware Cost . . . . . . . . . . . . . . . . . 59

5.3 Autocorrelation Issue of FSM . . . . . . . . . . . . . . . . . . . . 60

5.3.1 Autocorrelation Analysis . . . . . . . . . . . . . . . . . . . 60

5.3.2 Re-randomizer . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.3.3 Discussion and Analysis . . . . . . . . . . . . . . . . . . . 70

5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

6 Conclusion and Discussion 73

References 76

vi

List of Tables

4.1 Energy consumption parameters for the STT cache structures. . . 21

4.2 Energy consumption parameters for the CMOS cache structures.

Leakage is mW. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.3 Simulated Processor Configurations. . . . . . . . . . . . . . . . . . 24

4.4 Simulated STT-MRAM Cache Parameters. For writes, the access

latency is added to the write latency. . . . . . . . . . . . . . . . . 25

4.5 Simulated Cache Hierarchies. . . . . . . . . . . . . . . . . . . . . 25

4.6 Cache capacity impact on performance and cache reuse of canneal

with simlarge dataset. The execution time of each configuration is

normalized to 4 cores with 4MB L2 cache. The percentage is the

L2 cache miss rate. . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.7 Cache capacity impact on performance and cache reuse of canneal

with simnative dataset. The execution time of each configuration

is normalized to 4 cores with 4MB L2 cache. The percentage is the

L2 cache miss rate. . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.1 The MSE and PSNR of Edge Detection using all-spin logics . . . 45

5.2 The MSE and PSNR of Frame Difference using all-spin logics . . . 46

5.3 The average error and deviation of the parallel FSMs. . . . . . . . 56

5.4 The MSE and PSNR of image processing applications . . . . . . . 58

5.5 Hardware Cost, Latency and Area-Delay Product of serial and par-

allel FSM with 32 degrees of parallelism . . . . . . . . . . . . . . 59

vii

List of Figures

2.1 Typical multicore cache hierarchy, with multiple copies of data. . 7

2.2 The basic all-spin logic elements. . . . . . . . . . . . . . . . . . . 8

2.3 Stochastic Computing Multiplication using a single AND Gate . . 9

2.4 Finite-state machine diagram for approximating the exp function. 10

4.1 A STT-MRAM 1T1MTJ bit-cell (a), showing the access transistor

and MTJ storage element. In the Parallel (P) state (b) resistance

through the device is lower than the Anti-Parallel (AP) state (c). 19

4.2 The different trends in read and write energy for MTJ cells used in

L1 (left) and L2 (right) caches. The Write total is a combination

of Write cell and Write access; Write cell is the per-bit switching

energy from SPICE and Write access is the array access energy

reported by Cacti. . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.3 STT-MRAM cache is modeled to have separated read/write port.

The read port works the same as the CMOS SRAM cache, but

the write port blocks a consecutive write operation when it is not

available. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.4 The coherence transition diagram modeled in Gem5 to accommo-

date the write blocking mechanism of STT-MRAM. Instead of hav-

ing one intermediate transition state IS, we added another transi-

tion state IS S to simulate the blocking state of the STT-MRAM,

no write requests can be served in this state unless it finished its

transition to S state. . . . . . . . . . . . . . . . . . . . . . . . . . 23

viii

4.5 (a), (b) compares the cmos-base (64K,4M) to STT hierarchies (128K,16M).

Write latency of the L1 cache results in significant performance

drop. (c), (d) compares the cmos-base to STT hierarchies that use

the write-merging L0. The hierarchy uses a 4K fully-associative L0

cache and a STT L1 cache with various write latencies. . . . . . . 27

4.6 The near core cache miss rate. Though the 1KB fully-associative L0

cache does have a large cache miss rate, the 4KB cache on average

is less than 5%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.7 (a, b) shows the total energy consumption normalized to cmos-

base (64K,4M) with stt-l2 (64K,16M), stt-l1d2 (128K,16M), and

stt-l0 with varying L0 sizes (1/4K, 128K, 16M). The CMOS3 L2

leakage is computed for a 4MB STT-MRAM cache to create a fair

baseline. (c, d) shows the dynamic energy consumption with the

same configurations as in (a, b). . . . . . . . . . . . . . . . . . . 30

4.8 (a, b) shows the Energy Delay Product of various STT hierarchies

with four cores. (c, d) shows the Energy Delay Product with 16

cores. All normalized to cmos-base. . . . . . . . . . . . . . . . . 31

4.9 The scalability of various architecture hierarchies using from 4 to 16

cores, including two-level cmos-base (64K,4/8/16M), stt-l2 (64K,16/32/64M)

and stt-l1d2 (128K,16/32/64M), three-level stt-l0z1 (1K,128K,16/32/64M)

and stt-l0z4 (4K,128K,16/32/64M). The 3-level hierarchy has a sim-

ilar scalability as the two level cmos-base, but canneal and facesim

particularly show better result than others. . . . . . . . . . . . . . 33

4.10 The figure shows the performance and energy use of several config-

urations including CMOS baseline, stt-l1d2, stt-l0z1, to stt-l0z64,

where stt-l0z32 and stt-l0z64 have 32kB 4-way assoc and 64kB 8-

way assoc L0 cache. The average performance improves less than

10% and the energy use increases more than 25% from stt-l0z4 to

stt-l0z64. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

ix

4.11 The figure shows the energy-delay product of several configurations

including the larger L0 implementations. The overall energy-delay

product increases 20% on average for both benchmark suites. . . . 36

5.1 Basic elements implemented using all-spin logic. . . . . . . . . . . 39

5.2 Flip flops, basic element of FSM, implemented using all-spin logic. 40

5.3 Combinational stochastic computing elements implemented using

all-spin logic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.4 3-bit finite-state machine implemented using all-spin logic. . . . . 41

5.5 Edge Detection application diagram using the proposed spintronic

stochastic circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.6 Frame Difference application diagram using the proposed spintronic

stochastic circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.7 Edge Detection Results with different error rate injection using

Spintronic logics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.8 Frame Difference Results with different error rate injection using

Spintronic logics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.9 The Steady-State distribution of a 16-state FSM. From the figure,

we can see that this distribution is symmetric about the input value

of 0.5. The distribution changes the most here, making it very

sensitive around 0.5. . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.10 The mean output of the straightforward implementation of FSM

with 2, 4, 8, 16 and 32 parallel copies. The FSM is a typical

16-state absolute value function. The three subgraphs are using

different initial state at 0, 7 and 15 respectively. . . . . . . . . . . 50

5.11 Simulation results of the conventional deterministic scheme, serial

and a straightforward parallel stochastic implementation on Frame

Difference with different initial states. The straightforward parallel

stochastic implementation is clearly not able to compute the correct

results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

x

5.12 The straightforward parallel FSM implementation and the pro-

posed parallel implementation. The proposed parallel FSM has 32

parallel short bit streams sent to the Estimator to obtain an initial

guess for the input. Two Estimator implementations are parallel

counter and majority gate counter. This initial estimate is then

sent to the Dispatcher to look up a set of state configurations to

initialize the parallel FSMs. . . . . . . . . . . . . . . . . . . . . . 53

5.13 The experiment and analytical output result of a 32-input Majority

Gate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.14 The output mean value of two parallel FSMs with parallel counter

as estimator or majority gate estimator and the serial FSM. Both

estimators use 13 clocks, 13 × 32 = 416 bits, to approximate the

input value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55


and parallel stochastic implementation on Frame Difference. . . . 56


and parallel stochastic implementation on Edge Detection. . . . . 57

5.17 Experimented methodology to measure and compare autocorrelation. 61

5.18 Autocorrelation Comparison. The flat “Random” line is a typi-

cal autocorrelation of random bit streams, whereas the fluctuating

“FSM” is the autocorrelation when the output of the FSM is sup-

posed to be the same as “Random” streams. Generally, the flatter

the autocorrelation plot, the more random the stream. . . . . . . 62

5.19 FSM autocorrelation results at a bit length of 1024. All show that

the 32-state line is higher than the 16-state and 8-state cases, which

means that the autocorrelation metric of the 32-state case is the

largest. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

xi

5.20 FSM autocorrelation results with 16 states. All show that the

shorter bit streams usually have larger autocorrelation metric. Specif-

ically, the autocorrelation metric with a bit stream of length of 256

is almost 10 times larger than when the length of bit stream is 4096. 65

5.21 Proposed re-randomizer with feedback structure. . . . . . . . . . . 67

5.22 Autocorrelation Comparison of the FSM, Re-randomizer and Ran-

dom, where “Random” is an ideal Bernoulli bit stream. “FSM” has

the worst autocorrelation. Re-randomizer as “Follower” is almost

the same as the random sequence as “Random”. . . . . . . . . . 68

5.23 Worst Case Autocorrelation Plot with 16 states and a bit stream

length of 1024. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

xii

Chapter 1

Introduction

Although Moore’s law [1] has been successfully predicting the scaling of CMOS

technology for several decades, it began to slow down as CMOS devices approach-

ing the quantum physical limitations [2]. To keep decreasing the size of the elec-

tronic transistors, researchers have been studying potential CMOS alternatives [3].

Spintronic devices with much smaller size and virtually zero leakage power has

quickly emerged as one of the most promising beyond-CMOS devices.

However, the spintronic devices suffer from large dynamic switching energy

use, long write/switch latency and random bit flips due to thermal instability.

Device level optimization was investigated to reduce such impact by carefully

designing the right dimensions and selecting reasonable retention times, but it

is still not enough to match the CMOS technology in the sense of performance

and dynamic energy efficiency [3]. Optimization in the application level can take

the spintronic limitations into account and mitigate its weaknesses by introducing

novel architecture and computational models.

In the memory device usage area, spintronic memory (Spin-Transfer Torque-

MRAM) is an attractive alternative technology to CMOS since it offers higher

density, virtually no leakage current and similar read latency as to CMOS based

SRAM. It continues to require higher write energy, however, presenting a challenge

to memory hierarchy design when energy consumption is a concern. In this thesis,

1

2

we use the STT-MRAM for the first level caches of a multicore processor to

take advantage of its larger size and smaller leakage power to reduce the energy

consumptions. The large STT-MRAM first-level cache saves leakage power, but it

causes a big performance drop due to the long write latency of the STT-MRAM.

This performance drop can be mitigated by implementing a small and fast fully

associative level-0 SRAM cache, which merges the performance gap between the

cpu core and the slow STT-MRAM first-level cache. The proposed STT hierarchy

reduces the energy-delay product by 65% on average and shows good scalability

over the CMOS baseline with a few benchmarks which scale significantly better.

The Parsec and Splash2 benchmark suites are analyzed running on a modern

multicore platform, comparing performance, energy consumption and scalability

of the spintronic cache system to a CMOS design.

In the logic device usage area, All-spin logic (ASL) begin to draw significant

attention as one of the pure spin logic circuit without using any charge based

devices. However, its applications are limited due to its random flipping nature

and relatively large dynamic energy consumptions. Conventional boolean logic

are very sensitive to reliability issues. Researchers start to investigate other com-

putation models that fits better with the novel devices. Stochastic computing,

which employs random bit streams for computations, has shown low hardware

cost and high fault-tolerance compared to the computations using a conventional

binary encoding. By combining ASL with stochastic computing, it is possible to

reduce the impact on the final results from random bit flipping because of the

high fault tolerance of stochastic computing. Also, finite-state machine (FSM)

based stochastic computing elements can compute complex functions, such as the

exponentiation and hyperbolic tangent functions, significantly simplifies the spin-

tronic circuit and make it more efficient than boolean logics. However, the FSM,

as a sequential logic, cannot be directly implemented in parallel like the combi-

national logic, so reducing the long latency of the calculation becomes difficult.

Applications in the relatively higher frequency domain would require an extremely

fast clock rate using FSM. This work proposes a parallel implementation of the

3

FSM, using an estimator and a dispatcher to directly initialize the FSM to the

steady state. Experimental results show that the outputs of four typical functions

using the parallel implementation are very close to those of the serial version. The

parallel FSM scheme further shows equivalent or better image quality than the

serial implementation in two image processing applications Edge Detection and

Frame Difference. Another issue of the FSM is the autocorrelation issue, which

changes the randomness of the bit stream, impacting results of relatively large

and complex stochastic circuits. We further analyzed the autocorrelation, indica-

tor of randomness of a bit stream, with different parameters of a FSM including

the number of states, the length of the bit stream, and the output value. With

a better understanding of the timely correlation of the output bits of FSM, we

proposed a re-randomizer to solve this autocorrelation issue in the FSM based

computing elements.

Contributions: In this thesis, we have conducted a detailed study of optimizing

spintronic devices in both memory and logic circuitry design. For the memory

use case, we analyzed the impact of implementing spintronic memory caches in all

level of CPU memory hierarchy and proposed a write-merging L0 cache to mitigate

the performance drop due to spintronic memory device’s long write latency. To

address the unreliability issue of spintronic logic device, we propose to implement

spin logic circuits using stochastic computational model to take advantage of its

high fault tolerance. Moreover, we proposed a parallel implementation scheme of

the sequential stochastic computing element, the finite-state machine, to further

improve its performance. And we also proposed a re-randomizer to solve the FSM

autocorrelation issue to improve its output quality for larger circuits. The rest of

the thesis is organized as below.

• Chapter 2 briefly presents the background of spintronic devices and stochas-

tic computing.

• Chapter 3 presents related works in the area.

• Chapter 4 demonstrates the spintronic device in memory use case, including

4

the analysis of spintronic memory impact, design and implementation of the

optimized cache hierarchy and interpretation of experimental results.

• Chapter 5 demonstrates the spintronic device in logic use case, especially

when implemented using stochastic computing schemes. We further analyze

and design the optimized implementation of the stochastic computing unit

finite-state machine to improve its performance and output quality.

• Chapter 6 presents a final discussion of the analysis presented in the thesis,

draws the conclusion and briefly discusses the future works.

Chapter 2

Background

The observation of the scaling of electronic devices has been successfully following

the Moore’s law for several decades. The CMOS is shrinking the technology node

every other year, but it starts to face serious scaling and power consumption issues.

The current CMOS device, such as SRAM, becomes unable to meet the demand

of big, fast and low power on-chip cache for multi-core implementations. As for

logic CMOS devices, the technology node is reaching 7 nanometer range and

close to hit the physical limitations. Spintronic technology using Spin-Transfer

Torque Magnetics, capable of building both memory and logic circuitry, draws

substantial attention in recent years. It quickly stands out as one of the most

promising CMOS alternatives. Specifically, Spin-Transfer Torque-Magnetic RAM

(STT-MRAM), one of the novel non-volatile memory family, has shown great

potential of replacing on-chip cache due to its fast read latency, large capacity and

low leakage energy use. On the other hand, All-spin logic (ASL) can systematically

synthesis boolean logics without charge-based devices, which making it possible

to build pure spin circuitries.

5

6

2.1 Spintronic memory device, STT-MRAM

Spin-Transfer Torque-Magnetic RAM (STT-MRAM) offers higher density than

traditional SRAM cache, and its non-volatility facilitates low leakage power [4].

Also, STT-MRAM is one of the few candidates with similar read latency to current

SRAM technology. With this higher cell density and low leakage power, STT-

MRAM is generally considered as a viable potential alternative to SRAM in future

on-chip caches.

However, due to its non-volatile nature, this technology suffers from high dy-

namic energy consumption, primarily due to high write power and longer write

latency [5]. The write latency of the STT-MRAM is commonly approximated as

3 to 4 times that of SRAM [6], but some consider it to be larger [7], so we perform

our analysis over a range of latencies. These characteristics seem to fit well at

the larger last-level caches of a processor, where high capacity is desirable and

longer latency is tolerated. Previous researches [8] and [9] also showed a perfor-

mance drop after directly implementing the first-level STT-MRAM cache. Indeed,

the majority of research in the area of on-chip STT-MRAM has been focused on

last-level caches in [5], [6] and [10].

To be a true replacement for CMOS, however, it would be desirable to use

STT-MRAM at all levels of on-chip cache. CMOS caches have evolved toward

deep hierarchies with multiple levels of private caches in multicore designs, since

read and write latency and power are similar. In a modern chip-multiprocessor

(CMP), multiple copies of data exist in different caches, and more data movement

occurs between caches for sharing. These extra cache updates beyond those seen

in a single-core processor increase the energy consumption. Figure 2.1 shows a

typical multicore hierarchy, highlighting the fact that multiple copies of a cache

line typically exist across the hierarchy. Data sharing requires extra data move-

ment across the hierarchy.

The significant leakage reduction potential and extremely long write latency

of the STT-MRAM motivate us to find optimal cache hierarchy design to reduce

7

C0 C1 C2 C3

L1 L1 L1 L1

L2 L2

L3

Figure 2.1: Typical multicore cache hierarchy, with multiple copies of data.

cache energy consumption without significantly degrading performance [11]. To

best exploit the increased density and reduced leakage power of STT-MRAM, it is

necessary to overcome the high dynamic write energy and latency of STT-MRAM

at the lower-level caches of a large CMP.

2.2 Spintronic logic device, All-spin logic

All-spin logic has been proposed to synthesis boolean logics without charge-based

devices [2], which is naturally a majority gate logic and can implement multiple

logics, such as AND, OR, NOT, etc. Although all-spin logic has low standby

power and smaller size, it suffers from slow switching time and large switching

energy use [3]. Also, the nanomagnet can randomly flip depend on the thermal

stability factor, causing extremely big impact on the output results.

The basic element of all-spin logic is an inverting/non-inverting gate as in

Fig. 2.2a. The spin signal is transferred through a magnet from input A to output

B. When the applied VDD is positive, it is a inverting gate. And when the applied

VDD is negative, it is non-inverting and passing the original signal. If the VDD

8

BA MagnetMagnet

All-Spin Logic Gate

VDD

(a) All-Spin Logic Gate.

D

C

A

Majority Gate

B

(b) Majority Gate.

Figure 2.2: The basic all-spin logic elements.

is cut off, the magnet won’t switch, hence preserve the previous signal.

A majority gate can be built by combining multiple such gates as in Fig. 2.2b.

This majority gate contains 4 magnets, where three of them are inputs and one is

the output. The majority of the inputs determines the output. By fixing one of

the input node to logic 0 or logic 1, we can easily get an AND or OR gate. This

majority logic can further synthesis various combinational logics.

Current all-spin logic gates are capable of implementing simple circuits, where

stochastic computing model fits well of its needs. Stochastic computing has shown

to be low cost in terms of hardware area, high fault-tolerance and short critical

path compared to computations using conventional binary encoding. Computa-

tions based on this stochastic approach can be implemented with very simple

logic. It can tolerate circuit unreliability by the unified data representation [12],

which can mitigate the uncertainty of the spintronic logics.

Combinational logic has been studied in the early stochastic computing. For in-

stance, an AND gate can be implemented to calculate multiplication as in Fig 2.3.

Stochastic sequential logic using a finite-state machine (FSM) was first proposed

by Brown and Card [13] and then validated by Lilja and Li [12]. The FSM as

in Figure 2.4, consisting of only a few D flip-flops and simple combination logics,

is capable of approximating functions, such as exponential, hyperbolic tangent

(tanh) and absolute value.

Previous works already looked into the implementation of stochastic comput-

ing circuits, including combinational stochastic logics [14], and peripheral circuits

9

AB

11010111

11001010

11000010

a=6/8

b=1/2

c=3/8

CAND

Figure 2.3: Stochastic Computing Multiplication using a single AND Gate

such as random bits generator taking advantage of the spin device random flip-

ping nature [15]. We have further propose to implement the stochastic computing

sequential logics, the finite state machine, with all-spin logics to provide more

complex functionalities. With a relatively small number of spin devices, we can

even implement image applications, such as Edge detection and Frame difference,

using stochastic computing model to minimize the impact from the spintronic

logic uncertainty.

However, the stochastic computing will cause long latencies due to its long

bit stream [16]. This latency can be reduced by implementing parallel stochastic

units when only using the combinational logic [17]. Because any bit in combina-

tional logic without any feedback loop [18] is uncorrelated with each other, we

can implement it in serial, which is distributed in time, or in parallel, which is

distributed in space. Both will have the same expected output value. On the

other hand, the FSM as a sequential logic, having bits correlated in time, cannot

be directly implemented in parallel.

Currently, the length of the stochastic computing bit stream is typically from

256 to 1024 bits, which means the clock frequency should be 256 to 1024 times

of the sampling frequency. For example, for audio applications, the sampling rate

is around 48kHz, which would require the stochastic computing circuit to boost

its clock rate to around 48MHz. This is acceptable for low frequency situations,

but for higher frequency applications, this will significantly increase the hardware

10

S0 S1 SN-k-1 SN-k SN-1

X X X X

X

1-X 1-X 1-X 1-X

1-X

Y= 1 Y= 0

Exp(-2kX)X Y

X

1-X

...X

1-X

... SN-1

Figure 2.4: Finite-state machine diagram for approximating the exp function.

area and energy use.

Moreover, the finite-state machine suffers from autocorrelation issue [13] due

to its sequential nature. It does not follow the random binomial distribution

and the output is time dependent and breaks the random bit stream assumption.

Autocorrelation is the cross correlation of a signal with itself, a typical method to

measure the randomness of a sequence. An ideal random sequence should have

a flat autocorrelation plot. Stochastic computing is based on the assumption

that each bit is an independent and identically distributed (iid) Bernoulli random

variable, thus the output statistical characteristics like mean or variance could

be theoretically justified. This impact from autocorrelation could affect large

stochastic networks, especially those have sequential dependencies like feedback

structures. Feeding the output of an FSM into another FSM or feeding it back to

the input side can cause the circuit output value to drift away from the expected

probability.

Chapter 3

Related Works

Spintronic devices have great potential to be an alternative to CMOS devices

in both memory and logic circuits [3]. In [9], a detailed evaluation flow of the

emerging non-volatile memogy technologies including STT-MRAM was described

to explore the next generation memory hierarchy. Spintronic circuits have a lower

standby power and smaller size, especially beneficial for mobile and wearable

devices [3].

Spintronic memory devices: Researchers have proposed implementing the

STT-MRAM L1 cache to take advantage of its larger capacity and significant

smaller leakage power in [8], [19] and [9]. The feasibility of STT L1 data and

instruction cache implementation was evaluated and a performance drop due to

the STT L1 data cache was observed. [8] investigated the STT-MRAM L1 im-

plementation in a single-issue in-order 8 core system. A larger STT L1 under

the same area provided better performance but larger total power due to the

CMOS peripheral circuitry overhead. Li, et al [19] implemented a one-level STT-

MRAM cache in a simple embedded system with in-order single core configuration.

They proposed a compiler assisted refresh scheme for the implemented volatile

STT-MRAM, which significantly reduced the refresh frequency, minimizing the

dynamic refresh energy. These studies mainly focused on evaluating the imple-

mentation of STT caches in simple in-order cpus rather than high performance

11

12

computation platforms.

To address the write power and latency problems, researchers have proposed

several techniques: decreasing the retention time [5], [7] and [10], modifying

cache hierarchy to use a mix of structures with different properties [4], [20], [21]

and [7], implementing policies to limit write operations to high-power structures

[6] [22] [23] [24] [25] [26], and using hybrid cache architectures [27] [22] [28].

Decreasing the retention time trades reliability for device area and energy on a

device level, while the cache policy optimizes energy consumption on a system

level. Both of them achieved significant energy reduction and relatively less at-

tractive performance improvement, but required either additional logic or changes

to the cache control scheme.

Retention time can potentially be reduced in caches by reducing the MTJ

volume since the life time of a cacheline can be much shorter than the typical 10

years. This would allow a reduction of MTJ write current. Reduced retention

time was proposed and analyzed in [5] for on-chip caches on a single-core chip.

They proposed a SRAM L1 cache, and reduced retention time STT-MRAM L2

and L3 cache hierarchy design, which showed an improvement of energy reduction

of 70%, but at a small performance loss. To ensure the reduced retention time

STT-MRAM is reliable, they further proposed a refresh scheme similar to DRAM

refresh technology. Optimal retention times were studied in [10] for the last level

cache, settling on a retention time of about 10 ms after a detailed application

profiling for CMPs. They proposed a victim-cache structure to handle those cache

lines that exceeded their corresponding retention time and achieved 18% and 60%

improvement in performance and energy separately. However, because a bit-flip

could happen at any time during the cacheline life, an error correction coding

scheme should be introduced to the data checking procedure before refreshing [29].

Although reducing retention time can potentially decrease cacheline write en-

ergy and latency, extra error handling units must be added to maintain cache reli-

ability. This scheme uses an orthogonal approach to reduce STT-MRAM dynamic

energy compared to our hierarchy scheme, leaving an opportunity to combine the

13

two in the future.

Implementation of STT-MRAM across the entire cache hierarchy including the

L1 cache was considered in [7]. The work implemented low retention devices in the

L1 cache with a dynamic refresh scheme and further propsed a mixture of retention

times in the last level cache. By using a data migration scheme, read-intensive data

and write-intensive data can be allocated to different retention time region, which

gives a 6.2% performance improvement and 40% energy improvement over the

single level relaxed retention scheme. A read-write aware hybrid cache hierarchy

was presented in [27], where the cache is divided into a Write cache section based

on SRAM and a Read section based on non-volatile memories including STT-

MRAM. They suggest an intracache data movement policy which produces an

overall power reduction up to 55% in addition to a 5% performance improvement

over the baseline SRAM L2 and L3 cache. A novel management policy using

hybrid cache design was shown in [28] that aims at improving cache lifetime by

reducing write pressure on STT-MRAM caches. They show a 50% reduction in

power for a L2 shared cache along with a substantial improvement in a cache

lifetime.

Hybrid schemes require complex control units to dispatch requests to different

cache devices. Our scheme that only implements a standard cache level can avoid

these extra control units. This keeps our design simple and straight forward, thus

leveraging existing schemes.

The idea of a small, fully-associative cache was first proposed in [30] to remove

mapping conflict misses in a direct-mapped cache by putting it in the refill path.

To reduce the microprocessor energy use, [31] proposed a small direct-mapped

cache as a filter cache on the core side that achieved almost 60% energy reduction

with 20% performance drop. In [22], a read-preemptive write buffer of 20 entries

on the memory side was proposed to reduce the read stalls to the STT-MRAM L2

cache during the long write operations by implementing rules that favor read op-

erations. The write requests to the STT-MRAM L1 cache, however, mainly come

from the cpu, which has a much higher demand for cpu stores than the memory

14

cacheline fills. Also, the STT-MRAM L1 suffers from longer read access latency

due to the longer MTJ sensing time. In this paper, the small fully-associative

cache is put on the core side. This is to firstly improve the bandwidth of data

flowing to the STT-MRAM L1 cache by merging processor writes into cacheline

writes, similar to the write aggregation schemes in [32, 33], and to also provide

overall faster cache access due to its simplicity and small capacity.

Spintronic logic devices: Nikonov and Young proposed a new benchmarking

of beyond-CMOS exploratory devices, including various Magneto electric, Spin

Torque and Ferro electric devices [3]. Behin-Aein, et al, first proposed the all-spin

logic (ASL) device that uses spin at every stage of the operation rather than a

mixture of spin and charge-based devices [34]. A Functional Enhanced All Spin

Logic (FEASL) was proposed in [35] to enable the design of large boolean logic

blocks. Pajouhi, et al, further proposed a systematic methodology to synthesis the

ASL circuits [2]. They identified that ASL requires large current at fast switching

speeds and causes static power dissipation issue. The short spin flip length in

interconnects of ASL also becomes a key bottleneck. Moreover, the spintronic

nanomagnets can flip on its own based on its thermal stability level [36], which

could cause significant unreliability issue. Researchers further explored imple-

menting spintronic devices into non-boolean logics, such as stochastic comput-

ing [14] and neural networks [37]. Stochastic computing with high error tolerance

and extremely simple hardware structure [12] [38] can effectively mitigate the

key drawbacks of the spintronic devices. The stochastic computing combinational

logics and peripheral circuits such as random bit generator have been studied to

implement with spintronic logic devices in [14] [15]. While stochastic computing

sequential logics such as finite-state machine requires implementation of flip flops,

which were studied in [39] [40] [41] [42] [43]. All of these previous attempts are

using both spin and charge-based devices in their design.

Stochastic computing: Since the early works of Gaines [44], researchers have

employed the stochastic computing algorithms in various areas including neural

networks [45] [46] [16] [47] [48] [49], signal processing [50] [51] [52] [53] [54] and

15

image processing applications [12] [55]. Qian et al. [56] has proposed a synthe-

sis method using the Bernstein polynomials to approximate functions with only

combinational logic. However, such synthesis requires multiple uncorrelated ran-

dom input bit streams, which increases the hardware cost. Besides, to achieve

a higher accuracy, the degree of the polynomial will have to increase, which will

cause the number of the input sources to grow even larger. Functions such as

exponential cannot be efficiently implemented due to the large hardware cost.

Brown and Card [13] proposed a sequential logic FSM. Li and Lilja [12] later

validated the mathematics of the FSM and proposed systematic methods to syn-

thesize and implement it into various image processing applications [55]. With

very limited hardware cost, the FSM is capable of approximating functions such

as absolute value, exponential and tanh, and is widely used in various applica-

tions [45] [12] [16].

Although these applications benefits from low hardware cost and high fault

tolerance of stochastic computing, the long sequence of bits to get a smaller vari-

ance for a better estimate of output value creates long latency and significant

performance drop compared with conventional implementations. A parallel imple-

mentation of combinational logic is proposed in [17] which has higher computing

accuracy and faster processing speed by using a nibble serial data organization,

but this method can only apply to combinational stochastic logic not sequential

ones such as FSM. Wang, et al [18] further studied the impact of feedback loop

on stochastic circuits. A re-randomizer is proposed to break the correlation intro-

duced by the feedback loop. Because the re-randomizer generates the bit stream

for the next real domain clock value, it must preserve the equivalent precision and

uses all the bits from the previous value to generate the next one, which causes a

time delay of a real domain clock. Pixel level parallelism that requires a large ar-

ray of stochastic computing units to calculate the entire image is proposed in [12]

to speedup the application processing time. Although this method can improve

throughput, the calculation latency for each pixel remains the same.

16

Another issue with the FSM is that the output does not follow a random bino-

mial distribution any longer and the output shows autocorrelation [13]. It is the

cross correlation of a signal with itself, a typical method of measuring the random-

ness of a sequence. Saeed etc [52] has also mentioned the autocorrelation problem

in their LDPC stochastic decoders due to the feedback structure in their design.

An ideal random sequence should have a flat autocorrelation plot. Stochastic com-

puting is based on the assumption that each bit is an independent and identically

distributed (iid) Bernoulli random variable, so the output statistical character-

istics like mean or variance could be theoretically justified. However, the FSM

does not produce such a flat autocorrelation plot, suggesting relaxed randomness

assumption. Two possible re-randomizers are proposed in [52] to break correlation

within a bit stream. One of them uses an M-bit shift-register with a selectable bit

to produce new bit streams called an Edge Memory. It requires a short periodic

signal otherwise generates large hardware overhead. Another method mentioned

is to use an up-down counter to integrate input bits and simultaneously regen-

erate new bit streams. We followed the latter method to design our proposed

re-randomizer.

Chapter 4

Incorporating Spintronic Devices

in CPU Caches

With higher density, low leakage power and similar read latency to current SRAM

technology, STT-MRAM is generally considered as a potential candidate to replace

SRAM in future on-chip caches. However, STT-MRAM suffers from high dynamic

energy consumption, due to large write current and long write charge pulse to

switch spin directions. This long latency is tolerated at larger last-level caches

of a processor, but causes a significant performance drop if directly replacing

first-level caches [8] [9]. To overcome this performance degradation and take

advantage of the significant leakage reduction and large capacity, we proposed an

optimized cache hierarchy design [11]. We utilize a novel physics-based model

of Magnetic Tunnel Junction (MTJ) to develop size and energy models of STT-

MRAM cells [57]. Because the usage of first-level and last-level caches is quite

different, we have evaluated different circuit level tradeoffs between MTJ read

and write energy to find optimal design points for energy and performance. A

drop-in replacement of STT-MRAM to CMOS shows a fundamental problem of

an exposed mismatch in the bandwidth of data being written by the processor and

the ability of the STT-MRAM cache to absorb it. By introducing a small, fully-

associative Level-0 (L0) cache, this bandwidth mismatch can be accommodated.

17

18

This structure also benefits cache dynamic energy consumption, since it is so small

that both its static and dynamic energy use are quite low. This is an extension of

the analysis in [58], which analyzed the effectiveness of a small L0 of various sizes

compared to a simpler two-level CMOS hierarchy.

In this chapter, we made a detailed analysis of the impact of high write latency

at the L1 cache level, including a tradeoff between read and write energy and

latency of STT-MRAM caches. We further demonstrate the benefit of the write-

merging L0 cache to the performance of a fixed-core count system and the energy

consumption. Finally, an analysis of scalability with increasing number of cores

is conducted to compare CMOS caches to STT-MRAM caches.

4.1 Simulation Methodology

A static RAM storage cell using STT-MRAM is depicted in Figure 4.1(a). The

bit-cell consists of an access transistor and a storage element that uses a Magnetic

Tunneling Junction (MTJ), known as a 1T1MTJ cell. The MTJ consists of a

fixed-layer and a free layer, separated by a thin insulator that allows a tunneling

current to flow when biased. The material used for the fixed and free layers has

two stable spin directions, with the spin direction of the fixed layer locked. The

free layer can have either the same spin orientation as the fixed layer, known

as Parallel or P (Fig. 4.1b) or Anti-parallel AP (Fig. 4.1c). Resistance through

the device is higher in the AP state, so a read operation consists of sensing the

high or low resistance value. For a write operation, the current passing through

the device in one direction gives the free layer the P orientation, while passing

the other direction creates the AP orientation. There is a critical minimum write

current which must be maintained for an adequate period of time to allow complete

switching, leading to longer latency of write operations. The access transistor must

be sized to provide a sufficient switching current, and this higher current (above

the critical value) facilitates a shorter switching delay. The access transistor is

usually larger than the MTJ, so there is a tradeoff between bit-cell area and

19

Insulating Layer Insulating Layer

Free Layer Free Layer

Fixed Layer Fixed Layer

(b) (c)(a)

SL BL

WL

Transistor MTJ

Parallel Anti-Parallel

Figure 4.1: A STT-MRAM 1T1MTJ bit-cell (a), showing the access transistorand MTJ storage element. In the Parallel (P) state (b) resistance through thedevice is lower than the Anti-Parallel (AP) state (c).

switching time. A larger bit-cell can have a lower switching time and lower energy,

but creates a larger array for the same storage capacity, leading to longer wires

and higher energy requirements at the array level.

4.1.1 Technology Modeling

A combination of SPICE and Cacti [59] simulations were used to develop the

technology models in this analysis. SPICE was used for bit-cell simulations, and

these results were entered into Cacti for array modeling. The SPICE models

developed for [57] were used to simulate MTJ switching energy and transistor

sizing for the write pulse widths of interest. Retention times of 10 years as well

as 1 year were simulated for a 20nm Predictive Technology Model [60]. The

6σ methodology is used in the SPICE models to eliminate defects from process

variations, especially for STT sensing and write delay.

The bit-cell and transistor sizes from these simulations were then used in a

modified Cacti to generate array energy and timing values for various operations.

Since Cacti does not support STT-MRAM modeling, we modified the original

Cacti SRAM model to simulate the STT-MRAM MTJ cell by changing the cell

width and aspect ratio to fit 1T1MTJ and setting all cell leakage power to 0.

20

This model treated the STT-MRAM like an SRAM cache with smaller cell size

and different cell dimensions. This approach created a conservative estimate of

the STT-MRAM cache array since there is the potential to further optimize the

STT-MRAM circuitry. We also evaluated the energy and timing values produced

by the NVSim [61] non-volatile memory simulator. We found that our approach

produced slightly more conservative parameter values than NVSim, although the

results were comparable. The bit-cell write energy from the SPICE simulation

multiplied by the cache line size was then added to the dynamic write energy

value from Cacti to estimate STT-MRAM cache write energy. Since not every

bit in a cache line would switch on a write operation, this gives a large, and thus

conservative, estimate of write energy. Table 4.1 lists the values gathered from the

SPICE simulations. We observed that as the transistor size is made smaller, the

bit-cell write energy increases but the array-level dynamic read and write energy

decreases. The total write energy trends are in different directions for L1 and L2

cache arrays, as shown in Figure 4.2. L1 and L2 caches have different read and

write access patterns as well, with an L1 cache typically seeing a higher percentage

of read operations, so the optimal design point could be different for both caches.

For L1 cache, the optimal point is somewhere in the middle, around the 5ns write

region, since write energy grows quickly to the left and read energy grows to the

right. However, since the system performance is very sensitive to L1 cache write

pulse latency, a short write is still preferred. For L2 cache, the trends are similar,

so the optimal point there is at the long write region. Previous work [10, 4] has

shown that L2 write latency has little effect, so we pick 7ns in this chapter. The

CMOS cache energy parameters were directly modeled from Cacti as listed in

Table 4.2.

21

Table 4.1: Energy consumption parameters for the STT cache structures.

size 64kB 128kB 256kB 4MB 16MB 32MB 64MBMTJ transistor size (F 2) 24 48 144 24 48 144 24 48 144 24 48 84 24 24 24

MTJ latency (ns) 7 5 3 7 5 3 7 5 3 7 5 3.6 7 7 7MTJ flip energy (fJ) 508 418 378 508 418 378 508 418 378 531 432 397 531 531 531

read (pJ) 15.6 17.8 30.1 18.6 21.4 31.0 24.3 28.4 48.4 152 190 223 255 406 766WrtAccess (pJ) 19.1 22.7 47.6 25.2 30.2 44.9 37.4 45.6 62.7 173 269 315 286 448 949

WrtCell (pJ) 260 214 194 260 214 194 260 214 194 272 221 203 272 272 272Wrt (pJ) 279 237 241 285 244 238 298 260 256 445 490 518 558 720 1221

leakage (mW) 1.7 1.9 2.6 2.6 2.8 3.3 4.4 4.6 6.7 64.2 98.2 126 232 560 1048

Table 4.2: Energy consumption parameters for the CMOS cache structures. Leak-age is mW.

size 1kB 4kB 32kB 64kB 4MB 8MB 16MB

rd (pJ) 6.9 7.2 28.3 35.0 293 520 1003

wrt (pJ) 7.6 10.7 30.6 40.2 344 512 1114

leakage 0.76 1.33 13.45 28.2 736 1440 2984

MTJ Transistor Size (F2) – Write Latency (ns)

Figure 4.2: The different trends in read and write energy for MTJ cells used inL1 (left) and L2 (right) caches. The Write total is a combination of Write celland Write access; Write cell is the per-bit switching energy from SPICE andWrite access is the array access energy reported by Cacti.

22

4.1.2 Architectural Simulation

Complete architectural simulations were used for both design space comparisons

and scalability analysis. The simulations were performed with the Sniper simu-

lator [62] running the Parsec [63] and Splash2 [64] benchmark suites. The two

benchmark suites have fundamental different properties, where Splash2 focuses

more on High-Performance Computing and Parsec includes a wide range of ap-

plications [65]. A combination of the two can improve the benchmark program

diversity. The sniper simulator is based on an analytical model that estimates

the performance by analyzing intervals. This model achieved 10 times simulation

speedup with relatively high accuracy [62]. The speedup gave us the ability to

directly measure the complete benchmark execution time to avoid using any sam-

pling scheme, which is not ideal for multi-thread benchmarks [66]. The simulator

was modified to properly model cache write latency in all relevant operations.

We modeled the STT-MRAM cache as in Fig. 4.3 in both Sniper and Gem5 [67]

Simulators. The charging write pulse will actually block a subsequent write re-

quest, reducing bandwidth. The read and write paths are separated to prevent

blocking of a consecutive read operation by a previous write to avoid further per-

formance drop [68]. The modeling of STT-MRAM has more details in Gem5 than

Sniper. Ruby memory model in Gem5 provides the flexibility to examine the

modified coherence transitions to accommodate the write blocking mechanism of

the STT-MRAM and proves the validity of this model. Fig 4.4 shows one typical

coherence transition modification. NP and S are stable states which means empty

and shared state separately. IS and IS S state are intermediate states which means

transition from NP to S and transitions from IS to S separately. To simulate the

write blocking, we added this IS S state to block all write requests to the same

cacheline unless it goes to S state after the write pulse latency.

A range of cache hierarchies with various capacity, access and write latencies

were simulated, using the MESI cache coherence protocol with strict inclusive

policy. Table 4.3 lists the processor configurations used in the simulations. All

based on a four-wide out-of-order execution model running at 2GHz. Structures

23

NP IS SL1_GETS DATAAllocate TBEAllocate L2

Deallocate TBEsendDataToRequestor

writedataToCache

NP IS IS_SL1_GETS DATA

Allocate TBEAllocate L2

sendData

SWD_Time

Deallocate TBEwritedataToCache

Block state

Check Resource Available: fail; success

Bank0

curTick: 10003

busyTick: 10010

Bank0

curTick: 10000

busyTick: 9994

busyTick: 10010

Add write pulse

latency

trigger

Original Transition

Modified Transition

Figure 4.3: STT-MRAM cache is modeled to have separated read/write port. Theread port works the same as the CMOS SRAM cache, but the write port blocksa consecutive write operation when it is not available.

Cache

Rd/Wrt Port

Rd/Wr Request

Request SentEvery Cycle

CMOS SRAM Cache

Data SentAccess Latency

Cache

Rd Port

Rd Request

Same As

CMOS

STT-MRAM Cache

Wrt Port

Wrt RequestStall When

Wrt Port Not Avail

Request SentEvery

Write Pulse Latency

Figure 4.4: The coherence transition diagram modeled in Gem5 to accommodatethe write blocking mechanism of STT-MRAM. Instead of having one intermediatetransition state IS, we added another transition state IS S to simulate the blockingstate of the STT-MRAM, no write requests can be served in this state unless itfinished its transition to S state.

24

Table 4.3: Simulated Processor Configurations.

Parameter Values

Pipeline 4-wide, out-of-orderL1 ICache 64KB, 2-way

ROB entries 128Memory 45ns latency, 7.6GB/s bandwidth

such as reorder buffer (ROB) were deliberately set to be on the large side to try

to remove these structures as possible hidden bottlenecks in the simulations. The

instruction cache was implemented using 64kB CMOS cache in all configurations

to minimize its impact.

All data is reported just for the parallel region-of-interest (ROI). The number

of processors was varied from 4 to 16. Typical two-level and three-level cache

hierarchies were analyzed, with a shared last-level cache and private cache(s) per

core. A crossbar interconnect joined all cores. Table 4.4 lists the values of system

parameters simulated for each benchmark. The access latency in this Table refers

to the time the cache access is being completed or the requested data returned,

which often refers to the read latency. The write latency refers to the STT-MRAM

write pulse delay. We assumed the access latency of STT L1 cache is 25% to 50%

longer and STT L2 cache is 20% faster than the CMOS cache according to [69].

The longer sensing time of the STT-MRAM due to read disturbance significantly

increases the access latency of a smaller cache, but the shorter interconnect delay

makes it faster for larger caches. The L2 cache associativity of each configuration

was larger than the sum of all L1 cache sets to avoid cache misses due to inclusive

policy. The large datasets of both suites were used for all simulations. The native

datasets were used on a few benchmarks to show the scalability observations are

valid on real workloads.

Table 4.5 shows the main configurations in the following graphs. The names in

the left hand column are how these configurations will be referred to and labeled

in the graphs. We picked 7ns write latency for STT L2 and 5ns for STT L1 cache

due to performance and energy concerns. The MTJ transistor sizes for L2 and

25

Table 4.4: Simulated STT-MRAM Cache Parameters. For writes, the accesslatency is added to the write latency.

Parameter Values

Access Latency (CMOS) L0 1 or 3, L1 4, L2 30 cyclesAccess Latency (STT) L1 5 or 6; L2 24 cyclesWrite Latency (STT) L1 5ns, optional 3, 7ns; L2 7ns

L0 DCache 1K, 4K fully-associative, privateL1 DCache 64K, 128K, 256K 4-way associative, private

L2 Cache 4 cores 4MB, 16MB 16-way associative, 8 banksL2 Cache 8 cores 8MB, 32MB 32-way associative, 8 banks

L2 Cache 16 cores 16MB, 64MB 64-way associative, 8 banks

Table 4.5: Simulated Cache Hierarchies.

Configuration Name L0 Sz L1 Sz L2 Sz

cmos-base - CMOS 64K CMOS 4MB/8MB/16MBstt-l2 - CMOS 64K STT 16MB/32MB/64MB

stt-l1d2 - STT 128K STT 16MB/32MB/64MBstt-l1d4 - STT 256K STT 16MB/32MB/64MBstt-l0z1 CMOS 1K STT 128K STT 16MB/32MB/64MBstt-l0z4 CMOS 4K STT 256K STT 16MB/32MB/64MB

L1 are 24F 2 and 48F 2 as in Figure 4.2, while a common 6T SRAM cell can be

135F 2 [69]. All stt configurations used a STT L2 cache. In a conservative manner,

we estimated the STT L2 to have four times the size of the cmos-base and the

STT L1 cache to have two to four times the size of the CMOS L1 in the same

footage. Specifically, l1d2 means the STT L1 has twice the density (same footage)

of the CMOS L1 and l0z4 means the CMOS L0 size is 4kB.

4.2 Performance, Energy and Scalability

Performance, energy consumption, energy-delay product and scalability of the

cache hierarchy are the primary metrics of interest. The performance here uses

the total execution time. The energy refers to the overall CPU cache energy

consumption but not the entire CPU. The scalability refers to the speedup from

different number of cores. In this section we will examine the experimental results

26

of our proposed STT-MRAM cache hierarchy compared to a CMOS baseline. The

comparison of performance and energy results are based on a fixed number of cores

among different hierarchies. We explore the scalability of proposed hierarchies

from 4 to 16 cores to investigate the STT-MRAM impact. At last, we will show

the performance and energy impact when implemented a larger L0 cache.

4.2.1 Performance

The challenge of using STT-MRAM for lower level caches (closer to the CPU)

is to overcome the added write latency and dynamic write energy. The reward

is increased density and significantly reduced leakage power, which comprises the

majority of the power consumed in a CMOS cache hierarchy. Long write latency

at a first-level cache creates a bandwidth mismatch between the processor pipeline

and the cache. Queueing and buffering can only absorb a certain amount of write

data before the processor must stall, if the cache system cannot keep up with the

offered load. Figure 4.5 (a, b) shows the impact of write latency on performance

when the CMOS L1 cache is replaced by STT-MRAM. Extra capacity in the L1

made possible by the higher density of STT-MRAM cannot compensate for the

reduction in write bandwidth seen by the processor.

A method to match the bandwidth between the CPU and the L1 cache is

needed, that does not also increase cache energy consumption significantly. Aug-

menting the STT-MRAM L1 with a small fully-associative CMOS Level-0 (L0)

cache was investigated in [58] and found to be an effective method to restore the

performance lost to higher write latency. The L0 cache acts as a write-merging

buffer, translating single-word writes from the CPU into cache line writes to the

L1. If enough of the processor traffic is handled by this L0, performance can be

restored. We have implemented this structure as a standard write-back cache, so

it uses a standard cache controller with no extra functionality such as a hybrid

cache or low retention-time cache would require. By keeping the L0 as small as

27

blackscholes

bodytrackcanneal

dedupfacesim ferret

fluidanimatefreqmine

raytrace

streamcluster

swaptions vipsaverage

0.8

0.6

0.4

0.2

0.0

0.2

0.4

0.6sp

eedu

p

stt-l1lat3ns 5ns 7ns

(a) Parsec benchmark suite

barnescholesky fft fmm

lu.contlu.ncont

ocean.cont

ocean.ncontradiosity radix

raytrace

water.nsqwater.sp

average0.8

0.6

0.4

0.2

0.0

0.2

0.4

0.6

spee

dup

stt-l1lat3ns 5ns 7ns

(b) Splash2 benchmark suite

blackscholes

bodytrackcanneal

dedupfacesim ferret


raytrace

streamcluster


0.8

0.6

0.4

0.2

0.0

0.2

0.4

0.6

spee

dup

stt-l1lat5ns stt-l0-l1lat3ns 5ns 7ns

(c) Parsec benchmark suite (L0 imple-mented)


lu.contlu.ncont

ocean.cont


raytrace

water.nsqwater.sp

average0.8

0.6

0.4

0.2

0.0

0.2

0.4

0.6

spee

dup

stt-l1lat5ns stt-l0-l1lat3ns 5ns 7ns

(d) Splash2 benchmark suite (L0 imple-mented)

Figure 4.5: (a), (b) compares the cmos-base (64K,4M) to STT hierarchies(128K,16M). Write latency of the L1 cache results in significant performance drop.(c), (d) compares the cmos-base to STT hierarchies that use the write-mergingL0. The hierarchy uses a 4K fully-associative L0 cache and a STT L1 cache withvarious write latencies.

possible, access time and leakage power are kept low. Figure 4.5 (c, d) demon-

strates the improvement from using the L0. When the L0 is 4KB, the average miss

rate can be as low as 2% in Figure 4.6. The performance of the STT-MRAM two

level hierarchy differ significantly among 3, 5 and 7ns write latencies by almost

50%, while the performance difference of the three level hierarchy with the L0

implementations shrinks to 15%. The three level hierarchy performs 40% better

on average than the two level with 5ns write latency.

4.2.2 Energy Comparison

By reconfiguring the cache hierarchy to contain as little CMOS circuitry as pos-

sible, leakage power is reduced significantly. With the L0 write-merging cache as

28

blackscholes

bodytrackcanneal

dedupfacesim ferret


raytrace

streamcluster


0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

0.18

mis

s ra

te

Bar: cmos-base, stt-l2, stt-l1d2, stt-l1d4, stt-l0z1, stt-l0z4l1d_missl0_miss



lu.contlu.ncont

ocean.cont


raytrace

water.nsqwater.sp

average0.00

0.05

0.10

0.15

0.20

0.25

0.30

mis

s ra

te

Bar: cmos-base, stt-l2, stt-l1d2, stt-l1d4, stt-l0z1, stt-l0z4l1d_missl0_miss


Figure 4.6: The near core cache miss rate. Though the 1KB fully-associative L0cache does have a large cache miss rate, the 4KB cache on average is less than5%.

29

the only all-CMOS structure, the L1 and LLC can be configured for larger capac-

ity in a given area. We have simulated several different combinations of cache size

at all three levels. Figure 4.7 (a, b) shows the potential energy savings with this

three level configuration. All graphs in this section use the simlarge dataset to

stress cache capacity as much as possible in simulation. SRAM L2 cache leakage

and SRAM L1 cache leakage energy take approximately 80% and 10% on average

of the total cache energy consumptions respectively. The total energy use drops

significantly after adopting the STT L2 cache (stt-l2) by almost 60%. This further

drops another 10% after using the 4KB L0 STT configuration (stt-l0z4). Total

energy savings with the small L0 implementation on average are approximately

70% for both Parsec and Splash2 benchmark suites. Figure 4.7 (c, d) shows the

dynamic energy consumption of various STT hierarchies, omitting L0, L1 and L2

leakage. Different L0 and L1 sizes are shown. The large cross-hatched segment in

the middle of the 1KB L0 (stt-l0z1) bars represents cache line writes to the L1.

With the smaller L0, there is a large number of L0 write-backs of modified data

to the L1. This segment decreases rapidly as the L0 size increases.

4.2.3 Energy-Delay Product

Though the tradeoff between energy use and performance is common in the system

design, improving both at the same time can rarely happen. To show the overall

merit of the hierarchy design, we have used the energy-delay product (EDP) as a

better metric to highlight the most balanced architecture between energy efficiency

and performance [70]. Figure 4.8 (a, b) shows the EDP of a system with four

cores. On average, both benchmark suites of 4KB L0 (stt-l0z4) show a significant

65% EDP reduction over the baseline (cmos-base). The 4KB L0 (stt-l0z4) shows

an approximately 25% reduction over the configuration that only replaced the

CMOS L2 with STT (stt-l2), This is approximately an additional 10% reduction

over the CMOS baseline. Figure 4.8 (c, d) shows the EDP of a system with 16

cores. The observations of the four cores still hold for the 16 cores system. In

30

blackscholes

bodytrackcanneal

dedupfacesim ferret


raytrace

streamcluster


0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

norm

aliz

ed c

ache

ene

rgy

use

Bar: cmos-base, stt-l2, stt-l1d2, stt-l1d4, stt-l0z1, stt-l0z4l2_leakl2_ln_missl2_lnl2_rdl1_leakl1_ln_missl1_lnl1_cpu_stl1_rdl0_leakl0_ln_missl0_cpu_stl0_rd

(a) Parsec benchmark suite (Total En-ergy)


lu.contlu.ncont

ocean.cont


raytrace

water.nsqwater.sp

average0.0

0.2

0.4

0.6

0.8

1.0

1.2

norm

aliz

ed c

ache

ene

rgy

use

Bar: cmos-base, stt-l2, stt-l1d2, stt-l1d4, stt-l0z1, stt-l0z4l2_leakl2_ln_missl2_lnl2_rdl1_leakl1_ln_missl1_lnl1_cpu_stl1_rdl0_leakl0_ln_missl0_cpu_stl0_rd

(b) Splash2 benchmark suite (Total En-ergy)

blackscholes

bodytrackcanneal

dedupfacesim ferret


raytrace

streamcluster


0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

norm

aliz

ed c

ache

ene

rgy

use

Bar: cmos-base, stt-l2, stt-l1d2, stt-l1d4, stt-l0z1, stt-l0z4

l2_ln_missl2_lnl2_rdl1_ln_missl1_lnl1_cpu_stl1_rdl0_ln_missl0_cpu_stl0_rd

(c) Parsec benchmark suite (DynamicEnergy)


lu.contlu.ncont

ocean.cont


raytrace

water.nsqwater.sp

average0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

norm

aliz

ed c

ache

ene

rgy

use

Bar: cmos-base, stt-l2, stt-l1d2, stt-l1d4, stt-l0z1, stt-l0z4

l2_ln_missl2_lnl2_rdl1_ln_missl1_lnl1_cpu_stl1_rdl0_ln_missl0_cpu_stl0_rd

(d) Splash2 benchmark suite (DynamicEnergy)

Figure 4.7: (a, b) shows the total energy consumption normalized to cmos-base(64K,4M) with stt-l2 (64K,16M), stt-l1d2 (128K,16M), and stt-l0 with varyingL0 sizes (1/4K, 128K, 16M). The CMOS3 L2 leakage is computed for a 4MBSTT-MRAM cache to create a fair baseline. (c, d) shows the dynamic energyconsumption with the same configurations as in (a, b).

31

blackscholes

bodytrackcanneal

dedupfacesim ferret


raytrace

streamcluster


0.0

0.5

1.0

1.5

2.0

norm

aliz

ed c

ache

ene

rgy

dela

y pr

oduc

t

cmos-base-s4stt-l2-s4

stt-l1d2-s4stt-l1d4-s4

stt-l0z1-s4stt-l0z4-s4

(a) Parsec benchmark suite (4 cores)


lu.contlu.ncont

ocean.cont


raytrace

water.nsqwater.sp

average0.0

0.5

1.0

1.5

2.0

norm

aliz

ed c

ache

ene

rgy

dela

y pr

oduc

t




(b) Splash2 benchmark suite (4 cores)

blackscholes

bodytrackcanneal

dedupfacesim ferret


raytrace

streamcluster


0.0

0.5

1.0

1.5

2.0

norm

aliz

ed c

ache

ene

rgy

dela

y pr

oduc

t




(c) Parsec benchmark suite (16 cores)


lu.contlu.ncont

ocean.cont


raytrace

water.nsqwater.sp

average0.0

0.5

1.0

1.5

2.0

norm

aliz

ed c

ache

ene

rgy

dela

y pr

oduc

t




(d) Splash2 benchmark suite (16 cores)

Figure 4.8: (a, b) shows the Energy Delay Product of various STT hierarchies withfour cores. (c, d) shows the Energy Delay Product with 16 cores. All normalizedto cmos-base.

32

Table 4.6: Cache capacity impact on performance and cache reuse of cannealwith simlarge dataset. The execution time of each configuration is normalized to4 cores with 4MB L2 cache. The percentage is the L2 cache miss rate.

cores 4MB 8MB 16MB 32MB 64MB 128MB4 1 (43%) 0.84 (38%) 0.68 (31%) 0.61 (25%) 0.477 (20%) 0.476 (18%)8 - 0.477 (38%) 0.41 (31%) 0.32 (22%) 0.30 (16%) 0.29 (12%)16 - - 0.36 (31%) 0.19 (22%) 0.16 (12%) 0.15 (8%)

summary, the 4KB L0 hierarchy of both benchmark suites achieves an average

65% EDP reduction over the CMOS baseline and a 25% EDP reduction over

the configuration only with L2 STT-MRAM. The first 65% EDP reduction comes

mainly from the energy reduction since the implemented STT L2 has little impact

on the benchmark performance due to less frequent L2 write accesses. Moreover,

since the L1 SRAM cache leakage energy becomes larger at 25% once having

the STT L2 cache implemented, we can potentially achieve a significant energy

reduction if switch SRAM L1 to STT L1 as well. But due to the long write latency

and much higher write access frequency in L1, such direct replacement results in

much slower execution time and larger dynamic energy use as in Figure 4.7 (c, d).

By implementing the small L0, a significant number of CPU writes are absorbed

by this small L0, so the write access frequency to the STT L1 becomes small,

further reducing dynamic energy use and improving the performance. Besides,

the small L0 can potentially provide a faster CPU-side cache access than the STT

L1 due to its longer sensing time. The 25% EDP reduction over the STT L2

implementation is thus achieved by this small L0 scheme.

4.2.4 Scalability

To analyze the scalability of the workloads, we have used full simulation runs with

the large and native simulation datasets. Figure 4.9 shows the scalability from 4

to 16 cores of both Parsec and Splash2 benchmark suites. Only the slopes of the

lines are being compared here among different configurations. With the simlarge

dataset, both suites show that the STT-MRAM does not significantly impact the

33

blackscholes

0

1

2

3

4

5

norm

aliz

ed s

peed

bodytrack

cannealdedup

facesim ferret

fluidanimate

Dots: 4 cores, 8 cores, 16 cores

freqmineraytra

ce

streamclu

ster

swaptionsvips

average

cmos-base stt-l2 stt-l1d2 stt-l0z1 stt-l0z4


barnes0

1

2

3

4

5

norm

aliz

ed s

peed

cholesky fft fmmlu.co

nt

lu.ncont

ocean.cont

ocean.ncont

Dots: 4 cores, 8 cores, 16 cores

radiosity radix

raytrace

water.nsq

water.spaverage

cmos-base stt-l2 stt-l1d2 stt-l0z1 stt-l0z4


Figure 4.9: The scalability of various architecture hierarchies using from 4 to16 cores, including two-level cmos-base (64K,4/8/16M), stt-l2 (64K,16/32/64M)and stt-l1d2 (128K,16/32/64M), three-level stt-l0z1 (1K,128K,16/32/64M) andstt-l0z4 (4K,128K,16/32/64M). The 3-level hierarchy has a similar scalability asthe two level cmos-base, but canneal and facesim particularly show better resultthan others.

Table 4.7: Cache capacity impact on performance and cache reuse of canneal withsimnative dataset. The execution time of each configuration is normalized to 4cores with 4MB L2 cache. The percentage is the L2 cache miss rate.

cores 4MB 8MB 16MB 32MB 64MB 128MB4 1 (94%) 0.94 (90%) 0.84 (83%) 0.73 (70%) 0.58 (53%) 0.49 (35%)8 - 0.67 (90%) 0.61 (82%) 0.51 (70%) 0.38 (53%) 0.27 (35%)16 - - 0.55 (83%) 0.47 (70%) 0.35 (53%) 0.22 (34%)

34

scalability over the cmos-base. Canneal and facesim from Parsec show good scala-

bility over other benchmarks. According to [71], canneal was limited primarily by

memory latency rather than bandwidth due to lower data reuse. But this obser-

vation was made under a relatively small LLC, while Table 4.6 and 4.7 show that

a larger STT cache have better cache reuse for both simlarge and simnative input

dataset. The simlarge and simnative input dataset of canneal are 256MB and

2GB respectively. Since it is possible that the working set of simlarge can fit in a

64MB STT L2 but simnative cannot, we further investigate the simnative input

dataset of canneal to see whether the larger L2 impact persists. The scalability

of simlarge input dataset reaches a maximum at 32MB, while scalability of the

simnative input dataset increases with the L2 cache capacity. This means with

a real workload, a larger L2 cache (up to 128MB) could improve the scalability

of canneal. We observed bad scalability from swaptions of Parsec and raytrace of

Splash2. Freqmine only runs single-threaded due to lack of OpenMP support in

the sniper simulator. In general, the STT-MRAM with 4K L0 (stt-l0z4) hierar-

chy has good scalability over the cmos-base, with a few cases, including facesim,

canneal and cholesky, showing significant scalability improvements.

4.2.5 Larger L0 impact

It is clear that the implementation of this small fully associative L0 achieved a

good tradeoff between performance and energy use. We further investigated a

larger L0 to see whether better EDP can be achieved. A 32kB 4-way associative

and 64kB 8-way associative L0 was implemented into the same hierarchy as in

Figure 4.10 and Figure 4.11. With such big caches, fully associativity becomes

too costly for dynamic energy consumption and access delay. Figure 4.10 shows

an average 10% performance improvement, but 20% to 40% more energy use after

adopting these larger L0 caches. The overall EDP increases by more than 20%

in both benchmark suites in Figure 4.11. Because the 4kB L0 already has a low

average miss rate at 2% in Figure 4.6, performance improvement space is little by

35

blackscholes

bodytrackcanneal

dedupfacesim ferret


raytrace

streamcluster


0.8

0.6

0.4

0.2

0.0

0.2

0.4

0.6

PARS

EC: s

peed

up

cmos-base-s4stt-l1d2-s4



(a) Parsec Performance


lu.contlu.ncont

ocean.cont


raytrace

water.nsqwater.sp

average0.8

0.6

0.4

0.2

0.0

0.2

0.4

0.6

0.8

SPLA

SH2:

spe

edup




(b) Splash2 Performance

blackscholes

bodytrackcanneal

dedupfacesim ferret


raytrace

streamcluster


0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

norm

aliz

ed c

ache

ene

rgy

use

Bar: cmos-base, stt-l1d2, stt-l0z1, stt-l0z4, stt-l0z32, stt-l0z64l2_leakl2_ln_missl2_lnl2_rdl1_leakl1_ln_missl1_lnl1_cpu_stl1_rdl0_leakl0_ln_missl0_cpu_stl0_rd

(c) Parsec Energy


lu.contlu.ncont

ocean.cont


raytrace

water.nsqwater.sp

average0.0

0.2

0.4

0.6

0.8

1.0

1.2

norm

aliz

ed c

ache

ene

rgy

use

Bar: cmos-base, stt-l1d2, stt-l0z1, stt-l0z4, stt-l0z32, stt-l0z64l2_leakl2_ln_missl2_lnl2_rdl1_leakl1_ln_missl1_lnl1_cpu_stl1_rdl0_leakl0_ln_missl0_cpu_stl0_rd

(d) Splash2 Energy

Figure 4.10: The figure shows the performance and energy use of several configu-rations including CMOS baseline, stt-l1d2, stt-l0z1, to stt-l0z64, where stt-l0z32and stt-l0z64 have 32kB 4-way assoc and 64kB 8-way assoc L0 cache. The averageperformance improves less than 10% and the energy use increases more than 25%from stt-l0z4 to stt-l0z64. .

36

blackscholes

bodytrackcanneal

dedupfacesim ferret


raytrace

streamcluster


0.0

0.5

1.0

1.5

2.0

norm

aliz

ed c

ache

ene

rgy

dela

y pr

oduc

t




(a) Parsec Energy-Delay Product


lu.contlu.ncont

ocean.cont


raytrace

water.nsqwater.sp

average0.0

0.5

1.0

1.5

2.0

norm

aliz

ed c

ache

ene

rgy

dela

y pr

oduc

t




(b) Splash2 Energy-Delay Product

Figure 4.11: The figure shows the energy-delay product of several configurationsincluding the larger L0 implementations. The overall energy-delay product in-creases 20% on average for both benchmark suites. .

37

simply increasing L0 size. But the static leakage and dynamic energy will increase

significantly. The 4K L0 remains a better design point to tradeoff performance

and energy use.

4.3 Summary

We have analyzed the impact of STT-MRAM as a replacement for CMOS at all

levels of a multiprocessor cache hierarchy. Though STT-MRAM has higher write

energy and latency, reducing these parameters at the circuit level does not lead to

an optimal design. The extra circuit area required to minimize MTJ bit-cell write

time and energy causes the cache arrays to grow, leading to higher read energy

and latency due to parasitic effects.

A fully-associative L0 cache as small as 4KB can effectively restore performance

lost to the higher write latency. This structure hides the extra write latency of

around 5ns when running at 2GHz, giving a total cache energy savings of 40-70%

and an average energy-delay product reduction of 60% compared to the CMOS

baseline. The L0 cache is implemented as a standard cache level, requiring no

additional control structures. We have observed no significant scalability impact

using STT-MRAM with L0 implemented. A few benchmarks show improved

scalability up to 16 cores using the STT-MRAM hierachy.

The introduction of new memory technologies can have significant impacts on

the best architectural choices for the memory hierarchy of a multicore system. This

chapter shows that simple solutions can help mitigate the negative impacts while

still allowing the system to take advantage of the benefits of the new technology.

Chapter 5

Incorporating Spintronic Devices

in Logic Units

All-spin logic is capable of synthesizing boolean logics using majority gate without

charge-based devices [2]. It has low standby power and much smaller size, which

fits the needs of the trending mobile and IoT applications. However, the slow

switching time and large dynamic energy consumption significantly impact the

performance of ASL. Moreover, random bit flips of nanomagnet greatly impacts

the output results of the circuits using such devices. The basic element of ASL,

the majority gate, can be efficiently implemented in combinational stochastic com-

puting circuit, which provides high fault tolerance and low hardware cost. We also

propose to implement it in sequential stochastic computing logic, the finite-state

machine (FSM), which further expands its functionalities and can be used in mul-

tiple image applications. One of the problem of the stochastic computing is its

long calculation latency. To improve the overall performance and reliability, we

propose a parallel implementation scheme of the FSM to reduce the calculation

time and analyze the autocorrelation issue of the FSM which impacts the output

results of large circuit network especially with feedback loops.

In this chapter, we first propose the spintronic logic implementation of the

38

39

CB

A

And Gate

0

CB

A

Or Gate

1

CB

A

Nor Gate

1invertor

C

A0

And Gate with Multiple

Inputs

0

A1

A2

A3

0

0

Figure 5.1: Basic elements implemented using all-spin logic.

stochastic computing elements including combinational logics and sequential log-

ics. Then we propose a parallel implementation scheme of sequential stochas-

tic computing unit, the finite-state machine to mitigate its long calculation la-

tency [72]. Lastly, we analyze the parameters that impact the FSM autocorrela-

tion and use a re-randomizer to solve this issue as in [73].

5.1 Spintronic logic devices

Using the majority gate nature of the all-spin logics (ASL), we can implement

multiple basic combinational boolean logics including AND, OR, NOR and mul-

tiple input AND gate as in Fig. 5.1. These logics are the basic elements to build

larger stochastic circuits. Sequential logics, such as Flip flops, are also crucial to

build sequential stochastic computing logics, such as finite-state machine (FSM),

which can approximate complex functionalities, including absolute value, hyper-

bolic tanh and exponential functions [12]. Fig. 5.2 shows the diagram of a J-K flip

flop implemented using ASL. The J-K flip flop uses an internal D flip flop, which is

controlled by the system clock. Data is stored in the first magnet when the clock

is positive, then moved to the second magnet when the clock becomes negative.

40

and

0

and

0

1or

D Flip-Flopinvertor

0

0

1

J

K

B

A

J’

K’

A’

B’

Vclk Vclk’

DQ

Q’

Q Q’

Figure 5.2: Flip flops, basic element of FSM, implemented using all-spin logic.

This second magnet can only store the data when the clock changes from positive

to negative, thus effectively separates the data node and the internal store node.

Using these basic logics, we can implement both combinational and sequential

stochastic computing elements. Fig. 5.3 shows all the combinational computing

elements. The multiplier is implemented using an simple AND gate. By taking

advantage of the majority gate nature of the ASL, we can implement the stochastic

adder using a majority gate by feeding the control node a bit stream of probability

of 0.5 instead of implementing a mux logic like the CMOS design as in [13]. This

significantly reduces the hardware cost of an stochastic adder. With the J-K flip

flop and various logics, we can further implement FSM using ASL with less than

200 magnets as in Fig. 5.4.

Stochastic computing not only benefits the spintronic logics with simplified

hardware structure, but also provides great fault tolerance. We implement the

spin based circuits in two image applications commonly used in stochastic com-

puting, Edge Detection and Frame Difference [12], as in Fig. 5.5 and Fig. 5.6.

Edge Detection uses three subtraction units and two absolute value FSM units

to calculate the difference between adjacent pixels, which uses 326 magnets in

total. Frame Difference uses two subtraction units, an absolute value FSM unit

41

CB

A

Multiplier (And Gate)

0

C

B

A

Adder

0.5H

C

B

A

Subtractor

0.5Hinvertor

Figure 5.3: Combinational stochastic computing elements implemented using all-spin logic.

invertor

UP/DN

1

Vclk Vclk’ Q0 Q1 Q2

A

J

K

B

A’

J’

K’

B’

Vc Vc’Q Q’A

J

K

B

A’

J’

K’

B’

Vc Vc’Q Q’A

J

K

B

A’

J’

K’

B’

Vc Vc’Q Q’

Q0 Q1 Q2

Q0' Q1' Q2'

Figure 5.4: 3-bit finite-state machine implemented using all-spin logic.

42

Sub |X|

Sub |X|

Sub

PRi,jPRi+1,j+1

PRi+1,jPRi,j+1

PSi,j

0.5

0.5

0.5

Figure 5.5: Edge Detection application diagram using the proposed spintronicstochastic circuits

|X|Sub

PXtPY

0.5 0.5

PXt-1

Pth

tanhSub

Figure 5.6: Frame Difference application diagram using the proposed spintronicstochastic circuits

43

and tanh FSM unit to calculate the difference between two sequential frames. It

uses 322 magnets. We inject bit errors in every magnet to simulate the randomly

bit flips of the spintronic devices to examine the stochastic fault tolerance prop-

erty. The injection error rate ranges from 0.01 to 0.00001 until the output image

quality is similar to the standard stochastic output without bit flipping. The ex-

perimental results compare the output image quality using mean squared error

(MSE) and peak-signal-to-noise ratio (PSNR) among image outputs with differ-

ent error rates. The MSE is the mean of the square of each pixel error between

the stochastic implementation and the conventional implementation results as in

Equation 5.1.

MSE =1

mn

m−1∑i=0

n−1∑j=0

(I(i, j)−K(i, j))2 (5.1)

where I refers to the image result from the conventional implementation and

K refers to either of the stochastic implementations. PSNR is further calculated

using the MSE as in Equation 5.2.

PSNR = 20× log10(MAXI)− 10× log10(MSE) (5.2)

where MAXI is the maximum possible pixel value of the image. The bit length

of each pixel of all images in this paper is 8, so the MAXI value is 28 − 1 = 255.

PSNR shows the ratio between the maximum possible power of a signal and the

noise in dB.

Fig. 5.7 and Fig. 5.8 compares the stochastic computing results with different

level of error injection rates. The error rates in Edge Detection in Fig. 5.7 ranges

from 0.01 to 0.0001, where the image quality improves as the error rate drops to

0.0001. We can observe that when the error rate is 0.01, there are lots of noises

in the image. And when the error rate drops to 0.0001, the output image looks

similar to the output without error injections. The error rates in Frame Difference

in Fig. 5.8 ranges from 0.001 to 0.00001, and the image quality also improves

44

(a) Edge Detection (b) Standard Results

(c) Error Rate 0.01 (d) Error Rate 0.001 (e) Error Rate 0.0001

Figure 5.7: Edge Detection Results with different error rate injection using Spin-tronic logics

45

(a) Frame Difference (b) Standard Results

(c) Error Rate 0.001 (d) Error Rate 0.0001 (e) Error Rate 0.00001

Figure 5.8: Frame Difference Results with different error rate injection using Spin-tronic logics

Table 5.1: The MSE and PSNR of Edge Detection using all-spin logics

Edge Detection error rate 0.01 0.001 0.0001 0MSE 412.8 151 58.8 53PSNR (dB) 22.2 26.5 30.6 30.9

significantly with the decreased error rate. When the error rate is below 0.001, we

can hardly spot the shape of the walking person. And when the error rate drops

to 0.0001, the output image looks almost the same as the output without error

injections. We further inspect the output image qualities by using the MSE and

PSNR as in Table. 5.1 and Table. 5.2

The two tables show the MSE and PSNR results with varying error rates. The

PSNR becomes almost the same as the one without error injections, when the

error rates is 0.0001 for Edge Detection and when the error rates is 0.00001 for

Frame Difference. Therefore, the error rate of 0.00001 is good enough for both of

the image processing applications. The error rate of 0.00001 is still relatively high

46

Table 5.2: The MSE and PSNR of Frame Difference using all-spin logics

Edge Detection error rate 0.001 0.0001 0.00001 0MSE 1259 275.6 250.4 200PSNR (dB) 17.2 23.9 24.5 25.1

compared with the CMOS soft errors. Both of them use around 350 magnets,

which contains two 16-state FSM and several supporting logics. Assuming the

system clock is 1 GHz (clock duration is 1 ns), the error rate of 0.00001 means

there would be one bit being flipped every 100000 bits, or one bit being flipped

every 0.1 ms (0.1ms = 100000× 1ns). This could significantly relax the retention

time requirement, which is usually set to 10 years. With reduced retention time,

the dimension of the spintronic magent can be decreased and reduce the switching

power.

In summary, the stochastic computing scheme with high fault tolerance and

extremely simple hardware structure can benefit the spintronic logic devices. The

high fault tolerance relaxes the retention time requirement and reduces the dy-

namic power. Two image applications implemented by all-spin logic show that the

output image quality is comparable to the standard output even with relatively

high error rate at 0.00001.

5.2 Parallel Implementation of FSM

Implementing the spintronic logic device such as ASL can effectively reduce the

reliability impact and enhance the output result with relative smaller retention

time of the ASL. This reduction in retention time can also reduce the dynamic

energy use and shrinking the device cell size. However, the FSM suffers from

long calculation latency due to its randomness nature. To achieve acceptable

accuracy, a long bit stream, usually 1024 bits, is required in such calculations.

This long latency limits the usage of stochastic computing in the high frequency

applications. To address this issue, we have studied the parallel implementation

47

of the FSM to improve its performance.

5.2.1 The parallel FSM design

According to the Markov Chain theory, the probability distribution of states of

a finite-state machine after a long run is deterministic and unique for each in-

put value, which is called the steady-state distribution [74]. For example, the

transition matrix P of a 4-state FSM with input probability of x is

P =

1− x x 0 0

1− x 0 x 0

0 1− x 0 x

0 0 1− x x

The steady-state distribution π of this transition matrix, as in Equation. 5.3

can be shown to exist and to be solved.

π = πP (5.3)

Fig. 5.9 shows the steady-state distributions of a typical 16-state FSM with

different input values. The expected time (i.e., number of clock cycles) to reach

this steady state can be calculated using the transition matrix. When each vector

of P n becomes the same as the steady-state distribution, n is the number of

steps to reach the steady state. A 16-state FSM with an input value of 0.5 can

be estimated to require at least 200 cycles to reach the steady state. This will

become a huge disadvantage if we want to implement a parallel FSM with bit

stream length of 1024, since the convergence period will be too long for each

parallel copy. We implemented a straightforward parallel FSM of the absolute

value function as in Fig. 5.12a to show this impact. The inputs are 32 uncorrelated

bit streams generated by feeding the same value X into 32 Linear-feedback shift

register (LFSR) random bit generators. Then each input is sent to an absolute

value function FSM. The mean value of all output bit streams represent the final

48

5 10 150

0.5

1

State

Probability

(a) input value is 0.2.

5 10 150

0.5

1

State

Probability

(b) input value is 0.5.

5 10 150

0.5

1

State

Probability

(c) input value is 0.8

Figure 5.9: The Steady-State distribution of a 16-state FSM. From the figure,we can see that this distribution is symmetric about the input value of 0.5. Thedistribution changes the most here, making it very sensitive around 0.5.

49

output value Y. We initiate the FSM with different states 0, 7 and 15, which are the

left extreme, middle and right extreme points of all the states. The performance

of this straightforward parallel implementation is shown in Fig. 5.10. When the

number of parallel copies is 4 and the length of each bit stream is 256, the output

mean value is still close to the real value. However, as the number of parallel

copies increases, the output mean value becomes less accurate. When the initial

state is 0, the right part of the results significantly drift away from the expected

results as parallel copies increases. When the initial is 15, the left part drifts away.

When the FSM starts at the middle state 7, all data points move away from the

correct value. This is because the FSM generates wrong outputs before it reaches

the steady state and this impact grows significantly when the bit stream becomes

shorter.

Further, we implemented this straightforward parallel FSM in one of the image

applications, Frame Difference, that uses the absolute value function as in [12].

We implemented 32 parallel copies of the FSM units with different initial states at

0, 7 and 15, and each bit stream has 32 bits, so the total number of stream bits of

each pixel is 1024, same as the serial stochastic implementation. Fig. 5.11 shows

clearly that the straightforward implementations lose most of the information and

fail to compute the correct Frame difference results. Moreover, the results from

initial state at 0 and 15 are somehow in complementary shapes. Combining the

two can give us a graph very close to the correct result. Whereas when the initial

state is at 7, the output shows a rough shape but most of the details are lost.

This matches the observations of the absolute value implementation in Fig. 5.10

This phenomenon is due to the Markov Chain nature of the FSM. Each in-

put value generates a transition matrix that has a steady state distribution as in

Fig. 5.9. When the input value is much smaller than 0.5, the steady state distribu-

tion mostly concentrates to state 0. When the input value is much larger than 0.5,

it concentrates to state 15. Therefore only half of the outputs are correct when

we set the initial state to be 0 or 15 and all outputs are not correct when we set

initial state as 7. To faster reach the steady state and decrease the convergence

50

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.4

0.5

0.6

0.7

0.8

0.9

1

Input Value

Ou

tpu

t V

alu

e

para 2

para 4

para 8

para 16

para 32

real

(a) initial state at 0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.4

0.5

0.6

0.7

0.8

0.9

1

Input Value

Ou

tpu

t V

alu

e

para 2

para 4

para 8

para 16

para 32

real

(b) initial state at 7

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.4

0.5

0.6

0.7

0.8

0.9

1

Input Value

Ou

tpu

t V

alu

e

para 2

para 4

para 8

para 16

para 32

real

(c) initial state at 15

Figure 5.10: The mean output of the straightforward implementation of FSMwith 2, 4, 8, 16 and 32 parallel copies. The FSM is a typical 16-state absolutevalue function. The three subgraphs are using different initial state at 0, 7 and15 respectively.

51

(a) Frame Diff Original (b) Conventional (c) Serial Stochasic

(d) Para Impl Init State 0 (e) Init state 7 (f) Init state 15

Figure 5.11: Simulation results of the conventional deterministic scheme, serialand a straightforward parallel stochastic implementation on Frame Difference withdifferent initial states. The straightforward parallel stochastic implementation isclearly not able to compute the correct results.

52

time, we can manually store the steady state distributions and initiate the FSM

directly to the them when the input value is known. A dispatcher is proposed to

initiate the FSMs from any input value as in Fig. 5.12d.

The dispatcher itself is a look up table (LUT), through which the input value

can pick its corresponding set of initial states. For example, when the input is

0.5, the initial states are evenly distributed among all FSMs as in Fig. 5.9b. Each

state (of 16 states) will be assigned to two FSMs as initial state for a 16-state

parallel FSM of 32 parallel copies to mimic the steady distribution of 0.5. Thirty-

two initial states are stored in the table to approximate the distribution from each

input. The LUT has 20 entries from 0 to 1, with step of 0.05. The dispatcher

requires an estimation of the input value to pick the correct entry of states, so an

estimator is implemented before the dispatcher unit.

We propose two implementations of the estimation unit, parallel counter and

majority gate counter, as in Fig.5.12c. Because the dispatcher LUT entries in-

crease with a step of 0.05, we choose the estimation bit stream length to be 416

(32 parallel copies x 13 cycles) where the standard deviation of the estimation

is 0.025 (est ± 0.025) to be precise enough for the parallel counter to pick the

entries in the dispatcher. Moreover, since the steady state distribution is less

sensitive around input 0 and 1, a simpler majority gate as in Fig.5.13 can meet

the estimation needs as well. During the estimation process, the dispatcher could

not give an initial-states set, the parallel FSM unit will have a stall. We could

store the bits in the estimation process and feed them back to the FSMs in the

next estimation, making it a simple pipeline to avoid this stall. This way, it only

impacts the first input data and does not slow down the overall calculation speed.

The complete parallel FSM implementation is shown in Fig.5.12b.

5.2.2 Experiments and Results

In this section, we will introduce the experimental methodology and present the

computational results. We first set up the parallel FSM unit to approximate 3

53

Parallel Random

Bit gener-

ator

X

FSM

FSM

FSM

...

X0

X1

X31

Y0

Y1

Y31

(a) The straightforward parallelFSM implementation.

Parallel Random

Bit generator

X

FSM

FSM

FSM

...

X0

X1

X31

Y0

Y1

Y31

Estimator

...

Dispatcher

(b) The parallel FSM implementation.

Parallel counter

X0

X1

X31

...X

Estimator1

Majority Gate

X0

X1

X31

...X

Estimator2

Counter

(c) Estimator.

FSM

FSM

FSM

initiate

X

LUT

...

(d) Dispatcher.

Figure 5.12: The straightforward parallel FSM implementation and the proposedparallel implementation. The proposed parallel FSM has 32 parallel short bitstreams sent to the Estimator to obtain an initial guess for the input. Two Esti-mator implementations are parallel counter and majority gate counter. This initialestimate is then sent to the Dispatcher to look up a set of state configurations toinitialize the parallel FSMs.

54

0 0.2 0.4 0.6 0.8 10

0.5

1

input value

ou

tpu

t v

alu

e

experiment plot

real value

Figure 5.13: The experiment and analytical output result of a 32-input MajorityGate.

typical FSM functions, absolute value, tanh and exponential to study the parallel

implementation impact. We implemented our scheme with 32 parallel units of

FSMs that can reduce the bit stream length from 1024 to 32. Due to the prob-

abilistic nature of the stochastic computing scheme, we perform our experiments

repeatedly for 10 times for statistical significance. The experimental results com-

pare the accuracy and consistency among different schemes, including the serial

FSM, the straightforward parallel FSM as in Fig. 5.12a, the parallel FSM with a

parallel counter estimator and a majority gate estimator as in Fig. 5.12c. Then,

we implemented our parallel FSM with the majority gate estimator into two im-

age processing applications as in the previous subsection. We implemented the

FSMs in parallel the same way as in the previous single function implementations.

The experimental results compare the output image quality using mean squared

error (MSE) and peak-signal-to-noise ratio (PSNR) among the conventional deter-

ministic scheme (Conventional), the serial stochastic scheme (Serial) and parallel

stochastic scheme (Parallel).

Fig. 5.14 compares the output accuracy, showing the true output value and the

mean value of the repeated experimental results of the serial and different parallel

stochastic FSM implementations. The average error and the standard deviation of

each implementation are shown in Table. 5.3. The straightforward scheme shows

55

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.4

0.5

0.6

0.7

0.8

0.9

1

Input Value

Ou

tpu

t V

alu

e

serl

para Straight

paraCntEst

mjrEst

real

(a) Absolute Value Function

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

Input Value

Ou

tpu

t V

alu

e

serl

para Straight

paraCntEst

mjrEst

real

(b) Exponential Function

0 0.2 0.4 0.6 0.8 10

0.5

1

Input Value

Ou

tpu

t V

alu

e

serl

para Straight

paraCntEst

mjrEst

real

(c) Tanh Function

Figure 5.14: The output mean value of two parallel FSMs with parallel counteras estimator or majority gate estimator and the serial FSM. Both estimators use13 clocks, 13× 32 = 416 bits, to approximate the input value.

56

Table 5.3: The average error and deviation of the parallel FSMs.

Abs err Abs std Tanh err Tanh std Exp err Exp stdserial 0.0066 0.0127 0.0121 0.0224 0.0105 0.0171straight 0.1320 0.0118 0.0840 0.0234 0.1500 0.0141paraCnt 0.0038 0.0141 0.0049 0.0248 0.0176 0.0211mjrEst 0.0057 0.0150 0.0045 0.0285 0.0199 0.0238

(a) Frame Difference Original (b) Conventional

(c) Serial Stochastic (d) Parallel Stochastic

Figure 5.15: Simulation results of the conventional deterministic scheme, serialand parallel stochastic implementation on Frame Difference.

57

(a) Edge Detection Original (b) Conventional

(c) Serial Stochastic (d) Parallel Stochastic

Figure 5.16: Simulation results of the conventional deterministic scheme, serialand parallel stochastic implementation on Edge Detection.

58

Table 5.4: The MSE and PSNR of image processing applications

ApplicationsMSE PSNR (dB)

serial parallel serial parallelEdgeDetect 47.3 47.9 31.4 31.3FrameDiff 156.5 133.4 26.2 26.9

significant difference from the true output, while the other two parallel schemes

with estimator and dispatcher are very close to the true output, showing very good

accuracy. The error becomes bigger as the input value grows to 0.5 due to larger

autocorrelation and variance impacts [73]. The parallel FSM tend to be closer to

true value than the serial FSM when the input value is near 1, especially with the

exponential and tanh functions. Since the serial FSM initiates to state 0, it needs

time to grow from state 0 to 15 when the input value is close to 1. This transition

requires at least 16 steps, generating 16 wrong output bits. A bit stream of 1024

bits has an error rate of 161024

= 1.56% with 16 wrong output bits. However, the

parallel FSM does not fix the initial states, making no difference between different

input values. This makes the parallel implementation more accurate near input

value of 1.

Edge Detection and Frame Difference results are shown in Fig. 5.16 and

Fig. 5.15. It is hard to visually find any difference in both application results

other than some outliers due to the probabilistic nature of the stochastic comput-

ing. Detailed image quality comparisons are listed in Table 5.4. Both applications

achieved acceptable PSNR of 8-bit images under both serial and parallel imple-

mentations [75]. Both of them have similar MSE under two different schemes.

In summary, the experimental results show that the performance of the par-

allel FSM is as good as the serial implementation. The simplified majority gate

estimator can also compete with a more complex parallel counter, which suggests

more simplifications could be exploited for low accuracy applications. When num-

ber of parallel units increases and the number of each bit stream decreases, this

estimator-dispatcher mechanism becomes crucial to ensure the accuracy of the

59

Table 5.5: Hardware Cost, Latency and Area-Delay Product of serial and parallelFSM with 32 degrees of parallelism

serial parallel (PC) parallel (MG)FSM unit 4 8× 32 8× 32

supporting unit - 72 66total LUT-FF pairs 4 328 322

initial latency 0 13 13latency 1024 32 32

Area-Delay Product 4096 10496 10304

parallel FSM scheme. The image processing applications show that the parallel

FSM implementation can achieve equivalent or better image quality than serial

implementations.

5.2.3 Latency and Hardware Cost

We implemented the serial and parallel stochastic finite-state machine in verilog

using Xilinx ISE. The estimator of the parallel FSM unit is implemented using

two different schemes, parallel counter and majority gate as in Figure 5.12c. The

hardware cost of the 32-degree parallelism implementation is shown in Table 5.5,

where parallel (PC) refers to parallel FSM implementation using parallel counter

as the estimator and parallel FSM (MG) refers to the implementation using major-

ity gate. The hardware area reported is the number of look-up table and flip-flop

pairs (LUT-FF).

Although the parallel implementation of the FSM introduces hardware over-

head, it reduces the latency compared to the serial version. For instance, when

the number of parallel copies is 32, the serial implementation latency reduces from

1024 cycles to 32 cycles. This significant latency reduction can be critical for high

frequency applications. Although the parallel implementation does introduce an

initial latency during the input estimation process by 13 cycles as in Table 5.5, we

can minimize this impact by feeding back these 13 bits during the next estimation

as a pipeline. We can further see that the parallel implementation Area-Delay

60

Product (ADP) is greater than the serial ADP due to the significant hardware

overhead for the parallel implementation. Of course it is a common practice to

trade off area for better performance. Each FSM unit of the parallel implementa-

tion becomes about twice as large as the serial FSM, which contributes the larger

hardware overhead. This is because the parallel FSM must be able to initialize to

different states, which increases the hardware complexity and area cost. Another

hardware overhead of the parallel FSM comes from the dispatcher and estima-

tor, which is shown in Table 5.5 as supporting unit. The dispatcher is simply a

look-up table (LUT) with multiple entries, that each store a set of initial states

for the number of parallel FSM units. As for the estimator implementation, we

can further see that the majority gate scheme reduces the supporting unit area

significantly compared to the parallel counter scheme in the table.

5.3 Autocorrelation Issue of FSM

The sequential stochastic finite-state machine has another issue that is the auto-

correlation between the output bit stream. Since the output of the bit of the FSM

is determined by previous state and the current input bit, the outputs are actually

timely correlated. Therefore the output of the FSM from a random input bit

stream does not hold good randomness, where good randomness means flat auto-

correlation plot. The changing of the randomness compromised the randomness

assumption of the stochastic computing and could bring significant impacts to the

final output result when the circuit becomes large and complicated. Therefore, in

this section, we analyzed the auto-correlation impact of the FSM and proposed a

re-randomizer to resolve this issue.

5.3.1 Autocorrelation Analysis

Several experiments were performed to evaluate and understand the autocorrela-

tion issue for random bit streams produced by FSM-based stochastic computing

61

FSMinput output Autocorrelation

measurement

FSMinput output Autocorrelation

measurement

CountThen

regenerate

FSM group

Control group

Figure 5.17: Experimented methodology to measure and compare autocorrelation.

elements. We feed in different inputs into the FSM and compare the outputs with

a regenerated random output stream as in Fig. 5.17. A flat autocorrelation plot

shows better randomness, and vice versa.

The main parameters of the FSM-based computing elements are the number of

states and the length of the bit stream. The length of the bit stream determines

the calculation time and has been proven to be inversely proportional to the

output variance [56]. The number of states is generally related to the output

accuracy. We set the input stream values vary from 0 to 1 with an increment of

0.1. Then we choose the bit stream length to start from 256, then 512, 1024, 2046

to 4096. These experiments are performed with four FSM algorithms: absolute

value, |p|; exponential, e−p; exponential of absolute value, e−|p|; and tanh function.

These algorithms were implemented in image processing applications in previous

works [12].

Discrete Autocorrelation is an unnormalized vector shown in Formula 5.4,

where Rxx[j] is the jth element of the autocorrelation vector, xn is the nth element

of bit stream x, and xn−j is the nth element of the j-bit circular shift of the bit

stream x(j).

62

0 500 1000 1500 2000

0.65

0.7

0.75

0.8

0.85

bit shift

auto

corr

elat

ion

FSM

Random

Figure 5.18: Autocorrelation Comparison. The flat “Random” line is a typicalautocorrelation of random bit streams, whereas the fluctuating “FSM” is the auto-correlation when the output of the FSM is supposed to be the same as “Random”streams. Generally, the flatter the autocorrelation plot, the more random thestream.

Rxx[j] =∑n

xnxn−j (5.4)

Autocorrelation vectors are not convenient to compare directly for different

parameter settings. According to the random assumption, subsequent bits in a

statistic bit stream can be considered as a series of iid Bernoulli random variables.

Therefore, the sum of N bits can be modeled as a binomial distribution of B(N, p).

Many operations depend on such an assumption. For example, a power operation

like x2 would be easily calculated by feeding in x and a one-clock delayed version

of x into an AND gate. Such an operation requires x to be independent with

time, which can be characterized by a good flat autocorrelation as “Random” in

Fig. 5.18. To make the comparison of autocorrelations much easier, we propose a

variance-like metric in Formula 5.5, where N is the length of the autocorrelation

vector, and x̄ is the FSM output mean value.

63

M =1

N

N∑n=1

(Rxx[n]− x̄2)2 (5.5)

This metric is valid since autocorrelation is actually a bit shifted version of

self dot-product, as in Formula 5.4. Then the mean of each autocorrelation vector

element R̄xx[j] is equal to the mean value of the square of x and a j-bit delayed

stream x(j), which is x̄2. Therefore, the ideal autocorrelation should be identical

to x̄2. A good way to quantize the autocorrelation vector is just to find the

average difference between the actual measurement and an ideal one. The smaller

this value gets, the less significant the autocorrelation problem will be. The 1024-

bit length stream’s output variance of 2.4× 10−4 with mean value at 0.5 is small

enough for our consideration. There, however, is an issue about this metric that

it only characterizes the average situation. We will also examine the worst case

autocorrelation plot to reveal detailed information, especially on which part of the

autocorrelation is bad.

The experimental results include single factor comparisons among three dif-

ferent parameters: the number of the states, the bit stream lengths, and the

output values. Fig. 5.19 shows how the autocorrelation is related to the number

of states in the four different algorithms with the length of bit stream of 1024.

As the number of states increases, the metric M becomes bigger and the output

autocorrelation becomes worse.

As the bit stream length increases in Fig. 5.20, the metric M becomes smaller

and the autocorrelation becomes better among all four algorithms when the num-

ber of states is fixed at 16. This trend holds with different numbers of states.

Improvement due to the bit length increase is significant and consistent among

different numbers of states and different algorithms. Specifically, experiment re-

sults from ABS in Fig. 5.20a and EXP in Fig. 5.20b show an almost linear inverse

relationship between the autocorrelation metric and the length of the bit stream.

As the bit stream length is doubled, the metric becomes half with the same output

value.

64

0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3x 10

-3

output value

auto

corr

elat

ion

8

16

32

(a) Absolute Algorithm

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1x 10

-3

output value

auto

corr

elat

ion

8

16

32

(b) Exponential Algorithm

0 0.2 0.4 0.6 0.8 10

2

4

6

8x 10

-4

output value

auto

corr

elat

ion

8

16

32

(c) Exponential Absolute Algorithm

0 0.2 0.4 0.6 0.8 10

0.002

0.004

0.006

0.008

0.01

output value

auto

corr

elat

ion

8

16

32

(d) Tanh Algorithm

Figure 5.19: FSM autocorrelation results at a bit length of 1024. All show thatthe 32-state line is higher than the 16-state and 8-state cases, which means thatthe autocorrelation metric of the 32-state case is the largest.

65

0.4 0.5 0.6 0.7 0.8 0.9 10

1

2

3

4x 10

-3

output value

auto

corr

elat

ion

256

512

1024

2048

4096


0 0.2 0.4 0.6 0.8 10

0.5

1

1.5

2

2.5

3x 10

-3

output value

auto

corr

elat

ion

256

512

1024

2048

4096


0 0.2 0.4 0.6 0.8 10

0.5

1

1.5

2x 10

-3

output value

auto

corr

elat

ion

256

512

1024

2048

4096


0 0.2 0.4 0.6 0.8 10

1

2

3

4

5x 10

-3

output value

auto

corr

elat

ion

256

512

1024

2048

4096

(d) Tanh Algorithm

Figure 5.20: FSM autocorrelation results with 16 states. All show that the shorterbit streams usually have larger autocorrelation metric. Specifically, the autocor-relation metric with a bit stream of length of 256 is almost 10 times larger thanwhen the length of bit stream is 4096.

66

Another relation between the output value and the autocorrelation metric can

be observed from the results. The autocorrelation becomes the worst around the

output value of 0.5, then becomes smaller when the output value departs from

the center. For all four algorithms, the autocorrelation metric of output values

below 0.2 or larger than 0.8 is typically 10 times smaller than the worst case at

the output value of 0.5. At these two extremes, the autocorrelation issue is less

significant and is close to 2.4× 10−4, which could be considered negligible.

The difference between the FSM output and a random sequence output is

shown in Fig. 5.20. We can see that, for all four algorithms, the autocorrelation

metric of the FSM output is significantly larger than the random stream. The

gap can be as large as 10 when the output value is around 0.5. Clearly, the FSM

significantly affects the output autocorrelation, which suggests poor randomness

of the output stream.

5.3.2 Re-randomizer

Although we found that a smaller number of states and a longer bit stream can

generate an output stream with better autocorrelation, these characteristics work

against higher precision and efficiency. We propose a re-randomizer that main-

tains the shorter bit stream and relatively large number of states. The saturating

up-down counter based re-randomizer is constructed with a feedback circuit shown

in Fig. 5.21. Counter structures have been widely used in Neural Network appli-

cations [13] [76]. It is generally considered to be an integrator. By controlling the

feedback loop and the counter size, we can manipulate the output behavior. Our

proposed re-randomizer uses a simple unity gain feedback loop, and behaves like

a low pass follower. The bigger the counter size, the smaller the cut off frequency

is [76]. However, as the counter size increases, the steady state settling time also

increases, impacting the output accuracy. To balance the output accuracy and

the cut off frequency, the counter size has been chosen to be 128 states when the

length of bit stream is larger than 1024, or to be 1/10 of the length of the bit

67

LFSR

CNT+

-

+

-

Rerandomizer

FSMinput

Figure 5.21: Proposed re-randomizer with feedback structure.

stream. By connecting the re-randomizer directly after the FSM output, we can

analyze the autocorrelation improvement from the re-randomizer.

Fig. 5.22 compares the ideal binomial random bit stream, the FSM output, and

the re-randomizer output autocorrelations with 16 states and the bit stream length

set to 1024. The FSM output autocorrelation departs from the ideal binomial

random bit stream at around 0.2, merges back at around 0.8, and reaches the worst

case at around 0.5. The re-randomizer (aka the “Follower”) generally remains at

the same level as the binomial random bit stream, which is less than 2.4 × 10−4

in all cases. The FSM output value and random binomial output are close to

each other, while the re-randomizer has different degrees of value shifts from the

random binomial stream outputs.

The re-randomizer can significantly reduce the autocorrelation metric. Also

from Fig. 5.23 when the length of bit stream is 1024 with 16 states around output

value at 0.5, we can see how the re-randomizer reshaped the autocorrelation. What

the re-randomizer did is similar to a low pass filter. It filtered out the higher

frequency components of the autocorrelation plot signals, making them much

flatter. However, at the worst case situations, the lower frequency components

still remained quite noticeable.

68

0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1x 10

-3

output value

auto

corr

elat

ion

FSM

Random

Follower


0 0.2 0.4 0.6 0.8 10

2

4

6

8x 10

-4

output valueau

toco

rrel

atio

n

FSM

Random

Follower


0 0.2 0.4 0.6 0.8 10

1

2

3

4

5

6x 10

-4

output value

auto

corr

elat

ion

FSM

Random

Follower


0 0.2 0.4 0.6 0.8 10

0.5

1

1.5

2

2.5

3x 10

-3

output value

auto

corr

elat

ion

FSM

Random

Follower

(d) Tanh Algorithm

Figure 5.22: Autocorrelation Comparison of the FSM, Re-randomizer and Ran-dom, where “Random” is an ideal Bernoulli bit stream. “FSM” has the worstautocorrelation. Re-randomizer as “Follower” is almost the same as the randomsequence as “Random”.

69

0 500 1000 1500 20000

0.1

0.2

0.3

0.4

0.5

bit shift

auto

corr

elat

ion

FSM

Random

Follower


0 500 1000 1500 2000

0.65

0.7

0.75

0.8

0.85

bit shift

auto

corr

elat

ion

FSM

Random

Follower


0 500 1000 1500 20000.5

0.6

0.7

0.8

0.9

bit shift

auto

corr

elat

ion

FSM

Random

Follower


0 500 1000 1500 20000

0.1

0.2

0.3

0.4

0.5

bit shift

auto

corr

elat

ion

FSM

Random

Follower

(d) Tanh Algorithm

Figure 5.23: Worst Case Autocorrelation Plot with 16 states and a bit streamlength of 1024.

70

5.3.3 Discussion and Analysis

For a binomial random bit stream, the autocorrelation can be considered as the

square of the output probability, except at 0-bit shift. Therefore, all other auto-

correlation vector elements should have the mean value equal to E(X)2, where

X is the random variable of the binomial random input following a distribution

of 1NB(N, p). Recall the autocorrelation metric M proposed previously, which is

exactly the variance of such a distribution. Because each autocorrelation element

is totally determined by each specific bit stream, the autocorrelation should be

linearly related with the variance of X, V ar(X), but not V ar(X2). The variance

of X for this distribution is:

V ar(X) =p(1− p)

N

Now, the autocorrelation, like the variance, should be related to both the

output value and the bit stream length, N. Specifically, the autocorrelation metric

M is inversely proportional to the bit stream length N and reaches the maximum

when p equals 0.5. Both of the two parameters maintain the same relationship

when we evaluated the FSM output, where the binomial random assumption was

not guaranteed.

As for the number of states, it is reasonable to suggest that as the number

of states increases, the time required for a full walk through all states would

become longer. This time could be related with some specific pattern in the output

streams. Although the Markov chain guarantees a steady-state distribution, it

takes at least one period to statistically simulate this distribution. This period

can be determined by the steady-state distribution, and we can roughly treat

the number of states to be the shortest possible period. While a random stream

has constant probability to generate each bit, the FSM must take at least one

period to simulate this procedure. So this period should be positively related

with the output autocorrelation. Therefore, fewer states usually produce better

autocorrelation.

71

We know that the binomial distribution can be approximated with a Gaussian

distribution when N is large enough. Then it is reasonable to see the binomial

random stream as a DC signal plus some Gaussian white noise which has the

same magnitude as the bit stream’s standard deviation. However, the FSM out-

put streams usually contain long successive 1s or 0s unlike the binomial random

streams. Thus, the FSM output streams contain more complex frequency compo-

nents. The re-randomizer we proposed successfully filters out most of the higher

frequency components. However, we cannot increase the counter size to lower the

cut off frequency without limit, since a bigger counter size also means a longer

settling time, which would affect the output accuracy.

5.4 Summary

In this chapter, we introduced the spintronic logic devices and analyzed its merits

and weaknesses. Although spintronic logic technology benefits from low leakage

power and small footprint, it is unreliable and has to increase cell dimension

to overcome its intensity to flip. This fits very well with stochastic computing

schemes, which can utilize simple circuit logic such as AND Gate and FSMs to do

complex calculations. It reduces the complexity of the circuit of spintronic logic

and is highly tolerant of errors.

Previous works already studied the implementation of stochastic computing

combinational logics and number bit generator using the spintronic logic devices.

We proposed to implement sequential logic, the FSM, which is capable to do

complex calculations, including exponential function, absolute value function and

hyperbolic tangent function, using spintronic logic devices.

We further proposed a parallelization scheme for the FSM to optimize the per-

formance of stochastic computing scheme. The proposed parallelization scheme

uses a look-up table dispatcher to set the initial states of multiple FSMs at the

72

steady state to avoid a long convergence period. This scheme can effectively im-

plement the stochastic sequential logic FSM in parallel to reduce the long calcu-

lation latency with some hardware overhead. Experiments on three typical FSM

functions show that the accuracy and variance of the parallel FSM scheme are

comparable to the serial implementation. The parallel FSM scheme further shows

equivalent or better image quality than the serial implementation in two image

processing applications.

Finally, we proposed a re-randomizer to overcome the autocorrelation issue

of the FSM. We analyzed that the autocorrelation of FSM is related with the

number of states, the length of bit stream and the output value. Larger number

of states with shorter bit stream length and output around 0.5 generates worse

autocorrelation error. The proposed re-randomizer uses a up/down counter and

a LFSR can effectively break the output bit sequence of the FSM and solve the

autocorrelation issue.

Chapter 6

Conclusion and Discussion

We have investigated the spintronic device implementations in both memory and

logic circuits. It is a promising alternative to CMOS with significantly smaller

size and virtually zero standby energy use. However, its high dynamic switching

power and latency with the unreliable random flipping nature makes it hard to

accommodate the performance and accuracy requirements. In this work, we have

optimized the spintronic devices for both memory and logic circuits to mitigate

these weaknesses.

Spintronic memory devices: We have analyzed the impact of STT-MRAM as

a replacement for CMOS at all levels of a multiprocessor cache hierarchy. Though

STT-MRAM has higher write energy and latency, reducing these parameters at

the circuit level does not lead to an optimal design. The extra circuit area required

to minimize MTJ bit-cell write time and energy causes the cache arrays to grow,

leading to higher read energy and latency due to parasitics effects.

An fully-associative L0 cache as small as 4KB can effectively restore perfor-

mance lost to the higher write latency. This structure hides the extra write la-

tency of around 5ns when running at 2GHz, giving a total cache energy savings of

40-70% and an average energy-delay product reduction of 60% compared to the

CMOS baseline. The L0 cache is implemented as a standard cache level, requiring

73

74

no additional control structures. We have observed no significant scalability im-

pact using STT-MRAM with L0 implemented. A few benchmarks show improved

scalability up to 16 cores using the STT-MRAM hierarchy. The introduction of

new memory technologies can have significant impacts on the best architectural

choices for the memory hierarchy of a multicore system. This work shows that

simple solutions can help mitigate the negative impacts while still allowing the

system to take advantage of the benefits of the new technology.

Spintronic logic devices: We have introduced the spintronic logic device and

analyzed its merits and weaknesses. Although it benefits from low leakage power

and smaller footprint, it suffers from random bit flips due to its thermal instability.

We proposed a scheme to implement stochastic computing circuits using all-spin

logics that can take advantage of the high fault tolerance of stochastic computing

to mitigate spin random flips. The proposed design can tolerate up to error rate

of 0.00001 for two image applications, Edge Detection and Frame Difference.

We further looked into optimization of one of the core stochastic computing

elements, the finite-state machine to achieve better performance. A parallelization

scheme for the FSM was proposed. Using a look-up table dispatcher to set the

initial states of multiple FSMs, the parallel FSM can immediately work from the

steady state, avoiding the long convergence period. We also proposed two kind of

estimators for the dispatcher. One of them, the parallel counter, requires larger

hardware area, but provides better estimation of the input value. The other, using

the majority gate, naturally fits the trend of steady-state distribution with the

input values, and also simplifies the estimator hardware. The proposed parallel

scheme can effectively implement FSM in parallel to reduce the long calculation la-

tency with some hardware overhead. Experiments on three typical FSM functions

show that the accuracy and variance of the parallel FSM scheme are comparable

to the serial implementation. The parallel FSM scheme further shows equivalent

or better image quality than the serial one in two image processing applications.

We conclude that quickly initializing the FSM by estimating the initial state using

only a few bits of the input value allows parallelism to be effectively exploited in

75

stochastic logic that uses storage elements.

Another issue of the FSM, the autocorrelation issue, is later experimentally

studied. The autocorrelation can introduce dependencies that breaks the ran-

domness assumption that could impact the results of large stochastic circuits,

especially with feedback loops. The autocorrelation is related with three param-

eters, the length of the bit stream, the number of states in the FSM and the

output value. It becomes worse when the length of bit stream decreases, when

the number of states becomes larger, or when the output value is close to 0.5. We

proposed a re-randomizer that uses a up/down counter to track the output value

and regenerates a new random bit stream to solve the autocorrelation issue of the

FSM.

Future Works: The parallel implementation scheme of the FSM successfully

reduces the calculation latency by almost 32 times, but the hardware overhead

is substantial and causes the Area-Delay Product to increase to almost twice as

the serial implementation. Verilog suggests the parallel version of FSM uses twice

number of LUT-FF pairs due to the dynamical initial states, which contributes the

most of the hardware overhead. An investigation on how to optimize and reduce

each FSM will be performed in the future works, which can reduce the ADP by

almost half. We also want to examine the possibility of completely designing the

parallel stochastic computing implementations using pure ASL devices, where the

addition components such as estimator and dispatcher can be easily implemented

with ASL majority gate and MTJ memory cells. A thorough comparison between

the stochastic ASL and conventional boolean implementation will be performed

in the future works as well.

References

[1] G. E. Moore. Cramming more components onto integrated circuits. Proceed-

ings of the IEEE, 86(1):82–85, Jan 1998.

[2] Z. Pajouhi, S. Venkataramani, K. Yogendra, A. Raghunathan, and K. Roy.

Exploring spin-transfer-torque devices for logic applications. IEEE Trans-

actions on Computer-Aided Design of Integrated Circuits and Systems,

34(9):1441–1454, Sept 2015.

[3] Dmitri E. Nikonov and Ian A. Young. Benchmarking of beyond-cmos ex-

ploratory devices for logic integrated circuits. IEEE Journal on Exploratory

Solid-State Computational Devices and Circuits, 1:3–11, 2015.

[4] Sang Phill Park, Sumeet Gupta, Niladri Mojumder, Anand Raghunathan,

and Kaushik Roy. Future cache design using STT MRAMs for improved en-

ergy efficiency: devices, circuits and architecture. In DAC ’12: Proceedings of

the 49th Annual Design Automation Conference. ACM Request Permissions,

June 2012.

[5] Clinton W IV Smullen, Vidyabhushan Mohan, Anurag Nigam, Sudhanva

Gurumurthi, and Mircea R Jr Stan. Relaxing Non-Volatility for Fast and

Energy-Efficient STT-RAM Caches. High Performance Computer Architec-

ture (HPCA), 2011 IEEE 17th International Symposium on, pages 50–61,

2011.

76

77

[6] Mitchelle Rasquinha, Dhruv Choudhary, Subho Chatterjee, Saibal

Mukhopadhyay, and Sudhakar Yalamanchili. An energy efficient cache de-

sign using spin torque transfer (STT) RAM. In ISLPED ’10: Proceedings of

the 16th ACM/IEEE international symposium on Low power electronics and

design. ACM Request Permissions, August 2010.

[7] Zhenyu Sun, Xiuyuan Bi, Hai Helen Li, Weng-Fai Wong, Zhong-Liang Ong,

Xiaochun Zhu, and Wenqing Wu. Multi retention level STT-RAM cache

designs with a dynamic refresh scheme. In MICRO-44 ’11: Proceedings of

the 44th Annual IEEE/ACM International Symposium on Microarchitecture.

ACM Request Permissions, December 2011.

[8] Xiaochen Guo, Engin Ipek, and Tolga Soyata. Resistive computation: avoid-

ing the power wall with low-leakage, STT-MRAM based computing. In ISCA

’10: Proceedings of the 37th annual international symposium on Computer

architecture. ACM Request Permissions, June 2010.

[9] S. Senni, L. Torres, G. Sassatelli, A. Gamatie, and B. Mussard. Emerging

non-volatile memory technologies exploration flow for processor architecture.

In VLSI (ISVLSI), 2015 IEEE Computer Society Annual Symposium on,

pages 460–460, July 2015.

[10] Adwait Jog, Asit K Mishra, Cong Xu, Yuan Xie, Vijaykrishnan Narayanan,

Ravishankar Krishnan Iyer, and Chita R Das. Cache revive: Architecting

volatile STT-RAM caches for enhanced performance in CMPs. In DAC ’12:

Proceedings of the 49th Annual Design Automation Conference, pages 243–

252, 2012.

[11] C. Ma, W. Tuohy, and D. J. Lilja. Impact of spintronic memory on multi-

core cache hierarchy design. IET Computers Digital Techniques, 11(2):51–59,

2017.

78

[12] Peng Li and D.J. Lilja. Using stochastic computing to implement digital

image processing algorithms. In Computer Design (ICCD), 2011 IEEE 29th

International Conference on, pages 154–161. IEEE, 2011.

[13] B.D. Brown and H.C. Card. Stochastic neural computation. I. Computational

elements. Computers, IEEE Transactions on, 50(9):891–905, 2001.

[14] R. Venkatesan, S. Venkataramani, X. Fong, K. Roy, and A. Raghunathan.

Spintastic: Spin-based stochastic logic for energy-efficient computing. In 2015

Design, Automation Test in Europe Conference Exhibition (DATE), pages

1575–1578, March 2015.

[15] X. Fong, M. C. Chen, and K. Roy. Generating true random numbers using on-

chip complementary polarizer spin-transfer torque magnetic tunnel junctions.

In 72nd Device Research Conference, pages 103–104, June 2014.

[16] Yuan Ji, Feng Ran, Cong Ma, and D.J. Lilja. A hardware implementation

of a radial basis function neural network using stochastic logic. In Design,

Automation Test in Europe Conference Exhibition (DATE), 2015, pages 880–

883, March 2015.

[17] L. Miao and C. Chakrabarti. A parallel stochastic computing system with

improved accuracy. In SiPS 2013 Proceedings, pages 195–200, Oct 2013.

[18] Zhiheng Wang, N. Saraf, K. Bazargan, and A. Scheel. Randomness meets

feedback: Stochastic implementation of logistic map dynamical system. In

2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC), pages

1–7, June 2015.

[19] Qingan Li, Jianhua Li, Liang Shi, C.J. Xue, Yiran Chen, and Yanxiang He.

Compiler-assisted refresh minimization for volatile stt-ram cache. In De-

sign Automation Conference (ASP-DAC), 2013 18th Asia and South Pacific,

pages 273–278, Jan 2013.

79

[20] Wei Xu, Hongbin Sun, Xiaobin Wang, Yiran Chen, and Tong Zhang. Design

of last-level on-chip cache using spin-torque transfer ram (stt ram). Very

Large Scale Integration (VLSI) Systems, IEEE Transactions on, 19(3):483–

493, March 2011.

[21] Yusung Kim, Sumeet Kumar Gupta, Sang Phill Park, Georgios Panagopou-

los, and Kaushik Roy. Write-optimized reliable design of STT MRAM. In

ISLPED ’12: Proceedings of the 2012 ACM/IEEE international symposium

on Low power electronics and design. ACM Request Permissions, July 2012.

[22] Guangyu Sun, Xiangyu Dong, Yuan Xie, Jian Li, and Yiran Chen. A novel

architecture of the 3d stacked mram l2 cache for cmps. In High Performance

Computer Architecture, 2009. HPCA 2009. IEEE 15th International Sympo-

sium on, pages 239–249, Feb 2009.

[23] Ping Zhou, Bo Zhao, Jun Yang, and Youtao Zhang. Energy reduction for

STT-RAM using early write termination. In ICCAD ’09: Proceedings of the

2009 International Conference on Computer-Aided Design. ACM Request

Permissions, November 2009.

[24] Kon-Woo Kwon, Sri Harsha Choday, Yusung Kim, and Kaushik Roy.

AWARE (Asymmetric Write Architecture With REdundant Blocks): A High

Write Speed STT-MRAM Cache Architecture. IEEE Transactions on Very

Large Scale Integration(VLSI) Systems, 22(4):712–720.

[25] Zhenyu Sun, Hai Li, and Wenqing Wu. A dual-mode architecture for fast-

switching STT-RAM. In ISLPED ’12: Proceedings of the 2012 ACM/IEEE

international symposium on Low power electronics and design. ACM Request

Permissions, July 2012.

80

[26] Junwhan Ahn, Sungjoo Yoo, and Kiyoung Choi. Dasca: Dead write predic-

tion assisted stt-ram cache architecture. High Performance Computer Archi-

tecture (HPCA2014), 2014 IEEE 20th International Symposium on, February

2014.

[27] Xiaoxia Wu, Jian Li, Lixin Zhang, E. Speight, and Yuan Xie. Power and

performance of read-write aware hybrid caches with non-volatile memories.

In Design, Automation Test in Europe Conference Exhibition , 2009. DATE

’09., pages 737–742, April 2009.

[28] Amin Jadidi, Mohammad Arjomand, and Hamid Sarbazi-Azad. High-

endurance and performance-efficient design of hybrid cache architectures

through adaptive line replacement. In ISLPED ’11: Proceedings of the 17th

IEEE/ACM international symposium on Low-power electronics and design.

IEEE Press, August 2011.

[29] B. Del Bel, Jongyeon Kim, C.H. Kim, and S.S. Sapatnekar. Improving stt-

mram density through multibit error correction. In Design, Automation and

Test in Europe Conference and Exhibition (DATE), 2014, pages 1–6, March

2014.

[30] Norman P Jouppi. Improving direct-mapped cache performance by the addi-

tion of a small fully-associative cache and prefetch buffers. In ACM SIGARCH

Computer Architecture News, volume 18, pages 364–373. ACM, 1990.

[31] Johnson Kin, Munish Gupta, and William H. Mangione-Smith. The filter

cache: An energy efficient memory structure. In Proceedings of the 30th

Annual ACM/IEEE International Symposium on Microarchitecture, MICRO

30, pages 184–193, Washington, DC, USA, 1997. IEEE Computer Society.

[32] A. Varma and Q. Jacobson. Destage algorithms for disk arrays with non-

volatile caches. In Computer Architecture, 1995. Proceedings., 22nd Annual

International Symposium on, pages 83–95, June 1995.

81

[33] Binny S. Gill and Dharmendra S. Modha. Wow: Wise ordering for writes -

combining spatial and temporal locality in non-volatile caches. In Proceed-

ings of the 4th Conference on USENIX Conference on File and Storage Tech-

nologies - Volume 4, FAST’05, page 10, Berkeley, CA, USA, 2005. USENIX

Association.

[34] Behtash Behin-Aein, Deepanjan Datta, Sayeff Salahuddin, and Supriyo

Datta. Proposal for an all-spin logic device with built-in memory. Nature

Nanotechnology, 5:266–270, Feb 2010.

[35] C. Augustine, G. Panagopoulos, B. Behin-Aein, S. Srinivasan, A. Sarkar,

and K. Roy. Low-power functionality enhanced computation architecture

using spin-based devices. In 2011 IEEE/ACM International Symposium on

Nanoscale Architectures, pages 129–136, June 2011.

[36] J. Kim, B. Tuohy, C. Ma, W. H. Choi, I. Ahmed, D. Lilja, and C. H. Kim.

Spin-hall effect mram based cache memory: A feasibility study. In 2015 73rd

Annual Device Research Conference (DRC), pages 117–118, June 2015.

[37] S. G. Ramasubramanian, R. Venkatesan, M. Sharad, K. Roy, and A. Raghu-

nathan. Spindle: Spintronic deep learning engine for large-scale neuromorphic

computing. In 2014 IEEE/ACM International Symposium on Low Power

Electronics and Design (ISLPED), pages 15–20, Aug 2014.

[38] Meng Yang, Bingzhe Li, David J. Lilja, Bo Yuan, and Weikang Qian. To-

wards theoretical cost limit of stochastic number generators for stochastic

computing. In IEEE Computer Society Annual Symposium on VLSI, 2018.

[39] Weisheng Zhao, E. Belhaire, and C. Chappert. Spin-mtj based non-volatile

flip-flop. In 2007 7th IEEE Conference on Nanotechnology (IEEE NANO),

pages 399–402, Aug 2007.

[40] K. Ryu, J. Kim, J. Jung, J. P. Kim, S. H. Kang, and S. O. Jung. A magnetic

tunnel junction based zero standby leakage current retention flip-flop. IEEE

82

Transactions on Very Large Scale Integration (VLSI) Systems, 20(11):2044–

2053, Nov 2012.

[41] T. Endoh, S. Togashi, F. Iga, Y. Yoshida, T. Ohsawa, H. Koike, S. Fukami,

S. Ikeda, N. Kasai, N. Sakimura, T. Hanyu, and H. Ohno. A 600mhz mtj-

based nonvolatile latch making use of incubation time in mtj switching. In

2011 International Electron Devices Meeting, pages 4.3.1–4.3.4, Dec 2011.

[42] H. Koike, T. Ohsawa, S. Ikeda, T. Hanyu, H. Ohno, T. Endoh, N. Sakimura,

R. Nebashi, Y. Tsuji, A. Morioka, S. Miura, H. Honjo, and T. Sugibayashi.

A power-gated mpu with 3-microsecond entry/exit delay using mtj-based

nonvolatile flip-flop. In 2013 IEEE Asian Solid-State Circuits Conference

(A-SSCC), pages 317–320, Nov 2013.

[43] K. Jabeur, G. Di Pendina, F. Bernard-Granger, and G. Prenat. Spin orbit

torque non-volatile flip-flop for high speed and low energy applications. IEEE

Electron Device Letters, 35(3):408–410, March 2014.

[44] B.R. Gaines. Techniques of identification with the stochastic computer. In

Proc. IFAC Symp. Problems of Identification, pages 1–18, 1967.

[45] B.D. Brown and H.C. Card. Stochastic neural computation. II. Soft compet-

itive learning. Computers, IEEE Transactions on, 50(9):906–920, 2001.

[46] H. Li, D. Zhang, and S. Y. Foo. A stochastic digital implementation of a

neural network controller for small wind turbine systems. IEEE Transactions

on Power Electronics, 21(5):1502–1507, September 2006.

[47] Bingzhe Li, M Hassan Najafi, and David J Lilja. An fpga implementation

of a restricted boltzmann machine classifier using stochastic bit streams.

In Application-specific Systems, Architectures and Processors (ASAP), 2015

IEEE 26th International Conference on, pages 68–69. IEEE, 2015.

83

[48] Bingzhe Li, Yaobin Qin, Bo Yuan, and David J Lilja. Neural network classi-

fiers using stochastic computing with a hardware-oriented approximate acti-

vation function. In 2017 IEEE 35th International Conference on Computer

Design (ICCD), pages 97–104. IEEE, 2017.

[49] B. Li, M. H. Najafi, B. Yuan, and D. J. Lilja. Quantized neural networks

with new stochastic multipliers. In 2018 19th International Symposium on

Quality Electronic Design (ISQED), pages 376–382, March 2018.

[50] J. G. Ortega, C. L. Janer, J. M. Quero, J. Pinilla, and J. Serrano. Analog to

digital and digital to analog conversion based on stochastic logic. In IEEE

21st Annual Conference of Industrial Electronics Society, IECON’95, pages

995–999, 1995.

[51] C. L. Janer, J. M. Quero, J. G. Ortega, and L. G. Franquelo. Fully parallel

stochastic computation architecture. IEEE Transactions on Signal Process-

ing, 44(8):2110–2117, August 1996.

[52] Saeed S. Tehrani, S. Mannor, and Warren J. Gross. Fully parallel stochastic

ldpc decoders. IEEE Transactions on signal processing, page 11, November

2008.

[53] M. Hori and M. Ueda. Fpga implementation of a blind source separation sys-

tem based on stochastic computing. In IEEE Conference on Soft Computing

in Industrial Applications, SMCia’08, pages 182–187, 2008.

[54] N. Saraf, K. Bazargan, D.J. Lilja, and M.D. Riedel. Iir filters using stochas-

tic arithmetic. In Design, Automation and Test in Europe Conference and

Exhibition (DATE), 2014, pages 1–6, March 2014.

[55] Peng Li and D.J. Lilja. A low power fault-tolerance architecture for the ker-

nel density estimation based image segmentation algorithm. In Application-

Specific Systems, Architectures and Processors (ASAP), 2011 IEEE Interna-

tional Conference on, pages 161–168. IEEE, 2011.

84

[56] Weikang Qian, Xin Li, M.D. Riedel, K. Bazargan, and D.J. Lilja. An ar-

chitecture for fault-tolerant computation with stochastic logic. Computers,

IEEE Transactions on, 60(1):93–105, 2011.

[57] Jongyeon Kim, Hui Zhao, Yanfeng Jiang, Angeline Klemm, Jian-Ping Wang,

and Chris H. Kim. Scaling Analysis of In-plane and Perpendicular Anisotropy

Magnetic Tunnel Junctions Using a Physics-Based Model. In Device Research

Conference (DRC), 2014, June 2014.

[58] William Tuohy, Cong Ma, Pushkar Nandkar, Nishant Borse, and David J

Lilja. Improving energy and performance with spintronics caches in multi-

core systems. In Europar ’14: OMHI - Third Annual Workshop on On-Chip

Memory Hierarchies and Interconnects. Springer-Verlag, August 2014.

[59] L.P. Hewlett-Packard Development Company. Cacti 6.5, 2009.

[60] Wei Zhao and Yu Cao. New generation of predictive technology model for

sub-45nm design exploration. Quality Electronic Design, 2006. ISQED ’06.

7th International Symposium on, page 6, 2006.

[61] Xiangyu Dong, Cong Xu, Yuan Xie, and N P Jouppi. NVSim: A Circuit-Level

Performance, Energy, and Area Model for Emerging Nonvolatile Memory.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and

Systems, 31(7):994–1007.

[62] Davy Genbrugge, Stijn Eyerman, and Lieven Eeckhout. Interval simula-

tion: Raising the level of abstraction in architectural simulation. In High

Performance Computer Architecture (HPCA), 2010 IEEE 16th International

Symposium on, pages 1–12. IEEE, January 2010.

[63] Christian Bienia. Benchmarking Modern Multiprocessors. PhD thesis, Prince-

ton University, January 2011.

85

[64] Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh,

and Anoop Gupta. The splash-2 programs: Characterization and method-

ological considerations. In Proceedings of the 22Nd Annual International

Symposium on Computer Architecture, ISCA ’95, pages 24–36, New York,

NY, USA, 1995. ACM.

[65] C. Bienia, S. Kumar, and K. Li. Parsec vs. splash-2: A quantitative com-

parison of two multithreaded benchmark suites on chip-multiprocessors. In

Workload Characterization, 2008. IISWC 2008. IEEE International Sympo-

sium on, pages 47–56, Sept 2008.

[66] A R Alameldeen and D A Wood. IPC Considered Harmful for Multiprocessor

Workloads. Micro, IEEE, 26(4):8–17, 2006.

[67] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt,

Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna,

Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay

Vaish, Mark D. Hill, and David A. Wood. The gem5 simulator. SIGARCH

Comput. Archit. News, 39(2):1–7, August 2011.

[68] Cong Ma, William Tuohy, Pushkar Nandkar, and David J. Lilja. Cycle-

accurate stt-mram model in gem5. In International Symposium on Computer

Architecture (ISCA) Second gem5 User Workshop, 2015.

[69] Ki C. Chun, Hui Zhao, Jonathan D. Harms, Tae-Hyoung Kim, Jian-Ping

Wang, and Chul H. Kim. A Scaling Roadmap and Performance Evaluation

of In-Plane and Perpendicular MTJ Based STT-MRAMs for High-Density

Cache Memory. Solid-State Circuits, IEEE Journal of, 48(2):598–610, Febru-

ary 2013.

[70] Ricardo Gonzales and Mark Horowitz. Energy dissipation in general purpose

processors. IEEE Journal of Solid State Circuits, 31:1277–1284, 1995.

86

[71] Major Bhadauria, Vincent M Weaver, and Sally A McKee. Understanding

PARSEC performance on contemporary CMPs. Workload Characterization,

2009. IISWC 2009. IEEE International Symposium on, pages 98–107, 2009.

[72] C. Ma and D. J. Lilja. Parallel implementation of finite state machines for

reducing the latency of stochastic computing. In 2018 19th International

Symposium on Quality Electronic Design (ISQED), pages 335–340, March

2018.

[73] Cong Ma, Peng Li, and David J. Lilja. Autocorrelation study for finite-state

machine-based stochastic computing elements. In International Workshop on

Logic Synthesis (IWLS), 2013.

[74] A. A. Markov. Extension of the limit theorems of probability theory to a sum

of variables connected in a chain. reprinted in Appendix B of: R. Howard.

Dynamic Probabilistic Systems, volume 1: Markov Chains. John Wiley and

Sons, 1971.

[75] Nikolaos Thomos, Nikolaos V. Boulgouris, and Michael G. Strintzis. Op-

timized transmission of JPEG2000 streams over wireless channels. IEEE

transactions on image processing : a publication of the IEEE Signal Process-

ing Society, 15(1):54–67, January 2006.

[76] J. M. Quero, S. L. Toral, J. G. Ortega, and L. G. Franquelo. Continuous

time filter design using stochastic logic. 1:113–116, 1999.

The Design of Spintronic-based Circuitry for Memory and ...

Documents

Transcript of The Design of Spintronic-based Circuitry for Memory and ...