The Design of Spintronic-based Circuitry for Memory and ...
Transcript of The Design of Spintronic-based Circuitry for Memory and ...
The Design of Spintronic-based Circuitry for
Memory and Logic Units in Computer Systems
A DISSERTATION
SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL
OF THE UNIVERSITY OF MINNESOTA
BY
Cong Ma
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Advisor David J. Lilja
October, 2018
c© Cong Ma 2018
ALL RIGHTS RESERVED
Acknowledgements
I would like to express my sincere gratitude to my advisor Prof. David J. Lilja
for the continuous support of my Ph.D study and related research. His guidance
and encouragement helped me overcome challenges and obstacles. I am grateful
for his patience, his kindness and invaluable advice in research and in life.
Besides my advisor, I would like to thank the rest of my thesis committee:
Prof. Kia Bazargan, Prof. Sachin Sapatnekar, and Prof. Pen-Chung Yew; and
my preliminary exam committee: Prof. Chris H. Kim, and Prof. Ulya Karpuzcu,
for their insightful comments and suggestions.
I am grateful for the resources from the University of Minnesota Supercomput-
ing Institute and the support from C-SPIN, one of six centers of STARnet, a Semi-
conductor Research Corporation program, sponsored by MARCO and DARPA,
and from Nation Science Foundation grant no. CCF-1241987. Any opinions, find-
ings and conclusions or recommendations expressed in this material are those of
the authors and do not necessarily reflect the views of the NSF and C-SPIN.
I would like to thank my colleagues and my fellow labmates: Dr. Peng Li,
Prof. Yuan Ji, Bill Tuohy, Pushkar Nandkar, Dr. Jongyeon Kim, Ibrahim Ahmed,
Zhaoxin Liang, Dr. Bingzhe Li, Dr. Manas Minglani, Dr. Hassan Najafi, Yaobin
Qin, for the brainstorms, for the inspiring discussions, for the sleepless nights
before deadlines, and for all the fun we had.
I would like to thank Prof. Emad Ebbini, who provided me an opportunity to
join their team when I was in the Master program to get valuable experience of
research works. Also, I would like to thank the staff of Electrical and Computer
i
Engineering Department, the Graduate School and International Student and
Scholar Services of University of Minnesota, especially the Graduate Advisor,
Linda Jagerson, for all my last minute questions.
My sincere thanks also goes to my former manager Alka Deshpande at Oracle,
for her support during the difficult times, and to Prof. Shujuan Wang, Prof.
Guofu Zhai, Prof. Sijiu Liu and Dr. Lei Kang at Harbin Institute of Technology,
for their guidance opens the door to research.
I would also like to thank my friends at University of Minnesota, especially Dr.
Yinglong Feng, Jie Kang, Yi Wang, Dr. Xiaofan Wu, Dr. Wei Zhang, Dr. Zisheng
Zhang, Keping Song, and Qi Zhao, for the great parties, for the homemake hot
pots, for the fun trips, and for the Dota nights.
Last but not least, I would like to thank my family: my wife, Wenwei Zhang,
for always believing in me and indulging my geeky talks; my parents, Shuhua Yang
and Baocheng Ma, for the continuous love; my parents-in-law, Yueyan Pan and
Yi Zhang, for the unconditional support; and my frenchie, MobyDick, for always
sitting next to me and warming my feet while writing this thesis. Thank you all
for the love and support. This thesis would not have been possible without you.
ii
Dedication
To my wife, Wenwei and my dog, MobyDick. Thank you for your support and
companionship during the most difficult time in my life.
iii
Abstract
As CMOS technology starts to face serious scaling and power consumption
issues, emerging beyond-CMOS technologies draw substantial attention in recent
years. Spintronic device, one of the most promising CMOS alternatives, with
smaller size and low standby power consumption, fits the needs of the trending
mobile and IoT devices. Spin-Transfer Torque-MRAM (STT-MRAM) with com-
parable read latency with SRAM and All-spin logic (ASL) capable of implement-
ing pure spin-based circuit are the potential candidates to replace CMOS memory
and logic devices. However, spintronic memory continues to require higher write
energy, presenting a challenge to memory hierarchy design when energy consump-
tion is a concern. This motivates the use of STT-MRAM for the first level caches
of a multicore processor to reduce energy consumption without significantly de-
grading the performance. The large STT-MRAM first-level cache implementation
saves leakage power. And the use of small level-0 cache regains the performance
drop due to the long write latency of STT-MRAM. This combination reduces the
energy-delay product by 65% on average compared to CMOS baseline. All-spin
logic suffers from random bit flips that significantly impacts the Boolean logic re-
liability. Stochastic computing, using random bit streams for computations, has
shown low hardware cost and high fault-tolerance compared to the conventional
binary encoding. It motivates the use of ASL in stochastic computing to take
advantage of its simplicity and fault tolerance. Finite-state machine (FSM), a se-
quential stochastic computing element, can compute complex functions including
the exponentiation and hyperbolic tangent functions more efficiently, but it suffers
from long calculation latency and autocorrelation issues. A parallel implementa-
tion scheme of FSM is proposed to use an estimator and a dispatcher to directly
initialize the FSM to the steady state. It shows equivalent or better results than
the serial implementation with some hardware overhead. A re-randomizer that
uses an up/down counter is also proposed to solve the autocorrelation issue.
iv
Contents
Acknowledgements i
Dedication iii
Abstract iv
List of Tables vii
List of Figures viii
1 Introduction 1
2 Background 5
2.1 Spintronic memory device, STT-MRAM . . . . . . . . . . . . . . 6
2.2 Spintronic logic device, All-spin logic . . . . . . . . . . . . . . . . 7
3 Related Works 11
4 Incorporating Spintronic Devices in CPU Caches 17
4.1 Simulation Methodology . . . . . . . . . . . . . . . . . . . . . . . 18
4.1.1 Technology Modeling . . . . . . . . . . . . . . . . . . . . . 19
4.1.2 Architectural Simulation . . . . . . . . . . . . . . . . . . . 22
4.2 Performance, Energy and Scalability . . . . . . . . . . . . . . . . 25
4.2.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.2 Energy Comparison . . . . . . . . . . . . . . . . . . . . . . 27
v
4.2.3 Energy-Delay Product . . . . . . . . . . . . . . . . . . . . 29
4.2.4 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.5 Larger L0 impact . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5 Incorporating Spintronic Devices in Logic Units 38
5.1 Spintronic logic devices . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Parallel Implementation of FSM . . . . . . . . . . . . . . . . . . . 46
5.2.1 The parallel FSM design . . . . . . . . . . . . . . . . . . . 47
5.2.2 Experiments and Results . . . . . . . . . . . . . . . . . . . 52
5.2.3 Latency and Hardware Cost . . . . . . . . . . . . . . . . . 59
5.3 Autocorrelation Issue of FSM . . . . . . . . . . . . . . . . . . . . 60
5.3.1 Autocorrelation Analysis . . . . . . . . . . . . . . . . . . . 60
5.3.2 Re-randomizer . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3.3 Discussion and Analysis . . . . . . . . . . . . . . . . . . . 70
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6 Conclusion and Discussion 73
References 76
vi
List of Tables
4.1 Energy consumption parameters for the STT cache structures. . . 21
4.2 Energy consumption parameters for the CMOS cache structures.
Leakage is mW. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 Simulated Processor Configurations. . . . . . . . . . . . . . . . . . 24
4.4 Simulated STT-MRAM Cache Parameters. For writes, the access
latency is added to the write latency. . . . . . . . . . . . . . . . . 25
4.5 Simulated Cache Hierarchies. . . . . . . . . . . . . . . . . . . . . 25
4.6 Cache capacity impact on performance and cache reuse of canneal
with simlarge dataset. The execution time of each configuration is
normalized to 4 cores with 4MB L2 cache. The percentage is the
L2 cache miss rate. . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.7 Cache capacity impact on performance and cache reuse of canneal
with simnative dataset. The execution time of each configuration
is normalized to 4 cores with 4MB L2 cache. The percentage is the
L2 cache miss rate. . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1 The MSE and PSNR of Edge Detection using all-spin logics . . . 45
5.2 The MSE and PSNR of Frame Difference using all-spin logics . . . 46
5.3 The average error and deviation of the parallel FSMs. . . . . . . . 56
5.4 The MSE and PSNR of image processing applications . . . . . . . 58
5.5 Hardware Cost, Latency and Area-Delay Product of serial and par-
allel FSM with 32 degrees of parallelism . . . . . . . . . . . . . . 59
vii
List of Figures
2.1 Typical multicore cache hierarchy, with multiple copies of data. . 7
2.2 The basic all-spin logic elements. . . . . . . . . . . . . . . . . . . 8
2.3 Stochastic Computing Multiplication using a single AND Gate . . 9
2.4 Finite-state machine diagram for approximating the exp function. 10
4.1 A STT-MRAM 1T1MTJ bit-cell (a), showing the access transistor
and MTJ storage element. In the Parallel (P) state (b) resistance
through the device is lower than the Anti-Parallel (AP) state (c). 19
4.2 The different trends in read and write energy for MTJ cells used in
L1 (left) and L2 (right) caches. The Write total is a combination
of Write cell and Write access; Write cell is the per-bit switching
energy from SPICE and Write access is the array access energy
reported by Cacti. . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 STT-MRAM cache is modeled to have separated read/write port.
The read port works the same as the CMOS SRAM cache, but
the write port blocks a consecutive write operation when it is not
available. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.4 The coherence transition diagram modeled in Gem5 to accommo-
date the write blocking mechanism of STT-MRAM. Instead of hav-
ing one intermediate transition state IS, we added another transi-
tion state IS S to simulate the blocking state of the STT-MRAM,
no write requests can be served in this state unless it finished its
transition to S state. . . . . . . . . . . . . . . . . . . . . . . . . . 23
viii
4.5 (a), (b) compares the cmos-base (64K,4M) to STT hierarchies (128K,16M).
Write latency of the L1 cache results in significant performance
drop. (c), (d) compares the cmos-base to STT hierarchies that use
the write-merging L0. The hierarchy uses a 4K fully-associative L0
cache and a STT L1 cache with various write latencies. . . . . . . 27
4.6 The near core cache miss rate. Though the 1KB fully-associative L0
cache does have a large cache miss rate, the 4KB cache on average
is less than 5%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.7 (a, b) shows the total energy consumption normalized to cmos-
base (64K,4M) with stt-l2 (64K,16M), stt-l1d2 (128K,16M), and
stt-l0 with varying L0 sizes (1/4K, 128K, 16M). The CMOS3 L2
leakage is computed for a 4MB STT-MRAM cache to create a fair
baseline. (c, d) shows the dynamic energy consumption with the
same configurations as in (a, b). . . . . . . . . . . . . . . . . . . 30
4.8 (a, b) shows the Energy Delay Product of various STT hierarchies
with four cores. (c, d) shows the Energy Delay Product with 16
cores. All normalized to cmos-base. . . . . . . . . . . . . . . . . 31
4.9 The scalability of various architecture hierarchies using from 4 to 16
cores, including two-level cmos-base (64K,4/8/16M), stt-l2 (64K,16/32/64M)
and stt-l1d2 (128K,16/32/64M), three-level stt-l0z1 (1K,128K,16/32/64M)
and stt-l0z4 (4K,128K,16/32/64M). The 3-level hierarchy has a sim-
ilar scalability as the two level cmos-base, but canneal and facesim
particularly show better result than others. . . . . . . . . . . . . . 33
4.10 The figure shows the performance and energy use of several config-
urations including CMOS baseline, stt-l1d2, stt-l0z1, to stt-l0z64,
where stt-l0z32 and stt-l0z64 have 32kB 4-way assoc and 64kB 8-
way assoc L0 cache. The average performance improves less than
10% and the energy use increases more than 25% from stt-l0z4 to
stt-l0z64. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
ix
4.11 The figure shows the energy-delay product of several configurations
including the larger L0 implementations. The overall energy-delay
product increases 20% on average for both benchmark suites. . . . 36
5.1 Basic elements implemented using all-spin logic. . . . . . . . . . . 39
5.2 Flip flops, basic element of FSM, implemented using all-spin logic. 40
5.3 Combinational stochastic computing elements implemented using
all-spin logic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.4 3-bit finite-state machine implemented using all-spin logic. . . . . 41
5.5 Edge Detection application diagram using the proposed spintronic
stochastic circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.6 Frame Difference application diagram using the proposed spintronic
stochastic circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.7 Edge Detection Results with different error rate injection using
Spintronic logics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.8 Frame Difference Results with different error rate injection using
Spintronic logics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.9 The Steady-State distribution of a 16-state FSM. From the figure,
we can see that this distribution is symmetric about the input value
of 0.5. The distribution changes the most here, making it very
sensitive around 0.5. . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.10 The mean output of the straightforward implementation of FSM
with 2, 4, 8, 16 and 32 parallel copies. The FSM is a typical
16-state absolute value function. The three subgraphs are using
different initial state at 0, 7 and 15 respectively. . . . . . . . . . . 50
5.11 Simulation results of the conventional deterministic scheme, serial
and a straightforward parallel stochastic implementation on Frame
Difference with different initial states. The straightforward parallel
stochastic implementation is clearly not able to compute the correct
results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
x
5.12 The straightforward parallel FSM implementation and the pro-
posed parallel implementation. The proposed parallel FSM has 32
parallel short bit streams sent to the Estimator to obtain an initial
guess for the input. Two Estimator implementations are parallel
counter and majority gate counter. This initial estimate is then
sent to the Dispatcher to look up a set of state configurations to
initialize the parallel FSMs. . . . . . . . . . . . . . . . . . . . . . 53
5.13 The experiment and analytical output result of a 32-input Majority
Gate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.14 The output mean value of two parallel FSMs with parallel counter
as estimator or majority gate estimator and the serial FSM. Both
estimators use 13 clocks, 13 × 32 = 416 bits, to approximate the
input value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.15 Simulation results of the conventional deterministic scheme, serial
and parallel stochastic implementation on Frame Difference. . . . 56
5.16 Simulation results of the conventional deterministic scheme, serial
and parallel stochastic implementation on Edge Detection. . . . . 57
5.17 Experimented methodology to measure and compare autocorrelation. 61
5.18 Autocorrelation Comparison. The flat “Random” line is a typi-
cal autocorrelation of random bit streams, whereas the fluctuating
“FSM” is the autocorrelation when the output of the FSM is sup-
posed to be the same as “Random” streams. Generally, the flatter
the autocorrelation plot, the more random the stream. . . . . . . 62
5.19 FSM autocorrelation results at a bit length of 1024. All show that
the 32-state line is higher than the 16-state and 8-state cases, which
means that the autocorrelation metric of the 32-state case is the
largest. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
xi
5.20 FSM autocorrelation results with 16 states. All show that the
shorter bit streams usually have larger autocorrelation metric. Specif-
ically, the autocorrelation metric with a bit stream of length of 256
is almost 10 times larger than when the length of bit stream is 4096. 65
5.21 Proposed re-randomizer with feedback structure. . . . . . . . . . . 67
5.22 Autocorrelation Comparison of the FSM, Re-randomizer and Ran-
dom, where “Random” is an ideal Bernoulli bit stream. “FSM” has
the worst autocorrelation. Re-randomizer as “Follower” is almost
the same as the random sequence as “Random”. . . . . . . . . . 68
5.23 Worst Case Autocorrelation Plot with 16 states and a bit stream
length of 1024. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
xii
Chapter 1
Introduction
Although Moore’s law [1] has been successfully predicting the scaling of CMOS
technology for several decades, it began to slow down as CMOS devices approach-
ing the quantum physical limitations [2]. To keep decreasing the size of the elec-
tronic transistors, researchers have been studying potential CMOS alternatives [3].
Spintronic devices with much smaller size and virtually zero leakage power has
quickly emerged as one of the most promising beyond-CMOS devices.
However, the spintronic devices suffer from large dynamic switching energy
use, long write/switch latency and random bit flips due to thermal instability.
Device level optimization was investigated to reduce such impact by carefully
designing the right dimensions and selecting reasonable retention times, but it
is still not enough to match the CMOS technology in the sense of performance
and dynamic energy efficiency [3]. Optimization in the application level can take
the spintronic limitations into account and mitigate its weaknesses by introducing
novel architecture and computational models.
In the memory device usage area, spintronic memory (Spin-Transfer Torque-
MRAM) is an attractive alternative technology to CMOS since it offers higher
density, virtually no leakage current and similar read latency as to CMOS based
SRAM. It continues to require higher write energy, however, presenting a challenge
to memory hierarchy design when energy consumption is a concern. In this thesis,
1
2
we use the STT-MRAM for the first level caches of a multicore processor to
take advantage of its larger size and smaller leakage power to reduce the energy
consumptions. The large STT-MRAM first-level cache saves leakage power, but it
causes a big performance drop due to the long write latency of the STT-MRAM.
This performance drop can be mitigated by implementing a small and fast fully
associative level-0 SRAM cache, which merges the performance gap between the
cpu core and the slow STT-MRAM first-level cache. The proposed STT hierarchy
reduces the energy-delay product by 65% on average and shows good scalability
over the CMOS baseline with a few benchmarks which scale significantly better.
The Parsec and Splash2 benchmark suites are analyzed running on a modern
multicore platform, comparing performance, energy consumption and scalability
of the spintronic cache system to a CMOS design.
In the logic device usage area, All-spin logic (ASL) begin to draw significant
attention as one of the pure spin logic circuit without using any charge based
devices. However, its applications are limited due to its random flipping nature
and relatively large dynamic energy consumptions. Conventional boolean logic
are very sensitive to reliability issues. Researchers start to investigate other com-
putation models that fits better with the novel devices. Stochastic computing,
which employs random bit streams for computations, has shown low hardware
cost and high fault-tolerance compared to the computations using a conventional
binary encoding. By combining ASL with stochastic computing, it is possible to
reduce the impact on the final results from random bit flipping because of the
high fault tolerance of stochastic computing. Also, finite-state machine (FSM)
based stochastic computing elements can compute complex functions, such as the
exponentiation and hyperbolic tangent functions, significantly simplifies the spin-
tronic circuit and make it more efficient than boolean logics. However, the FSM,
as a sequential logic, cannot be directly implemented in parallel like the combi-
national logic, so reducing the long latency of the calculation becomes difficult.
Applications in the relatively higher frequency domain would require an extremely
fast clock rate using FSM. This work proposes a parallel implementation of the
3
FSM, using an estimator and a dispatcher to directly initialize the FSM to the
steady state. Experimental results show that the outputs of four typical functions
using the parallel implementation are very close to those of the serial version. The
parallel FSM scheme further shows equivalent or better image quality than the
serial implementation in two image processing applications Edge Detection and
Frame Difference. Another issue of the FSM is the autocorrelation issue, which
changes the randomness of the bit stream, impacting results of relatively large
and complex stochastic circuits. We further analyzed the autocorrelation, indica-
tor of randomness of a bit stream, with different parameters of a FSM including
the number of states, the length of the bit stream, and the output value. With
a better understanding of the timely correlation of the output bits of FSM, we
proposed a re-randomizer to solve this autocorrelation issue in the FSM based
computing elements.
Contributions: In this thesis, we have conducted a detailed study of optimizing
spintronic devices in both memory and logic circuitry design. For the memory
use case, we analyzed the impact of implementing spintronic memory caches in all
level of CPU memory hierarchy and proposed a write-merging L0 cache to mitigate
the performance drop due to spintronic memory device’s long write latency. To
address the unreliability issue of spintronic logic device, we propose to implement
spin logic circuits using stochastic computational model to take advantage of its
high fault tolerance. Moreover, we proposed a parallel implementation scheme of
the sequential stochastic computing element, the finite-state machine, to further
improve its performance. And we also proposed a re-randomizer to solve the FSM
autocorrelation issue to improve its output quality for larger circuits. The rest of
the thesis is organized as below.
• Chapter 2 briefly presents the background of spintronic devices and stochas-
tic computing.
• Chapter 3 presents related works in the area.
• Chapter 4 demonstrates the spintronic device in memory use case, including
4
the analysis of spintronic memory impact, design and implementation of the
optimized cache hierarchy and interpretation of experimental results.
• Chapter 5 demonstrates the spintronic device in logic use case, especially
when implemented using stochastic computing schemes. We further analyze
and design the optimized implementation of the stochastic computing unit
finite-state machine to improve its performance and output quality.
• Chapter 6 presents a final discussion of the analysis presented in the thesis,
draws the conclusion and briefly discusses the future works.
Chapter 2
Background
The observation of the scaling of electronic devices has been successfully following
the Moore’s law for several decades. The CMOS is shrinking the technology node
every other year, but it starts to face serious scaling and power consumption issues.
The current CMOS device, such as SRAM, becomes unable to meet the demand
of big, fast and low power on-chip cache for multi-core implementations. As for
logic CMOS devices, the technology node is reaching 7 nanometer range and
close to hit the physical limitations. Spintronic technology using Spin-Transfer
Torque Magnetics, capable of building both memory and logic circuitry, draws
substantial attention in recent years. It quickly stands out as one of the most
promising CMOS alternatives. Specifically, Spin-Transfer Torque-Magnetic RAM
(STT-MRAM), one of the novel non-volatile memory family, has shown great
potential of replacing on-chip cache due to its fast read latency, large capacity and
low leakage energy use. On the other hand, All-spin logic (ASL) can systematically
synthesis boolean logics without charge-based devices, which making it possible
to build pure spin circuitries.
5
6
2.1 Spintronic memory device, STT-MRAM
Spin-Transfer Torque-Magnetic RAM (STT-MRAM) offers higher density than
traditional SRAM cache, and its non-volatility facilitates low leakage power [4].
Also, STT-MRAM is one of the few candidates with similar read latency to current
SRAM technology. With this higher cell density and low leakage power, STT-
MRAM is generally considered as a viable potential alternative to SRAM in future
on-chip caches.
However, due to its non-volatile nature, this technology suffers from high dy-
namic energy consumption, primarily due to high write power and longer write
latency [5]. The write latency of the STT-MRAM is commonly approximated as
3 to 4 times that of SRAM [6], but some consider it to be larger [7], so we perform
our analysis over a range of latencies. These characteristics seem to fit well at
the larger last-level caches of a processor, where high capacity is desirable and
longer latency is tolerated. Previous researches [8] and [9] also showed a perfor-
mance drop after directly implementing the first-level STT-MRAM cache. Indeed,
the majority of research in the area of on-chip STT-MRAM has been focused on
last-level caches in [5], [6] and [10].
To be a true replacement for CMOS, however, it would be desirable to use
STT-MRAM at all levels of on-chip cache. CMOS caches have evolved toward
deep hierarchies with multiple levels of private caches in multicore designs, since
read and write latency and power are similar. In a modern chip-multiprocessor
(CMP), multiple copies of data exist in different caches, and more data movement
occurs between caches for sharing. These extra cache updates beyond those seen
in a single-core processor increase the energy consumption. Figure 2.1 shows a
typical multicore hierarchy, highlighting the fact that multiple copies of a cache
line typically exist across the hierarchy. Data sharing requires extra data move-
ment across the hierarchy.
The significant leakage reduction potential and extremely long write latency
of the STT-MRAM motivate us to find optimal cache hierarchy design to reduce
7
C0 C1 C2 C3
L1 L1 L1 L1
L2 L2
L3
Figure 2.1: Typical multicore cache hierarchy, with multiple copies of data.
cache energy consumption without significantly degrading performance [11]. To
best exploit the increased density and reduced leakage power of STT-MRAM, it is
necessary to overcome the high dynamic write energy and latency of STT-MRAM
at the lower-level caches of a large CMP.
2.2 Spintronic logic device, All-spin logic
All-spin logic has been proposed to synthesis boolean logics without charge-based
devices [2], which is naturally a majority gate logic and can implement multiple
logics, such as AND, OR, NOT, etc. Although all-spin logic has low standby
power and smaller size, it suffers from slow switching time and large switching
energy use [3]. Also, the nanomagnet can randomly flip depend on the thermal
stability factor, causing extremely big impact on the output results.
The basic element of all-spin logic is an inverting/non-inverting gate as in
Fig. 2.2a. The spin signal is transferred through a magnet from input A to output
B. When the applied VDD is positive, it is a inverting gate. And when the applied
VDD is negative, it is non-inverting and passing the original signal. If the VDD
8
BA MagnetMagnet
All-Spin Logic Gate
VDD
(a) All-Spin Logic Gate.
D
C
A
Majority Gate
B
(b) Majority Gate.
Figure 2.2: The basic all-spin logic elements.
is cut off, the magnet won’t switch, hence preserve the previous signal.
A majority gate can be built by combining multiple such gates as in Fig. 2.2b.
This majority gate contains 4 magnets, where three of them are inputs and one is
the output. The majority of the inputs determines the output. By fixing one of
the input node to logic 0 or logic 1, we can easily get an AND or OR gate. This
majority logic can further synthesis various combinational logics.
Current all-spin logic gates are capable of implementing simple circuits, where
stochastic computing model fits well of its needs. Stochastic computing has shown
to be low cost in terms of hardware area, high fault-tolerance and short critical
path compared to computations using conventional binary encoding. Computa-
tions based on this stochastic approach can be implemented with very simple
logic. It can tolerate circuit unreliability by the unified data representation [12],
which can mitigate the uncertainty of the spintronic logics.
Combinational logic has been studied in the early stochastic computing. For in-
stance, an AND gate can be implemented to calculate multiplication as in Fig 2.3.
Stochastic sequential logic using a finite-state machine (FSM) was first proposed
by Brown and Card [13] and then validated by Lilja and Li [12]. The FSM as
in Figure 2.4, consisting of only a few D flip-flops and simple combination logics,
is capable of approximating functions, such as exponential, hyperbolic tangent
(tanh) and absolute value.
Previous works already looked into the implementation of stochastic comput-
ing circuits, including combinational stochastic logics [14], and peripheral circuits
9
AB
11010111
11001010
11000010
a=6/8
b=1/2
c=3/8
CAND
Figure 2.3: Stochastic Computing Multiplication using a single AND Gate
such as random bits generator taking advantage of the spin device random flip-
ping nature [15]. We have further propose to implement the stochastic computing
sequential logics, the finite state machine, with all-spin logics to provide more
complex functionalities. With a relatively small number of spin devices, we can
even implement image applications, such as Edge detection and Frame difference,
using stochastic computing model to minimize the impact from the spintronic
logic uncertainty.
However, the stochastic computing will cause long latencies due to its long
bit stream [16]. This latency can be reduced by implementing parallel stochastic
units when only using the combinational logic [17]. Because any bit in combina-
tional logic without any feedback loop [18] is uncorrelated with each other, we
can implement it in serial, which is distributed in time, or in parallel, which is
distributed in space. Both will have the same expected output value. On the
other hand, the FSM as a sequential logic, having bits correlated in time, cannot
be directly implemented in parallel.
Currently, the length of the stochastic computing bit stream is typically from
256 to 1024 bits, which means the clock frequency should be 256 to 1024 times
of the sampling frequency. For example, for audio applications, the sampling rate
is around 48kHz, which would require the stochastic computing circuit to boost
its clock rate to around 48MHz. This is acceptable for low frequency situations,
but for higher frequency applications, this will significantly increase the hardware
10
S0 S1 SN-k-1 SN-k SN-1
X X X X
X
1-X 1-X 1-X 1-X
1-X
Y= 1 Y= 0
Exp(-2kX)X Y
X
1-X
...X
1-X
... SN-1
Figure 2.4: Finite-state machine diagram for approximating the exp function.
area and energy use.
Moreover, the finite-state machine suffers from autocorrelation issue [13] due
to its sequential nature. It does not follow the random binomial distribution
and the output is time dependent and breaks the random bit stream assumption.
Autocorrelation is the cross correlation of a signal with itself, a typical method to
measure the randomness of a sequence. An ideal random sequence should have
a flat autocorrelation plot. Stochastic computing is based on the assumption
that each bit is an independent and identically distributed (iid) Bernoulli random
variable, thus the output statistical characteristics like mean or variance could
be theoretically justified. This impact from autocorrelation could affect large
stochastic networks, especially those have sequential dependencies like feedback
structures. Feeding the output of an FSM into another FSM or feeding it back to
the input side can cause the circuit output value to drift away from the expected
probability.
Chapter 3
Related Works
Spintronic devices have great potential to be an alternative to CMOS devices
in both memory and logic circuits [3]. In [9], a detailed evaluation flow of the
emerging non-volatile memogy technologies including STT-MRAM was described
to explore the next generation memory hierarchy. Spintronic circuits have a lower
standby power and smaller size, especially beneficial for mobile and wearable
devices [3].
Spintronic memory devices: Researchers have proposed implementing the
STT-MRAM L1 cache to take advantage of its larger capacity and significant
smaller leakage power in [8], [19] and [9]. The feasibility of STT L1 data and
instruction cache implementation was evaluated and a performance drop due to
the STT L1 data cache was observed. [8] investigated the STT-MRAM L1 im-
plementation in a single-issue in-order 8 core system. A larger STT L1 under
the same area provided better performance but larger total power due to the
CMOS peripheral circuitry overhead. Li, et al [19] implemented a one-level STT-
MRAM cache in a simple embedded system with in-order single core configuration.
They proposed a compiler assisted refresh scheme for the implemented volatile
STT-MRAM, which significantly reduced the refresh frequency, minimizing the
dynamic refresh energy. These studies mainly focused on evaluating the imple-
mentation of STT caches in simple in-order cpus rather than high performance
11
12
computation platforms.
To address the write power and latency problems, researchers have proposed
several techniques: decreasing the retention time [5], [7] and [10], modifying
cache hierarchy to use a mix of structures with different properties [4], [20], [21]
and [7], implementing policies to limit write operations to high-power structures
[6] [22] [23] [24] [25] [26], and using hybrid cache architectures [27] [22] [28].
Decreasing the retention time trades reliability for device area and energy on a
device level, while the cache policy optimizes energy consumption on a system
level. Both of them achieved significant energy reduction and relatively less at-
tractive performance improvement, but required either additional logic or changes
to the cache control scheme.
Retention time can potentially be reduced in caches by reducing the MTJ
volume since the life time of a cacheline can be much shorter than the typical 10
years. This would allow a reduction of MTJ write current. Reduced retention
time was proposed and analyzed in [5] for on-chip caches on a single-core chip.
They proposed a SRAM L1 cache, and reduced retention time STT-MRAM L2
and L3 cache hierarchy design, which showed an improvement of energy reduction
of 70%, but at a small performance loss. To ensure the reduced retention time
STT-MRAM is reliable, they further proposed a refresh scheme similar to DRAM
refresh technology. Optimal retention times were studied in [10] for the last level
cache, settling on a retention time of about 10 ms after a detailed application
profiling for CMPs. They proposed a victim-cache structure to handle those cache
lines that exceeded their corresponding retention time and achieved 18% and 60%
improvement in performance and energy separately. However, because a bit-flip
could happen at any time during the cacheline life, an error correction coding
scheme should be introduced to the data checking procedure before refreshing [29].
Although reducing retention time can potentially decrease cacheline write en-
ergy and latency, extra error handling units must be added to maintain cache reli-
ability. This scheme uses an orthogonal approach to reduce STT-MRAM dynamic
energy compared to our hierarchy scheme, leaving an opportunity to combine the
13
two in the future.
Implementation of STT-MRAM across the entire cache hierarchy including the
L1 cache was considered in [7]. The work implemented low retention devices in the
L1 cache with a dynamic refresh scheme and further propsed a mixture of retention
times in the last level cache. By using a data migration scheme, read-intensive data
and write-intensive data can be allocated to different retention time region, which
gives a 6.2% performance improvement and 40% energy improvement over the
single level relaxed retention scheme. A read-write aware hybrid cache hierarchy
was presented in [27], where the cache is divided into a Write cache section based
on SRAM and a Read section based on non-volatile memories including STT-
MRAM. They suggest an intracache data movement policy which produces an
overall power reduction up to 55% in addition to a 5% performance improvement
over the baseline SRAM L2 and L3 cache. A novel management policy using
hybrid cache design was shown in [28] that aims at improving cache lifetime by
reducing write pressure on STT-MRAM caches. They show a 50% reduction in
power for a L2 shared cache along with a substantial improvement in a cache
lifetime.
Hybrid schemes require complex control units to dispatch requests to different
cache devices. Our scheme that only implements a standard cache level can avoid
these extra control units. This keeps our design simple and straight forward, thus
leveraging existing schemes.
The idea of a small, fully-associative cache was first proposed in [30] to remove
mapping conflict misses in a direct-mapped cache by putting it in the refill path.
To reduce the microprocessor energy use, [31] proposed a small direct-mapped
cache as a filter cache on the core side that achieved almost 60% energy reduction
with 20% performance drop. In [22], a read-preemptive write buffer of 20 entries
on the memory side was proposed to reduce the read stalls to the STT-MRAM L2
cache during the long write operations by implementing rules that favor read op-
erations. The write requests to the STT-MRAM L1 cache, however, mainly come
from the cpu, which has a much higher demand for cpu stores than the memory
14
cacheline fills. Also, the STT-MRAM L1 suffers from longer read access latency
due to the longer MTJ sensing time. In this paper, the small fully-associative
cache is put on the core side. This is to firstly improve the bandwidth of data
flowing to the STT-MRAM L1 cache by merging processor writes into cacheline
writes, similar to the write aggregation schemes in [32, 33], and to also provide
overall faster cache access due to its simplicity and small capacity.
Spintronic logic devices: Nikonov and Young proposed a new benchmarking
of beyond-CMOS exploratory devices, including various Magneto electric, Spin
Torque and Ferro electric devices [3]. Behin-Aein, et al, first proposed the all-spin
logic (ASL) device that uses spin at every stage of the operation rather than a
mixture of spin and charge-based devices [34]. A Functional Enhanced All Spin
Logic (FEASL) was proposed in [35] to enable the design of large boolean logic
blocks. Pajouhi, et al, further proposed a systematic methodology to synthesis the
ASL circuits [2]. They identified that ASL requires large current at fast switching
speeds and causes static power dissipation issue. The short spin flip length in
interconnects of ASL also becomes a key bottleneck. Moreover, the spintronic
nanomagnets can flip on its own based on its thermal stability level [36], which
could cause significant unreliability issue. Researchers further explored imple-
menting spintronic devices into non-boolean logics, such as stochastic comput-
ing [14] and neural networks [37]. Stochastic computing with high error tolerance
and extremely simple hardware structure [12] [38] can effectively mitigate the
key drawbacks of the spintronic devices. The stochastic computing combinational
logics and peripheral circuits such as random bit generator have been studied to
implement with spintronic logic devices in [14] [15]. While stochastic computing
sequential logics such as finite-state machine requires implementation of flip flops,
which were studied in [39] [40] [41] [42] [43]. All of these previous attempts are
using both spin and charge-based devices in their design.
Stochastic computing: Since the early works of Gaines [44], researchers have
employed the stochastic computing algorithms in various areas including neural
networks [45] [46] [16] [47] [48] [49], signal processing [50] [51] [52] [53] [54] and
15
image processing applications [12] [55]. Qian et al. [56] has proposed a synthe-
sis method using the Bernstein polynomials to approximate functions with only
combinational logic. However, such synthesis requires multiple uncorrelated ran-
dom input bit streams, which increases the hardware cost. Besides, to achieve
a higher accuracy, the degree of the polynomial will have to increase, which will
cause the number of the input sources to grow even larger. Functions such as
exponential cannot be efficiently implemented due to the large hardware cost.
Brown and Card [13] proposed a sequential logic FSM. Li and Lilja [12] later
validated the mathematics of the FSM and proposed systematic methods to syn-
thesize and implement it into various image processing applications [55]. With
very limited hardware cost, the FSM is capable of approximating functions such
as absolute value, exponential and tanh, and is widely used in various applica-
tions [45] [12] [16].
Although these applications benefits from low hardware cost and high fault
tolerance of stochastic computing, the long sequence of bits to get a smaller vari-
ance for a better estimate of output value creates long latency and significant
performance drop compared with conventional implementations. A parallel imple-
mentation of combinational logic is proposed in [17] which has higher computing
accuracy and faster processing speed by using a nibble serial data organization,
but this method can only apply to combinational stochastic logic not sequential
ones such as FSM. Wang, et al [18] further studied the impact of feedback loop
on stochastic circuits. A re-randomizer is proposed to break the correlation intro-
duced by the feedback loop. Because the re-randomizer generates the bit stream
for the next real domain clock value, it must preserve the equivalent precision and
uses all the bits from the previous value to generate the next one, which causes a
time delay of a real domain clock. Pixel level parallelism that requires a large ar-
ray of stochastic computing units to calculate the entire image is proposed in [12]
to speedup the application processing time. Although this method can improve
throughput, the calculation latency for each pixel remains the same.
16
Another issue with the FSM is that the output does not follow a random bino-
mial distribution any longer and the output shows autocorrelation [13]. It is the
cross correlation of a signal with itself, a typical method of measuring the random-
ness of a sequence. Saeed etc [52] has also mentioned the autocorrelation problem
in their LDPC stochastic decoders due to the feedback structure in their design.
An ideal random sequence should have a flat autocorrelation plot. Stochastic com-
puting is based on the assumption that each bit is an independent and identically
distributed (iid) Bernoulli random variable, so the output statistical character-
istics like mean or variance could be theoretically justified. However, the FSM
does not produce such a flat autocorrelation plot, suggesting relaxed randomness
assumption. Two possible re-randomizers are proposed in [52] to break correlation
within a bit stream. One of them uses an M-bit shift-register with a selectable bit
to produce new bit streams called an Edge Memory. It requires a short periodic
signal otherwise generates large hardware overhead. Another method mentioned
is to use an up-down counter to integrate input bits and simultaneously regen-
erate new bit streams. We followed the latter method to design our proposed
re-randomizer.
Chapter 4
Incorporating Spintronic Devices
in CPU Caches
With higher density, low leakage power and similar read latency to current SRAM
technology, STT-MRAM is generally considered as a potential candidate to replace
SRAM in future on-chip caches. However, STT-MRAM suffers from high dynamic
energy consumption, due to large write current and long write charge pulse to
switch spin directions. This long latency is tolerated at larger last-level caches
of a processor, but causes a significant performance drop if directly replacing
first-level caches [8] [9]. To overcome this performance degradation and take
advantage of the significant leakage reduction and large capacity, we proposed an
optimized cache hierarchy design [11]. We utilize a novel physics-based model
of Magnetic Tunnel Junction (MTJ) to develop size and energy models of STT-
MRAM cells [57]. Because the usage of first-level and last-level caches is quite
different, we have evaluated different circuit level tradeoffs between MTJ read
and write energy to find optimal design points for energy and performance. A
drop-in replacement of STT-MRAM to CMOS shows a fundamental problem of
an exposed mismatch in the bandwidth of data being written by the processor and
the ability of the STT-MRAM cache to absorb it. By introducing a small, fully-
associative Level-0 (L0) cache, this bandwidth mismatch can be accommodated.
17
18
This structure also benefits cache dynamic energy consumption, since it is so small
that both its static and dynamic energy use are quite low. This is an extension of
the analysis in [58], which analyzed the effectiveness of a small L0 of various sizes
compared to a simpler two-level CMOS hierarchy.
In this chapter, we made a detailed analysis of the impact of high write latency
at the L1 cache level, including a tradeoff between read and write energy and
latency of STT-MRAM caches. We further demonstrate the benefit of the write-
merging L0 cache to the performance of a fixed-core count system and the energy
consumption. Finally, an analysis of scalability with increasing number of cores
is conducted to compare CMOS caches to STT-MRAM caches.
4.1 Simulation Methodology
A static RAM storage cell using STT-MRAM is depicted in Figure 4.1(a). The
bit-cell consists of an access transistor and a storage element that uses a Magnetic
Tunneling Junction (MTJ), known as a 1T1MTJ cell. The MTJ consists of a
fixed-layer and a free layer, separated by a thin insulator that allows a tunneling
current to flow when biased. The material used for the fixed and free layers has
two stable spin directions, with the spin direction of the fixed layer locked. The
free layer can have either the same spin orientation as the fixed layer, known
as Parallel or P (Fig. 4.1b) or Anti-parallel AP (Fig. 4.1c). Resistance through
the device is higher in the AP state, so a read operation consists of sensing the
high or low resistance value. For a write operation, the current passing through
the device in one direction gives the free layer the P orientation, while passing
the other direction creates the AP orientation. There is a critical minimum write
current which must be maintained for an adequate period of time to allow complete
switching, leading to longer latency of write operations. The access transistor must
be sized to provide a sufficient switching current, and this higher current (above
the critical value) facilitates a shorter switching delay. The access transistor is
usually larger than the MTJ, so there is a tradeoff between bit-cell area and
19
Insulating Layer Insulating Layer
Free Layer Free Layer
Fixed Layer Fixed Layer
(b) (c)(a)
SL BL
WL
Transistor MTJ
Parallel Anti-Parallel
Figure 4.1: A STT-MRAM 1T1MTJ bit-cell (a), showing the access transistorand MTJ storage element. In the Parallel (P) state (b) resistance through thedevice is lower than the Anti-Parallel (AP) state (c).
switching time. A larger bit-cell can have a lower switching time and lower energy,
but creates a larger array for the same storage capacity, leading to longer wires
and higher energy requirements at the array level.
4.1.1 Technology Modeling
A combination of SPICE and Cacti [59] simulations were used to develop the
technology models in this analysis. SPICE was used for bit-cell simulations, and
these results were entered into Cacti for array modeling. The SPICE models
developed for [57] were used to simulate MTJ switching energy and transistor
sizing for the write pulse widths of interest. Retention times of 10 years as well
as 1 year were simulated for a 20nm Predictive Technology Model [60]. The
6σ methodology is used in the SPICE models to eliminate defects from process
variations, especially for STT sensing and write delay.
The bit-cell and transistor sizes from these simulations were then used in a
modified Cacti to generate array energy and timing values for various operations.
Since Cacti does not support STT-MRAM modeling, we modified the original
Cacti SRAM model to simulate the STT-MRAM MTJ cell by changing the cell
width and aspect ratio to fit 1T1MTJ and setting all cell leakage power to 0.
20
This model treated the STT-MRAM like an SRAM cache with smaller cell size
and different cell dimensions. This approach created a conservative estimate of
the STT-MRAM cache array since there is the potential to further optimize the
STT-MRAM circuitry. We also evaluated the energy and timing values produced
by the NVSim [61] non-volatile memory simulator. We found that our approach
produced slightly more conservative parameter values than NVSim, although the
results were comparable. The bit-cell write energy from the SPICE simulation
multiplied by the cache line size was then added to the dynamic write energy
value from Cacti to estimate STT-MRAM cache write energy. Since not every
bit in a cache line would switch on a write operation, this gives a large, and thus
conservative, estimate of write energy. Table 4.1 lists the values gathered from the
SPICE simulations. We observed that as the transistor size is made smaller, the
bit-cell write energy increases but the array-level dynamic read and write energy
decreases. The total write energy trends are in different directions for L1 and L2
cache arrays, as shown in Figure 4.2. L1 and L2 caches have different read and
write access patterns as well, with an L1 cache typically seeing a higher percentage
of read operations, so the optimal design point could be different for both caches.
For L1 cache, the optimal point is somewhere in the middle, around the 5ns write
region, since write energy grows quickly to the left and read energy grows to the
right. However, since the system performance is very sensitive to L1 cache write
pulse latency, a short write is still preferred. For L2 cache, the trends are similar,
so the optimal point there is at the long write region. Previous work [10, 4] has
shown that L2 write latency has little effect, so we pick 7ns in this chapter. The
CMOS cache energy parameters were directly modeled from Cacti as listed in
Table 4.2.
21
Table 4.1: Energy consumption parameters for the STT cache structures.
size 64kB 128kB 256kB 4MB 16MB 32MB 64MBMTJ transistor size (F 2) 24 48 144 24 48 144 24 48 144 24 48 84 24 24 24
MTJ latency (ns) 7 5 3 7 5 3 7 5 3 7 5 3.6 7 7 7MTJ flip energy (fJ) 508 418 378 508 418 378 508 418 378 531 432 397 531 531 531
read (pJ) 15.6 17.8 30.1 18.6 21.4 31.0 24.3 28.4 48.4 152 190 223 255 406 766WrtAccess (pJ) 19.1 22.7 47.6 25.2 30.2 44.9 37.4 45.6 62.7 173 269 315 286 448 949
WrtCell (pJ) 260 214 194 260 214 194 260 214 194 272 221 203 272 272 272Wrt (pJ) 279 237 241 285 244 238 298 260 256 445 490 518 558 720 1221
leakage (mW) 1.7 1.9 2.6 2.6 2.8 3.3 4.4 4.6 6.7 64.2 98.2 126 232 560 1048
Table 4.2: Energy consumption parameters for the CMOS cache structures. Leak-age is mW.
size 1kB 4kB 32kB 64kB 4MB 8MB 16MB
rd (pJ) 6.9 7.2 28.3 35.0 293 520 1003
wrt (pJ) 7.6 10.7 30.6 40.2 344 512 1114
leakage 0.76 1.33 13.45 28.2 736 1440 2984
MTJ Transistor Size (F2) – Write Latency (ns)
Figure 4.2: The different trends in read and write energy for MTJ cells used inL1 (left) and L2 (right) caches. The Write total is a combination of Write celland Write access; Write cell is the per-bit switching energy from SPICE andWrite access is the array access energy reported by Cacti.
22
4.1.2 Architectural Simulation
Complete architectural simulations were used for both design space comparisons
and scalability analysis. The simulations were performed with the Sniper simu-
lator [62] running the Parsec [63] and Splash2 [64] benchmark suites. The two
benchmark suites have fundamental different properties, where Splash2 focuses
more on High-Performance Computing and Parsec includes a wide range of ap-
plications [65]. A combination of the two can improve the benchmark program
diversity. The sniper simulator is based on an analytical model that estimates
the performance by analyzing intervals. This model achieved 10 times simulation
speedup with relatively high accuracy [62]. The speedup gave us the ability to
directly measure the complete benchmark execution time to avoid using any sam-
pling scheme, which is not ideal for multi-thread benchmarks [66]. The simulator
was modified to properly model cache write latency in all relevant operations.
We modeled the STT-MRAM cache as in Fig. 4.3 in both Sniper and Gem5 [67]
Simulators. The charging write pulse will actually block a subsequent write re-
quest, reducing bandwidth. The read and write paths are separated to prevent
blocking of a consecutive read operation by a previous write to avoid further per-
formance drop [68]. The modeling of STT-MRAM has more details in Gem5 than
Sniper. Ruby memory model in Gem5 provides the flexibility to examine the
modified coherence transitions to accommodate the write blocking mechanism of
the STT-MRAM and proves the validity of this model. Fig 4.4 shows one typical
coherence transition modification. NP and S are stable states which means empty
and shared state separately. IS and IS S state are intermediate states which means
transition from NP to S and transitions from IS to S separately. To simulate the
write blocking, we added this IS S state to block all write requests to the same
cacheline unless it goes to S state after the write pulse latency.
A range of cache hierarchies with various capacity, access and write latencies
were simulated, using the MESI cache coherence protocol with strict inclusive
policy. Table 4.3 lists the processor configurations used in the simulations. All
based on a four-wide out-of-order execution model running at 2GHz. Structures
23
NP IS SL1_GETS DATAAllocate TBEAllocate L2
Deallocate TBEsendDataToRequestor
writedataToCache
NP IS IS_SL1_GETS DATA
Allocate TBEAllocate L2
sendData
SWD_Time
Deallocate TBEwritedataToCache
Block state
Check Resource Available: fail; success
Bank0
curTick: 10003
busyTick: 10010
Bank0
curTick: 10000
busyTick: 9994
busyTick: 10010
Add write pulse
latency
trigger
Original Transition
Modified Transition
Figure 4.3: STT-MRAM cache is modeled to have separated read/write port. Theread port works the same as the CMOS SRAM cache, but the write port blocksa consecutive write operation when it is not available.
Cache
Rd/Wrt Port
Rd/Wr Request
Request SentEvery Cycle
CMOS SRAM Cache
Data SentAccess Latency
Cache
Rd Port
Rd Request
Same As
CMOS
STT-MRAM Cache
Wrt Port
Wrt RequestStall When
Wrt Port Not Avail
Request SentEvery
Write Pulse Latency
Figure 4.4: The coherence transition diagram modeled in Gem5 to accommodatethe write blocking mechanism of STT-MRAM. Instead of having one intermediatetransition state IS, we added another transition state IS S to simulate the blockingstate of the STT-MRAM, no write requests can be served in this state unless itfinished its transition to S state.
24
Table 4.3: Simulated Processor Configurations.
Parameter Values
Pipeline 4-wide, out-of-orderL1 ICache 64KB, 2-way
ROB entries 128Memory 45ns latency, 7.6GB/s bandwidth
such as reorder buffer (ROB) were deliberately set to be on the large side to try
to remove these structures as possible hidden bottlenecks in the simulations. The
instruction cache was implemented using 64kB CMOS cache in all configurations
to minimize its impact.
All data is reported just for the parallel region-of-interest (ROI). The number
of processors was varied from 4 to 16. Typical two-level and three-level cache
hierarchies were analyzed, with a shared last-level cache and private cache(s) per
core. A crossbar interconnect joined all cores. Table 4.4 lists the values of system
parameters simulated for each benchmark. The access latency in this Table refers
to the time the cache access is being completed or the requested data returned,
which often refers to the read latency. The write latency refers to the STT-MRAM
write pulse delay. We assumed the access latency of STT L1 cache is 25% to 50%
longer and STT L2 cache is 20% faster than the CMOS cache according to [69].
The longer sensing time of the STT-MRAM due to read disturbance significantly
increases the access latency of a smaller cache, but the shorter interconnect delay
makes it faster for larger caches. The L2 cache associativity of each configuration
was larger than the sum of all L1 cache sets to avoid cache misses due to inclusive
policy. The large datasets of both suites were used for all simulations. The native
datasets were used on a few benchmarks to show the scalability observations are
valid on real workloads.
Table 4.5 shows the main configurations in the following graphs. The names in
the left hand column are how these configurations will be referred to and labeled
in the graphs. We picked 7ns write latency for STT L2 and 5ns for STT L1 cache
due to performance and energy concerns. The MTJ transistor sizes for L2 and
25
Table 4.4: Simulated STT-MRAM Cache Parameters. For writes, the accesslatency is added to the write latency.
Parameter Values
Access Latency (CMOS) L0 1 or 3, L1 4, L2 30 cyclesAccess Latency (STT) L1 5 or 6; L2 24 cyclesWrite Latency (STT) L1 5ns, optional 3, 7ns; L2 7ns
L0 DCache 1K, 4K fully-associative, privateL1 DCache 64K, 128K, 256K 4-way associative, private
L2 Cache 4 cores 4MB, 16MB 16-way associative, 8 banksL2 Cache 8 cores 8MB, 32MB 32-way associative, 8 banks
L2 Cache 16 cores 16MB, 64MB 64-way associative, 8 banks
Table 4.5: Simulated Cache Hierarchies.
Configuration Name L0 Sz L1 Sz L2 Sz
cmos-base - CMOS 64K CMOS 4MB/8MB/16MBstt-l2 - CMOS 64K STT 16MB/32MB/64MB
stt-l1d2 - STT 128K STT 16MB/32MB/64MBstt-l1d4 - STT 256K STT 16MB/32MB/64MBstt-l0z1 CMOS 1K STT 128K STT 16MB/32MB/64MBstt-l0z4 CMOS 4K STT 256K STT 16MB/32MB/64MB
L1 are 24F 2 and 48F 2 as in Figure 4.2, while a common 6T SRAM cell can be
135F 2 [69]. All stt configurations used a STT L2 cache. In a conservative manner,
we estimated the STT L2 to have four times the size of the cmos-base and the
STT L1 cache to have two to four times the size of the CMOS L1 in the same
footage. Specifically, l1d2 means the STT L1 has twice the density (same footage)
of the CMOS L1 and l0z4 means the CMOS L0 size is 4kB.
4.2 Performance, Energy and Scalability
Performance, energy consumption, energy-delay product and scalability of the
cache hierarchy are the primary metrics of interest. The performance here uses
the total execution time. The energy refers to the overall CPU cache energy
consumption but not the entire CPU. The scalability refers to the speedup from
different number of cores. In this section we will examine the experimental results
26
of our proposed STT-MRAM cache hierarchy compared to a CMOS baseline. The
comparison of performance and energy results are based on a fixed number of cores
among different hierarchies. We explore the scalability of proposed hierarchies
from 4 to 16 cores to investigate the STT-MRAM impact. At last, we will show
the performance and energy impact when implemented a larger L0 cache.
4.2.1 Performance
The challenge of using STT-MRAM for lower level caches (closer to the CPU)
is to overcome the added write latency and dynamic write energy. The reward
is increased density and significantly reduced leakage power, which comprises the
majority of the power consumed in a CMOS cache hierarchy. Long write latency
at a first-level cache creates a bandwidth mismatch between the processor pipeline
and the cache. Queueing and buffering can only absorb a certain amount of write
data before the processor must stall, if the cache system cannot keep up with the
offered load. Figure 4.5 (a, b) shows the impact of write latency on performance
when the CMOS L1 cache is replaced by STT-MRAM. Extra capacity in the L1
made possible by the higher density of STT-MRAM cannot compensate for the
reduction in write bandwidth seen by the processor.
A method to match the bandwidth between the CPU and the L1 cache is
needed, that does not also increase cache energy consumption significantly. Aug-
menting the STT-MRAM L1 with a small fully-associative CMOS Level-0 (L0)
cache was investigated in [58] and found to be an effective method to restore the
performance lost to higher write latency. The L0 cache acts as a write-merging
buffer, translating single-word writes from the CPU into cache line writes to the
L1. If enough of the processor traffic is handled by this L0, performance can be
restored. We have implemented this structure as a standard write-back cache, so
it uses a standard cache controller with no extra functionality such as a hybrid
cache or low retention-time cache would require. By keeping the L0 as small as
27
blackscholes
bodytrackcanneal
dedupfacesim ferret
fluidanimatefreqmine
raytrace
streamcluster
swaptions vipsaverage
0.8
0.6
0.4
0.2
0.0
0.2
0.4
0.6sp
eedu
p
stt-l1lat3ns 5ns 7ns
(a) Parsec benchmark suite
barnescholesky fft fmm
lu.contlu.ncont
ocean.cont
ocean.ncontradiosity radix
raytrace
water.nsqwater.sp
average0.8
0.6
0.4
0.2
0.0
0.2
0.4
0.6
spee
dup
stt-l1lat3ns 5ns 7ns
(b) Splash2 benchmark suite
blackscholes
bodytrackcanneal
dedupfacesim ferret
fluidanimatefreqmine
raytrace
streamcluster
swaptions vipsaverage
0.8
0.6
0.4
0.2
0.0
0.2
0.4
0.6
spee
dup
stt-l1lat5ns stt-l0-l1lat3ns 5ns 7ns
(c) Parsec benchmark suite (L0 imple-mented)
barnescholesky fft fmm
lu.contlu.ncont
ocean.cont
ocean.ncontradiosity radix
raytrace
water.nsqwater.sp
average0.8
0.6
0.4
0.2
0.0
0.2
0.4
0.6
spee
dup
stt-l1lat5ns stt-l0-l1lat3ns 5ns 7ns
(d) Splash2 benchmark suite (L0 imple-mented)
Figure 4.5: (a), (b) compares the cmos-base (64K,4M) to STT hierarchies(128K,16M). Write latency of the L1 cache results in significant performance drop.(c), (d) compares the cmos-base to STT hierarchies that use the write-mergingL0. The hierarchy uses a 4K fully-associative L0 cache and a STT L1 cache withvarious write latencies.
possible, access time and leakage power are kept low. Figure 4.5 (c, d) demon-
strates the improvement from using the L0. When the L0 is 4KB, the average miss
rate can be as low as 2% in Figure 4.6. The performance of the STT-MRAM two
level hierarchy differ significantly among 3, 5 and 7ns write latencies by almost
50%, while the performance difference of the three level hierarchy with the L0
implementations shrinks to 15%. The three level hierarchy performs 40% better
on average than the two level with 5ns write latency.
4.2.2 Energy Comparison
By reconfiguring the cache hierarchy to contain as little CMOS circuitry as pos-
sible, leakage power is reduced significantly. With the L0 write-merging cache as
28
blackscholes
bodytrackcanneal
dedupfacesim ferret
fluidanimatefreqmine
raytrace
streamcluster
swaptions vipsaverage
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
0.18
mis
s ra
te
Bar: cmos-base, stt-l2, stt-l1d2, stt-l1d4, stt-l0z1, stt-l0z4l1d_missl0_miss
(a) Parsec benchmark suite
barnescholesky fft fmm
lu.contlu.ncont
ocean.cont
ocean.ncontradiosity radix
raytrace
water.nsqwater.sp
average0.00
0.05
0.10
0.15
0.20
0.25
0.30
mis
s ra
te
Bar: cmos-base, stt-l2, stt-l1d2, stt-l1d4, stt-l0z1, stt-l0z4l1d_missl0_miss
(b) Splash2 benchmark suite
Figure 4.6: The near core cache miss rate. Though the 1KB fully-associative L0cache does have a large cache miss rate, the 4KB cache on average is less than5%.
29
the only all-CMOS structure, the L1 and LLC can be configured for larger capac-
ity in a given area. We have simulated several different combinations of cache size
at all three levels. Figure 4.7 (a, b) shows the potential energy savings with this
three level configuration. All graphs in this section use the simlarge dataset to
stress cache capacity as much as possible in simulation. SRAM L2 cache leakage
and SRAM L1 cache leakage energy take approximately 80% and 10% on average
of the total cache energy consumptions respectively. The total energy use drops
significantly after adopting the STT L2 cache (stt-l2) by almost 60%. This further
drops another 10% after using the 4KB L0 STT configuration (stt-l0z4). Total
energy savings with the small L0 implementation on average are approximately
70% for both Parsec and Splash2 benchmark suites. Figure 4.7 (c, d) shows the
dynamic energy consumption of various STT hierarchies, omitting L0, L1 and L2
leakage. Different L0 and L1 sizes are shown. The large cross-hatched segment in
the middle of the 1KB L0 (stt-l0z1) bars represents cache line writes to the L1.
With the smaller L0, there is a large number of L0 write-backs of modified data
to the L1. This segment decreases rapidly as the L0 size increases.
4.2.3 Energy-Delay Product
Though the tradeoff between energy use and performance is common in the system
design, improving both at the same time can rarely happen. To show the overall
merit of the hierarchy design, we have used the energy-delay product (EDP) as a
better metric to highlight the most balanced architecture between energy efficiency
and performance [70]. Figure 4.8 (a, b) shows the EDP of a system with four
cores. On average, both benchmark suites of 4KB L0 (stt-l0z4) show a significant
65% EDP reduction over the baseline (cmos-base). The 4KB L0 (stt-l0z4) shows
an approximately 25% reduction over the configuration that only replaced the
CMOS L2 with STT (stt-l2), This is approximately an additional 10% reduction
over the CMOS baseline. Figure 4.8 (c, d) shows the EDP of a system with 16
cores. The observations of the four cores still hold for the 16 cores system. In
30
blackscholes
bodytrackcanneal
dedupfacesim ferret
fluidanimatefreqmine
raytrace
streamcluster
swaptions vipsaverage
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
norm
aliz
ed c
ache
ene
rgy
use
Bar: cmos-base, stt-l2, stt-l1d2, stt-l1d4, stt-l0z1, stt-l0z4l2_leakl2_ln_missl2_lnl2_rdl1_leakl1_ln_missl1_lnl1_cpu_stl1_rdl0_leakl0_ln_missl0_cpu_stl0_rd
(a) Parsec benchmark suite (Total En-ergy)
barnescholesky fft fmm
lu.contlu.ncont
ocean.cont
ocean.ncontradiosity radix
raytrace
water.nsqwater.sp
average0.0
0.2
0.4
0.6
0.8
1.0
1.2
norm
aliz
ed c
ache
ene
rgy
use
Bar: cmos-base, stt-l2, stt-l1d2, stt-l1d4, stt-l0z1, stt-l0z4l2_leakl2_ln_missl2_lnl2_rdl1_leakl1_ln_missl1_lnl1_cpu_stl1_rdl0_leakl0_ln_missl0_cpu_stl0_rd
(b) Splash2 benchmark suite (Total En-ergy)
blackscholes
bodytrackcanneal
dedupfacesim ferret
fluidanimatefreqmine
raytrace
streamcluster
swaptions vipsaverage
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
norm
aliz
ed c
ache
ene
rgy
use
Bar: cmos-base, stt-l2, stt-l1d2, stt-l1d4, stt-l0z1, stt-l0z4
l2_ln_missl2_lnl2_rdl1_ln_missl1_lnl1_cpu_stl1_rdl0_ln_missl0_cpu_stl0_rd
(c) Parsec benchmark suite (DynamicEnergy)
barnescholesky fft fmm
lu.contlu.ncont
ocean.cont
ocean.ncontradiosity radix
raytrace
water.nsqwater.sp
average0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
norm
aliz
ed c
ache
ene
rgy
use
Bar: cmos-base, stt-l2, stt-l1d2, stt-l1d4, stt-l0z1, stt-l0z4
l2_ln_missl2_lnl2_rdl1_ln_missl1_lnl1_cpu_stl1_rdl0_ln_missl0_cpu_stl0_rd
(d) Splash2 benchmark suite (DynamicEnergy)
Figure 4.7: (a, b) shows the total energy consumption normalized to cmos-base(64K,4M) with stt-l2 (64K,16M), stt-l1d2 (128K,16M), and stt-l0 with varyingL0 sizes (1/4K, 128K, 16M). The CMOS3 L2 leakage is computed for a 4MBSTT-MRAM cache to create a fair baseline. (c, d) shows the dynamic energyconsumption with the same configurations as in (a, b).
31
blackscholes
bodytrackcanneal
dedupfacesim ferret
fluidanimatefreqmine
raytrace
streamcluster
swaptions vipsaverage
0.0
0.5
1.0
1.5
2.0
norm
aliz
ed c
ache
ene
rgy
dela
y pr
oduc
t
cmos-base-s4stt-l2-s4
stt-l1d2-s4stt-l1d4-s4
stt-l0z1-s4stt-l0z4-s4
(a) Parsec benchmark suite (4 cores)
barnescholesky fft fmm
lu.contlu.ncont
ocean.cont
ocean.ncontradiosity radix
raytrace
water.nsqwater.sp
average0.0
0.5
1.0
1.5
2.0
norm
aliz
ed c
ache
ene
rgy
dela
y pr
oduc
t
cmos-base-s4stt-l2-s4
stt-l1d2-s4stt-l1d4-s4
stt-l0z1-s4stt-l0z4-s4
(b) Splash2 benchmark suite (4 cores)
blackscholes
bodytrackcanneal
dedupfacesim ferret
fluidanimatefreqmine
raytrace
streamcluster
swaptions vipsaverage
0.0
0.5
1.0
1.5
2.0
norm
aliz
ed c
ache
ene
rgy
dela
y pr
oduc
t
cmos-base-s16stt-l2-s16
stt-l1d2-s16stt-l1d4-s16
stt-l0z1-s16stt-l0z4-s16
(c) Parsec benchmark suite (16 cores)
barnescholesky fft fmm
lu.contlu.ncont
ocean.cont
ocean.ncontradiosity radix
raytrace
water.nsqwater.sp
average0.0
0.5
1.0
1.5
2.0
norm
aliz
ed c
ache
ene
rgy
dela
y pr
oduc
t
cmos-base-s16stt-l2-s16
stt-l1d2-s16stt-l1d4-s16
stt-l0z1-s16stt-l0z4-s16
(d) Splash2 benchmark suite (16 cores)
Figure 4.8: (a, b) shows the Energy Delay Product of various STT hierarchies withfour cores. (c, d) shows the Energy Delay Product with 16 cores. All normalizedto cmos-base.
32
Table 4.6: Cache capacity impact on performance and cache reuse of cannealwith simlarge dataset. The execution time of each configuration is normalized to4 cores with 4MB L2 cache. The percentage is the L2 cache miss rate.
cores 4MB 8MB 16MB 32MB 64MB 128MB4 1 (43%) 0.84 (38%) 0.68 (31%) 0.61 (25%) 0.477 (20%) 0.476 (18%)8 - 0.477 (38%) 0.41 (31%) 0.32 (22%) 0.30 (16%) 0.29 (12%)16 - - 0.36 (31%) 0.19 (22%) 0.16 (12%) 0.15 (8%)
summary, the 4KB L0 hierarchy of both benchmark suites achieves an average
65% EDP reduction over the CMOS baseline and a 25% EDP reduction over
the configuration only with L2 STT-MRAM. The first 65% EDP reduction comes
mainly from the energy reduction since the implemented STT L2 has little impact
on the benchmark performance due to less frequent L2 write accesses. Moreover,
since the L1 SRAM cache leakage energy becomes larger at 25% once having
the STT L2 cache implemented, we can potentially achieve a significant energy
reduction if switch SRAM L1 to STT L1 as well. But due to the long write latency
and much higher write access frequency in L1, such direct replacement results in
much slower execution time and larger dynamic energy use as in Figure 4.7 (c, d).
By implementing the small L0, a significant number of CPU writes are absorbed
by this small L0, so the write access frequency to the STT L1 becomes small,
further reducing dynamic energy use and improving the performance. Besides,
the small L0 can potentially provide a faster CPU-side cache access than the STT
L1 due to its longer sensing time. The 25% EDP reduction over the STT L2
implementation is thus achieved by this small L0 scheme.
4.2.4 Scalability
To analyze the scalability of the workloads, we have used full simulation runs with
the large and native simulation datasets. Figure 4.9 shows the scalability from 4
to 16 cores of both Parsec and Splash2 benchmark suites. Only the slopes of the
lines are being compared here among different configurations. With the simlarge
dataset, both suites show that the STT-MRAM does not significantly impact the
33
blackscholes
0
1
2
3
4
5
norm
aliz
ed s
peed
bodytrack
cannealdedup
facesim ferret
fluidanimate
Dots: 4 cores, 8 cores, 16 cores
freqmineraytra
ce
streamclu
ster
swaptionsvips
average
cmos-base stt-l2 stt-l1d2 stt-l0z1 stt-l0z4
(a) Parsec benchmark suite
barnes0
1
2
3
4
5
norm
aliz
ed s
peed
cholesky fft fmmlu.co
nt
lu.ncont
ocean.cont
ocean.ncont
Dots: 4 cores, 8 cores, 16 cores
radiosity radix
raytrace
water.nsq
water.spaverage
cmos-base stt-l2 stt-l1d2 stt-l0z1 stt-l0z4
(b) Splash2 benchmark suite
Figure 4.9: The scalability of various architecture hierarchies using from 4 to16 cores, including two-level cmos-base (64K,4/8/16M), stt-l2 (64K,16/32/64M)and stt-l1d2 (128K,16/32/64M), three-level stt-l0z1 (1K,128K,16/32/64M) andstt-l0z4 (4K,128K,16/32/64M). The 3-level hierarchy has a similar scalability asthe two level cmos-base, but canneal and facesim particularly show better resultthan others.
Table 4.7: Cache capacity impact on performance and cache reuse of canneal withsimnative dataset. The execution time of each configuration is normalized to 4cores with 4MB L2 cache. The percentage is the L2 cache miss rate.
cores 4MB 8MB 16MB 32MB 64MB 128MB4 1 (94%) 0.94 (90%) 0.84 (83%) 0.73 (70%) 0.58 (53%) 0.49 (35%)8 - 0.67 (90%) 0.61 (82%) 0.51 (70%) 0.38 (53%) 0.27 (35%)16 - - 0.55 (83%) 0.47 (70%) 0.35 (53%) 0.22 (34%)
34
scalability over the cmos-base. Canneal and facesim from Parsec show good scala-
bility over other benchmarks. According to [71], canneal was limited primarily by
memory latency rather than bandwidth due to lower data reuse. But this obser-
vation was made under a relatively small LLC, while Table 4.6 and 4.7 show that
a larger STT cache have better cache reuse for both simlarge and simnative input
dataset. The simlarge and simnative input dataset of canneal are 256MB and
2GB respectively. Since it is possible that the working set of simlarge can fit in a
64MB STT L2 but simnative cannot, we further investigate the simnative input
dataset of canneal to see whether the larger L2 impact persists. The scalability
of simlarge input dataset reaches a maximum at 32MB, while scalability of the
simnative input dataset increases with the L2 cache capacity. This means with
a real workload, a larger L2 cache (up to 128MB) could improve the scalability
of canneal. We observed bad scalability from swaptions of Parsec and raytrace of
Splash2. Freqmine only runs single-threaded due to lack of OpenMP support in
the sniper simulator. In general, the STT-MRAM with 4K L0 (stt-l0z4) hierar-
chy has good scalability over the cmos-base, with a few cases, including facesim,
canneal and cholesky, showing significant scalability improvements.
4.2.5 Larger L0 impact
It is clear that the implementation of this small fully associative L0 achieved a
good tradeoff between performance and energy use. We further investigated a
larger L0 to see whether better EDP can be achieved. A 32kB 4-way associative
and 64kB 8-way associative L0 was implemented into the same hierarchy as in
Figure 4.10 and Figure 4.11. With such big caches, fully associativity becomes
too costly for dynamic energy consumption and access delay. Figure 4.10 shows
an average 10% performance improvement, but 20% to 40% more energy use after
adopting these larger L0 caches. The overall EDP increases by more than 20%
in both benchmark suites in Figure 4.11. Because the 4kB L0 already has a low
average miss rate at 2% in Figure 4.6, performance improvement space is little by
35
blackscholes
bodytrackcanneal
dedupfacesim ferret
fluidanimatefreqmine
raytrace
streamcluster
swaptions vipsaverage
0.8
0.6
0.4
0.2
0.0
0.2
0.4
0.6
PARS
EC: s
peed
up
cmos-base-s4stt-l1d2-s4
stt-l0z1-s4stt-l0z4-s4
stt-l0z32-s4stt-l0z64-s4
(a) Parsec Performance
barnescholesky fft fmm
lu.contlu.ncont
ocean.cont
ocean.ncontradiosity radix
raytrace
water.nsqwater.sp
average0.8
0.6
0.4
0.2
0.0
0.2
0.4
0.6
0.8
SPLA
SH2:
spe
edup
cmos-base-s4stt-l1d2-s4
stt-l0z1-s4stt-l0z4-s4
stt-l0z32-s4stt-l0z64-s4
(b) Splash2 Performance
blackscholes
bodytrackcanneal
dedupfacesim ferret
fluidanimatefreqmine
raytrace
streamcluster
swaptions vipsaverage
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
norm
aliz
ed c
ache
ene
rgy
use
Bar: cmos-base, stt-l1d2, stt-l0z1, stt-l0z4, stt-l0z32, stt-l0z64l2_leakl2_ln_missl2_lnl2_rdl1_leakl1_ln_missl1_lnl1_cpu_stl1_rdl0_leakl0_ln_missl0_cpu_stl0_rd
(c) Parsec Energy
barnescholesky fft fmm
lu.contlu.ncont
ocean.cont
ocean.ncontradiosity radix
raytrace
water.nsqwater.sp
average0.0
0.2
0.4
0.6
0.8
1.0
1.2
norm
aliz
ed c
ache
ene
rgy
use
Bar: cmos-base, stt-l1d2, stt-l0z1, stt-l0z4, stt-l0z32, stt-l0z64l2_leakl2_ln_missl2_lnl2_rdl1_leakl1_ln_missl1_lnl1_cpu_stl1_rdl0_leakl0_ln_missl0_cpu_stl0_rd
(d) Splash2 Energy
Figure 4.10: The figure shows the performance and energy use of several configu-rations including CMOS baseline, stt-l1d2, stt-l0z1, to stt-l0z64, where stt-l0z32and stt-l0z64 have 32kB 4-way assoc and 64kB 8-way assoc L0 cache. The averageperformance improves less than 10% and the energy use increases more than 25%from stt-l0z4 to stt-l0z64. .
36
blackscholes
bodytrackcanneal
dedupfacesim ferret
fluidanimatefreqmine
raytrace
streamcluster
swaptions vipsaverage
0.0
0.5
1.0
1.5
2.0
norm
aliz
ed c
ache
ene
rgy
dela
y pr
oduc
t
cmos-base-s4stt-l1d2-s4
stt-l0z1-s4stt-l0z4-s4
stt-l0z32-s4stt-l0z64-s4
(a) Parsec Energy-Delay Product
barnescholesky fft fmm
lu.contlu.ncont
ocean.cont
ocean.ncontradiosity radix
raytrace
water.nsqwater.sp
average0.0
0.5
1.0
1.5
2.0
norm
aliz
ed c
ache
ene
rgy
dela
y pr
oduc
t
cmos-base-s4stt-l1d2-s4
stt-l0z1-s4stt-l0z4-s4
stt-l0z32-s4stt-l0z64-s4
(b) Splash2 Energy-Delay Product
Figure 4.11: The figure shows the energy-delay product of several configurationsincluding the larger L0 implementations. The overall energy-delay product in-creases 20% on average for both benchmark suites. .
37
simply increasing L0 size. But the static leakage and dynamic energy will increase
significantly. The 4K L0 remains a better design point to tradeoff performance
and energy use.
4.3 Summary
We have analyzed the impact of STT-MRAM as a replacement for CMOS at all
levels of a multiprocessor cache hierarchy. Though STT-MRAM has higher write
energy and latency, reducing these parameters at the circuit level does not lead to
an optimal design. The extra circuit area required to minimize MTJ bit-cell write
time and energy causes the cache arrays to grow, leading to higher read energy
and latency due to parasitic effects.
A fully-associative L0 cache as small as 4KB can effectively restore performance
lost to the higher write latency. This structure hides the extra write latency of
around 5ns when running at 2GHz, giving a total cache energy savings of 40-70%
and an average energy-delay product reduction of 60% compared to the CMOS
baseline. The L0 cache is implemented as a standard cache level, requiring no
additional control structures. We have observed no significant scalability impact
using STT-MRAM with L0 implemented. A few benchmarks show improved
scalability up to 16 cores using the STT-MRAM hierachy.
The introduction of new memory technologies can have significant impacts on
the best architectural choices for the memory hierarchy of a multicore system. This
chapter shows that simple solutions can help mitigate the negative impacts while
still allowing the system to take advantage of the benefits of the new technology.
Chapter 5
Incorporating Spintronic Devices
in Logic Units
All-spin logic is capable of synthesizing boolean logics using majority gate without
charge-based devices [2]. It has low standby power and much smaller size, which
fits the needs of the trending mobile and IoT applications. However, the slow
switching time and large dynamic energy consumption significantly impact the
performance of ASL. Moreover, random bit flips of nanomagnet greatly impacts
the output results of the circuits using such devices. The basic element of ASL,
the majority gate, can be efficiently implemented in combinational stochastic com-
puting circuit, which provides high fault tolerance and low hardware cost. We also
propose to implement it in sequential stochastic computing logic, the finite-state
machine (FSM), which further expands its functionalities and can be used in mul-
tiple image applications. One of the problem of the stochastic computing is its
long calculation latency. To improve the overall performance and reliability, we
propose a parallel implementation scheme of the FSM to reduce the calculation
time and analyze the autocorrelation issue of the FSM which impacts the output
results of large circuit network especially with feedback loops.
In this chapter, we first propose the spintronic logic implementation of the
38
39
CB
A
And Gate
0
CB
A
Or Gate
1
CB
A
Nor Gate
1invertor
C
A0
And Gate with Multiple
Inputs
0
A1
A2
A3
0
0
Figure 5.1: Basic elements implemented using all-spin logic.
stochastic computing elements including combinational logics and sequential log-
ics. Then we propose a parallel implementation scheme of sequential stochas-
tic computing unit, the finite-state machine to mitigate its long calculation la-
tency [72]. Lastly, we analyze the parameters that impact the FSM autocorrela-
tion and use a re-randomizer to solve this issue as in [73].
5.1 Spintronic logic devices
Using the majority gate nature of the all-spin logics (ASL), we can implement
multiple basic combinational boolean logics including AND, OR, NOR and mul-
tiple input AND gate as in Fig. 5.1. These logics are the basic elements to build
larger stochastic circuits. Sequential logics, such as Flip flops, are also crucial to
build sequential stochastic computing logics, such as finite-state machine (FSM),
which can approximate complex functionalities, including absolute value, hyper-
bolic tanh and exponential functions [12]. Fig. 5.2 shows the diagram of a J-K flip
flop implemented using ASL. The J-K flip flop uses an internal D flip flop, which is
controlled by the system clock. Data is stored in the first magnet when the clock
is positive, then moved to the second magnet when the clock becomes negative.
40
and
0
and
0
1or
D Flip-Flopinvertor
0
0
1
J
K
B
A
J’
K’
A’
B’
Vclk Vclk’
DQ
Q’
Q Q’
Figure 5.2: Flip flops, basic element of FSM, implemented using all-spin logic.
This second magnet can only store the data when the clock changes from positive
to negative, thus effectively separates the data node and the internal store node.
Using these basic logics, we can implement both combinational and sequential
stochastic computing elements. Fig. 5.3 shows all the combinational computing
elements. The multiplier is implemented using an simple AND gate. By taking
advantage of the majority gate nature of the ASL, we can implement the stochastic
adder using a majority gate by feeding the control node a bit stream of probability
of 0.5 instead of implementing a mux logic like the CMOS design as in [13]. This
significantly reduces the hardware cost of an stochastic adder. With the J-K flip
flop and various logics, we can further implement FSM using ASL with less than
200 magnets as in Fig. 5.4.
Stochastic computing not only benefits the spintronic logics with simplified
hardware structure, but also provides great fault tolerance. We implement the
spin based circuits in two image applications commonly used in stochastic com-
puting, Edge Detection and Frame Difference [12], as in Fig. 5.5 and Fig. 5.6.
Edge Detection uses three subtraction units and two absolute value FSM units
to calculate the difference between adjacent pixels, which uses 326 magnets in
total. Frame Difference uses two subtraction units, an absolute value FSM unit
41
CB
A
Multiplier (And Gate)
0
C
B
A
Adder
0.5H
C
B
A
Subtractor
0.5Hinvertor
Figure 5.3: Combinational stochastic computing elements implemented using all-spin logic.
invertor
UP/DN
1
Vclk Vclk’ Q0 Q1 Q2
A
J
K
B
A’
J’
K’
B’
Vc Vc’Q Q’A
J
K
B
A’
J’
K’
B’
Vc Vc’Q Q’A
J
K
B
A’
J’
K’
B’
Vc Vc’Q Q’
Q0 Q1 Q2
Q0' Q1' Q2'
Figure 5.4: 3-bit finite-state machine implemented using all-spin logic.
42
Sub |X|
Sub |X|
Sub
PRi,jPRi+1,j+1
PRi+1,jPRi,j+1
PSi,j
0.5
0.5
0.5
Figure 5.5: Edge Detection application diagram using the proposed spintronicstochastic circuits
|X|Sub
PXtPY
0.5 0.5
PXt-1
Pth
tanhSub
Figure 5.6: Frame Difference application diagram using the proposed spintronicstochastic circuits
43
and tanh FSM unit to calculate the difference between two sequential frames. It
uses 322 magnets. We inject bit errors in every magnet to simulate the randomly
bit flips of the spintronic devices to examine the stochastic fault tolerance prop-
erty. The injection error rate ranges from 0.01 to 0.00001 until the output image
quality is similar to the standard stochastic output without bit flipping. The ex-
perimental results compare the output image quality using mean squared error
(MSE) and peak-signal-to-noise ratio (PSNR) among image outputs with differ-
ent error rates. The MSE is the mean of the square of each pixel error between
the stochastic implementation and the conventional implementation results as in
Equation 5.1.
MSE =1
mn
m−1∑i=0
n−1∑j=0
(I(i, j)−K(i, j))2 (5.1)
where I refers to the image result from the conventional implementation and
K refers to either of the stochastic implementations. PSNR is further calculated
using the MSE as in Equation 5.2.
PSNR = 20× log10(MAXI)− 10× log10(MSE) (5.2)
where MAXI is the maximum possible pixel value of the image. The bit length
of each pixel of all images in this paper is 8, so the MAXI value is 28 − 1 = 255.
PSNR shows the ratio between the maximum possible power of a signal and the
noise in dB.
Fig. 5.7 and Fig. 5.8 compares the stochastic computing results with different
level of error injection rates. The error rates in Edge Detection in Fig. 5.7 ranges
from 0.01 to 0.0001, where the image quality improves as the error rate drops to
0.0001. We can observe that when the error rate is 0.01, there are lots of noises
in the image. And when the error rate drops to 0.0001, the output image looks
similar to the output without error injections. The error rates in Frame Difference
in Fig. 5.8 ranges from 0.001 to 0.00001, and the image quality also improves
44
(a) Edge Detection (b) Standard Results
(c) Error Rate 0.01 (d) Error Rate 0.001 (e) Error Rate 0.0001
Figure 5.7: Edge Detection Results with different error rate injection using Spin-tronic logics
45
(a) Frame Difference (b) Standard Results
(c) Error Rate 0.001 (d) Error Rate 0.0001 (e) Error Rate 0.00001
Figure 5.8: Frame Difference Results with different error rate injection using Spin-tronic logics
Table 5.1: The MSE and PSNR of Edge Detection using all-spin logics
Edge Detection error rate 0.01 0.001 0.0001 0MSE 412.8 151 58.8 53PSNR (dB) 22.2 26.5 30.6 30.9
significantly with the decreased error rate. When the error rate is below 0.001, we
can hardly spot the shape of the walking person. And when the error rate drops
to 0.0001, the output image looks almost the same as the output without error
injections. We further inspect the output image qualities by using the MSE and
PSNR as in Table. 5.1 and Table. 5.2
The two tables show the MSE and PSNR results with varying error rates. The
PSNR becomes almost the same as the one without error injections, when the
error rates is 0.0001 for Edge Detection and when the error rates is 0.00001 for
Frame Difference. Therefore, the error rate of 0.00001 is good enough for both of
the image processing applications. The error rate of 0.00001 is still relatively high
46
Table 5.2: The MSE and PSNR of Frame Difference using all-spin logics
Edge Detection error rate 0.001 0.0001 0.00001 0MSE 1259 275.6 250.4 200PSNR (dB) 17.2 23.9 24.5 25.1
compared with the CMOS soft errors. Both of them use around 350 magnets,
which contains two 16-state FSM and several supporting logics. Assuming the
system clock is 1 GHz (clock duration is 1 ns), the error rate of 0.00001 means
there would be one bit being flipped every 100000 bits, or one bit being flipped
every 0.1 ms (0.1ms = 100000× 1ns). This could significantly relax the retention
time requirement, which is usually set to 10 years. With reduced retention time,
the dimension of the spintronic magent can be decreased and reduce the switching
power.
In summary, the stochastic computing scheme with high fault tolerance and
extremely simple hardware structure can benefit the spintronic logic devices. The
high fault tolerance relaxes the retention time requirement and reduces the dy-
namic power. Two image applications implemented by all-spin logic show that the
output image quality is comparable to the standard output even with relatively
high error rate at 0.00001.
5.2 Parallel Implementation of FSM
Implementing the spintronic logic device such as ASL can effectively reduce the
reliability impact and enhance the output result with relative smaller retention
time of the ASL. This reduction in retention time can also reduce the dynamic
energy use and shrinking the device cell size. However, the FSM suffers from
long calculation latency due to its randomness nature. To achieve acceptable
accuracy, a long bit stream, usually 1024 bits, is required in such calculations.
This long latency limits the usage of stochastic computing in the high frequency
applications. To address this issue, we have studied the parallel implementation
47
of the FSM to improve its performance.
5.2.1 The parallel FSM design
According to the Markov Chain theory, the probability distribution of states of
a finite-state machine after a long run is deterministic and unique for each in-
put value, which is called the steady-state distribution [74]. For example, the
transition matrix P of a 4-state FSM with input probability of x is
P =
1− x x 0 0
1− x 0 x 0
0 1− x 0 x
0 0 1− x x
The steady-state distribution π of this transition matrix, as in Equation. 5.3
can be shown to exist and to be solved.
π = πP (5.3)
Fig. 5.9 shows the steady-state distributions of a typical 16-state FSM with
different input values. The expected time (i.e., number of clock cycles) to reach
this steady state can be calculated using the transition matrix. When each vector
of P n becomes the same as the steady-state distribution, n is the number of
steps to reach the steady state. A 16-state FSM with an input value of 0.5 can
be estimated to require at least 200 cycles to reach the steady state. This will
become a huge disadvantage if we want to implement a parallel FSM with bit
stream length of 1024, since the convergence period will be too long for each
parallel copy. We implemented a straightforward parallel FSM of the absolute
value function as in Fig. 5.12a to show this impact. The inputs are 32 uncorrelated
bit streams generated by feeding the same value X into 32 Linear-feedback shift
register (LFSR) random bit generators. Then each input is sent to an absolute
value function FSM. The mean value of all output bit streams represent the final
48
5 10 150
0.5
1
State
Probability
(a) input value is 0.2.
5 10 150
0.5
1
State
Probability
(b) input value is 0.5.
5 10 150
0.5
1
State
Probability
(c) input value is 0.8
Figure 5.9: The Steady-State distribution of a 16-state FSM. From the figure,we can see that this distribution is symmetric about the input value of 0.5. Thedistribution changes the most here, making it very sensitive around 0.5.
49
output value Y. We initiate the FSM with different states 0, 7 and 15, which are the
left extreme, middle and right extreme points of all the states. The performance
of this straightforward parallel implementation is shown in Fig. 5.10. When the
number of parallel copies is 4 and the length of each bit stream is 256, the output
mean value is still close to the real value. However, as the number of parallel
copies increases, the output mean value becomes less accurate. When the initial
state is 0, the right part of the results significantly drift away from the expected
results as parallel copies increases. When the initial is 15, the left part drifts away.
When the FSM starts at the middle state 7, all data points move away from the
correct value. This is because the FSM generates wrong outputs before it reaches
the steady state and this impact grows significantly when the bit stream becomes
shorter.
Further, we implemented this straightforward parallel FSM in one of the image
applications, Frame Difference, that uses the absolute value function as in [12].
We implemented 32 parallel copies of the FSM units with different initial states at
0, 7 and 15, and each bit stream has 32 bits, so the total number of stream bits of
each pixel is 1024, same as the serial stochastic implementation. Fig. 5.11 shows
clearly that the straightforward implementations lose most of the information and
fail to compute the correct Frame difference results. Moreover, the results from
initial state at 0 and 15 are somehow in complementary shapes. Combining the
two can give us a graph very close to the correct result. Whereas when the initial
state is at 7, the output shows a rough shape but most of the details are lost.
This matches the observations of the absolute value implementation in Fig. 5.10
This phenomenon is due to the Markov Chain nature of the FSM. Each in-
put value generates a transition matrix that has a steady state distribution as in
Fig. 5.9. When the input value is much smaller than 0.5, the steady state distribu-
tion mostly concentrates to state 0. When the input value is much larger than 0.5,
it concentrates to state 15. Therefore only half of the outputs are correct when
we set the initial state to be 0 or 15 and all outputs are not correct when we set
initial state as 7. To faster reach the steady state and decrease the convergence
50
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.4
0.5
0.6
0.7
0.8
0.9
1
Input Value
Ou
tpu
t V
alu
e
para 2
para 4
para 8
para 16
para 32
real
(a) initial state at 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.4
0.5
0.6
0.7
0.8
0.9
1
Input Value
Ou
tpu
t V
alu
e
para 2
para 4
para 8
para 16
para 32
real
(b) initial state at 7
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.4
0.5
0.6
0.7
0.8
0.9
1
Input Value
Ou
tpu
t V
alu
e
para 2
para 4
para 8
para 16
para 32
real
(c) initial state at 15
Figure 5.10: The mean output of the straightforward implementation of FSMwith 2, 4, 8, 16 and 32 parallel copies. The FSM is a typical 16-state absolutevalue function. The three subgraphs are using different initial state at 0, 7 and15 respectively.
51
(a) Frame Diff Original (b) Conventional (c) Serial Stochasic
(d) Para Impl Init State 0 (e) Init state 7 (f) Init state 15
Figure 5.11: Simulation results of the conventional deterministic scheme, serialand a straightforward parallel stochastic implementation on Frame Difference withdifferent initial states. The straightforward parallel stochastic implementation isclearly not able to compute the correct results.
52
time, we can manually store the steady state distributions and initiate the FSM
directly to the them when the input value is known. A dispatcher is proposed to
initiate the FSMs from any input value as in Fig. 5.12d.
The dispatcher itself is a look up table (LUT), through which the input value
can pick its corresponding set of initial states. For example, when the input is
0.5, the initial states are evenly distributed among all FSMs as in Fig. 5.9b. Each
state (of 16 states) will be assigned to two FSMs as initial state for a 16-state
parallel FSM of 32 parallel copies to mimic the steady distribution of 0.5. Thirty-
two initial states are stored in the table to approximate the distribution from each
input. The LUT has 20 entries from 0 to 1, with step of 0.05. The dispatcher
requires an estimation of the input value to pick the correct entry of states, so an
estimator is implemented before the dispatcher unit.
We propose two implementations of the estimation unit, parallel counter and
majority gate counter, as in Fig.5.12c. Because the dispatcher LUT entries in-
crease with a step of 0.05, we choose the estimation bit stream length to be 416
(32 parallel copies x 13 cycles) where the standard deviation of the estimation
is 0.025 (est ± 0.025) to be precise enough for the parallel counter to pick the
entries in the dispatcher. Moreover, since the steady state distribution is less
sensitive around input 0 and 1, a simpler majority gate as in Fig.5.13 can meet
the estimation needs as well. During the estimation process, the dispatcher could
not give an initial-states set, the parallel FSM unit will have a stall. We could
store the bits in the estimation process and feed them back to the FSMs in the
next estimation, making it a simple pipeline to avoid this stall. This way, it only
impacts the first input data and does not slow down the overall calculation speed.
The complete parallel FSM implementation is shown in Fig.5.12b.
5.2.2 Experiments and Results
In this section, we will introduce the experimental methodology and present the
computational results. We first set up the parallel FSM unit to approximate 3
53
Parallel Random
Bit gener-
ator
X
FSM
FSM
FSM
...
X0
X1
X31
Y0
Y1
Y31
(a) The straightforward parallelFSM implementation.
Parallel Random
Bit generator
X
FSM
FSM
FSM
...
X0
X1
X31
Y0
Y1
Y31
Estimator
...
Dispatcher
(b) The parallel FSM implementation.
Parallel counter
X0
X1
X31
...X
Estimator1
Majority Gate
X0
X1
X31
...X
Estimator2
Counter
(c) Estimator.
FSM
FSM
FSM
initiate
X
LUT
...
(d) Dispatcher.
Figure 5.12: The straightforward parallel FSM implementation and the proposedparallel implementation. The proposed parallel FSM has 32 parallel short bitstreams sent to the Estimator to obtain an initial guess for the input. Two Esti-mator implementations are parallel counter and majority gate counter. This initialestimate is then sent to the Dispatcher to look up a set of state configurations toinitialize the parallel FSMs.
54
0 0.2 0.4 0.6 0.8 10
0.5
1
input value
ou
tpu
t v
alu
e
experiment plot
real value
Figure 5.13: The experiment and analytical output result of a 32-input MajorityGate.
typical FSM functions, absolute value, tanh and exponential to study the parallel
implementation impact. We implemented our scheme with 32 parallel units of
FSMs that can reduce the bit stream length from 1024 to 32. Due to the prob-
abilistic nature of the stochastic computing scheme, we perform our experiments
repeatedly for 10 times for statistical significance. The experimental results com-
pare the accuracy and consistency among different schemes, including the serial
FSM, the straightforward parallel FSM as in Fig. 5.12a, the parallel FSM with a
parallel counter estimator and a majority gate estimator as in Fig. 5.12c. Then,
we implemented our parallel FSM with the majority gate estimator into two im-
age processing applications as in the previous subsection. We implemented the
FSMs in parallel the same way as in the previous single function implementations.
The experimental results compare the output image quality using mean squared
error (MSE) and peak-signal-to-noise ratio (PSNR) among the conventional deter-
ministic scheme (Conventional), the serial stochastic scheme (Serial) and parallel
stochastic scheme (Parallel).
Fig. 5.14 compares the output accuracy, showing the true output value and the
mean value of the repeated experimental results of the serial and different parallel
stochastic FSM implementations. The average error and the standard deviation of
each implementation are shown in Table. 5.3. The straightforward scheme shows
55
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.4
0.5
0.6
0.7
0.8
0.9
1
Input Value
Ou
tpu
t V
alu
e
serl
para Straight
paraCntEst
mjrEst
real
(a) Absolute Value Function
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.2
0.4
0.6
0.8
1
Input Value
Ou
tpu
t V
alu
e
serl
para Straight
paraCntEst
mjrEst
real
(b) Exponential Function
0 0.2 0.4 0.6 0.8 10
0.5
1
Input Value
Ou
tpu
t V
alu
e
serl
para Straight
paraCntEst
mjrEst
real
(c) Tanh Function
Figure 5.14: The output mean value of two parallel FSMs with parallel counteras estimator or majority gate estimator and the serial FSM. Both estimators use13 clocks, 13× 32 = 416 bits, to approximate the input value.
56
Table 5.3: The average error and deviation of the parallel FSMs.
Abs err Abs std Tanh err Tanh std Exp err Exp stdserial 0.0066 0.0127 0.0121 0.0224 0.0105 0.0171straight 0.1320 0.0118 0.0840 0.0234 0.1500 0.0141paraCnt 0.0038 0.0141 0.0049 0.0248 0.0176 0.0211mjrEst 0.0057 0.0150 0.0045 0.0285 0.0199 0.0238
(a) Frame Difference Original (b) Conventional
(c) Serial Stochastic (d) Parallel Stochastic
Figure 5.15: Simulation results of the conventional deterministic scheme, serialand parallel stochastic implementation on Frame Difference.
57
(a) Edge Detection Original (b) Conventional
(c) Serial Stochastic (d) Parallel Stochastic
Figure 5.16: Simulation results of the conventional deterministic scheme, serialand parallel stochastic implementation on Edge Detection.
58
Table 5.4: The MSE and PSNR of image processing applications
ApplicationsMSE PSNR (dB)
serial parallel serial parallelEdgeDetect 47.3 47.9 31.4 31.3FrameDiff 156.5 133.4 26.2 26.9
significant difference from the true output, while the other two parallel schemes
with estimator and dispatcher are very close to the true output, showing very good
accuracy. The error becomes bigger as the input value grows to 0.5 due to larger
autocorrelation and variance impacts [73]. The parallel FSM tend to be closer to
true value than the serial FSM when the input value is near 1, especially with the
exponential and tanh functions. Since the serial FSM initiates to state 0, it needs
time to grow from state 0 to 15 when the input value is close to 1. This transition
requires at least 16 steps, generating 16 wrong output bits. A bit stream of 1024
bits has an error rate of 161024
= 1.56% with 16 wrong output bits. However, the
parallel FSM does not fix the initial states, making no difference between different
input values. This makes the parallel implementation more accurate near input
value of 1.
Edge Detection and Frame Difference results are shown in Fig. 5.16 and
Fig. 5.15. It is hard to visually find any difference in both application results
other than some outliers due to the probabilistic nature of the stochastic comput-
ing. Detailed image quality comparisons are listed in Table 5.4. Both applications
achieved acceptable PSNR of 8-bit images under both serial and parallel imple-
mentations [75]. Both of them have similar MSE under two different schemes.
In summary, the experimental results show that the performance of the par-
allel FSM is as good as the serial implementation. The simplified majority gate
estimator can also compete with a more complex parallel counter, which suggests
more simplifications could be exploited for low accuracy applications. When num-
ber of parallel units increases and the number of each bit stream decreases, this
estimator-dispatcher mechanism becomes crucial to ensure the accuracy of the
59
Table 5.5: Hardware Cost, Latency and Area-Delay Product of serial and parallelFSM with 32 degrees of parallelism
serial parallel (PC) parallel (MG)FSM unit 4 8× 32 8× 32
supporting unit - 72 66total LUT-FF pairs 4 328 322
initial latency 0 13 13latency 1024 32 32
Area-Delay Product 4096 10496 10304
parallel FSM scheme. The image processing applications show that the parallel
FSM implementation can achieve equivalent or better image quality than serial
implementations.
5.2.3 Latency and Hardware Cost
We implemented the serial and parallel stochastic finite-state machine in verilog
using Xilinx ISE. The estimator of the parallel FSM unit is implemented using
two different schemes, parallel counter and majority gate as in Figure 5.12c. The
hardware cost of the 32-degree parallelism implementation is shown in Table 5.5,
where parallel (PC) refers to parallel FSM implementation using parallel counter
as the estimator and parallel FSM (MG) refers to the implementation using major-
ity gate. The hardware area reported is the number of look-up table and flip-flop
pairs (LUT-FF).
Although the parallel implementation of the FSM introduces hardware over-
head, it reduces the latency compared to the serial version. For instance, when
the number of parallel copies is 32, the serial implementation latency reduces from
1024 cycles to 32 cycles. This significant latency reduction can be critical for high
frequency applications. Although the parallel implementation does introduce an
initial latency during the input estimation process by 13 cycles as in Table 5.5, we
can minimize this impact by feeding back these 13 bits during the next estimation
as a pipeline. We can further see that the parallel implementation Area-Delay
60
Product (ADP) is greater than the serial ADP due to the significant hardware
overhead for the parallel implementation. Of course it is a common practice to
trade off area for better performance. Each FSM unit of the parallel implementa-
tion becomes about twice as large as the serial FSM, which contributes the larger
hardware overhead. This is because the parallel FSM must be able to initialize to
different states, which increases the hardware complexity and area cost. Another
hardware overhead of the parallel FSM comes from the dispatcher and estima-
tor, which is shown in Table 5.5 as supporting unit. The dispatcher is simply a
look-up table (LUT) with multiple entries, that each store a set of initial states
for the number of parallel FSM units. As for the estimator implementation, we
can further see that the majority gate scheme reduces the supporting unit area
significantly compared to the parallel counter scheme in the table.
5.3 Autocorrelation Issue of FSM
The sequential stochastic finite-state machine has another issue that is the auto-
correlation between the output bit stream. Since the output of the bit of the FSM
is determined by previous state and the current input bit, the outputs are actually
timely correlated. Therefore the output of the FSM from a random input bit
stream does not hold good randomness, where good randomness means flat auto-
correlation plot. The changing of the randomness compromised the randomness
assumption of the stochastic computing and could bring significant impacts to the
final output result when the circuit becomes large and complicated. Therefore, in
this section, we analyzed the auto-correlation impact of the FSM and proposed a
re-randomizer to resolve this issue.
5.3.1 Autocorrelation Analysis
Several experiments were performed to evaluate and understand the autocorrela-
tion issue for random bit streams produced by FSM-based stochastic computing
61
FSMinput output Autocorrelation
measurement
FSMinput output Autocorrelation
measurement
CountThen
regenerate
FSM group
Control group
Figure 5.17: Experimented methodology to measure and compare autocorrelation.
elements. We feed in different inputs into the FSM and compare the outputs with
a regenerated random output stream as in Fig. 5.17. A flat autocorrelation plot
shows better randomness, and vice versa.
The main parameters of the FSM-based computing elements are the number of
states and the length of the bit stream. The length of the bit stream determines
the calculation time and has been proven to be inversely proportional to the
output variance [56]. The number of states is generally related to the output
accuracy. We set the input stream values vary from 0 to 1 with an increment of
0.1. Then we choose the bit stream length to start from 256, then 512, 1024, 2046
to 4096. These experiments are performed with four FSM algorithms: absolute
value, |p|; exponential, e−p; exponential of absolute value, e−|p|; and tanh function.
These algorithms were implemented in image processing applications in previous
works [12].
Discrete Autocorrelation is an unnormalized vector shown in Formula 5.4,
where Rxx[j] is the jth element of the autocorrelation vector, xn is the nth element
of bit stream x, and xn−j is the nth element of the j-bit circular shift of the bit
stream x(j).
62
0 500 1000 1500 2000
0.65
0.7
0.75
0.8
0.85
bit shift
auto
corr
elat
ion
FSM
Random
Figure 5.18: Autocorrelation Comparison. The flat “Random” line is a typicalautocorrelation of random bit streams, whereas the fluctuating “FSM” is the auto-correlation when the output of the FSM is supposed to be the same as “Random”streams. Generally, the flatter the autocorrelation plot, the more random thestream.
Rxx[j] =∑n
xnxn−j (5.4)
Autocorrelation vectors are not convenient to compare directly for different
parameter settings. According to the random assumption, subsequent bits in a
statistic bit stream can be considered as a series of iid Bernoulli random variables.
Therefore, the sum of N bits can be modeled as a binomial distribution of B(N, p).
Many operations depend on such an assumption. For example, a power operation
like x2 would be easily calculated by feeding in x and a one-clock delayed version
of x into an AND gate. Such an operation requires x to be independent with
time, which can be characterized by a good flat autocorrelation as “Random” in
Fig. 5.18. To make the comparison of autocorrelations much easier, we propose a
variance-like metric in Formula 5.5, where N is the length of the autocorrelation
vector, and x̄ is the FSM output mean value.
63
M =1
N
N∑n=1
(Rxx[n]− x̄2)2 (5.5)
This metric is valid since autocorrelation is actually a bit shifted version of
self dot-product, as in Formula 5.4. Then the mean of each autocorrelation vector
element R̄xx[j] is equal to the mean value of the square of x and a j-bit delayed
stream x(j), which is x̄2. Therefore, the ideal autocorrelation should be identical
to x̄2. A good way to quantize the autocorrelation vector is just to find the
average difference between the actual measurement and an ideal one. The smaller
this value gets, the less significant the autocorrelation problem will be. The 1024-
bit length stream’s output variance of 2.4× 10−4 with mean value at 0.5 is small
enough for our consideration. There, however, is an issue about this metric that
it only characterizes the average situation. We will also examine the worst case
autocorrelation plot to reveal detailed information, especially on which part of the
autocorrelation is bad.
The experimental results include single factor comparisons among three dif-
ferent parameters: the number of the states, the bit stream lengths, and the
output values. Fig. 5.19 shows how the autocorrelation is related to the number
of states in the four different algorithms with the length of bit stream of 1024.
As the number of states increases, the metric M becomes bigger and the output
autocorrelation becomes worse.
As the bit stream length increases in Fig. 5.20, the metric M becomes smaller
and the autocorrelation becomes better among all four algorithms when the num-
ber of states is fixed at 16. This trend holds with different numbers of states.
Improvement due to the bit length increase is significant and consistent among
different numbers of states and different algorithms. Specifically, experiment re-
sults from ABS in Fig. 5.20a and EXP in Fig. 5.20b show an almost linear inverse
relationship between the autocorrelation metric and the length of the bit stream.
As the bit stream length is doubled, the metric becomes half with the same output
value.
64
0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
2
2.5
3x 10
-3
output value
auto
corr
elat
ion
8
16
32
(a) Absolute Algorithm
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1x 10
-3
output value
auto
corr
elat
ion
8
16
32
(b) Exponential Algorithm
0 0.2 0.4 0.6 0.8 10
2
4
6
8x 10
-4
output value
auto
corr
elat
ion
8
16
32
(c) Exponential Absolute Algorithm
0 0.2 0.4 0.6 0.8 10
0.002
0.004
0.006
0.008
0.01
output value
auto
corr
elat
ion
8
16
32
(d) Tanh Algorithm
Figure 5.19: FSM autocorrelation results at a bit length of 1024. All show thatthe 32-state line is higher than the 16-state and 8-state cases, which means thatthe autocorrelation metric of the 32-state case is the largest.
65
0.4 0.5 0.6 0.7 0.8 0.9 10
1
2
3
4x 10
-3
output value
auto
corr
elat
ion
256
512
1024
2048
4096
(a) Absolute Algorithm
0 0.2 0.4 0.6 0.8 10
0.5
1
1.5
2
2.5
3x 10
-3
output value
auto
corr
elat
ion
256
512
1024
2048
4096
(b) Exponential Algorithm
0 0.2 0.4 0.6 0.8 10
0.5
1
1.5
2x 10
-3
output value
auto
corr
elat
ion
256
512
1024
2048
4096
(c) Exponential Absolute Algorithm
0 0.2 0.4 0.6 0.8 10
1
2
3
4
5x 10
-3
output value
auto
corr
elat
ion
256
512
1024
2048
4096
(d) Tanh Algorithm
Figure 5.20: FSM autocorrelation results with 16 states. All show that the shorterbit streams usually have larger autocorrelation metric. Specifically, the autocor-relation metric with a bit stream of length of 256 is almost 10 times larger thanwhen the length of bit stream is 4096.
66
Another relation between the output value and the autocorrelation metric can
be observed from the results. The autocorrelation becomes the worst around the
output value of 0.5, then becomes smaller when the output value departs from
the center. For all four algorithms, the autocorrelation metric of output values
below 0.2 or larger than 0.8 is typically 10 times smaller than the worst case at
the output value of 0.5. At these two extremes, the autocorrelation issue is less
significant and is close to 2.4× 10−4, which could be considered negligible.
The difference between the FSM output and a random sequence output is
shown in Fig. 5.20. We can see that, for all four algorithms, the autocorrelation
metric of the FSM output is significantly larger than the random stream. The
gap can be as large as 10 when the output value is around 0.5. Clearly, the FSM
significantly affects the output autocorrelation, which suggests poor randomness
of the output stream.
5.3.2 Re-randomizer
Although we found that a smaller number of states and a longer bit stream can
generate an output stream with better autocorrelation, these characteristics work
against higher precision and efficiency. We propose a re-randomizer that main-
tains the shorter bit stream and relatively large number of states. The saturating
up-down counter based re-randomizer is constructed with a feedback circuit shown
in Fig. 5.21. Counter structures have been widely used in Neural Network appli-
cations [13] [76]. It is generally considered to be an integrator. By controlling the
feedback loop and the counter size, we can manipulate the output behavior. Our
proposed re-randomizer uses a simple unity gain feedback loop, and behaves like
a low pass follower. The bigger the counter size, the smaller the cut off frequency
is [76]. However, as the counter size increases, the steady state settling time also
increases, impacting the output accuracy. To balance the output accuracy and
the cut off frequency, the counter size has been chosen to be 128 states when the
length of bit stream is larger than 1024, or to be 1/10 of the length of the bit
67
LFSR
CNT+
-
+
-
Rerandomizer
FSMinput
Figure 5.21: Proposed re-randomizer with feedback structure.
stream. By connecting the re-randomizer directly after the FSM output, we can
analyze the autocorrelation improvement from the re-randomizer.
Fig. 5.22 compares the ideal binomial random bit stream, the FSM output, and
the re-randomizer output autocorrelations with 16 states and the bit stream length
set to 1024. The FSM output autocorrelation departs from the ideal binomial
random bit stream at around 0.2, merges back at around 0.8, and reaches the worst
case at around 0.5. The re-randomizer (aka the “Follower”) generally remains at
the same level as the binomial random bit stream, which is less than 2.4 × 10−4
in all cases. The FSM output value and random binomial output are close to
each other, while the re-randomizer has different degrees of value shifts from the
random binomial stream outputs.
The re-randomizer can significantly reduce the autocorrelation metric. Also
from Fig. 5.23 when the length of bit stream is 1024 with 16 states around output
value at 0.5, we can see how the re-randomizer reshaped the autocorrelation. What
the re-randomizer did is similar to a low pass filter. It filtered out the higher
frequency components of the autocorrelation plot signals, making them much
flatter. However, at the worst case situations, the lower frequency components
still remained quite noticeable.
68
0.4 0.5 0.6 0.7 0.8 0.9 10
0.2
0.4
0.6
0.8
1x 10
-3
output value
auto
corr
elat
ion
FSM
Random
Follower
(a) Absolute Algorithm
0 0.2 0.4 0.6 0.8 10
2
4
6
8x 10
-4
output valueau
toco
rrel
atio
n
FSM
Random
Follower
(b) Exponential Algorithm
0 0.2 0.4 0.6 0.8 10
1
2
3
4
5
6x 10
-4
output value
auto
corr
elat
ion
FSM
Random
Follower
(c) Exponential Absolute Algorithm
0 0.2 0.4 0.6 0.8 10
0.5
1
1.5
2
2.5
3x 10
-3
output value
auto
corr
elat
ion
FSM
Random
Follower
(d) Tanh Algorithm
Figure 5.22: Autocorrelation Comparison of the FSM, Re-randomizer and Ran-dom, where “Random” is an ideal Bernoulli bit stream. “FSM” has the worstautocorrelation. Re-randomizer as “Follower” is almost the same as the randomsequence as “Random”.
69
0 500 1000 1500 20000
0.1
0.2
0.3
0.4
0.5
bit shift
auto
corr
elat
ion
FSM
Random
Follower
(a) Absolute Algorithm
0 500 1000 1500 2000
0.65
0.7
0.75
0.8
0.85
bit shift
auto
corr
elat
ion
FSM
Random
Follower
(b) Exponential Algorithm
0 500 1000 1500 20000.5
0.6
0.7
0.8
0.9
bit shift
auto
corr
elat
ion
FSM
Random
Follower
(c) Exponential Absolute Algorithm
0 500 1000 1500 20000
0.1
0.2
0.3
0.4
0.5
bit shift
auto
corr
elat
ion
FSM
Random
Follower
(d) Tanh Algorithm
Figure 5.23: Worst Case Autocorrelation Plot with 16 states and a bit streamlength of 1024.
70
5.3.3 Discussion and Analysis
For a binomial random bit stream, the autocorrelation can be considered as the
square of the output probability, except at 0-bit shift. Therefore, all other auto-
correlation vector elements should have the mean value equal to E(X)2, where
X is the random variable of the binomial random input following a distribution
of 1NB(N, p). Recall the autocorrelation metric M proposed previously, which is
exactly the variance of such a distribution. Because each autocorrelation element
is totally determined by each specific bit stream, the autocorrelation should be
linearly related with the variance of X, V ar(X), but not V ar(X2). The variance
of X for this distribution is:
V ar(X) =p(1− p)
N
Now, the autocorrelation, like the variance, should be related to both the
output value and the bit stream length, N. Specifically, the autocorrelation metric
M is inversely proportional to the bit stream length N and reaches the maximum
when p equals 0.5. Both of the two parameters maintain the same relationship
when we evaluated the FSM output, where the binomial random assumption was
not guaranteed.
As for the number of states, it is reasonable to suggest that as the number
of states increases, the time required for a full walk through all states would
become longer. This time could be related with some specific pattern in the output
streams. Although the Markov chain guarantees a steady-state distribution, it
takes at least one period to statistically simulate this distribution. This period
can be determined by the steady-state distribution, and we can roughly treat
the number of states to be the shortest possible period. While a random stream
has constant probability to generate each bit, the FSM must take at least one
period to simulate this procedure. So this period should be positively related
with the output autocorrelation. Therefore, fewer states usually produce better
autocorrelation.
71
We know that the binomial distribution can be approximated with a Gaussian
distribution when N is large enough. Then it is reasonable to see the binomial
random stream as a DC signal plus some Gaussian white noise which has the
same magnitude as the bit stream’s standard deviation. However, the FSM out-
put streams usually contain long successive 1s or 0s unlike the binomial random
streams. Thus, the FSM output streams contain more complex frequency compo-
nents. The re-randomizer we proposed successfully filters out most of the higher
frequency components. However, we cannot increase the counter size to lower the
cut off frequency without limit, since a bigger counter size also means a longer
settling time, which would affect the output accuracy.
5.4 Summary
In this chapter, we introduced the spintronic logic devices and analyzed its merits
and weaknesses. Although spintronic logic technology benefits from low leakage
power and small footprint, it is unreliable and has to increase cell dimension
to overcome its intensity to flip. This fits very well with stochastic computing
schemes, which can utilize simple circuit logic such as AND Gate and FSMs to do
complex calculations. It reduces the complexity of the circuit of spintronic logic
and is highly tolerant of errors.
Previous works already studied the implementation of stochastic computing
combinational logics and number bit generator using the spintronic logic devices.
We proposed to implement sequential logic, the FSM, which is capable to do
complex calculations, including exponential function, absolute value function and
hyperbolic tangent function, using spintronic logic devices.
We further proposed a parallelization scheme for the FSM to optimize the per-
formance of stochastic computing scheme. The proposed parallelization scheme
uses a look-up table dispatcher to set the initial states of multiple FSMs at the
72
steady state to avoid a long convergence period. This scheme can effectively im-
plement the stochastic sequential logic FSM in parallel to reduce the long calcu-
lation latency with some hardware overhead. Experiments on three typical FSM
functions show that the accuracy and variance of the parallel FSM scheme are
comparable to the serial implementation. The parallel FSM scheme further shows
equivalent or better image quality than the serial implementation in two image
processing applications.
Finally, we proposed a re-randomizer to overcome the autocorrelation issue
of the FSM. We analyzed that the autocorrelation of FSM is related with the
number of states, the length of bit stream and the output value. Larger number
of states with shorter bit stream length and output around 0.5 generates worse
autocorrelation error. The proposed re-randomizer uses a up/down counter and
a LFSR can effectively break the output bit sequence of the FSM and solve the
autocorrelation issue.
Chapter 6
Conclusion and Discussion
We have investigated the spintronic device implementations in both memory and
logic circuits. It is a promising alternative to CMOS with significantly smaller
size and virtually zero standby energy use. However, its high dynamic switching
power and latency with the unreliable random flipping nature makes it hard to
accommodate the performance and accuracy requirements. In this work, we have
optimized the spintronic devices for both memory and logic circuits to mitigate
these weaknesses.
Spintronic memory devices: We have analyzed the impact of STT-MRAM as
a replacement for CMOS at all levels of a multiprocessor cache hierarchy. Though
STT-MRAM has higher write energy and latency, reducing these parameters at
the circuit level does not lead to an optimal design. The extra circuit area required
to minimize MTJ bit-cell write time and energy causes the cache arrays to grow,
leading to higher read energy and latency due to parasitics effects.
An fully-associative L0 cache as small as 4KB can effectively restore perfor-
mance lost to the higher write latency. This structure hides the extra write la-
tency of around 5ns when running at 2GHz, giving a total cache energy savings of
40-70% and an average energy-delay product reduction of 60% compared to the
CMOS baseline. The L0 cache is implemented as a standard cache level, requiring
73
74
no additional control structures. We have observed no significant scalability im-
pact using STT-MRAM with L0 implemented. A few benchmarks show improved
scalability up to 16 cores using the STT-MRAM hierarchy. The introduction of
new memory technologies can have significant impacts on the best architectural
choices for the memory hierarchy of a multicore system. This work shows that
simple solutions can help mitigate the negative impacts while still allowing the
system to take advantage of the benefits of the new technology.
Spintronic logic devices: We have introduced the spintronic logic device and
analyzed its merits and weaknesses. Although it benefits from low leakage power
and smaller footprint, it suffers from random bit flips due to its thermal instability.
We proposed a scheme to implement stochastic computing circuits using all-spin
logics that can take advantage of the high fault tolerance of stochastic computing
to mitigate spin random flips. The proposed design can tolerate up to error rate
of 0.00001 for two image applications, Edge Detection and Frame Difference.
We further looked into optimization of one of the core stochastic computing
elements, the finite-state machine to achieve better performance. A parallelization
scheme for the FSM was proposed. Using a look-up table dispatcher to set the
initial states of multiple FSMs, the parallel FSM can immediately work from the
steady state, avoiding the long convergence period. We also proposed two kind of
estimators for the dispatcher. One of them, the parallel counter, requires larger
hardware area, but provides better estimation of the input value. The other, using
the majority gate, naturally fits the trend of steady-state distribution with the
input values, and also simplifies the estimator hardware. The proposed parallel
scheme can effectively implement FSM in parallel to reduce the long calculation la-
tency with some hardware overhead. Experiments on three typical FSM functions
show that the accuracy and variance of the parallel FSM scheme are comparable
to the serial implementation. The parallel FSM scheme further shows equivalent
or better image quality than the serial one in two image processing applications.
We conclude that quickly initializing the FSM by estimating the initial state using
only a few bits of the input value allows parallelism to be effectively exploited in
75
stochastic logic that uses storage elements.
Another issue of the FSM, the autocorrelation issue, is later experimentally
studied. The autocorrelation can introduce dependencies that breaks the ran-
domness assumption that could impact the results of large stochastic circuits,
especially with feedback loops. The autocorrelation is related with three param-
eters, the length of the bit stream, the number of states in the FSM and the
output value. It becomes worse when the length of bit stream decreases, when
the number of states becomes larger, or when the output value is close to 0.5. We
proposed a re-randomizer that uses a up/down counter to track the output value
and regenerates a new random bit stream to solve the autocorrelation issue of the
FSM.
Future Works: The parallel implementation scheme of the FSM successfully
reduces the calculation latency by almost 32 times, but the hardware overhead
is substantial and causes the Area-Delay Product to increase to almost twice as
the serial implementation. Verilog suggests the parallel version of FSM uses twice
number of LUT-FF pairs due to the dynamical initial states, which contributes the
most of the hardware overhead. An investigation on how to optimize and reduce
each FSM will be performed in the future works, which can reduce the ADP by
almost half. We also want to examine the possibility of completely designing the
parallel stochastic computing implementations using pure ASL devices, where the
addition components such as estimator and dispatcher can be easily implemented
with ASL majority gate and MTJ memory cells. A thorough comparison between
the stochastic ASL and conventional boolean implementation will be performed
in the future works as well.
References
[1] G. E. Moore. Cramming more components onto integrated circuits. Proceed-
ings of the IEEE, 86(1):82–85, Jan 1998.
[2] Z. Pajouhi, S. Venkataramani, K. Yogendra, A. Raghunathan, and K. Roy.
Exploring spin-transfer-torque devices for logic applications. IEEE Trans-
actions on Computer-Aided Design of Integrated Circuits and Systems,
34(9):1441–1454, Sept 2015.
[3] Dmitri E. Nikonov and Ian A. Young. Benchmarking of beyond-cmos ex-
ploratory devices for logic integrated circuits. IEEE Journal on Exploratory
Solid-State Computational Devices and Circuits, 1:3–11, 2015.
[4] Sang Phill Park, Sumeet Gupta, Niladri Mojumder, Anand Raghunathan,
and Kaushik Roy. Future cache design using STT MRAMs for improved en-
ergy efficiency: devices, circuits and architecture. In DAC ’12: Proceedings of
the 49th Annual Design Automation Conference. ACM Request Permissions,
June 2012.
[5] Clinton W IV Smullen, Vidyabhushan Mohan, Anurag Nigam, Sudhanva
Gurumurthi, and Mircea R Jr Stan. Relaxing Non-Volatility for Fast and
Energy-Efficient STT-RAM Caches. High Performance Computer Architec-
ture (HPCA), 2011 IEEE 17th International Symposium on, pages 50–61,
2011.
76
77
[6] Mitchelle Rasquinha, Dhruv Choudhary, Subho Chatterjee, Saibal
Mukhopadhyay, and Sudhakar Yalamanchili. An energy efficient cache de-
sign using spin torque transfer (STT) RAM. In ISLPED ’10: Proceedings of
the 16th ACM/IEEE international symposium on Low power electronics and
design. ACM Request Permissions, August 2010.
[7] Zhenyu Sun, Xiuyuan Bi, Hai Helen Li, Weng-Fai Wong, Zhong-Liang Ong,
Xiaochun Zhu, and Wenqing Wu. Multi retention level STT-RAM cache
designs with a dynamic refresh scheme. In MICRO-44 ’11: Proceedings of
the 44th Annual IEEE/ACM International Symposium on Microarchitecture.
ACM Request Permissions, December 2011.
[8] Xiaochen Guo, Engin Ipek, and Tolga Soyata. Resistive computation: avoid-
ing the power wall with low-leakage, STT-MRAM based computing. In ISCA
’10: Proceedings of the 37th annual international symposium on Computer
architecture. ACM Request Permissions, June 2010.
[9] S. Senni, L. Torres, G. Sassatelli, A. Gamatie, and B. Mussard. Emerging
non-volatile memory technologies exploration flow for processor architecture.
In VLSI (ISVLSI), 2015 IEEE Computer Society Annual Symposium on,
pages 460–460, July 2015.
[10] Adwait Jog, Asit K Mishra, Cong Xu, Yuan Xie, Vijaykrishnan Narayanan,
Ravishankar Krishnan Iyer, and Chita R Das. Cache revive: Architecting
volatile STT-RAM caches for enhanced performance in CMPs. In DAC ’12:
Proceedings of the 49th Annual Design Automation Conference, pages 243–
252, 2012.
[11] C. Ma, W. Tuohy, and D. J. Lilja. Impact of spintronic memory on multi-
core cache hierarchy design. IET Computers Digital Techniques, 11(2):51–59,
2017.
78
[12] Peng Li and D.J. Lilja. Using stochastic computing to implement digital
image processing algorithms. In Computer Design (ICCD), 2011 IEEE 29th
International Conference on, pages 154–161. IEEE, 2011.
[13] B.D. Brown and H.C. Card. Stochastic neural computation. I. Computational
elements. Computers, IEEE Transactions on, 50(9):891–905, 2001.
[14] R. Venkatesan, S. Venkataramani, X. Fong, K. Roy, and A. Raghunathan.
Spintastic: Spin-based stochastic logic for energy-efficient computing. In 2015
Design, Automation Test in Europe Conference Exhibition (DATE), pages
1575–1578, March 2015.
[15] X. Fong, M. C. Chen, and K. Roy. Generating true random numbers using on-
chip complementary polarizer spin-transfer torque magnetic tunnel junctions.
In 72nd Device Research Conference, pages 103–104, June 2014.
[16] Yuan Ji, Feng Ran, Cong Ma, and D.J. Lilja. A hardware implementation
of a radial basis function neural network using stochastic logic. In Design,
Automation Test in Europe Conference Exhibition (DATE), 2015, pages 880–
883, March 2015.
[17] L. Miao and C. Chakrabarti. A parallel stochastic computing system with
improved accuracy. In SiPS 2013 Proceedings, pages 195–200, Oct 2013.
[18] Zhiheng Wang, N. Saraf, K. Bazargan, and A. Scheel. Randomness meets
feedback: Stochastic implementation of logistic map dynamical system. In
2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC), pages
1–7, June 2015.
[19] Qingan Li, Jianhua Li, Liang Shi, C.J. Xue, Yiran Chen, and Yanxiang He.
Compiler-assisted refresh minimization for volatile stt-ram cache. In De-
sign Automation Conference (ASP-DAC), 2013 18th Asia and South Pacific,
pages 273–278, Jan 2013.
79
[20] Wei Xu, Hongbin Sun, Xiaobin Wang, Yiran Chen, and Tong Zhang. Design
of last-level on-chip cache using spin-torque transfer ram (stt ram). Very
Large Scale Integration (VLSI) Systems, IEEE Transactions on, 19(3):483–
493, March 2011.
[21] Yusung Kim, Sumeet Kumar Gupta, Sang Phill Park, Georgios Panagopou-
los, and Kaushik Roy. Write-optimized reliable design of STT MRAM. In
ISLPED ’12: Proceedings of the 2012 ACM/IEEE international symposium
on Low power electronics and design. ACM Request Permissions, July 2012.
[22] Guangyu Sun, Xiangyu Dong, Yuan Xie, Jian Li, and Yiran Chen. A novel
architecture of the 3d stacked mram l2 cache for cmps. In High Performance
Computer Architecture, 2009. HPCA 2009. IEEE 15th International Sympo-
sium on, pages 239–249, Feb 2009.
[23] Ping Zhou, Bo Zhao, Jun Yang, and Youtao Zhang. Energy reduction for
STT-RAM using early write termination. In ICCAD ’09: Proceedings of the
2009 International Conference on Computer-Aided Design. ACM Request
Permissions, November 2009.
[24] Kon-Woo Kwon, Sri Harsha Choday, Yusung Kim, and Kaushik Roy.
AWARE (Asymmetric Write Architecture With REdundant Blocks): A High
Write Speed STT-MRAM Cache Architecture. IEEE Transactions on Very
Large Scale Integration(VLSI) Systems, 22(4):712–720.
[25] Zhenyu Sun, Hai Li, and Wenqing Wu. A dual-mode architecture for fast-
switching STT-RAM. In ISLPED ’12: Proceedings of the 2012 ACM/IEEE
international symposium on Low power electronics and design. ACM Request
Permissions, July 2012.
80
[26] Junwhan Ahn, Sungjoo Yoo, and Kiyoung Choi. Dasca: Dead write predic-
tion assisted stt-ram cache architecture. High Performance Computer Archi-
tecture (HPCA2014), 2014 IEEE 20th International Symposium on, February
2014.
[27] Xiaoxia Wu, Jian Li, Lixin Zhang, E. Speight, and Yuan Xie. Power and
performance of read-write aware hybrid caches with non-volatile memories.
In Design, Automation Test in Europe Conference Exhibition , 2009. DATE
’09., pages 737–742, April 2009.
[28] Amin Jadidi, Mohammad Arjomand, and Hamid Sarbazi-Azad. High-
endurance and performance-efficient design of hybrid cache architectures
through adaptive line replacement. In ISLPED ’11: Proceedings of the 17th
IEEE/ACM international symposium on Low-power electronics and design.
IEEE Press, August 2011.
[29] B. Del Bel, Jongyeon Kim, C.H. Kim, and S.S. Sapatnekar. Improving stt-
mram density through multibit error correction. In Design, Automation and
Test in Europe Conference and Exhibition (DATE), 2014, pages 1–6, March
2014.
[30] Norman P Jouppi. Improving direct-mapped cache performance by the addi-
tion of a small fully-associative cache and prefetch buffers. In ACM SIGARCH
Computer Architecture News, volume 18, pages 364–373. ACM, 1990.
[31] Johnson Kin, Munish Gupta, and William H. Mangione-Smith. The filter
cache: An energy efficient memory structure. In Proceedings of the 30th
Annual ACM/IEEE International Symposium on Microarchitecture, MICRO
30, pages 184–193, Washington, DC, USA, 1997. IEEE Computer Society.
[32] A. Varma and Q. Jacobson. Destage algorithms for disk arrays with non-
volatile caches. In Computer Architecture, 1995. Proceedings., 22nd Annual
International Symposium on, pages 83–95, June 1995.
81
[33] Binny S. Gill and Dharmendra S. Modha. Wow: Wise ordering for writes -
combining spatial and temporal locality in non-volatile caches. In Proceed-
ings of the 4th Conference on USENIX Conference on File and Storage Tech-
nologies - Volume 4, FAST’05, page 10, Berkeley, CA, USA, 2005. USENIX
Association.
[34] Behtash Behin-Aein, Deepanjan Datta, Sayeff Salahuddin, and Supriyo
Datta. Proposal for an all-spin logic device with built-in memory. Nature
Nanotechnology, 5:266–270, Feb 2010.
[35] C. Augustine, G. Panagopoulos, B. Behin-Aein, S. Srinivasan, A. Sarkar,
and K. Roy. Low-power functionality enhanced computation architecture
using spin-based devices. In 2011 IEEE/ACM International Symposium on
Nanoscale Architectures, pages 129–136, June 2011.
[36] J. Kim, B. Tuohy, C. Ma, W. H. Choi, I. Ahmed, D. Lilja, and C. H. Kim.
Spin-hall effect mram based cache memory: A feasibility study. In 2015 73rd
Annual Device Research Conference (DRC), pages 117–118, June 2015.
[37] S. G. Ramasubramanian, R. Venkatesan, M. Sharad, K. Roy, and A. Raghu-
nathan. Spindle: Spintronic deep learning engine for large-scale neuromorphic
computing. In 2014 IEEE/ACM International Symposium on Low Power
Electronics and Design (ISLPED), pages 15–20, Aug 2014.
[38] Meng Yang, Bingzhe Li, David J. Lilja, Bo Yuan, and Weikang Qian. To-
wards theoretical cost limit of stochastic number generators for stochastic
computing. In IEEE Computer Society Annual Symposium on VLSI, 2018.
[39] Weisheng Zhao, E. Belhaire, and C. Chappert. Spin-mtj based non-volatile
flip-flop. In 2007 7th IEEE Conference on Nanotechnology (IEEE NANO),
pages 399–402, Aug 2007.
[40] K. Ryu, J. Kim, J. Jung, J. P. Kim, S. H. Kang, and S. O. Jung. A magnetic
tunnel junction based zero standby leakage current retention flip-flop. IEEE
82
Transactions on Very Large Scale Integration (VLSI) Systems, 20(11):2044–
2053, Nov 2012.
[41] T. Endoh, S. Togashi, F. Iga, Y. Yoshida, T. Ohsawa, H. Koike, S. Fukami,
S. Ikeda, N. Kasai, N. Sakimura, T. Hanyu, and H. Ohno. A 600mhz mtj-
based nonvolatile latch making use of incubation time in mtj switching. In
2011 International Electron Devices Meeting, pages 4.3.1–4.3.4, Dec 2011.
[42] H. Koike, T. Ohsawa, S. Ikeda, T. Hanyu, H. Ohno, T. Endoh, N. Sakimura,
R. Nebashi, Y. Tsuji, A. Morioka, S. Miura, H. Honjo, and T. Sugibayashi.
A power-gated mpu with 3-microsecond entry/exit delay using mtj-based
nonvolatile flip-flop. In 2013 IEEE Asian Solid-State Circuits Conference
(A-SSCC), pages 317–320, Nov 2013.
[43] K. Jabeur, G. Di Pendina, F. Bernard-Granger, and G. Prenat. Spin orbit
torque non-volatile flip-flop for high speed and low energy applications. IEEE
Electron Device Letters, 35(3):408–410, March 2014.
[44] B.R. Gaines. Techniques of identification with the stochastic computer. In
Proc. IFAC Symp. Problems of Identification, pages 1–18, 1967.
[45] B.D. Brown and H.C. Card. Stochastic neural computation. II. Soft compet-
itive learning. Computers, IEEE Transactions on, 50(9):906–920, 2001.
[46] H. Li, D. Zhang, and S. Y. Foo. A stochastic digital implementation of a
neural network controller for small wind turbine systems. IEEE Transactions
on Power Electronics, 21(5):1502–1507, September 2006.
[47] Bingzhe Li, M Hassan Najafi, and David J Lilja. An fpga implementation
of a restricted boltzmann machine classifier using stochastic bit streams.
In Application-specific Systems, Architectures and Processors (ASAP), 2015
IEEE 26th International Conference on, pages 68–69. IEEE, 2015.
83
[48] Bingzhe Li, Yaobin Qin, Bo Yuan, and David J Lilja. Neural network classi-
fiers using stochastic computing with a hardware-oriented approximate acti-
vation function. In 2017 IEEE 35th International Conference on Computer
Design (ICCD), pages 97–104. IEEE, 2017.
[49] B. Li, M. H. Najafi, B. Yuan, and D. J. Lilja. Quantized neural networks
with new stochastic multipliers. In 2018 19th International Symposium on
Quality Electronic Design (ISQED), pages 376–382, March 2018.
[50] J. G. Ortega, C. L. Janer, J. M. Quero, J. Pinilla, and J. Serrano. Analog to
digital and digital to analog conversion based on stochastic logic. In IEEE
21st Annual Conference of Industrial Electronics Society, IECON’95, pages
995–999, 1995.
[51] C. L. Janer, J. M. Quero, J. G. Ortega, and L. G. Franquelo. Fully parallel
stochastic computation architecture. IEEE Transactions on Signal Process-
ing, 44(8):2110–2117, August 1996.
[52] Saeed S. Tehrani, S. Mannor, and Warren J. Gross. Fully parallel stochastic
ldpc decoders. IEEE Transactions on signal processing, page 11, November
2008.
[53] M. Hori and M. Ueda. Fpga implementation of a blind source separation sys-
tem based on stochastic computing. In IEEE Conference on Soft Computing
in Industrial Applications, SMCia’08, pages 182–187, 2008.
[54] N. Saraf, K. Bazargan, D.J. Lilja, and M.D. Riedel. Iir filters using stochas-
tic arithmetic. In Design, Automation and Test in Europe Conference and
Exhibition (DATE), 2014, pages 1–6, March 2014.
[55] Peng Li and D.J. Lilja. A low power fault-tolerance architecture for the ker-
nel density estimation based image segmentation algorithm. In Application-
Specific Systems, Architectures and Processors (ASAP), 2011 IEEE Interna-
tional Conference on, pages 161–168. IEEE, 2011.
84
[56] Weikang Qian, Xin Li, M.D. Riedel, K. Bazargan, and D.J. Lilja. An ar-
chitecture for fault-tolerant computation with stochastic logic. Computers,
IEEE Transactions on, 60(1):93–105, 2011.
[57] Jongyeon Kim, Hui Zhao, Yanfeng Jiang, Angeline Klemm, Jian-Ping Wang,
and Chris H. Kim. Scaling Analysis of In-plane and Perpendicular Anisotropy
Magnetic Tunnel Junctions Using a Physics-Based Model. In Device Research
Conference (DRC), 2014, June 2014.
[58] William Tuohy, Cong Ma, Pushkar Nandkar, Nishant Borse, and David J
Lilja. Improving energy and performance with spintronics caches in multi-
core systems. In Europar ’14: OMHI - Third Annual Workshop on On-Chip
Memory Hierarchies and Interconnects. Springer-Verlag, August 2014.
[59] L.P. Hewlett-Packard Development Company. Cacti 6.5, 2009.
[60] Wei Zhao and Yu Cao. New generation of predictive technology model for
sub-45nm design exploration. Quality Electronic Design, 2006. ISQED ’06.
7th International Symposium on, page 6, 2006.
[61] Xiangyu Dong, Cong Xu, Yuan Xie, and N P Jouppi. NVSim: A Circuit-Level
Performance, Energy, and Area Model for Emerging Nonvolatile Memory.
IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems, 31(7):994–1007.
[62] Davy Genbrugge, Stijn Eyerman, and Lieven Eeckhout. Interval simula-
tion: Raising the level of abstraction in architectural simulation. In High
Performance Computer Architecture (HPCA), 2010 IEEE 16th International
Symposium on, pages 1–12. IEEE, January 2010.
[63] Christian Bienia. Benchmarking Modern Multiprocessors. PhD thesis, Prince-
ton University, January 2011.
85
[64] Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh,
and Anoop Gupta. The splash-2 programs: Characterization and method-
ological considerations. In Proceedings of the 22Nd Annual International
Symposium on Computer Architecture, ISCA ’95, pages 24–36, New York,
NY, USA, 1995. ACM.
[65] C. Bienia, S. Kumar, and K. Li. Parsec vs. splash-2: A quantitative com-
parison of two multithreaded benchmark suites on chip-multiprocessors. In
Workload Characterization, 2008. IISWC 2008. IEEE International Sympo-
sium on, pages 47–56, Sept 2008.
[66] A R Alameldeen and D A Wood. IPC Considered Harmful for Multiprocessor
Workloads. Micro, IEEE, 26(4):8–17, 2006.
[67] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt,
Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna,
Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay
Vaish, Mark D. Hill, and David A. Wood. The gem5 simulator. SIGARCH
Comput. Archit. News, 39(2):1–7, August 2011.
[68] Cong Ma, William Tuohy, Pushkar Nandkar, and David J. Lilja. Cycle-
accurate stt-mram model in gem5. In International Symposium on Computer
Architecture (ISCA) Second gem5 User Workshop, 2015.
[69] Ki C. Chun, Hui Zhao, Jonathan D. Harms, Tae-Hyoung Kim, Jian-Ping
Wang, and Chul H. Kim. A Scaling Roadmap and Performance Evaluation
of In-Plane and Perpendicular MTJ Based STT-MRAMs for High-Density
Cache Memory. Solid-State Circuits, IEEE Journal of, 48(2):598–610, Febru-
ary 2013.
[70] Ricardo Gonzales and Mark Horowitz. Energy dissipation in general purpose
processors. IEEE Journal of Solid State Circuits, 31:1277–1284, 1995.
86
[71] Major Bhadauria, Vincent M Weaver, and Sally A McKee. Understanding
PARSEC performance on contemporary CMPs. Workload Characterization,
2009. IISWC 2009. IEEE International Symposium on, pages 98–107, 2009.
[72] C. Ma and D. J. Lilja. Parallel implementation of finite state machines for
reducing the latency of stochastic computing. In 2018 19th International
Symposium on Quality Electronic Design (ISQED), pages 335–340, March
2018.
[73] Cong Ma, Peng Li, and David J. Lilja. Autocorrelation study for finite-state
machine-based stochastic computing elements. In International Workshop on
Logic Synthesis (IWLS), 2013.
[74] A. A. Markov. Extension of the limit theorems of probability theory to a sum
of variables connected in a chain. reprinted in Appendix B of: R. Howard.
Dynamic Probabilistic Systems, volume 1: Markov Chains. John Wiley and
Sons, 1971.
[75] Nikolaos Thomos, Nikolaos V. Boulgouris, and Michael G. Strintzis. Op-
timized transmission of JPEG2000 streams over wireless channels. IEEE
transactions on image processing : a publication of the IEEE Signal Process-
ing Society, 15(1):54–67, January 2006.
[76] J. M. Quero, S. L. Toral, J. G. Ortega, and L. G. Franquelo. Continuous
time filter design using stochastic logic. 1:113–116, 1999.