System-Level Power, Thermal and Reliability ... - Queen's U
Transcript of System-Level Power, Thermal and Reliability ... - Queen's U
![Page 1: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/1.jpg)
System-Level Power, Thermal and Reliability
Optimization
by
Changyun Zhu
A thesis submitted to the
Department of Electrical and Computer Engineering
in conformity with the requirements for
the degree of Doctor of Philosophy
Queen’s University
Kingston, Ontario, Canada
July 2009
Copyright © Changyun Zhu, 2009
![Page 2: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/2.jpg)
Abstract
An integrated circuit can now contain more than one billion transistors. With
increasing system integration and technology scaling, power and power-related issues
have become the primary challenges of integrated circuit design. In this disserta-
tion, techniques and algorithms, from system-level synthesis to emerging integration
and device technologies, are proposed to address the power and power-induced ther-
mal and reliability challenges of modern billion-transistor integrated circuit design.
In Chapter 1, the challenges of semiconductor technology scaling are introduced.
Chapter 2 reviews the related works. Chapter 3 focuses on the reliability optimiza-
tion issue during system-level design. A reliable application-specific multiprocessor
system-on-chip synthesis system is proposed, called TASR, which exploits redundancy
and thermal-aware design planning to produce reliable and compact circuit designs.
Chapter 4 introduces three-dimensional (3D) integration, a new integrated circuit
fabrication and integration technology. Thermal issue is a primary concern of 3D in-
tegration. A 3D integrated circuit heat flow analytical framework is proposed in this
chapter. Proactive, continuously-engaged hardware and operating system thermal
management techniques are presented and evaluated which optimize system perfor-
mance than state-of-the-art techniques while honoring the same temperature bound.
Chapter 5 presents reconfigurable architecture design using single-electron tunneling
i
![Page 3: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/3.jpg)
transistor, an ultra-low-power nanometer-scale device. The proposed design has the
potential to overcome the power and energy barriers for both high-performance com-
puting and ultra-low-power embedded systems. Conclusions are drawn in Chapter 6.
ii
![Page 4: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/4.jpg)
Co-Authorship
All work regarding Reliable MPSoC Synthesis, 3D CMP Thermal Management
and Characterization of SET Transistors in this thesis (i.e., Chapter 3, Chapter 4
and Chapter 5 of the thesis) was done in collaboration with Zhenyu Gu.
iii
![Page 5: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/5.jpg)
Acknowledgments
First, I would like to gratefully thank my supervisor, Professor Li Shang, not only
for his supervision of my research work, but also for his patience and help which
encouraged me to complete my studies. He has all the traits of an excellent research
supervisor. I appreciate the corrections and suggestions offered by my committee
members: Professor Robert Knobel, Professor Ahmad Afsahi and Professor Alireza
Bakhshai for their valuable comments and feedback.
I would also like to thank Professor Naraig Manjikian for his kindly help during
my studies at Queen’s University.
Thanks are also given to Zhenyu Gu, Yonghong Yang, Kun Li, Nicholas Allec,
Assem Bsoul, Zyad Mohamed, Professor Robert P. Dick and Professor Qin Lv for
their invaluable discussions.
Finally, I am grateful to my parents, wife and friends for their support and en-
couragement over these years.
iv
![Page 6: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/6.jpg)
Table of Contents
Abstract i
Co-Authorship iii
Acknowledgments iv
Table of Contents v
List of Symbols viii
List of Tables xiii
List of Figures xiv
Chapter 1:
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Technology Scaling and Design Challenges . . . . . . . . . . . . . . . 1
1.2 Dissertation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Chapter 2:
Related works . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Reliability-aware synthesis . . . . . . . . . . . . . . . . . . . . . . . . 9
v
![Page 7: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/7.jpg)
2.2 Three-dimensional integrated circuit . . . . . . . . . . . . . . . . . . 10
2.3 Single-electron tunneling transistors . . . . . . . . . . . . . . . . . . . 13
Chapter 3:
Reliable MPSoC Synthesis . . . . . . . . . . . . . . . . . 15
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 TASR: Temperature-Aware Synthesis of Reliable MPSoCs . . . . . . 20
3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . 42
Chapter 4:
3D CMP Thermal Management . . . . . . . . . . . . . . 44
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Heat Flow in 3D CMPs . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4 3D CMP Thermal Management . . . . . . . . . . . . . . . . . . . . . 57
4.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Chapter 5:
Characterization of SET Transistors . . . . . . . . . . . . 89
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2 SET Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3 IceFlex: A Fault-Tolerant Hybrid SET/CMOS Reconfigurable Archi-
tecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
vi
![Page 8: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/8.jpg)
5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Chapter 6:
Conclusions and Future Work . . . . . . . . . . . . . . . 134
6.1 Thesis Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
vii
![Page 9: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/9.jpg)
List of Symbols
A Thermal conductance matrix
C Capacitance
CD Drain capacitance
CG Gate capacitance
CS Source capacitance
CP Island capacitance
EaEM Activation energy of electromigration
EaSM Activation energy of stress migration
F (t) Cumulative distribution function
G Gain
I Current
J Current density
viii
![Page 10: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/10.jpg)
K Diagonal matrix containing the thermal conductances of adjacent thermal ele-
ments
Keff Effective vertical thermal conductivity
Klayer Thermal conductivity of the region without any vias
Kvia Thermal conductivity of the via material
L Laplacian matrix
Pd Power density
R Resistance
RD Drain resistance
RS Source resistance
T Temperature
T0 Metal deposition temperature during fabrication
Tambient Ambient temperature
Taverage Chip average temperature
VTH Threshold voltage
β Alpha power law parameter
κ or κB Boltzmann’s constant
µ Scale parameter of lognormal distribution
ix
![Page 11: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/11.jpg)
ρvia Via density
σ Shape parameter of lognormal distribution
ξ Run-time switching activity multiplied the capacitance of the switched nodes.
ζij Thermal impact coefficient for core i due to j
e Elementary charge
f Frequency
f(t) Probability density function
g Conductance
h Planck’s constant
3D Three dimensional
BIPS Billion instructions per second
BJT Bipolar junction transistor
CDF Cumulative distribution function
CMOS Complementary metal-oxide-semiconductor
CMP Chip-Level multiprocessor
CR Component redundancy
DRAM Dynamic random access memory
x
![Page 12: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/12.jpg)
DSP Digital signal processing
DTM Dynamic thermal management
DVFS Dynamic voltage and frequency scaling
EEMBC The embedded microprocessor benchmark consortium
FPGA Field-programmable gate array
IC Integrated circuit
IPC Instructions per cycle
LUT Lookup table
MPSoC Multiprocessor system-on-chip
MTTF Mean time to failure
MVL Majority voting logic
NoC Network on chip
OS Operating system
PDF Probability density function
PE Processing element
PRSA Parallel recombinative simulated annealing
xi
![Page 13: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/13.jpg)
SET Single-electron tunneling transistor
SMT Simultaneous multithreading
TIP Thermal impact per performance
xii
![Page 14: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/14.jpg)
List of Tables
3.1 System MTTF Improvement Under Area Bound [132] . . . . . . . . . 40
4.1 ThermOS Implementation [134]. . . . . . . . . . . . . . . . . . . . . . 65
4.2 DVFS and Clock Throttling Comparison [134]. . . . . . . . . . . . . . 69
4.3 Design Parameters for Alpha 21264 [134]. . . . . . . . . . . . . . . . . 70
4.4 3D Package Setup [134]. . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.5 Benchmark Characteristics [134]. . . . . . . . . . . . . . . . . . . . . 73
4.6 Benchmark Suites [134]. . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.1 Island Size Estimation [133]. . . . . . . . . . . . . . . . . . . . . . . . 98
5.2 Design Space Characterization [133]. . . . . . . . . . . . . . . . . . . 105
5.3 Impact of Majority Vote Logic on SELB Fault Probability [133]. . . . 115
5.4 Characterization of IceFlex Microarchitecture for CΣ = e2/(40kBT ) [133]121
5.5 Characterization of IceFlex Interconnect Fabric For CΣ = e2/(40kBT ) [133]122
5.6 Latency and Energy Improvement For Exclusive-Or Design [133]. . . 127
5.7 IceFlex Performance and Power Consumption at Room Temperature
For CΣ = e2/(40kBT ) [133]. . . . . . . . . . . . . . . . . . . . . . . . 129
xiii
![Page 15: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/15.jpg)
List of Figures
1.1 Intel CPU Transistor Count [2]. . . . . . . . . . . . . . . . . . . . . 2
1.2 Microprocessor Power Consumption. . . . . . . . . . . . . . . . . . . 3
1.3 Temperature Profile for Active Layer and Heatsink [123]. . . . . . . 4
3.1 Reliable MPSoC Synthesis Example [132]. . . . . . . . . . . . . . . . 18
3.2 TASR Flow for the Temperature-Aware Synthesis of Reliable MP-
SoCs [132]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Temperature Impact on MTTF [38]. . . . . . . . . . . . . . . . . . . 29
3.4 Comparison of MPSoC Area–Reliability Tradeoffs [38]. . . . . . . . . 38
3.5 Comparison of Different Optimization Heuristics [132]. . . . . . . . . 39
4.1 (a) Comparison of Face-to-Face (Left) and Face-to-Back (Right) Con-
figurations for Two Stacked Dies, (b) 3D Three Stacked Die Floorplan
Used in This Work, and (c) 3D CMP Chip-package Thermal Model-
ing [134]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2 Inter-layer and Intra-layer Thermal Heterogeneity and Dominance in
3D CMPs [134]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3 ThermOS: 3D CMP Run-time Thermal Management [134]. . . . . . 63
4.4 Comparison of ThermOS and Distributed Approach [28, 134]. . . . . 77
xiv
![Page 16: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/16.jpg)
4.5 Reduction in Temperature Constraint Violations due to Local DVFS
and Elimination of Temperature Constraint Violations due to Clock
Throttling [134]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.6 Temporal Temperature Variation for Eight Processor Cores (P0–P7)
Running lv-mipc2 Using Local DVFS w.o. (Top) and w. (Bottom)
Clock Throttling [134]. . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.7 Negligible CMP Instruction Throughput Reduction Resulting from Lo-
cal DVFS and Clock Throttling [134]. . . . . . . . . . . . . . . . . . . 81
4.8 Impact of Global Guidance Interval [134]. . . . . . . . . . . . . . . . . 83
4.9 Impact of Lookup Table Size [134]. . . . . . . . . . . . . . . . . . . . 85
4.10 Impact of Floorplan Rotation [134]. . . . . . . . . . . . . . . . . . . . 86
5.1 SET Structure and Schematic [133]. . . . . . . . . . . . . . . . . . . 95
5.2 SET Coulomb Oscillation (Cg =3.2 aF, Cs = Cd =1.0 aF, and Rs =
Rd =10 MΩ) [133]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.3 IceFlex Microarchitecture [133]. . . . . . . . . . . . . . . . . . . . . . 106
5.4 Multi-gate SET Multiplexer Tree [133]. . . . . . . . . . . . . . . . . 108
5.5 SET Configuration Memory [135]. . . . . . . . . . . . . . . . . . . . . 110
5.6 SET Parity Circuit [133]. . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.7 Hybrid SET/CMOS Interface Circuitry [133]. . . . . . . . . . . . . . 116
5.8 Power and Performance of the Multi-gate SET Multiplexer Tree for
High Performance, CΣ = e2/(40kBT ) [133]. . . . . . . . . . . . . . . 125
5.9 Performance and Power Characterization of Exclusive-or Logic for Low
Power for CΣ = e2/(40kBT ) [133]. . . . . . . . . . . . . . . . . . . . 127
xv
![Page 17: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/17.jpg)
Chapter 1
Introduction
1.1 Technology Scaling and Design Challenges
As observed by Gordon E. Moore in 1965, the number of transistors that can be
integrated on a chip doubled every 18 to 24 months [78]. During the past four decades,
semiconductor technology scaling has provided consistent improvements in circuit
performance and integration density. Figure 1.1 shows the technology scaling of
Intel microprocessors since 1971. With increasing system integration and technology
scaling, integrated circuit design becomes increasingly complex. Power and power-
induced design issues, such as chip temperature and circuit reliability, have become
the primary concerns of modern integrated circuit design.
Power Challenges
Although scaling of technology provides higher functional integration, more com-
puting resources, better performance and parallel operation capability, the increased
1
![Page 18: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/18.jpg)
CHAPTER 1. INTRODUCTION 2
100
1000
10000
100000
1e+06
1e+07
1e+08
1e+09
1e+10
1970 1980 1990 2000 2010 2020
Tra
nsis
tor
count
Year
Intel CPU transistor count
40048008
8080
8088
80286 80386
8048680486Pentium
Pentium IIPentium III
Pentium 4Itanium 2 Core 2 Duo
Core 2 QuadDual-Core Itanium 2
Atom
Core i7
Quad-Core Itanium
Figure 1.1: Intel CPU Transistor Count [2].
operating frequency and transistor density raise the circuit dynamic power consump-
tion. Furthermore, because the subthreshold leakage is an inverse exponential func-
tion of a transistor’s threshold voltage (VTH) and VTH is reduced with technology scal-
ing under the constant electric field scaling scenario, the chip leakage power increases
exponentially [97]. Figure 1.2 shows the power consumption of microprocessors re-
leased during the past twenty years. It indicates the exponential increase in power
due to increased voltage, frequency, temperature and decreased threshold voltage.
Thermal Challenges
As more power is consumed by increasingly denser integrated circuits filled with
transistors, more heat is generated and therefore raises chip temperatures which has
![Page 19: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/19.jpg)
CHAPTER 1. INTRODUCTION 3
1
10
100
1000
1980 1985 1990 1995 2000 2005 2010
Power(W)
Year
Intel 386Intel 486
Intel pentiumIntel pentium2Intel pentium3Intel pentium4
Intel itaniumIntel i7
Alpha 21064Alpha 21164Alpha 21264
Spar cSuper Spar C
Spar c64Mips
HP PAPower PC
AMD K6AMD K7
AMD x86-64AMD Athlon64X2
AMD BarcelonaIntel Clovetown
Sun NiagaraSun Niagara 2
Figure 1.2: Microprocessor Power Consumption.
a huge impact on IC performance, cooling cost reliability, and power consumption.
The latencies of transistors and metal wires increase with increasing chip temperature
as do the probabilities of many lifetime reliability faults [53, 102]. For example, elec-
tromigration failure rate is an exponential function of temperature. Leakage power
consumption is now responsible for a substantial proportion of overall power con-
sumption in commercial designs and increases with temperature [67]. IC chips and
packages exhibit significant spatial and temporal variations due to the heterogene-
ity of thermal conductivity and heat capacity in different materials, as well as the
variation of power profiles. This requires accurate chip-package heat flow analysis,
which is complex and computing intensive. As illustrated by the example shown in
![Page 20: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/20.jpg)
CHAPTER 1. INTRODUCTION 4
35 40 45 50 55 60 65 70 75 80 85 90
-8 -6 -4 -2 0 2 4 6 8
-8
-6
-4
-2
0
2
4
6
8
35 40 45 50 55 60 65 70 75 80 85 90
Temperature (°C)
Position (mm)
Temperature (°C)Heatsink/IC
interfaceIC active layer
Figure 1.3: Temperature Profile for Active Layer and Heatsink [123].
Figure 1.3, the steady-state thermal profile of the active layer of the silicon die in
conjunction with the top layer of the cooling package is characterized using multigrid
thermal solver which has to partition the chip and the cooling package into 131,072
homogeneous thermal elements. Compared to steady-state thermal modeling, char-
acterizing an IC dynamic thermal profile is even more time consuming. IC synthesis
requires a large number of optimization steps; thermal modeling can easily become
its performance bottleneck [123].
![Page 21: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/21.jpg)
CHAPTER 1. INTRODUCTION 5
Reliability Challenges
Moreover, aggressive scaling of CMOS process technology poses serious challenges
to the lifetime reliability of ICs. Reduction of feature size and increases in power den-
sity have resulted in increasing chip temperature and failure rates. Increased system
integration using these vulnerable devices and interconnects results in reduced system
reliability. The severity of many reliability problems, such as time-dependent dielec-
tric breakdown in MOS transistors and electromigration in interconnects, increases
exponentially with temperature. Life time reliability is becoming an important qual-
ity metric in high-performance ICs. Optimizing lifetime reliability requires careful
planning during IC design and synthesis. At the architectural level, careful assign-
ment of tasks to processing elements (PEs) can balance the thermal profile of the chip,
thereby improving system reliability. Synthesis-time architectural planning and care-
ful use of PE-level and component-level (e.g., functional unit) redundancy will permit
continued MPSoC operation after the failure of some processors or components, while
limiting area overhead. At the physical level, a fast floorplanner is needed to pro-
vide physical information for generating the power profile which, in turn, is used to
determine the thermal profile. The evaluation and optimization of system reliability
and other design metrics, such as area and performance, require a comprehensive and
efficient architectural-level and physical-level synthesis infrastructure.
In summary, power, thermal and reliability issues have become dominant con-
straints in modern nanoscale integrated circuit design. For high-performance ap-
plications, temperature affects integration density, performance, power consumption
and cost. For battery-powered embedded systems, energy consumption directly de-
termines system life time. For any system, reliability strongly depends on the thermal
![Page 22: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/22.jpg)
CHAPTER 1. INTRODUCTION 6
profile during operation.
1.2 Dissertation Overview
In this dissertation, the issues of power, thermal and reliability challenges will be
addressed from the following three aspects: system-level synthesis algorithms, recently
proposed circuit integration technology and emerging device technology. First, relia-
bility consideration will be integrated into the system-level synthesis algorithms of IC
design flow. Then, the recently proposed integration technology, three-dimensional
integrated circuit to overcome the limitations of 2D technology will be discussed.
Finally, an emerging device technology, single-electron tunneling transistors, will be
evaluated to overcome the coming challenges for CMOS devices. The rest of this
dissertation will be organized as follows.
First, technology scaling and increasing power densities are increasing the severity
of IC lifetime reliability problems. The lifetime reliability problem cannot be well
solved at any single level of the design process. Reliability characterization requires
chip-package thermal profiles, which in turn requires physical information, including
an IC floorplan, power profile, and chip package thermal model. Reliability-aware IC
design requires an unified architectural-level and physical-level design flow. Therefore,
a system-level synthesis flow which conducts architectural synthesis, floorplanning,
on-chip network synthesis, chip-package thermal analysis, and reliability analysis is
proposed in Chapter 3. Optimization algorithms within this flow exploit redundancy
and temperature-aware design planning to produce reliable, compact IC designs. My
major contribution to this chapter is on the MPSoC reliability modeling, temperature-
dependent reliability modeling and reliability-aware optimization algorithm design.
![Page 23: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/23.jpg)
CHAPTER 1. INTRODUCTION 7
My collaborator, Zhenyu Gu, contributed to the floorplanning and on-chip network
synthesis. Two papers have been published on this project [132, 38].
Second, three-dimensional (3D) integration has the potential to improve the com-
munication latency and integration density of IC designs. By stacking multiple device
layers connected through inter-die vias, 3D technology significantly reduces on-chip
wire length, enables efficient interconnect and logic design, and further boosts logic
integration density. However, the stacked high power density layers of 3D chips in-
crease the importance and difficulty of thermal management. Chip power density
increases linearly with the number of vertically-stacked active circuit layers. In addi-
tion, the bonding layers used in 3D integration have low thermal conductivities, which
further exacerbates thermal effects. Chapter 4 identifies and describes the critical con-
cepts required for optimal thermal management and proposes proactive, continuously-
engaged hardware and operating system thermal management technique that achieves
better performance than state-of-the-art techniques while honouring the same tem-
perature bound. My major contribution to this chapter is on the characterization of
heat flow in 3D CMPs, derivation of optimal workload assignment and power–thermal
budgeting and thermal management implementation in the Linux kernel. My collab-
orator, Zhenyu Gu contributed to the design of 3D CMP architecture and technology,
framework buildup of the full simulation system, and benchmark suites characteristics
and generation. Two papers have been published on this project [131, 134].
Third, devices researchers have seen the coming challenges for CMOS devices
and evaluated alternative technologies such as single-electron tunneling transistors
(SETs). The International Technology Roadmap for Semiconductors projects that
SETs have the potential to achieve the lowest projected energy per switching event of
![Page 24: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/24.jpg)
CHAPTER 1. INTRODUCTION 8
any known device. However their use poses unique architectural, circuit design and
fabrication challenges. Chapter 5 explores the potential use of SETs in low-power em-
bedded systems, evaluates the benefits and limitations of SETs, and characterizes the
impacts of SETs on system design metrics. Based on the evaluation of the architec-
tural and circuit-level features, a fault-tolerant, reconfigurable, hybrid SET/CMOS
based architecture is proposed in this chapter. My major contribution of this chapter
is on the SET modeling, SET design space characterization and characterization of
IceFlex architecture. My collaborator, Zhenyu Gu, contributed to the global/local
interconnect design and characterization of embedded applications. Two papers has
been published on this project [135, 133].
Finally, a conclusion of this dissertation and the potential future research problems
are presented in Chapter 6.
![Page 25: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/25.jpg)
Chapter 2
Related works
2.1 Reliability-aware synthesis
Our reliable MPSoC synthesis work draws from research in the areas of integrated
circuit reliability modeling and optimization [103, 21], system synthesis [30, 42, 120,
64], physical design, and thermal analysis [99, 123]. Coskun et al. [21] and Srini-
vasan et al. [103] provided architectural reliability models and run-time optimization
techniques for MPSoCs and microprocessors, respectively. Eles et al. contrasted opti-
mization algorithms for use in hardware–software partitioning [30]. Henkel and Ernst
proposed flexible task discretization during hardware–software partitioning [42]. Xie
et al. proposed a technique to duplicate tasks on idle processors during embedded
system synthesis to tolerate transient faults [120]. Lee and Ha proposed an alloca-
tion, assignment, and scheduling algorithm for real-time MPSoCs [64]. Ogras et al.
proposed a branch-and-bound algorithm for NoC synthesis [81]. Glaß et al. proposed
an evolutionary algorithm that binds tasks to resources with the goal of improving
mean time to failure (MTTF) [36]. They considered fault processes with exponential
9
![Page 26: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/26.jpg)
CHAPTER 2. RELATED WORKS 10
or Weibull distributions; their fault model supports permanent faults. Our system
and fault model differs primarily by considering the influence of faults on subsequent
fault rates due to the impact of run-time rebinding on temperature profile.
2.2 Three-dimensional integrated circuit
This section summarizes the current status of 3D integration in microprocessor
design, surveys related work in microprocessor thermal management, and indicates
the special thermal management challenges 3D CMPs will bring.
Several 3D fabrication technologies have been proposed and developed [109, 108,
95]. Topol et al. reviewed the 3D fabrication process and design techniques developed
at IBM [109]. Tezzaron [108] and Samsung [95] developed 3D fabrication technologies
and Intel is planning to use 3D integration in the Terascale project [115].
3D integration increases the importance of, and complicates, thermal manage-
ment. The 2D heat flux density through the heatsink increases roughly linearly with
the number of stacked wafers. As a result, unless per-layer power densities are greatly
reduced, 3D CMPs will often operate near their thermal limits. Today’s 2D CMPs
already operate at or near their thermal limits, and rely on reactive management
techniques to maintain thermal safety.
In addition to increasing the importance of thermal management, 3D integration
complicates thermal management policy design. In contrast with 2D CMPs, the
temperatures of some pairs of 3D CMP processor cores, e.g., vertically-adjacent cores,
are highly correlated. Moreover, in 2D CMPs, processor cores have similar thermal
resistances to the ambient, and high thermal resistances to other cores. In 3D CMPs,
core resistance to ambient and thermal interaction are highly-heterogeneous. For
![Page 27: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/27.jpg)
CHAPTER 2. RELATED WORKS 11
example, heat generated in cores farther from the heatsink must flow through more
layers of silicon and polymide bonding before reaching the heatsink.
We next survey work in microprocessor thermal management. Initially, thermal
control strategies were seen as an infrequently-engaged final resorts. However, due
to increasing transistor densities and limitations in cooling technology, thermal con-
trol will be constantly engaged. ThermOS was developed for this emerging thermal
management paradigm.
Black et al. evaluated the performance improvement yielded by stacking memory
and logic layers [12]. Healy et al. proposed a microarchitecture-level floorplanning al-
gorithm that works for both 2D and 3D ICs [39]. Kgil et al. proposed an architecture
in which processing core layers are vertically integrated with main memory consisting
of multiple DRAM dies, permitting performance and power consumption improve-
ments compared to 2D designs [57]. Li et al. proposed a 3D topology that combines
the benefits of network-on-chip and 3D technology to reduce L2 cache latencies [65].
Tsai et al. explored cache implementation in 3D technologies [110].
Thermal issues are critical for 3D integration. Puttaswamy and Loh evaluated the
thermal impact of 3D integration on high-performance microprocessors [89]. They
also proposed a family of techniques that reduce 3D power density and assign more
power to the die closet to the heat sink [90]. These approaches are principally applied
at design time. Skadron et al. described a compact thermal analysis technique that
has been extended to support 3D integration [99]. Loi et al. studied processor and
memory behavior under temperature constraints for 3D technology [72]. Link and
Vijaykrishnan examined thermal effects in 3D technologies [71].
![Page 28: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/28.jpg)
CHAPTER 2. RELATED WORKS 12
Brooks and Martonosi presented one of the first evaluations of dynamic ther-
mal management (DTM) [14]. In essence, DTM allows microprocessor designers to
constrain the average-case, instead of worst-case, power profile. They instead al-
low run-time mechanisms to detect and resolve potential thermal emergencies. This
yields better overall performance than pessimistically designing systems based on the
worst-case power profile. Li et al. examined the impact of several design constraints,
including thermal effect, on CMP architecture design [69]. Sun et al. proposed a
temperature-aware synthesis technique for 3D CMPs [104], but do not consider run-
time OS management.
Migration strategies can improve the use of multi-core processors by distributing
heat generation more uniformly across the chip. Heo et al. proposed reducing peak
power density by moving computation to another physical location [43]. Powell et
al. explored the benefit of OS thermal management for SMTs and CMPs [87]. They
proposed the Heat and Run strategy, in which the OS co-schedules and migrates SMT
threads to maximize resource utilization before a thermal emergency arises and then
migrates computation to an idle core. Kumar et al. examined hardware-software ther-
mal management that uses hardware performance counters to characterize thermal
behavior and kernel support to schedule tasks [63]. They evaluated their mechanism
on a real system with SMT support and find significant benefits from considering
system-level effects which cannot be accounted for with pure hardware techniques. We
also take advantage of kernel scheduling and performance counters but also consider
multi-core management. Recent work by Park et al. examined energy-performance
tradeoffs in multi-threaded applications [83].
![Page 29: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/29.jpg)
CHAPTER 2. RELATED WORKS 13
2.3 Single-electron tunneling transistors
After single-electron tunneling transistors were discovery in the 1980s [9, 33], there
has been extensive research on fabrication, design, and modeling of SETs [70]. SET
fabrication and use in high-sensitivity amplifiers at cryogenic temperatures has been
the main research focus [25]. SETs and simple circuits with a variety of structures
were proposed and fabricated using different methods and materials [80, 105, 6]. Re-
cently, researchers have fabricated SETs that operate at room-temperature [75, 98,
84]. Various SET-based circuit applications, such as logic [111, 112, 79, 19] and mem-
ory [126, 118, 122] have been developed. These works provide the promising start for
SET circuit design. However, these articles did not provide an architectural evalua-
tion. We do not claim to have improved the performance of SET-based logic gates.
Instead, we are the first to develop the modules necessary to support architectural
design and synthesis and evaluate the architectural performance and power consump-
tion implications of using SETs. They demonstrate orders of magnitude improvement
in power consumption and energy efficiency compared to CMOS.
Research on SET modeling and simulation has been an active area. Monte Carlo
simulation has been widely used to model SETs. SIMON [117] and MOSES [17] are
the two most popular SET simulators. However, they are too slow for analysis of large
circuits. Uchida et al. proposed an analytical SET model and incorporated it into
SPICE [113]. Recently, Inokawa et al. extended this model to a more general form to
include asymmetric SETs [49]. Mahapatra et al. propose a simulation framework for
hybrid SET/CMOS circuit design and analysis [73]. Their model for SET behavior
is similar to that of Uchida et al. These compact modeling techniques are efficient
enough for use in SET circuit design and analysis and closely match Monte Carlo
![Page 30: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/30.jpg)
CHAPTER 2. RELATED WORKS 14
simulation results.
Significant challenges still remain for large-scale integration of SETs and for room-
temperature operation. SETs that operate reliably at room temperature have critical
dimensions of ∼1–10 nm. They are challenging to fabricate using current top-down
lithographic techniques. However, several exciting advances make the evaluation of
architectures for high-density logic based on SETs worthwhile. Scanning-probe mi-
croscopes can be used to create devices smaller than those using conventional lithog-
raphy [75]. Continual progress has been made on bottom-up nano-fabrication tech-
niques, where chemical techniques are used to make individual molecules with useful
electronic properties. Molecular quantum dots [40] can display SET behavior. Larger
structures, such as carbon nanotubes and nanowires, can act as SETs [6]. These
bottom-up techniques can create structures supporting room-temperature SET oper-
ation. However, more research is needed in order to integrate individual devices into
large-scale circuits. Very recent advances in graphene [35] devices show promise for
SETs. Reliable methods for cooling to very low temperatures without supplies of liq-
uid helium or nitrogen are also becoming more common [114]. For high-performance
computing, the added complexity of operating at cryogenic temperatures may not be
a limiting factor. Similarly, cryogenic temperatures are readily attained using passive
methods in outer space.
![Page 31: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/31.jpg)
Chapter 3
Reliable Multiprocessor
System-On-Chip Synthesis
This chapter presents a multiprocessor system-on-chip (MPSoC) synthesis algo-
rithm that optimizes system mean time to failure. Given a set of directed acyclic
periodic graphs in which nodes present a number of operations and edges represent
the communication events, in order to minimize system failure rate and area while
meeting functionality and timing constraints, the proposed algorithm determines 1) a
processor core allocation,which allocate the necessary processor cores into the MPSoC
system; 2) processor-level redundancy, which add identical processor cores to the MP-
SoC architecture; 3) component-level structural redundancy, which add appropriate
control mechanisms and redundant hardware to individual processor cores; 4) assign-
ment of tasks to processors, which map each specific task in a processor core; 5) floor-
plan, which estimate the area of each processor core and arrange all these cores within
an given region. and 6) scheduling, which determine when each operation is given the
access to system resource. Changes to the thermal profile resulting from changes in
15
![Page 32: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/32.jpg)
CHAPTER 3. RELIABLE MPSOC SYNTHESIS 16
allocation, assignment, scheduling, and floorplan are modeled and optimized during
synthesis, as is the impact of thermal profile on temperature-dependent failure mech-
anisms. The proposed techniques have the potential to substantially increase MPSoC
system mean time to failure compared to area-optimized solutions. If power densities
are high and the dominant lifetime failure mechanisms are strongly dependent on tem-
perature, our results indicate that thermal and structural redundancy optimization
during synthesis have the potential to greatly increase MPSoC lifetime with low area
cost. My major contribution to this chapter is on the MPSoC reliability modeling,
temperature-dependent reliability modeling and reliability-aware optimization algo-
rithm design( Section 3.2.1, 3.2.2, 3.2.3, 3.2.4 and 3.2.5). My collaborator, Zhenyu
Gu, contributed to the floorplanning and on-chip network synthesis( Section 3.2.6).
3.1 Introduction
A single integrated circuit can now contain more than one billion transistors.
It has been necessary to move to MPSoCs to control design complexity and power
consumption.
Increasing power density due to continued scaling of CMOS process technology
accelerates temperature-dependent and current-dependent failure mechanisms such as
electromigration. Lifetime reliability is becoming an important quality metric in high-
performance MPSoCs. Optimizing lifetime reliability requires careful planning during
MPSoC design and synthesis. This problem cannot be well solved at any single level
of the design process. Reliability characterization requires MPSoC thermal profiles,
which in turn requires physical information, including an MPSoC floorplan, power
profile, and chip-package thermal model. Reliability-aware MPSoC design requires
![Page 33: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/33.jpg)
CHAPTER 3. RELIABLE MPSOC SYNTHESIS 17
an unified architectural-level and physical-level design flow.
3.1.1 Contributions
Our work addresses synthesis of MPSoCs capable of reliable operation in the pres-
ence of permanent faults. The proposed algorithm generates MPSoC architectures
that satisfy the functionality and performance constraints of a specification while si-
multaneously optimizing die area and MTTF. The problem specification consists of
graphs composed of data-dependent, multirate, periodic tasks as well as a database of
processor cores. Each processor core executes different tasks with different execution
times and power consumptions. This work makes the following main contributions.
1. We have developed and implemented an MPSoC synthesis flow that conducts
architectural synthesis, floorplanning, on-chip network synthesis, chip-package
thermal analysis, and reliability analysis. Optimization algorithms within this
flow exploit redundancy and temperature-aware design planning to produce
reliable, compact MPSoC designs.
2. We propose a two-phase reliability optimization flow that builds on a stochastic
functionality, performance, and area optimization algorithm and an iterative
reliability enhancement algorithm that explores the trade-off between MPSoC
reliability and area. This algorithm improves MPSoC system MTTF by an
average of 85% with less than 5% area cost and by an average of 436% with less
than 25% area cost, compared to area-optimized solutions.
To the best of our knowledge, this is the first work to propose and implement a
method of predicting and optimizing the impact of design changes during synthesis
![Page 34: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/34.jpg)
CHAPTER 3. RELIABLE MPSOC SYNTHESIS 18
Solution I Solution II
Power
PowerPC
PC
K6−2E+
AMD
PCPower
PCPower
PowerPC(RE)
Figure 3.1: Reliable MPSoC Synthesis Example [132].
on temperature-dependent MPSoC failure processes.
3.1.2 System MTTF Definition and Example
We define system MTTF to be the expected amount of time an MPSoC will
operate, possibly in the presence of component faults, before its performance drops
below some designer-specified constraint or it is no longer able to meet its functionality
requirements. Using system MTTF to characterize reliability has the advantage of
taking into account performance; this is important for consumer electronics and most
other MPSoC applications.
To concurrently optimize the system MTTF and area of an MPSoC, it is necessary
to exploit both hardware redundancy and temperature profile. Processor-level redun-
dancy is achieved by adding processors to the MPSoC architecture. Component-level
redundancy is achieved by adding appropriate control mechanisms and redundant
hardware such as additional arithmetic logic units (ALUs) or cache banks to individ-
ual processors [103]. We will illustrate each method of improving system MTTF us-
ing an example. Figure 3.1 shows two synthesized solutions for a telecommunication
![Page 35: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/35.jpg)
CHAPTER 3. RELIABLE MPSOC SYNTHESIS 19
application based processor performance data from the Embedded Microprocessor
Benchmark Consortium [31]. Each solution contains three embedded processors con-
nected by an on-chip router. The temperature of each on-chip component is indicated
by its brightness: brighter components are hotter. The embedded processor, an AMD
K6-2E+, used in Solution I, is replaced with an IBM PowerPC 405GP-RE in Solu-
tion II. 405GP-RE is a low power, redundant version of the 405GP; the floating/fixed
point units and register files are duplicated. The system MTTFs of Solution I and
Solution II are 0.7 year and 1.5 years; these changes doubled MTTF. Further relia-
bility enhancements can be used to increase MTTF to 7 years at small area cost. In
this example, solutions contain processors from different companies. If necessary, the
database can be limited to processors from a single company. In order to simplify the
synthesis problem, we ignore the issue that there would be processors better suited to
the particular task at hand than others as long as the overall performance can meet
the deadline requirement.
This example illustrates the potential improvement to system MTTF due to tem-
perature reduction and resource redundancy. MPSoC reliability strongly depends on
temperature. In Solution I, the K5-2E+ has a peak temperature of 59.9 . In So-
lution II, replacing the K5-2E+ with the 405GP-RE reduces the peak temperature
by 5.1 , thereby decreasing the run-time fault rate. Second, increasing system re-
dundancy improves fault-tolerance. Compared to the K5-2E+, the 405GP-RE can
tolerate more run-time faults. This results in an improvement to system MTTF.
![Page 36: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/36.jpg)
CHAPTER 3. RELIABLE MPSOC SYNTHESIS 20
Processor core and
task performance, power,
area, and temperature-
dependent reliability models
Thermal analysis
Reliability
analysis
Core allocation
change
Task assignment change
Adaptive list scheduling
Floorplanning
Functionality, performance,
and area evaluationArea-optimized
MPSoC
Thermal
analysis
Functionality,
performance, area,
and reliability evaluation
Initial construction of solutions
Convergence?Convergence?
Core reinforcement
Core
swapping
Core
addition
Reliability enhancement
Max area
reached?
Area and
reliability
optimized
MPSoC
DCT
FLT
ACUM
ARCH
TRAN
Problem instance
Y
NN
Y
Stochastic optimization of functionality, timing, and area Reliability/area curve exploration
Y
N
Figure 3.2: TASR Flow for the Temperature-Aware Synthesis of Reliable MP-SoCs [132].
3.2 TASR: Temperature-Aware Synthesis of Reli-
able MPSoCs
In this section, we describe TASR, the proposed reliable application-specific MP-
SoC synthesis infrastructure.
3.2.1 TASR Infrastructure
Determining and optimizing MPSoC system MTTF requires substantial infras-
tructure. Figure 3.2 illustrates the main steps and components in the proposed
synthesis flow. Computing system MTTF requires knowledge of component MT-
TFs and run-time performance constraints. Computing component MTTFs requires
knowledge of MPSoC thermal profile and architecture. Computing MPSoC thermal
profile during synthesis requires a floorplan, task assignment dependent power model-
ing, and a thermal analysis algorithm. Finally, determining, and optimizing MPSoC
architecture requires a system-level synthesis infrastructure that allocates processor
![Page 37: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/37.jpg)
CHAPTER 3. RELIABLE MPSOC SYNTHESIS 21
cores, assigns tasks to processors, rapidly generates floorplans, assigns communication
events to network links, and schedules operations and communication events.
TASR is composed of algorithms from three domains: system-level synthesis,
physical synthesis, and solution analysis. The system-level design contains a single-
objective stochastic optimization algorithm that minimizes MPSoC area subject to
functionality and performance requirements, and an iterative reliability enhancement
algorithm that uses knowledge of redundancy and thermal profile to improve system
MTTF at a small cost in MPSoC area. Physical-level synthesis consists of a slicing
floorplanning algorithm and an on-chip network synthesis algorithm. In addition,
TASR contains a novel statistical lifetime reliability model, and also performance,
power, and thermal models to guide MPSoC reliability optimization.
Given
1. Functionality and timing requirements consisting of a directed acyclic graph of
periodic graphs of communicating heterogeneous tasks, each of which may have
a different deadline;
2. Databases indicating the properties of the available heterogeneous processor
cores and on-chip network resources when used with the tasks in the function-
ality requirements specification, e.g., task execution times and power consump-
tions on each processor and processor areas; and
3. Temperature-dependent reliability models for the processors and functional
units within them.
TASR uses a two-stage optimization flow to determine
![Page 38: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/38.jpg)
CHAPTER 3. RELIABLE MPSOC SYNTHESIS 22
1. An allocation of processor cores that are selected based on their performance
and reliability characteristics;
2. An assignment of tasks to processor cores that takes task impact on temperature
and therefore reliability into account;
3. A schedule of all the tasks and communication events in the system; and
4. A floorplan for the MPSoC.
The solutions are optimized for reliability (maximized MTTF) and area. Each so-
lution is associated with numerous alternative task assignments and schedules to
permit continued operation in the event of processor core failure. If a processor fails,
the resulting change in task assignment and schedule required to maintain functional
correctness and meet timing requirements is pre-planned.
3.2.2 Two-Phase Synthesis Flow
This section explains the two-phase synthesis process used within TASR. The
first phase uses a parallel recombinative simulated annealing (PRSA) algorithm, i.e.,
an advanced form of genetic algorithm, to search for low-area MPSoC architectures
that meet functionality and timing requirements without violating area constraints.
Previous studies [26] have demonstrated that the use of PRSA allocation and assign-
ment together with adaptive list scheduling permits optimal solutions to problems
for which optimal solutions are known [88]. For problem instances with previously
published results, the PRSA approach rapidly produces solutions of equal or better
quality [44, 127]. Adaptive list scheduling makes multiple scheduling attempts with
different prioritization metrics in order to meet timing and functionality constraints.
![Page 39: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/39.jpg)
CHAPTER 3. RELIABLE MPSOC SYNTHESIS 23
The MPSoC lifetime reliability optimization problem can potentially be solved
using a PRSA synthesis flow by including system MTTF with the other optimization
objectives. However, the addition of reliability optimization to functional, timing,
and area optimization greatly increases problem complexity. Moreover, the time cost
of determining the reliability impact of a design change is much higher than that
of determining the area and performance impact. It becomes necessary to conduct
thermal and reliability analysis and to determine multiple task assignments and sched-
ules for each MPSoC in order to support runtime adaptation to processor core failure.
Therefore, we propose starting from an area-optimized solution meeting functionality
and timing constraints and using a reliability enhancement algorithm to explore the
area–reliability tradeoff curve.
Lifetime reliability is inversely related to chip temperature. By increasing chip
area, power density and chip temperature decrease, thereby increasing chip reliability.
Structural redundancy, which permits continued processor or MPSoC operation after
component failure and generally increases area, can also improve reliability.
3.2.3 Integrated Circuit Failure Mechanisms
In this section, we characterize integrated circuit (IC) failure mechanisms. The
lifetime reliability of ICs is primarily affected by the following failure mechanisms:
electromigration, thermal cycling, time-dependent dielectric breakdown, and stress
migration [103].
Electromigration is the gradual displacement of the atoms in metal wires caused
by electrical current. It leads to voids and hillocks that cause open and short circuit
![Page 40: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/40.jpg)
CHAPTER 3. RELIABLE MPSOC SYNTHESIS 24
failures. The MTTF due to electromigration is given by the following equation [55]:
MTTF EM =AEM
JneEaEMκT (3.1)
where AEM is a constant determined by the physical characteristics of the metal inter-
connect, J is the current density, EaEM is the activation energy of electromigration,
n is an empirically-determined constant, κ is Boltzmann’s constant, and T is the
temperature.
Thermal cycling refers to IC fatigue failures caused by thermal mismatch deforma-
tion. In IC chip and package, adjacent material layers such as copper/low-k dielectric
have different coefficients of thermal expansion. As a result, run-time thermal vari-
ation causes fatigue deformation, leading to failures. The MTTF due to thermal
cycling is given by the following equation [55]:
MTTF TC =ATC
(Taverage − Tambient)q (3.2)
where ATC is a constant coefficient, Taverage is the chip average run-time temperature,
Tambient is the ambient temperature, and q is the Coffin-Manson exponent constant.
Time-dependent dielectric breakdown is the deterioration of the gate dielectric
layer. This effect depends strongly on temperature, and is becoming increasingly
prominent with the reduction of gate-oxide dielectric thickness and non-ideal supply
voltage reduction. The MTTF due to time-dependent dielectric breakdown is given
by the following equation [55, 103]:
MTTF TDDB = ATDDB
(1
V
)(a−bT )
eA+B/T+CT
κT (3.3)
where ATDDB is a constant, V is the supply voltage, and a, b, A,B, and C are fitting
parameters.
![Page 41: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/41.jpg)
CHAPTER 3. RELIABLE MPSOC SYNTHESIS 25
Stress migration is the mass transportation of metal atoms in metal wires due to
mechanical stress caused by thermal mismatch among metal and dielectric materials.
The MTTF resulting from stress migration is given by the following equation [55]:
MTTF SM = ASM |T0 − T |−neEaSMκT (3.4)
where ASM is a constant, T0 is the metal deposition temperature during fabrication,
T is the run-time temperature of the metal layer, n is an empirically-determined
constant, and EaSM is the activation energy for stress migration.
Equations 3.1–3.4 indicate that the lifetime reliability of ICs is strongly influenced
by temperature. Therefore, thermal analysis and optimization techniques play impor-
tant roles in reliability optimization. Generally, MTTF values resulting from different
mechanisms is from 20 to 30 years.
3.2.4 MPSoC Reliability Modeling
The system MTTF of an MPSoC is a function of the lifetime reliabilities of all its
PEs. In this work, we propose a system-level lifetime reliability model for MPSoCs.
Our first step is to derive an efficient modeling method that can accurately predict
the lifetime reliability of each MPSoC PE.
3.2.4.1 Reliability Modeling of On-Chip PEs
The lifetime reliability of an on-chip PE is influenced by numerous design-time and
run-time factors, such as architecture-level and circuit-level redundancy, accumulation
of wear, and run-time temperature. Accurate lifetime characterization of each PE is
challenging.
![Page 42: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/42.jpg)
CHAPTER 3. RELIABLE MPSOC SYNTHESIS 26
We propose a PE reliability model that is capable of incorporating the effects of
multiple fault mechanisms, component-level resource redundancy, and temperature.
The dependence of lifetime failure processes on other parameters, such as current
density, is not directly considered. Constant values of these parameters resulting in
PE MTTFs of 30 years at 50 and 1.8 V are used [103]. For the sake of explana-
tion, our description of PE reliability modeling starts from the simplest case, i.e., a
single failure mechanism, single point of failure (no resource redundancy), and con-
stant temperature. These assumptions are later relaxed, and the reliability model
generalized.
3.2.4.2 Lognormal Distribution Reliability Model for Single PE, Single
Point of Failure
Statistical modeling is commonly used in IC reliability characterization. Re-
searchers have proposed using various statistical models, e.g., exponential, Weibull,
and lognormal, to characterize IC lifetime failures. Compared to other commonly-
considered statistical models, the lognormal distribution more accurately models the
time-dependent degradation processes of ICs, e.g., diffusion, corrosion, migration, and
crack propagation [103] caused by the failure mechanisms described in Section 3.2.3.
However, using the lognormal distribution complicates the derivation of analytical
solutions. Numerical methods, such as Monte-Carlo simulation or statistical fitting
techniques, are required. These methods are computationally intensive.
Starting from the simplest assumption, for a failure mechanism i, the run-time
![Page 43: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/43.jpg)
CHAPTER 3. RELIABLE MPSOC SYNTHESIS 27
fault probability density function (PDF), fi(t), and the corresponding fault cumula-
tive distribution function (CDF), Fi(t), have two parameters: σiPE (a shape parame-
ter) and µiPE (a scale parameter). The MTTF of an on-chip PE due to a particular
failure mechanism i, MTTF iPE , is then estimated:
MTTF iPE =
∫ ∞0
t fi(t)dt =
∫ 1
0
t dFi(t) = eµiPE+σiPE
2/2 (3.5)
The overall lifetime reliability of each on-chip PE, MTTFPE , is modeled by a joint
lognormal distribution that depends on the major failure mechanisms described in
Section 3.2.3. We assume that the relationships among different failure mechanisms
are serial, i.e., each individual failure mechanism can result in the failure of a non-
redundant PE. Therefore, for each non-redundant PE, the CDF of its overall lifetime
failure probability follows:
FPE (t) = 1−∏i
(1− Fi(t)) (3.6)
where i is the index of different failure mechanisms.
Researchers have often used exponential distributions for statistical modeling due
to their convenience. Given Fi(t) with exponential distributions, Equation 3.6 would
yield an easily-computed analytical solution. However, as a consequence of using
the more accurate lognormal distribution for each Fi(t), Equation 3.6 does not allow
straight-forward estimation of PE MTTF, MTTFPE . In this work, we use statistical
fitting to approximate MTTFPE using a single lognormal distribution, governed by
µPE and σPE . The parameters for this approximation follow:
µPE =1
2log
((∫∞0t dFPE (t)
)4∫∞0t2 dFPE (t)
)(3.7)
σPE =
√√√√log
( ∫∞0t2 dFPE (t)(∫∞
0t dFPE (t)
)2
)(3.8)
![Page 44: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/44.jpg)
CHAPTER 3. RELIABLE MPSOC SYNTHESIS 28
3.2.4.3 Reliability Models for Inactive Spare and Active Spare Redun-
dant PEs
PEs may have component redundancy to improve reliability or performance. Such
PEs can be designed to continue functioning even after some of their components,
e.g., an ALU or a cache bank, fail. Inactive spares are redundant resources that
are not activated until a fault occurs in an active resource. The impact of faults in
inactive spares upon the lifetime reliabilities of PEs can be characterized as follows.
Assume a PE contains M types of resources. Each type of resource Si, i ∈
1, · · · ,M, is comprised of Ni identical elements. Assume the cumulative failure
probability of resource element Ei,j, i ∈ 1, · · · ,M, j ∈ 1, · · · , Ni is Fi,j(t). Then,
the cumulative failure probability of resource Si, FSi(t) =∏
j Fi,j(t). The MIN–MAX
approximation [103] may be used to bound the MTTF of a PE with M types of
resources as follows:
MTTFPE =M
mini=1
(∫ 1
0
t dFSi(t)
)(3.9)
Active spares are redundant resources that are actively used even before any faults
have occurred. Faults in active spares reduce the performance of the affected PE.
Determining the reliability impact of faults that result in changes to observable PE
behavior involves system-level design decisions, and will be described in detail in
Section 3.2.5.
3.2.4.4 Temperature-Dependent Reliability Model for Potentially Redun-
dant PEs
The lifetime reliability of a PE strongly depends on its temperature. After each
MPSoC solution is derived, performance and power analysis are conducted. The
![Page 45: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/45.jpg)
CHAPTER 3. RELIABLE MPSOC SYNTHESIS 29
0 5 10 15 20 25 300
0.02
0.04
0.06
0.08
0.1
0.12
time (years)
prob
abili
ty d
ensi
ty
fault probability density attemperature T
1
fault probability density attemperature T
2
t1
t2
Figure 3.3: Temperature Impact on MTTF [38].
estimated power profile, MPSoC floorplan, and cooling configuration are provided
to a thermal analysis algorithm [123] to determine the thermal profile. Note that
Equation 3.9 is derived under an assumption of constant PE temperature. Next, we
discuss temperature-dependent PE MTTF estimation.
The temperature profile of an MPSoC varies as the tasks assigned to it change.
Task assignments change whenever migration is used to compensate for a partial or
complete PE failure. The impact of temperature variation on MTTF calculation is
illustrated in Figure 3.3. In this example, T1 and T2 are temperatures. The PE is
initially hot (T1) and, at time t1, becomes cooler (T2). Functions f1(t) and f2(t) are the
fault PDFs given temperatures T1 and T2, respectively. The overall fault distribution
of the PE should satisfy the following equation, i.e., the overall cumulative fault
distribution equals one. ∫ t1
0
f1(t)dt+
∫ ∞t2
f2(t)dt = 1 (3.10)
When we switch from the fault PDF associated with one temperature, e.g., T1, to that
associated with another temperature, e.g., T2, it is necessary to adjust our start time
![Page 46: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/46.jpg)
CHAPTER 3. RELIABLE MPSOC SYNTHESIS 30
to the value, in the new time scale, associated with the appropriate amount of wear
that had been experienced in the previous time scale, i.e., we must start integrating
from the effective age of the PE. For this example the concept can be summarized as
follows: F1(t1) = F2(t2).
Given that T0, T1, · · · , TN−1 denote the PE thermal profile, the overall fault
distribution should satisfy the following equation:∫ te0
ts0=0
f0(t)dt+
∫ te1
ts1
f1(t)dt+ · · ·+∫ ∞tsN−1
fN−1(t)dt = 1 (3.11)
where fi(t) denotes the fault PDF of the PE at temperature Ti, tei(t) denotes the
transition time at which the temperature changes from Ti−1 to Ti, and tsi(t) denotes
the equivalent age of the PE, starting from tei−1, when the temperature switches to
Ti. The value of tsi can be determined using Equation 3.11, allowing the MTTF of a
PE to be determined using the following equation:
MTTF =N−1∑i=0
∫ tei
tsi
tfi(t)dt (3.12)
This has the effect of breaking time into regions (∑N−1
i=0 ) during which the temperature
of the PE is uniform and, during each region, weighting each time instant by the
probability of failure at that instant (t · fi(t)). Values for tsi and tei are computed
based on Equation 3.11.
Reliability analysis may be conducted numerous times during reliability optimiza-
tion. Therefore, modeling efficiency is critical. An MPSoC consists of numerous
PEs. If the cumulative fault probability distributions, Fi(t), are lognormal, then
solving Equation 3.9 requires computationally-intensive numerical analysis. To im-
prove computational efficiency, we produce a PE reliability library before reliability
optimization by pre-characterizing the reliability distributions of PEs as functions
![Page 47: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/47.jpg)
CHAPTER 3. RELIABLE MPSOC SYNTHESIS 31
of temperature and supply voltage. During MPSoC reliability optimization, when
solving Equation 3.12, the value of Fi(t) is efficiently obtained using table look-ups.
3.2.5 Reliability Optimization of MPSoCs
Figure 3.2 illustrates the proposed reliability analysis and optimization flow. In
TASR, reliability optimization starts by evaluating the system MTTF of area opti-
mized solutions (using Algorithm 1), Such solutions tend to have high power density,
high temperature, low resource redundancy and, therefore, low system MTTF. An
iterative reliability enhancement algorithm is invoked if these solutions do not provide
the required system MTTF. During each iteration, Algorithm 2 optimizes MTTF by
improving processor core and component redundancy and/or optimizing chip thermal
profile by introducing new processors. System-level (task assignment and scheduling)
and physical-level (floorplanning and network synthesis) algorithms are then invoked
to produce valid MPSoC solutions. Through performance, power, thermal, and relia-
bility analyses, the system MTTFs of new solutions are estimated and evaluated. The
iterative optimization flow continues until the targeted system MTTF is achieved.
Algorithm 1 estimates system MTTF based on statistical models of MPSoC run-
time failure processes. Starting from time t = 0, it determines the minimal MTTF
among all the processor cores (line 4). Each fault may result in partial or complete
processor core failure. In either case, task migration is used to optimize system
performance. The task migration routine moves tasks from the faulty or partially-
faulty processor to other processors (line 6). After task migration, if the MPSoC
still meets its performance requirements, the algorithm considers the next processor
core with minimal MTTF. Task migration results in run-time changes in chip power
![Page 48: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/48.jpg)
CHAPTER 3. RELIABLE MPSOC SYNTHESIS 32
Algorithm 1 System MTTF Analysis of an MPSoC Solution
1: Given an MPSoC solution, set MTTFMPSoC ← 02: while system schedule is valid do3: MPSoCFunc are the functioning processors in the MPSOC4: Fault interval ei ← minp∈MPSoCFunc (MTTFp)5: MTTFMPSoC ← MTTFMPSoC + ei
6: Task migration, scheduling7: if system scheduling is valid then8: Power analysis, thermal analysis, compute processor temperatures9: else
10: Return MTTFMPSoC
11: end if12: end while
consumption and temperature profiles, thereby changing the lifetime reliability of
each processor core. To accurately predict subsequent processor MTTFs, power and
thermal analysis are conducted (line 8). This process continues until the MPSoC fails
to meet its performance or functionality requirements. The system MTTF of the
MPSoC solution is then reported (line 11).
At run-time, on-line fault detection algorithms should determine when an execu-
tion unit has failed. A proper treatment of on-line fault detection is beyond the scope
of this dissertation but can be found in the literature [77]. Upon fault detection, the
pre-planned task assignment changes associated with the particular fault are made.
If it is acceptable to reboot the system in the presence of a fault (a few times in the
system lifespan), no further provisions are necessary. If uninterrupted operation is
necessary, distributed system checkpointing may be used.
TASR is equipped with an efficient workload migration algorithm to maintain sys-
tem functionality and meet performance requirements in the presence of partial and
complete processor failures. When an MPSoC fails to meet its performance require-
ments due to run-time faults, tasks migrate to other processors using the following
policy. Tasks on faulty processors are first sorted in order of increasing time slack, the
![Page 49: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/49.jpg)
CHAPTER 3. RELIABLE MPSOC SYNTHESIS 33
difference between the task’s latest finish time and earliest finish time. They are then
migrated from the processor to other processors in this order until the system perfor-
mance requirements are met and no tasks are assigned to a totally failed processor.
When moving a task from one processor to another, the new processor is selected by
Pareto-ranking processors in order of increasing utilization ratio (the proportion of
time during which the processor is actively executing tasks) and increasing execution
time for the task and processor under consideration. Depending on whether a proces-
sor is inoperational or partially-failed, all or some of the tasks assigned to it migrate
to other processors.
TASR optimizes the lifetime reliability of MPSoCs by focusing on architectural
changes that improve redundancy and thermal profile, while maintaining low area
overhead. Algorithm 2 shows the actions taken by TASR to improve the MTTF of an
MPSoC architecture. First, the MTTF of each individual processor is estimated (line
2). The processor with the minimal MTTF is identified as the MPSoC’s most vulner-
able point, Pvul (line 3). One of the proposed reliability optimization moves is then
applied: processor reinforcement, processor swapping, and processor addition (line
4). Processor reinforcement introduces component redundancy (see Section 3.1.2)
into the most vulnerable processor. Processor swapping replaces the most vulnerable
processor with a different, more reliable, processor. Processor addition introduces
a new processor into the MPSoC, enabling tasks to migrate from the vulnerable
processor to other processors. These moves consider multiple candidates processors.
TASR uses the relative reliability gain, defined in Equation 3.13, to select the best
candidate move. This equation takes power density reduction, resource redundancy
![Page 50: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/50.jpg)
CHAPTER 3. RELIABLE MPSOC SYNTHESIS 34
improvement, and area overhead associated with the move into consideration.
GTASR = e−Pd ×MTTFref/A (3.13)
Note that this value is used only to guide changes. The detailed effect of each
tentative change is computed using thermal profile and reliability analysis. MPSoC
power profile influences MPSoC temperature profile, which strongly influences reli-
ability. The MTTFs associated with some major fault mechanisms are exponential
functions of temperature. Therefore, in Equation 3.13, TASR uses an exponential
term, e−Pd , to characterize the impact of power density reduction on reliability im-
provement. Pd is the power density reduction resulting from applying a candidate
move. In Equation 3.13, the impact of redundancy is characterized by the second
term, MTTFref , the system MTTF improvement resulting from the candidate move.
MTTFref is calculated under the assumption that other design characteristics, e.g.,
temperature profile and supply voltage, remain the same. The relative reliability
gain introduced by each candidate move is the product of these two terms divided
by the area overhead. The move with the highest gain is applied (line 5). After each
optimization move, system-level and physical-level synthesis algorithms are invoked
to update the MPSoC solution. Cost analysis is then conducted to determine the
improvement in system reliability, determine the impact on MPSoC area, and vali-
date the system schedule. This optimization process continues until the target system
MTTF is achieved.
Two additional other optimization moves were implemented for the sake of com-
parison. The first considers only power density, e−Pd , and the second considers only
resource redundancy, MTTFref . Performance comparisons among these three heuris-
tics are provided in Section 3.3.
![Page 51: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/51.jpg)
CHAPTER 3. RELIABLE MPSOC SYNTHESIS 35
Algorithm 2 Reliability-Aware Optimization Algorithm
1: while MTTFMPSoC < MTTFtarget do2: ∀pe∈MPSoC compute MTTFpe
3: Find vulnerable point: Pvul is the processor with minimal MTTF4: Optimization moves (processor reinforcement, processor swapping, processor addition)5: Apply the best move based on Equation 3.136: System-level synthesis: Task assignment and Scheduling7: Physical-level synthesis: Floorplanning and network synthesis8: Performance, power, thermal, reliability analysis9: if system MTTF does not improve or system schedule invalid then
10: Revert this change11: end if12: end while
3.2.6 Floorplanning, Thermal Analysis, and Network Syn-
thesis
We use a fast constructive area and communication aware floorplanning block
placement algorithm based on network partitioning and optimal processor orientation
and rotation selection to determine MPSoC power profile as well as communication
latency and communication power consumption [26]. A fine-grained MPSoC thermal
model is used within a thermal analysis algorithm designed for accuracy and high
enough speed for use within the inner loop of synthesis [123]. Finally, we carry out
on-chip network synthesis, using network topology to explicitly model communication
contention.
3.3 Experimental Results
This section describes the benchmarks used to evaluate TASR and presents the
results of evaluation.
![Page 52: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/52.jpg)
CHAPTER 3. RELIABLE MPSOC SYNTHESIS 36
3.3.1 Benchmarks
The proposed reliable MPSoC synthesis algorithm was evaluated using a num-
ber of benchmarks taken from the E3S embedded systems benchmark suite, which
is based on EEMBC benchmark data [31]. This suite contains 17 PEs, e.g., the
AMD ElanSC520, Analog Devices 21065L, the Motorola MPC555, and the Texas
Instruments TMS320C6203. These processors are characterized based on the mea-
sured execution times of 47 tasks commonly encountered in embedded applications,
power numbers derived from datasheets, and additional information, e.g., processor
areas, some of which were necessarily estimated, and prices gathered by emailing and
calling vendors. Any processor for which the datasheet reflected results in coarser
technologies were linearly scaled to a 0.18 µm technology. The task sets follow the
organization of the EEMBC benchmarks. There is one task set for each of the five
application suites: Automotive/Industrial, Consumer, Networking, Office Automa-
tion, and Telecommunications. The Office Automation problem contains only five
tasks. Our modified version of Office Automation contains four copies of the origi-
nal task set. In addition, TGFF [27] was used to generate five random benchmarks,
each of which has 30–50 tasks. The graphs have different structures, ranging from
random connectivity to a series-parallel structure commonly encountered in DSP ap-
plications. For the random benchmarks, tasks were randomly assigned task types
from the EEMBC benchmarks.
The EEMBC processors do not have component redundancy, i.e., each processor
will fail if any of its functional units fails. We introduce a redundant version for
each processor by duplicating floating/fixed point units and floating/integer register
files. We assume that instruction scheduling units and instruction decode units do
![Page 53: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/53.jpg)
CHAPTER 3. RELIABLE MPSOC SYNTHESIS 37
not have redundancy [103]; a run-time fault in these units will result in processor
failure. On-chip caches have redundancy; a single fault reduces performance but the
processor remains operational. We relied on previous work to estimate the cost of
component redundancy [103]. Processors with component redundancy suffer a 24%
area penalty and, while their additional functional units are still operational, have
25% higher performance and power consumption.
The embedded microprocessors in EEMBC have fairly homogeneous energy–delay
products. It is our goal to develop a synthesis algorithm that is effective at improving
the reliability of application-specific MPSoCs, which commonly contain heterogeneous
processors. Therefore, for each processor, we introduced one corresponding processor
operating at a higher voltage and another operating at a lower voltage. A maximum
of three voltages need to be provided by off-chip regulators. The alpha power law was
used to calculate the impact of voltage scaling on performance. A 0.18 µm process,
supply voltage of 1.8 V, and alpha of 1.3 were used [93]. To model high-performance
processors, the supply voltage was scaled to 2.5 V, performance increased by 25%,
and power consumption increased to 2.4×. To model low-power processors, the sup-
ply voltage was scaled to 1.28 V, performance was decreased by 25%, and power
consumption was decreased to 0.38×.
3.3.2 TASR vs. Stochastic Area Optimization
As described in Section 3.2.1, TASR consists of a two-stage optimization flow. It
first uses a stochastic optimization algorithm to minimize MPSoC area under per-
formance constraints. The area-optimized solution is used as a starting point for
the proposed reliability enhancements. The TASR lines in Figure 3.5 illustrate the
![Page 54: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/54.jpg)
CHAPTER 3. RELIABLE MPSOC SYNTHESIS 38
10
100
1000
0 1 2 3 4 5 6 7 8 9
Are
a (
mm
2)
MTTF (years)
autoconsumer
networkingoffice4xtelecom
random1random2random3random4random5
Figure 3.4: Comparison of MPSoC Area–Reliability Tradeoffs [38].
solutions produced by the MTTF optimization technique when run on all the bench-
marks. The initial area-optimized solutions appear at the left-most points of the
lines. TASR applied the optimization moves described in Section 3.2.5 until several
subsequent moves did not significantly improve system MTTF. Table 3.1 shows the
average system MTTF improvement over initial area-optimized solutions under dif-
ferent area overhead constraints for all ten benchmarks. These results illustrate three
key points about the reliable application-specific MPSoC synthesis problem.
1. The area cost to improve reliability is initially small. In Figure 3.4, area is
shown on a logarithmic scale. As shown in Table 3.1, improving the average
system MTTF over all benchmarks by 40%, 85%, and 180% results in maximum
![Page 55: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/55.jpg)
CHAPTER 3. RELIABLE MPSOC SYNTHESIS 39
0
50
100
150
200
2 3 4 5 6 7 8
Area
(mm
2 )
MTTF (years)
autoTASRCR-onlyPD-only
1PHASE
0
50
100
150
200
0 1 2 3 4 5 6 7 8
Area
(mm
2 )
MTTF (years)
office4xTASRCR-onlyPD-only
1PHASE
0
50
100
150
200
0 1 2 3 4 5 6 7 8
Area
(mm
2 )
MTTF (years)
consumerTASRCR-onlyPD-only
1PHASE
0
100
200
300
400
500
0 1 2 3 4 5 6 7 8Ar
ea (m
m2 )
MTTF (years)
telecomTASR
CR-onlyPD-only
1PHASE
0 100 200 300 400 500 600 700
0 1 2 3 4 5 6 7 8
Area
(mm
2 )
MTTF (years)
networkingTASRCR-onlyPD-only
1PHASE
0
50
100
150
200
0 1 2 3 4 5 6 7 8
Area
(mm
2 )
MTTF (years)
random1TASRCR-onlyPD-only
1PHASE
0 20 40 60 80
100 120 140
5 5.5 6 6.5 7 7.5 8 8.5
Area
(mm
2 )
MTTF (years)
random2TASRCR-onlyPD-only
1PHASE
0
50
100
150
200
0 1 2 3 4 5 6 7 8
Area
(mm
2 )
MTTF (years)
random3TASRCR-onlyPD-only
1PHASE
0
50
100
150
200
0 1 2 3 4 5 6 7 8
Area
(mm
2 )
MTTF (years)
random4TASRCR-onlyPD-only
1PHASE
100
200
300
400
500
600
0 1 2 3 4 5 6 7 8
Area
(mm
2 )
MTTF (years)
random5TASRCR-onlyPD-only
1PHASE
Figure 3.5: Comparison of Different Optimization Heuristics [132].
![Page 56: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/56.jpg)
CHAPTER 3. RELIABLE MPSOC SYNTHESIS 40
Table 3.1: System MTTF Improvement Under Area Bound [132]Area MTTF Area MTTF Area MTTF
bound improve. bound improve. bound improve.(%) (%) (%) (%) (%) (%)0.0 40.0 15.0 180.0 30.0 457.05.0 85.0 20.0 240.0 35.0 468.0
10.0 180.0 25.0 436.0 40.0 470.0
The MTTF improvement under each area bound is computed by selecting the highest-MTTF solution for each benchmark, that honors the area bound, and computing theaverage of their MTTF improvements.
area overheads of 0.0%, 5.0%, and 10.0%. MTTF is not directly considered
in the first optimization phase. As a result, TASR can sometimes improves
MTTF without area overhead because two solutions with the same area can
have different MTTFs. Initial solutions are optimized for area and tend to have
high power densities, high temperatures, and low resource redundancy: the
fault rates are high and single faults may cause failure. Therefore, the system
reliability can be improved at low area cost. TASR introduces processor cores
with lower power densities and/or replaces non-redundant cores with redundant
ones, thereby optimizing thermal properties and allowing the system to continue
operating despite runtime hardware faults.
2. As shown in Table 3.1, TASR automatically trades off system reliability for
area, allowing system designers to choose a desirable solution based on problem-
specific design constraints.
3. As system MTTF increases, the area penalty associated with further improving
system reliability increases. As shown in Table 3.1, TASR achieves 436% average
system MTTF improvement with a maximum area overhead of 25%. Further
![Page 57: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/57.jpg)
CHAPTER 3. RELIABLE MPSOC SYNTHESIS 41
improvements to system MTTF become prohibitively expensive. Processor core
failure cumulative distribution functions are non-decreasing. For a large enough
duration, there is a low probability that any processor will operate without a
fault. As a result, at very large MTTFs, adding processors or reinforcing a
subset of existing processors with redundant components has little impact on
MTTF.
3.3.3 Evaluation of Optimization Moves
TASR optimizes system reliability by controlling processor temperatures and im-
proving system redundancy. To evaluate the effectiveness of the proposed optimiza-
tion moves, we compare TASR with two alternative moves described in Section 3.2.5:
power density only (PD-only) and component redundancy only (CR-only) moves.
PD-only minimizes power density. CR-only increases resource redundancy. Fig-
ure 3.5 shows the results produced by TASR, CR-only, and PD-only optimization
moves. TASR almost always produces architectures with both superior area and
system MTTF. In some cases, PD-only or CR-only also do well. PD-only does not
consider component redundancy. However, introducing redundant processors in order
to improve power density still improves system MTTF. CR-only does not consider
processor power density. However, redundant processors tend to have lower power
densities than non-redundant processors; although their instantaneous spatial power
densities are similar to non-redundant processors, they have higher performance, per-
mitting lower temporal power densities. In general, it is necessary to use both struc-
tural redundancy and power density to produce high-quality solutions.
![Page 58: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/58.jpg)
CHAPTER 3. RELIABLE MPSOC SYNTHESIS 42
3.3.4 Evaluation of Optimization Flow
As explained in Section 3.2.2, it appears that a two-phase optimization flow in
which a stochastic optimization algorithm is first used to find a promising, low-area,
region of the solution space and then an iterative reliability enhancement algorithm
is used to trade off area for reliability is superior to a one-phase optimization flow.
To determine whether this argument has merit, we compared TASR with a one-
phase stochastic optimization algorithm in which functionality, timing, area, and
reliability are concurrently optimized. This algorithm, which we call 1PHASE, has the
ability to apply all the allocation, assignment, floorplanning, and scheduling changes
available to TASR. It optimizes MTTF within its multi-objective cost function. We
found that TASR can almost always produce solutions of equal or better quality than
1PHASE. In addition, TASR generally requires less CPU time (an average of 635.9 s
per benchmark) than 1PHASE (an average of 2,394 s per benchmark).
3.4 Conclusions and Future Work
This chapter has described a synthesis algorithm for reliable application-specific
MPSoCs. The dominant failure processes today, and in the near future, have rates
exponentially dependent on temperature. Therefore, the impact of tentative design
changes on detailed temperature profile during synthesis process should be considered.
This, in turn requires power profiles, which depend on floorplanning and power mod-
els. Even the fastest detailed thermal analysis and floorplanning algorithms cannot
be included within the inner loop of synthesis without greatly reducing the solution
space explored in a given amount of time. Therefore, we have proposed a two-stage
![Page 59: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/59.jpg)
CHAPTER 3. RELIABLE MPSOC SYNTHESIS 43
synthesis process in which a potentially-slow but high-quality stochastic optimiza-
tion algorithm is first used to minimize solution area. Starting from this promising
location in the solution space, a reliability enhancement heuristic explores the area–
MTTF tradeoff curve.
Our results indicate that this synthesis approach greatly outperforms simply
adding MTTF into a stochastic optimization algorithm as another objective. The
proposed synthesis flow increases MPSoC system mean time to failure by an average
of 85% with less than 5% area cost and by an average of 436% with less than 25%
area cost, compared to area-optimized solutions. As long as power densities remain
high and the dominant lifetime failure processes remain strongly dependent on tem-
perature, our results indicate that thermal and structural redundancy optimization
during synthesis have the potential to increase MPSoC lifetime with low area cost.
![Page 60: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/60.jpg)
Chapter 4
Three-Dimensional
Chip-Multiprocessor Run-Time
Thermal Management
Three-dimensional (3D) integration has the potential to improve the communica-
tion latency and integration density of chip-level multiprocessors (CMPs). However,
the stacked high power density layers of 3D CMPs increase the importance and diffi-
culty of thermal management. In this chapter, we investigate the 3D CMP run-time
thermal management problem and describe efficient management techniques. This
chapter makes the following main contributions: (1) it identifies and describes the
critical concepts required for optimal thermal management, namely the methods by
which heterogeneity in both workload power characteristics and processor core ther-
mal characteristics should be exploited and (2) it proposes an efficient, proactive,
continuously-engaged hardware and operating system thermal management technique
44
![Page 61: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/61.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 45
governed by optimal thermal management polices. The proposed technique is evalu-
ated using multiprogrammed and multithreaded benchmarks in an integrated power,
performance, and temperature full-system simulation environment. We find that
proactive power–thermal budgeting allows a 30% improvement in instruction through-
put compared to a proactive thermal management approach that bases decisions only
upon local information. The software components of the proposed thermal manage-
ment technique have been implemented in the Linux 2.6.8 kernel. The analysis and
technique developed in this chapter provide a general solution for future 3D and 2D
CMPs. My major contribution to this chapter is on the characterization of heat flow in
3D CMPs, derivation of optimal workload assignment and power–thermal budgeting
and thermal management implementation in the Linux kernel (Section 4.3 and 4.4).
My collaborator, Zhenyu Gu contributed to the design of 3D CMP architecture and
technology, framework buildup of the full simulation system, and benchmark suites
characteristics and generation (Section 4.5).
4.1 Introduction
Continued increases in integration density, and achieving higher application per-
formance without corresponding increases in processor frequency, are now primary
goals for microprocessor designers. As a result, microprocessor design is rapidly mov-
ing towards highly-scalable chip-multiprocessor (CMP) architectures. Today’s main-
stream microprocessors are multi-core [56, 60, 7, 50, 107, 96]. The trend for future
CMPs is to increase the number of on-chip cores: 80-core prototypes have recently
been demonstrated by Intel [115].
Performance scalability is a major challenge in CMP design. Using the mainstream
![Page 62: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/62.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 46
two-dimensional (2D) planar CMOS fabrication process, on-chip interconnect shows
poor scalability in both performance and power consumption [5]. Three-dimensional
(3D) integration has the potential to overcome the limitations of 2D technology [109,
12, 95, 108]. By stacking multiple device layers connected through inter-die vias, 3D
integration increases logic integration density significantly and reduces on-chip wire
length, especially for global and semi-global wires. This has motivated computer
architects to evaluate 3D technology for CMP architecture design [12, 65, 57, 58].
However, none of this work describes a thermal management solution appropriate for
3D CMPs.
Thermal issues are a large and growing concern for CMPs [68, 28, 14, 99]. Increas-
ing chip power consumption and temperature affect circuit reliability (via negative
bias temperature instability, electromigration, time-dependent dielectric breakdown,
thermal cycling, etc.), power and energy consumption (via increased leakage power),
and system cost (via increased cooling and packaging cost). The use of 3D integra-
tion magnifies power dissipation problems [12, 89, 90, 71]. Chip cross-sectional power
density increases linearly with the number of vertically-stacked active circuit layers.
In addition, the interconnect and bonding layers used in 3D integration have low ther-
mal conductivities, which further exacerbates thermal effects. Temperature-related
concerns that can sometimes be safely ignored in 2D CMPs, such as temperature-
induced performance or reliability degradation, become increasingly prominent in 3D
CMPs. 3D integration holds promise but without solutions to the thermal problems
it brings, 3D CMPs will be impractical.
Run-time thermal management techniques, such as dynamic voltage and frequency
scaling, clock throttling, execution unit toggling, and workload migration, have been
![Page 63: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/63.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 47
proposed for 2D high-performance microprocessors [14, 99, 87, 54, 68, 28]. Using
these techniques, cooling solutions and packages need not be designed for worst-case
power consumption scenarios. Cooling cost can thereby be significantly reduced. Past
work, however, cannot effectively optimize the performance–temperature tradeoff in
3D CMPs for the following reasons.
First, the thermal management techniques deployed in current microprocesasors
and operating systems are primarily used to handle rare, worst-case processor power
consumption events and eliminate thermal emergencies. Although they can poten-
tially introduce significant performance overhead, they are rarely invoked. In con-
trast, the higher power densities of future 3D (and some 2D) CMPs will frequently
require operation at or near thermal limits. Already, processors contain reactive tech-
niques to permit the use of reduced-cost packaging and cooling configurations that
are not capable of handling maximum power dissipation. Today’s laptops frequently
invoke thermal management mechanisms that drastically reduce performance, even
under normal operating conditions [74]. Power should be viewed as a limited resource
and processor cores should spend carefully-budgeted amounts. Thermal management
should be used to proactively, continuously optimize CMP performance and temper-
ature, instead of merely reacting to emergencies.
Second, 3D CMPs have heterogeneous power and thermal characteristics. On-
chip processor cores have different cooling efficiencies. For instance, cores in the
layers closer to the heatsink have higher cooling efficiencies than those farther from
the heatsink. Processor cores farther from the heatsink will have higher tempera-
tures than their neighbors nearer the heatsink, even when their power consumptions
are lower. Inter-core thermal correlation is heterogeneous. The thermal correlation
![Page 64: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/64.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 48
Die 1
Die 2
Device Layer
Metal Layers
Die−to−Die Vias
Die 2
Die 1
Backside ViasI/O and Power
Bulk Si Bulk Si
Heat SinkHeat Sink
(a)
L2 Cache
Core
Core
Core
Core
Core
CoreCore
Core
(b) (c)
Figure 4.1: (a) Comparison of Face-to-Face (Left) and Face-to-Back (Right) Config-urations for Two Stacked Dies, (b) 3D Three Stacked Die Floorplan Used in ThisWork, and (c) 3D CMP Chip-package Thermal Modeling [134].
between vertically-aligned processor cores is stronger than that between processor
cores within the same layer. The power and thermal heterogeneity of 3D CMP poses
unique challenges for run-time thermal management. Achieving optimal 3D CMP
performance under a temperature constraint requires careful system-wide control of
each processor core’s performance and power consumption. Local control, alone, is
insufficient.
In this chapter, we develop the analytical framework necessary to determine the
thermal impact of every core in a 3D CMP upon every other core. This framework
yields guidelines for near-optimal thermal management. The guidelines are embodied
in a proactive global power–thermal budgeting algorithm, performance counter-based
workload monitor, and distributed thermal control techniques, which we have imple-
mented in version 2.8.6 of the Linux kernel. The resulting 3D CMP thermal man-
agement solution, which we call ThermOS, is evaluated using detailed full-system
simulation with M5 [11]. We have integrated power modeling and thermal analysis
tools within the simulator, allowing unified architectural/power/thermal simulation
of arbitrary single-threaded and multi-threaded applications and the Linux operating
system (OS). Our results for a wide range of multiprogrammed and multithreaded
![Page 65: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/65.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 49
applications indicate that, given a peak temperature constraint, ThermOS improves
CMP throughput by an average of 29.84% when compared to state-of-the-art proac-
tive distributed thermal management. This improvement is primarily due to the
power–thermal budgeting guidelines used by ThermOS.
4.2 Contribution
Our work is most closely related to Donald’s and Martonosi’s research on CMP
thermal management using distributed control-theoretic core management and a
global controller that guides migration [28]. Both their thermal management tech-
nique and ThermOS are continuously-engaged thermal management techniques. How-
ever, existing proactive thermal management techniques are not appropriate for CMPs
with heterogeneous thermal environments, such as 3D CMPs. Global guidance and
power–thermal budgeting are particularly beneficial for 3D CMPs. By matching core
cooling characteristics, application features and voltage levels, we can improve perfor-
mance by limiting throttling and migration. We are the first to examine the impact
of thermal heterogeneity on thermal management of 3D architectures. We evaluate
our proposed policies in a full system simulator. This experimental setup accounts
for the overhead of DTM in the OS, including migration costs and context switches.
4.3 Heat Flow in 3D CMPs
This section uses examples to explain the special thermal characteristics of 3D
CMPs and develop a mathematical model that will be used to derive the thermal
management policies described in Section 4.4 and validated in Section 5.4.
![Page 66: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/66.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 50
I
PIC
1/ginter
K1/gintraJ
C
PJPK
TambTamb
1/ghs 1/ghs
C
Figure 4.2: Inter-layer and Intra-layer Thermal Heterogeneity and Dominance in 3DCMPs [134].
4.3.1 Introduction to Thermal Modeling
Heat conduction within CMP chip and package can be modeled using Fourier heat
flow analysis, which has been the standard method used by industry and academia
for circuit-level and architecture-level IC chip–package thermal analysis during the
past few decades [20, 8, 99, 125]. This method is analogous to Georg Simon Ohm’s
method 1 of modeling electrical current. Using Fourier heat flow analysis, heat flow is
analogous to electrical current and temperature is analogous to voltage. The CMP is
virtually partitioned into numerous discrete blocks, as shown in Figure 4.2. The ther-
mal conductance of each block is a linear function of the conductivity of its material
and its cross-sectional area divided by length; it is analogous to electrical conduc-
tance. Blocks also have heat capacities that are analogous to electrical capacitance.
1In fact, Ohm borrowed this model from Fourier and it was initially proposed to model heat flow.
![Page 67: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/67.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 51
Therefore, an instantaneous change in heat generation results in a gradual change in
temperature. As a result, the temperature profile of a CMP is essentially its power
profile after applying a complicated RC filter. We will deal with this effect in detail
in Section 4.3.3. For a thermal model to be accurate, each block must be so small
that the temperature within it is uniform. A fine-grained, and thus more accurate
model was used to validate ThermOS. However, for the sake of explanation, this sec-
tion will describe the coarse-grained model shown in Figure 4.2, in which each core
is represented with a single thermal model element.
In 3D CMPs fabricated from multiple stacked wafers, the thermal environment
varies from layer to layer. Moreover, the intra-layer and inter-layer thermal rela-
tionships among CMP cores are heterogeneous. The rest of this section explains the
impact of this heterogeneity on heat flow and builds the theoretical foundations for
developing near-optimal 3D CMP thermal management policies. This understanding
is essential for proper thermal management of 3D CMPs but no prior work is based
on it.
Homogeneous Intra-Layer Characteristics
Figure 4.2 illustrates a simplified heat conduction model for a pair of adjacent
CMP cores on the same layer (J and K) and a pair of adjacent CMP cores on different
layers (I and K) of a 3D CMP. As shown in this figure, since the heat dissipation paths
of Cores J and K are nearly identical, the thermal conductances of these two cores
are nearly equal. In other words, processor cores within the same layer have similar
cooling efficiencies.
![Page 68: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/68.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 52
Heterogeneous Inter-Layer Characteristics
In contrast to cores on the same layer, Cores I and K have different conductances
to the ambient: ghs = 0.82 W/K for Core K and 1/(1/ghs + 1/ginter) = 0.73 W/K
for Core I 2. In addition, the steady-state temperature of Core I is always higher
than that of Core K, even if Core I has a lower power consumption. The following
equations formalize this effect, which we refer to as thermal dominance. Neglecting
the limited intra-layer heat flow,
TK = Tamb + (PK + PI)/ghs (4.1)
TI = TK + PI/ginter
= Tamb + (PK + PI)/ghs + PI/ginter (4.2)
where TK and TI are the temperatures of Cores K and I, Tamb is the ambient temper-
ature, PK and PI are the power consumptions of Cores K and I, ghs is the thermal
conductance from Core K to the ambient through the cooling solution, and ginter
is the inter-layer thermal conductance between Cores I and K. In addition to Core I
thermally dominating Core K, it also has a higher total resistance to the ambient, i.e.,
it has a lower cooling efficiency. As a result, a unit of power consumption on Core I
will have at least as great an impact on temperature as a unit of power consumption
on Core J or K.
2The thermal conductance values in this section are derived using a thermal analysis packagedeveloped by Yang et al. [125], which constructs a fine-grained 3D CMP thermal model basedon the material properties and physical structure of the chip–package configuration described inSection 4.5.1.2, Table 4.3, and Table 4.4. For the sake of explanation, coarse-grained thermal modelwith compact equations are used in this section to simplify the explanation of fundamental 3D CMPthermal properties.
![Page 69: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/69.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 53
Thermal Coupling
The thermal conductance between J and K (gintra) is approximately 0.41 W/K.
Heat can flow between Cores J and K. As a result, the power consumption of one can
influence the temperature of the other. However, this thermal coupling is relatively
minor compared to that between vertically-aligned cores. The thermal conductance
between Cores I and K (ginter) is approximately 6.67 W/K, almost 16× gintra . The
large interface area between Cores I and K results in a high thermal conductance,
despite the interposed high thermal resistivity (but thin, and therefore low resistance)
10 µm polyimide bonding layer.
Summary and Open Questions
At this point, we can draw some qualitative conclusions. The temperatures of
vertically-aligned cores are highly correlated, relative to the temperatures of horizontally-
adjacent cores. Cores farther from the heatsink have higher temperatures than their
neighbors closer to the heatsink. In addition, the temperature impact of a unit of
power dissipation will be at least as high for Core I as for Cores J and K, due to their
differing thermal conductances to the ambient. However, a few questions remain:
1. How can we use this knowledge of thermal environment heterogeneity to guide
the development of a CMP thermal management algorithm? and
2. What is the impact of the power consumption of each core upon all other cores
in the system?
We will now introduce a general analytical framework that answers these questions.
![Page 70: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/70.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 54
4.3.2 3D CMP Heat Flow Analytical Framework
In this section, we formulate the problem of determining the impact of a unit
change in power consumption for any given processor core upon the temperatures of
all other cores. This formulation provides the theoretical foundation for determining
the principals of near-optimal thermal management. We can represent the thermal
characteristics of a 3D CMP using the following notation, which follows naturally
from the heat conduction analysis ideas discussed in Section 4.3.1:
CdT (t)
dt+ AT (t) = Pu(t) (4.3)
In this equation, given a system of N thermal elements, C is a an N × N matrix
with thermal element heat capacities along the diagonal and zeros elsewhere, T is
a length N thermal element temperature vector, t is time, A is an N × N matrix
containing the thermal conductances of adjacent elements at the corresponding row–
column intersections and zeros elsewhere, P is a length N thermal element power
vector, and u(t) is a step function that changes from 0 to 1 at time t. In addition,
matrix A = LTKL, where L is a Laplacian matrix and K is a diagonal matrix
containing the thermal conductances of adjacent thermal elements. Given an IC
chip–package partition with N connected thermal elements plus a ground element
that models the ambient temperature, matrix A is full rank or nonsingular [76]. The
impact of the CdT (t)/dt term will be explained in detail in Section 4.3.3. In order
to ease explanation, neglect C, then solve Equation 4.3 for T as follows:
T = PA−1 (4.4)
![Page 71: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/71.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 55
This leads to an interesting observation: A−1 gives the thermal impact of unit changes
in power consumption. It is conventionally referred to as the thermal resistance ma-
trix [18] but it would be better to view it as a thermal impact matrix. In order to
determine the thermal impact of one core’s power consumption on another core’s
temperature, we need only consider the value in the corresponding row–column inter-
section in A−1. Let us assume that Core I is currently the hottest in the CMP. ζij is
the thermal impact coefficient for core i due to j. This value indicates the change in
the temperature for element i as a consequence of a unit change in power consumption
for element j. To determine the impact of power consumed in Cores J and K upon
Core I’s temperature, we need only consider the thermal impact coefficients in row I
in A−1, i.e., [ζI,I , ζI,J , ζI,K ]. Thus,
TI = PI × ζI,I + PJ × ζI,J + PK × ζI,K (4.5)
The thermal impact matrix will be used extensively in Section 4.4 to develop
thermal management guidelines. It also gives us a new view of thermal heterogeneity
in 3D CMPs. For a representative stacked-wafer 3D CMP design, the ζ value for
vertically-adjacent cores is 1.22 K/W and the ζ value for laterally-adjacent cores is
0.39 K/W, yielding a thermal impact ratio of 3.12 for the two cases.
4.3.3 Power Model, Dynamic Thermal Analysis, and Model-
ing Granularity
In the previous subsections, we made a number of simplifying assumptions about
the thermal environment in order to ease explanation. Our actual analysis and ther-
mal management implementation relaxes many of these assumptions for greater ac-
curacy. We now expound on our thermal model.
![Page 72: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/72.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 56
In order to determine thermal profile, the power profile must first be known. We
model both dynamic power consumption and leakage power consumption [129]. De-
pendence on voltage, switching activity, capacitance, and temperature are considered.
These equations are used together with a Wattch-based EV6 power model [15] to de-
termine the power consumption distribution among architectural units. The power
distributions of real multiprogrammed and multithreaded workloads on CMPs may
be spatially and temporally heterogeneous. The proposed modeling approach allows
us to capture the impact of workload heterogeneity on power and thermal profiles.
As explained in Section 4.3.2, the thermal analysis of real ICs must consider heat
capacity (C) as well as thermal conductance, i.e., transient analysis is necessary. The
thermal analysis infrastructure we use in architectural–thermal simulation captures
these effects using a frequency-domain moment matching analysis technique. Our
on-line thermal management technique continuously adjusts its behavior based on
thermal sensor readings. Prior subsections assumed that each CMP core is repre-
sented by a single thermal element to simplify explanation. In reality, our analysis
infrastructure is capable of dividing each CMP core into numerous three-dimensional
thermal elements to permit accurate temperature estimation.
Heat capacity plays a role in thermal modeling and management. Considering
transient effects complicates the power and thermal analysis infrastructure. Fortu-
nately, heat capacity limits the rate of temperature change, i.e., the maximum tem-
perature change of a CMP core in a given time interval is limited by the RC thermal
time constant of the core and the maximum power consumption change. Although
we used a thermal analysis infrastructure that considers transient thermal effects in
detail, the proposed thermal management technique is designed to react to transient
![Page 73: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/73.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 57
thermal effects by periodically adapting its behavior based on temperatures measured
with thermal sensors or estimated using run-time thermal models.
4.4 3D CMP Thermal Management
In this section, we investigate the 3D CMP run-time thermal management problem
and propose efficient management techniques. Given a 3D CMP with N on-chip
processor cores, our goal is to maximize the CMP throughput under run-time thermal
constraints. CMP throughput is defined as the total number of instructions executed
by the CMP per second.
CMP IPS =N−1∑i=0
IPC i × fi (4.6)
where IPC i and fi are the run-time instructions per cycle and frequency of Core i.
Run-time thermal safety requires that
∀N−1i=0 Ti ≤ TMAX (4.7)
i.e., the temperature of each processor core cannot exceed the maximum safe temper-
ature: TMAX .
In the following sections, we analyze the thermal management problem for 3D
CMPs and determine the policies necessary for performance optimization under tem-
perature constraints. This study will be used to guide the development of our run-time
thermal management techniques.
![Page 74: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/74.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 58
4.4.1 Conditions Required for Optimal 3D CMP Thermal
Management and Derivations of Resulting Policy Guide-
lines
This section derives performance optimization guidelines. The central theme is
to optimize the performance of CMP cores under a constraint on peak temperature
during workload assignment and power–thermal budgeting.
Observation: To maximize CMP throughput, processor cores should operate at dif-
ferent voltages and frequencies due to heterogeneous processor core thermal charac-
teristics and heterogeneous run-time workloads.
As described in Figure 4.3.1, processor cores in a 3D CMP are thermally correlated.
The temperature of each Core i, is affected by the power consumptions of all cores,
as follows:
Ti =N−1∑j=0
ζi,j × pj ≤ TMAX (4.8)
where Ti is the temperature of processor Core i; ζi,j, i, j ∈ [0, N −1] is an inter-core
thermal impact coefficient, which indicates the impact of a unit power consumption
of Core j on the temperature of Core i; pj is Core j’s power consumption; and N is
the number of processor cores of the CMP.
We would like to guide migration of tasks among cores, and budget power to cores,
in order to optimize CMP throughput under a temperature constraint. To facilitate
developing the necessary guidelines, we introduce the concept of thermal impact per
performance gain, TIP :
TIP fi,j =
dTidfj
, TIP IPCi,j =
dTidIPC j
(4.9)
TIP i,j indicates the thermal impact on processor Core i due to the increase in Core j’s
![Page 75: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/75.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 59
performance, by either increasing its frequency and voltage, and/or assign a high IPC
job to this core. Intuitively, TIP is the thermal cost per unit increase in processor core
performance. It can be viewed as the inverse of a core’s thermal efficiency. Subject
to a temperature bound, maximizing CMP performance thus requires that all the
processor cores achieve the same thermal impact per performance improvement on
the maximum-temperature core, i.e.,
TIP f,IPCi,0 ≡ TIP f,IPC
i,1 ≡ · · · ≡ TIP f,IPCi,N−1 (4.10)
Note that the impact on Ti due to the power consumption of core j is ζi,jPj. Given that
dynamic power consumption, Pj = ξjV2j fj (where Vj and fj are the supply voltage
and frequency of Core j), Vj ∝ fβj , and β ≈ 1 [13]; ξj is Core j’s run-time switching
activity multiplied the capacitance of the switched nodes (which is approximately
linearly proportional to the IPC of the job running in Core j), then
ζi,0f2β+10 ≡ ζi,1f
2β+11 ≡ · · · ≡ ζi,N−1f
2β+1N−1
ζi,0ξ0f2β0 ≡ ζi,1ξ1f
2β1 ≡ · · · ≡ ζi,N−1ξN−1f
2βN−1 (4.11)
This result indicates that processor cores with heterogeneous power and thermal
characteristics, i.e., different power–thermal impact coefficients, ζi,j, running jobs
with different IPCs should be clocked at different frequencies. A similar conclusion
can be drawn when both dynamic and leakage power variants are considered.
As shown in Section 4.3.1, the inter-layer and intra-layer thermal characteristics
of 3D CMPs show distinct differences. This leads to different thermal management
policies for inter-layer and intra-layer processor cores. In the following sections, we
determine the conditions required for optimal 3D CMP thermal management and
derive the resulting policy guidelines.
![Page 76: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/76.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 60
4.4.1.1 Inter-Layer Power–Thermal Budgeting and Workload Assignment
Inter-layer processor cores have heterogeneous thermal characteristics. In addi-
tion, vertically-aligned cores have strongly-correlated temperatures. We now derive
heterogeneity-aware guidelines for power–thermal budgeting and workload assignment
among vertically-aligned cores.
Guideline I: To maximize CMP throughput, the thermal efficiencies of vertically-
aligned processor cores should be optimized under the thermal constraint, i.e., the
voltage and frequency assignment among vertically-aligned processor cores should fol-
low Equations 4.8–4.11.
As shown in Section 4.3.1, among each group of vertically-aligned processor cores,
the Core i farthest from the heat sink is thermally dominant, i.e., it has the highest
temperature and also the lowest cooling efficiency. Therefore, given the thermal
constraint for processor Core i, i.e., Ti ≤ TMAX , the performance-optimal voltage
and frequency setup produced by Equations 4.8–4.11 also guarantees the thermal
safety for other vertically-aligned processor cores. In other words, Equations 4.8–4.11
provide the performance-optimal power–thermal budget policy for vertically-aligned
processor cores. Considering Cores I and K in Figure 4.2,
ζI (= 1/ginter + 1/ghs) > ζK (= 1/ghs), and
TI (= ζI × PI + ζK × PK) > TK (= ζK × PI + ζK × PK)
Equations 4.8–4.11 yield fIfK
=(
IPCK×ζKIPC I×ζI
) 12β
. Given homogeneous workload assign-
ment, i.e., IPCK ≡ IPCK , this implies that fK > fI , i.e., to optimize CMP through-
put, the processor core with higher cooling efficiency should be clocked at a higher
frequency.
![Page 77: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/77.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 61
Guideline II: Given jobs with different IPCs, the maximal CMP throughput can
only be achieved by maximizing the IPC heterogeneity during workload distribution.
To maximize throughput, jobs with higher IPCs should be assigned to cores with higher
thermal efficiencies.
This guideline indicates how to distribute run-time workload among vertically-
aligned processor cores. We will again use Figure 4.2 to illustrate the reason for this
guideline. Given a temperature constraint TMAX and an arbitrary workload assign-
ment with Core I’s IPC equal to IPC I and Core K’s IPC equal to IPCK , Equa-
tions 4.8–4.11 yield the following performance-optimal power and thermal budget
assignment under the given workload distribution:
fI = fK ×(
IPCK × ζKIPC I × ζI
) 12β
(4.12)
fK =
TMAX
ζK × IPCK
(1 +
(ζK×IPCK
ζI×IPC I
) 12β
)
12β+1
(4.13)
Next, we switch the workload between Core I and Core K, Equations 4.8–4.11
yield the following performance-optimal power and thermal budget assignment for
the new distribution:
f ′I = f ′K ×(
IPC I × ζKIPCK × ζI
) 12β
(4.14)
f ′K =
TMAX
ζK × IPC I
(1 +
(ζK×IPC I
ζI×IPCK
) 12β
)
12β+1
(4.15)
Then, simple calculation can show that difference in the CMP throughput between
![Page 78: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/78.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 62
these two workload distributions
(IPC I × fI + IPCK × fK)−
(IPCK × f ′I + IPC I × f ′K) ≥ 0 ⇐⇒ IPC I ≤ IPCK (4.16)
In other words, assigning jobs with higher IPCs to cores with higher thermal efficien-
cies yields higher overall throughput under the same temperature constraint.
4.4.1.2 Intra-Layer Power–Thermal Budgeting
Intra-layer cores have mostly-homogeneous thermal characteristics with almost
identical cooling efficiencies (see Section 4.3.1), i.e., ζi,i ≈ ζj,j, when Core i and Core j
are in the same layer. In addition, the inter-core thermal impact is significantly lower
than the self power–thermal impact of each core, i.e., ζi,i ζi,j, when i 6= j. We
derive the following policies for intra-layer power–thermal budgeting and workload
assignment.
Guideline III: To maximize aggregate CMP frequency or instruction throughput,
power–thermal budget and workload should be balanced among intra-layer processor
cores.
Consider two intra-layer processor cores J and K with ζJ,J ≡ ζK,K ζJ,K ≡
ζK,J . The temperature of each core depends mainly on its own power consumption,
i.e., TJ ≈ ζJ,J × PJ and TK ≈ ζK,K × PK (steady-state). Given thermal constraint
TJ , TK ≤ TMAX , performance optimization yields PJ ≡ PK and TIPJ ≡ TIPK , i.e.,
both cores should be clocked at the same frequency and execute workload with the
same IPC. This guideline can also be motivated as follows. Assume both cores are
assigned the same voltage V , frequency f , and workload (ξ and IPC ). Therefore,
TJ ≡ TK . Next, by adjusting the workload assignment, we increase the IPCs of the
![Page 79: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/79.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 63
Global power-thermal budgeting
Distributed thermal-aware workload migration
Temperature monitoring
Workload monitoring
Distributed run-time thermal management
Operating system
CMP hardware
Figure 4.3: ThermOS: 3D CMP Run-time Thermal Management [134].
jobs assigned to one core and decrease the IPCs of the jobs assigned to another core.
Since ζJ,J , ζK,K ζJ,K , ζJ,K , the temperature of one of the cores increases and the
peak temperature of these two cores increases. As a result, frequency reduction and
performance degradation are required to meet temperature constraints.
4.4.2 ThermOS: 3D CMP Thermal Management
Based on the thermal management guidelines developed in Section 4.4.1, we have
developed ThermOS, a unified hardware and OS thermal management solution for
3D CMP. As shown in Figure 4.3 and Table 4.1, ThermOS consists of hardware-
based temperature–workload monitoring and distributed run-time thermal manage-
ment built into a 3D CMP microarchitecture, as well as a temperature-aware Linux
kernel equipped for global power–thermal budgeting and distributed temperature-
aware workload migration. ThermOS is a proactive, continuously-engaged solution
designed to handle 3D CMP power–thermal heterogeneities, distribute run-time work-
load, and manage the limited power–thermal budget to optimize performance under
![Page 80: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/80.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 64
temperature constraint. Our ThermOS is built upon the Linux 2.6.8 kernel. It has an
O (1) time complexity scheduler. Our temperature-aware scheduling algorithm main-
tains the same time complexity. Table 4.1 summarizes the proposed offline, run-time,
and hardware management techniques.
4.4.2.1 Temperature Monitoring
ThermOS gathers CMP temperature profiles at run-time, which are used to guide
temperature-aware workload migration as well as power–thermal budgeting. Either
thermal sensors or online thermal analysis may be used for on-line temperature mon-
itoring. Thermal sensors have been widely used in high-performance microproces-
sors [85, 56]. Efficient software-based online thermal analysis techniques have also
been developed [99].
4.4.2.2 Workload Monitoring
In addition to CMP thermal profile, ThermOS gathers run-time performance and
power characteristics to guide job migration as well as power–thermal budgeting. A
processor core’s activity factor is a function of the capacitances of its functional units
and the corresponding run-time activity factors resulting from its workload. Most
modern processors provide hardware performance counters for monitoring specific
events [56, 101]. These performance counters can be used to inform accurate and
efficient regression-based run-time performance and power models [52, 63]. ThermOS
uses this technique for linear regression estimation of run-time processor core activ-
ity factors. The model was developed offline and integrated with the OS. During
execution, each processor core’s hardware performance counter values are gathered
![Page 81: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/81.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 65
Tab
le4.
1:T
her
mO
SIm
ple
men
tati
on[1
34].
Offl
ine
Giv
enth
eact
ivit
yfa
ctor
ran
ge
of
on
-ch
ipp
roce
ssor
core
,d
eriv
eth
elo
ok-u
pta
ble
,w
hic
hco
mp
uta
tion
conta
ins
the
op
tim
al
volt
ages
and
freq
uen
cies
yie
lded
by
Equ
ati
on
s8–11.
reb
ala
nce
tick
()In
voke
clu
ster
op
t()
an
dgro
up
op
t()
at
the
beg
inn
ing
of
each
work
load
mig
rati
on
tim
ein
terv
al
(ever
y20
ms)
.cl
ust
erop
t()
Con
du
ctin
ter-
layer
mig
rati
on
acc
ord
ing
toG
uid
elin
eII
.
OS
gro
up
op
t()
Con
du
ctin
tra-l
ayer
mig
rati
on
acc
ord
ing
toG
uid
elin
eII
I.O
nlin
esc
hed
ule
rti
ck()
1)
Mon
itor
the
act
ivit
yfa
ctors
of
run
-tim
ep
roce
sses
usi
ng
hard
ware
per
form
an
ceco
unte
rs.
2)
Det
erm
ine
the
glo
bal
pow
er–th
erm
al
bu
dget
ing
usi
ng
run
-tim
eta
ble
looku
p.
Hard
ware
Loca
lD
VF
SP
roact
ive
dis
trib
ute
dD
VF
Sb
ase
don
glo
bal
gu
idan
cean
dlo
cal
vari
ati
on
.L
oca
lcl
ock
Rea
ctiv
ed
istr
ibu
ted
clock
thro
ttlin
gto
gu
ara
nte
eth
erm
al
safe
ty.
![Page 82: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/82.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 66
periodically when triggered by OS timer interrupts (every 1 ms in Linux 2.6.8 kernel).
These performance counter values are used for run-time workload activity and IPC
estimation.
4.4.2.3 Distributed Thermal-Aware Workload Migration
ThermOS contains a distributed online workload migration technique to support
performance optimization. The proposed technique follows the guidelines derived in
Section 4.4.1 and carefully handles 3D CMP inter-layer thermal heterogeneity and
run-time workload heterogeneity. ThermOS uses a distributed approach that swaps
jobs with high IPCs to processor cores with higher thermal efficiencies.
Consider two vertically-adjacent processor cores: Core I and Core K. Assume
Core K has higher cooling efficiency than Core I. To optimize instruction throughput,
ThermOS compares the jobs stored in each processor core’s job queue. It first identi-
fies the lowest-IPC job (IPCMINK) on core K and the highest-IPC job (IPCMAX I)
on Core I. If IPCMINK < IPCMAX I , ThermOS swaps the corresponding jobs. Intra-
layer thermal heterogeneity and thermal correlation are small. Therefore, ThermOS
balances the intra-layer IPC distribution to optimize instruction throughput. Aver-
age IPCs of jobs on horizontally-adjacent cores are compared. If appropriate, they
are swapped to further balance the distribution. The proposed distributed thermal-
aware workload migration technique has been integrated within the default Linux
kernel workload balancing policy. In the current implementation, workload migration
occurs every 20 ms.
![Page 83: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/83.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 67
4.4.2.4 Global Power–Thermal Budgeting
ThermOS dynamically adjusts the power–thermal budgets of processor cores to
optimize 3D CMP performance. Following the guidelines in Section 4.4.1, ThermOS
balances the power–thermal budget assignment among processor cores in the same
layer. Equations 4.8–4.11 are used to guide inter-layer power–thermal budgeting. The
leakage-temperature dependency introduces temperature variables on both sides of
Equation 4.10. Solving this equation requires numerical iteration and detailed chip-
package thermal analysis, which are computationally intensive. To minimize run-time
overhead, we have developed an hybrid offline/online budgeting technique.
Given the switching activity (or IPC) range of the workload, the optimal voltage
and frequency settings for vertically-aligned processor cores are pre-computed. The
offline component of the budgeting algorithm is iterative. During each iteration, based
on the IPC and the switching activity of each processor core, Equations 4.8–4.11 are
used to determine the optimal processor core power–thermal budgets. Thermal anal-
ysis is then used to estimate the 3D CMP thermal profile and update the leakage
power profile estimate. This process iterates until the chip-package thermal profile
converges, subject to feedback from temperature-dependent leakage power consump-
tion. The final voltage and frequency configurations are stored in a look-up table
for efficient use during online power–thermal budgeting. Given that the number of
processor layers is L and the number of activity factor settings is n, the lookup table
has nL entries. Increasing n, i.e., the resolution of the activity factor index, improves
performance but increases storage overhead, as demonstrated in Section 4.6.4.2. In
ThermOS, run-time power–thermal budgeting is implemented in the Linux kernel and
invoked periodically. Periods ranging from 1 ms to 100 ms are currently supported.
![Page 84: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/84.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 68
4.4.2.5 Distributed Run-Time Thermal Management
ThermOS uses distributed run-time thermal management to honor the power and
thermal budgets described in Section 4.4.2.4 and adhere to a temperature constraint.
Periodically, each processor core adjusts its voltage and frequency based on its as-
signed power–thermal budget. However, transient variations may not be immediately
detected by the OS. In order to honor the temperature constraint, ThermOS uses
local dynamic voltage and frequency scaling (DVFS) and clock throttling to react
to transient variation with lower latency than global power–thermal budgeting. Ta-
ble 4.2 compares these two widely-used power management techniques. DVFS has
high area overhead, mainly due to complex power supply circuitry and the need of
off-chip capacitors and inductors for each independent voltage domain. It also has
a higher response latency than clock throttling. For modern high-performance mi-
croprocessors equipped with DVFS, the voltage transition rate is in the range of
10 mV/µs [51]. Clock throttling, on the other hand, has low area overhead and low
latency. However, DVFS has less performance impact per unit power reduction than
clock throttling, thanks to the superlinear dependence of power on voltage. Note that
most modern high-performance processors already support DVFS. We are proposing
to use this existing DVFS hardware to the best effect. In ThermOS, local DVFS con-
tinuously tracks temperature changes and clock throttling is used as a final defense
to guarantee thermal safety.
![Page 85: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/85.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 69
Table 4.2: DVFS and Clock Throttling Comparison [134].Area overhead Response Performance impact
DVFS High Slow LowClock throttling Low Fast High
4.5 Experimental Setup
This section describes the experimental setup used to evaluate the proposed 3D
CMP dynamic thermal management techniques. We describe our simulation and OS
infrastructure, 3D chip and package models, and benchmark suites.
4.5.1 Infrastructure
Performance and temperature estimation for 3D CMP architectures is challenging.
Estimating spatial and temporal thermal profiles requires time-varying power profiles.
This, in turn, requires timing and power analysis. To accurately estimate the run-time
characteristics of 3D CMPs, we developed a full-system out-of-order multiprocessor
simulation environment with integrated processor performance, power, and thermal
models.
4.5.1.1 Full-System Simulation Setup
We use the M5 Full System Simulator [11]. M5 provides a detailed, cycle-accurate,
out-of-order simulation mode and a faster functional simulation mode. We use a com-
bination of full-system checkpoints and the functional simulation mode to boot the
system and fast-forward past the initialization portion of our benchmarks. We then
![Page 86: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/86.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 70
Table 4.3: Design Parameters for Alpha 21264 [134].Alpha 21264 Configuration (90 nm)
Die size 4.56×4.56 mm2
Frequency and Voltage 2 GHz, 1.2 VInstruction Queue 64 entriesFunctional Units 4IXU, 2FPU, 1BPU
Physical Registers 80 GPR, 72 FPRBranch Predictor 1 K local, 4 K global
Memory HierarchyL1 DCache/core 32 KB, 2-way, 64 B blocks, 3 cycle lat.L1 ICache/core 64 KB, 2-way, 64 B blocks, 1 cycle lat.
Shared L2 Cache 16 MB, 8-way LRU, 64 B blocks, 25 cycle lat.
Table 4.4: 3D Package Setup [134].
LayerThermal Heat Depth
cond. (W/mK) cap. (J/m3K) (µm)
Eff. Active Layer (Silicon) 160.11 1.66× 106 50Eff. Interface Layer (Polyimide) 6.83 3.99× 106 10
Heatsink (Cu) 400 3.55× 106 6,900Thermal Grease [94] 3–5 (5 used) 4× 106* 50
* From configuration used in HotSpot [99].
switch to detailed simulation mode to evaluate thermal and performance character-
istics.
We added a Wattch-based EV6 power model to M5 [15], scaled to a 90 nm process.
Our cache power model is based on CACTI [106]. Static power consumption was
estimated using an area-based, temperature-sensitive leakage model [103]. A 3D
frequency-domain dynamic thermal analysis package was used [125]. Each active
layer was modeled using numerous thermal elements.
![Page 87: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/87.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 71
4.5.1.2 Processor Architecture
There are two ways to stack device layers: face-to-face and face-to-back. For
designs with more than two layers, face-to-back bonding decreases worst-case inter-
wafer via delay. We evaluate a three-layer front-to-back CMP structure. As shown
in Figure 4.1, there are eight Alpha 21264 microprocessor cores in the top two layers.
Each layer contains four microprocessor cores. Layers are connected with polyimide
glue. There is 50 µm of thermal grease between the heatsink and die. Parameters for
thermal grease and interface material follow Samson et al. [94].
Each processor core has 32 KB L1 data cache and 64 KB L1 instruction cache.
There is a 16 MB shared L2 cache on Layer 2 and 1,024 MB of main memory. A
90 nm technology is modeled. Details can be found in the Table 4.3 and Table 4.4.
We have accounted for inter-layer vias in the thermal model in the following way.
The via density in a region follows ρvia = nAvia/(wh) where n is the number of vias
in the region, Avia is the cross section area of each via, w is the width of the region,
and h is the height of the region. The relationship between via density and effective
vertical thermal conductivity follows:
Keff = ρviaKvia + (1− ρvia)Klayer (4.17)
where Kvia is the thermal conductivity of the via material and Klayer is thermal
conductivity of the region without any vias. Here, the via is assumed to be copper
with a thermal conductivity of 400 W/mK. A typical via size is 15 µm×15 µm.
For the Alpha 21264, there are 587 package pins (389 die pins). Interconnect
vias use 0.64% of the core area. This results in the effective bulk silicon layer and
interface layer thermal conductivities reported in Table 4.4. There are three types of
heat sinks: extruded, folded-fin, and integrated vapor-chamber. In this chapter, we
![Page 88: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/88.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 72
assume an extruded copper heat sink with a thermal conductivity of 400 W/mK [116].
4.5.1.3 Operating System
The ThermOS run-time thermal management algorithms are implemented within
the Linux 2.6.8 kernel. We made two main changes to the kernel:
• Performance-counter based power modeling: We enable OS-level power estima-
tion using performance counters. Hardware event counters of the sort typical
for modern processors were added to M5. A regression-based power model was
added to the OS [52].
• Power–thermal budgeting, task migration, and thermal management: The pro-
posed power–thermal budgeting and temperature-aware task migration tech-
niques were implemented in the Linux kernel. We modified M5 to support
kernel control of DVFS and clock throttling temperature monitoring through
privileged machine registers.
4.5.2 Benchmark Suites
Multithreaded and multiprogrammed benchmarks from SPEC2000, Media Bench,
ALPBench [66], and SPLASH2 [100] are used. Phansalkar et al. did a detailed analysis
of SPEC2000 and found that it can be divided into different groups based on several
benchmark-specific metrics [86]. In order to build a complete set of test cases for our
proposed techniques, we selected two benchmark-specific metrics: IPC and expected
temperature variation. Although the absolute values of these metrics depend on
microarchitectural characteristics, their relative differences in a set of benchmarks
are mostly micro-architecture independent.
![Page 89: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/89.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 73
Table 4.5: Benchmark Characteristics [134].
Group NameAvg. Avg. Max. Max.IPC Pow. (W) T δT
SPEC gcc 3.36 14.67 64.88 0.20High IPC applu 3.13 14.37 65.64 0.12
gzip 2.78 13.34 63.49 0.34mgrid 2.58 13.66 61.84 0.31
SPEC twolf 1.58 11.33 64.30 0.19Low IPC parser 1.55 10.41 60.70 0.28
vpr 1.47 10.63 60.43 0.29mcf 1.25 10.91 63.79 0.25
Media gsmenc 3.10 13.50 63.38 0.09High IPC jpegdec 2.72 13.42 65.89 0.13
Mediag721enc 1.94 11.91 61.39 0.08
Low IPCMultithreaded MPGenc 2.95 14.34 68.78 0.20(two threads) Sphinx3 1.13 9.93 61.68 0.02
cholesky 2.83 14.27 70.57 0.32lu 2.26 12.10 66.97 0.08
radix 0.84 5.81 57.17 0.28water-nsquared 1.85 11.99 65.32 0.12water-spatial 1.74 10.57 62.35 0.08
![Page 90: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/90.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 74
Table 4.6: Benchmark Suites [134].Multiprogrammed test setups
Group Filename Clusters BenchmarksSPEC hv-hipc High T var., high IPC gzip, mgrid
lv-hipc Low T var., high IPC applu, gcchv-lipc High T var., low IPC parser, vprlv-lipc Low T var., low IPC twolf, mcf
hv-mipc1 High T var., mixed IPC gzip, parserhv-mipc2 High T var., mixed IPC mgrid, vprlv-mipc1 Low T var., mixed IPC applu, mcflv-mipc2 Low T var., mixed IPC gcc, twolf
Media media-hipc High IPC jpegdec, gsmencmedia-mipc Mixed IPC gsmenc, g721enc
Multithreaded test setupsMPGenc, sphinx3, cholesky, lu, radix, water-nsquared, water-spatial
• IPC: IPC is approximately linearly-related to power consumption, which, has
a strong influence on temperature.
• Expected temperature variation: The main goal of the proposed 3D CMP ther-
mal management technique is to maximize performance subject to a tempera-
ture constraint. In order to evaluate it, we have selected a set of benchmarks
with a wide range of spatial and temporal thermal characteristics.
Based on these metrics, the benchmarks were analyzed, yielding the results in
Table 4.5. Dynamic power traces were gathered during 500 ms to determine average
power consumption, the temporal average of peak temperature, and the maximum
peak temperature variation.
We created 17 test setups (see Table 4.6). Ten of these were for multiprogrammed
benchmarks. Each contains mixes of benchmarks with high and low temperature
![Page 91: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/91.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 75
variation and IPC. Each test setup contains two SPEC or Media benchmarks. For
multithreaded benchmarks, seven test setups are created. Each test setup contains
one ALPBench or SPLASH2 benchmark with two parallel threads. During experi-
ments, each run contains eight copies of each test setup, i.e., 16 processes/threads in
total with two processes or threads per core on average.
4.6 Experimental Results
This section evaluates ThermOS, the proposed run-time thermal management
solution for 3D CMPs.
4.6.1 Comparison of ThermOS With Alternatives
In this section, we first contrast ThermOS with solutions used in existing pro-
cessors. Then we provide a detailed quantitative comparison with a state-of-the-art
continuously-engaged thermal management technique. The following experiments use
85 as a predefined thermal constraint.
Most thermal management techniques used in practice react to emergencies in-
stead of being continuously engaged. They detect dangerously-high temperatures and
reduce power consumption, generally via hardware clock throttling. Such solutions
are adequate when temperatures approach their limits only very rarely. However,
high power densities and constraints on cooling costs require proactive thermal man-
agement. Some researchers have moved in this direction.
Donald and Martonosi [28] proposed a distributed continuously-engaged thermal
management technique for 2D CMPs. Their approach is based on closed-loop control
![Page 92: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/92.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 76
theory, and continuously adjusts the voltage and frequency of each processor core
to maintain safe temperatures. Each core has its own controller and the controllers
act independently, without knowledge of the conditions of other cores. This per-
mits significantly better performance than reactive approaches because DVFS can
generally reduce power consumption by the same amount as clock throttling with
a smaller performance penalty. In fact, their results indicate that, compared with
a stop-go based thermal control policy, distributed DVFS improves throughput by
2.5×. However, independent local control has limitations. The power consumed in
one processor can impact the temperatures of other processors in nonuniform ways.
As a result, continuously-engaged global control can permit better performance than
continuously-engaged local control. This is especially true for 3D architectures, in
which the power consumption of a particular processor core has great impact on the
temperature of vertically-aligned cores and relatively less impact on other cores.
ThermOS uses continuously-engaged, distributed global/local control to maximize
performance given a temperature bound. It supports both 3D and 2D architectures.
It has two primary differences with state-of-the-art temperature control techniques.
First, it uses global power budgeting that takes into account the thermal interaction
between processor cores. Second, it directs temperature-aware workload migration of
threads among processor cores.
Figure 4.4 shows 3D CMP run-time instruction throughput (BIPS: billion instruc-
tions per second), achieved by ThermOS and Donald’s and Martonosi’s approach.
Compared to the distributed local approach, ThermOS improves instruction through-
put by 29.84% on average (ranging from 15.22% to 53.79%). This can be explained
![Page 93: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/93.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 77
10
15
20
25
30
35
hv-hipc
hv-lipchv-mipc1
hv-mipc2
lv-hipclv-lipc
lv-mipc1
lv-mipc2
media-hipc
media-mipc
MPGenc
Sphinx3
cholesky
lu radixwater-nsquared
water-spatial
Thro
ughp
ut (B
IPS)
ThermOS Distributed approach
Figure 4.4: Comparison of ThermOS and Distributed Approach [28, 134].
as follows. In 3D CMPs, the strong thermal correlation among inter-layer vertically-
aligned processor cores has significant impact on the temperature of the processor
layer farthest from the heat sink. Using the proposed power–thermal budgeting
and thermal-aware workload migration techniques, ThermOS determines appropri-
ate power budgets for each group of vertically-aligned processor cores. In addition, it
uses DVFS to optimize the power–thermal efficiency of each processor core. Together,
these techniques maximize overall throughput. Donald’s and Martonosi’s work, on
the other hand, is a distributed, processor-local technique. Using this technique, each
processor core regulates its power and performance to ensure local thermal safety
without considering the thermal impact on neighboring cores. As a result, vertically-
aligned processor cores are unable to collaboratively share the power and thermal
budget, which can reduce CMP performance. In other words, when a distributed,
![Page 94: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/94.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 78
0
5
10
15
20
25
30
35
40
hv-hipc
hv-lipchv-mipc1
hv-mipc2
lv-hipclv-lipc
lv-mipc1
lv-mipc2
media-hipc
media-mipc
MPGenc
Sphinx3
cholesky
lu radixwater-nsquared
water-spatial
Ther
mal
Vio
latio
n (%
)
w local DVFS, w clock throttlingw local DVFS, w/o clock throttling
w/o local DVFS, w/o clock throttling
Figure 4.5: Reduction in Temperature Constraint Violations due to Local DVFS andElimination of Temperature Constraint Violations due to Clock Throttling [134].
local management technique is used, power consumption on processor cores near the
heatsink can push processor cores farther from the heatsink to their thermal limits.
4.6.2 Efficiency Impact of Guaranteeing Thermal Safety
In this section, we establish an upper bound on performance by evaluating a
thermal management technique with near-optimal performance, but vulnerability to
temperature constraint violations due to transient changes in workload. We then
show that there is only a small performance reduction resulting from the additional
management techniques ThermOS uses to guarantee thermal safety.
![Page 95: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/95.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 79
ThermOS uses the temperature-aware workload migration and global power–
thermal budgeting guidelines derived in Section 4.4.1. These techniques can poten-
tially offer near-optimal run-time performance subject to a temperature constraint.
However, they do not immediately react to transient workload variation occurring
in individual processor cores, which may cause run-time temperature constraint vi-
olations. ThermOS uses distributed run-time thermal management techniques to
guarantee thermal safety, i.e., local DVFS and clock throttling dynamically adjust
the voltage and frequency of each processor core to eliminate thermal emergencies.
Compared to DVFS, clock throttling is more responsive but degrades performance
more for the same thermal improvement. Therefore, in ThermOS, DVFS is continu-
ously engaged and clock throttling is invoked only when local DVFS cannot guarantee
thermal safety. These techniques, however, may cause the run-time operations of the
processor cores to deviate from the guidelines derived in Section 4.4.1. Straying from
these guidelines has the potential to reduce performance.
Figure 4.5 illustrates the levels of thermal safety achieved by various control tech-
niques. As shown in this figure, when distributed control is disabled, the voltage and
frequency of each processor core is solely controlled by global power–thermal budget-
ing, which does not consider the temporal workload variation within each processor
core. This local workload variation can cause significant run-time power variation,
and therefore temperature constraint violations. Local DVFS can adapt to rapid
workload variation occurring within each processor core and adjust voltage and fre-
quency accordingly, thereby reducing run-time thermal emergencies. When clock
throttling is also enabled, processor thermal emergencies are completely eliminated
(see Figure 4.5).
![Page 96: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/96.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 80
82.5
83
83.5
84
84.5
85
85.5
300 350 400 450 500
Te
mp
era
ture
(°C
)
Time (ms)
P4 temperature profile (local DVFS + clock throttling)P0 temperature profile (local DVFS + clock throttling)
82.5
83
83.5
84
84.5
85
85.5
Te
mp
era
ture
(°C
)
P4 temperature profile (local DVFS)P0 temperature profile (local DVFS)
82
82.5
83
83.5
84
84.5
85
85.5
300 350 400 450 500
Te
mp
era
ture
(°C
)
Time (ms)
P5 temperature profile (local DVFS + clock throttling)P1 temperature profile (local DVFS + clock throttling)
82
82.5
83
83.5
84
84.5
85
85.5
Te
mp
era
ture
(°C
)
P5 temperature profile (local DVFS)P1 temperature profile (local DVFS)
82
82.5
83
83.5
84
84.5
85
85.5
300 350 400 450 500
Te
mp
era
ture
(°C
)
Time (ms)
P6 temperature profile (local DVFS + clock throttling)P2 temperature profile (local DVFS + clock throttling)
82
82.5
83
83.5
84
84.5
85
85.5
Te
mp
era
ture
(°C
)
P6 temperature profile (local DVFS)P2 temperature profile (local DVFS)
82.5
83
83.5
84
84.5
85
85.5
86
300 350 400 450 500
Te
mp
era
ture
(°C
)
Time (ms)
P7 temperature profile (local DVFS + clock throttling)P3 temperature profile (local DVFS + clock throttling)
82.5
83
83.5
84
84.5
85
85.5
86
Te
mp
era
ture
(°C
)
P7 temperature profile (local DVFS)P3 temperature profile (local DVFS)
Figure 4.6: Temporal Temperature Variation for Eight Processor Cores (P0–P7) Run-ning lv-mipc2 Using Local DVFS w.o. (Top) and w. (Bottom) Clock Throttling [134].
![Page 97: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/97.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 81
0
0.2
0.4
0.6
0.8
1
1.2
1.4
hv-hipc
hv-lipchv-mipc1
hv-mipc2
lv-hipclv-lipc
lv-mipc1
lv-mipc2
media-hipc
media-mipc
MPGenc
Sphinx3
cholesky
lu radixwater-nsquared
water-spatial
Norm
alize
d th
roug
hput
(BIP
S)
w local DVFS, w clock throttlingw local DVFS, w/o clock throttling
w/o local DVFS, w/o clock throttling
Figure 4.7: Negligible CMP Instruction Throughput Reduction Resulting from LocalDVFS and Clock Throttling [134].
To further illustrate the effectiveness of the distributed run-time control tech-
niques, Figure 4.6 shows the run-time thermal profiles of eight processor cores when
running the lv-mipc2 benchmark, with and without local clock throttling. Proces-
sors 0–3 are adjacent to the heatsink and processors 4–7 are farther from it. Local
DVFS balances CMP thermal profile, and run-time temperature constraint violations
(exceeding 85 , a predefined thermal threshold used in this experiment) occur only
rarely. When both local DVFS and clock throttling are enabled, the temperature
constraint is never violated.
Figure 4.7 indicates that the performance penalty introduced by the distributed
control techniques required to guarantee thermal safety is low. To help quantify the
performance impact, we normalize the CMP throughput to the value achieved by
![Page 98: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/98.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 82
global power–thermal budgeting and then evaluate the CMP throughput with local
DVFS only with both local DVFS and clock throttling. These results indicate that
local DVFS degrades instruction throughput by 0.55% on average. Since local DVFS
is capable of eliminating most run-time thermal emergencies, clock throttling is rarely
invoked. As shown in these figures, enabling both local DVFS and clock throttling
results in performance penalties of only 0.60% on average for instruction throughput.
In summary, the proposed distributed run-time thermal control technique achieves
thermal safety with little performance impact.
4.6.3 Robustness to Changes in 3D Integration
In order to show the robustness of ThermOS to variation in 3D integration style,
we evaluated the performance improvement when used for CMPs using front-to-back
and front-to-front wafer integration (see Section 4.5.1). We simulated the proposed
technique and Donald’s and Martonosi’s distributed local approach [28] for both in-
tegration styles using all benchmark mixes shown in Table 4.6. The average CMP
instruction throughput improvement was 29.84% for front-to-back integration and
23.77% for front-to-front integration. For all combination of benchmarks and pack-
ages, the instruction throughput improvements were greater than 7%. We can con-
clude that ThermOS permits substantial improvements in performance over Donald’s
and Martonosi’s distributed local technique for different 3D integration styles.
4.6.4 Scalability Analysis of ThermOS
ThermOS uses distributed temperature-aware workload migration, global power–
thermal budgeting, and distributed run-time thermal control techniques to optimize
![Page 99: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/99.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 83
10
15
20
25
30
hv-hipc
hv-lipchv-mipc1
hv-mipc2
lv-hipccholesky
radix
Thro
ughp
ut (B
IPS)
1ms 10ms 50ms 100ms
Figure 4.8: Impact of Global Guidance Interval [134].
3D CMP throughput and guarantee thermal safety. In contrast with purely local
distributed techniques, run-time power–thermal budgeting is global. This might raise
concerns about the scalability of ThermOS when used on many-core 3D CMPs. In this
section, we evaluate the scalability of the proposed global power–thermal budgeting
technique.
4.6.4.1 Performance Impact
ThermOS periodically decides power–thermal budgets for processor cores. This
involves inter-layer and intra-layer assignment. Run-time inter-layer assignment uses
efficient table lookup. Intra-layer assignment uses an efficient homogeneous assign-
ment policy, i.e., processor cores within the same layer are assigned the same power–
thermal budgets. In the current setup, i.e., an eight-core 3D CMP with a 1 ms global
![Page 100: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/100.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 84
guidance interval, detailed simulation shows that the overall run-time overhead intro-
duced by global power–thermal budgeting is only 0.22%.
The run-time overhead of global power–thermal budgeting is linearly proportional
to the run-time global guidance/budgeting interval. In general, shorter global guid-
ance intervals can more accurately track run-time workload variation but may intro-
duce more run-time overhead and communication contention when aggregating data
from different CMP cores. It might therefore be useful to reduce this overhead by
increasing the global guidance interval.
In the current setup, a 1 ms guidance interval is used. This is frequent enough to
allow adjustments in global power–thermal budget before temporal workload variation
can produce large temperature changes, i.e., a higher frequency is unnecessary. To
evaluate the impact of increasing global guidance interval on system performance,
we run all six benchmarks with high workload variation from Table 4.6. One low-
variation benchmark (lv hipc) is also included for the sake of comparison. The results
are shown in Figure 4.8. They indicate that, for guidance intervals up to and including
100 ms, ThermOS maintains nearly-identical performance. Only hv-hipc, cholesky,
and radix experience noticeable performance degradation, due to their high temporal
workload variation. However, changing the global guidance interval from 1 ms to
100 ms only reduces CMP instruction throughput by 1.81%, 1.06%, and 2.61% for
hv-hipc, cholesky, and radix, respectively. We conclude that even if it were necessary
to reduce global guidance interval by two orders of magnitude in order to maintain
low global power–thermal budgeting run-time overhead in many-core 3D CMPs, there
would be little reduction in thermally-safe performance.
![Page 101: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/101.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 85
10
15
20
25
30
35
40
hv-hipc
hv-lipchv-mipc1
hv-mipc2
lv-hipclv-lipc
lv-mipc1
lv-mipc2
media-hipc
media-mipc
MPGenc
Sphinx3
cholesky
lu radixwater-nsquared
water-spatial
Thro
ughp
ut (B
IPS)
6 X 6 lookup table11 X 11 lookup table
51 X 51 lookup table
Figure 4.9: Impact of Lookup Table Size [134].
4.6.4.2 Storage Impact
As described in Section 4.4.2.4, ThermOS uses an offline iterative budgeting al-
gorithm to precompute some power–thermal budgeting decisions, which are stored
using a lookup table in the main memory for efficient run-time usage. This lookup
table has nL entries. Each entry requires 4 B storage. L is the number of processor
layers. It is expected that the number of processor layers in 3D CMPs will be limited.
n is the number of activity factor settings, which affects the power–thermal budgeting
resolution. Higher resolution improves the accuracy of the run-time power–thermal
budgeting decisions, but also increases the storage requirements for the table. In the
current setup, we use a two-dimensional lookup table with 51×51 entries (10.4 KB)
which provides sufficient resolution for accurate power–thermal budgeting.
It might be useful to decrease lookup table resolution for many-core systems in
![Page 102: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/102.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 86
10 15 20 25 30 35 40 45 50
hv-hipc
hv-lipchv-mipc1
hv-mipc2
lv-hipclv-lipc
lv-mipc1
lv-mipc2
media-hipc
media-mipc
MPGenc
Sphinx3
cholesky
lu radixwater-nsquared
water-spatial
Thro
ughp
ut (B
IPS)
ThermalOS w/o rotationThermalOS w rotation
Distributed approach w/o rotationDistributed approach w rotation
Figure 4.10: Impact of Floorplan Rotation [134].
order to limit storage overhead. We evaluated the impact of decreasing lookup table
resolution on thermally-safe CMP performance by running all benchmark mixes using
51×51, 11×11, and 6×6 tables. As shown in Figure 4.9, compared to the 51×51
lookup table, the 11×11 lookup table setting reduces the memory usage from 10,404 B
to 484 B, with average CMP instruction throughput reductions of 0.75%. When the
table is reduced to 6×6 entries, memory usage decreases to 144 B, with average CMP
instruction throughput reductions of 2.87%. We conclude that ThermOS requires
little storage and that its performance degrades slowly with reduced lookup table
size.
![Page 103: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/103.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 87
4.6.5 Interaction with 3D CMP Floorplan Optimization
This experiment evaluates ThermOS for 3D CMPs with different floorplans. CMP
thermal profile is strongly influenced by on-die power distribution. In 3D CMPs, inter-
layer vertically-aligned processor cores have strong thermal correlation. If all cores
have identical floorplans, functional units with high power densities are vertically-
aligned, potentially creating local thermal hotspots. Intelligent inter-layer floorplan
arrangement can potentially balance inter-layer power profile and minimize chip peak
temperature. Using the three-layer 3D CMP setup with processor core layers and
one L2 cache layer, detailed thermal analysis shows that, by rotating the floorplan of
top-layer processor cores by 180 degrees, chip power profile is more balanced, intra-
core local hotspots are minimized, and chip peak temperature is reduced by 1.99
on average and 4.24 maximum among the multiprogramming and multithreading
benchmarks. Figure 4.10 compares ThermOS and the baseline distributed technique,
with and without floorplan rotation. It shows that both run-time techniques can
leverage the temperature reduction offered by floorplan rotation and achieve higher
throughput under the same temperature constraint. In addition, ThermOS consis-
tently outperforms the distributed technique by 31.45% and 29.84% on average with
and without floorplan rotation, respectively.
4.7 Conclusions
3D integration has the potential to significantly improve performance and inte-
gration density. However, it will also increase power density, thereby increasing the
importance of using continuously-engaged thermal management techniques. It will
![Page 104: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/104.jpg)
CHAPTER 4. 3D CMP THERMAL MANAGEMENT 88
also increase the heterogeneity in thermal interaction among processor cores. This
requires careful consideration during thermal management policy design.
We have developed a mathematical formulation for optimizing workload assign-
ment, power–thermal budgeting, and voltage mode selection for 3D CMP thermal
management. This formulation has been used to develop a continuously-engaged
hardware–software thermal management solution for 3D CMPs. The proposed solu-
tion has been implemented within the Linux kernel and evaluated using full-system
3D CMP and OS simulation. Our strategy outperforms a state-of-the-art proactive
thermal management technique that does not make use of power–thermal budgeting.
![Page 105: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/105.jpg)
Chapter 5
Characterization of Single-Electron
Tunneling Transistors for
Designing Low-Power Embedded
Systems
Minimizing power consumption is vitally important in embedded system design;
power consumption determines battery lifespan. Ultra-low-power designs may even
permit embedded systems to operate without batteries by scavenging energy from the
environment. Moreover, managing power dissipation is now a key factor in integrated
circuit packaging and cooling. As a result, embedded system price, size, weight, and
reliability are all strongly dependent on power dissipation.
Recent developments in nanoscale devices open new alternatives for low-power
embedded system design. Among these, single-electron tunneling transistors (SETs)
89
![Page 106: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/106.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 90
hold the promise of achieving the lowest power consumption. Unfortunately, most
analysis of SETs has focused on single devices instead of architectures, making it
difficult to determine whether they are appropriate for low-power embedded systems.
Evaluating the use of SETs in large-scale digital systems requires novel architec-
tural and circuit design. SET-based design imposes numerous challenges resulting
from low driving strength, relatively large static power consumption, and the pres-
ence of reliability problems resulting from random background charge effects. We
propose a fault-tolerant, hybrid SET/CMOS, reconfigurable architecture, named Ice-
Flex, that can be tailored to specific requirements and allows trade-offs among power
consumption, performance requirements, operation temperature, fabrication cost, and
reliability. Using IceFlex as a testbed, we characterize the benefits and limitations
of SETs in embedded system designs. In particular, we focus on the use of SETs
in room-temperature ultra-low-power embedded systems such as wireless sensor net-
work nodes. We also consider higher-performance applications such as multimedia
consumer electronics. We see this work as a first step in determining the potential of
ultra-low-power embedded system design using SETs. My major contribution of this
chapter is on the SET modeling, SET design space characterization and characteriza-
tion of IceFlex architecture (Section 5.2, 5.3.1, 5.3.2.1, 5.3.2.2, 5.3.2.3, 5.3.2.5, 5.4.1.1
and 5.4.1.3) My collaborator, Zhenyu Gu, contributed to the global/local intercon-
nect design and characterization of embedded applications (Section 5.3.2.4, 5.4.1.2
and 5.4.2).
![Page 107: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/107.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 91
5.1 Introduction
Energy consumption and thermal issues are now central issues in electronic sys-
tem design. In high-performance applications, temperature affects integration density,
performance, reliability, power consumption, and cost. For battery-powered embed-
ded systems, power consumption determines system life time. Power consumption
crises were historically solved by moving to new technologies that decreased energy
per operation, allowing increases in density and eventually performance. Power and
thermal concerns were primary motivations for replacing vacuum tubes with semicon-
ductor devices in the 1960s and replacing bipolar junction transistors with CMOS in
the 1990s. Although CMOS is the mainstream fabrication technology used today, as
IC and system integration further increase, it will reach fabrication, power consump-
tion, and thermal limits; it may soon be time for another transition to a dramatically
different technology.
Device researchers have seen the coming challenges for CMOS devices and evalu-
ated alternative technologies such as carbon nanotube transistors [29], nanowires [46],
and single-electron tunneling transistors (SETs) [70]. The International Technology
Roadmap for Semiconductors projects that SETs have the potential to achieve the
lowest projected energy per switching event of any known device (1 × 10−18 J) [53].
However, their use poses unique architectural, circuit design, and fabrication chal-
lenges. For example, SETs are susceptible to reliability problems caused by random
background offset charges. They have cyclic I–V curves (see Figure 5.2) that can
complicate design but permit highly-efficient implementation of some useful logic
functions that have proven inefficient using CMOS and threshold logic. Although the
fabrication of SETs capable of operating at low temperatures is now common, feature
![Page 108: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/108.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 92
sizes of only a few nanometers are required for room-temperature operation, making
fabrication challenging.
5.1.1 Past Work
After their discovery in the 1980s [9, 33], there has been extensive research on
fabrication, design, and modeling of SETs [70]. SET fabrication and use in high-
sensitivity amplifiers at cryogenic temperatures has been the main research focus [25].
SETs and simple circuits with a variety of structures were proposed and fabricated
using different methods and materials [80, 105, 6]. Recently, researchers have fabri-
cated SETs that operate at room-temperature [75, 98, 84]. Various SET-based circuit
applications, such as logic [111, 112, 79, 19] and memory [126, 118, 122] have been
developed. These works provide the promising start for SET circuit design. How-
ever, these articles did not provide an architectural evaluation. We do not claim to
have improved the performance of SET-based logic gates. Instead, we are the first
to develop the modules necessary to support architectural design and synthesis and
evaluate the architectural performance and power consumption implications of using
SETs. They demonstrate orders of magnitude improvement in power consumption
and energy efficiency compared to CMOS.
Research on SET modeling and simulation has been an active area. Monte Carlo
simulation has been widely used to model SETs. SIMON [117] and MOSES [17] are
the two most popular SET simulators. However, they are too slow for analysis of large
circuits. Uchida et al. proposed an analytical SET model and incorporated it into
SPICE [113]. Recently, Inokawa et al. extended this model to a more general form to
include asymmetric SETs [49]. Mahapatra et al. propose a simulation framework for
![Page 109: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/109.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 93
hybrid SET/CMOS circuit design and analysis [73]. Their model for SET behavior
is similar to that of Uchida et al. These compact modeling techniques are efficient
enough for use in SET circuit design and analysis and closely match Monte Carlo
simulation results.
Significant challenges still remain for large-scale integration of SETs and for room-
temperature operation. SETs that operate reliably at room temperature have critical
dimensions of ∼1–10 nm. They are challenging to fabricate using current top-down
lithographic techniques. However, several exciting advances make the evaluation of
architectures for high-density logic based on SETs worthwhile. Scanning-probe mi-
croscopes can be used to create devices smaller than those using conventional lithog-
raphy [75]. Continual progress has been made on bottom-up nano-fabrication tech-
niques, where chemical techniques are used to make individual molecules with useful
electronic properties. Molecular quantum dots [40] can display SET behavior. Larger
structures, such as carbon nanotubes and nanowires, can act as SETs [6]. These
bottom-up techniques can create structures supporting room-temperature SET oper-
ation. However, more research is needed in order to integrate individual devices into
large-scale circuits. Very recent advances in graphene [35] devices show promise for
SETs. Reliable methods for cooling to very low temperatures without supplies of liq-
uid helium or nitrogen are also becoming more common [114]. For high-performance
computing, the added complexity of operating at cryogenic temperatures may not be
a limiting factor. Similarly, cryogenic temperatures are readily attained using passive
methods in outer space.
![Page 110: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/110.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 94
5.1.2 Contributions
In this chapter, we explore the potential use of SETs in low-power embedded
systems. In order to take advantage of the power efficiency of SETs, it is critical
to bring SET-based design to the system level, characterize the impacts of SETs on
system design metrics, and evaluate the benefits and limitations of SETs. Our work
starts from design space characterization of SET-based architectures. We evaluate
the impacts of using SETs upon architectural, circuit-level, and device-level design,
considering metrics such as energy efficiency, performance, reliability, maximum op-
erating temperature, and ease of fabrication.
Based on our evaluation of the architectural and circuit-level features that can
most effectively exploit the strengths of SETs while working within their limitations,
we propose a fault-tolerant, reconfigurable, hybrid SET/CMOS based architecture
called IceFlex. IceFlex is regular and cell-based. It is reconfigurable, permitting
compensation for fabrication defects. It incorporates flexible, modular circuits to en-
able tolerance of run-time faults. In addition to compensating for the weaknesses of
SETs, IceFlex exploits their strengths, e.g., we develop a two-SET design to imple-
ment Boolean functions that are not linearly separable.
We tailor IceFlex to both high-performance and battery-powered embedded sys-
tems and characterize its energy efficiency, performance, and power consumption by
using it for a number of instruction processors and application-specific cores. Com-
pared to CMOS-based designs, IceFlex improves energy efficiency by two orders of
magnitude for both battery-powered and high-performance applications, while main-
taining good performance. However, our results also indicate great challenges to the
use of SET-based designs in portable embedded systems. Their use will either require
![Page 111: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/111.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 95
CG :gate capacitance CD :drain tunnel junction capacitanceCG2 :optional 2nd gate capacitance RS :source tunnel junction resistanceCS :source tunnel junction capacitance RD :drain tunnel junction resistance
gate (G)island
optional 2nd gate (G2)
tunneljunction
source(S)
drain(D)
CG
CG2
CS,RS CD,RD
Figure 5.1: SET Structure and Schematic [133].
advances in the compact cooling technologies or the fabrication of features with sizes
approaching physical limits.
5.2 SET Modeling
In this section, we introduce the physical properties of SETs, and discuss SET
analytical device modeling.
5.2.1 SET Basics
The operation of a single-electron tunneling device is governed by the Coulomb
charging effect. As shown in Figure 5.1, a single-electron tunneling device consists of
a nanometer-scale conductive island embedded in an insulating material. Electrons
travel between the island, source (S), and drain (D) through thin insulating tunnel
junctions. When an electron tunnels into the island, the overall electrostatic potential
of the island increases by e2/CΣ, where e is the elementary charge and CΣ is the island
![Page 112: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/112.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 96
0.001
0.01
0.1
1
10
-60 -40 -20 0 20 40 60 80
I DS(n
A)
VGS(mV)
Temperature: 5KTemperature: 10KTemperature: 20K
0.001
0.01
0.1
1
10
-60 -40 -20 0 20 40 60 80
I DS(n
A)
VGS(mV)
Temperature: 5KTemperature: 10KTemperature: 20K
PVCNVC
Figure 5.2: SET Coulomb Oscillation (Cg =3.2 aF, Cs = Cd =1.0 aF, and Rs =Rd =10 MΩ) [133].
capacitance. For large devices, this change in potential is negligible due to the high
island capacitance CP. However, for nanometer-scale islands, CP is much smaller.
As a result, the electrostatic energy change due to the addition or removal of a single
electron can be larger than the thermal energy, particularly at low temperatures.
Changes to SET island potential results in an energy gap at the Fermi energy,
preventing further electron tunneling. This phenomenon is called Coulomb blockade.
It prevents current from flowing between source and drain (Ids = 0), i.e., the SET is
turned off. The Coulomb blockade effect can be overcome by changing the voltage of
a conductor capacitively coupled to the island, thereby turning tunneling on and off.
Although their transfer functions differ significantly from those of CMOS transistors,
with careful circuit design, SETs can be used to realize logic functions using circuits
analogous to CMOS, or using radically different design techniques [70].
As shown in Figure 5.1, a SET typically has four terminals. The source and
![Page 113: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/113.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 97
drain terminals (S, D) serve as electron reservoirs. When the SET is turned on,
electrons tunnel from one terminal, through the junction, to the conductive island.
They then tunnel through the other junction to the other terminal. Each tunneling
junction is modeled as a resistor (RS or RD) and a capacitor (CS or CD) in paral-
lel. A gate terminal (G), with coupling capacitance CG, controls the transport of
electrons. A SET may also contain an optional second gate terminal (G2), which is
generally used to tune SET VGS bias. The Coulomb blockade effect is maximized
when VGS = me/CG, where m = 0,±1,±2, · · · [32] because, at these voltages, the
system is in a minimal-energy state when an integer number of electrons are present
on the island. Any single tunneling event between island and either source or drain
would move the system from this state. The Coulomb blockade effect vanishes when
m = ±1/2,±3/2, · · · , i.e., when m is a half-integer value because, at these voltages,
the system is in a minimal-energy state when a half-integer number of electrons are
present on the island. In this case, a single tunneling event does not move the system
from a minimum energy state. Electrons can therefore tunnel through the island as
determined by VDS. The I–V curve of a SET is shown in Figure 5.2; drain current
changes as a function of the gate voltage, with a period if e/Cg. The periodic changes
are called Coulomb Oscillations.
In order to observe the Coulomb blockade effect, the following constraints must
be satisfied.
1) Since thermal fluctuations can suppress the Coulomb Blockade effect, the
electrostatic charging energy, e2/CP, must be much greater than kBT , where kB
is Boltzmann’s constant and T is the temperature. In order to ensure reliability,
e2/CP ≥ 10kBT or the more conservative e2/CP ≥ 40kBT constraint is enforced.
![Page 114: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/114.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 98
Table 5.1: Island Size Estimation [133].Temperature CΣ = e2/(10kBT ) CΣ = e2/(40kBT )
(K) Island Island Island Islandcapacitance diameter capacitance diameter
(aF) (nm) (aF) (nm)40 4.65 52.48 1.16 13.1277 2.41 27.26 0.60 6.82103 1.80 20.38 0.45 5.10120 1.55 17.49 0.39 4.37200 0.93 10.50 0.23 2.62250 0.74 8.40 0.19 2.10300 0.62 7.00 0.15 1.75
Assuming disc capacitor model (CP = 8εr). One side of island embedded in silicon dioxide. Otherside exposed to Nitrogen.
These equations imply that the maximum allowed island capacitance is inversely pro-
portional to temperature. At room temperature, an island capacitance below 1 aF
is required. Island capacitance is a function of island size. As shown in Table 5.1,
room-temperature operation requires an island size in the nanometer range, making
fabrication challenging. At present, the smallest island capacitance of a fabricated
device is around 0.15 aF [98].
2) To observe single-electron charging effects, electrons must be confined to the
island, which requires that the junction resistance be higher than the quantum resis-
tance, i.e., RS, RD > h/e2, h/e2 = 25.8 kΩ, where h is Planck’s constant. Therefore,
SETs have high resistances and low driving currents.
In order to operate voltage-state logic, SETs must exhibit voltage gain. The low-
temperature voltage gain is equal to the gate capacitance divided by the sum of the
junction capacitances: G = CG/(CS+CD). Achieving this gain requires low tunneling
junction capacitances. It also requires close coupling of gate and island without a large
increase in the total island capacitance. High gain has only been demonstrated for a
![Page 115: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/115.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 99
few devices and has required operation at low temperatures [82, 41]. However, further
advances in nanofabrication may overcome this limitation.
5.2.2 Random Background Charge Effects
Constant background charge effects have been a persistent problem for SETs.
Charges near the SET island influence its equilibrium state [119]. Although the
resulting voltage offsets can be compensated for with a biased second gate terminal,
the required bias is unknown until fabrication. Worse yet, some devices are affected
by random background charge effects, which result in run-time voltage fluctuations.
It is the tentative consensus of the research community that random background
charge effects are caused by multiple, closely-spaced charge traps near the island,
among which charge carriers tunnel. This produces run-time variation in gate bias,
and may cause logic errors. Much work has been done to understand the nature and
density of these defects [34, 62, 136]. Most SETs have been fabricated with aluminum
islands. Some researchers have attempted to eliminate random background charge
effects by fabricating SETs with alternative island materials such as silicon. Silicon
island based devices have high immunity to random background charge noise, with
operation unchanged over several weeks [137]. However, random background charge
effects remain the main source of run-time reliability problems for most SET designs.
In this chapter, we describe a reconfigurable architecture that provides architectural
resistance to the effects of random background charges.
![Page 116: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/116.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 100
5.2.3 SET Modeling
Circuit design involves extensive simulation. Despite their accuracy, Monte Carlo
methods are too slow for large-scale circuit analysis. We build upon the SET an-
alytical model developed by Inokawa et al. [49], which has been incorporated into
SPICE. Combined with MOS transistor models, it provides an efficient and accurate
simulation solution for hybrid SET/CMOS circuits. Inokawa’s model ignores random
background charge effects and multi-gate effects. We incorporate these effects into
the model.
The I–V characteristics of a SET with island charge equal to n or n+ 1 electrons
follow [49]:
IDS =e
4RTCΣ
×
(1− r2)(V 2GS − V 2
DS) sinh(VDS/T )
(VGS + rVDS) sinh(VGS/T )− (VDS + rVGS) sinh(VDS/T )(5.1)
where
VGS =2∑CGiVGSie
−
(∑CGi + CS − CD)VDS
e− 1− 2n+ ζ (5.2)
VDS =CΣVDSe
, T =2kBTCΣ
e2(5.3)
r =RD −RS
RD +RS
, RT =2
1RS
+ 1RD
(5.4)
CΣ = CS + CD +∑
CGi (5.5)
In this model,2
PCGiVGSie
models the Coulomb charging effects of the multiple gate
terminals. ζ is a real number that characterizes the random background charge effect.
![Page 117: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/117.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 101
This compact model is derived based on the steady-state master equation, which is
not directly applicable to transient circuit analysis. However, when used in circuits,
SETs are connected by metal wires. Based on existing fabrication processes, the
capacitance of local interconnect is at least two orders of magnitude higher than
SET island capacitance, thereby eliminating inter-SET Coulomb interaction. The
independence of SETs enables the use of quasi-steady-state analysis [49, 128].
5.3 IceFlex: A Fault-Tolerant Hybrid SET/CMOS
Reconfigurable Architecture
This section describes the design and analysis of IceFlex, the proposed low-power,
fault-tolerant, reconfigurable, hybrid SET/CMOS architecture. The vast majority
of devices in IceFlex are SETs, allowing extremely low power consumption. CMOS
devices are sparingly used to improve the driving strength of global interconnect.
Our evaluation of the architectural constraints imposed by SETs led to four main
conclusions.
1. Flawless fabrication will be challenging, especially for circuits that operate
at room temperature. It is important to simplify fabrication and use post-
fabrication adaptation to improve reliability.
2. An unpredictable subset of devices will be susceptible to random background
offset charge effect noise: SET-based architectures should have the ability to
tolerate run-time errors.
3. SETs have poor driving strength; this must be remedied, especially when driving
![Page 118: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/118.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 102
global interconnect.
4. SETs have the ability to efficiently implement some functions that are ineffi-
cient using BJTs, CMOS logic, or threshold logic, e.g., non-linearly-separable
functions. SET-based architectures should exploit such special properties.
5.3.1 SET Design Space Characterization
In order to characterize the benefits and limitations of SET circuits and archi-
tectures, we analyze the tradeoffs among the following metrics: temperature, perfor-
mance, power consumption, reliability, and fabrication constraints. This study yields
two design configurations, each of which is shown in Table 5.2. One targets high-
performance embedded applications such as multimedia consumer electronics and
one targets ultra-low-power embedded applications such as wireless sensor networks.
5.3.1.1 Temperature
IceFlex was evaluated at seven temperature settings (see Table 5.2). IceFlex is a
hybrid SET/CMOS design; the temperature range starts at 40 K to permit reliable
operation of the CMOS components. 77 K is achieved by liquid nitrogen cooling.
103 K is the average cloud top temperature. 120 K and below are defined as cryogenic.
At 200 K, functional SET devices have been widely demonstrated in the literature.
250 K is a temperature that might be reached using a stacked Peltier heat pump.
300 K is room temperature.
![Page 119: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/119.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 103
5.3.1.2 Capacitance
To observe well-defined Coulomb blockade effects, electron charging energy must
be higher than the thermal energy, i.e., e2
CΣ≥ 10kBT or e2
CΣ≥ 40kBT , where kB is
Boltzmann’s constant and T is temperature. At room temperature, this constraint
requires an island capacitance below 1 aF, making fabrication challenging but pos-
sible [98]. In order to operate voltage-state logic, SETs must exhibit voltage gain,
which is equal to the gate capacitance divided by the sum of the junction capac-
itances: G = CG/(CS + CD). Our results indicate that a gain of 1.5 is sufficient
for use in digital logic. Targeting battery-powered systems, using CP ≤ e2/(10kBT ),
CP ≤ e2/(40kBT ) and G = 1.5, the maximum allowed gate and junction capacitances
are derived and shown in the “Low power, Capacitance” columns of Table 5.2.
The maximal allowed capacitance decreases with increasing temperature. How-
ever, fabricating SETs with low gate capacitance is challenging. We assume the capac-
itances at 300 K are the minimum allowed. Given e2
CΣ≥ 10kBT , for high-performance
applications, these minimal gate and junction capacitances are used at all the tem-
perature settings and shown in the corresponding “High Performance, Capacitance”
columns of Table 5.2. Given e2
CΣ≥ 40kBT , which requires very low SET capacitance
at room temperature, CG = 0.09 aF. This makes fabrication very challenging. Due
to fabrication concerns, for high-performance design, the capacitance and voltage are
determined at the appropriate operation temperature, instead of room temperature.
5.3.1.3 Voltage
Consider a SET biased via a second gate, such that a VGS of zero places it in the
middle of the positive voltage coefficient (PVC) region in Figure 5.2. In this case, the
![Page 120: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/120.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 104
maximum range of current values can be traversed by letting VGS (i.e., Vin) vary in
the range [−e/(4CG), e/(4CG)]. At all but the lowest temperatures, this range also
provides near-optimal sensitivity to VGS; we use this range. Once the range of VGS
is known, a VSS of −e/(4CG) and a VDD of e/(4CG) naturally follow, shown in the
“Voltage” columns of Table 5.2. Note that a bias voltage applied via a second gate
can be used to shift the zero VGS point from the PVC to negative voltage coefficient
(NVC) region in Figure 5.2, permitting NMOS-like or PMOS-like behavior.
5.3.1.4 Junction Resistance
To observe single-electron charging effects, electrons must be confined to the is-
land. This requires junction resistances that are much higher than the quantum resis-
tance, i.e., RS, RD h/e2, h/e2 = 25.8 kΩ, where h is Planck’s constant. Therefore,
SETs have high resistances and low driving currents. In this chapter, we pick two
resistance settings: 100 KΩ for high-performance applications and 10 MΩ for battery-
powered systems, shown in the “Resist.” columns of Table 5.2.
5.3.1.5 Reliability Implications
Researchers have pointed out the dangers posed by thermal noise as charging
(state change) energy approaches thermal energy. We explicitly consider the effects of
temperature on steady-state current during circuit analysis and its effects are reflected
in our design decisions. We implicitly consider, and guard against, the effects of
temperature-dependent shot noise by requiring charging energy to be a large multiple
of the thermal energy. Designs with charging energies of both 10 and 40 times the
thermal energy are evaluated in this chapter (10kBT or 40kBT ). Researchers have
![Page 121: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/121.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 105
Tab
le5.
2:D
esig
nSpac
eC
har
acte
riza
tion
[133
].C
Σ=e2/1
0kBT
CΣ
=e2/4
0kBT
Low
pow
erH
igh
perf
orm
ance
Low
pow
erH
igh
perf
orm
ance
Tem
p.C
apac
itan
ceV
olta
geR
esis
t.C
apac
itan
ceV
olta
geR
esis
t.C
apac
itan
ceV
olta
geR
esis
t.C
apac
itan
ceV
olta
geR
esis
t.
(K)
(aF
)(m
V)
(MΩ
)(a
F)
(mV
)(k
Ω)
(aF
)(m
V)
(MΩ
)(a
F)
(mV
)(k
Ω)
CG
CS
Vdd,V
inR
SC
GC
SV
dd,V
inR
SC
GC
SV
dd,V
inR
SC
GC
SV
dd,V
inR
S
CD
e/4C
GR
DC
De/
4CG
RD
CD
e/4C
GR
DC
De/
4CG
RD
402.
780.
9314
.36
100.
370.
1210
7.70
100
0.70
0.23
57.4
610
0.70
0.23
57.4
610
077
1.45
0.48
27.6
510
0.37
0.12
107.
7010
00.
360.
1211
0.60
100.
360.
1211
0.60
100
103
1.08
0.36
36.9
910
0.37
0.12
107.
7010
00.
270.
0914
7.95
100.
270.
0914
7.95
100
120
0.93
0.31
43.0
910
0.37
0.12
107.
7010
00.
230.
0817
2.37
100.
230.
0817
2.37
100
200
0.56
0.19
71.8
210
0.37
0.12
107.
7010
00.
140.
0528
7.28
100.
140.
0528
7.28
100
250
0.45
0.15
89.7
710
0.37
0.12
107.
7010
00.
110.
0435
9.10
100.
110.
0435
9.10
100
300
0.37
0.12
107.
7010
0.37
0.12
107.
7010
00.
090.
0343
0.91
100.
090.
0343
0.91
100
![Page 122: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/122.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 106
SET configuration memory
SET local interconnect Hybrid SET/CMOS globalinterconnect
Majority voting logic
SET multi-gate lookup table
SET input switch fabric SET registers
Figure 5.3: IceFlex Microarchitecture [133].
reported device operation at each level but the 40kBT requirement is more reliable.
At charging energies over 10kBT , the model we use is accurate to within 4% of the
time-dependent master equation [59, 113].
Random background charge effects [62, 136] are the main barrier to SET reliability.
They are observed as 1/f noise on SET gate voltages, with some SETs susceptible
and others immune. Several recent devices have shown improved immunity to this
noise, as described in Section 5.2.2. Currently, the distribution of random background
offset charges can only be determined after fabrication [70]. Susceptible SETs may
suffer transient errors infrequently, e.g., only once per day. In this chapter, we use
architectural techniques to reduce the probability of failure using an entirely SET-
based design. SETs are used in parallel to exploit the lack of SET-to-SET correlation
in random background offset charge effects.
5.3.2 IceFlex Design
In this section, we present the architecture and circuit design of IceFlex. The
microarchitecture of IceFlex is shown in Figure 5.3. IceFlex is a cell-based design.
Each cell is a SET logic block (SELB) composed of the following components: (1)
multi-gate SET-based reconfigurable look-up tables that can realize arbitrary n-input
![Page 123: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/123.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 107
Boolean functions; (2) a SET-based arithmetic unit that allows efficient implementa-
tions of non-linearly separable arithmetic operations; (3) a SET-based reconfiguration
memory array that caches multiple configuration contexts to support efficient run-
time reconfiguration; (4) a multi-gate SET-based input switch fabric; and (5) SET
registers. In addition, IceFlex includes SET threshold logic-based majority voting
logic units, allowing a flexible solution to run-time reliability problems. In IceFlex, a
multi-level on-chip interconnect fabric forms inter-SELB connections. Local connec-
tions rely on a custom-designed, SET-driven, variable-length, constant-latency inter-
connect. Using a constant-latency interconnect structure reduces power consumption
and simplifies physical-level design automation, e.g., placement and routing. SETs
have limited driving strength. Therefore, IceFlex uses hybrid SET/CMOS circuits to
drive global interconnects.
We now explain each IceFlex component and discuss both circuit and architecture
design tradeoffs.
5.3.2.1 Multi-Gate SET Reconfigurable Lookup Table Component
Each SELB is equipped with l sets of n-input reconfigurable look-up tables. Each
look-up table can realize an arbitrary n-input Boolean function. The basic structure
of the look-up table consists of an m-to-1 multi-gate SET multiplexer tree (m = 2n),
and an m-bit SET storage cell, which will be described in the next section.
The proposed multi-gate SET multiplexer tree differs from existing CMOS-based
designs in the following way. A CMOS m-to-1 multiplexer tree requires dlog2me
stages of transmission gates, plus buffers to meet the required driving strength. SETs
may have multiple gate terminals. As described in Equation 5.5, the gate charging
![Page 124: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/124.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 108
Config. Bit0
Config. Bitm-1
configuration
m-to-1 multi-gate multiplexer SET tree
Vdd
VG2
Vss
-VG2
Config. Bit0
Config. Bit1
Config. Bitmc-1
s0 s1 snc
s0 s1 snc
s0 s1 snc
mc-to-1 multi-gate SET multiplexer
Vdd
VG2
Vss
-VG2
0
A 4-to-1 multi-gate SET multiplexer example
a b
a b
a b
a b
0
0
1
IDS
VG
RSET
VG
a=1b=1
P0path P0
path P1
path P2
path P3
a=1b=1
P1P2 P3
Figure 5.4: Multi-gate SET Multiplexer Tree [133].
effect is a function of∑CGiVGSi . Therefore, multiple control signals, e.g., the select
signals for a multiplexer, can be supplied to a single SET, enabling a more compact
circuit structure with better performance and power efficiency.
Figure 5.4 shows the proposed SET multi-gate multiplexer tree design. The basic
building block is a q-to-1 multi-gate single-stage multiplexer, in which each of the q
paths consists of a single multi-gate SET controlled by dlog2 qe select signals. Using
this design, the logic depth of a n-to-1 multiplexer tree reduces to⌈logqm
⌉instead
of dlog2me. Figure 5.4 also shows a design case for q = 4. The output SET buffer is
used to break long resistive path and improve the driving strength.
As described in Section 5.2, thermal energy has significant impact on electron
tunneling and the ratio of on to off currents, i.e., the ratio of the off to on resistance.
This ratio decreases as the ratio of Coulomb charging energy (e2/C) to thermal energy
(kBT ) decreases. On the other hand, as the number of gate control signals per SET
(hence the number of off paths connected in parallel) increases, the impact of the off
paths on the circuit output increases. Consider, for the sake of example, the dual-gate
4-to-1 multiplexer design shown in Figure 5.4. The four logic inputs are 0001 and
both select signals are logic one, i.e., Va = Vb = V . Assume Ca = Cb = C. As shown
in the I–V curve on the right side of Figure 5.4, for the SET on path P3, the overall
gate charge equals 2CV . Therefore, the SET becomes fully conductive. For paths P1
![Page 125: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/125.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 109
and P2, the gate charges both equal CV −CV = 0, hence both switches are partially
conductive. For path P0, even though the overall gate charge equals −2CV , at high
temperature its resistance may still be within the same order of magnitude as that
of path P3. Since the inputs of paths P0, P1 and P3 are all connected to logic zero
(the worst-case scenario), these three parallel paths may reduce the output voltage,
producing incorrect results.
In the high-performance setting, the same capacitance settings are used across the
whole temperature range. Therefore, the ratio of Coulomb charging energy to thermal
energy increases as the temperature decreases. Therefore, lower temperatures permit
fewer multiplexer levels in the multiplexer tree, with more inputs to each individual
multiplexer.
Detailed circuit analysis shows that, using the high-performance setting and e2/CP ≥10kBT , the dual-gate design may be used at temperatures up to 200 K. At 250 K and
300 K, only the single-gate design is feasible. For the low-power setting, capacitance
scaling maintains the same e2/CPkBT ratio. Therefore, the same design should be
used for the whole temperature range. In addition, since both the low-power setting
and the high-performance setting at room temperature use the same e2/CPkBT ra-
tio, only the single-gate design is feasible for low-power, room-temperature operation.
For the e2/CP ≥ 40kBT configurations of IceFlex, the dual-gate design may be used
at all temperatures due to the increased charging energy.
5.3.2.2 SET Configuration Memory
In IceFlex, run-time reconfiguration is enabled by SET configuration memory,
which consists of SET configuration cache and current configuration memory. In
![Page 126: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/126.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 110
Dual-islandSET bufferSET configuration memory
VCG
charge
VG
VoutD S
IDS
VG
VCG
VGV
outD SIDS
VG
VG
Store 1
Store 0
SET memory cell
Configuration setsset k-1 set0set1 V
dd
In
Vss
Out
VG2
-VG2
Figure 5.5: SET Configuration Memory [135].
each SELB, the configuration cache stores multiple configurations. During run-time
reconfiguration, one set of configuration bits stored in the configuration cache are
placed into the current configuration memory to program SELB logic and intercon-
nect. If k copies of configuration sets are stored in the configuration cache, then the
circuit can be reconfigured k times during run-time execution without the need to
access off-chip memory.
The left portion of Figure 5.5 shows the circuit structure of the configuration
memory in IceFlex. The SET configuration cache is the main on-chip configuration
memory. Each storage cell consists of a dual-island SET [70]. A dual-island SET
contains two capacitively-coupled SETs: a primary SET and a secondary SET. By
controlling VCG, electrons can tunnel through the control gate and charge the island
of the secondary SET. The charge state of the secondary SET shifts the phase of the
Coulomb oscillations of the primary gate, i.e., its conductivity condition shifts as a
function of gate control voltage, VGS. Therefore, under a certain VGS, the primary
SET is either conductive or open due to different island charges, representing either
![Page 127: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/127.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 111
a logic one or logic zero.
In the configuration cache, selecting a configuration forms a short-circuit path
between the pull-up resistor and SETs with a stored zero within the selected configu-
ration set. The power consumption will be high if the configuration cache constantly
controls the logic and interconnect. To minimize power consumption, separate on-chip
memories are used to store the currently-used configuration.
We designed a dual-island based SET buffer to hold the current configuration. As
shown to the right of Figure 5.5, this buffer uses two biasing voltages, VG2 and −VG2 ,
and behaves like a complementary SET inverter. During run-time reconfiguration, for
each dual-island SET, the corresponding configuration bit stored in the configuration
memory updates the island charge of its secondary SET and conductivity of the
primary SET, thereby controlling the buffer output.
5.3.2.3 Efficient SET Implementations of Non-Unate Functions and Im-
plications for Arithmetic
SETs have the ability to support efficient implementation of some critical logic
functions that have long frustrated designers using threshold logic, BJT, and CMOS
technologies. Most conventional transistors have either non-decreasing or non-increasing
I–V curves. As a result, numerous devices are required to implement Boolean func-
tions that are not unate, i.e., linearly separable. However, such functions are widely
used, especially in digital arithmetic. The periodic nature of SET I–V curves can
be exploited for efficient implementation of highly-useful non-unate functions such as
exclusive-OR.
![Page 128: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/128.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 112
SET parity circuit
INk−1
IN1
IN1
INk−1
VG2
−VG2
VSS
VDD
Figure 5.6: SET Parity Circuit [133].
The most efficient CMOS static pass-transistor logic design of a two-input exclusive-
OR gate in general use requires six transistors [91]. Moreover, it relies on strong input
signals because it is not capable of signal restoration. A restoring version would re-
quire at least eight transistors. In contrast, it is possible to implement a two-transistor
SET-based exclusive-OR gate that is structurally equivalent to a CMOS inverter. In
this design, each SET has two gates, each of which is connected to one of the exclusive-
OR inputs. The circuit structure for a SET-based n-input parity gate is shown in
Figure 5.6. This design is capable of signal restoration. Thanks to the periodic SET
I–V curve, it is possible to directly determine whether the number of high inputs is
odd or even. By appropriately adjusting the gate capacitances, the device can be
adjusted such that switching a single gate will result in a 180 phase shift in the
I–V curve (see Figure 5.2). Note that even or odd parity functions with additional
inputs may be implemented using only two SETs. The number of inputs is bounded
primarily by geometrical constraints on fabrication of additional gates.
In SET-based architectures, we propose the use of fast carry chains based on the
proposed exclusive-OR (sum) computation logic. We have found that this design is
![Page 129: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/129.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 113
approximately 75% more energy-efficient and 25% faster than a design based on a
conventional CMOS-style exclusive-OR sum implementation, when both are imple-
mented using SETs. This design style is impossible for threshold logic, BJTs, and
CMOS technologies. Note that carry-out logic is equivalent to 2-out-of-3 majority
vote logic.
5.3.2.4 Reconfigurable Interconnect Network
IceFlex consists of a variety of reconfigurable interconnect resources, including
SET local interconnects, hybrid SET/CMOS global interconnects, and SET switch
fabric.
Interconnect consumes a substantial proportion of total power consumption in Ice-
Flex: its power efficiency is important. For SET-based interconnect, the static power
consumption dominates due to the impact of thermal energy on device conductance,
especially at high temperatures. In addition, static power consumption increases with
wireload because maintaining unchanged communication latency with higher wireload
requires lower junction resistance. In contrast, the dynamic power consumption of
SETs is low due to the low SET gate capacitance and low voltage swing. For hybrid
SET/CMOS-based interconnect, SETs are only used to drive CMOS buffers, which in
turn drive wires. In this case, SETs with low driving strength, hence high junction re-
sistance, are allowed. Compared to SETs, CMOS has lower static power consumption
but higher capacitance and dynamic power consumption. Therefore, dynamic power
dominates in the hybrid SET/CMOS-based design. Circuit analysis shows that, given
the same performance constraint, SET-based design is more energy-efficient for local
interconnect and the hybrid SET/CMOS design is more energy-efficient for global
![Page 130: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/130.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 114
interconnect.
In IceFlex, local interconnects driven directly by SET buffers support communica-
tion between nearby SELBs. Three types of local interconnects are supported: single
length, double length, and hex length. The proposed SET local interconnect design
guarantees a constant latency across different routing lengths. Consider, for the sake
of example, a local communication architecture in which the maximum interconnect
delay is constrained and the longest interconnect is appropriately buffered to meet
this constraint. In this case, it would be possible to similarly drive shorter intercon-
nects, thereby decreasing their delays, relative to that of the longest interconnect.
It would also be possible to reduce the driving strength on shorter interconnects to
reduce power consumption and produce a local interconnect architecture in which
all interconnects have uniform delay. We propose the second design because it im-
proves interconnect power efficiency and also simplifies placement and routing during
physical design.
The proposed SET local interconnect is designed as follows. A SET buffer with
minimal driving strength (hence high junction resistance) is first determined. Next,
for local interconnects with different routing lengths, minimal driving strength SET
buffers are connected in parallel to meet driving strength requirements imposed by
performance constraints. The main motivation for using parallel SET buffers is that
SET junction resistance cannot be reduced arbitrarily (RD, RS h/e2). Using ho-
mogeneous SET buffers in parallel instead of heterogeneous SET buffers may also
simplify fabrication.
Remote connections introduce the high capacitive loads of long metal wires. To
address the driving strength problem of SET-only circuits, we have designed hybrid
![Page 131: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/131.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 115
Tab
le5.
3:Im
pac
tof
Ma
jori
tyV
ote
Log
icon
SE
LB
Fau
ltP
robab
ilit
y[1
33].
SET
faul
tpr
obab
ility
1/1,
000
1/10,0
001/
100,
000
Maj
orit
yvo
tein
puts
35
73
57
35
7R
awfa
ilpr
ob.
6.20
E-2
6.20
E-2
6.20
E-2
6.38
E-3
6.38
E-3
6.38
E-3
6.40
E-4
6.40
E-4
6.40
E-4
Bes
tpr
ob.
1.11
E-2
2.17
E-3
4.45
E-4
1.22
E-4
2.57
E-6
5.71
E-8
1.23
E-6
2.62
E-9
5.86
E-1
2SE
TM
VL
prob
.1.
11E
-22.
18E
-34.
57E
-41.
22E
-42.
69E
-61.
77E
-71.
23E
-63.
82E
-91.
21E
-9
![Page 132: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/132.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 116
VG2
-VG2
HLB output
VG2
-VG2
HLB input
SINV1 SINV2CINV1 CINV2
Inter-HLB metal wire
Figure 5.7: Hybrid SET/CMOS Interface Circuitry [133].
SET/CMOS interface circuitry to drive global interconnect. Figure 5.7 shows the
circuit structure, which contains two complementary SET inverters and two CMOS
inverters. A SELB output is first fed to the input of SET inverter SINV1. SINV1
drives the CMOS inverter, CINV1. Unlike the SET logic used inside SELBs, SINV1
uses a low-resistance design to improve driving strength. Fortunately, it is possible
to achieve sufficient driving strength with a single SET. Since the voltage range of
SET logic is much smaller than that of CMOS logic, the output signal of SINV1 is
within the switching range of the CMOS inverter. Since both MOS transistors are
conductive within the switching region, short-circuit power is high. To solve the short-
circuit power consumption problem, CINV1 is designed to satisfy the following two
constraints. First, Vtn + |Vtp| > Vdd − Vss ensures that at least one MOS transistor
is off at all times, reducing static power consumption. Second, the output signal
range of SINV1 must be greater than Vtn + |Vtp| − (Vdd− Vss). Therefore, the NMOS
(PMOS) transistor of CINV1 is conductive when SINV1 has a high (low) output
signal. Therefore, CINV1 serves as a signal converter, and CINV2 provides driving
strength.
CINV2 cannot be used to drive the input SET logic of a SELB directly. SET
![Page 133: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/133.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 117
current is a periodic function of the gate control voltage and has a period of e/CG,
which is much smaller than the output voltage range of CINV2. Therefore, this
output voltage range cannot be used directly. To solve this problem, we design a
special SET inverter, SINV2, that is used for SELB inputs. SINV2 is fabricated with
a large distance between gate and island in order to reduce the gate capacitance, CG.
Thus, e/CG can match the output signal range of CMOS inverter CINT2. Although
source–island and drain–island junctions must be short to permit tunneling, there is
no such bound on gate–island separation.
In IceFlex, each SELB is equipped with a reconfigurable input switch fabric that
selects the connections among local and global interconnects. The input switch fabric
is implemented using multi-gate SET multiplexor tree, similar to that in the recon-
figurable look-up table described in Section 5.3.2.1.
5.3.2.5 Design and Modeling of IceFlex Majority Voting Logic
Although researchers are making progress on reducing the severity of noise result-
ing from random background offset charge effects, it may continue to pose run-time
noise problems in the future. Even if this problem can be entirely solved, resistance
to run-time faults may be useful in SETs, e.g., to allow resistance to Alpha particle
induced faults or other single event upsets. IceFlex incorporates support for hierar-
chical spatial redundancy to improve fault tolerance. Although much of the literature
predicts the need for fault-tolerant architectures in nanoelectronics, the level of fault
tolerance is currently unknown. Therefore, we consider the results for a number of
possible SET failure rates and in the presence of three fault-tolerance configurations.
Other researchers have proposed a number of architectural techniques to support
![Page 134: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/134.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 118
reliable computation using nanoscale electronics that are susceptible to fabrication-
time and run-time faults. Dehon described the use of structural redundancy and
programming-time defect-aware configuration in a carbon nanotube and silicon nanowire
based programmable logic array architecture [24]. Goldstein et al. describe the use
of a defect map that is generated during post-fabrication testing to avoid the use of
faulty devices [37]. Bahar et al. present a method of expressing logic circuits using
Markov Random Fields, permitting Boolean functions to be computed using devices
susceptible to potentially-frequent transient faults [10]. We think it likely that the
random background charge problem will ultimately be dealt with by a combination
of improved fabrication technology, post-fabrication testing to identify and avoid a
subset of the affected SETs, and run-time fault-tolerance via conventional structural
redundancy or recent advances in probabilistic computation. IceFlex provides for
regular structural redundancy and run-time error correction.
We now consider the fault model for IceFlex SELBs. Every path from SELB input
to output contains 64 SETs. In the third row of Table 5.3, we show the SELB raw
failure probabilities, i.e., the probability of a SELB producing an incorrect output.
SELB failure probability is a function of the SET fault probability, for which Ta-
ble 5.3 shows three values. Likharev estimates the long-term density of background
offset charge susceptible SETs [70]. We follow his assumptions arriving at one suscep-
tible SET in 10,000. The resulting 1/f noise produces long-duration failure periods.
Therefore, in this analysis, we (conservatively) assume that susceptible devices consis-
tently fail. In reality, errors may not be consistent. We also consider the higher SET
fault probability of 1/1,000 and the lower fault probability of 1/100,000. Advances
in fabrication and detection of most SETs susceptible to random background offset
![Page 135: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/135.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 119
charge effects by post-fabrication testing may permit reduction in run-time SET fault
probability.
We have considered the effect of using no MVL (Raw fail prob.), fault-free MVL
(Best prob.), and SET MVL. Using a given reliability configuration, it is not possible
for MVL-based designs to produce lower SELB fault probabilities than those shown
in the Best prob. row. SET MVLs are constructed from multi-gate SETs. We focus
on the three-input SET MVL design to simplify depiction; the five-input, and seven-
input SET MVL follows an analogous design style. This circuit has identical structure
to the parity gate shown in Figure 5.6. However, the separation of gates and island are
adjusted such that the circuit traverses only 1/2 Coulomb oscillation period during
use. The SET pull-up gates are separated sufficiently to require the majority of the
gates to be high. The converse is true of the pull-down gates. For each SET depicted
in the figure, four SETs are used in parallel in order to permit the failure of one SET
while still producing correct results. We have computed the delay of the SET MVL
by considering the worst-case scenario, in which a path that is 3/5 or 4/7 closed has
a faulty driver SET and a path that is 2/4 or 3/7 closed has no faulty SETs.
As shown in Table 5.3 it is possible for a seven-input SET-only MVL with redun-
dant SELBs to reduce the failure rate to 1/8,500,000, given a SET fault probability of
1/10,000, or 1/830,000,000, given a SET fault probability of 1/100,000. Given recent
trends in noise-resistant SET design and fabrication, it seems likely that a less aggres-
sive fault tolerance configuration will be necessary in the future (see Section 5.2.2).
If a method of rapidly determining which SETs are susceptible to random back-
ground charge effects is ever developed, these effects can be avoided in the same way
that fabrication defects are avoided: via the use of a regular computation structure in
![Page 136: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/136.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 120
which operations are mapped only to fault-free devices. There has been some promis-
ing work on this topic, in which illumination is used to produce ions, accelerating the
onset of random background charge effects [16].
5.4 Experimental Results
In this section, we evaluate the suitability of using SETs in low-power embedded
system design. We start from the microarchitecture characterization of IceFlex. Ice-
Flex is then used as a testbed to characterize the benefits and limitations of SETs for
both high-performance and battery-powered embedded application.
5.4.1 Characterization of the IceFlex Architecture
Following the design parameters shown in Table 5.2, we evaluate the performance
and power consumption of IceFlex using HSPICE. For SET circuitry, the SPICE
model and device parameters are described in Section 5.2.3. For CMOS logic and
metal wire, we use the 22 nm Berkeley BSIM4 predictive technology model, which
models the impact of temperature on MOS devices. We analyzed designs adhering
to the CΣ = e2/(40kBT ) constraint. We also analyzed designs with the less conser-
vative CΣ = e2/(10kBT ) constraint. A low-power setting (targeting megahertz-range
frequencies) and a high-performance setting (targeting gigahertz-range frequencies),
are considered.
Tables 5.4 and 5.5 summarize the performance and power characterization of the
![Page 137: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/137.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 121
Tab
le5.
4:C
har
acte
riza
tion
ofIc
eFle
xM
icro
arch
itec
ture
forC
Σ=e2/(
40kBT
)[1
33]
Low
pow
erH
igh
per
form
an
ce40
K77
K103
K120
K200
K250
K300
K40
K77
K103
K120
K200
K250
K300
KL
UT
10.0
47.8
67.0
96.8
05.5
75.0
34.7
50.0
80.0
60.0
50.0
50.0
50.0
40.0
4L
ate
ncy
Reg
iste
r1.4
21.0
91.0
21.0
00.9
00.8
80.8
60.0
10.0
10.0
10.0
10.0
10.0
10.0
17-I
NP
UT
MV
L0.5
80.5
70.5
80.5
80.5
90.5
60.5
83.2
8E
-03
3.1
8E
-03
3.1
6E
-03
3.2
0E
-03
3.2
4E
-03
2.9
9E
-03
3.1
4E
-03
(ns)
SE
T-M
VL
1.1
51.1
31.1
31.0
01.0
81.0
41.0
60.0
10.0
10.0
10.0
10.0
10.0
10.0
1A
rith
met
icS
UM
MG
2.3
22.3
12.3
12.3
12.3
12.2
82.2
90.0
10.0
10.0
10.0
10.0
10.0
10.0
1L
ogic
CS
3.0
22.9
72.9
52.9
62.9
52.8
92.9
30.0
10.0
10.0
10.0
10.0
10.0
10.0
1C
O1.1
51.1
31.1
31.0
01.0
81.0
41.0
60.0
10.0
10.0
10.0
10.0
10.0
10.0
1L
UT
0.0
70.2
60.4
40.5
81.6
02.6
43.7
06.6
725.7
644.5
358.1
9162.2
0266.6
9373.8
1P
ow
erR
egis
ter
0.0
80.3
00.5
30.7
21.9
93.1
44.4
88.0
229.8
853.1
672.1
2199.6
4315.2
1450.3
47
INP
UT
-MV
L0.0
50.2
00.3
60.4
81.3
22.1
73.0
25.3
720.0
535.8
748.1
5132.2
4217.3
1302.6
0(n
W)
SE
T-M
VL
0.0
10.0
30.0
60.0
80.2
10.3
40.4
80.9
43.5
16.2
68.4
423.2
437.5
852.9
0A
rith
met
icS
UM
MG
1.6
1E
-03
0.0
10.0
10.0
10.0
40.0
70.0
90.2
20.8
01.4
41.9
15.1
98.8
812.0
4L
ogic
CS
0.0
10.0
40.0
70.0
90.2
50.4
00.5
71.0
43.8
76.9
09.3
025.6
041.5
158.3
5C
O0.0
10.0
30.0
60.0
80.2
10.3
40.4
80.9
43.5
16.2
68.4
423.2
437.5
852.9
0
![Page 138: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/138.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 122
Tab
le5.
5:C
har
acte
riza
tion
ofIc
eFle
xIn
terc
onnec
tF
abri
cF
orC
Σ=e2/(
40kBT
)[1
33]
Low
pow
erH
igh
per
form
an
ce40
K77
K103
K120
K200
K250
K300
K40
K77
K103
K120
K200
K250
K300
KIS
F6.6
96
5.2
38
4.7
27
4.5
37
3.7
12
3.3
51
3.1
69
0.0
50
0.0
39
0.0
37
0.0
36
0.0
30
0.0
28
0.0
27
Sin
gle
0.7
28
0.6
99
0.6
94
0.6
97
0.7
99
0.7
70
0.7
84
0.0
06
0.0
06
0.0
06
0.0
07
0.0
05
0.0
05
0.0
05
Late
ncy
Dou
ble
0.7
04
0.6
87
0.6
85
0.6
89
0.7
94
0.7
66
0.7
81
0.0
06
0.0
06
0.0
06
0.0
07
0.0
05
0.0
05
0.0
05
(ns)
Hex
0.6
92
0.6
80
0.6
80
0.6
84
0.7
91
0.7
63
0.7
79
0.0
06
0.0
06
0.0
06
0.0
07
0.0
05
0.0
05
0.0
05
Glo
bal
2.9
96
4.5
23
4.6
57
4.2
37
4.5
72
4.5
20
6.7
85
0.1
63
0.1
10
0.0
92
0.0
86
0.0
74
0.0
73
0.0
99
ISF
0.2
19
0.8
44
1.4
57
1.9
03
5.3
02
8.7
27
12.2
26
22.0
22
85.0
34
146.9
20
191.9
57
535.0
72
879.8
37
1233.1
47
Sin
gle
0.0
08
0.0
32
0.0
57
0.0
76
0.2
10
0.3
42
0.4
79
0.9
59
3.3
87
6.1
93
7.9
77
24.9
92
34.1
01
53.5
81
Pow
erD
ou
ble
0.0
17
0.0
63
0.1
13
0.1
52
0.4
20
0.6
84
0.9
58
1.9
17
6.7
75
12.3
86
15.9
55
49.9
84
68.2
02
107.1
60
(nW
)H
ex0.0
34
0.1
27
0.2
26
0.3
05
0.8
40
1.3
68
1.9
17
3.8
35
13.5
49
24.7
71
31.9
09
99.9
67
136.4
00
214.3
20
Glo
bal
271.7
80
23.9
12
6.6
68
4.4
60
3.5
55
4.5
13
5.8
57
6674.8
00
5146.7
00
5560.9
00
5824.1
00
5318.2
00
4856.1
00
4745.7
00
![Page 139: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/139.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 123
logic components and interconnect fabric IceFlex, including multi-gate SET recon-
figurable lookup table (LUT)1, SET register (Register), SET and CMOS four-out-of-
seven majority voting logic (MVL), multi-gate (MG) and CMOS-style (CS) exclusive-
OR, (CO) carry-out logic, and SET local interconnect (Single, Double, and Hex),
hybrid SET/CMOS global interconnect (Global) and SET input switch fabric (ISF).
From these results, we make the following observations.
First, IceFlex has high energy efficiency, good performance, and high flexibility
in terms of performance and energy efficiency tradeoff. At the low-power setting,
the power consumptions of SET-based logic components and local interconnect fab-
ric are nano-Watts. The hybrid SET/CMOS global interconnect has the highest
power consumption. This is a result of the high capacitance of global wires and high
power consumption of the CMOS buffers. All components in the low-power version
of IceFlex still have latencies in the range of nanoseconds. SETs have high junction
resistance and low driving strength. Using the high-performance setting, by scaling
the SET junction resistance down to 100 kΩ, the latencies of the SET-based logic and
local interconnect fabric are consistently lower than 100 ps. Even though reducing
resistance results in a 100× increase in power, as demonstrated in Section 5.4.2, the
overall energy efficiency of IceFlex is still orders of magnitude higher than that of
CMOS-based solutions.
Second, these results demonstrate the impact of temperature on SET performance
and power consumption – as the temperature increases, performance increases and
the power efficiency decreases. This is a result of the impact of thermal energy on
tunneling events and therefore circuit behavior, which is described in Section 5.2.
The number of electrons with sufficient energy to overcome the Coulomb blockade
1To allow comparison with Xilinx FPGAs, a 16-to-1 setting is used.
![Page 140: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/140.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 124
effect increases with temperature, thereby increasing tunneling rate, performance,
and power consumption.
The CΣ = e2/(40kBT ) setting enables greater resistance to shot noise than the
CΣ = e2/(10kBT ) setting. However, it also imposes performance and power consump-
tion penalties. For SET circuitry, the required supply voltage is inversely proportional
to gate capacitance. Compared to the CΣ = e2/(10kBT ) setting, CΣ = e2/(40kBT ) re-
quires a further reduction of SET gate capacitance and an increase in supply voltage.
Note that the driven capacitance of a SET circuit is dominated by the metal wires.
Therefore, decreased gate capacitance has negligible impact on power consumption.
The increased supply voltage, on the other hand, increases circuit dynamic power
consumption. Moreover, the increased voltage range increases the duration of signal
swing, thereby increases latency.
5.4.1.1 SET Multi-Gate Multiplexer Tree
As described in Section 5.3.2.1, multi-gate SETs improve the performance, power
consumption, and area efficiency of the multiplexer tree design. This section charac-
terizes the impact of thermal energy on the proposed multi-gate design.
As described in Section 5.3.2.1, at the high-performance CΣ = e2/(10kBT ) setting,
the dual-gate design is used for temperatures at or below 200 K. For these settings only
single-gate design is feasible at temperatures greater than 250 K due to high static
current at these temperatures. As a result, circuit power consumption is increased
at high temperatures. From 200 K to 250 K, both latency and power consumption
increase. In addition, when using the same design, we observe that both the circuit
performance and power consumption increase with temperature. The same trend
![Page 141: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/141.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 125
0.04
0.045
0.05
0.055
0.06
0.065
0.07
0.075
0.08
0 50 100 150 200 250 300
100
200
300
Late
ncy
(ns)
Pow
er (n
W)
T (K)
LUT latencyLUT power
Figure 5.8: Power and Performance of the Multi-gate SET Multiplexer Tree for HighPerformance, CΣ = e2/(40kBT ) [133].
was described in Section 5.4.1. Using the low-power design of IceFlex, only the
single-gate design is feasible (see Section 5.3.2.1). Using e2/CP ≥ 40kBT , SET
circuitry is less susceptible to thermal energy thanks to the increased charging energy.
Therefore, both low-power and high-performance dual-gate multiplexer tree designs
become feasible across the entire temperature range. As shown in Figure 5.8, using the
high-performance CΣ = e2/(40kBT ) setting, the performance and power consumption
of the multi-gate multiplexer tree design increase consistently with temperature. A
similar trend can be shown for the corresponding low-power design case.
5.4.1.2 Power and Performance of Interconnect Design
Power consumption, performance, and the tradeoff between them are of central im-
portance in interconnect design. We considered both SET-only and SET/CMOS hy-
brid interconnect driver designs. The relative static power benefit of the SET/CMOS
![Page 142: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/142.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 126
hybrid design over the SET-only design increases as the wireload increases. This
is mainly due to an increase in the static power consumption of the SET-only de-
sign as more SET buffers are used to meet the driving strength requirements. The
SET-only design has superior power efficiency. As the wire length increases, the pro-
portion of capacitance contributed by CMOS buffer gates becomes less significant
relative to wire capacitance. Therefore, compared to the SET-only design, the dy-
namic power consumption of the SET/CMOS hybrid design also improves, but is still
inferior to that of the SET-only design. At 300 K, for both the CΣ = e2/(40kBT ) and
CΣ = e2/(10kBT ) settings, we found that SET-only designs had better energy effi-
ciencies for wires shorter than approximately 1 mm, and SET/CMOS hybrid designs
were better for longer wires. As temperature increases, the thermal energy impact
increases. As a result, the static power consumption of SETs increases. Therefore,
the wire length at which the SET/CMOS design begins to outperform the SET-only
design decreases as temperature increases.
Table 5.5 illustrate two interesting trends for global interconnect. The power con-
sumption of both the low-power and the high-performance CΣ ≤ e2/(40kBT ) hybrid
SET/CMOS designs decrease with increasing temperature. At low temperatures, the
output voltage ranges and driving currents for the SETs are small, increasing CMOS
buffer static power consumption.
5.4.1.3 Performance and Power Characterization of SET Non-Unate Logic
SETs support the efficient implementation of some non-unate arithmetic func-
tions. We evaluate the power consumption and performance of an exclusive-OR
gate, a non-unate Boolean function widely used in arithmetic logic, e.g., in addition
![Page 143: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/143.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 127
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3
3.1
0 50 100 150 200 250 300 0
0.1
0.2
0.3
0.4
0.5
0.6La
tenc
y (n
s)
Pow
er (n
W)
T (K)
Multi-gate style latencyCMOS style latency
Multi-gate style powerCMOS style power
Figure 5.9: Performance and Power Characterization of Exclusive-or Logic for LowPower for CΣ = e2/(40kBT ) [133].
Table 5.6: Latency and Energy Improvement For Exclusive-Or Design [133].Performance CΣ Performance Energy
setting constraint (F) improvement (%) improvement (%)Battery e2/(10kBT ) 40.8 64.1Battery e2/(40kBT ) 22.0 87.1
High e2/(10kBT ) 32.1 84.6High e2/(40kBT ) 25.2 84.4
and multiplication. We compared the two different implementations described in
Section 5.3.2.3, the proposed SET-based design and the CMOS-style SET implemen-
tation. Figure 5.9 shows the power and performance characterization of these two
designs at the low-power and high-performance settings at CΣ = e2/(40kBT ) settings.
These results demonstrate the superior power consumption and performance of this
design style, which is not possible using BJTs, CMOS, or threshold logic. Compared
to the CMOS-style SET implementation, the design that exploits the periodic I–V
![Page 144: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/144.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 128
curve of SETs achieves the latency and power consumption reductions indicated in
Table 5.6, i.e., approximately a 25% reduction in latency and 75% reduction in energy
consumption.
5.4.2 Characterization of High-Performance and Battery-Powered
Embedded Applications
This section characterizes the performance and power consumption of IceFlex
when used to implement numerous general-purpose and application-specific processor
cores. We evaluate the suitability of IceFlex for use in both portable battery-powered
and high-performance embedded systems by determining its performance and energy
efficiency when used to implement the processor cores described below. We have
divided the cores into battery-powered and high-performance categories.
Battery-Powered
AES (Rijndael) IP core (AES), ATMega103 microcontroller (AVR), coordinate
rotation computer (CORDIC), ECC core (ECC), 32-bit IEEE 754 floating-point unit
(FPU), Reed–Solomon encoder (RS), USB 2.0 function (USB), and video compression
systems (VC).
High-Performance
Power-efficient RISC CPU (ARM7), synchronous / DLX core (ASPIDA DLX),
five-stage pipeline RISC CPU (Jam RISC), entire SPARC V8 processor (LEON2
SPARC), RISC CPU (Microblaze), MIPS I clone (miniMIPS), MIPS processor (MIPS)
supporting most MIP I opcodes (Plasma), MIPS I integer only clone (UCore), and
![Page 145: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/145.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 129
Table 5.7: IceFlex Performance and Power Consumption at Room Temperature ForCΣ = e2/(40kBT ) [133].
FPGA IceFlex22 nm CMOS Battery- High-
Benchmarks technology∗ powered performanceFreq Energy Freq Energy Freq Energy
(MHz) (J/cycle) (MHz) (J/cycle) (MHz) (J/cycle)ARM7 26.3 2.96e-09 2.0 5.47e-11 224.0 4.79e-11
ASPIDA DLX 125.7 8.86e-10 11.5 6.37e-12 1333.3 5.58e-12Jam RISC 95.9 8.92e-10 12.8 3.65e-12 1481.5 3.19e-12
LEON2 SPARC 85.9 1.88e-09 8.8 2.39e-11 1025.6 2.09e-11Microblaze RISC 115.1 7.28e-10 16.4 2.01e-12 1904.8 1.76e-12
miniMIPS 88.0 4.87e-10 9.6 9.78e-12 1111.1 8.56e-12MIPS 80.4 1.02e-09 10.5 4.34e-12 1212.1 3.80e-12
Plasma 75.4 1.13e-09 8.8 6.91e-12 1025.6 6.05e-12UCore 136.4 8.19e-10 12.8 5.45e-12 1481.5 4.78e-12YACC 72.1 1.18e-09 19.2 3.08e-12 2222.2 2.69e-12AES 205.3 3.43e-10 28.7 2.34e-12 3333.3 2.05e-12AVR 71.9 2.67e-10 9.6 5.34e-12 1111.1 4.67e-12
CORDIC 271.8 1.37e-10 114.9 2.05e-13 13333.3 1.79e-13ECC 39.1 4.91e-10 11.5 6.92e-12 1333.3 6.05e-12FPU 28.4 1.00e-09 2.6 8.02e-11 296.3 7.02e-11RS 496.7 1.28e-11 57.5 4.61e-14 6666.7 4.05e-14
USB 171.6 3.24e-10 38.3 1.53e-12 4444.4 1.34e-12VC 114.16 1.24e-09 23.0 1.04e-11 2666.8 9.10e-12
Avg. energy Improvement 68.58× 78.46×
MIPS I clone (YACC).
The Xilinx Virtex-II XC2V2000 FPGA is used as a base case for comparison.
Each application is synthesized with Xilinx ISE to determine the number of required
LUTs, maximum frequency, and power consumption, using a switching probability of
10% [121] and a 65 nm feature size. Then, we scale the FPGA synthesis results into
a 22 nm process based on HSPICE predictive technology model simulation results
for the two technologies [130]. We used FPGA synthesis software to estimate the
number of IceFlex SELBs required. 16-entry Virtex-II LUTs were used due to their
functional (but not structural) similarity to IceFlex SELBs. For each design, the
![Page 146: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/146.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 130
maximum frequency for IceFlex was determined by multiplying the number of SELBs
along the longest combinational path by the delay of an IceFlex SELB plus the
delay of a local interconnect. IceFlex power consumption was computed by taking
the sum of the power consumptions of all components at the maximum operating
frequency. Note that, since Xilinx ISE does not report use of global interconnect
for any of the processors we synthesized, we exclude the hybrid global interconnect
from IceFlex power analysis. In designs that use primarily local interconnect (i.e.,
single, double, and hex interconnect), the reported power consumption results will
be accurate. However, for designs in which global hybrid SET–CMOS interconnect
dominates, the power consumption may approach that of global interconnect in a
corresponding 22 nm CMOS design.
Table 5.7 show the operating frequencies and energy efficiency in Joules per clock
cycle of the CMOS FPGA and IceFlex variants for each benchmark application. As
described in Section 5.3.1.5, recent progress in fabrication is reducing the severity of
the random background charge problem. If that work succesful, it may be less critical
to use redundancy and majority voting logic in IceFlex.
5.4.2.1 Ultra-Low-Power Applications
The data in Table 5.7 indicate that the non-redundant, room temperature,
low-power version of IceFlex is suitable for use in applications such as sensor net-
work nodes, if they can be fabricated with sufficiently small island capacitances. In
the following analysis, we shall focus on the AVR core, which is representative of
a commonly-used sensor network node processor. Alkaline AA batteries typically
![Page 147: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/147.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 131
have 2,800 mAH of energy and nominal operating voltages of 1.5 V, i.e., they can de-
liver approximately 15,000 J. Using the conservative CΣ ≤ e2/(40kBT ) constraint, a
low-power IceFlex AVR implementation running at 4 MHz consumes approximately
200 µW, permitting it to run for 20 years on one AA battery, i.e., longer than the shelf
life of most such batteries. When the less conservative CΣ ≤ e2/(10kBT ) constraint
is used, the average energy consumption improvements increase to 95.60× (non-
redundant battery powered), 115.65× (non-redundant high performance), 12.27×
(redundant battery powered), and 15.27× (redundant high performance).
This power consumption is also low enough to permit an AVR processor to oper-
ate on energy scavenged from the environment. If we assume an energy scavenging
volume of 5 cm3 and use Roundy’s power densities of 4 µW/cm3 for indoor solar en-
ergy, 200 µW/cm3 for vibrations, 10 µW/cm3 for daily temperature variation, and
0.003 µW/cm3 for acoustic noise at 75 dB [92], we find that one sensor network node
is capable of scavenging enough energy to power an IceFlex AVR processor running
at the maximum clock frequency from vibrations or daily temperature variation, at
3.7 MHz from indoor solar energy, and at 2.8 kHz from 75 dB acoustic noise. However,
SET circuits that operate at room temperature and adhere to the CΣ ≤ e2/(40kBT )
constraint will rely on features with sizes approaching (but not crossing) physical
limits. Although the use of SETs in battery-powered applications has potential, it
depends on the solution of formidable fabrication challenges or the development of
compact, low-power cooling methods.
![Page 148: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/148.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 132
5.4.2.2 Energy-Efficient High-Performance Applications
We can draw the following general conclusions from Table 5.7. For a wide range
of processor cores, the SET-based IceFlex architecture is capable of achieving energy
efficiencies two orders of magnitude better than 22 nm CMOS-based FPGAs. Peak
frequencies ranging from 200 MHz to 2 GHz are maintained for all processors.
One might expect the high-performance version of IceFlex to consistently achieve
higher frequency but lower energy efficiency than the low-power version of IceFlex.
However, its energy efficiency is typically better, as well. Operating at higher fre-
quencies can permit reduced static energy consumption, and therefore better energy
efficiency, especially at room temperature where static power consumption is high
(see Figure 5.2). Therefore, for SET-based architectures that are operated at room
temperature and have low performance requirements, it will generally be more energy
efficient to operate the device at high frequency and periodically enter a power-gated
sleep mode than to continuously operate at a low frequency.
In high-performance applications for which parallel computation is appropriate,
improved energy efficiency can be traded for improved performance with the same
energy budget. For example, given a power budget of 125 mW and CΣ ≤ e2/(40kBT ),
one could use one LEON2 SPARC implemented with an FPGA and running at 85 MHz
or 5 LEON2 SPARCs implemented with the high-performance variant of IceFlex and
operating at 1,025 MHz. This implies an overall performance 60× higher than that of
the FPGA version. Taken to its logical extreme, assuming a power budget of 100 W
and one instruction per cycle, one could execute 4.8 Terra IPS. These numbers are
intended to give the reader some indication of the potential to improve performance
given a power budget. In practice some of this performance will be lost due to
![Page 149: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/149.jpg)
CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 133
parallelization inefficiency and off-chip communication latency. A similar comparison
can be used for the MIPS processor, for which IceFlex permits a 268× improvement
in energy efficiency compared with an FPGA implementation.
5.5 Conclusions
In this chapter, we have analyzed the impact of using SETs in architecture and
circuit design; proposed IceFlex, a fault-tolerant, reconfigurable, hybrid SET/CMOS
architecture for use in high-performance and battery-powered embedded systems;
and evaluated the energy efficiency, power consumption, and performance of IceFlex
in these applications. Our results indicate that using SETs for computation poses
many design challenges, some of which can be solved with the proposed architecture
and circuit design techniques. In addition, we find that SETs have unique proper-
ties that permit significant improvements in circuit efficiency when compared with
BJT, CMOS, and threshold logic based design. In summary, we find that a hybrid
SETs/CMOS architecture has the potential to improve energy efficiency in battery-
powered high-performance applications by two orders of magnitude compared with
22 nm CMOS while permitting operating frequencies that are as high, or higher. Al-
though they hold great promise, the practical use of SETs will require additional
research into fault tolerance techniques, processing technologies, and novel circuit de-
signs. In particular, the use of SET-based designs in portable applications will either
require the fabrication of features with sizes approaching physical limits or the devel-
opment of compact, energy-efficient technologies permitting operation below ambient
temperature.
![Page 150: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/150.jpg)
Chapter 6
Conclusions and Future Work
This chapter summarizes the proposed techniques and discusses possible directions
for future work.
6.1 Thesis Summary
This thesis proposes several techniques and algorithms, specifically, system-level
synthesis, recently developed integration technology and emerging device technol-
ogy to address problems related to power, thermal and reliability issues of modern
integrated circuit design.
Technology scaling and increasing power densities make IC design lifetime reli-
ability problems more severe. Lifetime reliability strongly depends on system-level
architecture, redundancy, and IC thermal profile during operation. In order to explore
the system-level synthesis algorithms to increase IC lifetime by thermal and struc-
tural redundancy optimization, a two-stage synthesis process has been proposed. A
potentially-slow but high-quality stochastic optimization algorithm is first used to
134
![Page 151: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/151.jpg)
CHAPTER 6. CONCLUSIONS AND FUTURE WORK 135
minimize solution area. Starting from this promising location in the solution space,
a reliability enhancement heuristic explores the area-MTTF tradeoff curve. The pro-
posed algorithm has been integrated into a system-level synthesis flow that conducts
architectural synthesis, floorplanning, on-chip network synthesis, chip-package ther-
mal analysis and reliability analysis. As indicated by our results, the proposed syn-
thesis system achieves 436% average system MTTF improvement with a maximum
area overhead of 25%. Compared with one-phase stochastic optimization algorithm,
the proposed synthesis can always produce solutions of equal or better quality while
requiring less CPU time.
Several three-dimensional integration technologies have been proposed and devel-
oped to overcome the limitations of 2D technology. (1) 3D technology increases logic
integration density significantly; (2) 3D technology reduces on-chip wire length, es-
pecially for global and semi-global wires. However, by stacking multiple device layers
connected through inter-die vias, 3D integration increases the importance and diffi-
culty of thermal management due to the following reasons: (1) Chip cross-sectional
power density increases linearly with the number of vertically-stacked active circuit
layers; (2) the interconnect and bonding layers used in 3D integration have low ther-
mal conductivities which further exacerbate thermal effects; (3) the high power den-
sity of 3D chips will frequently require operation at or near thermal limits and (4)
3D chips have heterogeneous power and thermal characteristics which challenge run-
time thermal management. In order to investigate the run-time thermal management
problem of 3D integrated circuits, we developed the analytical framework for 3D heat
flow and proposed a proactive global power-thermal budgeting algorithm, perfor-
mance counter-based workload monitor and distributed thermal control techniques.
![Page 152: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/152.jpg)
CHAPTER 6. CONCLUSIONS AND FUTURE WORK 136
The proposed technique, called ThermmOS which is built upon Linux 2.6.8 kernel, is
a unified hardware and OS thermal management solution to maximize thermally-safe
3D IC performance. The results indicate that proactive power-thermal budgeting
allows 30% improvement in instruction throughput compared to a state-of-the-art
proactive thermal management approach. Evaluation results also indicate the pro-
posed technique has small performance overhead and good scalability.
Device researchers have seen the coming challenges for CMOS devices and eval-
uated alternative technologies. The International Technology Roadmap for Semi-
conductors projects that single-electron tunneling transistors have the potential to
achieve the lowest projected energy per switching event of any known device. In
order to explore the potential use of SETs in low-power embedded systems, SET-
based design was brought to the system level to characterize the impacts of SETs on
system design metrics and evaluate the benefits and limitations of SETs. Based on
the evaluation of the architectural and circuit-level features, a fault-tolerant, recon-
figurable, hybrid SET/CMOS based architecture called IceFlex was proposed. The
results indicate that using a hybrid SETs/CMOS architecture has the potential to
improve energy efficiency in battery-powered high-performance applications by two
orders of magnitude compared with 22nm CMOS while permitting operating frequen-
cies that are as high, or higher. Although they hold great promise, the practical use
of SETs will require additional research into fault tolerance techniques, processing
technologies and novel circuit designs.
6.2 Future Work
The following research directions can be further pursued.
![Page 153: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/153.jpg)
CHAPTER 6. CONCLUSIONS AND FUTURE WORK 137
3D Thermal-Aware and Reliability-Aware Synthesis
Due to the additional constraints of stacking multiple device layers, the synthesis
algorithms for 3D circuits are quite different from traditional planar integrated cir-
cuits. Besides the traditional optimization goals, such as performance, area and inter-
connect latency, 3D synthesis also needs to address the issues unique to 3D circuits,
such as minimizing the inter-die vias [4, 23, 22]. Furthermore, the use of 3D integra-
tion magnifies power dissipation problems. Temperature-related concerns that can
sometimes be safely ignored in 2D circuit design, such as temperature-induced per-
formance or reliability degradation become increasingly prominent in 3D integrated
circuits. In addition, the dependence of leakage power consumption on temperature
will further exacerbate the thermal effect. These issues must be tackled during phys-
ical level synthesis procedure [47]. In addition, in high-level synthesis and in system-
level synthesis areas, task assignment and scheduling need to be carefully designed to
balance power consumption in the spatial and time domains respectively [61]. The
road ahead presents many challenges in developing EDA tools to explore the design
space of 3D integrated circuits before one can fully take the benefits from this new
technology.
Thermal and Reliability Modeling for SETs Circuit
Although single-electron tunneling transistors hold great promise, practical use
of SETs will require additional research into thermal and reliability modeling for
SET devices. Enabling operation at the desired temperature is a major concern for
SETs. Fabricating SET islands of small enough size and capacitance to permit room-
temperature operation is a major challenging. Researchers have proposed accurate
![Page 154: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/154.jpg)
CHAPTER 6. CONCLUSIONS AND FUTURE WORK 138
chip-package thermal analysis techniques for use in IC synthesis and design [124,
125, 123, 45]. However, there is no solid work for SET devices thermal modeling
which includes detailed thermal characterization and fast nanoscale thermal analysis
methods. In addition, SET circuits are susceptible to logic errors resulting from a
phenomenon called the random background charge effect. Fault tolerance must be
carefully addressed when using SETs in system-level design. In order to do so, an
accurate SET fault probability estimation is required.
Energy Optimization on Application Layer
This dissertation discussed power, thermal and reliability optimization on the
hardware and operating system layers. In the future, energy optimization on the
application layer can be explored, especially those applications running on battery-
powered, portable devices. Personal, portable communication and computation de-
vices are now part of hundreds of millions of lives, often in the form of smart-phones.
From Daniel Henderson’s 1993 prototype, intellect, which can receive and display im-
ages and video media [1], to the first photo taken by Philippe Kahn in 1997 using a
camera phone and shared instantly with more than 2,000 families [3], the functional-
ity and adoption of personal portable devices have continuously increased. Today’s
personal portable devices, such as the iPhone from Apple, Blackberry from RIM,
and Android phone from Google, have integrated many system functions, such as the
global positioning system (GPS), cameras, sensors, large touch screens, and easy-to-
use interfaces. Global mobile phone subscriptions reached 3.3 billion in 2007 [48].
Users are able to capture information anywhere and anytime. These devices are also
heavily used for information sharing and social interaction. In battery-powered mobile
![Page 155: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/155.jpg)
CHAPTER 6. CONCLUSIONS AND FUTURE WORK 139
systems, energy consumption is a primary design concern. A limited battery energy
budget forces hardware designers to use energy-efficient, but slow microprocessors
and limited storage hardware. These constraints, in turn, limit the performance and
functionality of software applications running on portable devices. Those challenges
need to be addressed during the application designing procedure.
![Page 156: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/156.jpg)
Bibliography
[1] American museum. http://americanhistory.si.edu/.
[2] Transistor count on wikipedia. http://http://en.wikipedia.org/wiki/Transistor
count.
[3] Wikipedia. http://en.wikipedia.org/wiki/Philippe Kahn/.
[4] Cristinel Ababei, Yan Feng, Brent Goplen, Hushrav Mogal, Tianpei Zhang,
Kia Bazargan, and Sachin Sapatnekar. Placement and routing in 3d integrated
circuits. IEEE Design & Test, 22(6):520–531, November 2005.
[5] V. Agarwal, M.S. Hrisikesh, S.W. Keckler, and D. Burger. Clock rate vs. IPC:
The end of the road for conventional microarchitectures. In Proc. Int. Symp.
Computer Architecture, pages 276–283, June 2000.
[6] M. Ahlskog, R. Tarkiainen, L. Roschier, and P. Hakonen. Single-electron transis-
tor made of two crossing multiwalled carbon nanotubes and its noise properties.
Applied Physics Ltrs., 77:4037–4039, December 2000.
[7] AMD multi-core white paper. http://www.amd.com.
[8] ANSYS. http://www.ansys.com/.
140
![Page 157: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/157.jpg)
BIBLIOGRAPHY 141
[9] D. V. Averin and K. K. Likharev. Coulomb blockade of tunneling and coherent
oscillations in small tunnel junctions. J. Low Temperature Physics, 62:345–372,
February 1986.
[10] R. Iris Bahar, Joseph Mundy, and Jie Chen. A probabilistic-based design
methodology for nanoscale computation. In Proc. Int. Conf. Computer-Aided
Design, pages 480–486, November 2003.
[11] Nathan L. Binkert, Ronald G. Dreslinski, Lisa R. Hsu, Kevin T. Lim, Ali G.
Saidi, and Steven K. Reinhardt. The M5 simulator: Modeling networked sys-
tems. Proc. Int. Symp. Microarchitecture, 26(4):52–60, 2006.
[12] Bryan Black, Murali M. Annavaram, Ned Brekelbaum, John DeVale, Lei Jiang,
Gabriel H. Loh, Don McCaule, Pat Morrow, Donald W. Nelson, Daniel Pantuso,
Paul Reed, Jeff Rupley, Sadasivan Shankar, John Shen, and Clair Webb. Die
stacking (3D) microarchitecture. In Proc. Int. Symp. Microarchitecture, pages
469–479, December 2006.
[13] K. A. Bowman, B. L. Austin, J. C. Eble, X. Tang, and J. D. Meindl. A physical
alpha-power law MOSFET model. IEEE J. Solid-State Circuits, 34:1410–1414,
October 1999.
[14] David Brooks and Margaret Martonosi. Dynamic thermal management for high-
performance microprocessors. In Proc. Int. Symp. High-Performance Computer
Architecture, pages 171–182, January 2001.
![Page 158: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/158.jpg)
BIBLIOGRAPHY 142
[15] David Brooks, Vivek Tiwari, and Margaret Martonosi. Wattch: A framework
for architectural-level power analysis and optimizations. In Proc. Int. Symp.
Computer Architecture, pages 83–94, June 2000.
[16] K. R. Brown, L. Sun, and B. E. Kane. Electric-field-dependent spectroscopy
of charge motion using a single-electron transistor. Applied Physics Ltrs., 88,
2006 May.
[17] R. H. Chen. MOSES: a general Monte Carlo simulator for single-electron cir-
cuits. Meeting Abstracts, The Electrochemical Society, 96(2):576, October 1996.
[18] Yi-Kan Cheng, Ching-Chi Teng, Sung-Mo Kang, and Ching-Han Tsai. Elec-
trothermal Analysis of VLSI Systems. Cambridge University Press, 2000.
[19] Young-Kyun Cho and Yoon-Ha Jeong. Single-electron pass-transistor logic with
multiple tunnel junctions and its hybrid circuit with MOSFETs. ETRI J.,
26(6):669–672, December 2004.
[20] COMSOL Multiphysics. http://www.comsol.com/products/multiphysics/.
[21] A. K. Coskun, T. S. Rosing, K. Mihic, G. De Micheli, and Y. Leblebici. Analysis
and optimization of MPSoC reliability. J. Low Power Electronics, pages 56–69,
April 2006.
[22] Shamik Das, Anantha Chandrakasan, and Rafael Reif. Three-dimensional inte-
grated circuits: Performance, design methodology, and cad tools. pages 13–18,
February 2003.
![Page 159: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/159.jpg)
BIBLIOGRAPHY 143
[23] Shamik Das, Andy Fan, Kuan-Neng Chen, and C. S. TanAnantha. Technol-
ogy, performance, and computer-aided design of three-dimensional integrated
circuits. In Proc. Int. Symp. Physical Design, pages 108–115, April 2004.
[24] Andre DeHon. Array-based architecture for FET-based nanoscale electronics.
IEEE Trans. Nanotechnology, 2(1):23–32, March 2003.
[25] Michel H. Devoret and Robert J. Schoelkopf. Amplifiying quantum signals with
the single-electron transistor. Nature, 406:1039–1046, August 2000.
[26] Robert P. Dick. Multiobjective synthesis of low-power real-time distributed em-
bedded systems. PhD thesis, Dept. of Electrical Engineering, Princeton Univer-
sity, July 2002.
[27] Robert P. Dick, David L. Rhodes, and Wayne Wolf. TGFF: task graphs for
free. In Proc. Int. Wkshp. Hardware/Software Co-Design, pages 97–101, March
1998.
[28] James Donald and Margaret Martonosi. Techniques for multicore thermal man-
agement: Classification and new exploration. In Proc. Int. Symp. Computer
Architecture, June 2006.
[29] M. S. Dresselhaus, G. Dresselhaus, and Phaedon Avouris. Carbon Nanotubes.
Springer-Verlag, Germany, February 2001.
[30] Petru Eles, Zebo Peng, Krzysztof Kuchcinski, and Alexa Doboli. System level
hardware/software partitioning based on simulated annealing and tabu search.
ACM Trans. Design Automation Electronic Systems, 2:5–32, January 1997.
[31] Embedded microprocessor benchmark consortium. http://www.eembc.org.
![Page 160: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/160.jpg)
BIBLIOGRAPHY 144
[32] David K. Ferry and Stephen M. Goodnick. Transport in Nanostructures. Cam-
bridge University Press, 1997.
[33] T. A. Fulton and G. J. Dolan. Observation of single-electron charging effects
in small tunnel junctions. Physics Review Ltrs., 59:109–112, July 1987.
[34] M. Furlan and S. V. Lotkhov. Electrometry on charge traps with a single-
electron transistor. Physics Rev. B, 67:205313, 2003.
[35] A. K. Geim and K. S. Novoselov. The rise of graphene. Nature Materials,
6:183–191, March 2007.
[36] M. Glaß, M. Lukasiewycz, T. Streichert, C. Haubelt, and J. Teich. Reliability-
aware system synthesis. In Proc. Design, Automation & Test in Europe Conf.,
April 2007.
[37] Seth Copen Goldstein and Mihai Budiu. Nanofabrics: spatial computing using
molecular electronics. In Proc. Int. Symp. Computer Architecture, pages 178–
189, June 2001.
[38] Zhenyu (Peter) Gu, Changyun Zhu, Li Shang, and Robert P. Dick. Application-
specific MPSoC reliability optimization. IEEE Trans. VLSI Systems, 16(5),
May 2008.
[39] Michael Healy, Mario Vittes, Mongkol Ekpanyapong, Chinnakrishna Ballapu-
ram, Sung Kyu Lim, Hsien-Hsin S. Lee, and Gabriel H. Loh. Multi-objective
microarchitectural floorplanning for 2d and 3d ics. TCAD, 26(1):38–52, January
2007.
![Page 161: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/161.jpg)
BIBLIOGRAPHY 145
[40] James R. Heath and Mark A. Ratner. Molecular electronics. Physics Today,
56:43–49, May 2003.
[41] C. P. Heij, P. Hadley, and J. E. Mooij. Single-electron inverter. Applied Physics
Ltrs., 78:1140–1142, 2001.
[42] Jorg Henkel and Rolf Ernst. A hardware/software partitioner using a dynami-
cally determined granularity. In Proc. Design Automation Conf., pages 691–696,
June 1997.
[43] Seongmoo Heo, Kenneth Barr, and Krste Asanovic. Reducing power density
through activity migration. In Proc. Int. Symp. Low Power Electronics & De-
sign, pages 217–222, August 2003.
[44] J. Hou and Wayne Wolf. Process partitioning for distributed embedded systems.
In Proc. Int. Wkshp. Hardware/Software Co-Design, pages 70–76, March 1996.
[45] W. Huang, S. Ghosh, S. Velusamy, K. Sankaranarayanan, K. Skadron, and M.R.
Stan. HotSpot: A compact thermal modeling methodology for early-stage VLSI
design. IEEE Trans. VLSI Systems, 14(5):501–524, May 2006.
[46] Yu Huang, Xiangfeng Duan, Yi Cui, Lincoln J. Lauhon, Kyoung-Ha Kim, and
Charles M. Lieber. Logic gates and computation from assembled nanowire
building blocks. Nature, 294(5545):1313–1317, November 2001.
[47] W.-L. Hung, G. M. Link, Y. Xie, N. Vijaykrishnan, and M. J. Irwin. Inter-
connect and thermal-aware floorplanning for 3D microprocessors. In Proc. Int.
Symp. Quality of Electronic Design, pages 98–104, March 2006.
![Page 162: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/162.jpg)
BIBLIOGRAPHY 146
[48] Global mobile forecast to 2012. In Informa Telecomms & Media Report, Novem-
ber 2007.
[49] Hiroshi Inokawa and Yasuo Takahashi. A compact analytical model for asym-
metric single-electron tunneling transistors. IEEE Trans. Electron Devices,
50(2):455–461, February 2003.
[50] Intel multi-core processor architecture. http://www.intel.com.
[51] Canturk Isci, Alper Buyuktosunoglu, Chen-Yong Cher, Pradip Bose, and Mar-
garet Martonosi. An analysis of efficient multi-core global power management
policies: Maximizing performance for a given power budget. In Proc. Int. Symp.
Microarchitecture, pages 78–88, December 2006.
[52] Canturk Isci and Margaet Martonosi. Runtime power monitoring in high-end
processors: Methodology and empirical data. In Proc. Int. Symp. Microarchi-
tecture, pages 93–104, December 2003.
[53] International Technology Roadmap for Semiconductors, 2006. http://public.
itrs.net/.
[54] J.McGregor. x86 power and thermal management. In Microprocessor Report,
December 2004.
[55] Joint Electron Device Engineering Council. Failure mechanisms and models for
semiconductor devices. In JEDEC Publication JEP 122-B, August 2003.
[56] R. Kalla, B. Sinharoy, and J. M. Tendler. IBM Power5 chip: a dual-core
multithreaded processor. IEEE Micro, 24(2):40–47, 2004.
![Page 163: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/163.jpg)
BIBLIOGRAPHY 147
[57] Taeho Kgil, Shaun D’Souza, Ali Saidi, Nathan Binkert, Ronald Dreslinski,
Trevor Mudge, Steven Reinhardt, and Krisztian Flautner. PicoServer: using
3D stacking technology to enable a compact energy efficient chip multiproces-
sor. In Proc. Int. Conf. Architectural Support for Programming Languages and
Operating Systems, pages 117–128, October 2006.
[58] Jongman Kim, Chrysostomos Nicopoulos, Dongkook Park, Reetuparna Das,
Yuan Xie, Vijaykrishnan Narayanan, Mazin S. Yousif, and Chita R. Das. A
novel dimensionally-decomposed router for on-chip communication in 3D archi-
tectures. In Proc. Int. Symp. Computer Architecture, June 2007.
[59] Masaharu Kirihara, Kazuo Nakazato, and Mathias Wagner. Hybrid circuit
simulator including a model for single electron tunneling devices. Japanese J.
of Applied Physics, 38(4A), April 1999.
[60] P. Kongetira, K. Aingaran, and K. Olukotun. Niagara: A 32-way multithreaded
SPARC processor. IEEE Micro, 25(2):21–29, 2005.
[61] Vyas Krishnan and Srinivas Katkoori. A 3d-layout aware binding algorithm
for high-level synthesis of three-dimensional integrated circuits. In Proc. Int.
Symp. Quality of Electronic Design, pages 885–892, March 2007.
[62] V. A. Krupenin, D.E. Presnov, A.B. Zorin, and J. Niemeyer. Aluminum single
electron transistors with islands isolated from a substrate. J. of Low Tempera-
ture Physics, 118(5/6), December 1999.
![Page 164: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/164.jpg)
BIBLIOGRAPHY 148
[63] Amit Kumar, Li Shang, Li-Shiuan Peh, and Niraj K. Jha. HybDTM: a coordi-
nated hardware-software approach for dynamic thermal management. In Proc.
Design Automation Conf., pages 548–553, July 2006.
[64] Choonseung Lee and Soonhoi Ha. Hardware-software cosynthesis of multitask
MPSoCs with real-time constraints. In Proc. Int. Conf. ASIC, pages 919–924,
October 2005.
[65] Feihui Li, Chrysostomos Nicopoulos, Thomas Richardson, Yuan Xie, Vijaykr-
ishnan Narayanan, and Mahmut Kandemir. Design and management of 3D
chip multiprocessors using network-in-memory. In Proc. Int. Symp. Computer
Architecture, pages 130–141, June 2006.
[66] Man-Lap Li, Ruchira Sasanka, Sarita V. Adve, Yen-Kuang Chen, and Eric
Debes. The ALPbench benchmark suite for complex multimedia applications.
In Int. Symp. Workload Characterization, pages 34–35, October 2005.
[67] Peng Li, Yangdong Deng, and Lawrence T. Pileggi. Temperature-dependent
optimization of cache leakage power dissipation. In Proc. Int. Conf. Computer
Design, October 2005.
[68] Yingmin Li, David Brooks, Zhigang Hu, and Kevin Skadron. Performance,
energy, and thermal considerations for SMT and CMP architectures. In Proc.
Int. Symp. Computer Architecture, pages 71–82, February 2005.
[69] Yingmin Li, Benjamin Leez, David Brooks, Zhigang Huyy, and Kevin Skadron.
CMP design space exploration subject to physical constraints. In Proc. Int.
Symp. High-Performance Computer Architecture, pages 17–28, February 2006.
![Page 165: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/165.jpg)
BIBLIOGRAPHY 149
[70] Konstantin K. Likharev. Single-electron devices and their applications. Proc.
IEEE, 87(4):606–632, April 1999.
[71] G. M. Link and N. Vijaykrishnan. Thermal trends in emerging technologies. In
Proc. Int. Symp. Quality of Electronic Design, pages 625–632, March 2006.
[72] Gian Luca Loi, Banit Agrawal, Navin Srivastava, Sheng-Chih Lin, Timothy
Sherwood, and Kaustav Banerjee. A thermally-aware performance analysis of
vertically integrated (3-d) processor-memory hierarchy. In Proc. Design Au-
tomation Conf., pages 991–996, July 2006.
[73] S. Mahapatra, V. Vaish, C. Wasshuber, and K. Banerjee. Analytical modelling
of single electron transistor (SET) for hybrid CMOS-SET analog IC design.
IEEE Trans. Electron Devices, 51(11):1772–1782, June 2004.
[74] Arindam Mallik, Jack Cosgrove, Robert P. Dick, Gokhan Memik, and Peter
Dinda. PICSEL: Measuring user-perceived performance to control dynamic
frequency scaling. In Proc. Int. Conf. Architectural Support for Programming
Languages and Operating Systems, March 2008.
[75] K. Matsumoto, M. Ishii, K. Segawa, and Y. Oka. Room temperature opera-
tion of a single electron transistor made by the scanning tunneling microscope
nanooxidation process for the TiOx/Ti system. Applied Physics Ltrs., 68(1):34–
36, January 1996.
[76] Ulla Miekkala. Graph properties for splitting with grounded Laplacian matrices.
BIT Numerical Mathematics, pages 485–495, September 1993.
![Page 166: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/166.jpg)
BIBLIOGRAPHY 150
[77] A. Mishra and P. Banerjee. An algorithm-based error detection scheme for the
multigrid method. IEEE Trans. Computer-Aided Design of Integrated Circuits
and Systems, 52(9):1089–1099, September 2003.
[78] Gordon E. Moore. Cramming more components onto integrated circuits. Elec-
tronics, 38(8):82–85, April 1965.
[79] F. Nakajima, Y. Miyoshi, J. Motohisa, and T. Fukui. Single-electron
AND/NAND logic circuits based on a self-organized dot network. Applied
Physics Ltrs., 83(13):2680–2682, September 2003.
[80] Y. Nakamura, C. D. Chen, and J. S. Tsai. 100-K operation of Al-based single-
electron transistors. Japan Journal Applied Physics, 35:1465–1467, November
1996.
[81] Umit Y. Ogras and Radu Marculescu. Energy- and performance- driven NoC
communication architectures synthesis using a decomposition approach. In
Proc. Design, Automation & Test in Europe Conf., pages 352–357, March 2005.
[82] Y. Ono, Y. Takahashi, K. Yamazaki, M. Nagase, H. Namatsu, K. Kurihara,
and K. Murase. Si complementary single-electron inverter. IEDM Technology
Dig., pages 367–370, 1999.
[83] Soyeon Park, Weihang Jiang, Yuanyuan Zhou, and Sarita Adve. Managing
energy-performance tradeoffs for multi-threaded applications. In Proc. Int.
Conf. on Measurement and Modeling of Computer Systems, pages 169–180,
June 2007.
![Page 167: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/167.jpg)
BIBLIOGRAPHY 151
[84] Yu A. Pashkin, Y. Nakamura, and J. S. Tsai. Room-temperature Al single-
electron transistor made by electron-beam lithography. Applied Physics Ltrs.,
76(16):2256–2258, April 2000.
[85] D. Pham, S. Asano, M. Bolliger, M. N. Day, H. P. Hofstee, C. Johns, J. Kahle,
A. Kameyama, J. Keaty, Y. Massubuchi, M. Riley, D. Shippy, D. Stasiak,
M. Suzuoki, M. Wang, J. Warnock, S. Weitzel, D. Wendel, T. Yamazaki, and
K. Yazawa. The design and implementation of a first-generation CELL proces-
sor. In Proc. Int. Solid-State Circuits Conf., pages 49–52, February 2007.
[86] Aashish Phansalkar, Ajay Joshi, Lieven Eeckhout, and Lizy K. John. Measuring
program similarity: Experiments with SPEC CPU benchmark suites. In Proc.
Int. Symp. on Performance Analysis of Systems and Software, pages 10–20,
March 2005.
[87] M. D. Powell, M. Gomaa, and T. N. Vijaykumar. Heat-and-run: Leveraging
SMT and CMP to manage power density through the operating system. In
Proc. Int. Conf. Architectural Support for Programming Languages and Oper-
ating Systems, pages 260–270, November 2004.
[88] S. Prakash and A. Parker. SOS: Synthesis of application-specific heterogeneous
multiprocessor systems. J. Parallel & Distributed Computing, 16:338–351, De-
cember 1992.
[89] Kiran Puttaswamy and Gabriel H. Loh. Thermal analysis of a 3d die-stacked
high-performance microprocessor. In Proc. Great Lakes Symp. VLSI, pages
19–24, May 2006.
![Page 168: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/168.jpg)
BIBLIOGRAPHY 152
[90] Kiran Puttaswamy and Gabriel H. Loh. Thermal herding: Microarchitecture
techniques for controlling hotspots in high-performance 3d-integrated proces-
sors. In Proc. Int. Symp. High-Performance Computer Architecture, pages 193–
204, February 2007.
[91] Jan M. Rabaey. Digital Integrated Circuits. Prentice-Hall, NJ, 1998.
[92] Shad Roundy, Paul K. Wright, and Jan Rabaey. A study of low level vibra-
tions as a power source for wireless sensor nodes. Computer Communications,
26:1131–1144, October 2003.
[93] Takayasu Sakurai. A JSSC classic paper: The simple model of CMOS drain
current. IEEE Solid State Circuits Society Quarterly Newsletter, pages 4–5,
October 2004.
[94] Eric C. Samson, Sridhar V. Machiroutu, Je-Young Chang, Ishmael Santos, Jim
Hermerding, Ashay Dani, Ravi Prasher, and David W. Song. Interface material
selection and a thermal management technique in second-generation platforms
built on Intel Centrino mobile technology. Intel Technology J., 09(1):75–86,
February 2005.
[95] Samsung. http://www.samsung.com/.
[96] K. Sankaralingam, R. Nagarajan, H. Liu, J. Huh, C. K. Kim, D. Burger, S. W.
Keckler, and C. R. Moore. Exploiting ILP, TLP, and DLP using polymorphism
in the TRIPS architecture. In Proc. Int. Symp. Computer Architecture, pages
422–433, June 2003.
![Page 169: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/169.jpg)
BIBLIOGRAPHY 153
[97] Oleg Semenov, Arman Vassighi, Manoj Sachdev, Ali Keshavarzi, and C. F.
Hawkins. Effect of cmos technology scaling on thermal management during
burn-in. 16:686–695, November 2003.
[98] J.-I. Shirakashi, K. Matsumoto, N. Miura, and M. Konagai. Single-electron
charging effects in Nb/Nb oxide-based single-electron transistors at room tem-
perature. Applied Physics Ltrs., 72(15):1893–1895, April 1998.
[99] Kevin Skadron, Mircea R. Stan, Wei Huang, Sivakumar Velusamy, Karthik
Sankaranarayanan, and David Tarjan. Temperature-aware microarchitecture.
In Proc. Int. Symp. Computer Architecture, pages 2–13, June 2003.
[100] SPLASH2 website. http://www-flash.stanford.edu/apps/SPLASH/.
[101] R. Sprunt. Pentium 4 performance-monitoring features. IEEE Micro, 22(4):72–
82, 2002.
[102] J. Srinivasan, S.V. Adve, P. Bose, and J.A. Rivers. The impact of technology
scaling on lifetime reliability. In Proc. International Conf. Dependable Systems
and Networks, pages 177–186, June 2004.
[103] Jayanth Srinivasan, Sarita V. Adve, Pradip Bose, and Jude A. Rivers. Exploit-
ing structural duplication for lifetime reliability enhancement. In Proc. Int.
Symp. Computer Architecture, pages 520–531, June 2005.
[104] Chong Sun, Li Shang, and Robert P. Dick. Three-dimensional multi-processor
system-on-chip thermal optimization. In Proc. Int. Conf. Hardware/Software
Codesign and System Synthesis, pages 117–122, October 2007.
![Page 170: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/170.jpg)
BIBLIOGRAPHY 154
[105] X. Tang, X. Baie, V. Bayot, F. Van de Wiele, and J. P. Colinge. An SOI single-
electron transistor. In Proc. Silicon-on-Insulator Conf., pages 46–47, October
1999.
[106] David Tarjan, Shyamkumar Thoziyoor, and Norman P. Jouppi. CACTI 4.0.
Technical report, HP Laboratories, June 2006.
[107] Michael Bedford Taylor, Walter Lee, Jason Miller, David Wentzlaff, Ian Bratt,
Ben Greenwald, Henry Hoffmann, Paul Johnson, Jason Kim, James Psota,
Arvind Saraf, Nathan Shnidman, Volker Strumpen, Matt Frank, Saman Ama-
rasinghe, and Anant Agarwal. Evaluation of the raw microprocessor: An
exposed-wire-delay architecture for ILP and streams. In Proc. Int. Symp. Com-
puter Architecture, June 2004.
[108] Tezzaron. http://www.tezzaron.com/technology/FaStack.htm.
[109] A. W. Topol, D. C. La Tulipe, L. Shi Jr., D. J. Frank, K. Bernstein, S. E. Steen,
A. Kumar, G. U. Singco, A. M. Young, K. W. Guarini, and M. Ieong. Three-
dimensional integrated circuits. IBM J. Research and Development, 4:491–506,
2006.
[110] Y. Tsai, Y. Xie, N. Vijaykrishnan, and M. J.Irwin. Three-dimensional cache
design exploration using 3DCacti. In Proc. Int. Conf. Computer Design, pages
519–524, October 2005.
[111] J R Tucker. Complementary digital logic based on the Coulomb blockade. J.
Applied Physics, 72(99):4399–4413, 1992.
![Page 171: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/171.jpg)
BIBLIOGRAPHY 155
[112] K Uchida, J Koga, R Ohba, and A Toriumi. Programmable single-electron tran-
sistor logic for future low-power intelligent LSI: proposal and room-temperature
operation. IEEE Trans. Electron Devices, 50(7):1623–1630, July 2003.
[113] Ken Uchida, Kazuya Matsuzawa, Junji Koga, Ryuji Ohba, Shin ichi Takagi, and
Akira Toriumi. Analytical single-electron transistor (SET) model for design and
analysis of realistic set circuits. Japanese. J. Applied Physics, 39:2321–2324,
April 2000.
[114] Srinivas Vanapalli, Michael Lewis, Zhihua Gan, and Ray Radebaugh. 120 Hz
pulse tube cryocooler for fast cooldown to 50 K. Applied Physics Letters, 90,
February 2007.
[115] S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Fi-
nan, P. Lyer, A. Singh, T. Jacob, S. Jain, S. Venkataraman, Y Hoskote, and
N. Borkar. An 80-tile 1.28TFLOPS networks-on-chip in 65nm CMOS. In Proc.
Int. Solid-State Circuits Conf., February 2007.
[116] Ram Viswanath, Vijay Wakharkar, Abhay Watwe, and Vassou Lebonheur.
Thermal performance challenges from silicon to systems. Intel Technology J.,
04(3):1–16, August 2000.
[117] C. Wasshuber, H. Kosina, and S. Selberherr. A single-electron device and cir-
cuit simulator. IEEE Trans. Computer-Aided Design of Integrated Circuits and
Systems, 16:937–944, September 1997.
![Page 172: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/172.jpg)
BIBLIOGRAPHY 156
[118] C. Wasshuber, H. Kosina, and S. Selberherr. A comparative study of single
electron memories. IEEE Trans. Electron Devices, 45:2365–2371, November
1998.
[119] Henning Wolf, Franz Josef Ahlers, J. Niemeyer, Hansjorg Scherer, Thomas
Weimann, Alexander B. Zorin, Vladimir A. Krupenin, Sergey V. Lotkhov, and
Denis E. Presnov. Investigation of the offset charge noise in single electron tun-
neling devices. Trans. on Instrumentation and Measurement, 46(2):303–306,
April 1997.
[120] Y. Xie, L. Lu, M. Kandemir, N. Vijaykrishnan, and M. J. Irwin. Reliability-
aware co-synthesis for embedded systems. In Proc. Int. Conf. Application-
Specific Systems, Architectures, and Processors, September 2004.
[121] Xilinx XPower. http://www.xilinx.com.
[122] K. K. Yadavalli, A. O. Orlov, G. L. Snider, and A. N. Korotkov. Single electron
memory devices: toward background charge insensitive operation. J. Vacuum
Science Technology B Microelectronics and Nanometer Structures, 21:2860–
2864, 2003.
[123] Yonghong Yang, Zhenyu (Peter) Gu, Changyun Zhu, Robert P. Dick, and
Li Shang. ISAC: Integrated space and time adaptive chip-package thermal anal-
ysis. IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems,
January 2007.
![Page 173: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/173.jpg)
BIBLIOGRAPHY 157
[124] Yonghong Yang, Zhenyu (Peter) Gu, Changyun Zhu, Li Shang, and Robert P.
Dick. Adaptive chip-package thermal analysis for synthesis and design. In Proc.
Design, Automation, and Test in Europe, pages 844–849, March 2006.
[125] Yonghong Yang, Changyun Zhu, Zhenyu (Peter) Gu, Li Shang, and Robert P.
Dick. Adaptive multi-domain thermal modeling and analysis for integrated
circuit synthesis and design. In Proc. Int. Conf. Computer-Aided Design, pages
575–582, November 2006.
[126] K. Yano, T. Ishii, T. Hashimoto, T. Kobayashi, F. Murai, and K. Seki. Room-
temperature single-electron memory. IEEE Trans. Electron Devices, 41:1628–
1638, September 1994.
[127] Ti-Yen Yen. Hardware-Software Co-Synthesis of Distributed Embedded Systems.
PhD thesis, Dept. of Electrical Engg., Princeton University, June 1996.
[128] Y. S. Yu, S. W. Hwang, and D. Ahn. Transient modelling of single-electron tran-
sistors for efficient circuit simulation by SPICE. Electronics Ltrs., 152(6):691–
696, December 2005.
[129] Y. Zhang, D. Parikh, K. Sankaranarayanan, K. Skadron, and M. Stan. HotLeak-
age: A temperature-aware model of subthreshold and gate leakage for architects.
Technical report, Univ. of Virginia, May 2003. CS-2003-05.
[130] W. Zhao and Y. Cao. New generation of predictive technology model for sub-
45nm design exploration. In Proc. Int. Symp. Quality of Electronic Design,
pages 585–590, March 2006.
![Page 174: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/174.jpg)
BIBLIOGRAPHY 158
[131] Changyun Zhu, Zhenyu Gu, Li Shang, Robert P. Dick, and Russ Joseph. Run-
time thermal management of three-dimensional chip multiprocessors. In Proc.
Wkshp. Quality-Aware Design, June 2008. Invited paper.
[132] Changyun Zhu, Zhenyu (Peter) Gu, Robert P. Dick, and Li Shang. Reliable
multiprocessor system-on-chip synthesis. In Proc. Int. Conf. Hardware/Software
Codesign and System Synthesis, pages 239–244, October 2007.
[133] Changyun Zhu, Zhenyu (Peter) Gu, Robert P. Dick, Li Shang, and Robert
Knobel. Characterization of Single-Electron Tunneling Transistors for Design-
ing Low-Power Embedded Systems. IEEE Trans. VLSI Systems, 17(5), May
2009.
[134] Changyun Zhu, Zhenyu (Peter) Gu, Li Shang, Robert P. Dick, and Russ Joseph.
Three-dimensional chip-multiprocessor run-time thermal management. IEEE
Trans. Computer-Aided Design of Integrated Circuits and Systems, 27(8), Au-
gust 2008.
[135] Changyun Zhu, Zhenyu (Peter) Gu, Li Shang, Robert P. Dick, and Robert
Knobel. Towards an ultra-low-power architecture using single-electron tunnel-
ing transistors. In Proc. Design Automation Conf., pages 312–317, June 2007.
[136] N. M. Zimmerman, W. H. Huber, A. Fujiwara, and Y. Takahashi. Excellent
charge offset stability in Si-based SET transistors. In Proc. Precision Electro-
magnetic Measurements, pages 124–125, November 2002.
![Page 175: System-Level Power, Thermal and Reliability ... - Queen's U](https://reader034.fdocuments.in/reader034/viewer/2022042804/62687672dc82e57b552e1c46/html5/thumbnails/175.jpg)
BIBLIOGRAPHY 159
[137] N. S. Zimmerman, W. H. Huber, A. Fujiwara, and Y. Takahashi. Excellent
charge offset stability in a Si-based single-electron tunneling transistor. Applied
Physics Ltrs., 79:3186–3190, 2002.