System-Level Power, Thermal and Reliability ... - Queen's U

System-Level Power, Thermal and Reliability

Optimization

by

Changyun Zhu

A thesis submitted to the

Department of Electrical and Computer Engineering

in conformity with the requirements for

the degree of Doctor of Philosophy

Queen’s University

Kingston, Ontario, Canada

July 2009

Copyright © Changyun Zhu, 2009

Abstract

An integrated circuit can now contain more than one billion transistors. With

increasing system integration and technology scaling, power and power-related issues

have become the primary challenges of integrated circuit design. In this disserta-

tion, techniques and algorithms, from system-level synthesis to emerging integration

and device technologies, are proposed to address the power and power-induced ther-

mal and reliability challenges of modern billion-transistor integrated circuit design.

In Chapter 1, the challenges of semiconductor technology scaling are introduced.

Chapter 2 reviews the related works. Chapter 3 focuses on the reliability optimiza-

tion issue during system-level design. A reliable application-specific multiprocessor

system-on-chip synthesis system is proposed, called TASR, which exploits redundancy

and thermal-aware design planning to produce reliable and compact circuit designs.

Chapter 4 introduces three-dimensional (3D) integration, a new integrated circuit

fabrication and integration technology. Thermal issue is a primary concern of 3D in-

tegration. A 3D integrated circuit heat flow analytical framework is proposed in this

chapter. Proactive, continuously-engaged hardware and operating system thermal

management techniques are presented and evaluated which optimize system perfor-

mance than state-of-the-art techniques while honoring the same temperature bound.

Chapter 5 presents reconfigurable architecture design using single-electron tunneling

i

transistor, an ultra-low-power nanometer-scale device. The proposed design has the

potential to overcome the power and energy barriers for both high-performance com-

puting and ultra-low-power embedded systems. Conclusions are drawn in Chapter 6.

ii

Co-Authorship

All work regarding Reliable MPSoC Synthesis, 3D CMP Thermal Management

and Characterization of SET Transistors in this thesis (i.e., Chapter 3, Chapter 4

and Chapter 5 of the thesis) was done in collaboration with Zhenyu Gu.

iii

Acknowledgments

First, I would like to gratefully thank my supervisor, Professor Li Shang, not only

for his supervision of my research work, but also for his patience and help which

encouraged me to complete my studies. He has all the traits of an excellent research

supervisor. I appreciate the corrections and suggestions offered by my committee

members: Professor Robert Knobel, Professor Ahmad Afsahi and Professor Alireza

Bakhshai for their valuable comments and feedback.

I would also like to thank Professor Naraig Manjikian for his kindly help during

my studies at Queen’s University.

Thanks are also given to Zhenyu Gu, Yonghong Yang, Kun Li, Nicholas Allec,

Assem Bsoul, Zyad Mohamed, Professor Robert P. Dick and Professor Qin Lv for

their invaluable discussions.

Finally, I am grateful to my parents, wife and friends for their support and en-

couragement over these years.

iv

Table of Contents

Abstract i

Co-Authorship iii

Acknowledgments iv

Table of Contents v

List of Symbols viii

List of Tables xiii

List of Figures xiv

Chapter 1:

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Technology Scaling and Design Challenges . . . . . . . . . . . . . . . 1

1.2 Dissertation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Chapter 2:

Related works . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1 Reliability-aware synthesis . . . . . . . . . . . . . . . . . . . . . . . . 9

v

2.2 Three-dimensional integrated circuit . . . . . . . . . . . . . . . . . . 10

2.3 Single-electron tunneling transistors . . . . . . . . . . . . . . . . . . . 13

Chapter 3:

Reliable MPSoC Synthesis . . . . . . . . . . . . . . . . . 15

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2 TASR: Temperature-Aware Synthesis of Reliable MPSoCs . . . . . . 20

3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.4 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . 42

Chapter 4:

3D CMP Thermal Management . . . . . . . . . . . . . . 44

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.3 Heat Flow in 3D CMPs . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.4 3D CMP Thermal Management . . . . . . . . . . . . . . . . . . . . . 57

4.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69


4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

Chapter 5:

Characterization of SET Transistors . . . . . . . . . . . . 89

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.2 SET Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.3 IceFlex: A Fault-Tolerant Hybrid SET/CMOS Reconfigurable Archi-

tecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

vi


5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

Chapter 6:

Conclusions and Future Work . . . . . . . . . . . . . . . 134

6.1 Thesis Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

vii

List of Symbols

A Thermal conductance matrix

C Capacitance

CD Drain capacitance

CG Gate capacitance

CS Source capacitance

CP Island capacitance

EaEM Activation energy of electromigration

EaSM Activation energy of stress migration

F (t) Cumulative distribution function

G Gain

I Current

J Current density

viii

K Diagonal matrix containing the thermal conductances of adjacent thermal ele-

ments

Keff Effective vertical thermal conductivity

Klayer Thermal conductivity of the region without any vias

Kvia Thermal conductivity of the via material

L Laplacian matrix

Pd Power density

R Resistance

RD Drain resistance

RS Source resistance

T Temperature

T0 Metal deposition temperature during fabrication

Tambient Ambient temperature

Taverage Chip average temperature

VTH Threshold voltage

β Alpha power law parameter

κ or κB Boltzmann’s constant

µ Scale parameter of lognormal distribution

ix

ρvia Via density

σ Shape parameter of lognormal distribution

ξ Run-time switching activity multiplied the capacitance of the switched nodes.

ζij Thermal impact coefficient for core i due to j

e Elementary charge

f Frequency

f(t) Probability density function

g Conductance

h Planck’s constant

3D Three dimensional

BIPS Billion instructions per second

BJT Bipolar junction transistor

CDF Cumulative distribution function

CMOS Complementary metal-oxide-semiconductor

CMP Chip-Level multiprocessor

CR Component redundancy

DRAM Dynamic random access memory

x

DSP Digital signal processing

DTM Dynamic thermal management

DVFS Dynamic voltage and frequency scaling

EEMBC The embedded microprocessor benchmark consortium

FPGA Field-programmable gate array

IC Integrated circuit

IPC Instructions per cycle

LUT Lookup table

MPSoC Multiprocessor system-on-chip

MTTF Mean time to failure

MVL Majority voting logic

NoC Network on chip

OS Operating system

PDF Probability density function

PE Processing element

PRSA Parallel recombinative simulated annealing

xi

SET Single-electron tunneling transistor

SMT Simultaneous multithreading

TIP Thermal impact per performance

xii

List of Tables

3.1 System MTTF Improvement Under Area Bound [132] . . . . . . . . . 40

4.1 ThermOS Implementation [134]. . . . . . . . . . . . . . . . . . . . . . 65

4.2 DVFS and Clock Throttling Comparison [134]. . . . . . . . . . . . . . 69

4.3 Design Parameters for Alpha 21264 [134]. . . . . . . . . . . . . . . . . 70

4.4 3D Package Setup [134]. . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.5 Benchmark Characteristics [134]. . . . . . . . . . . . . . . . . . . . . 73

4.6 Benchmark Suites [134]. . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.1 Island Size Estimation [133]. . . . . . . . . . . . . . . . . . . . . . . . 98

5.2 Design Space Characterization [133]. . . . . . . . . . . . . . . . . . . 105

5.3 Impact of Majority Vote Logic on SELB Fault Probability [133]. . . . 115

5.4 Characterization of IceFlex Microarchitecture for CΣ = e2/(40kBT ) [133]121

5.5 Characterization of IceFlex Interconnect Fabric For CΣ = e2/(40kBT ) [133]122

5.6 Latency and Energy Improvement For Exclusive-Or Design [133]. . . 127

5.7 IceFlex Performance and Power Consumption at Room Temperature

For CΣ = e2/(40kBT ) [133]. . . . . . . . . . . . . . . . . . . . . . . . 129

xiii

List of Figures

1.1 Intel CPU Transistor Count [2]. . . . . . . . . . . . . . . . . . . . . 2

1.2 Microprocessor Power Consumption. . . . . . . . . . . . . . . . . . . 3

1.3 Temperature Profile for Active Layer and Heatsink [123]. . . . . . . 4

3.1 Reliable MPSoC Synthesis Example [132]. . . . . . . . . . . . . . . . 18

3.2 TASR Flow for the Temperature-Aware Synthesis of Reliable MP-

SoCs [132]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.3 Temperature Impact on MTTF [38]. . . . . . . . . . . . . . . . . . . 29

3.4 Comparison of MPSoC Area–Reliability Tradeoffs [38]. . . . . . . . . 38

3.5 Comparison of Different Optimization Heuristics [132]. . . . . . . . . 39

4.1 (a) Comparison of Face-to-Face (Left) and Face-to-Back (Right) Con-

figurations for Two Stacked Dies, (b) 3D Three Stacked Die Floorplan

Used in This Work, and (c) 3D CMP Chip-package Thermal Model-

ing [134]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.2 Inter-layer and Intra-layer Thermal Heterogeneity and Dominance in

3D CMPs [134]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.3 ThermOS: 3D CMP Run-time Thermal Management [134]. . . . . . 63

4.4 Comparison of ThermOS and Distributed Approach [28, 134]. . . . . 77

xiv

4.5 Reduction in Temperature Constraint Violations due to Local DVFS

and Elimination of Temperature Constraint Violations due to Clock

Throttling [134]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.6 Temporal Temperature Variation for Eight Processor Cores (P0–P7)

Running lv-mipc2 Using Local DVFS w.o. (Top) and w. (Bottom)

Clock Throttling [134]. . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.7 Negligible CMP Instruction Throughput Reduction Resulting from Lo-

cal DVFS and Clock Throttling [134]. . . . . . . . . . . . . . . . . . . 81

4.8 Impact of Global Guidance Interval [134]. . . . . . . . . . . . . . . . . 83

4.9 Impact of Lookup Table Size [134]. . . . . . . . . . . . . . . . . . . . 85

4.10 Impact of Floorplan Rotation [134]. . . . . . . . . . . . . . . . . . . . 86

5.1 SET Structure and Schematic [133]. . . . . . . . . . . . . . . . . . . 95

5.2 SET Coulomb Oscillation (Cg =3.2 aF, Cs = Cd =1.0 aF, and Rs =

Rd =10 MΩ) [133]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.3 IceFlex Microarchitecture [133]. . . . . . . . . . . . . . . . . . . . . . 106

5.4 Multi-gate SET Multiplexer Tree [133]. . . . . . . . . . . . . . . . . 108

5.5 SET Configuration Memory [135]. . . . . . . . . . . . . . . . . . . . . 110

5.6 SET Parity Circuit [133]. . . . . . . . . . . . . . . . . . . . . . . . . . 112

5.7 Hybrid SET/CMOS Interface Circuitry [133]. . . . . . . . . . . . . . 116

5.8 Power and Performance of the Multi-gate SET Multiplexer Tree for

High Performance, CΣ = e2/(40kBT ) [133]. . . . . . . . . . . . . . . 125

5.9 Performance and Power Characterization of Exclusive-or Logic for Low

Power for CΣ = e2/(40kBT ) [133]. . . . . . . . . . . . . . . . . . . . 127

xv

Chapter 1

Introduction

1.1 Technology Scaling and Design Challenges

As observed by Gordon E. Moore in 1965, the number of transistors that can be

integrated on a chip doubled every 18 to 24 months [78]. During the past four decades,

semiconductor technology scaling has provided consistent improvements in circuit

performance and integration density. Figure 1.1 shows the technology scaling of

Intel microprocessors since 1971. With increasing system integration and technology

scaling, integrated circuit design becomes increasingly complex. Power and power-

induced design issues, such as chip temperature and circuit reliability, have become

the primary concerns of modern integrated circuit design.

Power Challenges

Although scaling of technology provides higher functional integration, more com-

puting resources, better performance and parallel operation capability, the increased

1

CHAPTER 1. INTRODUCTION 2

100

1000

10000

100000

1e+06

1e+07

1e+08

1e+09

1e+10

1970 1980 1990 2000 2010 2020

Tra

nsis

tor

count

Year

Intel CPU transistor count

40048008

8080

8088

80286 80386

8048680486Pentium

Pentium IIPentium III

Pentium 4Itanium 2 Core 2 Duo

Core 2 QuadDual-Core Itanium 2

Atom

Core i7

Quad-Core Itanium

Figure 1.1: Intel CPU Transistor Count [2].

operating frequency and transistor density raise the circuit dynamic power consump-

tion. Furthermore, because the subthreshold leakage is an inverse exponential func-

tion of a transistor’s threshold voltage (VTH) and VTH is reduced with technology scal-

ing under the constant electric field scaling scenario, the chip leakage power increases

exponentially [97]. Figure 1.2 shows the power consumption of microprocessors re-

leased during the past twenty years. It indicates the exponential increase in power

due to increased voltage, frequency, temperature and decreased threshold voltage.

Thermal Challenges

As more power is consumed by increasingly denser integrated circuits filled with

transistors, more heat is generated and therefore raises chip temperatures which has


1

10

100

1000

1980 1985 1990 1995 2000 2005 2010

Power(W)

Year

Intel 386Intel 486

Intel pentiumIntel pentium2Intel pentium3Intel pentium4

Intel itaniumIntel i7

Alpha 21064Alpha 21164Alpha 21264

Spar cSuper Spar C

Spar c64Mips

HP PAPower PC

AMD K6AMD K7

AMD x86-64AMD Athlon64X2

AMD BarcelonaIntel Clovetown

Sun NiagaraSun Niagara 2

Figure 1.2: Microprocessor Power Consumption.

a huge impact on IC performance, cooling cost reliability, and power consumption.

The latencies of transistors and metal wires increase with increasing chip temperature

as do the probabilities of many lifetime reliability faults [53, 102]. For example, elec-

tromigration failure rate is an exponential function of temperature. Leakage power

consumption is now responsible for a substantial proportion of overall power con-

sumption in commercial designs and increases with temperature [67]. IC chips and

packages exhibit significant spatial and temporal variations due to the heterogene-

ity of thermal conductivity and heat capacity in different materials, as well as the

variation of power profiles. This requires accurate chip-package heat flow analysis,

which is complex and computing intensive. As illustrated by the example shown in


35 40 45 50 55 60 65 70 75 80 85 90

-8 -6 -4 -2 0 2 4 6 8

-8

-6

-4

-2

0

2

4

6

8

35 40 45 50 55 60 65 70 75 80 85 90

Temperature (°C)

Position (mm)

Temperature (°C)Heatsink/IC

interfaceIC active layer

Figure 1.3: Temperature Profile for Active Layer and Heatsink [123].

Figure 1.3, the steady-state thermal profile of the active layer of the silicon die in

conjunction with the top layer of the cooling package is characterized using multigrid

thermal solver which has to partition the chip and the cooling package into 131,072

homogeneous thermal elements. Compared to steady-state thermal modeling, char-

acterizing an IC dynamic thermal profile is even more time consuming. IC synthesis

requires a large number of optimization steps; thermal modeling can easily become

its performance bottleneck [123].


Reliability Challenges

Moreover, aggressive scaling of CMOS process technology poses serious challenges

to the lifetime reliability of ICs. Reduction of feature size and increases in power den-

sity have resulted in increasing chip temperature and failure rates. Increased system

integration using these vulnerable devices and interconnects results in reduced system

reliability. The severity of many reliability problems, such as time-dependent dielec-

tric breakdown in MOS transistors and electromigration in interconnects, increases

exponentially with temperature. Life time reliability is becoming an important qual-

ity metric in high-performance ICs. Optimizing lifetime reliability requires careful

planning during IC design and synthesis. At the architectural level, careful assign-

ment of tasks to processing elements (PEs) can balance the thermal profile of the chip,

thereby improving system reliability. Synthesis-time architectural planning and care-

ful use of PE-level and component-level (e.g., functional unit) redundancy will permit

continued MPSoC operation after the failure of some processors or components, while

limiting area overhead. At the physical level, a fast floorplanner is needed to pro-

vide physical information for generating the power profile which, in turn, is used to

determine the thermal profile. The evaluation and optimization of system reliability

and other design metrics, such as area and performance, require a comprehensive and

efficient architectural-level and physical-level synthesis infrastructure.

In summary, power, thermal and reliability issues have become dominant con-

straints in modern nanoscale integrated circuit design. For high-performance ap-

plications, temperature affects integration density, performance, power consumption

and cost. For battery-powered embedded systems, energy consumption directly de-

termines system life time. For any system, reliability strongly depends on the thermal


profile during operation.

1.2 Dissertation Overview

In this dissertation, the issues of power, thermal and reliability challenges will be

addressed from the following three aspects: system-level synthesis algorithms, recently

proposed circuit integration technology and emerging device technology. First, relia-

bility consideration will be integrated into the system-level synthesis algorithms of IC

design flow. Then, the recently proposed integration technology, three-dimensional

integrated circuit to overcome the limitations of 2D technology will be discussed.

Finally, an emerging device technology, single-electron tunneling transistors, will be

evaluated to overcome the coming challenges for CMOS devices. The rest of this

dissertation will be organized as follows.

First, technology scaling and increasing power densities are increasing the severity

of IC lifetime reliability problems. The lifetime reliability problem cannot be well

solved at any single level of the design process. Reliability characterization requires

chip-package thermal profiles, which in turn requires physical information, including

an IC floorplan, power profile, and chip package thermal model. Reliability-aware IC

design requires an unified architectural-level and physical-level design flow. Therefore,

a system-level synthesis flow which conducts architectural synthesis, floorplanning,

on-chip network synthesis, chip-package thermal analysis, and reliability analysis is

proposed in Chapter 3. Optimization algorithms within this flow exploit redundancy

and temperature-aware design planning to produce reliable, compact IC designs. My

major contribution to this chapter is on the MPSoC reliability modeling, temperature-

dependent reliability modeling and reliability-aware optimization algorithm design.


My collaborator, Zhenyu Gu, contributed to the floorplanning and on-chip network

synthesis. Two papers have been published on this project [132, 38].

Second, three-dimensional (3D) integration has the potential to improve the com-

munication latency and integration density of IC designs. By stacking multiple device

layers connected through inter-die vias, 3D technology significantly reduces on-chip

wire length, enables efficient interconnect and logic design, and further boosts logic

integration density. However, the stacked high power density layers of 3D chips in-

crease the importance and difficulty of thermal management. Chip power density

increases linearly with the number of vertically-stacked active circuit layers. In addi-

tion, the bonding layers used in 3D integration have low thermal conductivities, which

further exacerbates thermal effects. Chapter 4 identifies and describes the critical con-

cepts required for optimal thermal management and proposes proactive, continuously-

engaged hardware and operating system thermal management technique that achieves

better performance than state-of-the-art techniques while honouring the same tem-

perature bound. My major contribution to this chapter is on the characterization of

heat flow in 3D CMPs, derivation of optimal workload assignment and power–thermal

budgeting and thermal management implementation in the Linux kernel. My collab-

orator, Zhenyu Gu contributed to the design of 3D CMP architecture and technology,

framework buildup of the full simulation system, and benchmark suites characteristics

and generation. Two papers have been published on this project [131, 134].

Third, devices researchers have seen the coming challenges for CMOS devices

and evaluated alternative technologies such as single-electron tunneling transistors

(SETs). The International Technology Roadmap for Semiconductors projects that

SETs have the potential to achieve the lowest projected energy per switching event of


any known device. However their use poses unique architectural, circuit design and

fabrication challenges. Chapter 5 explores the potential use of SETs in low-power em-

bedded systems, evaluates the benefits and limitations of SETs, and characterizes the

impacts of SETs on system design metrics. Based on the evaluation of the architec-

tural and circuit-level features, a fault-tolerant, reconfigurable, hybrid SET/CMOS

based architecture is proposed in this chapter. My major contribution of this chapter

is on the SET modeling, SET design space characterization and characterization of

IceFlex architecture. My collaborator, Zhenyu Gu, contributed to the global/local

interconnect design and characterization of embedded applications. Two papers has

been published on this project [135, 133].

Finally, a conclusion of this dissertation and the potential future research problems

are presented in Chapter 6.

Chapter 2

Related works

2.1 Reliability-aware synthesis

Our reliable MPSoC synthesis work draws from research in the areas of integrated

circuit reliability modeling and optimization [103, 21], system synthesis [30, 42, 120,

64], physical design, and thermal analysis [99, 123]. Coskun et al. [21] and Srini-

vasan et al. [103] provided architectural reliability models and run-time optimization

techniques for MPSoCs and microprocessors, respectively. Eles et al. contrasted opti-

mization algorithms for use in hardware–software partitioning [30]. Henkel and Ernst

proposed flexible task discretization during hardware–software partitioning [42]. Xie

et al. proposed a technique to duplicate tasks on idle processors during embedded

system synthesis to tolerate transient faults [120]. Lee and Ha proposed an alloca-

tion, assignment, and scheduling algorithm for real-time MPSoCs [64]. Ogras et al.

proposed a branch-and-bound algorithm for NoC synthesis [81]. Glaß et al. proposed

an evolutionary algorithm that binds tasks to resources with the goal of improving

mean time to failure (MTTF) [36]. They considered fault processes with exponential

9

CHAPTER 2. RELATED WORKS 10

or Weibull distributions; their fault model supports permanent faults. Our system

and fault model differs primarily by considering the influence of faults on subsequent

fault rates due to the impact of run-time rebinding on temperature profile.

2.2 Three-dimensional integrated circuit

This section summarizes the current status of 3D integration in microprocessor

design, surveys related work in microprocessor thermal management, and indicates

the special thermal management challenges 3D CMPs will bring.

Several 3D fabrication technologies have been proposed and developed [109, 108,

95]. Topol et al. reviewed the 3D fabrication process and design techniques developed

at IBM [109]. Tezzaron [108] and Samsung [95] developed 3D fabrication technologies

and Intel is planning to use 3D integration in the Terascale project [115].

3D integration increases the importance of, and complicates, thermal manage-

ment. The 2D heat flux density through the heatsink increases roughly linearly with

the number of stacked wafers. As a result, unless per-layer power densities are greatly

reduced, 3D CMPs will often operate near their thermal limits. Today’s 2D CMPs

already operate at or near their thermal limits, and rely on reactive management

techniques to maintain thermal safety.

In addition to increasing the importance of thermal management, 3D integration

complicates thermal management policy design. In contrast with 2D CMPs, the

temperatures of some pairs of 3D CMP processor cores, e.g., vertically-adjacent cores,

are highly correlated. Moreover, in 2D CMPs, processor cores have similar thermal

resistances to the ambient, and high thermal resistances to other cores. In 3D CMPs,

core resistance to ambient and thermal interaction are highly-heterogeneous. For


example, heat generated in cores farther from the heatsink must flow through more

layers of silicon and polymide bonding before reaching the heatsink.

We next survey work in microprocessor thermal management. Initially, thermal

control strategies were seen as an infrequently-engaged final resorts. However, due

to increasing transistor densities and limitations in cooling technology, thermal con-

trol will be constantly engaged. ThermOS was developed for this emerging thermal

management paradigm.

Black et al. evaluated the performance improvement yielded by stacking memory

and logic layers [12]. Healy et al. proposed a microarchitecture-level floorplanning al-

gorithm that works for both 2D and 3D ICs [39]. Kgil et al. proposed an architecture

in which processing core layers are vertically integrated with main memory consisting

of multiple DRAM dies, permitting performance and power consumption improve-

ments compared to 2D designs [57]. Li et al. proposed a 3D topology that combines

the benefits of network-on-chip and 3D technology to reduce L2 cache latencies [65].

Tsai et al. explored cache implementation in 3D technologies [110].

Thermal issues are critical for 3D integration. Puttaswamy and Loh evaluated the

thermal impact of 3D integration on high-performance microprocessors [89]. They

also proposed a family of techniques that reduce 3D power density and assign more

power to the die closet to the heat sink [90]. These approaches are principally applied

at design time. Skadron et al. described a compact thermal analysis technique that

has been extended to support 3D integration [99]. Loi et al. studied processor and

memory behavior under temperature constraints for 3D technology [72]. Link and

Vijaykrishnan examined thermal effects in 3D technologies [71].


Brooks and Martonosi presented one of the first evaluations of dynamic ther-

mal management (DTM) [14]. In essence, DTM allows microprocessor designers to

constrain the average-case, instead of worst-case, power profile. They instead al-

low run-time mechanisms to detect and resolve potential thermal emergencies. This

yields better overall performance than pessimistically designing systems based on the

worst-case power profile. Li et al. examined the impact of several design constraints,

including thermal effect, on CMP architecture design [69]. Sun et al. proposed a

temperature-aware synthesis technique for 3D CMPs [104], but do not consider run-

time OS management.

Migration strategies can improve the use of multi-core processors by distributing

heat generation more uniformly across the chip. Heo et al. proposed reducing peak

power density by moving computation to another physical location [43]. Powell et

al. explored the benefit of OS thermal management for SMTs and CMPs [87]. They

proposed the Heat and Run strategy, in which the OS co-schedules and migrates SMT

threads to maximize resource utilization before a thermal emergency arises and then

migrates computation to an idle core. Kumar et al. examined hardware-software ther-

mal management that uses hardware performance counters to characterize thermal

behavior and kernel support to schedule tasks [63]. They evaluated their mechanism

on a real system with SMT support and find significant benefits from considering

system-level effects which cannot be accounted for with pure hardware techniques. We

also take advantage of kernel scheduling and performance counters but also consider

multi-core management. Recent work by Park et al. examined energy-performance

tradeoffs in multi-threaded applications [83].


2.3 Single-electron tunneling transistors

After single-electron tunneling transistors were discovery in the 1980s [9, 33], there

has been extensive research on fabrication, design, and modeling of SETs [70]. SET

fabrication and use in high-sensitivity amplifiers at cryogenic temperatures has been

the main research focus [25]. SETs and simple circuits with a variety of structures

were proposed and fabricated using different methods and materials [80, 105, 6]. Re-

cently, researchers have fabricated SETs that operate at room-temperature [75, 98,

84]. Various SET-based circuit applications, such as logic [111, 112, 79, 19] and mem-

ory [126, 118, 122] have been developed. These works provide the promising start for

SET circuit design. However, these articles did not provide an architectural evalua-

tion. We do not claim to have improved the performance of SET-based logic gates.

Instead, we are the first to develop the modules necessary to support architectural

design and synthesis and evaluate the architectural performance and power consump-

tion implications of using SETs. They demonstrate orders of magnitude improvement

in power consumption and energy efficiency compared to CMOS.

Research on SET modeling and simulation has been an active area. Monte Carlo

simulation has been widely used to model SETs. SIMON [117] and MOSES [17] are

the two most popular SET simulators. However, they are too slow for analysis of large

circuits. Uchida et al. proposed an analytical SET model and incorporated it into

SPICE [113]. Recently, Inokawa et al. extended this model to a more general form to

include asymmetric SETs [49]. Mahapatra et al. propose a simulation framework for

hybrid SET/CMOS circuit design and analysis [73]. Their model for SET behavior

is similar to that of Uchida et al. These compact modeling techniques are efficient

enough for use in SET circuit design and analysis and closely match Monte Carlo


simulation results.

Significant challenges still remain for large-scale integration of SETs and for room-

temperature operation. SETs that operate reliably at room temperature have critical

dimensions of ∼1–10 nm. They are challenging to fabricate using current top-down

lithographic techniques. However, several exciting advances make the evaluation of

architectures for high-density logic based on SETs worthwhile. Scanning-probe mi-

croscopes can be used to create devices smaller than those using conventional lithog-

raphy [75]. Continual progress has been made on bottom-up nano-fabrication tech-

niques, where chemical techniques are used to make individual molecules with useful

electronic properties. Molecular quantum dots [40] can display SET behavior. Larger

structures, such as carbon nanotubes and nanowires, can act as SETs [6]. These

bottom-up techniques can create structures supporting room-temperature SET oper-

ation. However, more research is needed in order to integrate individual devices into

large-scale circuits. Very recent advances in graphene [35] devices show promise for

SETs. Reliable methods for cooling to very low temperatures without supplies of liq-

uid helium or nitrogen are also becoming more common [114]. For high-performance

computing, the added complexity of operating at cryogenic temperatures may not be

a limiting factor. Similarly, cryogenic temperatures are readily attained using passive

methods in outer space.

Chapter 3

Reliable Multiprocessor

System-On-Chip Synthesis

This chapter presents a multiprocessor system-on-chip (MPSoC) synthesis algo-

rithm that optimizes system mean time to failure. Given a set of directed acyclic

periodic graphs in which nodes present a number of operations and edges represent

the communication events, in order to minimize system failure rate and area while

meeting functionality and timing constraints, the proposed algorithm determines 1) a

processor core allocation,which allocate the necessary processor cores into the MPSoC

system; 2) processor-level redundancy, which add identical processor cores to the MP-

SoC architecture; 3) component-level structural redundancy, which add appropriate

control mechanisms and redundant hardware to individual processor cores; 4) assign-

ment of tasks to processors, which map each specific task in a processor core; 5) floor-

plan, which estimate the area of each processor core and arrange all these cores within

an given region. and 6) scheduling, which determine when each operation is given the

access to system resource. Changes to the thermal profile resulting from changes in

15

CHAPTER 3. RELIABLE MPSOC SYNTHESIS 16

allocation, assignment, scheduling, and floorplan are modeled and optimized during

synthesis, as is the impact of thermal profile on temperature-dependent failure mech-

anisms. The proposed techniques have the potential to substantially increase MPSoC

system mean time to failure compared to area-optimized solutions. If power densities

are high and the dominant lifetime failure mechanisms are strongly dependent on tem-

perature, our results indicate that thermal and structural redundancy optimization

during synthesis have the potential to greatly increase MPSoC lifetime with low area

cost. My major contribution to this chapter is on the MPSoC reliability modeling,

temperature-dependent reliability modeling and reliability-aware optimization algo-

rithm design( Section 3.2.1, 3.2.2, 3.2.3, 3.2.4 and 3.2.5). My collaborator, Zhenyu

Gu, contributed to the floorplanning and on-chip network synthesis( Section 3.2.6).

3.1 Introduction

A single integrated circuit can now contain more than one billion transistors.

It has been necessary to move to MPSoCs to control design complexity and power

consumption.

Increasing power density due to continued scaling of CMOS process technology

accelerates temperature-dependent and current-dependent failure mechanisms such as

electromigration. Lifetime reliability is becoming an important quality metric in high-

performance MPSoCs. Optimizing lifetime reliability requires careful planning during

MPSoC design and synthesis. This problem cannot be well solved at any single level

of the design process. Reliability characterization requires MPSoC thermal profiles,

which in turn requires physical information, including an MPSoC floorplan, power

profile, and chip-package thermal model. Reliability-aware MPSoC design requires


an unified architectural-level and physical-level design flow.

3.1.1 Contributions

Our work addresses synthesis of MPSoCs capable of reliable operation in the pres-

ence of permanent faults. The proposed algorithm generates MPSoC architectures

that satisfy the functionality and performance constraints of a specification while si-

multaneously optimizing die area and MTTF. The problem specification consists of

graphs composed of data-dependent, multirate, periodic tasks as well as a database of

processor cores. Each processor core executes different tasks with different execution

times and power consumptions. This work makes the following main contributions.

1. We have developed and implemented an MPSoC synthesis flow that conducts

architectural synthesis, floorplanning, on-chip network synthesis, chip-package

thermal analysis, and reliability analysis. Optimization algorithms within this

flow exploit redundancy and temperature-aware design planning to produce

reliable, compact MPSoC designs.

2. We propose a two-phase reliability optimization flow that builds on a stochastic

functionality, performance, and area optimization algorithm and an iterative

reliability enhancement algorithm that explores the trade-off between MPSoC

reliability and area. This algorithm improves MPSoC system MTTF by an

average of 85% with less than 5% area cost and by an average of 436% with less

than 25% area cost, compared to area-optimized solutions.

To the best of our knowledge, this is the first work to propose and implement a

method of predicting and optimizing the impact of design changes during synthesis


Solution I Solution II

Power

PowerPC

PC

K6−2E+

AMD

PCPower

PCPower

PowerPC(RE)

Figure 3.1: Reliable MPSoC Synthesis Example [132].

on temperature-dependent MPSoC failure processes.

3.1.2 System MTTF Definition and Example

We define system MTTF to be the expected amount of time an MPSoC will

operate, possibly in the presence of component faults, before its performance drops

below some designer-specified constraint or it is no longer able to meet its functionality

requirements. Using system MTTF to characterize reliability has the advantage of

taking into account performance; this is important for consumer electronics and most

other MPSoC applications.

To concurrently optimize the system MTTF and area of an MPSoC, it is necessary

to exploit both hardware redundancy and temperature profile. Processor-level redun-

dancy is achieved by adding processors to the MPSoC architecture. Component-level

redundancy is achieved by adding appropriate control mechanisms and redundant

hardware such as additional arithmetic logic units (ALUs) or cache banks to individ-

ual processors [103]. We will illustrate each method of improving system MTTF us-

ing an example. Figure 3.1 shows two synthesized solutions for a telecommunication


application based processor performance data from the Embedded Microprocessor

Benchmark Consortium [31]. Each solution contains three embedded processors con-

nected by an on-chip router. The temperature of each on-chip component is indicated

by its brightness: brighter components are hotter. The embedded processor, an AMD

K6-2E+, used in Solution I, is replaced with an IBM PowerPC 405GP-RE in Solu-

tion II. 405GP-RE is a low power, redundant version of the 405GP; the floating/fixed

point units and register files are duplicated. The system MTTFs of Solution I and

Solution II are 0.7 year and 1.5 years; these changes doubled MTTF. Further relia-

bility enhancements can be used to increase MTTF to 7 years at small area cost. In

this example, solutions contain processors from different companies. If necessary, the

database can be limited to processors from a single company. In order to simplify the

synthesis problem, we ignore the issue that there would be processors better suited to

the particular task at hand than others as long as the overall performance can meet

the deadline requirement.

This example illustrates the potential improvement to system MTTF due to tem-

perature reduction and resource redundancy. MPSoC reliability strongly depends on

temperature. In Solution I, the K5-2E+ has a peak temperature of 59.9 . In So-

lution II, replacing the K5-2E+ with the 405GP-RE reduces the peak temperature

by 5.1 , thereby decreasing the run-time fault rate. Second, increasing system re-

dundancy improves fault-tolerance. Compared to the K5-2E+, the 405GP-RE can

tolerate more run-time faults. This results in an improvement to system MTTF.


Processor core and

task performance, power,

area, and temperature-

dependent reliability models

Thermal analysis

Reliability

analysis

Core allocation

change

Task assignment change

Adaptive list scheduling

Floorplanning

Functionality, performance,

and area evaluationArea-optimized

MPSoC

Thermal

analysis

Functionality,

performance, area,

and reliability evaluation

Initial construction of solutions

Convergence?Convergence?

Core reinforcement

Core

swapping

Core

addition

Reliability enhancement

Max area

reached?

Area and

reliability

optimized

MPSoC

DCT

FLT

ACUM

ARCH

TRAN

Problem instance

Y

NN

Y

Stochastic optimization of functionality, timing, and area Reliability/area curve exploration

Y

N

Figure 3.2: TASR Flow for the Temperature-Aware Synthesis of Reliable MP-SoCs [132].

3.2 TASR: Temperature-Aware Synthesis of Reli-

able MPSoCs

In this section, we describe TASR, the proposed reliable application-specific MP-

SoC synthesis infrastructure.

3.2.1 TASR Infrastructure

Determining and optimizing MPSoC system MTTF requires substantial infras-

tructure. Figure 3.2 illustrates the main steps and components in the proposed

synthesis flow. Computing system MTTF requires knowledge of component MT-

TFs and run-time performance constraints. Computing component MTTFs requires

knowledge of MPSoC thermal profile and architecture. Computing MPSoC thermal

profile during synthesis requires a floorplan, task assignment dependent power model-

ing, and a thermal analysis algorithm. Finally, determining, and optimizing MPSoC

architecture requires a system-level synthesis infrastructure that allocates processor


cores, assigns tasks to processors, rapidly generates floorplans, assigns communication

events to network links, and schedules operations and communication events.

TASR is composed of algorithms from three domains: system-level synthesis,

physical synthesis, and solution analysis. The system-level design contains a single-

objective stochastic optimization algorithm that minimizes MPSoC area subject to

functionality and performance requirements, and an iterative reliability enhancement

algorithm that uses knowledge of redundancy and thermal profile to improve system

MTTF at a small cost in MPSoC area. Physical-level synthesis consists of a slicing

floorplanning algorithm and an on-chip network synthesis algorithm. In addition,

TASR contains a novel statistical lifetime reliability model, and also performance,

power, and thermal models to guide MPSoC reliability optimization.

Given

1. Functionality and timing requirements consisting of a directed acyclic graph of

periodic graphs of communicating heterogeneous tasks, each of which may have

a different deadline;

2. Databases indicating the properties of the available heterogeneous processor

cores and on-chip network resources when used with the tasks in the function-

ality requirements specification, e.g., task execution times and power consump-

tions on each processor and processor areas; and

3. Temperature-dependent reliability models for the processors and functional

units within them.

TASR uses a two-stage optimization flow to determine


1. An allocation of processor cores that are selected based on their performance

and reliability characteristics;

2. An assignment of tasks to processor cores that takes task impact on temperature

and therefore reliability into account;

3. A schedule of all the tasks and communication events in the system; and

4. A floorplan for the MPSoC.

The solutions are optimized for reliability (maximized MTTF) and area. Each so-

lution is associated with numerous alternative task assignments and schedules to

permit continued operation in the event of processor core failure. If a processor fails,

the resulting change in task assignment and schedule required to maintain functional

correctness and meet timing requirements is pre-planned.

3.2.2 Two-Phase Synthesis Flow

This section explains the two-phase synthesis process used within TASR. The

first phase uses a parallel recombinative simulated annealing (PRSA) algorithm, i.e.,

an advanced form of genetic algorithm, to search for low-area MPSoC architectures

that meet functionality and timing requirements without violating area constraints.

Previous studies [26] have demonstrated that the use of PRSA allocation and assign-

ment together with adaptive list scheduling permits optimal solutions to problems

for which optimal solutions are known [88]. For problem instances with previously

published results, the PRSA approach rapidly produces solutions of equal or better

quality [44, 127]. Adaptive list scheduling makes multiple scheduling attempts with

different prioritization metrics in order to meet timing and functionality constraints.


The MPSoC lifetime reliability optimization problem can potentially be solved

using a PRSA synthesis flow by including system MTTF with the other optimization

objectives. However, the addition of reliability optimization to functional, timing,

and area optimization greatly increases problem complexity. Moreover, the time cost

of determining the reliability impact of a design change is much higher than that

of determining the area and performance impact. It becomes necessary to conduct

thermal and reliability analysis and to determine multiple task assignments and sched-

ules for each MPSoC in order to support runtime adaptation to processor core failure.

Therefore, we propose starting from an area-optimized solution meeting functionality

and timing constraints and using a reliability enhancement algorithm to explore the

area–reliability tradeoff curve.

Lifetime reliability is inversely related to chip temperature. By increasing chip

area, power density and chip temperature decrease, thereby increasing chip reliability.

Structural redundancy, which permits continued processor or MPSoC operation after

component failure and generally increases area, can also improve reliability.

3.2.3 Integrated Circuit Failure Mechanisms

In this section, we characterize integrated circuit (IC) failure mechanisms. The

lifetime reliability of ICs is primarily affected by the following failure mechanisms:

electromigration, thermal cycling, time-dependent dielectric breakdown, and stress

migration [103].

Electromigration is the gradual displacement of the atoms in metal wires caused

by electrical current. It leads to voids and hillocks that cause open and short circuit


failures. The MTTF due to electromigration is given by the following equation [55]:

MTTF EM =AEM

JneEaEMκT (3.1)

where AEM is a constant determined by the physical characteristics of the metal inter-

connect, J is the current density, EaEM is the activation energy of electromigration,

n is an empirically-determined constant, κ is Boltzmann’s constant, and T is the

temperature.

Thermal cycling refers to IC fatigue failures caused by thermal mismatch deforma-

tion. In IC chip and package, adjacent material layers such as copper/low-k dielectric

have different coefficients of thermal expansion. As a result, run-time thermal vari-

ation causes fatigue deformation, leading to failures. The MTTF due to thermal

cycling is given by the following equation [55]:

MTTF TC =ATC

(Taverage − Tambient)q (3.2)

where ATC is a constant coefficient, Taverage is the chip average run-time temperature,

Tambient is the ambient temperature, and q is the Coffin-Manson exponent constant.

Time-dependent dielectric breakdown is the deterioration of the gate dielectric

layer. This effect depends strongly on temperature, and is becoming increasingly

prominent with the reduction of gate-oxide dielectric thickness and non-ideal supply

voltage reduction. The MTTF due to time-dependent dielectric breakdown is given

by the following equation [55, 103]:

MTTF TDDB = ATDDB

(1

V

)(a−bT )

eA+B/T+CT

κT (3.3)

where ATDDB is a constant, V is the supply voltage, and a, b, A,B, and C are fitting

parameters.


Stress migration is the mass transportation of metal atoms in metal wires due to

mechanical stress caused by thermal mismatch among metal and dielectric materials.

The MTTF resulting from stress migration is given by the following equation [55]:

MTTF SM = ASM |T0 − T |−neEaSMκT (3.4)

where ASM is a constant, T0 is the metal deposition temperature during fabrication,

T is the run-time temperature of the metal layer, n is an empirically-determined

constant, and EaSM is the activation energy for stress migration.

Equations 3.1–3.4 indicate that the lifetime reliability of ICs is strongly influenced

by temperature. Therefore, thermal analysis and optimization techniques play impor-

tant roles in reliability optimization. Generally, MTTF values resulting from different

mechanisms is from 20 to 30 years.

3.2.4 MPSoC Reliability Modeling

The system MTTF of an MPSoC is a function of the lifetime reliabilities of all its

PEs. In this work, we propose a system-level lifetime reliability model for MPSoCs.

Our first step is to derive an efficient modeling method that can accurately predict

the lifetime reliability of each MPSoC PE.

3.2.4.1 Reliability Modeling of On-Chip PEs

The lifetime reliability of an on-chip PE is influenced by numerous design-time and

run-time factors, such as architecture-level and circuit-level redundancy, accumulation

of wear, and run-time temperature. Accurate lifetime characterization of each PE is

challenging.


We propose a PE reliability model that is capable of incorporating the effects of

multiple fault mechanisms, component-level resource redundancy, and temperature.

The dependence of lifetime failure processes on other parameters, such as current

density, is not directly considered. Constant values of these parameters resulting in

PE MTTFs of 30 years at 50 and 1.8 V are used [103]. For the sake of explana-

tion, our description of PE reliability modeling starts from the simplest case, i.e., a

single failure mechanism, single point of failure (no resource redundancy), and con-

stant temperature. These assumptions are later relaxed, and the reliability model

generalized.

3.2.4.2 Lognormal Distribution Reliability Model for Single PE, Single

Point of Failure

Statistical modeling is commonly used in IC reliability characterization. Re-

searchers have proposed using various statistical models, e.g., exponential, Weibull,

and lognormal, to characterize IC lifetime failures. Compared to other commonly-

considered statistical models, the lognormal distribution more accurately models the

time-dependent degradation processes of ICs, e.g., diffusion, corrosion, migration, and

crack propagation [103] caused by the failure mechanisms described in Section 3.2.3.

However, using the lognormal distribution complicates the derivation of analytical

solutions. Numerical methods, such as Monte-Carlo simulation or statistical fitting

techniques, are required. These methods are computationally intensive.

Starting from the simplest assumption, for a failure mechanism i, the run-time


fault probability density function (PDF), fi(t), and the corresponding fault cumula-

tive distribution function (CDF), Fi(t), have two parameters: σiPE (a shape parame-

ter) and µiPE (a scale parameter). The MTTF of an on-chip PE due to a particular

failure mechanism i, MTTF iPE , is then estimated:

MTTF iPE =

∫ ∞0

t fi(t)dt =

∫ 1

0

t dFi(t) = eµiPE+σiPE

2/2 (3.5)

The overall lifetime reliability of each on-chip PE, MTTFPE , is modeled by a joint

lognormal distribution that depends on the major failure mechanisms described in

Section 3.2.3. We assume that the relationships among different failure mechanisms

are serial, i.e., each individual failure mechanism can result in the failure of a non-

redundant PE. Therefore, for each non-redundant PE, the CDF of its overall lifetime

failure probability follows:

FPE (t) = 1−∏i

(1− Fi(t)) (3.6)

where i is the index of different failure mechanisms.

Researchers have often used exponential distributions for statistical modeling due

to their convenience. Given Fi(t) with exponential distributions, Equation 3.6 would

yield an easily-computed analytical solution. However, as a consequence of using

the more accurate lognormal distribution for each Fi(t), Equation 3.6 does not allow

straight-forward estimation of PE MTTF, MTTFPE . In this work, we use statistical

fitting to approximate MTTFPE using a single lognormal distribution, governed by

µPE and σPE . The parameters for this approximation follow:

µPE =1

2log

((∫∞0t dFPE (t)

)4∫∞0t2 dFPE (t)

)(3.7)

σPE =

√√√√log

( ∫∞0t2 dFPE (t)(∫∞

0t dFPE (t)

)2

)(3.8)


3.2.4.3 Reliability Models for Inactive Spare and Active Spare Redun-

dant PEs

PEs may have component redundancy to improve reliability or performance. Such

PEs can be designed to continue functioning even after some of their components,

e.g., an ALU or a cache bank, fail. Inactive spares are redundant resources that

are not activated until a fault occurs in an active resource. The impact of faults in

inactive spares upon the lifetime reliabilities of PEs can be characterized as follows.

Assume a PE contains M types of resources. Each type of resource Si, i ∈

1, · · · ,M, is comprised of Ni identical elements. Assume the cumulative failure

probability of resource element Ei,j, i ∈ 1, · · · ,M, j ∈ 1, · · · , Ni is Fi,j(t). Then,

the cumulative failure probability of resource Si, FSi(t) =∏

j Fi,j(t). The MIN–MAX

approximation [103] may be used to bound the MTTF of a PE with M types of

resources as follows:

MTTFPE =M

mini=1

(∫ 1

0

t dFSi(t)

)(3.9)

Active spares are redundant resources that are actively used even before any faults

have occurred. Faults in active spares reduce the performance of the affected PE.

Determining the reliability impact of faults that result in changes to observable PE

behavior involves system-level design decisions, and will be described in detail in

Section 3.2.5.

3.2.4.4 Temperature-Dependent Reliability Model for Potentially Redun-

dant PEs

The lifetime reliability of a PE strongly depends on its temperature. After each

MPSoC solution is derived, performance and power analysis are conducted. The


0 5 10 15 20 25 300

0.02

0.04

0.06

0.08

0.1

0.12

time (years)

prob

abili

ty d

ensi

ty

fault probability density attemperature T

1

fault probability density attemperature T

2

t1

t2

Figure 3.3: Temperature Impact on MTTF [38].

estimated power profile, MPSoC floorplan, and cooling configuration are provided

to a thermal analysis algorithm [123] to determine the thermal profile. Note that

Equation 3.9 is derived under an assumption of constant PE temperature. Next, we

discuss temperature-dependent PE MTTF estimation.

The temperature profile of an MPSoC varies as the tasks assigned to it change.

Task assignments change whenever migration is used to compensate for a partial or

complete PE failure. The impact of temperature variation on MTTF calculation is

illustrated in Figure 3.3. In this example, T1 and T2 are temperatures. The PE is

initially hot (T1) and, at time t1, becomes cooler (T2). Functions f1(t) and f2(t) are the

fault PDFs given temperatures T1 and T2, respectively. The overall fault distribution

of the PE should satisfy the following equation, i.e., the overall cumulative fault

distribution equals one. ∫ t1

0

f1(t)dt+

∫ ∞t2

f2(t)dt = 1 (3.10)

When we switch from the fault PDF associated with one temperature, e.g., T1, to that

associated with another temperature, e.g., T2, it is necessary to adjust our start time


to the value, in the new time scale, associated with the appropriate amount of wear

that had been experienced in the previous time scale, i.e., we must start integrating

from the effective age of the PE. For this example the concept can be summarized as

follows: F1(t1) = F2(t2).

Given that T0, T1, · · · , TN−1 denote the PE thermal profile, the overall fault

distribution should satisfy the following equation:∫ te0

ts0=0

f0(t)dt+

∫ te1

ts1

f1(t)dt+ · · ·+∫ ∞tsN−1

fN−1(t)dt = 1 (3.11)

where fi(t) denotes the fault PDF of the PE at temperature Ti, tei(t) denotes the

transition time at which the temperature changes from Ti−1 to Ti, and tsi(t) denotes

the equivalent age of the PE, starting from tei−1, when the temperature switches to

Ti. The value of tsi can be determined using Equation 3.11, allowing the MTTF of a

PE to be determined using the following equation:

MTTF =N−1∑i=0

∫ tei

tsi

tfi(t)dt (3.12)

This has the effect of breaking time into regions (∑N−1

i=0 ) during which the temperature

of the PE is uniform and, during each region, weighting each time instant by the

probability of failure at that instant (t · fi(t)). Values for tsi and tei are computed

based on Equation 3.11.

Reliability analysis may be conducted numerous times during reliability optimiza-

tion. Therefore, modeling efficiency is critical. An MPSoC consists of numerous

PEs. If the cumulative fault probability distributions, Fi(t), are lognormal, then

solving Equation 3.9 requires computationally-intensive numerical analysis. To im-

prove computational efficiency, we produce a PE reliability library before reliability

optimization by pre-characterizing the reliability distributions of PEs as functions


of temperature and supply voltage. During MPSoC reliability optimization, when

solving Equation 3.12, the value of Fi(t) is efficiently obtained using table look-ups.

3.2.5 Reliability Optimization of MPSoCs

Figure 3.2 illustrates the proposed reliability analysis and optimization flow. In

TASR, reliability optimization starts by evaluating the system MTTF of area opti-

mized solutions (using Algorithm 1), Such solutions tend to have high power density,

high temperature, low resource redundancy and, therefore, low system MTTF. An

iterative reliability enhancement algorithm is invoked if these solutions do not provide

the required system MTTF. During each iteration, Algorithm 2 optimizes MTTF by

improving processor core and component redundancy and/or optimizing chip thermal

profile by introducing new processors. System-level (task assignment and scheduling)

and physical-level (floorplanning and network synthesis) algorithms are then invoked

to produce valid MPSoC solutions. Through performance, power, thermal, and relia-

bility analyses, the system MTTFs of new solutions are estimated and evaluated. The

iterative optimization flow continues until the targeted system MTTF is achieved.

Algorithm 1 estimates system MTTF based on statistical models of MPSoC run-

time failure processes. Starting from time t = 0, it determines the minimal MTTF

among all the processor cores (line 4). Each fault may result in partial or complete

processor core failure. In either case, task migration is used to optimize system

performance. The task migration routine moves tasks from the faulty or partially-

faulty processor to other processors (line 6). After task migration, if the MPSoC

still meets its performance requirements, the algorithm considers the next processor

core with minimal MTTF. Task migration results in run-time changes in chip power


Algorithm 1 System MTTF Analysis of an MPSoC Solution

1: Given an MPSoC solution, set MTTFMPSoC ← 02: while system schedule is valid do3: MPSoCFunc are the functioning processors in the MPSOC4: Fault interval ei ← minp∈MPSoCFunc (MTTFp)5: MTTFMPSoC ← MTTFMPSoC + ei

6: Task migration, scheduling7: if system scheduling is valid then8: Power analysis, thermal analysis, compute processor temperatures9: else

10: Return MTTFMPSoC

11: end if12: end while

consumption and temperature profiles, thereby changing the lifetime reliability of

each processor core. To accurately predict subsequent processor MTTFs, power and

thermal analysis are conducted (line 8). This process continues until the MPSoC fails

to meet its performance or functionality requirements. The system MTTF of the

MPSoC solution is then reported (line 11).

At run-time, on-line fault detection algorithms should determine when an execu-

tion unit has failed. A proper treatment of on-line fault detection is beyond the scope

of this dissertation but can be found in the literature [77]. Upon fault detection, the

pre-planned task assignment changes associated with the particular fault are made.

If it is acceptable to reboot the system in the presence of a fault (a few times in the

system lifespan), no further provisions are necessary. If uninterrupted operation is

necessary, distributed system checkpointing may be used.

TASR is equipped with an efficient workload migration algorithm to maintain sys-

tem functionality and meet performance requirements in the presence of partial and

complete processor failures. When an MPSoC fails to meet its performance require-

ments due to run-time faults, tasks migrate to other processors using the following

policy. Tasks on faulty processors are first sorted in order of increasing time slack, the


difference between the task’s latest finish time and earliest finish time. They are then

migrated from the processor to other processors in this order until the system perfor-

mance requirements are met and no tasks are assigned to a totally failed processor.

When moving a task from one processor to another, the new processor is selected by

Pareto-ranking processors in order of increasing utilization ratio (the proportion of

time during which the processor is actively executing tasks) and increasing execution

time for the task and processor under consideration. Depending on whether a proces-

sor is inoperational or partially-failed, all or some of the tasks assigned to it migrate

to other processors.

TASR optimizes the lifetime reliability of MPSoCs by focusing on architectural

changes that improve redundancy and thermal profile, while maintaining low area

overhead. Algorithm 2 shows the actions taken by TASR to improve the MTTF of an

MPSoC architecture. First, the MTTF of each individual processor is estimated (line

2). The processor with the minimal MTTF is identified as the MPSoC’s most vulner-

able point, Pvul (line 3). One of the proposed reliability optimization moves is then

applied: processor reinforcement, processor swapping, and processor addition (line

4). Processor reinforcement introduces component redundancy (see Section 3.1.2)

into the most vulnerable processor. Processor swapping replaces the most vulnerable

processor with a different, more reliable, processor. Processor addition introduces

a new processor into the MPSoC, enabling tasks to migrate from the vulnerable

processor to other processors. These moves consider multiple candidates processors.

TASR uses the relative reliability gain, defined in Equation 3.13, to select the best

candidate move. This equation takes power density reduction, resource redundancy


improvement, and area overhead associated with the move into consideration.

GTASR = e−Pd ×MTTFref/A (3.13)

Note that this value is used only to guide changes. The detailed effect of each

tentative change is computed using thermal profile and reliability analysis. MPSoC

power profile influences MPSoC temperature profile, which strongly influences reli-

ability. The MTTFs associated with some major fault mechanisms are exponential

functions of temperature. Therefore, in Equation 3.13, TASR uses an exponential

term, e−Pd , to characterize the impact of power density reduction on reliability im-

provement. Pd is the power density reduction resulting from applying a candidate

move. In Equation 3.13, the impact of redundancy is characterized by the second

term, MTTFref , the system MTTF improvement resulting from the candidate move.

MTTFref is calculated under the assumption that other design characteristics, e.g.,

temperature profile and supply voltage, remain the same. The relative reliability

gain introduced by each candidate move is the product of these two terms divided

by the area overhead. The move with the highest gain is applied (line 5). After each

optimization move, system-level and physical-level synthesis algorithms are invoked

to update the MPSoC solution. Cost analysis is then conducted to determine the

improvement in system reliability, determine the impact on MPSoC area, and vali-

date the system schedule. This optimization process continues until the target system

MTTF is achieved.

Two additional other optimization moves were implemented for the sake of com-

parison. The first considers only power density, e−Pd , and the second considers only

resource redundancy, MTTFref . Performance comparisons among these three heuris-

tics are provided in Section 3.3.


Algorithm 2 Reliability-Aware Optimization Algorithm

1: while MTTFMPSoC < MTTFtarget do2: ∀pe∈MPSoC compute MTTFpe

3: Find vulnerable point: Pvul is the processor with minimal MTTF4: Optimization moves (processor reinforcement, processor swapping, processor addition)5: Apply the best move based on Equation 3.136: System-level synthesis: Task assignment and Scheduling7: Physical-level synthesis: Floorplanning and network synthesis8: Performance, power, thermal, reliability analysis9: if system MTTF does not improve or system schedule invalid then

10: Revert this change11: end if12: end while

3.2.6 Floorplanning, Thermal Analysis, and Network Syn-

thesis

We use a fast constructive area and communication aware floorplanning block

placement algorithm based on network partitioning and optimal processor orientation

and rotation selection to determine MPSoC power profile as well as communication

latency and communication power consumption [26]. A fine-grained MPSoC thermal

model is used within a thermal analysis algorithm designed for accuracy and high

enough speed for use within the inner loop of synthesis [123]. Finally, we carry out

on-chip network synthesis, using network topology to explicitly model communication

contention.

3.3 Experimental Results

This section describes the benchmarks used to evaluate TASR and presents the

results of evaluation.


3.3.1 Benchmarks

The proposed reliable MPSoC synthesis algorithm was evaluated using a num-

ber of benchmarks taken from the E3S embedded systems benchmark suite, which

is based on EEMBC benchmark data [31]. This suite contains 17 PEs, e.g., the

AMD ElanSC520, Analog Devices 21065L, the Motorola MPC555, and the Texas

Instruments TMS320C6203. These processors are characterized based on the mea-

sured execution times of 47 tasks commonly encountered in embedded applications,

power numbers derived from datasheets, and additional information, e.g., processor

areas, some of which were necessarily estimated, and prices gathered by emailing and

calling vendors. Any processor for which the datasheet reflected results in coarser

technologies were linearly scaled to a 0.18 µm technology. The task sets follow the

organization of the EEMBC benchmarks. There is one task set for each of the five

application suites: Automotive/Industrial, Consumer, Networking, Office Automa-

tion, and Telecommunications. The Office Automation problem contains only five

tasks. Our modified version of Office Automation contains four copies of the origi-

nal task set. In addition, TGFF [27] was used to generate five random benchmarks,

each of which has 30–50 tasks. The graphs have different structures, ranging from

random connectivity to a series-parallel structure commonly encountered in DSP ap-

plications. For the random benchmarks, tasks were randomly assigned task types

from the EEMBC benchmarks.

The EEMBC processors do not have component redundancy, i.e., each processor

will fail if any of its functional units fails. We introduce a redundant version for

each processor by duplicating floating/fixed point units and floating/integer register

files. We assume that instruction scheduling units and instruction decode units do


not have redundancy [103]; a run-time fault in these units will result in processor

failure. On-chip caches have redundancy; a single fault reduces performance but the

processor remains operational. We relied on previous work to estimate the cost of

component redundancy [103]. Processors with component redundancy suffer a 24%

area penalty and, while their additional functional units are still operational, have

25% higher performance and power consumption.

The embedded microprocessors in EEMBC have fairly homogeneous energy–delay

products. It is our goal to develop a synthesis algorithm that is effective at improving

the reliability of application-specific MPSoCs, which commonly contain heterogeneous

processors. Therefore, for each processor, we introduced one corresponding processor

operating at a higher voltage and another operating at a lower voltage. A maximum

of three voltages need to be provided by off-chip regulators. The alpha power law was

used to calculate the impact of voltage scaling on performance. A 0.18 µm process,

supply voltage of 1.8 V, and alpha of 1.3 were used [93]. To model high-performance

processors, the supply voltage was scaled to 2.5 V, performance increased by 25%,

and power consumption increased to 2.4×. To model low-power processors, the sup-

ply voltage was scaled to 1.28 V, performance was decreased by 25%, and power

consumption was decreased to 0.38×.

3.3.2 TASR vs. Stochastic Area Optimization

As described in Section 3.2.1, TASR consists of a two-stage optimization flow. It

first uses a stochastic optimization algorithm to minimize MPSoC area under per-

formance constraints. The area-optimized solution is used as a starting point for

the proposed reliability enhancements. The TASR lines in Figure 3.5 illustrate the


10

100

1000

0 1 2 3 4 5 6 7 8 9

Are

a (

mm

2)

MTTF (years)

autoconsumer

networkingoffice4xtelecom

random1random2random3random4random5

Figure 3.4: Comparison of MPSoC Area–Reliability Tradeoffs [38].

solutions produced by the MTTF optimization technique when run on all the bench-

marks. The initial area-optimized solutions appear at the left-most points of the

lines. TASR applied the optimization moves described in Section 3.2.5 until several

subsequent moves did not significantly improve system MTTF. Table 3.1 shows the

average system MTTF improvement over initial area-optimized solutions under dif-

ferent area overhead constraints for all ten benchmarks. These results illustrate three

key points about the reliable application-specific MPSoC synthesis problem.

1. The area cost to improve reliability is initially small. In Figure 3.4, area is

shown on a logarithmic scale. As shown in Table 3.1, improving the average

system MTTF over all benchmarks by 40%, 85%, and 180% results in maximum


0

50

100

150

200

2 3 4 5 6 7 8

Area

(mm

2 )

MTTF (years)

autoTASRCR-onlyPD-only

1PHASE

0

50

100

150

200

0 1 2 3 4 5 6 7 8

Area

(mm

2 )

MTTF (years)

office4xTASRCR-onlyPD-only

1PHASE

0

50

100

150

200

0 1 2 3 4 5 6 7 8

Area

(mm

2 )

MTTF (years)

consumerTASRCR-onlyPD-only

1PHASE

0

100

200

300

400

500

0 1 2 3 4 5 6 7 8Ar

ea (m

m2 )

MTTF (years)

telecomTASR

CR-onlyPD-only

1PHASE

0 100 200 300 400 500 600 700

0 1 2 3 4 5 6 7 8

Area

(mm

2 )

MTTF (years)

networkingTASRCR-onlyPD-only

1PHASE

0

50

100

150

200

0 1 2 3 4 5 6 7 8

Area

(mm

2 )

MTTF (years)

random1TASRCR-onlyPD-only

1PHASE

0 20 40 60 80

100 120 140

5 5.5 6 6.5 7 7.5 8 8.5

Area

(mm

2 )

MTTF (years)


1PHASE

0

50

100

150

200

0 1 2 3 4 5 6 7 8

Area

(mm

2 )

MTTF (years)


1PHASE

0

50

100

150

200

0 1 2 3 4 5 6 7 8

Area

(mm

2 )

MTTF (years)


1PHASE

100

200

300

400

500

600

0 1 2 3 4 5 6 7 8

Area

(mm

2 )

MTTF (years)


1PHASE

Figure 3.5: Comparison of Different Optimization Heuristics [132].


Table 3.1: System MTTF Improvement Under Area Bound [132]Area MTTF Area MTTF Area MTTF

bound improve. bound improve. bound improve.(%) (%) (%) (%) (%) (%)0.0 40.0 15.0 180.0 30.0 457.05.0 85.0 20.0 240.0 35.0 468.0

10.0 180.0 25.0 436.0 40.0 470.0

The MTTF improvement under each area bound is computed by selecting the highest-MTTF solution for each benchmark, that honors the area bound, and computing theaverage of their MTTF improvements.

area overheads of 0.0%, 5.0%, and 10.0%. MTTF is not directly considered

in the first optimization phase. As a result, TASR can sometimes improves

MTTF without area overhead because two solutions with the same area can

have different MTTFs. Initial solutions are optimized for area and tend to have

high power densities, high temperatures, and low resource redundancy: the

fault rates are high and single faults may cause failure. Therefore, the system

reliability can be improved at low area cost. TASR introduces processor cores

with lower power densities and/or replaces non-redundant cores with redundant

ones, thereby optimizing thermal properties and allowing the system to continue

operating despite runtime hardware faults.

2. As shown in Table 3.1, TASR automatically trades off system reliability for

area, allowing system designers to choose a desirable solution based on problem-

specific design constraints.

3. As system MTTF increases, the area penalty associated with further improving

system reliability increases. As shown in Table 3.1, TASR achieves 436% average

system MTTF improvement with a maximum area overhead of 25%. Further


improvements to system MTTF become prohibitively expensive. Processor core

failure cumulative distribution functions are non-decreasing. For a large enough

duration, there is a low probability that any processor will operate without a

fault. As a result, at very large MTTFs, adding processors or reinforcing a

subset of existing processors with redundant components has little impact on

MTTF.

3.3.3 Evaluation of Optimization Moves

TASR optimizes system reliability by controlling processor temperatures and im-

proving system redundancy. To evaluate the effectiveness of the proposed optimiza-

tion moves, we compare TASR with two alternative moves described in Section 3.2.5:

power density only (PD-only) and component redundancy only (CR-only) moves.

PD-only minimizes power density. CR-only increases resource redundancy. Fig-

ure 3.5 shows the results produced by TASR, CR-only, and PD-only optimization

moves. TASR almost always produces architectures with both superior area and

system MTTF. In some cases, PD-only or CR-only also do well. PD-only does not

consider component redundancy. However, introducing redundant processors in order

to improve power density still improves system MTTF. CR-only does not consider

processor power density. However, redundant processors tend to have lower power

densities than non-redundant processors; although their instantaneous spatial power

densities are similar to non-redundant processors, they have higher performance, per-

mitting lower temporal power densities. In general, it is necessary to use both struc-

tural redundancy and power density to produce high-quality solutions.


3.3.4 Evaluation of Optimization Flow

As explained in Section 3.2.2, it appears that a two-phase optimization flow in

which a stochastic optimization algorithm is first used to find a promising, low-area,

region of the solution space and then an iterative reliability enhancement algorithm

is used to trade off area for reliability is superior to a one-phase optimization flow.

To determine whether this argument has merit, we compared TASR with a one-

phase stochastic optimization algorithm in which functionality, timing, area, and

reliability are concurrently optimized. This algorithm, which we call 1PHASE, has the

ability to apply all the allocation, assignment, floorplanning, and scheduling changes

available to TASR. It optimizes MTTF within its multi-objective cost function. We

found that TASR can almost always produce solutions of equal or better quality than

1PHASE. In addition, TASR generally requires less CPU time (an average of 635.9 s

per benchmark) than 1PHASE (an average of 2,394 s per benchmark).

3.4 Conclusions and Future Work

This chapter has described a synthesis algorithm for reliable application-specific

MPSoCs. The dominant failure processes today, and in the near future, have rates

exponentially dependent on temperature. Therefore, the impact of tentative design

changes on detailed temperature profile during synthesis process should be considered.

This, in turn requires power profiles, which depend on floorplanning and power mod-

els. Even the fastest detailed thermal analysis and floorplanning algorithms cannot

be included within the inner loop of synthesis without greatly reducing the solution

space explored in a given amount of time. Therefore, we have proposed a two-stage


synthesis process in which a potentially-slow but high-quality stochastic optimiza-

tion algorithm is first used to minimize solution area. Starting from this promising

location in the solution space, a reliability enhancement heuristic explores the area–

MTTF tradeoff curve.

Our results indicate that this synthesis approach greatly outperforms simply

adding MTTF into a stochastic optimization algorithm as another objective. The

proposed synthesis flow increases MPSoC system mean time to failure by an average

of 85% with less than 5% area cost and by an average of 436% with less than 25%

area cost, compared to area-optimized solutions. As long as power densities remain

high and the dominant lifetime failure processes remain strongly dependent on tem-

perature, our results indicate that thermal and structural redundancy optimization

during synthesis have the potential to increase MPSoC lifetime with low area cost.

Chapter 4

Three-Dimensional

Chip-Multiprocessor Run-Time

Thermal Management

Three-dimensional (3D) integration has the potential to improve the communica-

tion latency and integration density of chip-level multiprocessors (CMPs). However,

the stacked high power density layers of 3D CMPs increase the importance and diffi-

culty of thermal management. In this chapter, we investigate the 3D CMP run-time

thermal management problem and describe efficient management techniques. This

chapter makes the following main contributions: (1) it identifies and describes the

critical concepts required for optimal thermal management, namely the methods by

which heterogeneity in both workload power characteristics and processor core ther-

mal characteristics should be exploited and (2) it proposes an efficient, proactive,

continuously-engaged hardware and operating system thermal management technique

44

CHAPTER 4. 3D CMP THERMAL MANAGEMENT 45

governed by optimal thermal management polices. The proposed technique is evalu-

ated using multiprogrammed and multithreaded benchmarks in an integrated power,

performance, and temperature full-system simulation environment. We find that

proactive power–thermal budgeting allows a 30% improvement in instruction through-

put compared to a proactive thermal management approach that bases decisions only

upon local information. The software components of the proposed thermal manage-

ment technique have been implemented in the Linux 2.6.8 kernel. The analysis and

technique developed in this chapter provide a general solution for future 3D and 2D

CMPs. My major contribution to this chapter is on the characterization of heat flow in

3D CMPs, derivation of optimal workload assignment and power–thermal budgeting

and thermal management implementation in the Linux kernel (Section 4.3 and 4.4).

My collaborator, Zhenyu Gu contributed to the design of 3D CMP architecture and

technology, framework buildup of the full simulation system, and benchmark suites

characteristics and generation (Section 4.5).

4.1 Introduction

Continued increases in integration density, and achieving higher application per-

formance without corresponding increases in processor frequency, are now primary

goals for microprocessor designers. As a result, microprocessor design is rapidly mov-

ing towards highly-scalable chip-multiprocessor (CMP) architectures. Today’s main-

stream microprocessors are multi-core [56, 60, 7, 50, 107, 96]. The trend for future

CMPs is to increase the number of on-chip cores: 80-core prototypes have recently

been demonstrated by Intel [115].

Performance scalability is a major challenge in CMP design. Using the mainstream


two-dimensional (2D) planar CMOS fabrication process, on-chip interconnect shows

poor scalability in both performance and power consumption [5]. Three-dimensional

(3D) integration has the potential to overcome the limitations of 2D technology [109,

12, 95, 108]. By stacking multiple device layers connected through inter-die vias, 3D

integration increases logic integration density significantly and reduces on-chip wire

length, especially for global and semi-global wires. This has motivated computer

architects to evaluate 3D technology for CMP architecture design [12, 65, 57, 58].

However, none of this work describes a thermal management solution appropriate for

3D CMPs.

Thermal issues are a large and growing concern for CMPs [68, 28, 14, 99]. Increas-

ing chip power consumption and temperature affect circuit reliability (via negative

bias temperature instability, electromigration, time-dependent dielectric breakdown,

thermal cycling, etc.), power and energy consumption (via increased leakage power),

and system cost (via increased cooling and packaging cost). The use of 3D integra-

tion magnifies power dissipation problems [12, 89, 90, 71]. Chip cross-sectional power

density increases linearly with the number of vertically-stacked active circuit layers.

In addition, the interconnect and bonding layers used in 3D integration have low ther-

mal conductivities, which further exacerbates thermal effects. Temperature-related

concerns that can sometimes be safely ignored in 2D CMPs, such as temperature-

induced performance or reliability degradation, become increasingly prominent in 3D

CMPs. 3D integration holds promise but without solutions to the thermal problems

it brings, 3D CMPs will be impractical.

Run-time thermal management techniques, such as dynamic voltage and frequency

scaling, clock throttling, execution unit toggling, and workload migration, have been


proposed for 2D high-performance microprocessors [14, 99, 87, 54, 68, 28]. Using

these techniques, cooling solutions and packages need not be designed for worst-case

power consumption scenarios. Cooling cost can thereby be significantly reduced. Past

work, however, cannot effectively optimize the performance–temperature tradeoff in

3D CMPs for the following reasons.

First, the thermal management techniques deployed in current microprocesasors

and operating systems are primarily used to handle rare, worst-case processor power

consumption events and eliminate thermal emergencies. Although they can poten-

tially introduce significant performance overhead, they are rarely invoked. In con-

trast, the higher power densities of future 3D (and some 2D) CMPs will frequently

require operation at or near thermal limits. Already, processors contain reactive tech-

niques to permit the use of reduced-cost packaging and cooling configurations that

are not capable of handling maximum power dissipation. Today’s laptops frequently

invoke thermal management mechanisms that drastically reduce performance, even

under normal operating conditions [74]. Power should be viewed as a limited resource

and processor cores should spend carefully-budgeted amounts. Thermal management

should be used to proactively, continuously optimize CMP performance and temper-

ature, instead of merely reacting to emergencies.

Second, 3D CMPs have heterogeneous power and thermal characteristics. On-

chip processor cores have different cooling efficiencies. For instance, cores in the

layers closer to the heatsink have higher cooling efficiencies than those farther from

the heatsink. Processor cores farther from the heatsink will have higher tempera-

tures than their neighbors nearer the heatsink, even when their power consumptions

are lower. Inter-core thermal correlation is heterogeneous. The thermal correlation


Die 1

Die 2

Device Layer

Metal Layers

Die−to−Die Vias

Die 2

Die 1

Backside ViasI/O and Power

Bulk Si Bulk Si

Heat SinkHeat Sink

(a)

L2 Cache

Core

Core

Core

Core

Core

CoreCore

Core

(b) (c)

Figure 4.1: (a) Comparison of Face-to-Face (Left) and Face-to-Back (Right) Config-urations for Two Stacked Dies, (b) 3D Three Stacked Die Floorplan Used in ThisWork, and (c) 3D CMP Chip-package Thermal Modeling [134].

between vertically-aligned processor cores is stronger than that between processor

cores within the same layer. The power and thermal heterogeneity of 3D CMP poses

unique challenges for run-time thermal management. Achieving optimal 3D CMP

performance under a temperature constraint requires careful system-wide control of

each processor core’s performance and power consumption. Local control, alone, is

insufficient.

In this chapter, we develop the analytical framework necessary to determine the

thermal impact of every core in a 3D CMP upon every other core. This framework

yields guidelines for near-optimal thermal management. The guidelines are embodied

in a proactive global power–thermal budgeting algorithm, performance counter-based

workload monitor, and distributed thermal control techniques, which we have imple-

mented in version 2.8.6 of the Linux kernel. The resulting 3D CMP thermal man-

agement solution, which we call ThermOS, is evaluated using detailed full-system

simulation with M5 [11]. We have integrated power modeling and thermal analysis

tools within the simulator, allowing unified architectural/power/thermal simulation

of arbitrary single-threaded and multi-threaded applications and the Linux operating

system (OS). Our results for a wide range of multiprogrammed and multithreaded


applications indicate that, given a peak temperature constraint, ThermOS improves

CMP throughput by an average of 29.84% when compared to state-of-the-art proac-

tive distributed thermal management. This improvement is primarily due to the

power–thermal budgeting guidelines used by ThermOS.

4.2 Contribution

Our work is most closely related to Donald’s and Martonosi’s research on CMP

thermal management using distributed control-theoretic core management and a

global controller that guides migration [28]. Both their thermal management tech-

nique and ThermOS are continuously-engaged thermal management techniques. How-

ever, existing proactive thermal management techniques are not appropriate for CMPs

with heterogeneous thermal environments, such as 3D CMPs. Global guidance and

power–thermal budgeting are particularly beneficial for 3D CMPs. By matching core

cooling characteristics, application features and voltage levels, we can improve perfor-

mance by limiting throttling and migration. We are the first to examine the impact

of thermal heterogeneity on thermal management of 3D architectures. We evaluate

our proposed policies in a full system simulator. This experimental setup accounts

for the overhead of DTM in the OS, including migration costs and context switches.

4.3 Heat Flow in 3D CMPs

This section uses examples to explain the special thermal characteristics of 3D

CMPs and develop a mathematical model that will be used to derive the thermal

management policies described in Section 4.4 and validated in Section 5.4.


I

PIC

1/ginter

K1/gintraJ

C

PJPK

TambTamb

1/ghs 1/ghs

C

Figure 4.2: Inter-layer and Intra-layer Thermal Heterogeneity and Dominance in 3DCMPs [134].

4.3.1 Introduction to Thermal Modeling

Heat conduction within CMP chip and package can be modeled using Fourier heat

flow analysis, which has been the standard method used by industry and academia

for circuit-level and architecture-level IC chip–package thermal analysis during the

past few decades [20, 8, 99, 125]. This method is analogous to Georg Simon Ohm’s

method 1 of modeling electrical current. Using Fourier heat flow analysis, heat flow is

analogous to electrical current and temperature is analogous to voltage. The CMP is

virtually partitioned into numerous discrete blocks, as shown in Figure 4.2. The ther-

mal conductance of each block is a linear function of the conductivity of its material

and its cross-sectional area divided by length; it is analogous to electrical conduc-

tance. Blocks also have heat capacities that are analogous to electrical capacitance.

1In fact, Ohm borrowed this model from Fourier and it was initially proposed to model heat flow.


Therefore, an instantaneous change in heat generation results in a gradual change in

temperature. As a result, the temperature profile of a CMP is essentially its power

profile after applying a complicated RC filter. We will deal with this effect in detail

in Section 4.3.3. For a thermal model to be accurate, each block must be so small

that the temperature within it is uniform. A fine-grained, and thus more accurate

model was used to validate ThermOS. However, for the sake of explanation, this sec-

tion will describe the coarse-grained model shown in Figure 4.2, in which each core

is represented with a single thermal model element.

In 3D CMPs fabricated from multiple stacked wafers, the thermal environment

varies from layer to layer. Moreover, the intra-layer and inter-layer thermal rela-

tionships among CMP cores are heterogeneous. The rest of this section explains the

impact of this heterogeneity on heat flow and builds the theoretical foundations for

developing near-optimal 3D CMP thermal management policies. This understanding

is essential for proper thermal management of 3D CMPs but no prior work is based

on it.

Homogeneous Intra-Layer Characteristics

Figure 4.2 illustrates a simplified heat conduction model for a pair of adjacent

CMP cores on the same layer (J and K) and a pair of adjacent CMP cores on different

layers (I and K) of a 3D CMP. As shown in this figure, since the heat dissipation paths

of Cores J and K are nearly identical, the thermal conductances of these two cores

are nearly equal. In other words, processor cores within the same layer have similar

cooling efficiencies.


Heterogeneous Inter-Layer Characteristics

In contrast to cores on the same layer, Cores I and K have different conductances

to the ambient: ghs = 0.82 W/K for Core K and 1/(1/ghs + 1/ginter) = 0.73 W/K

for Core I 2. In addition, the steady-state temperature of Core I is always higher

than that of Core K, even if Core I has a lower power consumption. The following

equations formalize this effect, which we refer to as thermal dominance. Neglecting

the limited intra-layer heat flow,

TK = Tamb + (PK + PI)/ghs (4.1)

TI = TK + PI/ginter

= Tamb + (PK + PI)/ghs + PI/ginter (4.2)

where TK and TI are the temperatures of Cores K and I, Tamb is the ambient temper-

ature, PK and PI are the power consumptions of Cores K and I, ghs is the thermal

conductance from Core K to the ambient through the cooling solution, and ginter

is the inter-layer thermal conductance between Cores I and K. In addition to Core I

thermally dominating Core K, it also has a higher total resistance to the ambient, i.e.,

it has a lower cooling efficiency. As a result, a unit of power consumption on Core I

will have at least as great an impact on temperature as a unit of power consumption

on Core J or K.

2The thermal conductance values in this section are derived using a thermal analysis packagedeveloped by Yang et al. [125], which constructs a fine-grained 3D CMP thermal model basedon the material properties and physical structure of the chip–package configuration described inSection 4.5.1.2, Table 4.3, and Table 4.4. For the sake of explanation, coarse-grained thermal modelwith compact equations are used in this section to simplify the explanation of fundamental 3D CMPthermal properties.


Thermal Coupling

The thermal conductance between J and K (gintra) is approximately 0.41 W/K.

Heat can flow between Cores J and K. As a result, the power consumption of one can

influence the temperature of the other. However, this thermal coupling is relatively

minor compared to that between vertically-aligned cores. The thermal conductance

between Cores I and K (ginter) is approximately 6.67 W/K, almost 16× gintra . The

large interface area between Cores I and K results in a high thermal conductance,

despite the interposed high thermal resistivity (but thin, and therefore low resistance)

10 µm polyimide bonding layer.

Summary and Open Questions

At this point, we can draw some qualitative conclusions. The temperatures of

vertically-aligned cores are highly correlated, relative to the temperatures of horizontally-

adjacent cores. Cores farther from the heatsink have higher temperatures than their

neighbors closer to the heatsink. In addition, the temperature impact of a unit of

power dissipation will be at least as high for Core I as for Cores J and K, due to their

differing thermal conductances to the ambient. However, a few questions remain:

1. How can we use this knowledge of thermal environment heterogeneity to guide

the development of a CMP thermal management algorithm? and

2. What is the impact of the power consumption of each core upon all other cores

in the system?

We will now introduce a general analytical framework that answers these questions.


4.3.2 3D CMP Heat Flow Analytical Framework

In this section, we formulate the problem of determining the impact of a unit

change in power consumption for any given processor core upon the temperatures of

all other cores. This formulation provides the theoretical foundation for determining

the principals of near-optimal thermal management. We can represent the thermal

characteristics of a 3D CMP using the following notation, which follows naturally

from the heat conduction analysis ideas discussed in Section 4.3.1:

CdT (t)

dt+ AT (t) = Pu(t) (4.3)

In this equation, given a system of N thermal elements, C is a an N × N matrix

with thermal element heat capacities along the diagonal and zeros elsewhere, T is

a length N thermal element temperature vector, t is time, A is an N × N matrix

containing the thermal conductances of adjacent elements at the corresponding row–

column intersections and zeros elsewhere, P is a length N thermal element power

vector, and u(t) is a step function that changes from 0 to 1 at time t. In addition,

matrix A = LTKL, where L is a Laplacian matrix and K is a diagonal matrix

containing the thermal conductances of adjacent thermal elements. Given an IC

chip–package partition with N connected thermal elements plus a ground element

that models the ambient temperature, matrix A is full rank or nonsingular [76]. The

impact of the CdT (t)/dt term will be explained in detail in Section 4.3.3. In order

to ease explanation, neglect C, then solve Equation 4.3 for T as follows:

T = PA−1 (4.4)


This leads to an interesting observation: A−1 gives the thermal impact of unit changes

in power consumption. It is conventionally referred to as the thermal resistance ma-

trix [18] but it would be better to view it as a thermal impact matrix. In order to

determine the thermal impact of one core’s power consumption on another core’s

temperature, we need only consider the value in the corresponding row–column inter-

section in A−1. Let us assume that Core I is currently the hottest in the CMP. ζij is

the thermal impact coefficient for core i due to j. This value indicates the change in

the temperature for element i as a consequence of a unit change in power consumption

for element j. To determine the impact of power consumed in Cores J and K upon

Core I’s temperature, we need only consider the thermal impact coefficients in row I

in A−1, i.e., [ζI,I , ζI,J , ζI,K ]. Thus,

TI = PI × ζI,I + PJ × ζI,J + PK × ζI,K (4.5)

The thermal impact matrix will be used extensively in Section 4.4 to develop

thermal management guidelines. It also gives us a new view of thermal heterogeneity

in 3D CMPs. For a representative stacked-wafer 3D CMP design, the ζ value for

vertically-adjacent cores is 1.22 K/W and the ζ value for laterally-adjacent cores is

0.39 K/W, yielding a thermal impact ratio of 3.12 for the two cases.

4.3.3 Power Model, Dynamic Thermal Analysis, and Model-

ing Granularity

In the previous subsections, we made a number of simplifying assumptions about

the thermal environment in order to ease explanation. Our actual analysis and ther-

mal management implementation relaxes many of these assumptions for greater ac-

curacy. We now expound on our thermal model.


In order to determine thermal profile, the power profile must first be known. We

model both dynamic power consumption and leakage power consumption [129]. De-

pendence on voltage, switching activity, capacitance, and temperature are considered.

These equations are used together with a Wattch-based EV6 power model [15] to de-

termine the power consumption distribution among architectural units. The power

distributions of real multiprogrammed and multithreaded workloads on CMPs may

be spatially and temporally heterogeneous. The proposed modeling approach allows

us to capture the impact of workload heterogeneity on power and thermal profiles.

As explained in Section 4.3.2, the thermal analysis of real ICs must consider heat

capacity (C) as well as thermal conductance, i.e., transient analysis is necessary. The

thermal analysis infrastructure we use in architectural–thermal simulation captures

these effects using a frequency-domain moment matching analysis technique. Our

on-line thermal management technique continuously adjusts its behavior based on

thermal sensor readings. Prior subsections assumed that each CMP core is repre-

sented by a single thermal element to simplify explanation. In reality, our analysis

infrastructure is capable of dividing each CMP core into numerous three-dimensional

thermal elements to permit accurate temperature estimation.

Heat capacity plays a role in thermal modeling and management. Considering

transient effects complicates the power and thermal analysis infrastructure. Fortu-

nately, heat capacity limits the rate of temperature change, i.e., the maximum tem-

perature change of a CMP core in a given time interval is limited by the RC thermal

time constant of the core and the maximum power consumption change. Although

we used a thermal analysis infrastructure that considers transient thermal effects in

detail, the proposed thermal management technique is designed to react to transient


thermal effects by periodically adapting its behavior based on temperatures measured

with thermal sensors or estimated using run-time thermal models.

4.4 3D CMP Thermal Management

In this section, we investigate the 3D CMP run-time thermal management problem

and propose efficient management techniques. Given a 3D CMP with N on-chip

processor cores, our goal is to maximize the CMP throughput under run-time thermal

constraints. CMP throughput is defined as the total number of instructions executed

by the CMP per second.

CMP IPS =N−1∑i=0

IPC i × fi (4.6)

where IPC i and fi are the run-time instructions per cycle and frequency of Core i.

Run-time thermal safety requires that

∀N−1i=0 Ti ≤ TMAX (4.7)

i.e., the temperature of each processor core cannot exceed the maximum safe temper-

ature: TMAX .

In the following sections, we analyze the thermal management problem for 3D

CMPs and determine the policies necessary for performance optimization under tem-

perature constraints. This study will be used to guide the development of our run-time

thermal management techniques.


4.4.1 Conditions Required for Optimal 3D CMP Thermal

Management and Derivations of Resulting Policy Guide-

lines

This section derives performance optimization guidelines. The central theme is

to optimize the performance of CMP cores under a constraint on peak temperature

during workload assignment and power–thermal budgeting.

Observation: To maximize CMP throughput, processor cores should operate at dif-

ferent voltages and frequencies due to heterogeneous processor core thermal charac-

teristics and heterogeneous run-time workloads.

As described in Figure 4.3.1, processor cores in a 3D CMP are thermally correlated.

The temperature of each Core i, is affected by the power consumptions of all cores,

as follows:

Ti =N−1∑j=0

ζi,j × pj ≤ TMAX (4.8)

where Ti is the temperature of processor Core i; ζi,j, i, j ∈ [0, N −1] is an inter-core

thermal impact coefficient, which indicates the impact of a unit power consumption

of Core j on the temperature of Core i; pj is Core j’s power consumption; and N is

the number of processor cores of the CMP.

We would like to guide migration of tasks among cores, and budget power to cores,

in order to optimize CMP throughput under a temperature constraint. To facilitate

developing the necessary guidelines, we introduce the concept of thermal impact per

performance gain, TIP :

TIP fi,j =

dTidfj

, TIP IPCi,j =

dTidIPC j

(4.9)

TIP i,j indicates the thermal impact on processor Core i due to the increase in Core j’s


performance, by either increasing its frequency and voltage, and/or assign a high IPC

job to this core. Intuitively, TIP is the thermal cost per unit increase in processor core

performance. It can be viewed as the inverse of a core’s thermal efficiency. Subject

to a temperature bound, maximizing CMP performance thus requires that all the

processor cores achieve the same thermal impact per performance improvement on

the maximum-temperature core, i.e.,

TIP f,IPCi,0 ≡ TIP f,IPC

i,1 ≡ · · · ≡ TIP f,IPCi,N−1 (4.10)

Note that the impact on Ti due to the power consumption of core j is ζi,jPj. Given that

dynamic power consumption, Pj = ξjV2j fj (where Vj and fj are the supply voltage

and frequency of Core j), Vj ∝ fβj , and β ≈ 1 [13]; ξj is Core j’s run-time switching

activity multiplied the capacitance of the switched nodes (which is approximately

linearly proportional to the IPC of the job running in Core j), then

ζi,0f2β+10 ≡ ζi,1f

2β+11 ≡ · · · ≡ ζi,N−1f

2β+1N−1

ζi,0ξ0f2β0 ≡ ζi,1ξ1f

2β1 ≡ · · · ≡ ζi,N−1ξN−1f

2βN−1 (4.11)

This result indicates that processor cores with heterogeneous power and thermal

characteristics, i.e., different power–thermal impact coefficients, ζi,j, running jobs

with different IPCs should be clocked at different frequencies. A similar conclusion

can be drawn when both dynamic and leakage power variants are considered.

As shown in Section 4.3.1, the inter-layer and intra-layer thermal characteristics

of 3D CMPs show distinct differences. This leads to different thermal management

policies for inter-layer and intra-layer processor cores. In the following sections, we

determine the conditions required for optimal 3D CMP thermal management and

derive the resulting policy guidelines.


4.4.1.1 Inter-Layer Power–Thermal Budgeting and Workload Assignment

Inter-layer processor cores have heterogeneous thermal characteristics. In addi-

tion, vertically-aligned cores have strongly-correlated temperatures. We now derive

heterogeneity-aware guidelines for power–thermal budgeting and workload assignment

among vertically-aligned cores.

Guideline I: To maximize CMP throughput, the thermal efficiencies of vertically-

aligned processor cores should be optimized under the thermal constraint, i.e., the

voltage and frequency assignment among vertically-aligned processor cores should fol-

low Equations 4.8–4.11.

As shown in Section 4.3.1, among each group of vertically-aligned processor cores,

the Core i farthest from the heat sink is thermally dominant, i.e., it has the highest

temperature and also the lowest cooling efficiency. Therefore, given the thermal

constraint for processor Core i, i.e., Ti ≤ TMAX , the performance-optimal voltage

and frequency setup produced by Equations 4.8–4.11 also guarantees the thermal

safety for other vertically-aligned processor cores. In other words, Equations 4.8–4.11

provide the performance-optimal power–thermal budget policy for vertically-aligned

processor cores. Considering Cores I and K in Figure 4.2,

ζI (= 1/ginter + 1/ghs) > ζK (= 1/ghs), and

TI (= ζI × PI + ζK × PK) > TK (= ζK × PI + ζK × PK)

Equations 4.8–4.11 yield fIfK

=(

IPCK×ζKIPC I×ζI

) 12β

. Given homogeneous workload assign-

ment, i.e., IPCK ≡ IPCK , this implies that fK > fI , i.e., to optimize CMP through-

put, the processor core with higher cooling efficiency should be clocked at a higher

frequency.


Guideline II: Given jobs with different IPCs, the maximal CMP throughput can

only be achieved by maximizing the IPC heterogeneity during workload distribution.

To maximize throughput, jobs with higher IPCs should be assigned to cores with higher

thermal efficiencies.

This guideline indicates how to distribute run-time workload among vertically-

aligned processor cores. We will again use Figure 4.2 to illustrate the reason for this

guideline. Given a temperature constraint TMAX and an arbitrary workload assign-

ment with Core I’s IPC equal to IPC I and Core K’s IPC equal to IPCK , Equa-

tions 4.8–4.11 yield the following performance-optimal power and thermal budget

assignment under the given workload distribution:

fI = fK ×(

IPCK × ζKIPC I × ζI

) 12β

(4.12)

fK =

TMAX

ζK × IPCK

(1 +

(ζK×IPCK

ζI×IPC I

) 12β

)

12β+1

(4.13)

Next, we switch the workload between Core I and Core K, Equations 4.8–4.11

yield the following performance-optimal power and thermal budget assignment for

the new distribution:

f ′I = f ′K ×(

IPC I × ζKIPCK × ζI

) 12β

(4.14)

f ′K =

TMAX

ζK × IPC I

(1 +

(ζK×IPC I

ζI×IPCK

) 12β

)

12β+1

(4.15)

Then, simple calculation can show that difference in the CMP throughput between


these two workload distributions

(IPC I × fI + IPCK × fK)−

(IPCK × f ′I + IPC I × f ′K) ≥ 0 ⇐⇒ IPC I ≤ IPCK (4.16)

In other words, assigning jobs with higher IPCs to cores with higher thermal efficien-

cies yields higher overall throughput under the same temperature constraint.

4.4.1.2 Intra-Layer Power–Thermal Budgeting

Intra-layer cores have mostly-homogeneous thermal characteristics with almost

identical cooling efficiencies (see Section 4.3.1), i.e., ζi,i ≈ ζj,j, when Core i and Core j

are in the same layer. In addition, the inter-core thermal impact is significantly lower

than the self power–thermal impact of each core, i.e., ζi,i ζi,j, when i 6= j. We

derive the following policies for intra-layer power–thermal budgeting and workload

assignment.

Guideline III: To maximize aggregate CMP frequency or instruction throughput,

power–thermal budget and workload should be balanced among intra-layer processor

cores.

Consider two intra-layer processor cores J and K with ζJ,J ≡ ζK,K ζJ,K ≡

ζK,J . The temperature of each core depends mainly on its own power consumption,

i.e., TJ ≈ ζJ,J × PJ and TK ≈ ζK,K × PK (steady-state). Given thermal constraint

TJ , TK ≤ TMAX , performance optimization yields PJ ≡ PK and TIPJ ≡ TIPK , i.e.,

both cores should be clocked at the same frequency and execute workload with the

same IPC. This guideline can also be motivated as follows. Assume both cores are

assigned the same voltage V , frequency f , and workload (ξ and IPC ). Therefore,

TJ ≡ TK . Next, by adjusting the workload assignment, we increase the IPCs of the


Global power-thermal budgeting

Distributed thermal-aware workload migration

Temperature monitoring

Workload monitoring

Distributed run-time thermal management

Operating system

CMP hardware

Figure 4.3: ThermOS: 3D CMP Run-time Thermal Management [134].

jobs assigned to one core and decrease the IPCs of the jobs assigned to another core.

Since ζJ,J , ζK,K ζJ,K , ζJ,K , the temperature of one of the cores increases and the

peak temperature of these two cores increases. As a result, frequency reduction and

performance degradation are required to meet temperature constraints.

4.4.2 ThermOS: 3D CMP Thermal Management

Based on the thermal management guidelines developed in Section 4.4.1, we have

developed ThermOS, a unified hardware and OS thermal management solution for

3D CMP. As shown in Figure 4.3 and Table 4.1, ThermOS consists of hardware-

based temperature–workload monitoring and distributed run-time thermal manage-

ment built into a 3D CMP microarchitecture, as well as a temperature-aware Linux

kernel equipped for global power–thermal budgeting and distributed temperature-

aware workload migration. ThermOS is a proactive, continuously-engaged solution

designed to handle 3D CMP power–thermal heterogeneities, distribute run-time work-

load, and manage the limited power–thermal budget to optimize performance under


temperature constraint. Our ThermOS is built upon the Linux 2.6.8 kernel. It has an

O (1) time complexity scheduler. Our temperature-aware scheduling algorithm main-

tains the same time complexity. Table 4.1 summarizes the proposed offline, run-time,

and hardware management techniques.

4.4.2.1 Temperature Monitoring

ThermOS gathers CMP temperature profiles at run-time, which are used to guide

temperature-aware workload migration as well as power–thermal budgeting. Either

thermal sensors or online thermal analysis may be used for on-line temperature mon-

itoring. Thermal sensors have been widely used in high-performance microproces-

sors [85, 56]. Efficient software-based online thermal analysis techniques have also

been developed [99].

4.4.2.2 Workload Monitoring

In addition to CMP thermal profile, ThermOS gathers run-time performance and

power characteristics to guide job migration as well as power–thermal budgeting. A

processor core’s activity factor is a function of the capacitances of its functional units

and the corresponding run-time activity factors resulting from its workload. Most

modern processors provide hardware performance counters for monitoring specific

events [56, 101]. These performance counters can be used to inform accurate and

efficient regression-based run-time performance and power models [52, 63]. ThermOS

uses this technique for linear regression estimation of run-time processor core activ-

ity factors. The model was developed offline and integrated with the OS. During

execution, each processor core’s hardware performance counter values are gathered


Tab

le4.

1:T

her

mO

SIm

ple

men

tati

on[1

34].

Offl

ine

Giv

enth

eact

ivit

yfa

ctor

ran

ge

of

on

-ch

ipp

roce

ssor

core

,d

eriv

eth

elo

ok-u

pta

ble

,w

hic

hco

mp

uta

tion

conta

ins

the

op

tim

al

volt

ages

and

freq

uen

cies

yie

lded

by

Equ

ati

on

s8–11.

reb

ala

nce

tick

()In

voke

clu

ster

op

t()

an

dgro

up

op

t()

at

the

beg

inn

ing

of

each

work

load

mig

rati

on

tim

ein

terv

al

(ever

y20

ms)

.cl

ust

erop

t()

Con

du

ctin

ter-

layer

mig

rati

on

acc

ord

ing

toG

uid

elin

eII

.

OS

gro

up

op

t()

Con

du

ctin

tra-l

ayer

mig

rati

on

acc

ord

ing

toG

uid

elin

eII

I.O

nlin

esc

hed

ule

rti

ck()

1)

Mon

itor

the

act

ivit

yfa

ctors

of

run

-tim

ep

roce

sses

usi

ng

hard

ware

per

form

an

ceco

unte

rs.

2)

Det

erm

ine

the

glo

bal

pow

er–th

erm

al

bu

dget

ing

usi

ng

run

-tim

eta

ble

looku

p.

Hard

ware

Loca

lD

VF

SP

roact

ive

dis

trib

ute

dD

VF

Sb

ase

don

glo

bal

gu

idan

cean

dlo

cal

vari

ati

on

.L

oca

lcl

ock

Rea

ctiv

ed

istr

ibu

ted

clock

thro

ttlin

gto

gu

ara

nte

eth

erm

al

safe

ty.


periodically when triggered by OS timer interrupts (every 1 ms in Linux 2.6.8 kernel).

These performance counter values are used for run-time workload activity and IPC

estimation.

4.4.2.3 Distributed Thermal-Aware Workload Migration

ThermOS contains a distributed online workload migration technique to support

performance optimization. The proposed technique follows the guidelines derived in

Section 4.4.1 and carefully handles 3D CMP inter-layer thermal heterogeneity and

run-time workload heterogeneity. ThermOS uses a distributed approach that swaps

jobs with high IPCs to processor cores with higher thermal efficiencies.

Consider two vertically-adjacent processor cores: Core I and Core K. Assume

Core K has higher cooling efficiency than Core I. To optimize instruction throughput,

ThermOS compares the jobs stored in each processor core’s job queue. It first identi-

fies the lowest-IPC job (IPCMINK) on core K and the highest-IPC job (IPCMAX I)

on Core I. If IPCMINK < IPCMAX I , ThermOS swaps the corresponding jobs. Intra-

layer thermal heterogeneity and thermal correlation are small. Therefore, ThermOS

balances the intra-layer IPC distribution to optimize instruction throughput. Aver-

age IPCs of jobs on horizontally-adjacent cores are compared. If appropriate, they

are swapped to further balance the distribution. The proposed distributed thermal-

aware workload migration technique has been integrated within the default Linux

kernel workload balancing policy. In the current implementation, workload migration

occurs every 20 ms.


4.4.2.4 Global Power–Thermal Budgeting

ThermOS dynamically adjusts the power–thermal budgets of processor cores to

optimize 3D CMP performance. Following the guidelines in Section 4.4.1, ThermOS

balances the power–thermal budget assignment among processor cores in the same

layer. Equations 4.8–4.11 are used to guide inter-layer power–thermal budgeting. The

leakage-temperature dependency introduces temperature variables on both sides of

Equation 4.10. Solving this equation requires numerical iteration and detailed chip-

package thermal analysis, which are computationally intensive. To minimize run-time

overhead, we have developed an hybrid offline/online budgeting technique.

Given the switching activity (or IPC) range of the workload, the optimal voltage

and frequency settings for vertically-aligned processor cores are pre-computed. The

offline component of the budgeting algorithm is iterative. During each iteration, based

on the IPC and the switching activity of each processor core, Equations 4.8–4.11 are

used to determine the optimal processor core power–thermal budgets. Thermal anal-

ysis is then used to estimate the 3D CMP thermal profile and update the leakage

power profile estimate. This process iterates until the chip-package thermal profile

converges, subject to feedback from temperature-dependent leakage power consump-

tion. The final voltage and frequency configurations are stored in a look-up table

for efficient use during online power–thermal budgeting. Given that the number of

processor layers is L and the number of activity factor settings is n, the lookup table

has nL entries. Increasing n, i.e., the resolution of the activity factor index, improves

performance but increases storage overhead, as demonstrated in Section 4.6.4.2. In

ThermOS, run-time power–thermal budgeting is implemented in the Linux kernel and

invoked periodically. Periods ranging from 1 ms to 100 ms are currently supported.


4.4.2.5 Distributed Run-Time Thermal Management

ThermOS uses distributed run-time thermal management to honor the power and

thermal budgets described in Section 4.4.2.4 and adhere to a temperature constraint.

Periodically, each processor core adjusts its voltage and frequency based on its as-

signed power–thermal budget. However, transient variations may not be immediately

detected by the OS. In order to honor the temperature constraint, ThermOS uses

local dynamic voltage and frequency scaling (DVFS) and clock throttling to react

to transient variation with lower latency than global power–thermal budgeting. Ta-

ble 4.2 compares these two widely-used power management techniques. DVFS has

high area overhead, mainly due to complex power supply circuitry and the need of

off-chip capacitors and inductors for each independent voltage domain. It also has

a higher response latency than clock throttling. For modern high-performance mi-

croprocessors equipped with DVFS, the voltage transition rate is in the range of

10 mV/µs [51]. Clock throttling, on the other hand, has low area overhead and low

latency. However, DVFS has less performance impact per unit power reduction than

clock throttling, thanks to the superlinear dependence of power on voltage. Note that

most modern high-performance processors already support DVFS. We are proposing

to use this existing DVFS hardware to the best effect. In ThermOS, local DVFS con-

tinuously tracks temperature changes and clock throttling is used as a final defense

to guarantee thermal safety.


Table 4.2: DVFS and Clock Throttling Comparison [134].Area overhead Response Performance impact

DVFS High Slow LowClock throttling Low Fast High

4.5 Experimental Setup

This section describes the experimental setup used to evaluate the proposed 3D

CMP dynamic thermal management techniques. We describe our simulation and OS

infrastructure, 3D chip and package models, and benchmark suites.

4.5.1 Infrastructure

Performance and temperature estimation for 3D CMP architectures is challenging.

Estimating spatial and temporal thermal profiles requires time-varying power profiles.

This, in turn, requires timing and power analysis. To accurately estimate the run-time

characteristics of 3D CMPs, we developed a full-system out-of-order multiprocessor

simulation environment with integrated processor performance, power, and thermal

models.

4.5.1.1 Full-System Simulation Setup

We use the M5 Full System Simulator [11]. M5 provides a detailed, cycle-accurate,

out-of-order simulation mode and a faster functional simulation mode. We use a com-

bination of full-system checkpoints and the functional simulation mode to boot the

system and fast-forward past the initialization portion of our benchmarks. We then


Table 4.3: Design Parameters for Alpha 21264 [134].Alpha 21264 Configuration (90 nm)

Die size 4.56×4.56 mm2

Frequency and Voltage 2 GHz, 1.2 VInstruction Queue 64 entriesFunctional Units 4IXU, 2FPU, 1BPU

Physical Registers 80 GPR, 72 FPRBranch Predictor 1 K local, 4 K global

Memory HierarchyL1 DCache/core 32 KB, 2-way, 64 B blocks, 3 cycle lat.L1 ICache/core 64 KB, 2-way, 64 B blocks, 1 cycle lat.

Shared L2 Cache 16 MB, 8-way LRU, 64 B blocks, 25 cycle lat.

Table 4.4: 3D Package Setup [134].

LayerThermal Heat Depth

cond. (W/mK) cap. (J/m3K) (µm)

Eff. Active Layer (Silicon) 160.11 1.66× 106 50Eff. Interface Layer (Polyimide) 6.83 3.99× 106 10

Heatsink (Cu) 400 3.55× 106 6,900Thermal Grease [94] 3–5 (5 used) 4× 106* 50

* From configuration used in HotSpot [99].

switch to detailed simulation mode to evaluate thermal and performance character-

istics.

We added a Wattch-based EV6 power model to M5 [15], scaled to a 90 nm process.

Our cache power model is based on CACTI [106]. Static power consumption was

estimated using an area-based, temperature-sensitive leakage model [103]. A 3D

frequency-domain dynamic thermal analysis package was used [125]. Each active

layer was modeled using numerous thermal elements.


4.5.1.2 Processor Architecture

There are two ways to stack device layers: face-to-face and face-to-back. For

designs with more than two layers, face-to-back bonding decreases worst-case inter-

wafer via delay. We evaluate a three-layer front-to-back CMP structure. As shown

in Figure 4.1, there are eight Alpha 21264 microprocessor cores in the top two layers.

Each layer contains four microprocessor cores. Layers are connected with polyimide

glue. There is 50 µm of thermal grease between the heatsink and die. Parameters for

thermal grease and interface material follow Samson et al. [94].

Each processor core has 32 KB L1 data cache and 64 KB L1 instruction cache.

There is a 16 MB shared L2 cache on Layer 2 and 1,024 MB of main memory. A

90 nm technology is modeled. Details can be found in the Table 4.3 and Table 4.4.

We have accounted for inter-layer vias in the thermal model in the following way.

The via density in a region follows ρvia = nAvia/(wh) where n is the number of vias

in the region, Avia is the cross section area of each via, w is the width of the region,

and h is the height of the region. The relationship between via density and effective

vertical thermal conductivity follows:

Keff = ρviaKvia + (1− ρvia)Klayer (4.17)

where Kvia is the thermal conductivity of the via material and Klayer is thermal

conductivity of the region without any vias. Here, the via is assumed to be copper

with a thermal conductivity of 400 W/mK. A typical via size is 15 µm×15 µm.

For the Alpha 21264, there are 587 package pins (389 die pins). Interconnect

vias use 0.64% of the core area. This results in the effective bulk silicon layer and

interface layer thermal conductivities reported in Table 4.4. There are three types of

heat sinks: extruded, folded-fin, and integrated vapor-chamber. In this chapter, we


assume an extruded copper heat sink with a thermal conductivity of 400 W/mK [116].

4.5.1.3 Operating System

The ThermOS run-time thermal management algorithms are implemented within

the Linux 2.6.8 kernel. We made two main changes to the kernel:

• Performance-counter based power modeling: We enable OS-level power estima-

tion using performance counters. Hardware event counters of the sort typical

for modern processors were added to M5. A regression-based power model was

added to the OS [52].

• Power–thermal budgeting, task migration, and thermal management: The pro-

posed power–thermal budgeting and temperature-aware task migration tech-

niques were implemented in the Linux kernel. We modified M5 to support

kernel control of DVFS and clock throttling temperature monitoring through

privileged machine registers.

4.5.2 Benchmark Suites

Multithreaded and multiprogrammed benchmarks from SPEC2000, Media Bench,

ALPBench [66], and SPLASH2 [100] are used. Phansalkar et al. did a detailed analysis

of SPEC2000 and found that it can be divided into different groups based on several

benchmark-specific metrics [86]. In order to build a complete set of test cases for our

proposed techniques, we selected two benchmark-specific metrics: IPC and expected

temperature variation. Although the absolute values of these metrics depend on

microarchitectural characteristics, their relative differences in a set of benchmarks

are mostly micro-architecture independent.


Table 4.5: Benchmark Characteristics [134].

Group NameAvg. Avg. Max. Max.IPC Pow. (W) T δT

SPEC gcc 3.36 14.67 64.88 0.20High IPC applu 3.13 14.37 65.64 0.12

gzip 2.78 13.34 63.49 0.34mgrid 2.58 13.66 61.84 0.31

SPEC twolf 1.58 11.33 64.30 0.19Low IPC parser 1.55 10.41 60.70 0.28

vpr 1.47 10.63 60.43 0.29mcf 1.25 10.91 63.79 0.25

Media gsmenc 3.10 13.50 63.38 0.09High IPC jpegdec 2.72 13.42 65.89 0.13

Mediag721enc 1.94 11.91 61.39 0.08

Low IPCMultithreaded MPGenc 2.95 14.34 68.78 0.20(two threads) Sphinx3 1.13 9.93 61.68 0.02

cholesky 2.83 14.27 70.57 0.32lu 2.26 12.10 66.97 0.08

radix 0.84 5.81 57.17 0.28water-nsquared 1.85 11.99 65.32 0.12water-spatial 1.74 10.57 62.35 0.08


Table 4.6: Benchmark Suites [134].Multiprogrammed test setups

Group Filename Clusters BenchmarksSPEC hv-hipc High T var., high IPC gzip, mgrid

lv-hipc Low T var., high IPC applu, gcchv-lipc High T var., low IPC parser, vprlv-lipc Low T var., low IPC twolf, mcf

hv-mipc1 High T var., mixed IPC gzip, parserhv-mipc2 High T var., mixed IPC mgrid, vprlv-mipc1 Low T var., mixed IPC applu, mcflv-mipc2 Low T var., mixed IPC gcc, twolf

Media media-hipc High IPC jpegdec, gsmencmedia-mipc Mixed IPC gsmenc, g721enc

Multithreaded test setupsMPGenc, sphinx3, cholesky, lu, radix, water-nsquared, water-spatial

• IPC: IPC is approximately linearly-related to power consumption, which, has

a strong influence on temperature.

• Expected temperature variation: The main goal of the proposed 3D CMP ther-

mal management technique is to maximize performance subject to a tempera-

ture constraint. In order to evaluate it, we have selected a set of benchmarks

with a wide range of spatial and temporal thermal characteristics.

Based on these metrics, the benchmarks were analyzed, yielding the results in

Table 4.5. Dynamic power traces were gathered during 500 ms to determine average

power consumption, the temporal average of peak temperature, and the maximum

peak temperature variation.

We created 17 test setups (see Table 4.6). Ten of these were for multiprogrammed

benchmarks. Each contains mixes of benchmarks with high and low temperature


variation and IPC. Each test setup contains two SPEC or Media benchmarks. For

multithreaded benchmarks, seven test setups are created. Each test setup contains

one ALPBench or SPLASH2 benchmark with two parallel threads. During experi-

ments, each run contains eight copies of each test setup, i.e., 16 processes/threads in

total with two processes or threads per core on average.


This section evaluates ThermOS, the proposed run-time thermal management

solution for 3D CMPs.

4.6.1 Comparison of ThermOS With Alternatives

In this section, we first contrast ThermOS with solutions used in existing pro-

cessors. Then we provide a detailed quantitative comparison with a state-of-the-art

continuously-engaged thermal management technique. The following experiments use

85 as a predefined thermal constraint.

Most thermal management techniques used in practice react to emergencies in-

stead of being continuously engaged. They detect dangerously-high temperatures and

reduce power consumption, generally via hardware clock throttling. Such solutions

are adequate when temperatures approach their limits only very rarely. However,

high power densities and constraints on cooling costs require proactive thermal man-

agement. Some researchers have moved in this direction.

Donald and Martonosi [28] proposed a distributed continuously-engaged thermal

management technique for 2D CMPs. Their approach is based on closed-loop control


theory, and continuously adjusts the voltage and frequency of each processor core

to maintain safe temperatures. Each core has its own controller and the controllers

act independently, without knowledge of the conditions of other cores. This per-

mits significantly better performance than reactive approaches because DVFS can

generally reduce power consumption by the same amount as clock throttling with

a smaller performance penalty. In fact, their results indicate that, compared with

a stop-go based thermal control policy, distributed DVFS improves throughput by

2.5×. However, independent local control has limitations. The power consumed in

one processor can impact the temperatures of other processors in nonuniform ways.

As a result, continuously-engaged global control can permit better performance than

continuously-engaged local control. This is especially true for 3D architectures, in

which the power consumption of a particular processor core has great impact on the

temperature of vertically-aligned cores and relatively less impact on other cores.

ThermOS uses continuously-engaged, distributed global/local control to maximize

performance given a temperature bound. It supports both 3D and 2D architectures.

It has two primary differences with state-of-the-art temperature control techniques.

First, it uses global power budgeting that takes into account the thermal interaction

between processor cores. Second, it directs temperature-aware workload migration of

threads among processor cores.

Figure 4.4 shows 3D CMP run-time instruction throughput (BIPS: billion instruc-

tions per second), achieved by ThermOS and Donald’s and Martonosi’s approach.

Compared to the distributed local approach, ThermOS improves instruction through-

put by 29.84% on average (ranging from 15.22% to 53.79%). This can be explained


10

15

20

25

30

35

hv-hipc

hv-lipchv-mipc1

hv-mipc2

lv-hipclv-lipc

lv-mipc1

lv-mipc2

media-hipc

media-mipc

MPGenc

Sphinx3

cholesky

lu radixwater-nsquared

water-spatial

Thro

ughp

ut (B

IPS)

ThermOS Distributed approach

Figure 4.4: Comparison of ThermOS and Distributed Approach [28, 134].

as follows. In 3D CMPs, the strong thermal correlation among inter-layer vertically-

aligned processor cores has significant impact on the temperature of the processor

layer farthest from the heat sink. Using the proposed power–thermal budgeting

and thermal-aware workload migration techniques, ThermOS determines appropri-

ate power budgets for each group of vertically-aligned processor cores. In addition, it

uses DVFS to optimize the power–thermal efficiency of each processor core. Together,

these techniques maximize overall throughput. Donald’s and Martonosi’s work, on

the other hand, is a distributed, processor-local technique. Using this technique, each

processor core regulates its power and performance to ensure local thermal safety

without considering the thermal impact on neighboring cores. As a result, vertically-

aligned processor cores are unable to collaboratively share the power and thermal

budget, which can reduce CMP performance. In other words, when a distributed,


0

5

10

15

20

25

30

35

40

hv-hipc

hv-lipchv-mipc1

hv-mipc2

lv-hipclv-lipc

lv-mipc1

lv-mipc2

media-hipc

media-mipc

MPGenc

Sphinx3

cholesky


water-spatial

Ther

mal

Vio

latio

n (%

)

w local DVFS, w clock throttlingw local DVFS, w/o clock throttling

w/o local DVFS, w/o clock throttling

Figure 4.5: Reduction in Temperature Constraint Violations due to Local DVFS andElimination of Temperature Constraint Violations due to Clock Throttling [134].

local management technique is used, power consumption on processor cores near the

heatsink can push processor cores farther from the heatsink to their thermal limits.

4.6.2 Efficiency Impact of Guaranteeing Thermal Safety

In this section, we establish an upper bound on performance by evaluating a

thermal management technique with near-optimal performance, but vulnerability to

temperature constraint violations due to transient changes in workload. We then

show that there is only a small performance reduction resulting from the additional

management techniques ThermOS uses to guarantee thermal safety.


ThermOS uses the temperature-aware workload migration and global power–

thermal budgeting guidelines derived in Section 4.4.1. These techniques can poten-

tially offer near-optimal run-time performance subject to a temperature constraint.

However, they do not immediately react to transient workload variation occurring

in individual processor cores, which may cause run-time temperature constraint vi-

olations. ThermOS uses distributed run-time thermal management techniques to

guarantee thermal safety, i.e., local DVFS and clock throttling dynamically adjust

the voltage and frequency of each processor core to eliminate thermal emergencies.

Compared to DVFS, clock throttling is more responsive but degrades performance

more for the same thermal improvement. Therefore, in ThermOS, DVFS is continu-

ously engaged and clock throttling is invoked only when local DVFS cannot guarantee

thermal safety. These techniques, however, may cause the run-time operations of the

processor cores to deviate from the guidelines derived in Section 4.4.1. Straying from

these guidelines has the potential to reduce performance.

Figure 4.5 illustrates the levels of thermal safety achieved by various control tech-

niques. As shown in this figure, when distributed control is disabled, the voltage and

frequency of each processor core is solely controlled by global power–thermal budget-

ing, which does not consider the temporal workload variation within each processor

core. This local workload variation can cause significant run-time power variation,

and therefore temperature constraint violations. Local DVFS can adapt to rapid

workload variation occurring within each processor core and adjust voltage and fre-

quency accordingly, thereby reducing run-time thermal emergencies. When clock

throttling is also enabled, processor thermal emergencies are completely eliminated

(see Figure 4.5).


82.5

83

83.5

84

84.5

85

85.5

300 350 400 450 500

Te

mp

era

ture

(°C

)

Time (ms)

P4 temperature profile (local DVFS + clock throttling)P0 temperature profile (local DVFS + clock throttling)

82.5

83

83.5

84

84.5

85

85.5

Te

mp

era

ture

(°C

)

P4 temperature profile (local DVFS)P0 temperature profile (local DVFS)

82

82.5

83

83.5

84

84.5

85

85.5

300 350 400 450 500

Te

mp

era

ture

(°C

)

Time (ms)


82

82.5

83

83.5

84

84.5

85

85.5

Te

mp

era

ture

(°C

)


82

82.5

83

83.5

84

84.5

85

85.5

300 350 400 450 500

Te

mp

era

ture

(°C

)

Time (ms)


82

82.5

83

83.5

84

84.5

85

85.5

Te

mp

era

ture

(°C

)


82.5

83

83.5

84

84.5

85

85.5

86

300 350 400 450 500

Te

mp

era

ture

(°C

)

Time (ms)


82.5

83

83.5

84

84.5

85

85.5

86

Te

mp

era

ture

(°C

)


Figure 4.6: Temporal Temperature Variation for Eight Processor Cores (P0–P7) Run-ning lv-mipc2 Using Local DVFS w.o. (Top) and w. (Bottom) Clock Throttling [134].


0

0.2

0.4

0.6

0.8

1

1.2

1.4

hv-hipc

hv-lipchv-mipc1

hv-mipc2

lv-hipclv-lipc

lv-mipc1

lv-mipc2

media-hipc

media-mipc

MPGenc

Sphinx3

cholesky


water-spatial

Norm

alize

d th

roug

hput

(BIP

S)

w local DVFS, w clock throttlingw local DVFS, w/o clock throttling

w/o local DVFS, w/o clock throttling

Figure 4.7: Negligible CMP Instruction Throughput Reduction Resulting from LocalDVFS and Clock Throttling [134].

To further illustrate the effectiveness of the distributed run-time control tech-

niques, Figure 4.6 shows the run-time thermal profiles of eight processor cores when

running the lv-mipc2 benchmark, with and without local clock throttling. Proces-

sors 0–3 are adjacent to the heatsink and processors 4–7 are farther from it. Local

DVFS balances CMP thermal profile, and run-time temperature constraint violations

(exceeding 85 , a predefined thermal threshold used in this experiment) occur only

rarely. When both local DVFS and clock throttling are enabled, the temperature

constraint is never violated.

Figure 4.7 indicates that the performance penalty introduced by the distributed

control techniques required to guarantee thermal safety is low. To help quantify the

performance impact, we normalize the CMP throughput to the value achieved by


global power–thermal budgeting and then evaluate the CMP throughput with local

DVFS only with both local DVFS and clock throttling. These results indicate that

local DVFS degrades instruction throughput by 0.55% on average. Since local DVFS

is capable of eliminating most run-time thermal emergencies, clock throttling is rarely

invoked. As shown in these figures, enabling both local DVFS and clock throttling

results in performance penalties of only 0.60% on average for instruction throughput.

In summary, the proposed distributed run-time thermal control technique achieves

thermal safety with little performance impact.

4.6.3 Robustness to Changes in 3D Integration

In order to show the robustness of ThermOS to variation in 3D integration style,

we evaluated the performance improvement when used for CMPs using front-to-back

and front-to-front wafer integration (see Section 4.5.1). We simulated the proposed

technique and Donald’s and Martonosi’s distributed local approach [28] for both in-

tegration styles using all benchmark mixes shown in Table 4.6. The average CMP

instruction throughput improvement was 29.84% for front-to-back integration and

23.77% for front-to-front integration. For all combination of benchmarks and pack-

ages, the instruction throughput improvements were greater than 7%. We can con-

clude that ThermOS permits substantial improvements in performance over Donald’s

and Martonosi’s distributed local technique for different 3D integration styles.

4.6.4 Scalability Analysis of ThermOS

ThermOS uses distributed temperature-aware workload migration, global power–

thermal budgeting, and distributed run-time thermal control techniques to optimize


10

15

20

25

30

hv-hipc

hv-lipchv-mipc1

hv-mipc2

lv-hipccholesky

radix

Thro

ughp

ut (B

IPS)

1ms 10ms 50ms 100ms

Figure 4.8: Impact of Global Guidance Interval [134].

3D CMP throughput and guarantee thermal safety. In contrast with purely local

distributed techniques, run-time power–thermal budgeting is global. This might raise

concerns about the scalability of ThermOS when used on many-core 3D CMPs. In this

section, we evaluate the scalability of the proposed global power–thermal budgeting

technique.

4.6.4.1 Performance Impact

ThermOS periodically decides power–thermal budgets for processor cores. This

involves inter-layer and intra-layer assignment. Run-time inter-layer assignment uses

efficient table lookup. Intra-layer assignment uses an efficient homogeneous assign-

ment policy, i.e., processor cores within the same layer are assigned the same power–

thermal budgets. In the current setup, i.e., an eight-core 3D CMP with a 1 ms global


guidance interval, detailed simulation shows that the overall run-time overhead intro-

duced by global power–thermal budgeting is only 0.22%.

The run-time overhead of global power–thermal budgeting is linearly proportional

to the run-time global guidance/budgeting interval. In general, shorter global guid-

ance intervals can more accurately track run-time workload variation but may intro-

duce more run-time overhead and communication contention when aggregating data

from different CMP cores. It might therefore be useful to reduce this overhead by

increasing the global guidance interval.

In the current setup, a 1 ms guidance interval is used. This is frequent enough to

allow adjustments in global power–thermal budget before temporal workload variation

can produce large temperature changes, i.e., a higher frequency is unnecessary. To

evaluate the impact of increasing global guidance interval on system performance,

we run all six benchmarks with high workload variation from Table 4.6. One low-

variation benchmark (lv hipc) is also included for the sake of comparison. The results

are shown in Figure 4.8. They indicate that, for guidance intervals up to and including

100 ms, ThermOS maintains nearly-identical performance. Only hv-hipc, cholesky,

and radix experience noticeable performance degradation, due to their high temporal

workload variation. However, changing the global guidance interval from 1 ms to

100 ms only reduces CMP instruction throughput by 1.81%, 1.06%, and 2.61% for

hv-hipc, cholesky, and radix, respectively. We conclude that even if it were necessary

to reduce global guidance interval by two orders of magnitude in order to maintain

low global power–thermal budgeting run-time overhead in many-core 3D CMPs, there

would be little reduction in thermally-safe performance.


10

15

20

25

30

35

40

hv-hipc

hv-lipchv-mipc1

hv-mipc2

lv-hipclv-lipc

lv-mipc1

lv-mipc2

media-hipc

media-mipc

MPGenc

Sphinx3

cholesky


water-spatial

Thro

ughp

ut (B

IPS)

6 X 6 lookup table11 X 11 lookup table

51 X 51 lookup table

Figure 4.9: Impact of Lookup Table Size [134].

4.6.4.2 Storage Impact

As described in Section 4.4.2.4, ThermOS uses an offline iterative budgeting al-

gorithm to precompute some power–thermal budgeting decisions, which are stored

using a lookup table in the main memory for efficient run-time usage. This lookup

table has nL entries. Each entry requires 4 B storage. L is the number of processor

layers. It is expected that the number of processor layers in 3D CMPs will be limited.

n is the number of activity factor settings, which affects the power–thermal budgeting

resolution. Higher resolution improves the accuracy of the run-time power–thermal

budgeting decisions, but also increases the storage requirements for the table. In the

current setup, we use a two-dimensional lookup table with 51×51 entries (10.4 KB)

which provides sufficient resolution for accurate power–thermal budgeting.

It might be useful to decrease lookup table resolution for many-core systems in


10 15 20 25 30 35 40 45 50

hv-hipc

hv-lipchv-mipc1

hv-mipc2

lv-hipclv-lipc

lv-mipc1

lv-mipc2

media-hipc

media-mipc

MPGenc

Sphinx3

cholesky


water-spatial

Thro

ughp

ut (B

IPS)

ThermalOS w/o rotationThermalOS w rotation

Distributed approach w/o rotationDistributed approach w rotation

Figure 4.10: Impact of Floorplan Rotation [134].

order to limit storage overhead. We evaluated the impact of decreasing lookup table

resolution on thermally-safe CMP performance by running all benchmark mixes using

51×51, 11×11, and 6×6 tables. As shown in Figure 4.9, compared to the 51×51

lookup table, the 11×11 lookup table setting reduces the memory usage from 10,404 B

to 484 B, with average CMP instruction throughput reductions of 0.75%. When the

table is reduced to 6×6 entries, memory usage decreases to 144 B, with average CMP

instruction throughput reductions of 2.87%. We conclude that ThermOS requires

little storage and that its performance degrades slowly with reduced lookup table

size.


4.6.5 Interaction with 3D CMP Floorplan Optimization

This experiment evaluates ThermOS for 3D CMPs with different floorplans. CMP

thermal profile is strongly influenced by on-die power distribution. In 3D CMPs, inter-

layer vertically-aligned processor cores have strong thermal correlation. If all cores

have identical floorplans, functional units with high power densities are vertically-

aligned, potentially creating local thermal hotspots. Intelligent inter-layer floorplan

arrangement can potentially balance inter-layer power profile and minimize chip peak

temperature. Using the three-layer 3D CMP setup with processor core layers and

one L2 cache layer, detailed thermal analysis shows that, by rotating the floorplan of

top-layer processor cores by 180 degrees, chip power profile is more balanced, intra-

core local hotspots are minimized, and chip peak temperature is reduced by 1.99

on average and 4.24 maximum among the multiprogramming and multithreading

benchmarks. Figure 4.10 compares ThermOS and the baseline distributed technique,

with and without floorplan rotation. It shows that both run-time techniques can

leverage the temperature reduction offered by floorplan rotation and achieve higher

throughput under the same temperature constraint. In addition, ThermOS consis-

tently outperforms the distributed technique by 31.45% and 29.84% on average with

and without floorplan rotation, respectively.

4.7 Conclusions

3D integration has the potential to significantly improve performance and inte-

gration density. However, it will also increase power density, thereby increasing the

importance of using continuously-engaged thermal management techniques. It will


also increase the heterogeneity in thermal interaction among processor cores. This

requires careful consideration during thermal management policy design.

We have developed a mathematical formulation for optimizing workload assign-

ment, power–thermal budgeting, and voltage mode selection for 3D CMP thermal

management. This formulation has been used to develop a continuously-engaged

hardware–software thermal management solution for 3D CMPs. The proposed solu-

tion has been implemented within the Linux kernel and evaluated using full-system

3D CMP and OS simulation. Our strategy outperforms a state-of-the-art proactive

thermal management technique that does not make use of power–thermal budgeting.

Chapter 5

Characterization of Single-Electron

Tunneling Transistors for

Designing Low-Power Embedded

Systems

Minimizing power consumption is vitally important in embedded system design;

power consumption determines battery lifespan. Ultra-low-power designs may even

permit embedded systems to operate without batteries by scavenging energy from the

environment. Moreover, managing power dissipation is now a key factor in integrated

circuit packaging and cooling. As a result, embedded system price, size, weight, and

reliability are all strongly dependent on power dissipation.

Recent developments in nanoscale devices open new alternatives for low-power

embedded system design. Among these, single-electron tunneling transistors (SETs)

89

CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 90

hold the promise of achieving the lowest power consumption. Unfortunately, most

analysis of SETs has focused on single devices instead of architectures, making it

difficult to determine whether they are appropriate for low-power embedded systems.

Evaluating the use of SETs in large-scale digital systems requires novel architec-

tural and circuit design. SET-based design imposes numerous challenges resulting

from low driving strength, relatively large static power consumption, and the pres-

ence of reliability problems resulting from random background charge effects. We

propose a fault-tolerant, hybrid SET/CMOS, reconfigurable architecture, named Ice-

Flex, that can be tailored to specific requirements and allows trade-offs among power

consumption, performance requirements, operation temperature, fabrication cost, and

reliability. Using IceFlex as a testbed, we characterize the benefits and limitations

of SETs in embedded system designs. In particular, we focus on the use of SETs

in room-temperature ultra-low-power embedded systems such as wireless sensor net-

work nodes. We also consider higher-performance applications such as multimedia

consumer electronics. We see this work as a first step in determining the potential of

ultra-low-power embedded system design using SETs. My major contribution of this

chapter is on the SET modeling, SET design space characterization and characteriza-

tion of IceFlex architecture (Section 5.2, 5.3.1, 5.3.2.1, 5.3.2.2, 5.3.2.3, 5.3.2.5, 5.4.1.1

and 5.4.1.3) My collaborator, Zhenyu Gu, contributed to the global/local intercon-

nect design and characterization of embedded applications (Section 5.3.2.4, 5.4.1.2

and 5.4.2).


5.1 Introduction

Energy consumption and thermal issues are now central issues in electronic sys-

tem design. In high-performance applications, temperature affects integration density,

performance, reliability, power consumption, and cost. For battery-powered embed-

ded systems, power consumption determines system life time. Power consumption

crises were historically solved by moving to new technologies that decreased energy

per operation, allowing increases in density and eventually performance. Power and

thermal concerns were primary motivations for replacing vacuum tubes with semicon-

ductor devices in the 1960s and replacing bipolar junction transistors with CMOS in

the 1990s. Although CMOS is the mainstream fabrication technology used today, as

IC and system integration further increase, it will reach fabrication, power consump-

tion, and thermal limits; it may soon be time for another transition to a dramatically

different technology.

Device researchers have seen the coming challenges for CMOS devices and evalu-

ated alternative technologies such as carbon nanotube transistors [29], nanowires [46],

and single-electron tunneling transistors (SETs) [70]. The International Technology

Roadmap for Semiconductors projects that SETs have the potential to achieve the

lowest projected energy per switching event of any known device (1 × 10−18 J) [53].

However, their use poses unique architectural, circuit design, and fabrication chal-

lenges. For example, SETs are susceptible to reliability problems caused by random

background offset charges. They have cyclic I–V curves (see Figure 5.2) that can

complicate design but permit highly-efficient implementation of some useful logic

functions that have proven inefficient using CMOS and threshold logic. Although the

fabrication of SETs capable of operating at low temperatures is now common, feature


sizes of only a few nanometers are required for room-temperature operation, making

fabrication challenging.

5.1.1 Past Work

After their discovery in the 1980s [9, 33], there has been extensive research on

fabrication, design, and modeling of SETs [70]. SET fabrication and use in high-

sensitivity amplifiers at cryogenic temperatures has been the main research focus [25].

SETs and simple circuits with a variety of structures were proposed and fabricated

using different methods and materials [80, 105, 6]. Recently, researchers have fabri-

cated SETs that operate at room-temperature [75, 98, 84]. Various SET-based circuit

applications, such as logic [111, 112, 79, 19] and memory [126, 118, 122] have been

developed. These works provide the promising start for SET circuit design. How-

ever, these articles did not provide an architectural evaluation. We do not claim to

have improved the performance of SET-based logic gates. Instead, we are the first

to develop the modules necessary to support architectural design and synthesis and

evaluate the architectural performance and power consumption implications of using

SETs. They demonstrate orders of magnitude improvement in power consumption

and energy efficiency compared to CMOS.

Research on SET modeling and simulation has been an active area. Monte Carlo

simulation has been widely used to model SETs. SIMON [117] and MOSES [17] are

the two most popular SET simulators. However, they are too slow for analysis of large

circuits. Uchida et al. proposed an analytical SET model and incorporated it into

SPICE [113]. Recently, Inokawa et al. extended this model to a more general form to

include asymmetric SETs [49]. Mahapatra et al. propose a simulation framework for


hybrid SET/CMOS circuit design and analysis [73]. Their model for SET behavior

is similar to that of Uchida et al. These compact modeling techniques are efficient

enough for use in SET circuit design and analysis and closely match Monte Carlo

simulation results.

Significant challenges still remain for large-scale integration of SETs and for room-

temperature operation. SETs that operate reliably at room temperature have critical

dimensions of ∼1–10 nm. They are challenging to fabricate using current top-down

lithographic techniques. However, several exciting advances make the evaluation of

architectures for high-density logic based on SETs worthwhile. Scanning-probe mi-

croscopes can be used to create devices smaller than those using conventional lithog-

raphy [75]. Continual progress has been made on bottom-up nano-fabrication tech-

niques, where chemical techniques are used to make individual molecules with useful

electronic properties. Molecular quantum dots [40] can display SET behavior. Larger

structures, such as carbon nanotubes and nanowires, can act as SETs [6]. These

bottom-up techniques can create structures supporting room-temperature SET oper-

ation. However, more research is needed in order to integrate individual devices into

large-scale circuits. Very recent advances in graphene [35] devices show promise for

SETs. Reliable methods for cooling to very low temperatures without supplies of liq-

uid helium or nitrogen are also becoming more common [114]. For high-performance

computing, the added complexity of operating at cryogenic temperatures may not be

a limiting factor. Similarly, cryogenic temperatures are readily attained using passive

methods in outer space.


5.1.2 Contributions

In this chapter, we explore the potential use of SETs in low-power embedded

systems. In order to take advantage of the power efficiency of SETs, it is critical

to bring SET-based design to the system level, characterize the impacts of SETs on

system design metrics, and evaluate the benefits and limitations of SETs. Our work

starts from design space characterization of SET-based architectures. We evaluate

the impacts of using SETs upon architectural, circuit-level, and device-level design,

considering metrics such as energy efficiency, performance, reliability, maximum op-

erating temperature, and ease of fabrication.

Based on our evaluation of the architectural and circuit-level features that can

most effectively exploit the strengths of SETs while working within their limitations,

we propose a fault-tolerant, reconfigurable, hybrid SET/CMOS based architecture

called IceFlex. IceFlex is regular and cell-based. It is reconfigurable, permitting

compensation for fabrication defects. It incorporates flexible, modular circuits to en-

able tolerance of run-time faults. In addition to compensating for the weaknesses of

SETs, IceFlex exploits their strengths, e.g., we develop a two-SET design to imple-

ment Boolean functions that are not linearly separable.

We tailor IceFlex to both high-performance and battery-powered embedded sys-

tems and characterize its energy efficiency, performance, and power consumption by

using it for a number of instruction processors and application-specific cores. Com-

pared to CMOS-based designs, IceFlex improves energy efficiency by two orders of

magnitude for both battery-powered and high-performance applications, while main-

taining good performance. However, our results also indicate great challenges to the

use of SET-based designs in portable embedded systems. Their use will either require


CG :gate capacitance CD :drain tunnel junction capacitanceCG2 :optional 2nd gate capacitance RS :source tunnel junction resistanceCS :source tunnel junction capacitance RD :drain tunnel junction resistance

gate (G)island

optional 2nd gate (G2)

tunneljunction

source(S)

drain(D)

CG

CG2

CS,RS CD,RD

Figure 5.1: SET Structure and Schematic [133].

advances in the compact cooling technologies or the fabrication of features with sizes

approaching physical limits.

5.2 SET Modeling

In this section, we introduce the physical properties of SETs, and discuss SET

analytical device modeling.

5.2.1 SET Basics

The operation of a single-electron tunneling device is governed by the Coulomb

charging effect. As shown in Figure 5.1, a single-electron tunneling device consists of

a nanometer-scale conductive island embedded in an insulating material. Electrons

travel between the island, source (S), and drain (D) through thin insulating tunnel

junctions. When an electron tunnels into the island, the overall electrostatic potential

of the island increases by e2/CΣ, where e is the elementary charge and CΣ is the island


0.001

0.01

0.1

1

10

-60 -40 -20 0 20 40 60 80

I DS(n

A)

VGS(mV)

Temperature: 5KTemperature: 10KTemperature: 20K

0.001

0.01

0.1

1

10

-60 -40 -20 0 20 40 60 80

I DS(n

A)

VGS(mV)

Temperature: 5KTemperature: 10KTemperature: 20K

PVCNVC

Figure 5.2: SET Coulomb Oscillation (Cg =3.2 aF, Cs = Cd =1.0 aF, and Rs =Rd =10 MΩ) [133].

capacitance. For large devices, this change in potential is negligible due to the high

island capacitance CP. However, for nanometer-scale islands, CP is much smaller.

As a result, the electrostatic energy change due to the addition or removal of a single

electron can be larger than the thermal energy, particularly at low temperatures.

Changes to SET island potential results in an energy gap at the Fermi energy,

preventing further electron tunneling. This phenomenon is called Coulomb blockade.

It prevents current from flowing between source and drain (Ids = 0), i.e., the SET is

turned off. The Coulomb blockade effect can be overcome by changing the voltage of

a conductor capacitively coupled to the island, thereby turning tunneling on and off.

Although their transfer functions differ significantly from those of CMOS transistors,

with careful circuit design, SETs can be used to realize logic functions using circuits

analogous to CMOS, or using radically different design techniques [70].

As shown in Figure 5.1, a SET typically has four terminals. The source and


drain terminals (S, D) serve as electron reservoirs. When the SET is turned on,

electrons tunnel from one terminal, through the junction, to the conductive island.

They then tunnel through the other junction to the other terminal. Each tunneling

junction is modeled as a resistor (RS or RD) and a capacitor (CS or CD) in paral-

lel. A gate terminal (G), with coupling capacitance CG, controls the transport of

electrons. A SET may also contain an optional second gate terminal (G2), which is

generally used to tune SET VGS bias. The Coulomb blockade effect is maximized

when VGS = me/CG, where m = 0,±1,±2, · · · [32] because, at these voltages, the

system is in a minimal-energy state when an integer number of electrons are present

on the island. Any single tunneling event between island and either source or drain

would move the system from this state. The Coulomb blockade effect vanishes when

m = ±1/2,±3/2, · · · , i.e., when m is a half-integer value because, at these voltages,

the system is in a minimal-energy state when a half-integer number of electrons are

present on the island. In this case, a single tunneling event does not move the system

from a minimum energy state. Electrons can therefore tunnel through the island as

determined by VDS. The I–V curve of a SET is shown in Figure 5.2; drain current

changes as a function of the gate voltage, with a period if e/Cg. The periodic changes

are called Coulomb Oscillations.

In order to observe the Coulomb blockade effect, the following constraints must

be satisfied.

1) Since thermal fluctuations can suppress the Coulomb Blockade effect, the

electrostatic charging energy, e2/CP, must be much greater than kBT , where kB

is Boltzmann’s constant and T is the temperature. In order to ensure reliability,

e2/CP ≥ 10kBT or the more conservative e2/CP ≥ 40kBT constraint is enforced.


Table 5.1: Island Size Estimation [133].Temperature CΣ = e2/(10kBT ) CΣ = e2/(40kBT )

(K) Island Island Island Islandcapacitance diameter capacitance diameter

(aF) (nm) (aF) (nm)40 4.65 52.48 1.16 13.1277 2.41 27.26 0.60 6.82103 1.80 20.38 0.45 5.10120 1.55 17.49 0.39 4.37200 0.93 10.50 0.23 2.62250 0.74 8.40 0.19 2.10300 0.62 7.00 0.15 1.75

Assuming disc capacitor model (CP = 8εr). One side of island embedded in silicon dioxide. Otherside exposed to Nitrogen.

These equations imply that the maximum allowed island capacitance is inversely pro-

portional to temperature. At room temperature, an island capacitance below 1 aF

is required. Island capacitance is a function of island size. As shown in Table 5.1,

room-temperature operation requires an island size in the nanometer range, making

fabrication challenging. At present, the smallest island capacitance of a fabricated

device is around 0.15 aF [98].

2) To observe single-electron charging effects, electrons must be confined to the

island, which requires that the junction resistance be higher than the quantum resis-

tance, i.e., RS, RD > h/e2, h/e2 = 25.8 kΩ, where h is Planck’s constant. Therefore,

SETs have high resistances and low driving currents.

In order to operate voltage-state logic, SETs must exhibit voltage gain. The low-

temperature voltage gain is equal to the gate capacitance divided by the sum of the

junction capacitances: G = CG/(CS+CD). Achieving this gain requires low tunneling

junction capacitances. It also requires close coupling of gate and island without a large

increase in the total island capacitance. High gain has only been demonstrated for a


few devices and has required operation at low temperatures [82, 41]. However, further

advances in nanofabrication may overcome this limitation.

5.2.2 Random Background Charge Effects

Constant background charge effects have been a persistent problem for SETs.

Charges near the SET island influence its equilibrium state [119]. Although the

resulting voltage offsets can be compensated for with a biased second gate terminal,

the required bias is unknown until fabrication. Worse yet, some devices are affected

by random background charge effects, which result in run-time voltage fluctuations.

It is the tentative consensus of the research community that random background

charge effects are caused by multiple, closely-spaced charge traps near the island,

among which charge carriers tunnel. This produces run-time variation in gate bias,

and may cause logic errors. Much work has been done to understand the nature and

density of these defects [34, 62, 136]. Most SETs have been fabricated with aluminum

islands. Some researchers have attempted to eliminate random background charge

effects by fabricating SETs with alternative island materials such as silicon. Silicon

island based devices have high immunity to random background charge noise, with

operation unchanged over several weeks [137]. However, random background charge

effects remain the main source of run-time reliability problems for most SET designs.

In this chapter, we describe a reconfigurable architecture that provides architectural

resistance to the effects of random background charges.


5.2.3 SET Modeling

Circuit design involves extensive simulation. Despite their accuracy, Monte Carlo

methods are too slow for large-scale circuit analysis. We build upon the SET an-

alytical model developed by Inokawa et al. [49], which has been incorporated into

SPICE. Combined with MOS transistor models, it provides an efficient and accurate

simulation solution for hybrid SET/CMOS circuits. Inokawa’s model ignores random

background charge effects and multi-gate effects. We incorporate these effects into

the model.

The I–V characteristics of a SET with island charge equal to n or n+ 1 electrons

follow [49]:

IDS =e

4RTCΣ

×

(1− r2)(V 2GS − V 2

DS) sinh(VDS/T )

(VGS + rVDS) sinh(VGS/T )− (VDS + rVGS) sinh(VDS/T )(5.1)

where

VGS =2∑CGiVGSie

−

(∑CGi + CS − CD)VDS

e− 1− 2n+ ζ (5.2)

VDS =CΣVDSe

, T =2kBTCΣ

e2(5.3)

r =RD −RS

RD +RS

, RT =2

1RS

+ 1RD

(5.4)

CΣ = CS + CD +∑

CGi (5.5)

In this model,2

PCGiVGSie

models the Coulomb charging effects of the multiple gate

terminals. ζ is a real number that characterizes the random background charge effect.


This compact model is derived based on the steady-state master equation, which is

not directly applicable to transient circuit analysis. However, when used in circuits,

SETs are connected by metal wires. Based on existing fabrication processes, the

capacitance of local interconnect is at least two orders of magnitude higher than

SET island capacitance, thereby eliminating inter-SET Coulomb interaction. The

independence of SETs enables the use of quasi-steady-state analysis [49, 128].

5.3 IceFlex: A Fault-Tolerant Hybrid SET/CMOS

Reconfigurable Architecture

This section describes the design and analysis of IceFlex, the proposed low-power,

fault-tolerant, reconfigurable, hybrid SET/CMOS architecture. The vast majority

of devices in IceFlex are SETs, allowing extremely low power consumption. CMOS

devices are sparingly used to improve the driving strength of global interconnect.

Our evaluation of the architectural constraints imposed by SETs led to four main

conclusions.

1. Flawless fabrication will be challenging, especially for circuits that operate

at room temperature. It is important to simplify fabrication and use post-

fabrication adaptation to improve reliability.

2. An unpredictable subset of devices will be susceptible to random background

offset charge effect noise: SET-based architectures should have the ability to

tolerate run-time errors.

3. SETs have poor driving strength; this must be remedied, especially when driving


global interconnect.

4. SETs have the ability to efficiently implement some functions that are ineffi-

cient using BJTs, CMOS logic, or threshold logic, e.g., non-linearly-separable

functions. SET-based architectures should exploit such special properties.

5.3.1 SET Design Space Characterization

In order to characterize the benefits and limitations of SET circuits and archi-

tectures, we analyze the tradeoffs among the following metrics: temperature, perfor-

mance, power consumption, reliability, and fabrication constraints. This study yields

two design configurations, each of which is shown in Table 5.2. One targets high-

performance embedded applications such as multimedia consumer electronics and

one targets ultra-low-power embedded applications such as wireless sensor networks.

5.3.1.1 Temperature

IceFlex was evaluated at seven temperature settings (see Table 5.2). IceFlex is a

hybrid SET/CMOS design; the temperature range starts at 40 K to permit reliable

operation of the CMOS components. 77 K is achieved by liquid nitrogen cooling.

103 K is the average cloud top temperature. 120 K and below are defined as cryogenic.

At 200 K, functional SET devices have been widely demonstrated in the literature.

250 K is a temperature that might be reached using a stacked Peltier heat pump.

300 K is room temperature.


5.3.1.2 Capacitance

To observe well-defined Coulomb blockade effects, electron charging energy must

be higher than the thermal energy, i.e., e2

CΣ≥ 10kBT or e2

CΣ≥ 40kBT , where kB is

Boltzmann’s constant and T is temperature. At room temperature, this constraint

requires an island capacitance below 1 aF, making fabrication challenging but pos-

sible [98]. In order to operate voltage-state logic, SETs must exhibit voltage gain,

which is equal to the gate capacitance divided by the sum of the junction capac-

itances: G = CG/(CS + CD). Our results indicate that a gain of 1.5 is sufficient

for use in digital logic. Targeting battery-powered systems, using CP ≤ e2/(10kBT ),

CP ≤ e2/(40kBT ) and G = 1.5, the maximum allowed gate and junction capacitances

are derived and shown in the “Low power, Capacitance” columns of Table 5.2.

The maximal allowed capacitance decreases with increasing temperature. How-

ever, fabricating SETs with low gate capacitance is challenging. We assume the capac-

itances at 300 K are the minimum allowed. Given e2

CΣ≥ 10kBT , for high-performance

applications, these minimal gate and junction capacitances are used at all the tem-

perature settings and shown in the corresponding “High Performance, Capacitance”

columns of Table 5.2. Given e2

CΣ≥ 40kBT , which requires very low SET capacitance

at room temperature, CG = 0.09 aF. This makes fabrication very challenging. Due

to fabrication concerns, for high-performance design, the capacitance and voltage are

determined at the appropriate operation temperature, instead of room temperature.

5.3.1.3 Voltage

Consider a SET biased via a second gate, such that a VGS of zero places it in the

middle of the positive voltage coefficient (PVC) region in Figure 5.2. In this case, the


maximum range of current values can be traversed by letting VGS (i.e., Vin) vary in

the range [−e/(4CG), e/(4CG)]. At all but the lowest temperatures, this range also

provides near-optimal sensitivity to VGS; we use this range. Once the range of VGS

is known, a VSS of −e/(4CG) and a VDD of e/(4CG) naturally follow, shown in the

“Voltage” columns of Table 5.2. Note that a bias voltage applied via a second gate

can be used to shift the zero VGS point from the PVC to negative voltage coefficient

(NVC) region in Figure 5.2, permitting NMOS-like or PMOS-like behavior.

5.3.1.4 Junction Resistance

To observe single-electron charging effects, electrons must be confined to the is-

land. This requires junction resistances that are much higher than the quantum resis-

tance, i.e., RS, RD h/e2, h/e2 = 25.8 kΩ, where h is Planck’s constant. Therefore,

SETs have high resistances and low driving currents. In this chapter, we pick two

resistance settings: 100 KΩ for high-performance applications and 10 MΩ for battery-

powered systems, shown in the “Resist.” columns of Table 5.2.

5.3.1.5 Reliability Implications

Researchers have pointed out the dangers posed by thermal noise as charging

(state change) energy approaches thermal energy. We explicitly consider the effects of

temperature on steady-state current during circuit analysis and its effects are reflected

in our design decisions. We implicitly consider, and guard against, the effects of

temperature-dependent shot noise by requiring charging energy to be a large multiple

of the thermal energy. Designs with charging energies of both 10 and 40 times the

thermal energy are evaluated in this chapter (10kBT or 40kBT ). Researchers have


Tab

le5.

2:D

esig

nSpac

eC

har

acte

riza

tion

[133

].C

Σ=e2/1

0kBT

CΣ

=e2/4

0kBT

Low

pow

erH

igh

perf

orm

ance

Low

pow

erH

igh

perf

orm

ance

Tem

p.C

apac

itan

ceV

olta

geR

esis

t.C

apac

itan

ceV

olta

geR

esis

t.C

apac

itan

ceV

olta

geR

esis

t.C

apac

itan

ceV

olta

geR

esis

t.

(K)

(aF

)(m

V)

(MΩ

)(a

F)

(mV

)(k

Ω)

(aF

)(m

V)

(MΩ

)(a

F)

(mV

)(k

Ω)

CG

CS

Vdd,V

inR

SC

GC

SV

dd,V

inR

SC

GC

SV

dd,V

inR

SC

GC

SV

dd,V

inR

S

CD

e/4C

GR

DC

De/

4CG

RD

CD

e/4C

GR

DC

De/

4CG

RD

402.

780.

9314

.36

100.

370.

1210

7.70

100

0.70

0.23

57.4

610

0.70

0.23

57.4

610

077

1.45

0.48

27.6

510

0.37

0.12

107.

7010

00.

360.

1211

0.60

100.

360.

1211

0.60

100

103

1.08

0.36

36.9

910

0.37

0.12

107.

7010

00.

270.

0914

7.95

100.

270.

0914

7.95

100

120

0.93

0.31

43.0

910

0.37

0.12

107.

7010

00.

230.

0817

2.37

100.

230.

0817

2.37

100

200

0.56

0.19

71.8

210

0.37

0.12

107.

7010

00.

140.

0528

7.28

100.

140.

0528

7.28

100

250

0.45

0.15

89.7

710

0.37

0.12

107.

7010

00.

110.

0435

9.10

100.

110.

0435

9.10

100

300

0.37

0.12

107.

7010

0.37

0.12

107.

7010

00.

090.

0343

0.91

100.

090.

0343

0.91

100


SET configuration memory

SET local interconnect Hybrid SET/CMOS globalinterconnect

Majority voting logic

SET multi-gate lookup table

SET input switch fabric SET registers

Figure 5.3: IceFlex Microarchitecture [133].

reported device operation at each level but the 40kBT requirement is more reliable.

At charging energies over 10kBT , the model we use is accurate to within 4% of the

time-dependent master equation [59, 113].

Random background charge effects [62, 136] are the main barrier to SET reliability.

They are observed as 1/f noise on SET gate voltages, with some SETs susceptible

and others immune. Several recent devices have shown improved immunity to this

noise, as described in Section 5.2.2. Currently, the distribution of random background

offset charges can only be determined after fabrication [70]. Susceptible SETs may

suffer transient errors infrequently, e.g., only once per day. In this chapter, we use

architectural techniques to reduce the probability of failure using an entirely SET-

based design. SETs are used in parallel to exploit the lack of SET-to-SET correlation

in random background offset charge effects.

5.3.2 IceFlex Design

In this section, we present the architecture and circuit design of IceFlex. The

microarchitecture of IceFlex is shown in Figure 5.3. IceFlex is a cell-based design.

Each cell is a SET logic block (SELB) composed of the following components: (1)

multi-gate SET-based reconfigurable look-up tables that can realize arbitrary n-input


Boolean functions; (2) a SET-based arithmetic unit that allows efficient implementa-

tions of non-linearly separable arithmetic operations; (3) a SET-based reconfiguration

memory array that caches multiple configuration contexts to support efficient run-

time reconfiguration; (4) a multi-gate SET-based input switch fabric; and (5) SET

registers. In addition, IceFlex includes SET threshold logic-based majority voting

logic units, allowing a flexible solution to run-time reliability problems. In IceFlex, a

multi-level on-chip interconnect fabric forms inter-SELB connections. Local connec-

tions rely on a custom-designed, SET-driven, variable-length, constant-latency inter-

connect. Using a constant-latency interconnect structure reduces power consumption

and simplifies physical-level design automation, e.g., placement and routing. SETs

have limited driving strength. Therefore, IceFlex uses hybrid SET/CMOS circuits to

drive global interconnects.

We now explain each IceFlex component and discuss both circuit and architecture

design tradeoffs.

5.3.2.1 Multi-Gate SET Reconfigurable Lookup Table Component

Each SELB is equipped with l sets of n-input reconfigurable look-up tables. Each

look-up table can realize an arbitrary n-input Boolean function. The basic structure

of the look-up table consists of an m-to-1 multi-gate SET multiplexer tree (m = 2n),

and an m-bit SET storage cell, which will be described in the next section.

The proposed multi-gate SET multiplexer tree differs from existing CMOS-based

designs in the following way. A CMOS m-to-1 multiplexer tree requires dlog2me

stages of transmission gates, plus buffers to meet the required driving strength. SETs

may have multiple gate terminals. As described in Equation 5.5, the gate charging


Config. Bit0

Config. Bitm-1

configuration

m-to-1 multi-gate multiplexer SET tree

Vdd

VG2

Vss

-VG2

Config. Bit0

Config. Bit1

Config. Bitmc-1

s0 s1 snc

s0 s1 snc

s0 s1 snc

mc-to-1 multi-gate SET multiplexer

Vdd

VG2

Vss

-VG2

0

A 4-to-1 multi-gate SET multiplexer example

a b

a b

a b

a b

0

0

1

IDS

VG

RSET

VG

a=1b=1

P0path P0

path P1

path P2

path P3

a=1b=1

P1P2 P3

Figure 5.4: Multi-gate SET Multiplexer Tree [133].

effect is a function of∑CGiVGSi . Therefore, multiple control signals, e.g., the select

signals for a multiplexer, can be supplied to a single SET, enabling a more compact

circuit structure with better performance and power efficiency.

Figure 5.4 shows the proposed SET multi-gate multiplexer tree design. The basic

building block is a q-to-1 multi-gate single-stage multiplexer, in which each of the q

paths consists of a single multi-gate SET controlled by dlog2 qe select signals. Using

this design, the logic depth of a n-to-1 multiplexer tree reduces to⌈logqm

⌉instead

of dlog2me. Figure 5.4 also shows a design case for q = 4. The output SET buffer is

used to break long resistive path and improve the driving strength.

As described in Section 5.2, thermal energy has significant impact on electron

tunneling and the ratio of on to off currents, i.e., the ratio of the off to on resistance.

This ratio decreases as the ratio of Coulomb charging energy (e2/C) to thermal energy

(kBT ) decreases. On the other hand, as the number of gate control signals per SET

(hence the number of off paths connected in parallel) increases, the impact of the off

paths on the circuit output increases. Consider, for the sake of example, the dual-gate

4-to-1 multiplexer design shown in Figure 5.4. The four logic inputs are 0001 and

both select signals are logic one, i.e., Va = Vb = V . Assume Ca = Cb = C. As shown

in the I–V curve on the right side of Figure 5.4, for the SET on path P3, the overall

gate charge equals 2CV . Therefore, the SET becomes fully conductive. For paths P1


and P2, the gate charges both equal CV −CV = 0, hence both switches are partially

conductive. For path P0, even though the overall gate charge equals −2CV , at high

temperature its resistance may still be within the same order of magnitude as that

of path P3. Since the inputs of paths P0, P1 and P3 are all connected to logic zero

(the worst-case scenario), these three parallel paths may reduce the output voltage,

producing incorrect results.

In the high-performance setting, the same capacitance settings are used across the

whole temperature range. Therefore, the ratio of Coulomb charging energy to thermal

energy increases as the temperature decreases. Therefore, lower temperatures permit

fewer multiplexer levels in the multiplexer tree, with more inputs to each individual

multiplexer.

Detailed circuit analysis shows that, using the high-performance setting and e2/CP ≥10kBT , the dual-gate design may be used at temperatures up to 200 K. At 250 K and

300 K, only the single-gate design is feasible. For the low-power setting, capacitance

scaling maintains the same e2/CPkBT ratio. Therefore, the same design should be

used for the whole temperature range. In addition, since both the low-power setting

and the high-performance setting at room temperature use the same e2/CPkBT ra-

tio, only the single-gate design is feasible for low-power, room-temperature operation.

For the e2/CP ≥ 40kBT configurations of IceFlex, the dual-gate design may be used

at all temperatures due to the increased charging energy.

5.3.2.2 SET Configuration Memory

In IceFlex, run-time reconfiguration is enabled by SET configuration memory,

which consists of SET configuration cache and current configuration memory. In


Dual-islandSET bufferSET configuration memory

VCG

charge

VG

VoutD S

IDS

VG

VCG

VGV

outD SIDS

VG

VG

Store 1

Store 0

SET memory cell

Configuration setsset k-1 set0set1 V

dd

In

Vss

Out

VG2

-VG2

Figure 5.5: SET Configuration Memory [135].

each SELB, the configuration cache stores multiple configurations. During run-time

reconfiguration, one set of configuration bits stored in the configuration cache are

placed into the current configuration memory to program SELB logic and intercon-

nect. If k copies of configuration sets are stored in the configuration cache, then the

circuit can be reconfigured k times during run-time execution without the need to

access off-chip memory.

The left portion of Figure 5.5 shows the circuit structure of the configuration

memory in IceFlex. The SET configuration cache is the main on-chip configuration

memory. Each storage cell consists of a dual-island SET [70]. A dual-island SET

contains two capacitively-coupled SETs: a primary SET and a secondary SET. By

controlling VCG, electrons can tunnel through the control gate and charge the island

of the secondary SET. The charge state of the secondary SET shifts the phase of the

Coulomb oscillations of the primary gate, i.e., its conductivity condition shifts as a

function of gate control voltage, VGS. Therefore, under a certain VGS, the primary

SET is either conductive or open due to different island charges, representing either


a logic one or logic zero.

In the configuration cache, selecting a configuration forms a short-circuit path

between the pull-up resistor and SETs with a stored zero within the selected configu-

ration set. The power consumption will be high if the configuration cache constantly

controls the logic and interconnect. To minimize power consumption, separate on-chip

memories are used to store the currently-used configuration.

We designed a dual-island based SET buffer to hold the current configuration. As

shown to the right of Figure 5.5, this buffer uses two biasing voltages, VG2 and −VG2 ,

and behaves like a complementary SET inverter. During run-time reconfiguration, for

each dual-island SET, the corresponding configuration bit stored in the configuration

memory updates the island charge of its secondary SET and conductivity of the

primary SET, thereby controlling the buffer output.

5.3.2.3 Efficient SET Implementations of Non-Unate Functions and Im-

plications for Arithmetic

SETs have the ability to support efficient implementation of some critical logic

functions that have long frustrated designers using threshold logic, BJT, and CMOS

technologies. Most conventional transistors have either non-decreasing or non-increasing

I–V curves. As a result, numerous devices are required to implement Boolean func-

tions that are not unate, i.e., linearly separable. However, such functions are widely

used, especially in digital arithmetic. The periodic nature of SET I–V curves can

be exploited for efficient implementation of highly-useful non-unate functions such as

exclusive-OR.


SET parity circuit

INk−1

IN1

IN1

INk−1

VG2

−VG2

VSS

VDD

Figure 5.6: SET Parity Circuit [133].

The most efficient CMOS static pass-transistor logic design of a two-input exclusive-

OR gate in general use requires six transistors [91]. Moreover, it relies on strong input

signals because it is not capable of signal restoration. A restoring version would re-

quire at least eight transistors. In contrast, it is possible to implement a two-transistor

SET-based exclusive-OR gate that is structurally equivalent to a CMOS inverter. In

this design, each SET has two gates, each of which is connected to one of the exclusive-

OR inputs. The circuit structure for a SET-based n-input parity gate is shown in

Figure 5.6. This design is capable of signal restoration. Thanks to the periodic SET

I–V curve, it is possible to directly determine whether the number of high inputs is

odd or even. By appropriately adjusting the gate capacitances, the device can be

adjusted such that switching a single gate will result in a 180 phase shift in the

I–V curve (see Figure 5.2). Note that even or odd parity functions with additional

inputs may be implemented using only two SETs. The number of inputs is bounded

primarily by geometrical constraints on fabrication of additional gates.

In SET-based architectures, we propose the use of fast carry chains based on the

proposed exclusive-OR (sum) computation logic. We have found that this design is


approximately 75% more energy-efficient and 25% faster than a design based on a

conventional CMOS-style exclusive-OR sum implementation, when both are imple-

mented using SETs. This design style is impossible for threshold logic, BJTs, and

CMOS technologies. Note that carry-out logic is equivalent to 2-out-of-3 majority

vote logic.

5.3.2.4 Reconfigurable Interconnect Network

IceFlex consists of a variety of reconfigurable interconnect resources, including

SET local interconnects, hybrid SET/CMOS global interconnects, and SET switch

fabric.

Interconnect consumes a substantial proportion of total power consumption in Ice-

Flex: its power efficiency is important. For SET-based interconnect, the static power

consumption dominates due to the impact of thermal energy on device conductance,

especially at high temperatures. In addition, static power consumption increases with

wireload because maintaining unchanged communication latency with higher wireload

requires lower junction resistance. In contrast, the dynamic power consumption of

SETs is low due to the low SET gate capacitance and low voltage swing. For hybrid

SET/CMOS-based interconnect, SETs are only used to drive CMOS buffers, which in

turn drive wires. In this case, SETs with low driving strength, hence high junction re-

sistance, are allowed. Compared to SETs, CMOS has lower static power consumption

but higher capacitance and dynamic power consumption. Therefore, dynamic power

dominates in the hybrid SET/CMOS-based design. Circuit analysis shows that, given

the same performance constraint, SET-based design is more energy-efficient for local

interconnect and the hybrid SET/CMOS design is more energy-efficient for global


interconnect.

In IceFlex, local interconnects driven directly by SET buffers support communica-

tion between nearby SELBs. Three types of local interconnects are supported: single

length, double length, and hex length. The proposed SET local interconnect design

guarantees a constant latency across different routing lengths. Consider, for the sake

of example, a local communication architecture in which the maximum interconnect

delay is constrained and the longest interconnect is appropriately buffered to meet

this constraint. In this case, it would be possible to similarly drive shorter intercon-

nects, thereby decreasing their delays, relative to that of the longest interconnect.

It would also be possible to reduce the driving strength on shorter interconnects to

reduce power consumption and produce a local interconnect architecture in which

all interconnects have uniform delay. We propose the second design because it im-

proves interconnect power efficiency and also simplifies placement and routing during

physical design.

The proposed SET local interconnect is designed as follows. A SET buffer with

minimal driving strength (hence high junction resistance) is first determined. Next,

for local interconnects with different routing lengths, minimal driving strength SET

buffers are connected in parallel to meet driving strength requirements imposed by

performance constraints. The main motivation for using parallel SET buffers is that

SET junction resistance cannot be reduced arbitrarily (RD, RS h/e2). Using ho-

mogeneous SET buffers in parallel instead of heterogeneous SET buffers may also

simplify fabrication.

Remote connections introduce the high capacitive loads of long metal wires. To

address the driving strength problem of SET-only circuits, we have designed hybrid


Tab

le5.

3:Im

pac

tof

Ma

jori

tyV

ote

Log

icon

SE

LB

Fau

ltP

robab

ilit

y[1

33].

SET

faul

tpr

obab

ility

1/1,

000

1/10,0

001/

100,

000

Maj

orit

yvo

tein

puts

35

73

57

35

7R

awfa

ilpr

ob.

6.20

E-2

6.20

E-2

6.20

E-2

6.38

E-3

6.38

E-3

6.38

E-3

6.40

E-4

6.40

E-4

6.40

E-4

Bes

tpr

ob.

1.11

E-2

2.17

E-3

4.45

E-4

1.22

E-4

2.57

E-6

5.71

E-8

1.23

E-6

2.62

E-9

5.86

E-1

2SE

TM

VL

prob

.1.

11E

-22.

18E

-34.

57E

-41.

22E

-42.

69E

-61.

77E

-71.

23E

-63.

82E

-91.

21E

-9


VG2

-VG2

HLB output

VG2

-VG2

HLB input

SINV1 SINV2CINV1 CINV2

Inter-HLB metal wire

Figure 5.7: Hybrid SET/CMOS Interface Circuitry [133].

SET/CMOS interface circuitry to drive global interconnect. Figure 5.7 shows the

circuit structure, which contains two complementary SET inverters and two CMOS

inverters. A SELB output is first fed to the input of SET inverter SINV1. SINV1

drives the CMOS inverter, CINV1. Unlike the SET logic used inside SELBs, SINV1

uses a low-resistance design to improve driving strength. Fortunately, it is possible

to achieve sufficient driving strength with a single SET. Since the voltage range of

SET logic is much smaller than that of CMOS logic, the output signal of SINV1 is

within the switching range of the CMOS inverter. Since both MOS transistors are

conductive within the switching region, short-circuit power is high. To solve the short-

circuit power consumption problem, CINV1 is designed to satisfy the following two

constraints. First, Vtn + |Vtp| > Vdd − Vss ensures that at least one MOS transistor

is off at all times, reducing static power consumption. Second, the output signal

range of SINV1 must be greater than Vtn + |Vtp| − (Vdd− Vss). Therefore, the NMOS

(PMOS) transistor of CINV1 is conductive when SINV1 has a high (low) output

signal. Therefore, CINV1 serves as a signal converter, and CINV2 provides driving

strength.

CINV2 cannot be used to drive the input SET logic of a SELB directly. SET


current is a periodic function of the gate control voltage and has a period of e/CG,

which is much smaller than the output voltage range of CINV2. Therefore, this

output voltage range cannot be used directly. To solve this problem, we design a

special SET inverter, SINV2, that is used for SELB inputs. SINV2 is fabricated with

a large distance between gate and island in order to reduce the gate capacitance, CG.

Thus, e/CG can match the output signal range of CMOS inverter CINT2. Although

source–island and drain–island junctions must be short to permit tunneling, there is

no such bound on gate–island separation.

In IceFlex, each SELB is equipped with a reconfigurable input switch fabric that

selects the connections among local and global interconnects. The input switch fabric

is implemented using multi-gate SET multiplexor tree, similar to that in the recon-

figurable look-up table described in Section 5.3.2.1.

5.3.2.5 Design and Modeling of IceFlex Majority Voting Logic

Although researchers are making progress on reducing the severity of noise result-

ing from random background offset charge effects, it may continue to pose run-time

noise problems in the future. Even if this problem can be entirely solved, resistance

to run-time faults may be useful in SETs, e.g., to allow resistance to Alpha particle

induced faults or other single event upsets. IceFlex incorporates support for hierar-

chical spatial redundancy to improve fault tolerance. Although much of the literature

predicts the need for fault-tolerant architectures in nanoelectronics, the level of fault

tolerance is currently unknown. Therefore, we consider the results for a number of

possible SET failure rates and in the presence of three fault-tolerance configurations.

Other researchers have proposed a number of architectural techniques to support


reliable computation using nanoscale electronics that are susceptible to fabrication-

time and run-time faults. Dehon described the use of structural redundancy and

programming-time defect-aware configuration in a carbon nanotube and silicon nanowire

based programmable logic array architecture [24]. Goldstein et al. describe the use

of a defect map that is generated during post-fabrication testing to avoid the use of

faulty devices [37]. Bahar et al. present a method of expressing logic circuits using

Markov Random Fields, permitting Boolean functions to be computed using devices

susceptible to potentially-frequent transient faults [10]. We think it likely that the

random background charge problem will ultimately be dealt with by a combination

of improved fabrication technology, post-fabrication testing to identify and avoid a

subset of the affected SETs, and run-time fault-tolerance via conventional structural

redundancy or recent advances in probabilistic computation. IceFlex provides for

regular structural redundancy and run-time error correction.

We now consider the fault model for IceFlex SELBs. Every path from SELB input

to output contains 64 SETs. In the third row of Table 5.3, we show the SELB raw

failure probabilities, i.e., the probability of a SELB producing an incorrect output.

SELB failure probability is a function of the SET fault probability, for which Ta-

ble 5.3 shows three values. Likharev estimates the long-term density of background

offset charge susceptible SETs [70]. We follow his assumptions arriving at one suscep-

tible SET in 10,000. The resulting 1/f noise produces long-duration failure periods.

Therefore, in this analysis, we (conservatively) assume that susceptible devices consis-

tently fail. In reality, errors may not be consistent. We also consider the higher SET

fault probability of 1/1,000 and the lower fault probability of 1/100,000. Advances

in fabrication and detection of most SETs susceptible to random background offset


charge effects by post-fabrication testing may permit reduction in run-time SET fault

probability.

We have considered the effect of using no MVL (Raw fail prob.), fault-free MVL

(Best prob.), and SET MVL. Using a given reliability configuration, it is not possible

for MVL-based designs to produce lower SELB fault probabilities than those shown

in the Best prob. row. SET MVLs are constructed from multi-gate SETs. We focus

on the three-input SET MVL design to simplify depiction; the five-input, and seven-

input SET MVL follows an analogous design style. This circuit has identical structure

to the parity gate shown in Figure 5.6. However, the separation of gates and island are

adjusted such that the circuit traverses only 1/2 Coulomb oscillation period during

use. The SET pull-up gates are separated sufficiently to require the majority of the

gates to be high. The converse is true of the pull-down gates. For each SET depicted

in the figure, four SETs are used in parallel in order to permit the failure of one SET

while still producing correct results. We have computed the delay of the SET MVL

by considering the worst-case scenario, in which a path that is 3/5 or 4/7 closed has

a faulty driver SET and a path that is 2/4 or 3/7 closed has no faulty SETs.

As shown in Table 5.3 it is possible for a seven-input SET-only MVL with redun-

dant SELBs to reduce the failure rate to 1/8,500,000, given a SET fault probability of

1/10,000, or 1/830,000,000, given a SET fault probability of 1/100,000. Given recent

trends in noise-resistant SET design and fabrication, it seems likely that a less aggres-

sive fault tolerance configuration will be necessary in the future (see Section 5.2.2).

If a method of rapidly determining which SETs are susceptible to random back-

ground charge effects is ever developed, these effects can be avoided in the same way

that fabrication defects are avoided: via the use of a regular computation structure in


which operations are mapped only to fault-free devices. There has been some promis-

ing work on this topic, in which illumination is used to produce ions, accelerating the

onset of random background charge effects [16].


In this section, we evaluate the suitability of using SETs in low-power embedded

system design. We start from the microarchitecture characterization of IceFlex. Ice-

Flex is then used as a testbed to characterize the benefits and limitations of SETs for

both high-performance and battery-powered embedded application.

5.4.1 Characterization of the IceFlex Architecture

Following the design parameters shown in Table 5.2, we evaluate the performance

and power consumption of IceFlex using HSPICE. For SET circuitry, the SPICE

model and device parameters are described in Section 5.2.3. For CMOS logic and

metal wire, we use the 22 nm Berkeley BSIM4 predictive technology model, which

models the impact of temperature on MOS devices. We analyzed designs adhering

to the CΣ = e2/(40kBT ) constraint. We also analyzed designs with the less conser-

vative CΣ = e2/(10kBT ) constraint. A low-power setting (targeting megahertz-range

frequencies) and a high-performance setting (targeting gigahertz-range frequencies),

are considered.

Tables 5.4 and 5.5 summarize the performance and power characterization of the


Tab

le5.

4:C

har

acte

riza

tion

ofIc

eFle

xM

icro

arch

itec

ture

forC

Σ=e2/(

40kBT

)[1

33]

Low

pow

erH

igh

per

form

an

ce40

K77

K103

K120

K200

K250

K300

K40

K77

K103

K120

K200

K250

K300

KL

UT

10.0

47.8

67.0

96.8

05.5

75.0

34.7

50.0

80.0

60.0

50.0

50.0

50.0

40.0

4L

ate

ncy

Reg

iste

r1.4

21.0

91.0

21.0

00.9

00.8

80.8

60.0

10.0

10.0

10.0

10.0

10.0

10.0

17-I

NP

UT

MV

L0.5

80.5

70.5

80.5

80.5

90.5

60.5

83.2

8E

-03

3.1

8E

-03

3.1

6E

-03

3.2

0E

-03

3.2

4E

-03

2.9

9E

-03

3.1

4E

-03

(ns)

SE

T-M

VL

1.1

51.1

31.1

31.0

01.0

81.0

41.0

60.0

10.0

10.0

10.0

10.0

10.0

10.0

1A

rith

met

icS

UM

MG

2.3

22.3

12.3

12.3

12.3

12.2

82.2

90.0

10.0

10.0

10.0

10.0

10.0

10.0

1L

ogic

CS

3.0

22.9

72.9

52.9

62.9

52.8

92.9

30.0

10.0

10.0

10.0

10.0

10.0

10.0

1C

O1.1

51.1

31.1

31.0

01.0

81.0

41.0

60.0

10.0

10.0

10.0

10.0

10.0

10.0

1L

UT

0.0

70.2

60.4

40.5

81.6

02.6

43.7

06.6

725.7

644.5

358.1

9162.2

0266.6

9373.8

1P

ow

erR

egis

ter

0.0

80.3

00.5

30.7

21.9

93.1

44.4

88.0

229.8

853.1

672.1

2199.6

4315.2

1450.3

47

INP

UT

-MV

L0.0

50.2

00.3

60.4

81.3

22.1

73.0

25.3

720.0

535.8

748.1

5132.2

4217.3

1302.6

0(n

W)

SE

T-M

VL

0.0

10.0

30.0

60.0

80.2

10.3

40.4

80.9

43.5

16.2

68.4

423.2

437.5

852.9

0A

rith

met

icS

UM

MG

1.6

1E

-03

0.0

10.0

10.0

10.0

40.0

70.0

90.2

20.8

01.4

41.9

15.1

98.8

812.0

4L

ogic

CS

0.0

10.0

40.0

70.0

90.2

50.4

00.5

71.0

43.8

76.9

09.3

025.6

041.5

158.3

5C

O0.0

10.0

30.0

60.0

80.2

10.3

40.4

80.9

43.5

16.2

68.4

423.2

437.5

852.9

0


Tab

le5.

5:C

har

acte

riza

tion

ofIc

eFle

xIn

terc

onnec

tF

abri

cF

orC

Σ=e2/(

40kBT

)[1

33]

Low

pow

erH

igh

per

form

an

ce40

K77

K103

K120

K200

K250

K300

K40

K77

K103

K120

K200

K250

K300

KIS

F6.6

96

5.2

38

4.7

27

4.5

37

3.7

12

3.3

51

3.1

69

0.0

50

0.0

39

0.0

37

0.0

36

0.0

30

0.0

28

0.0

27

Sin

gle

0.7

28

0.6

99

0.6

94

0.6

97

0.7

99

0.7

70

0.7

84

0.0

06

0.0

06

0.0

06

0.0

07

0.0

05

0.0

05

0.0

05

Late

ncy

Dou

ble

0.7

04

0.6

87

0.6

85

0.6

89

0.7

94

0.7

66

0.7

81

0.0

06

0.0

06

0.0

06

0.0

07

0.0

05

0.0

05

0.0

05

(ns)

Hex

0.6

92

0.6

80

0.6

80

0.6

84

0.7

91

0.7

63

0.7

79

0.0

06

0.0

06

0.0

06

0.0

07

0.0

05

0.0

05

0.0

05

Glo

bal

2.9

96

4.5

23

4.6

57

4.2

37

4.5

72

4.5

20

6.7

85

0.1

63

0.1

10

0.0

92

0.0

86

0.0

74

0.0

73

0.0

99

ISF

0.2

19

0.8

44

1.4

57

1.9

03

5.3

02

8.7

27

12.2

26

22.0

22

85.0

34

146.9

20

191.9

57

535.0

72

879.8

37

1233.1

47

Sin

gle

0.0

08

0.0

32

0.0

57

0.0

76

0.2

10

0.3

42

0.4

79

0.9

59

3.3

87

6.1

93

7.9

77

24.9

92

34.1

01

53.5

81

Pow

erD

ou

ble

0.0

17

0.0

63

0.1

13

0.1

52

0.4

20

0.6

84

0.9

58

1.9

17

6.7

75

12.3

86

15.9

55

49.9

84

68.2

02

107.1

60

(nW

)H

ex0.0

34

0.1

27

0.2

26

0.3

05

0.8

40

1.3

68

1.9

17

3.8

35

13.5

49

24.7

71

31.9

09

99.9

67

136.4

00

214.3

20

Glo

bal

271.7

80

23.9

12

6.6

68

4.4

60

3.5

55

4.5

13

5.8

57

6674.8

00

5146.7

00

5560.9

00

5824.1

00

5318.2

00

4856.1

00

4745.7

00


logic components and interconnect fabric IceFlex, including multi-gate SET recon-

figurable lookup table (LUT)1, SET register (Register), SET and CMOS four-out-of-

seven majority voting logic (MVL), multi-gate (MG) and CMOS-style (CS) exclusive-

OR, (CO) carry-out logic, and SET local interconnect (Single, Double, and Hex),

hybrid SET/CMOS global interconnect (Global) and SET input switch fabric (ISF).

From these results, we make the following observations.

First, IceFlex has high energy efficiency, good performance, and high flexibility

in terms of performance and energy efficiency tradeoff. At the low-power setting,

the power consumptions of SET-based logic components and local interconnect fab-

ric are nano-Watts. The hybrid SET/CMOS global interconnect has the highest

power consumption. This is a result of the high capacitance of global wires and high

power consumption of the CMOS buffers. All components in the low-power version

of IceFlex still have latencies in the range of nanoseconds. SETs have high junction

resistance and low driving strength. Using the high-performance setting, by scaling

the SET junction resistance down to 100 kΩ, the latencies of the SET-based logic and

local interconnect fabric are consistently lower than 100 ps. Even though reducing

resistance results in a 100× increase in power, as demonstrated in Section 5.4.2, the

overall energy efficiency of IceFlex is still orders of magnitude higher than that of

CMOS-based solutions.

Second, these results demonstrate the impact of temperature on SET performance

and power consumption – as the temperature increases, performance increases and

the power efficiency decreases. This is a result of the impact of thermal energy on

tunneling events and therefore circuit behavior, which is described in Section 5.2.

The number of electrons with sufficient energy to overcome the Coulomb blockade

1To allow comparison with Xilinx FPGAs, a 16-to-1 setting is used.


effect increases with temperature, thereby increasing tunneling rate, performance,

and power consumption.

The CΣ = e2/(40kBT ) setting enables greater resistance to shot noise than the

CΣ = e2/(10kBT ) setting. However, it also imposes performance and power consump-

tion penalties. For SET circuitry, the required supply voltage is inversely proportional

to gate capacitance. Compared to the CΣ = e2/(10kBT ) setting, CΣ = e2/(40kBT ) re-

quires a further reduction of SET gate capacitance and an increase in supply voltage.

Note that the driven capacitance of a SET circuit is dominated by the metal wires.

Therefore, decreased gate capacitance has negligible impact on power consumption.

The increased supply voltage, on the other hand, increases circuit dynamic power

consumption. Moreover, the increased voltage range increases the duration of signal

swing, thereby increases latency.

5.4.1.1 SET Multi-Gate Multiplexer Tree

As described in Section 5.3.2.1, multi-gate SETs improve the performance, power

consumption, and area efficiency of the multiplexer tree design. This section charac-

terizes the impact of thermal energy on the proposed multi-gate design.

As described in Section 5.3.2.1, at the high-performance CΣ = e2/(10kBT ) setting,

the dual-gate design is used for temperatures at or below 200 K. For these settings only

single-gate design is feasible at temperatures greater than 250 K due to high static

current at these temperatures. As a result, circuit power consumption is increased

at high temperatures. From 200 K to 250 K, both latency and power consumption

increase. In addition, when using the same design, we observe that both the circuit

performance and power consumption increase with temperature. The same trend


0.04

0.045

0.05

0.055

0.06

0.065

0.07

0.075

0.08

0 50 100 150 200 250 300

100

200

300

Late

ncy

(ns)

Pow

er (n

W)

T (K)

LUT latencyLUT power

Figure 5.8: Power and Performance of the Multi-gate SET Multiplexer Tree for HighPerformance, CΣ = e2/(40kBT ) [133].

was described in Section 5.4.1. Using the low-power design of IceFlex, only the

single-gate design is feasible (see Section 5.3.2.1). Using e2/CP ≥ 40kBT , SET

circuitry is less susceptible to thermal energy thanks to the increased charging energy.

Therefore, both low-power and high-performance dual-gate multiplexer tree designs

become feasible across the entire temperature range. As shown in Figure 5.8, using the

high-performance CΣ = e2/(40kBT ) setting, the performance and power consumption

of the multi-gate multiplexer tree design increase consistently with temperature. A

similar trend can be shown for the corresponding low-power design case.

5.4.1.2 Power and Performance of Interconnect Design

Power consumption, performance, and the tradeoff between them are of central im-

portance in interconnect design. We considered both SET-only and SET/CMOS hy-

brid interconnect driver designs. The relative static power benefit of the SET/CMOS


hybrid design over the SET-only design increases as the wireload increases. This

is mainly due to an increase in the static power consumption of the SET-only de-

sign as more SET buffers are used to meet the driving strength requirements. The

SET-only design has superior power efficiency. As the wire length increases, the pro-

portion of capacitance contributed by CMOS buffer gates becomes less significant

relative to wire capacitance. Therefore, compared to the SET-only design, the dy-

namic power consumption of the SET/CMOS hybrid design also improves, but is still

inferior to that of the SET-only design. At 300 K, for both the CΣ = e2/(40kBT ) and

CΣ = e2/(10kBT ) settings, we found that SET-only designs had better energy effi-

ciencies for wires shorter than approximately 1 mm, and SET/CMOS hybrid designs

were better for longer wires. As temperature increases, the thermal energy impact

increases. As a result, the static power consumption of SETs increases. Therefore,

the wire length at which the SET/CMOS design begins to outperform the SET-only

design decreases as temperature increases.

Table 5.5 illustrate two interesting trends for global interconnect. The power con-

sumption of both the low-power and the high-performance CΣ ≤ e2/(40kBT ) hybrid

SET/CMOS designs decrease with increasing temperature. At low temperatures, the

output voltage ranges and driving currents for the SETs are small, increasing CMOS

buffer static power consumption.

5.4.1.3 Performance and Power Characterization of SET Non-Unate Logic

SETs support the efficient implementation of some non-unate arithmetic func-

tions. We evaluate the power consumption and performance of an exclusive-OR

gate, a non-unate Boolean function widely used in arithmetic logic, e.g., in addition


2.2

2.3

2.4

2.5

2.6

2.7

2.8

2.9

3

3.1

0 50 100 150 200 250 300 0

0.1

0.2

0.3

0.4

0.5

0.6La

tenc

y (n

s)

Pow

er (n

W)

T (K)

Multi-gate style latencyCMOS style latency

Multi-gate style powerCMOS style power

Figure 5.9: Performance and Power Characterization of Exclusive-or Logic for LowPower for CΣ = e2/(40kBT ) [133].

Table 5.6: Latency and Energy Improvement For Exclusive-Or Design [133].Performance CΣ Performance Energy

setting constraint (F) improvement (%) improvement (%)Battery e2/(10kBT ) 40.8 64.1Battery e2/(40kBT ) 22.0 87.1

High e2/(10kBT ) 32.1 84.6High e2/(40kBT ) 25.2 84.4

and multiplication. We compared the two different implementations described in

Section 5.3.2.3, the proposed SET-based design and the CMOS-style SET implemen-

tation. Figure 5.9 shows the power and performance characterization of these two

designs at the low-power and high-performance settings at CΣ = e2/(40kBT ) settings.

These results demonstrate the superior power consumption and performance of this

design style, which is not possible using BJTs, CMOS, or threshold logic. Compared

to the CMOS-style SET implementation, the design that exploits the periodic I–V


curve of SETs achieves the latency and power consumption reductions indicated in

Table 5.6, i.e., approximately a 25% reduction in latency and 75% reduction in energy

consumption.

5.4.2 Characterization of High-Performance and Battery-Powered

Embedded Applications

This section characterizes the performance and power consumption of IceFlex

when used to implement numerous general-purpose and application-specific processor

cores. We evaluate the suitability of IceFlex for use in both portable battery-powered

and high-performance embedded systems by determining its performance and energy

efficiency when used to implement the processor cores described below. We have

divided the cores into battery-powered and high-performance categories.

Battery-Powered

AES (Rijndael) IP core (AES), ATMega103 microcontroller (AVR), coordinate

rotation computer (CORDIC), ECC core (ECC), 32-bit IEEE 754 floating-point unit

(FPU), Reed–Solomon encoder (RS), USB 2.0 function (USB), and video compression

systems (VC).

High-Performance

Power-efficient RISC CPU (ARM7), synchronous / DLX core (ASPIDA DLX),

five-stage pipeline RISC CPU (Jam RISC), entire SPARC V8 processor (LEON2

SPARC), RISC CPU (Microblaze), MIPS I clone (miniMIPS), MIPS processor (MIPS)

supporting most MIP I opcodes (Plasma), MIPS I integer only clone (UCore), and


Table 5.7: IceFlex Performance and Power Consumption at Room Temperature ForCΣ = e2/(40kBT ) [133].

FPGA IceFlex22 nm CMOS Battery- High-

Benchmarks technology∗ powered performanceFreq Energy Freq Energy Freq Energy

(MHz) (J/cycle) (MHz) (J/cycle) (MHz) (J/cycle)ARM7 26.3 2.96e-09 2.0 5.47e-11 224.0 4.79e-11

ASPIDA DLX 125.7 8.86e-10 11.5 6.37e-12 1333.3 5.58e-12Jam RISC 95.9 8.92e-10 12.8 3.65e-12 1481.5 3.19e-12

LEON2 SPARC 85.9 1.88e-09 8.8 2.39e-11 1025.6 2.09e-11Microblaze RISC 115.1 7.28e-10 16.4 2.01e-12 1904.8 1.76e-12

miniMIPS 88.0 4.87e-10 9.6 9.78e-12 1111.1 8.56e-12MIPS 80.4 1.02e-09 10.5 4.34e-12 1212.1 3.80e-12

Plasma 75.4 1.13e-09 8.8 6.91e-12 1025.6 6.05e-12UCore 136.4 8.19e-10 12.8 5.45e-12 1481.5 4.78e-12YACC 72.1 1.18e-09 19.2 3.08e-12 2222.2 2.69e-12AES 205.3 3.43e-10 28.7 2.34e-12 3333.3 2.05e-12AVR 71.9 2.67e-10 9.6 5.34e-12 1111.1 4.67e-12

CORDIC 271.8 1.37e-10 114.9 2.05e-13 13333.3 1.79e-13ECC 39.1 4.91e-10 11.5 6.92e-12 1333.3 6.05e-12FPU 28.4 1.00e-09 2.6 8.02e-11 296.3 7.02e-11RS 496.7 1.28e-11 57.5 4.61e-14 6666.7 4.05e-14

USB 171.6 3.24e-10 38.3 1.53e-12 4444.4 1.34e-12VC 114.16 1.24e-09 23.0 1.04e-11 2666.8 9.10e-12

Avg. energy Improvement 68.58× 78.46×

MIPS I clone (YACC).

The Xilinx Virtex-II XC2V2000 FPGA is used as a base case for comparison.

Each application is synthesized with Xilinx ISE to determine the number of required

LUTs, maximum frequency, and power consumption, using a switching probability of

10% [121] and a 65 nm feature size. Then, we scale the FPGA synthesis results into

a 22 nm process based on HSPICE predictive technology model simulation results

for the two technologies [130]. We used FPGA synthesis software to estimate the

number of IceFlex SELBs required. 16-entry Virtex-II LUTs were used due to their

functional (but not structural) similarity to IceFlex SELBs. For each design, the


maximum frequency for IceFlex was determined by multiplying the number of SELBs

along the longest combinational path by the delay of an IceFlex SELB plus the

delay of a local interconnect. IceFlex power consumption was computed by taking

the sum of the power consumptions of all components at the maximum operating

frequency. Note that, since Xilinx ISE does not report use of global interconnect

for any of the processors we synthesized, we exclude the hybrid global interconnect

from IceFlex power analysis. In designs that use primarily local interconnect (i.e.,

single, double, and hex interconnect), the reported power consumption results will

be accurate. However, for designs in which global hybrid SET–CMOS interconnect

dominates, the power consumption may approach that of global interconnect in a

corresponding 22 nm CMOS design.

Table 5.7 show the operating frequencies and energy efficiency in Joules per clock

cycle of the CMOS FPGA and IceFlex variants for each benchmark application. As

described in Section 5.3.1.5, recent progress in fabrication is reducing the severity of

the random background charge problem. If that work succesful, it may be less critical

to use redundancy and majority voting logic in IceFlex.

5.4.2.1 Ultra-Low-Power Applications

The data in Table 5.7 indicate that the non-redundant, room temperature,

low-power version of IceFlex is suitable for use in applications such as sensor net-

work nodes, if they can be fabricated with sufficiently small island capacitances. In

the following analysis, we shall focus on the AVR core, which is representative of

a commonly-used sensor network node processor. Alkaline AA batteries typically


have 2,800 mAH of energy and nominal operating voltages of 1.5 V, i.e., they can de-

liver approximately 15,000 J. Using the conservative CΣ ≤ e2/(40kBT ) constraint, a

low-power IceFlex AVR implementation running at 4 MHz consumes approximately

200 µW, permitting it to run for 20 years on one AA battery, i.e., longer than the shelf

life of most such batteries. When the less conservative CΣ ≤ e2/(10kBT ) constraint

is used, the average energy consumption improvements increase to 95.60× (non-

redundant battery powered), 115.65× (non-redundant high performance), 12.27×

(redundant battery powered), and 15.27× (redundant high performance).

This power consumption is also low enough to permit an AVR processor to oper-

ate on energy scavenged from the environment. If we assume an energy scavenging

volume of 5 cm3 and use Roundy’s power densities of 4 µW/cm3 for indoor solar en-

ergy, 200 µW/cm3 for vibrations, 10 µW/cm3 for daily temperature variation, and

0.003 µW/cm3 for acoustic noise at 75 dB [92], we find that one sensor network node

is capable of scavenging enough energy to power an IceFlex AVR processor running

at the maximum clock frequency from vibrations or daily temperature variation, at

3.7 MHz from indoor solar energy, and at 2.8 kHz from 75 dB acoustic noise. However,

SET circuits that operate at room temperature and adhere to the CΣ ≤ e2/(40kBT )

constraint will rely on features with sizes approaching (but not crossing) physical

limits. Although the use of SETs in battery-powered applications has potential, it

depends on the solution of formidable fabrication challenges or the development of

compact, low-power cooling methods.


5.4.2.2 Energy-Efficient High-Performance Applications

We can draw the following general conclusions from Table 5.7. For a wide range

of processor cores, the SET-based IceFlex architecture is capable of achieving energy

efficiencies two orders of magnitude better than 22 nm CMOS-based FPGAs. Peak

frequencies ranging from 200 MHz to 2 GHz are maintained for all processors.

One might expect the high-performance version of IceFlex to consistently achieve

higher frequency but lower energy efficiency than the low-power version of IceFlex.

However, its energy efficiency is typically better, as well. Operating at higher fre-

quencies can permit reduced static energy consumption, and therefore better energy

efficiency, especially at room temperature where static power consumption is high

(see Figure 5.2). Therefore, for SET-based architectures that are operated at room

temperature and have low performance requirements, it will generally be more energy

efficient to operate the device at high frequency and periodically enter a power-gated

sleep mode than to continuously operate at a low frequency.

In high-performance applications for which parallel computation is appropriate,

improved energy efficiency can be traded for improved performance with the same

energy budget. For example, given a power budget of 125 mW and CΣ ≤ e2/(40kBT ),

one could use one LEON2 SPARC implemented with an FPGA and running at 85 MHz

or 5 LEON2 SPARCs implemented with the high-performance variant of IceFlex and

operating at 1,025 MHz. This implies an overall performance 60× higher than that of

the FPGA version. Taken to its logical extreme, assuming a power budget of 100 W

and one instruction per cycle, one could execute 4.8 Terra IPS. These numbers are

intended to give the reader some indication of the potential to improve performance

given a power budget. In practice some of this performance will be lost due to


parallelization inefficiency and off-chip communication latency. A similar comparison

can be used for the MIPS processor, for which IceFlex permits a 268× improvement

in energy efficiency compared with an FPGA implementation.

5.5 Conclusions

In this chapter, we have analyzed the impact of using SETs in architecture and

circuit design; proposed IceFlex, a fault-tolerant, reconfigurable, hybrid SET/CMOS

architecture for use in high-performance and battery-powered embedded systems;

and evaluated the energy efficiency, power consumption, and performance of IceFlex

in these applications. Our results indicate that using SETs for computation poses

many design challenges, some of which can be solved with the proposed architecture

and circuit design techniques. In addition, we find that SETs have unique proper-

ties that permit significant improvements in circuit efficiency when compared with

BJT, CMOS, and threshold logic based design. In summary, we find that a hybrid

SETs/CMOS architecture has the potential to improve energy efficiency in battery-

powered high-performance applications by two orders of magnitude compared with

22 nm CMOS while permitting operating frequencies that are as high, or higher. Al-

though they hold great promise, the practical use of SETs will require additional

research into fault tolerance techniques, processing technologies, and novel circuit de-

signs. In particular, the use of SET-based designs in portable applications will either

require the fabrication of features with sizes approaching physical limits or the devel-

opment of compact, energy-efficient technologies permitting operation below ambient

temperature.

Chapter 6

Conclusions and Future Work

This chapter summarizes the proposed techniques and discusses possible directions

for future work.

6.1 Thesis Summary

This thesis proposes several techniques and algorithms, specifically, system-level

synthesis, recently developed integration technology and emerging device technol-

ogy to address problems related to power, thermal and reliability issues of modern

integrated circuit design.

Technology scaling and increasing power densities make IC design lifetime reli-

ability problems more severe. Lifetime reliability strongly depends on system-level

architecture, redundancy, and IC thermal profile during operation. In order to explore

the system-level synthesis algorithms to increase IC lifetime by thermal and struc-

tural redundancy optimization, a two-stage synthesis process has been proposed. A

potentially-slow but high-quality stochastic optimization algorithm is first used to

134

CHAPTER 6. CONCLUSIONS AND FUTURE WORK 135

minimize solution area. Starting from this promising location in the solution space,

a reliability enhancement heuristic explores the area-MTTF tradeoff curve. The pro-

posed algorithm has been integrated into a system-level synthesis flow that conducts

architectural synthesis, floorplanning, on-chip network synthesis, chip-package ther-

mal analysis and reliability analysis. As indicated by our results, the proposed syn-

thesis system achieves 436% average system MTTF improvement with a maximum

area overhead of 25%. Compared with one-phase stochastic optimization algorithm,

the proposed synthesis can always produce solutions of equal or better quality while

requiring less CPU time.

Several three-dimensional integration technologies have been proposed and devel-

oped to overcome the limitations of 2D technology. (1) 3D technology increases logic

integration density significantly; (2) 3D technology reduces on-chip wire length, es-

pecially for global and semi-global wires. However, by stacking multiple device layers

connected through inter-die vias, 3D integration increases the importance and diffi-

culty of thermal management due to the following reasons: (1) Chip cross-sectional

power density increases linearly with the number of vertically-stacked active circuit

layers; (2) the interconnect and bonding layers used in 3D integration have low ther-

mal conductivities which further exacerbate thermal effects; (3) the high power den-

sity of 3D chips will frequently require operation at or near thermal limits and (4)

3D chips have heterogeneous power and thermal characteristics which challenge run-

time thermal management. In order to investigate the run-time thermal management

problem of 3D integrated circuits, we developed the analytical framework for 3D heat

flow and proposed a proactive global power-thermal budgeting algorithm, perfor-

mance counter-based workload monitor and distributed thermal control techniques.


The proposed technique, called ThermmOS which is built upon Linux 2.6.8 kernel, is

a unified hardware and OS thermal management solution to maximize thermally-safe

3D IC performance. The results indicate that proactive power-thermal budgeting

allows 30% improvement in instruction throughput compared to a state-of-the-art

proactive thermal management approach. Evaluation results also indicate the pro-

posed technique has small performance overhead and good scalability.

Device researchers have seen the coming challenges for CMOS devices and eval-

uated alternative technologies. The International Technology Roadmap for Semi-

conductors projects that single-electron tunneling transistors have the potential to

achieve the lowest projected energy per switching event of any known device. In

order to explore the potential use of SETs in low-power embedded systems, SET-

based design was brought to the system level to characterize the impacts of SETs on

system design metrics and evaluate the benefits and limitations of SETs. Based on

the evaluation of the architectural and circuit-level features, a fault-tolerant, recon-

figurable, hybrid SET/CMOS based architecture called IceFlex was proposed. The

results indicate that using a hybrid SETs/CMOS architecture has the potential to

improve energy efficiency in battery-powered high-performance applications by two

orders of magnitude compared with 22nm CMOS while permitting operating frequen-

cies that are as high, or higher. Although they hold great promise, the practical use

of SETs will require additional research into fault tolerance techniques, processing

technologies and novel circuit designs.

6.2 Future Work

The following research directions can be further pursued.


3D Thermal-Aware and Reliability-Aware Synthesis

Due to the additional constraints of stacking multiple device layers, the synthesis

algorithms for 3D circuits are quite different from traditional planar integrated cir-

cuits. Besides the traditional optimization goals, such as performance, area and inter-

connect latency, 3D synthesis also needs to address the issues unique to 3D circuits,

such as minimizing the inter-die vias [4, 23, 22]. Furthermore, the use of 3D integra-

tion magnifies power dissipation problems. Temperature-related concerns that can

sometimes be safely ignored in 2D circuit design, such as temperature-induced per-

formance or reliability degradation become increasingly prominent in 3D integrated

circuits. In addition, the dependence of leakage power consumption on temperature

will further exacerbate the thermal effect. These issues must be tackled during phys-

ical level synthesis procedure [47]. In addition, in high-level synthesis and in system-

level synthesis areas, task assignment and scheduling need to be carefully designed to

balance power consumption in the spatial and time domains respectively [61]. The

road ahead presents many challenges in developing EDA tools to explore the design

space of 3D integrated circuits before one can fully take the benefits from this new

technology.

Thermal and Reliability Modeling for SETs Circuit

Although single-electron tunneling transistors hold great promise, practical use

of SETs will require additional research into thermal and reliability modeling for

SET devices. Enabling operation at the desired temperature is a major concern for

SETs. Fabricating SET islands of small enough size and capacitance to permit room-

temperature operation is a major challenging. Researchers have proposed accurate


chip-package thermal analysis techniques for use in IC synthesis and design [124,

125, 123, 45]. However, there is no solid work for SET devices thermal modeling

which includes detailed thermal characterization and fast nanoscale thermal analysis

methods. In addition, SET circuits are susceptible to logic errors resulting from a

phenomenon called the random background charge effect. Fault tolerance must be

carefully addressed when using SETs in system-level design. In order to do so, an

accurate SET fault probability estimation is required.

Energy Optimization on Application Layer

This dissertation discussed power, thermal and reliability optimization on the

hardware and operating system layers. In the future, energy optimization on the

application layer can be explored, especially those applications running on battery-

powered, portable devices. Personal, portable communication and computation de-

vices are now part of hundreds of millions of lives, often in the form of smart-phones.

From Daniel Henderson’s 1993 prototype, intellect, which can receive and display im-

ages and video media [1], to the first photo taken by Philippe Kahn in 1997 using a

camera phone and shared instantly with more than 2,000 families [3], the functional-

ity and adoption of personal portable devices have continuously increased. Today’s

personal portable devices, such as the iPhone from Apple, Blackberry from RIM,

and Android phone from Google, have integrated many system functions, such as the

global positioning system (GPS), cameras, sensors, large touch screens, and easy-to-

use interfaces. Global mobile phone subscriptions reached 3.3 billion in 2007 [48].

Users are able to capture information anywhere and anytime. These devices are also

heavily used for information sharing and social interaction. In battery-powered mobile


systems, energy consumption is a primary design concern. A limited battery energy

budget forces hardware designers to use energy-efficient, but slow microprocessors

and limited storage hardware. These constraints, in turn, limit the performance and

functionality of software applications running on portable devices. Those challenges

need to be addressed during the application designing procedure.

Bibliography

[1] American museum. http://americanhistory.si.edu/.

[2] Transistor count on wikipedia. http://http://en.wikipedia.org/wiki/Transistor

count.

[3] Wikipedia. http://en.wikipedia.org/wiki/Philippe Kahn/.

[4] Cristinel Ababei, Yan Feng, Brent Goplen, Hushrav Mogal, Tianpei Zhang,

Kia Bazargan, and Sachin Sapatnekar. Placement and routing in 3d integrated

circuits. IEEE Design & Test, 22(6):520–531, November 2005.

[5] V. Agarwal, M.S. Hrisikesh, S.W. Keckler, and D. Burger. Clock rate vs. IPC:

The end of the road for conventional microarchitectures. In Proc. Int. Symp.

Computer Architecture, pages 276–283, June 2000.

[6] M. Ahlskog, R. Tarkiainen, L. Roschier, and P. Hakonen. Single-electron transis-

tor made of two crossing multiwalled carbon nanotubes and its noise properties.

Applied Physics Ltrs., 77:4037–4039, December 2000.

[7] AMD multi-core white paper. http://www.amd.com.

[8] ANSYS. http://www.ansys.com/.

140

BIBLIOGRAPHY 141

[9] D. V. Averin and K. K. Likharev. Coulomb blockade of tunneling and coherent

oscillations in small tunnel junctions. J. Low Temperature Physics, 62:345–372,

February 1986.

[10] R. Iris Bahar, Joseph Mundy, and Jie Chen. A probabilistic-based design

methodology for nanoscale computation. In Proc. Int. Conf. Computer-Aided

Design, pages 480–486, November 2003.

[11] Nathan L. Binkert, Ronald G. Dreslinski, Lisa R. Hsu, Kevin T. Lim, Ali G.

Saidi, and Steven K. Reinhardt. The M5 simulator: Modeling networked sys-

tems. Proc. Int. Symp. Microarchitecture, 26(4):52–60, 2006.

[12] Bryan Black, Murali M. Annavaram, Ned Brekelbaum, John DeVale, Lei Jiang,

Gabriel H. Loh, Don McCaule, Pat Morrow, Donald W. Nelson, Daniel Pantuso,

Paul Reed, Jeff Rupley, Sadasivan Shankar, John Shen, and Clair Webb. Die

stacking (3D) microarchitecture. In Proc. Int. Symp. Microarchitecture, pages

469–479, December 2006.

[13] K. A. Bowman, B. L. Austin, J. C. Eble, X. Tang, and J. D. Meindl. A physical

alpha-power law MOSFET model. IEEE J. Solid-State Circuits, 34:1410–1414,

October 1999.

[14] David Brooks and Margaret Martonosi. Dynamic thermal management for high-

performance microprocessors. In Proc. Int. Symp. High-Performance Computer

Architecture, pages 171–182, January 2001.

BIBLIOGRAPHY 142

[15] David Brooks, Vivek Tiwari, and Margaret Martonosi. Wattch: A framework

for architectural-level power analysis and optimizations. In Proc. Int. Symp.

Computer Architecture, pages 83–94, June 2000.

[16] K. R. Brown, L. Sun, and B. E. Kane. Electric-field-dependent spectroscopy

of charge motion using a single-electron transistor. Applied Physics Ltrs., 88,

2006 May.

[17] R. H. Chen. MOSES: a general Monte Carlo simulator for single-electron cir-

cuits. Meeting Abstracts, The Electrochemical Society, 96(2):576, October 1996.

[18] Yi-Kan Cheng, Ching-Chi Teng, Sung-Mo Kang, and Ching-Han Tsai. Elec-

trothermal Analysis of VLSI Systems. Cambridge University Press, 2000.

[19] Young-Kyun Cho and Yoon-Ha Jeong. Single-electron pass-transistor logic with

multiple tunnel junctions and its hybrid circuit with MOSFETs. ETRI J.,

26(6):669–672, December 2004.

[20] COMSOL Multiphysics. http://www.comsol.com/products/multiphysics/.

[21] A. K. Coskun, T. S. Rosing, K. Mihic, G. De Micheli, and Y. Leblebici. Analysis

and optimization of MPSoC reliability. J. Low Power Electronics, pages 56–69,

April 2006.

[22] Shamik Das, Anantha Chandrakasan, and Rafael Reif. Three-dimensional inte-

grated circuits: Performance, design methodology, and cad tools. pages 13–18,

February 2003.

BIBLIOGRAPHY 143

[23] Shamik Das, Andy Fan, Kuan-Neng Chen, and C. S. TanAnantha. Technol-

ogy, performance, and computer-aided design of three-dimensional integrated

circuits. In Proc. Int. Symp. Physical Design, pages 108–115, April 2004.

[24] Andre DeHon. Array-based architecture for FET-based nanoscale electronics.

IEEE Trans. Nanotechnology, 2(1):23–32, March 2003.

[25] Michel H. Devoret and Robert J. Schoelkopf. Amplifiying quantum signals with

the single-electron transistor. Nature, 406:1039–1046, August 2000.

[26] Robert P. Dick. Multiobjective synthesis of low-power real-time distributed em-

bedded systems. PhD thesis, Dept. of Electrical Engineering, Princeton Univer-

sity, July 2002.

[27] Robert P. Dick, David L. Rhodes, and Wayne Wolf. TGFF: task graphs for

free. In Proc. Int. Wkshp. Hardware/Software Co-Design, pages 97–101, March

1998.

[28] James Donald and Margaret Martonosi. Techniques for multicore thermal man-

agement: Classification and new exploration. In Proc. Int. Symp. Computer

Architecture, June 2006.

[29] M. S. Dresselhaus, G. Dresselhaus, and Phaedon Avouris. Carbon Nanotubes.

Springer-Verlag, Germany, February 2001.

[30] Petru Eles, Zebo Peng, Krzysztof Kuchcinski, and Alexa Doboli. System level

hardware/software partitioning based on simulated annealing and tabu search.

ACM Trans. Design Automation Electronic Systems, 2:5–32, January 1997.

[31] Embedded microprocessor benchmark consortium. http://www.eembc.org.

BIBLIOGRAPHY 144

[32] David K. Ferry and Stephen M. Goodnick. Transport in Nanostructures. Cam-

bridge University Press, 1997.

[33] T. A. Fulton and G. J. Dolan. Observation of single-electron charging effects

in small tunnel junctions. Physics Review Ltrs., 59:109–112, July 1987.

[34] M. Furlan and S. V. Lotkhov. Electrometry on charge traps with a single-

electron transistor. Physics Rev. B, 67:205313, 2003.

[35] A. K. Geim and K. S. Novoselov. The rise of graphene. Nature Materials,

6:183–191, March 2007.

[36] M. Glaß, M. Lukasiewycz, T. Streichert, C. Haubelt, and J. Teich. Reliability-

aware system synthesis. In Proc. Design, Automation & Test in Europe Conf.,

April 2007.

[37] Seth Copen Goldstein and Mihai Budiu. Nanofabrics: spatial computing using

molecular electronics. In Proc. Int. Symp. Computer Architecture, pages 178–

189, June 2001.

[38] Zhenyu (Peter) Gu, Changyun Zhu, Li Shang, and Robert P. Dick. Application-

specific MPSoC reliability optimization. IEEE Trans. VLSI Systems, 16(5),

May 2008.

[39] Michael Healy, Mario Vittes, Mongkol Ekpanyapong, Chinnakrishna Ballapu-

ram, Sung Kyu Lim, Hsien-Hsin S. Lee, and Gabriel H. Loh. Multi-objective

microarchitectural floorplanning for 2d and 3d ics. TCAD, 26(1):38–52, January

2007.

BIBLIOGRAPHY 145

[40] James R. Heath and Mark A. Ratner. Molecular electronics. Physics Today,

56:43–49, May 2003.

[41] C. P. Heij, P. Hadley, and J. E. Mooij. Single-electron inverter. Applied Physics

Ltrs., 78:1140–1142, 2001.

[42] Jorg Henkel and Rolf Ernst. A hardware/software partitioner using a dynami-

cally determined granularity. In Proc. Design Automation Conf., pages 691–696,

June 1997.

[43] Seongmoo Heo, Kenneth Barr, and Krste Asanovic. Reducing power density

through activity migration. In Proc. Int. Symp. Low Power Electronics & De-

sign, pages 217–222, August 2003.

[44] J. Hou and Wayne Wolf. Process partitioning for distributed embedded systems.

In Proc. Int. Wkshp. Hardware/Software Co-Design, pages 70–76, March 1996.

[45] W. Huang, S. Ghosh, S. Velusamy, K. Sankaranarayanan, K. Skadron, and M.R.

Stan. HotSpot: A compact thermal modeling methodology for early-stage VLSI

design. IEEE Trans. VLSI Systems, 14(5):501–524, May 2006.

[46] Yu Huang, Xiangfeng Duan, Yi Cui, Lincoln J. Lauhon, Kyoung-Ha Kim, and

Charles M. Lieber. Logic gates and computation from assembled nanowire

building blocks. Nature, 294(5545):1313–1317, November 2001.

[47] W.-L. Hung, G. M. Link, Y. Xie, N. Vijaykrishnan, and M. J. Irwin. Inter-

connect and thermal-aware floorplanning for 3D microprocessors. In Proc. Int.

Symp. Quality of Electronic Design, pages 98–104, March 2006.

BIBLIOGRAPHY 146

[48] Global mobile forecast to 2012. In Informa Telecomms & Media Report, Novem-

ber 2007.

[49] Hiroshi Inokawa and Yasuo Takahashi. A compact analytical model for asym-

metric single-electron tunneling transistors. IEEE Trans. Electron Devices,

50(2):455–461, February 2003.

[50] Intel multi-core processor architecture. http://www.intel.com.

[51] Canturk Isci, Alper Buyuktosunoglu, Chen-Yong Cher, Pradip Bose, and Mar-

garet Martonosi. An analysis of efficient multi-core global power management

policies: Maximizing performance for a given power budget. In Proc. Int. Symp.

Microarchitecture, pages 78–88, December 2006.

[52] Canturk Isci and Margaet Martonosi. Runtime power monitoring in high-end

processors: Methodology and empirical data. In Proc. Int. Symp. Microarchi-

tecture, pages 93–104, December 2003.

[53] International Technology Roadmap for Semiconductors, 2006. http://public.

itrs.net/.

[54] J.McGregor. x86 power and thermal management. In Microprocessor Report,

December 2004.

[55] Joint Electron Device Engineering Council. Failure mechanisms and models for

semiconductor devices. In JEDEC Publication JEP 122-B, August 2003.

[56] R. Kalla, B. Sinharoy, and J. M. Tendler. IBM Power5 chip: a dual-core

multithreaded processor. IEEE Micro, 24(2):40–47, 2004.

BIBLIOGRAPHY 147

[57] Taeho Kgil, Shaun D’Souza, Ali Saidi, Nathan Binkert, Ronald Dreslinski,

Trevor Mudge, Steven Reinhardt, and Krisztian Flautner. PicoServer: using

3D stacking technology to enable a compact energy efficient chip multiproces-

sor. In Proc. Int. Conf. Architectural Support for Programming Languages and

Operating Systems, pages 117–128, October 2006.

[58] Jongman Kim, Chrysostomos Nicopoulos, Dongkook Park, Reetuparna Das,

Yuan Xie, Vijaykrishnan Narayanan, Mazin S. Yousif, and Chita R. Das. A

novel dimensionally-decomposed router for on-chip communication in 3D archi-

tectures. In Proc. Int. Symp. Computer Architecture, June 2007.

[59] Masaharu Kirihara, Kazuo Nakazato, and Mathias Wagner. Hybrid circuit

simulator including a model for single electron tunneling devices. Japanese J.

of Applied Physics, 38(4A), April 1999.

[60] P. Kongetira, K. Aingaran, and K. Olukotun. Niagara: A 32-way multithreaded

SPARC processor. IEEE Micro, 25(2):21–29, 2005.

[61] Vyas Krishnan and Srinivas Katkoori. A 3d-layout aware binding algorithm

for high-level synthesis of three-dimensional integrated circuits. In Proc. Int.

Symp. Quality of Electronic Design, pages 885–892, March 2007.

[62] V. A. Krupenin, D.E. Presnov, A.B. Zorin, and J. Niemeyer. Aluminum single

electron transistors with islands isolated from a substrate. J. of Low Tempera-

ture Physics, 118(5/6), December 1999.

BIBLIOGRAPHY 148

[63] Amit Kumar, Li Shang, Li-Shiuan Peh, and Niraj K. Jha. HybDTM: a coordi-

nated hardware-software approach for dynamic thermal management. In Proc.

Design Automation Conf., pages 548–553, July 2006.

[64] Choonseung Lee and Soonhoi Ha. Hardware-software cosynthesis of multitask

MPSoCs with real-time constraints. In Proc. Int. Conf. ASIC, pages 919–924,

October 2005.

[65] Feihui Li, Chrysostomos Nicopoulos, Thomas Richardson, Yuan Xie, Vijaykr-

ishnan Narayanan, and Mahmut Kandemir. Design and management of 3D

chip multiprocessors using network-in-memory. In Proc. Int. Symp. Computer

Architecture, pages 130–141, June 2006.

[66] Man-Lap Li, Ruchira Sasanka, Sarita V. Adve, Yen-Kuang Chen, and Eric

Debes. The ALPbench benchmark suite for complex multimedia applications.

In Int. Symp. Workload Characterization, pages 34–35, October 2005.

[67] Peng Li, Yangdong Deng, and Lawrence T. Pileggi. Temperature-dependent

optimization of cache leakage power dissipation. In Proc. Int. Conf. Computer

Design, October 2005.

[68] Yingmin Li, David Brooks, Zhigang Hu, and Kevin Skadron. Performance,

energy, and thermal considerations for SMT and CMP architectures. In Proc.

Int. Symp. Computer Architecture, pages 71–82, February 2005.

[69] Yingmin Li, Benjamin Leez, David Brooks, Zhigang Huyy, and Kevin Skadron.

CMP design space exploration subject to physical constraints. In Proc. Int.

Symp. High-Performance Computer Architecture, pages 17–28, February 2006.

BIBLIOGRAPHY 149

[70] Konstantin K. Likharev. Single-electron devices and their applications. Proc.

IEEE, 87(4):606–632, April 1999.

[71] G. M. Link and N. Vijaykrishnan. Thermal trends in emerging technologies. In

Proc. Int. Symp. Quality of Electronic Design, pages 625–632, March 2006.

[72] Gian Luca Loi, Banit Agrawal, Navin Srivastava, Sheng-Chih Lin, Timothy

Sherwood, and Kaustav Banerjee. A thermally-aware performance analysis of

vertically integrated (3-d) processor-memory hierarchy. In Proc. Design Au-

tomation Conf., pages 991–996, July 2006.

[73] S. Mahapatra, V. Vaish, C. Wasshuber, and K. Banerjee. Analytical modelling

of single electron transistor (SET) for hybrid CMOS-SET analog IC design.

IEEE Trans. Electron Devices, 51(11):1772–1782, June 2004.

[74] Arindam Mallik, Jack Cosgrove, Robert P. Dick, Gokhan Memik, and Peter

Dinda. PICSEL: Measuring user-perceived performance to control dynamic

frequency scaling. In Proc. Int. Conf. Architectural Support for Programming

Languages and Operating Systems, March 2008.

[75] K. Matsumoto, M. Ishii, K. Segawa, and Y. Oka. Room temperature opera-

tion of a single electron transistor made by the scanning tunneling microscope

nanooxidation process for the TiOx/Ti system. Applied Physics Ltrs., 68(1):34–

36, January 1996.

[76] Ulla Miekkala. Graph properties for splitting with grounded Laplacian matrices.

BIT Numerical Mathematics, pages 485–495, September 1993.

BIBLIOGRAPHY 150

[77] A. Mishra and P. Banerjee. An algorithm-based error detection scheme for the

multigrid method. IEEE Trans. Computer-Aided Design of Integrated Circuits

and Systems, 52(9):1089–1099, September 2003.

[78] Gordon E. Moore. Cramming more components onto integrated circuits. Elec-

tronics, 38(8):82–85, April 1965.

[79] F. Nakajima, Y. Miyoshi, J. Motohisa, and T. Fukui. Single-electron

AND/NAND logic circuits based on a self-organized dot network. Applied

Physics Ltrs., 83(13):2680–2682, September 2003.

[80] Y. Nakamura, C. D. Chen, and J. S. Tsai. 100-K operation of Al-based single-

electron transistors. Japan Journal Applied Physics, 35:1465–1467, November

1996.

[81] Umit Y. Ogras and Radu Marculescu. Energy- and performance- driven NoC

communication architectures synthesis using a decomposition approach. In

Proc. Design, Automation & Test in Europe Conf., pages 352–357, March 2005.

[82] Y. Ono, Y. Takahashi, K. Yamazaki, M. Nagase, H. Namatsu, K. Kurihara,

and K. Murase. Si complementary single-electron inverter. IEDM Technology

Dig., pages 367–370, 1999.

[83] Soyeon Park, Weihang Jiang, Yuanyuan Zhou, and Sarita Adve. Managing

energy-performance tradeoffs for multi-threaded applications. In Proc. Int.

Conf. on Measurement and Modeling of Computer Systems, pages 169–180,

June 2007.

BIBLIOGRAPHY 151

[84] Yu A. Pashkin, Y. Nakamura, and J. S. Tsai. Room-temperature Al single-

electron transistor made by electron-beam lithography. Applied Physics Ltrs.,

76(16):2256–2258, April 2000.

[85] D. Pham, S. Asano, M. Bolliger, M. N. Day, H. P. Hofstee, C. Johns, J. Kahle,

A. Kameyama, J. Keaty, Y. Massubuchi, M. Riley, D. Shippy, D. Stasiak,

M. Suzuoki, M. Wang, J. Warnock, S. Weitzel, D. Wendel, T. Yamazaki, and

K. Yazawa. The design and implementation of a first-generation CELL proces-

sor. In Proc. Int. Solid-State Circuits Conf., pages 49–52, February 2007.

[86] Aashish Phansalkar, Ajay Joshi, Lieven Eeckhout, and Lizy K. John. Measuring

program similarity: Experiments with SPEC CPU benchmark suites. In Proc.

Int. Symp. on Performance Analysis of Systems and Software, pages 10–20,

March 2005.

[87] M. D. Powell, M. Gomaa, and T. N. Vijaykumar. Heat-and-run: Leveraging

SMT and CMP to manage power density through the operating system. In

Proc. Int. Conf. Architectural Support for Programming Languages and Oper-

ating Systems, pages 260–270, November 2004.

[88] S. Prakash and A. Parker. SOS: Synthesis of application-specific heterogeneous

multiprocessor systems. J. Parallel & Distributed Computing, 16:338–351, De-

cember 1992.

[89] Kiran Puttaswamy and Gabriel H. Loh. Thermal analysis of a 3d die-stacked

high-performance microprocessor. In Proc. Great Lakes Symp. VLSI, pages

19–24, May 2006.

BIBLIOGRAPHY 152

[90] Kiran Puttaswamy and Gabriel H. Loh. Thermal herding: Microarchitecture

techniques for controlling hotspots in high-performance 3d-integrated proces-

sors. In Proc. Int. Symp. High-Performance Computer Architecture, pages 193–

204, February 2007.

[91] Jan M. Rabaey. Digital Integrated Circuits. Prentice-Hall, NJ, 1998.

[92] Shad Roundy, Paul K. Wright, and Jan Rabaey. A study of low level vibra-

tions as a power source for wireless sensor nodes. Computer Communications,

26:1131–1144, October 2003.

[93] Takayasu Sakurai. A JSSC classic paper: The simple model of CMOS drain

current. IEEE Solid State Circuits Society Quarterly Newsletter, pages 4–5,

October 2004.

[94] Eric C. Samson, Sridhar V. Machiroutu, Je-Young Chang, Ishmael Santos, Jim

Hermerding, Ashay Dani, Ravi Prasher, and David W. Song. Interface material

selection and a thermal management technique in second-generation platforms

built on Intel Centrino mobile technology. Intel Technology J., 09(1):75–86,

February 2005.

[95] Samsung. http://www.samsung.com/.

[96] K. Sankaralingam, R. Nagarajan, H. Liu, J. Huh, C. K. Kim, D. Burger, S. W.

Keckler, and C. R. Moore. Exploiting ILP, TLP, and DLP using polymorphism

in the TRIPS architecture. In Proc. Int. Symp. Computer Architecture, pages

422–433, June 2003.

BIBLIOGRAPHY 153

[97] Oleg Semenov, Arman Vassighi, Manoj Sachdev, Ali Keshavarzi, and C. F.

Hawkins. Effect of cmos technology scaling on thermal management during

burn-in. 16:686–695, November 2003.

[98] J.-I. Shirakashi, K. Matsumoto, N. Miura, and M. Konagai. Single-electron

charging effects in Nb/Nb oxide-based single-electron transistors at room tem-

perature. Applied Physics Ltrs., 72(15):1893–1895, April 1998.

[99] Kevin Skadron, Mircea R. Stan, Wei Huang, Sivakumar Velusamy, Karthik

Sankaranarayanan, and David Tarjan. Temperature-aware microarchitecture.

In Proc. Int. Symp. Computer Architecture, pages 2–13, June 2003.

[100] SPLASH2 website. http://www-flash.stanford.edu/apps/SPLASH/.

[101] R. Sprunt. Pentium 4 performance-monitoring features. IEEE Micro, 22(4):72–

82, 2002.

[102] J. Srinivasan, S.V. Adve, P. Bose, and J.A. Rivers. The impact of technology

scaling on lifetime reliability. In Proc. International Conf. Dependable Systems

and Networks, pages 177–186, June 2004.

[103] Jayanth Srinivasan, Sarita V. Adve, Pradip Bose, and Jude A. Rivers. Exploit-

ing structural duplication for lifetime reliability enhancement. In Proc. Int.

Symp. Computer Architecture, pages 520–531, June 2005.

[104] Chong Sun, Li Shang, and Robert P. Dick. Three-dimensional multi-processor

system-on-chip thermal optimization. In Proc. Int. Conf. Hardware/Software

Codesign and System Synthesis, pages 117–122, October 2007.

BIBLIOGRAPHY 154

[105] X. Tang, X. Baie, V. Bayot, F. Van de Wiele, and J. P. Colinge. An SOI single-

electron transistor. In Proc. Silicon-on-Insulator Conf., pages 46–47, October

1999.

[106] David Tarjan, Shyamkumar Thoziyoor, and Norman P. Jouppi. CACTI 4.0.

Technical report, HP Laboratories, June 2006.

[107] Michael Bedford Taylor, Walter Lee, Jason Miller, David Wentzlaff, Ian Bratt,

Ben Greenwald, Henry Hoffmann, Paul Johnson, Jason Kim, James Psota,

Arvind Saraf, Nathan Shnidman, Volker Strumpen, Matt Frank, Saman Ama-

rasinghe, and Anant Agarwal. Evaluation of the raw microprocessor: An

exposed-wire-delay architecture for ILP and streams. In Proc. Int. Symp. Com-

puter Architecture, June 2004.

[108] Tezzaron. http://www.tezzaron.com/technology/FaStack.htm.

[109] A. W. Topol, D. C. La Tulipe, L. Shi Jr., D. J. Frank, K. Bernstein, S. E. Steen,

A. Kumar, G. U. Singco, A. M. Young, K. W. Guarini, and M. Ieong. Three-

dimensional integrated circuits. IBM J. Research and Development, 4:491–506,

2006.

[110] Y. Tsai, Y. Xie, N. Vijaykrishnan, and M. J.Irwin. Three-dimensional cache

design exploration using 3DCacti. In Proc. Int. Conf. Computer Design, pages

519–524, October 2005.

[111] J R Tucker. Complementary digital logic based on the Coulomb blockade. J.

Applied Physics, 72(99):4399–4413, 1992.

BIBLIOGRAPHY 155

[112] K Uchida, J Koga, R Ohba, and A Toriumi. Programmable single-electron tran-

sistor logic for future low-power intelligent LSI: proposal and room-temperature

operation. IEEE Trans. Electron Devices, 50(7):1623–1630, July 2003.

[113] Ken Uchida, Kazuya Matsuzawa, Junji Koga, Ryuji Ohba, Shin ichi Takagi, and

Akira Toriumi. Analytical single-electron transistor (SET) model for design and

analysis of realistic set circuits. Japanese. J. Applied Physics, 39:2321–2324,

April 2000.

[114] Srinivas Vanapalli, Michael Lewis, Zhihua Gan, and Ray Radebaugh. 120 Hz

pulse tube cryocooler for fast cooldown to 50 K. Applied Physics Letters, 90,

February 2007.

[115] S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Fi-

nan, P. Lyer, A. Singh, T. Jacob, S. Jain, S. Venkataraman, Y Hoskote, and

N. Borkar. An 80-tile 1.28TFLOPS networks-on-chip in 65nm CMOS. In Proc.

Int. Solid-State Circuits Conf., February 2007.

[116] Ram Viswanath, Vijay Wakharkar, Abhay Watwe, and Vassou Lebonheur.

Thermal performance challenges from silicon to systems. Intel Technology J.,

04(3):1–16, August 2000.

[117] C. Wasshuber, H. Kosina, and S. Selberherr. A single-electron device and cir-

cuit simulator. IEEE Trans. Computer-Aided Design of Integrated Circuits and

Systems, 16:937–944, September 1997.

BIBLIOGRAPHY 156

[118] C. Wasshuber, H. Kosina, and S. Selberherr. A comparative study of single

electron memories. IEEE Trans. Electron Devices, 45:2365–2371, November

1998.

[119] Henning Wolf, Franz Josef Ahlers, J. Niemeyer, Hansjorg Scherer, Thomas

Weimann, Alexander B. Zorin, Vladimir A. Krupenin, Sergey V. Lotkhov, and

Denis E. Presnov. Investigation of the offset charge noise in single electron tun-

neling devices. Trans. on Instrumentation and Measurement, 46(2):303–306,

April 1997.

[120] Y. Xie, L. Lu, M. Kandemir, N. Vijaykrishnan, and M. J. Irwin. Reliability-

aware co-synthesis for embedded systems. In Proc. Int. Conf. Application-

Specific Systems, Architectures, and Processors, September 2004.

[121] Xilinx XPower. http://www.xilinx.com.

[122] K. K. Yadavalli, A. O. Orlov, G. L. Snider, and A. N. Korotkov. Single electron

memory devices: toward background charge insensitive operation. J. Vacuum

Science Technology B Microelectronics and Nanometer Structures, 21:2860–

2864, 2003.

[123] Yonghong Yang, Zhenyu (Peter) Gu, Changyun Zhu, Robert P. Dick, and

Li Shang. ISAC: Integrated space and time adaptive chip-package thermal anal-

ysis. IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems,

January 2007.

BIBLIOGRAPHY 157

[124] Yonghong Yang, Zhenyu (Peter) Gu, Changyun Zhu, Li Shang, and Robert P.

Dick. Adaptive chip-package thermal analysis for synthesis and design. In Proc.

Design, Automation, and Test in Europe, pages 844–849, March 2006.

[125] Yonghong Yang, Changyun Zhu, Zhenyu (Peter) Gu, Li Shang, and Robert P.

Dick. Adaptive multi-domain thermal modeling and analysis for integrated

circuit synthesis and design. In Proc. Int. Conf. Computer-Aided Design, pages

575–582, November 2006.

[126] K. Yano, T. Ishii, T. Hashimoto, T. Kobayashi, F. Murai, and K. Seki. Room-

temperature single-electron memory. IEEE Trans. Electron Devices, 41:1628–

1638, September 1994.

[127] Ti-Yen Yen. Hardware-Software Co-Synthesis of Distributed Embedded Systems.

PhD thesis, Dept. of Electrical Engg., Princeton University, June 1996.

[128] Y. S. Yu, S. W. Hwang, and D. Ahn. Transient modelling of single-electron tran-

sistors for efficient circuit simulation by SPICE. Electronics Ltrs., 152(6):691–

696, December 2005.

[129] Y. Zhang, D. Parikh, K. Sankaranarayanan, K. Skadron, and M. Stan. HotLeak-

age: A temperature-aware model of subthreshold and gate leakage for architects.

Technical report, Univ. of Virginia, May 2003. CS-2003-05.

[130] W. Zhao and Y. Cao. New generation of predictive technology model for sub-

45nm design exploration. In Proc. Int. Symp. Quality of Electronic Design,

pages 585–590, March 2006.

BIBLIOGRAPHY 158

[131] Changyun Zhu, Zhenyu Gu, Li Shang, Robert P. Dick, and Russ Joseph. Run-

time thermal management of three-dimensional chip multiprocessors. In Proc.

Wkshp. Quality-Aware Design, June 2008. Invited paper.

[132] Changyun Zhu, Zhenyu (Peter) Gu, Robert P. Dick, and Li Shang. Reliable

multiprocessor system-on-chip synthesis. In Proc. Int. Conf. Hardware/Software

Codesign and System Synthesis, pages 239–244, October 2007.

[133] Changyun Zhu, Zhenyu (Peter) Gu, Robert P. Dick, Li Shang, and Robert

Knobel. Characterization of Single-Electron Tunneling Transistors for Design-

ing Low-Power Embedded Systems. IEEE Trans. VLSI Systems, 17(5), May

2009.

[134] Changyun Zhu, Zhenyu (Peter) Gu, Li Shang, Robert P. Dick, and Russ Joseph.

Three-dimensional chip-multiprocessor run-time thermal management. IEEE

Trans. Computer-Aided Design of Integrated Circuits and Systems, 27(8), Au-

gust 2008.

[135] Changyun Zhu, Zhenyu (Peter) Gu, Li Shang, Robert P. Dick, and Robert

Knobel. Towards an ultra-low-power architecture using single-electron tunnel-

ing transistors. In Proc. Design Automation Conf., pages 312–317, June 2007.

[136] N. M. Zimmerman, W. H. Huber, A. Fujiwara, and Y. Takahashi. Excellent

charge offset stability in Si-based SET transistors. In Proc. Precision Electro-

magnetic Measurements, pages 124–125, November 2002.

BIBLIOGRAPHY 159

[137] N. S. Zimmerman, W. H. Huber, A. Fujiwara, and Y. Takahashi. Excellent

charge offset stability in a Si-based single-electron tunneling transistor. Applied

Physics Ltrs., 79:3186–3190, 2002.

System-Level Power, Thermal and Reliability ... - Queen's U

Documents

Transcript of System-Level Power, Thermal and Reliability ... - Queen's U