A Survey of Lifetime Reliability-Aware System-Level … Techniques for Embedded Multiprocessor...

14
IEEE TRANSACTIONS ON COMPUTERS, VOL. X, NO. X, MONTH YEAR 1 A Survey of Lifetime Reliability-Aware System-Level Design Techniques for Embedded Multiprocessor Systems Anup Das, Student Member, IEEE, Akash Kumar, Senior Member, IEEE, and Bharadwaj Veeravalli, Senior Member, IEEE Abstract—Lifetime reliability is emerging as a major concern for system design as escalating power density and hence temperature variation continues to accelerate wear-out, leading to a growing prominence of device-defects. This has attracted a significant attention both in industry and in academia to investigate on wear-out mitigation techniques, from micro-architectural adaptations to system- level optimization. Task mapping and scheduling-based system-level design techniques provide a low overhead approach for reliability optimization at design-time as well as reliability management at run-time. This paper provides an overview of the developments on reliability optimization using task mapping and scheduling over the past decade, since the introduction of the first reliability-aware system-level design technique in 2004. 1 I NTRODUCTION Technology scaling principles introduced in [1] have revolutionized the semiconductor industry, providing a roadmap for systematic and predictable improvement in transistor density, switching speed and power dis- sipation with scaled transistor feature size. Ever since the principles were introduced, the transistor feature size reduced by 2 with every new technology gen- eration, starting from a transistor gate length of 1mm in 1974 [1] to 45nm in 2007 [2]. However, at deep sub- micron technology nodes (especially 65nm and lower), sub-threshold leakage and non-ideal voltage and gate oxide scaling are causing the power dissipation to in- crease with shrinking transistor feature size (contrary to Dennard’s principles [1]), challenging further technology scaling [3]–[5]. Increasing power density in modern inte- grated circuits causes localization of the heat (hot spots) and a corresponding increase in the temperature. This increase in chip temperature accelerates wear-out of the semiconductor devices [6]–[8]. This is challenging the reliable operation of the integrated circuits reducing their useful life. In this context, lifetime reliability is defined as the long term reliability of a circuit and is measured in terms of useful life or mean time to failure of the circuit. With the increasing design cost and complexity of in- tegrated circuits, semiconductor industries have shifted toward design and re-use methodologies collectively referred to as system-on-chip (SoC) design. To accom- modate the ever increasing demands of performance and application support and to address scalability, multiple processing cores are integrated together on a single SoC to form multiprocessor systems-on-chip (MPSoCs) or multiprocessor systems in short. Similar to other A. Das, A. Kumar and B. Veeravalli are with the Department of Electrical and Computer Engineering, National University of Singapore, Singapore, 117583. E-mail: {akdas, akash, elebv}@nus.edu.sg integrated circuits, lifetime reliability is an emerging threat for multiprocessor systems in current and future technology nodes. This has attracted a significant at- tention both in industry and academia to control the thermal acceleration of the wear-out of the processing cores. The existing wear-out mitigation techniques can be classified into three categories – micro-architecture oriented techniques, compiler-directed techniques and system-level design techniques. The gate-level micro-architecture techniques involve CMOS device adaptations for wear-out mitigation. Ex- amples include the use of adaptive body biasing tech- nique [9], 22nm tri-gate transistor architecture [10] and device-geometry aware design rule [11]. A sum- mary of the design challenges and wear-out mitiga- tion techniques at the micro-architecture level is high- lighted in [12]. The compiler-directed techniques deal with processor instruction adaptations and scheduling for minimizing different wear-out mechanisms. Exam- ples include the intelligent instruction scheduling [13], exploiting instruction timing slacks [14] and the compiler directed register assignment [15]. Finally, the system-level design techniques work at a higher layer of abstraction at the operating system- level dealing with intelligent application mapping and scheduling on multiprocessor systems to improve the lifetime reliability. These techniques can be further clas- sified into design-time and run-time based approaches depending on whether the reliability optimization is per- formed statically at system design-time or dynamically during application execution. This paper provides a brief survey on the recent advances on system-level design techniques for minimizing temperature-related wear-out in multiprocessor systems. Remainder of this paper is organized as follows. An overview of the application and architecture model- ing for system-level design optimization is provided in Section 2. The different wear-out phenomena and the associated fault modeling is provided in Section 3. This

Transcript of A Survey of Lifetime Reliability-Aware System-Level … Techniques for Embedded Multiprocessor...

IEEE TRANSACTIONS ON COMPUTERS, VOL. X, NO. X, MONTH YEAR 1

A Survey of Lifetime Reliability-Aware System-LevelDesign Techniques for Embedded Multiprocessor Systems

Anup Das, Student Member, IEEE, Akash Kumar, Senior Member, IEEE,and Bharadwaj Veeravalli, Senior Member, IEEE

Abstract—Lifetime reliability is emerging as a major concern for system design as escalating power density and hence temperaturevariation continues to accelerate wear-out, leading to a growing prominence of device-defects. This has attracted a significant attentionboth in industry and in academia to investigate on wear-out mitigation techniques, from micro-architectural adaptations to system-level optimization. Task mapping and scheduling-based system-level design techniques provide a low overhead approach for reliabilityoptimization at design-time as well as reliability management at run-time. This paper provides an overview of the developments onreliability optimization using task mapping and scheduling over the past decade, since the introduction of the first reliability-awaresystem-level design technique in 2004.

F

1 INTRODUCTION

Technology scaling principles introduced in [1] haverevolutionized the semiconductor industry, providing aroadmap for systematic and predictable improvementin transistor density, switching speed and power dis-sipation with scaled transistor feature size. Ever sincethe principles were introduced, the transistor featuresize reduced by

√2 with every new technology gen-

eration, starting from a transistor gate length of 1mmin 1974 [1] to 45nm in 2007 [2]. However, at deep sub-micron technology nodes (especially 65nm and lower),sub-threshold leakage and non-ideal voltage and gateoxide scaling are causing the power dissipation to in-crease with shrinking transistor feature size (contrary toDennard’s principles [1]), challenging further technologyscaling [3]–[5]. Increasing power density in modern inte-grated circuits causes localization of the heat (hot spots)and a corresponding increase in the temperature. Thisincrease in chip temperature accelerates wear-out of thesemiconductor devices [6]–[8]. This is challenging thereliable operation of the integrated circuits reducing theiruseful life. In this context, lifetime reliability is defined asthe long term reliability of a circuit and is measured interms of useful life or mean time to failure of the circuit.

With the increasing design cost and complexity of in-tegrated circuits, semiconductor industries have shiftedtoward design and re-use methodologies collectivelyreferred to as system-on-chip (SoC) design. To accom-modate the ever increasing demands of performance andapplication support and to address scalability, multipleprocessing cores are integrated together on a singleSoC to form multiprocessor systems-on-chip (MPSoCs)or multiprocessor systems in short. Similar to other

• A. Das, A. Kumar and B. Veeravalli are with the Department of Electricaland Computer Engineering, National University of Singapore, Singapore,117583.E-mail: {akdas, akash, elebv}@nus.edu.sg

integrated circuits, lifetime reliability is an emergingthreat for multiprocessor systems in current and futuretechnology nodes. This has attracted a significant at-tention both in industry and academia to control thethermal acceleration of the wear-out of the processingcores. The existing wear-out mitigation techniques canbe classified into three categories – micro-architectureoriented techniques, compiler-directed techniques andsystem-level design techniques.

The gate-level micro-architecture techniques involveCMOS device adaptations for wear-out mitigation. Ex-amples include the use of adaptive body biasing tech-nique [9], 22nm tri-gate transistor architecture [10]and device-geometry aware design rule [11]. A sum-mary of the design challenges and wear-out mitiga-tion techniques at the micro-architecture level is high-lighted in [12]. The compiler-directed techniques dealwith processor instruction adaptations and schedulingfor minimizing different wear-out mechanisms. Exam-ples include the intelligent instruction scheduling [13],exploiting instruction timing slacks [14] and the compilerdirected register assignment [15].

Finally, the system-level design techniques work ata higher layer of abstraction at the operating system-level dealing with intelligent application mapping andscheduling on multiprocessor systems to improve thelifetime reliability. These techniques can be further clas-sified into design-time and run-time based approachesdepending on whether the reliability optimization is per-formed statically at system design-time or dynamicallyduring application execution. This paper provides a briefsurvey on the recent advances on system-level designtechniques for minimizing temperature-related wear-outin multiprocessor systems.

Remainder of this paper is organized as follows. Anoverview of the application and architecture model-ing for system-level design optimization is provided inSection 2. The different wear-out phenomena and theassociated fault modeling is provided in Section 3. This

IEEE TRANSACTIONS ON COMPUTERS, VOL. X, NO. X, MONTH YEAR 2

Core

Cache

Interconnect

Core

Cache

Core

Interconnect

Core

Memory Memory

(a) Shared Memory (b) Distributed Memory (c) Private Memory

CoreCache

Mem

CoreCache

MemCore

Cache

Mem

Interconnect

CoreCache

MemCore

Cache

Mem

Fig. 1. Memory-based multiprocessor classification.

is followed by the two system-level reliability-aware de-sign techniques in Sections 4 and 5, respectively. Finally,the paper is concluded in Section 6 with some overviewof the future research directions.

2 APPLICATION AND ARCHITECTURE MODEL

In most of the system-level optimization approaches, theapplication is abstracted as a directed graph with nodesrepresenting tasks and edges representing dependencies.In this section, a brief overview is provided on two of themost popular application abstraction models used in thesystem level optimization techniques – directed acyclicgraphs (DAGs) and synchronous data flow graphs (SD-FGs), along with an introduction to the different typesof multiprocessor systems used in the existing researchstudies.

2.1 Directed Acyclic Graphs

Directed acyclic graphs (DAGs) are the most commonabstraction model for applications used for reliabilityoptimization of multiprocessor systems. DAG of an ap-plication consists of a set of nodes representing tasksof the application and a set of edges representing datadependency among the tasks. The execution time of thenode represents the worst case execution time of the cor-responding task. However, there are studies consideringprobabilistic execution time of tasks [16], [17].

2.2 Synchronous Data Flow Graphs

Synchronous Data Flow Graphs (SDFGs, see [18]) areoften used for modeling modern DSP applications andfor designing concurrent multimedia applications im-plemented on a multi-processor system-on-chip. Bothpipelined streaming and cyclic dependencies betweentasks can be easily modeled in SDFGs. SDFGs allow anal-ysis of a system in terms of throughput and other perfor-mance properties, e.g. latency, buffer requirements [19].The nodes of a SDFG are called actors; they representfunctions that are computed by reading tokens (dataitems) from their input ports and writing the resultsof the computation as tokens on the output ports. Theedges in the graph, called channels, represent dependen-cies among different actors. SDFGs support for model-ing cyclic task dependency, multi-input tasks, multi-ratetasks and pipelined execution.

TABLE 1Comparison between homogeneous and heterogeneous

multiprocessor systems.

Homogeneous Heterogeneous

Advantages Less replication effort,highly scalable

Application specific,high computation

efficiency, low powerconsumption

LimitationsModerate computationefficiency, high power

consumption

Less flexible, lessscalable

Compatibility

data parallelism,shared memory

architecture, static anddynamic task mapping

task parallelism,message passing

interface, static taskmapping

Examples Philip’s WASABI [20] Texas Instrument’sOMAP [21]

2.3 Multiprocessor SystemsMultiprocessor architectures considered in this literaturesurvey can be classified into two categories – static andreconfigurable.

2.3.1 Static Multiprocessor SystemsThe static multiprocessor systems consist of processingcores interconnected using a communication mediumsuch as bus or networks-on-chip. The static multipro-cessor architecture differs according to the memory dis-tribution and processing core types. The memory-basedclassification is shown in Figure 1. Further discussion onthis is omitted in the remainder of this paper as most ofthe existing techniques provide little to no descriptionabout the type of system being considered. Instead, wefocus on the processing core type based classification.The cores of a multiprocessor system can be of thesame type (i.e. homogeneous) or can differ from oneanother in processing power or energy consumption (i.e.heterogeneous). The advantages and limitations of boththese types are outlined in Table 1.

2.3.2 Reconfigurable Multiprocessor SystemsAn emerging trend in multiprocessor design is to inte-grate reconfigurable logic such as field programmablegate arrays (FPGAs) along with homogeneous or hetero-geneous cores [22]. These systems allow implementationof custom instructions [23] or custom logic [24] specific tothe application being executed. Examples include XilinxZynq architecture [25].

2.4 Run-time System RequirementsFor design-time approaches, the multiprocessor systemis not available1 and therefore reliability optimization atdesign-time is often performed on a multiprocessor sys-tem simulator. On the other hand, for dynamic reliabilitymanagement, most of the existing techniques consider

1. The objective of the design-time approaches is to provide reliabil-ity margin for designing the platform.

IEEE TRANSACTIONS ON COMPUTERS, VOL. X, NO. X, MONTH YEAR 3

TABLE 2Summary of the run-time temperature measurement

approach.

Temperature measurement approach Advantages andlimitations

Simulator such as HotSpot & 3D-ICE high simulation time

Solving RC thermal model high simulation time

Thermal gun or infra-red thermography

limited accuracy tocapture the

peak/averagetemperature andthermal cycling

Performance counter

limited accuracy tomodel the temperature

contribution of thememory controller

Thermal and wear-out sensorshigh cost, limitedresponse time to

monitor temperature

real multiprocessor systems. Since transistor wear-out istemperature dependent, we provide a brief comparisonon the different temperature and wear-out measurementtechniques for real systems.

Temperature measurement for the available multipro-cessor systems can be performed in the following ways:

1. temperature simulation using thermal simulatorssuch as HotSpot [26] and 3D-ICE [27];

2. temperature estimation by solving the thermal RCmodel directly;

3. using external thermal gun or infrared thermogra-phy for temperature measurement [28];

4. using performance counter for estimating tempera-ture [29], [30]; and

5. using on-board thermal and wear-out sensors [31],[32].

Table 2 summarizes the key features of these tech-niques for adaptation in any run-time systems.

3 WEAR-OUT RELATED FAILURE MODELING

This section reviews the failure modeling of multipro-cessor systems, which will form the basis for comparingthe existing works on reliability optimization. The multi-processor system-level reliability modeling is dependenton the core-level reliability modeling, which in-turnis dependent on the device-level reliability modeling.These are discussed next.

3.1 Device-Level Reliability ModelingFollowing are the dominant wear-out related failuremechanisms for semiconductor devices.

3.1.1 ElectromigrationElectromigration refers to the movement of metal atomsfrom the interconnect wires and vias due to the flowof current, temperature gradient and electric diffusion,causing open and short circuits in the interconnect. The

mean time to failure (MTTF) due to electromigration isgiven by the following equation [33], [34].

MTTFEM =AEM

Jnexp

(EaEMKT

)(1)

where AEM is a material-dependent constant, J is thecurrent density, n is empirically determined constantwith a typical value of 2 for stress related failures,EaEM is the activation energy of electromigration, K isthe Boltzman’s constant and T is the temperature. Thecurrent density J is determined as

J =f · C · Vdd · Pt

W ·H(2)

where C is the parasitic capacitance, Vdd is the supplyvoltage, f is the clock frequency, Pt is the probability ofline toggling in a clock cycle, W is the width of the metalline and H is the thickness of the metal line.

3.1.2 Negative Bias Temperature InstabilityNegative bias temperature instability refers to the shiftin the threshold voltage and saturation current in p-channel MOS (PMOS) devices after an extended periodof stress caused by the application of negative voltageacross the gate to channel region [35], [36]. The MTTFdue to negative bias temperature instability is given bythe following equation [37]

MTTFNBTI =ANBTI

(VGS)γexp

(EaNBTIKT

)(3)

where ANBTI is a constant dependent on the fabricationprocess, γ is the voltage acceleration factor and EaNBTI

is the activation energy of negative bias temperatureinstability.

3.1.3 Hot Carrier InjectionThere are three types of carrier injection – channel hotelectron injection, drain avalanche hot carrier injectionand secondary generated hot electron injection. Channelhot electron injection refers to the escape of electronsfrom the channel causing a degradation of the Si-SiO2

interface [38], [39]. Drain avalanche hot carrier injectionrefers to the gate oxide degradation due to the electronand hole gate currents caused due to the impact ion-ization [38]. Finally, secondary generated hot electroninjection refers to the injection of minority carriers dueto secondary impact ionization [38]. The MTTF due tohot carrier injection is given by

MTTFHCI = AHCI exp(

θ

VDS

)(4)

where AHCI and θ are empirically determined constantsand VDS is the drain to source voltage.

3.1.4 Time Dependent Dielectric BreakdownTime dependent dielectric breakdown refers to thedegradation of the SiO2 insulating layer between thegate and the conducting channel of the MOS device.The applied voltage and the tunneling electrons create

IEEE TRANSACTIONS ON COMPUTERS, VOL. X, NO. X, MONTH YEAR 4

defects in the volume of the oxide film, which accu-mulates over time triggering a sudden loss of dielectricproperties. The exact physical mechanism behind thedegradation is, however, still an open question. TheMTTF due to time dependent dielectric breakdown isgiven by [34]

MTTFTDDB = ATDDB ·AG ·(

1

VGS

)α−βTexp

(X

T+

Y

T 2

)(5)

where VGS is the gate voltage, T is the temperature,α, β, X and Y are the fitting parameters, AG is the sur-face area of the gate oxide and ATDDB is an empiricallydetermined constant.

3.1.5 Stress MigrationStress migration refers to the motion of atoms in metalwires due to mechanical stress caused by the mismatchof temperature between the metal and the dielectricmaterial. The MTTF due to stress migration is given by

MTTFSM = ASM |T0 − T |−nexp(EaSMKT

)(6)

where ASM is a material dependent constant, T0 is themetal deposition temperature and EaSM is the activationenergy of stress migration.

3.1.6 Thermal CyclingThermal cycling refers to the wear-out caused by thermalstress due to a mismatched coefficient of thermal ex-pansion of the adjacent material layers. Thermal cyclingrelated MTTF is computed in three steps.

1. Calculating the thermal cycles from a thermal pro-file using Downing simple rainbow counting algo-rithm [40].

2. Calculating, from each thermal cycle, the number ofcycles to failure using Coffin-Manson’s rule [41].

NTC(i) = ATC (δTi − TTh)−b eEaTC

KTmax(i) (7)

where NTC(i) is the number of cycles to failuredue to ith thermal cycle, ATC is an empiricallydetermined constant, δTi is the amplitude of theith thermal cycle, TTh is the temperature at whichelastic deformation begins, b is the Coffin-Mansonexponent constant, EaTC is the activation energyof thermal cycling and Tmax(i) is the maximumtemperature in the ith thermal cycle.

3. Calculating the MTTF using Miner’s rule [42].

MTTF =NTC

∑mi=1 ti

m(8)

where ti is the time for the ith thermal cycle, mis the number of thermal cycles obtained in step 1and NTC is the effective cycles to failure determinedusing

NTC =m∑m

i=11

NTC(i)

(9)

Combining Equations 7-9,

MTTF =ATC

∑mi=1 ti

Thermal Stress(10)

where Thermal Stress is an indication of the stressexperienced due to the thermal cycling. This isobtained using the following equation.

Thermal Stress =m∑i=1

(δTi − TTh)b × e

−EaKTmax(i) (11)

3.2 Core-Level Reliability Modeling

Core-level reliability modeling approach is to combinethe device level reliability models to estimate the meantime to failure of the cores of a multiprocessor system.

The fault density at the device level is typically char-acterized by Weibull or Lognormal distribution. Forexample, time dependent dielectric breakdown followsWeibull distribution and electromigration follows Log-normal distribution. The distributions for other wear-outmechanism are not known with certainty. The reliabilitycomputation is demonstrated for these two types ofdistribution.

3.2.1 Weibull DistributionThe reliability of a device considering Weibull distribu-tion for the fault density is given by

R(t) = e−(tη

)β(12)

where η is the scale parameter and β is the shapeparameter. When temperature is not a constant but variesover time, any time interval 0 to t can be split into ndisjoint time intervals {[0, t1), [t1, t2). · · · [tn−1, tn)} suchthat the temperature is constant within each interval.Let Ti be the temperature at the time interval [ti, ti+1).The scale parameter for the Weibull distribution for eachinterval is given by

ηi =MTTFWO(Ti)

Γ (1 + β)(13)

where Γ is the gamma function and MTTFWO is theMTTF with specific wear-out (WO) type under consid-eration, i.e.

WO =

EM for electromigrationNBTI for negative bias temperature instability· · ·

(14)

The reliability is therefore given by [43]

R(t) = e−(∑n

i=1

ti−ti−1ηi

)β(15)

The expression for reliability is derived assuming aconstant value for the shape parameter β. When processvariation is considered, β is not constant but is givenby a Gaussian distribution function φ(µg, σg), where µg

is the mean and σ2g is the variance. The reliability of a

component (core) with N devices considering processvariation is given by [44]

R(t) = e−N

(∑ni=1

ti−ti−1ηi

)µg−σ2g2 ln

(∑ni=1

ti−ti−1ηi

)(16)

IEEE TRANSACTIONS ON COMPUTERS, VOL. X, NO. X, MONTH YEAR 5

3.2.2 Lognormal DistributionThe reliability of a core considering Lognormal distribu-tion for the fault density is given by

R(t) =1

2−

1

2erf

(ln(t)− µ√

2σ2

)(17)

where µ is the scale parameter, σ is the shape param-eter and erf is the error function. Considering a timedistribution of temperature as before, the reliability is

R(t) =1

2−

1

2erf

ln(t) + ln

(∑ni=1

ti−ti−1eµi

t

)√

2σ2

(18)

where the scale parameter µi for the ith interval is

µi = ln (MTTFWO(Ti))−σ2

2(19)

The reliability of a core with N devices using Lognor-mal distribution for fault density is given by [43]

R(t) = eN∫∞−∞ f(x)ln

(12− 1

2erf

(ln(t)−µ√

2x2

))dx

(20)

where f(x) is the probability density function of µ. Thisequation is too complex to integrate in the reliabilitycomputation for systems. Furthermore, recent studiesreveal that for large number of devices per component,Weibull distribution provides more accurate modelingthan Lognormal distribution and therefore has beenadopted for most of the existing works on reliabilityoptimization.

3.2.3 MTTF of a CoreThe mean time to failure for a core ci is therefore

MTTFi =

∫ ∞0

R(t)dt (21)

3.3 System-Level Reliability Modeling

The system-level reliability modeling is to combine thereliability of the different cores to determine the meantime to failure of the multiprocessor system.

Some of the most widely used MTTF computationtechniques are highlighted here.Max-Min Approach: One of the most widely adoptedapproach for MTTF optimization of a multiprocessorsystem is to maximize the minimum MTTFs of thedifferent cores. The MTTF for the multiprocessor systemis therefore represented as

MTTFsys = miniMTTFi (22)

Iterative Approach: In the iterative approach, the meantime to failure is determined iteratively, considering onefailure at a time. After every failure, the tasks on thefaulty core are remapped to the other active cores. Thischanges the operating temperature and hence the shapeparameter of the fault distribution functions. The newMTTF are computed for the cores and the minimum

Algorithm 1 Iterative reliability computationInput: Application and architecture graphsOutput: MTTF of the multiprocessor systems1: InitializeMTTFsys = 0

2: Map and schedule the application on the system3: while schedule is valid do4: for all core ci of the system do5: DetermineMTTFi6: end for7: MTTFsys = MTTFsys + min{MTTFi}8: Task migration and determine new schedule9: end while

MTTF of all the cores is added to system MTTF. This pro-cess is repeated until the performance of an applicationdrops below an acceptable limit. Algorithm 1 providesthe pseudo-code of the approach.Multi-Convolution Integral Approach: The reliability of amultiprocessor system with l failures is given by [45]

RsysNc−l(t) =

∫ t

0dt1

∫ t

t1

dt2 · · ·∫ t

tl−1

RsysNc−l(t, tl)dtl (23)

where Nc is the number of cores of the multiprocessorsystem, l is the number of failures, tl = (t1, t2, · · · , tl),where ti is the occurrence of the ith failure andRsysNc−l(t, tl) = R

sysNc−l(t|tl)·g

sysNc

(t1)·gsysNc−1(t2|t1) · · · gsysNc−l+1(tl|t1, t2, · · · tl−1)

(24)

where RsysNc−l(t|tl) is the probability that a core survives

at time t given the system experiences l failures andgsysNc−r+1(tr|t1, t2, · · · tr−1) is the probability that a systemcontaining Nc − r + 1 working cores experiences the rth

failure at time tr with the past r − 1 failures occurringat t1, t2, · · · tr−1. The MTTF of the system is

MTTFsys =

∫ ∞0

Nc−Nminc∑l=0

RsysNc−l(t)dt (25)

where Nminc is the minimum number of cores required

to satisfy the performance of a system. For a gracefullydegrading system, Nmin

c = 1.Monte-Carlo Simulation Approach: The MTTF of a systemcan be derived using Monte-Carlo simulation using sur-vival lattice to describe system structure. The time tofailure of a component assuming Weibull distribution isgiven by [43]

TTF =t∑n

i=1ti−ti−1

ηi

e

µ−

√µ2+2σln

(− ln(1−u)

Nc

)σ2 (26)

where u is a uniform random number in [0, 1] represent-ing the expected lifetime during system-level simulation.The system-level MTTF is determined using the survivallattice. The system reliability can be calculated as thepercentage of trials for which system survives longerthan TTF .

4 DESIGN-TIME APPROACHES

The design-time based lifetime reliability optimizationtechniques assume worst-case execution time for thetasks of a given application to determine its allocationon a multiprocessor system. These techniques are used toestimate the reliability budget early in the design cycle.

IEEE TRANSACTIONS ON COMPUTERS, VOL. X, NO. X, MONTH YEAR 6

There are two popular methodologies for multiprocessorsystem design – platform-based design and hardware-software co-design. A brief overview is provided hereon the research studies for each of these methodologies.

4.1 Platform-Based Design MethodologyThe platform-based design techniques determine theapplication mapping on a given multiprocessor sys-tem [46]–[48]. These techniques assume a given mul-tiprocessor floorplan consisting of the processing ele-ments. The selection of these elements (number and/ortypes) for the multiprocessor architecture and theirplacement (floorplan) are not addressed.

As established in Section 1, a major challenge of mod-ern multiprocessor system is lifetime reliability, threat-ened by high power densities and hence elevated oper-ating temperatures. This accelerates temperature-relatedwear-outs, leading to a growing prominence of perma-nent device defects [49]. Thermal management usingvoltage and frequency scaling has attracted significantresearch focus in recent years to increase the lifetime ofmultiprocessor system. This benefit rides on top of theenergy savings obtained by down scaling the voltageand frequency of operation within an acceptable per-formance degradation. However, recent studies on thecauses of transient faults have shown that the probabilityof transient faults of a circuit increases by several ordersof magnitude when operating at lower voltage andfrequency. The offline approaches distributes the tasksof a given application on a given multiprocessor systemto address these aspects in order to delay the occurrencesof faults assuming worst-case conditions.

Since temperature has a significant impact on devicewear-out, there are quite some research studies on offlinetask allocations techniques for temperature minimiza-tion. A thermal-aware task mapping and schedulingtechnique is proposed in [50]. Steady-state temperatureis generated using the HotSpot [26] tool from the averagepower of the task schedule. The temperature computa-tion is integrated in the design space exploration frame-work and thus the scalability of the approach is limiteddue to the high simulation time of the HotSpot tool.These have motivated researchers to study the dualitybetween heat transfer and electrical phenomena and assuch, the temporal and spatial temperature dependencyis modeled using thermal equivalent resistive capacitive(RC) model. There have been several research studiesproposed in literature to solve the thermal equivalentRC model in a design space exploration framework. Amixed integer linear programming (MILP)-based taskmapping and scheduling for peak temperature mini-mization is proposed in [51]. The proposed approachsolves the steady-state and spatial temperature depen-dency from the RC thermal equivalent model usingphased steady-state thermal analysis technique, whichis integrated directly in the MILP formulation.

A convex optimization formulation of the voltage-frequency dependency of temperature is formulated

in [52]. A interior-point algorithm is proposed to solvethe convex problem. A thermal-aware task sequencingtechnique is proposed in [53]. The proposed approachmaximizes the throughput of a periodic task set subjectto a peak temperature constraint by solving the tem-perature using steady-state approximation. A thermal-aware scheduling of sporadic real-time tasks is proposedin [54]. The proposed approach models the heat transferamong cores and heat sink using the Fourier coolingmodel in which the steady state temperature is calcu-lated using the RC equivalent thermal model. A fastand accurate steady-state thermal simulator based on theHotSpot tool is proposed [55] for homogeneous multi-processor systems. The simulator is based on analyticalapproach to solve the temperature power dependencyusing the RC equivalent thermal model. A fast event-driven approach is proposed in [56] to estimate thetemperature of multiprocessor systems. The estimationoverhead is minimized using prebuilt look-up tablesand predefined leakage calibration parameters. A stop-go scheduling of task graph on DVFS enabled multi-processor system is proposed in [57]. All these tech-niques determine the steady-state temperature only. Thetemperature from these approaches are accurate if theexecution tasks of a given application are of the orderof the thermal time constant of the package, which istypically hundreds of seconds.

Very few thermal management techniques exist inliterature that account for both the transient and thesteady-state phases. A temperature-aware technique isproposed in [58] to distribute the idle time in order tocontrol the power consumption and hence the tempera-ture. The proposed temperature model incorporates thedynamic and the leakage power components and there-fore results in fairly accurate temperature estimation.A similar approach is proposed in [59]. However, boththese techniques consider a uniprocessor system andresults in scalability and modeling of spatial dependencyissues when applied to multicore systems.

The discussions thus far, are related to thermal opti-mization for multiprocessor systems that indirectly leadsto reliability improvement. However, different wear-outmechanisms are influenced by temperature differently.Hence, there are studies that optimize lifetime reliabilitydirectly considering these wear-out mechanisms throughintelligent task mapping and scheduling. A reliabilityestimation technique is proposed in [60] for applicationspecific multiprocessor systems. First, the mean timeto failure of a multiprocessor system is computed byconsidering the lifetime reliability of individual process-ing elements. The model is based on a single pointof failure, i.e. assuming a series failure system. Thisassumption is later relaxed to consider multiple failuresby incorporating the change in fault density with corefailures. A slack allocation technique is proposed in [61]to improve the lifetime reliability of a network-on-chipbased multiprocessor systems. The proposed techniqueexploits the critical quantity slack arising from execution

IEEE TRANSACTIONS ON COMPUTERS, VOL. X, NO. X, MONTH YEAR 7

and storage resources. This slack is distributed to othercomponents to increase the mean time to failure. Asimulated annealing based technique is proposed in [62]to maximize the lifetime reliability of a multiprocessorsystem. The steady-state temperature values are deter-mined using the HotSpot tool for all combinations of theactive tasks on different processors. These temperaturedata are stored in a lookup and used during the opti-mization step. Ant-colony based optimization techniqueis proposed in [63] to determine the task mapping thatmaximizes lifetime defined as the time to the first failure.This technique has shown that the lifetime of a multi-processor system using temperature-aware optimizationtechnique can be significantly lower than when lifetimeis explicitly optimized. However, not enough details areprovided on the thermal model used to estimate thetemperature to determine the wear-out related aging.

A statistical gate delay aging model is presentedin [64], which considers both workload and fabrication-induced process variations. This model is integratedin a iterative design flow for reliability analysis andoptimization. A wear-out aware schedulability analysistechnique is proposed in [65] for real-time independenttasks mapped to processor with dynamic voltage andfrequency scaling capabilities.

A convex optimization based approach is proposedin [66] to maximize the lifetime reliability of the coresof a multiprocessor system considering electromigra-tion related wear-out mechanism. The proposed ap-proach also incorporates the wear-out of the underlyingnetworks-on-chip using a simplistic model to determinethe steady-state temperature. A simulated annealingbased energy-reliability joint optimization technique isproposed in [67]. This work considers the time to thefirst failure and the temperature is pre-determined usingdifferent combination of active tasks, similar to the onein [62]. A sequential quadratic programming based ap-proach is proposed in [68] to maximize the lifetime of amultiprocessor system considering the electromigration-related lifetime reliability of the communication link. Thesteady-state temperature is determined by solving theRC equivalent thermal model considering the placementof heat sink and heat spreader. In all these reliabilityoptimization techniques, the transient phase of the tem-perature is ignored.

To address this, the technique in [69] uses eigen valuedecomposition based approach to determine the steady-state dynamic temperature profile using a time-varyingand periodic power profile. The approach is shown to bemost accurate among all the existing techniques to deter-mine not only the peak or average temperatures but alsothe thermal amplitude and frequency, which is crucial forthe optimization of thermal cycling related lifetime. Theapproach also includes the leakage power dependencyon temperature using a linear approximation technique.Based on this temperature model, a simple heuristicis proposed to maximize the lifetime of a multiproces-sor system considering thermal cycling related wear-

out effect. A fast thermal model is proposed in [70] toincorporate the transient and steady-state temperaturesconsidering the spatial temperature dependency. Basedon this model, a gradient-based optimization techniqueis proposed to maximize the lifetime reliability with ap-plications represented as synchronous data flow graphs.This technique is later extended in [71] to consider multi-ple concurrent applications with reliability computationconsidering an iterative approach.

Finally, there are also techniques to optimize the life-time of multiprocessor systems considering transient andintermittent faults. A combination of dynamic voltagescaling with partial redundant multi-threading is pro-posed in [72] to explore the trade-off between soft-errortolerance and lifetime reliability. A resource managementtechnique is proposed in [73] to minimize processorwear-out, simultaneously providing tolerance for tran-sient and intermittent faults. The approach minimizesthe task communication energy and congestion on thenetwork-on-chip by considering the steady-state temper-ature model of [62]. A genetic algorithm based lifetimeoptimization technique is proposed in [74]. The approachdetermines the voltage and frequency of the cores ofa multiprocessor system to maximize the lifetime andminimize the soft-error susceptibility. Markov-decisionbased multiprocessor steady-state availability is derivedin [75] considering intermittent faults. The fault arrivalrate is modeled using Weibull distribution consideringelectromigration-related wear-out mechanism. A simpleheuristic is then proposed to maximize the lifetimereliability and the task communication energy, whilesatisfying the availability requirement of modern mul-tiprocessor systems. Table 3 summarizes these relatedworks.

4.2 Hardware-Software Co-Design MethodologyHardware-software co-design [86]–[89] involves meetingthe system level design objectives through concurrentdevelopment of the multiprocessor hardware and asso-ciated software. This methodology addresses two keyaspects – the number and types of processing elementsfor a multiprocessor platform satisfying a design budget(hardware) and the allocation of resources for everyapplication supported on the platform (software). Tra-ditionally, the hardware-software co-design problem hasbeen addressed to optimize performance or minimize en-ergy consumption [24], [90], [91]. Recently, a significantfocus is directed towards reliability-aware hardware-software co-design.

A hardware-software co-synthesis of fault-tolerant sys-tems is proposed in [76]. The proposed approach usestask duplication with comparison and re-execution asthe fault-tolerant mechanism. The co-synthesis frame-work performs task clustering to determine the bestplacement of the duplicate and compare tasks and deter-mine the minimum resources needed for error recovery.A design methodology is proposed in [77] to handle con-ditional execution in multi-rate embedded systems and

IEEE TRANSACTIONS ON COMPUTERS, VOL. X, NO. X, MONTH YEAR 8

TABLE 3Related works on reliability-aware platform-based design methodology.

Related Works Temperature Model Optimization Objective Reliability Modeling Application Model Architecture Model

[26], [50]–[57] steady-state & spatial temperature – Independent DAGs static homogeneous &heterogeneous

[58], [59] transient & steady-state temperature – Independent DAGs static homogeneous

Gu et al. [60] steady-state & spatial lifetime reliability Iterative Independent DAGs static heterogeneous

Meyer et al. [61]

steady-state & spatial lifetime reliability Max-Min Independent DAGs static homogeneousHuang et al. [62]

Hartman et al. [63]

Lu et al. [64] – lifetime reliability core-level modeling independent tasks –

Masrur et al. [65] – lifetime reliability Core-level reliability independent tasks –

Das et al. [66] steady-state & spatial lifetime reliability Max-Min independent &concurrent DAGs static homogeneous

Huang et al. [67]steady-state & spatial lifetime reliability & energy Max-Min independent DAGs static homogeneous

Wang et al. [68]

Ukhov et al. [69]transient & steady-state

lifetime reliability & energy Max-Min independent DAGs static homogeneoustemporal & spatial

Das et al. [70]transient & steady-state

lifetime reliability & energy Max-Min independent SDFGs static homogeneoustemporal & spatial

Das et al. [71]transient & steady-state

lifetime reliability & energy Iterativeindependent &

static homogeneoustemporal & spatial concurrent SDFGs

Siddiqua et al. [72] steady-state & spatial lifetime reliability, transient faults core-level modeling independent tasks static homogeneous

Chou et al. [73]steady-state & spatial

lifetime reliability, transientMax-Min independent DAGs static homogeneous

Das et al. [74], [75] & intermittent faults

selectively duplicates critical tasks to correct transienterrors. Two heuristics are proposed to insert duplicatedtasks in the task graph schedule. The system reliabilityis improved using idle computation resources. A designmethodology is proposed in [79] to determine the cost,performance and reliability trade-offs for multiprocessorsystem considering permanent faults. A design spaceexploration of multimedia multiprocessor systems isproposed in [80]. The proposed approach explores thetrade-off between different metrics such as performance,energy and cost while incorporating soft-error tolerancein the optimization process. A system level reliabilityanalysis technique is proposed in [81] considering pro-cess re-execution in software and selective hardeningof hardware for fault-tolerance. Based on this, a designoptimization heuristic is proposed to select the fault-tolerant architecture and the task mapping such that, theoverall cost is minimized. A system-level synthesis flowis proposed in [78] for the design of reliable embeddedsystems. The methodology explores different hardeningstrategies under a given user level reliability specifi-cation. The strategy with the least resource utilizationis selected and the given application is mapped onthe resulting platform to optimize reliability. All thesetechniques are applicable for reactive fault-tolerance, i.e.these techniques determine the multiprocessor platform

to maximize the fault-tolerance capability consideringtransient and permanent faults.

There are a very few research studies on the hardware-software co-design methodology for proactive fault-tolerance considering temperature-related wear-out. Asystem-level design methodology is proposed in [82]for the automatic synthesis of reliable embedded sys-tems. The proposed methodology addresses the fol-lowing: selection of resources with different reliability,area and latency parameters; and mapping of a dataflow application on the platform to simultaneously op-timize reliability, area and latency using multi-objectiveevolutionary algorithm. Although wear-out mechanismis dealt with in this paper, not enough details havebeen provided on temperature measurement and itseffect on the fault rate. A co-design methodology isproposed in [83] to determine the minimum resourcesrequired to improve system lifetime measured as meantime to failure. The proposed technique starts with agiven set of applications represented as directed acyclicgraphs and a database of available processors with anassociated temperature-dependent reliability model, theobjectives of the approach are (1) to select the coresfor the platform based on performance and reliabil-ity characteristics; (2) allocation and scheduling of thetasks to maximize the reliability; and (3) determine the

IEEE TRANSACTIONS ON COMPUTERS, VOL. X, NO. X, MONTH YEAR 9

TABLE 4Related works on reliability-aware hardware-software co-design methodology.

Related Works Optimization Objective Reliability Modeling Fault-tolerance MultiprocessorPlatform Application Model

Dave et al. [76]

area – reactive static heterogeneous independent DAGsXie et al. [77]

Bolchini et al. [78]

Bolchini et al. [79]

reliability – reactive static homogeneous independent DAGsStralen et al. [80]

Izosimov et al. [81]

Glaß et al. [82] lifetime reliability Convolution Integral proactive static homogeneous independent DAGs

Zhu et al. [83] area & lifetime reliability Iterative proactive static heterogeneous independent DAGs

Das et al. [84] lifetime reliability Max-Min proactive & reactive homogeneous &reconfigurable independent DAGs

Das et al. [85] lifetime reliability Max-Min proactive & reactive homogeneous &reconfigurable

independent &concurrent SDFGs

floorplan of the resultant architecture. The objectives areachieved in an iterative manner to jointly optimize forlifetime reliability and multiprocessor area. A trade-offanalysis technique is proposed in [84] to maximize thelifetime of a multiprocessor system, simultaneously im-proving the reliability considering transient faults. Theproposed approach determines the minimum numberof checkpoints for the tasks of a multiprocessor sys-tem that maximizes the transient fault reliability, whileincurring the least degradation of lifetime reliability.This is achieved using a gradient-based fast heuristic.This technique is later integrated in a hardware-softwareco-design framework [85] to determine the minimumresources of a reconfigurable multiprocessor system toguarantee the performance, simultaneously maximizingthe lifetime reliability considering individual and con-current applications represented as synchronous dataflow graphs. Table 4 summarizes the related works.

5 RUN-TIME APPROACHES

The design-time based reliability management is usefulto estimate the lifetime of the multiprocessor systemearly in the design cycle. During the in-field operation,these systems execute applications with different per-formance requirements. These applications exercise thehardware differently depending on the types of compu-tation being carried out. Dynamic (run-time) reliabilitymanagement (DRM) is, therefore essential to efficientlydeal with these workload and performance variations in-order to satisfy the useful life requirement. Additionally,new embedded applications and games are continuouslybeing developed and there is an increasing tendencyamong users to add these applications on the embed-ded devices. Not all these new applications and use-cases can be foreseen at design-time. Dynamic reliabilitymanagement is essential to allocate resources for these

new applications and use-cases in-order to maximize orsatisfy the lifetime requirement of these devices.

A thermal model is developed for networks-on-chipbased on the temperature characterization of a multi-processor system [92]. Based on this, a dynamic thermalmanagement approach is proposed to reduce the peaktemperature using thermal correlation based routing. Adynamic approach is proposed in [93] to manage reli-ability of a multiprocessor system considering differentaging effects such as electromigration, time dependentdielectric breakdown, and thermal cycling consideringa constant failure rate. However, with every failure, theload on the functional cores increases and therefore thethermal profile of the multiprocessor system changes.The fault rate is no longer constant under such dynamicbehavior. Furthermore, not enough details are providedto model the temperature, which is an essential factorgoverning the failure rate of the different wear-out mech-anisms. A dynamic reliability management technique isproposed in [94], where workload characteristics andthermal information are used to project the degradationcaused by various failure mechanisms. The projectedfailure probability is used to control the maximum volt-age and frequency of the cores of the multiprocessorsystem. The relationship between temperature and volt-age and frequency of operation is formulated in [95],[96]. Based on this, an online heuristic is proposed todetermine the voltage and frequency of the cores tominimize the temperature. These techniques do not learnthe relationship between the workload characteristicsand thermal behavior and therefore performs the sameoptimization repeatedly for the similar workload.

An operating system level technique is proposedin [97] to minimize the temperature of a multiprocessorsystem by careful scheduling of hot and cold tasks.This is based on the observation that executing a hotjob before a cold one results in a lower temperature

IEEE TRANSACTIONS ON COMPUTERS, VOL. X, NO. X, MONTH YEAR 10

than the other way round. The temperature dependencyon power is solved using fourth order Runge-Kuttamethod. However, the technique incurs significantlyhigh overhead. In [98], a scalable distributed dynamicthermal management technique is proposed to avoidthe development of hot spots that accelerate thermalaging and transient faults. The proposed technique con-sists of multiple agents, each managing a cluster ofthe many-core architecture. The agents negotiate amongthemselves to minimize the temperature. A neighbor-aware technique is proposed in [99], which migrates thetasks of an application considering the thermal impactof neighboring processors. This technique also mini-mizes thermal hot spots. An adaptive random schedul-ing approach is proposed in [100], which determinesthe workload allocation to processing cores consideringthe temperature history of the chip. The approach notonly balances the chip temperature but distributes thethermal stress evenly throughout the system lifetime.A coordinated hardware-software approach for dynamicthermal management is proposed in [101]. The approachdevelops a regression-based processor thermal model,based on the performance counter readings. This modelis integrated in the operating system to perform proac-tive dynamic thermal management to minimize the peaktemperature through process scheduling. This is com-bined with hardware techniques such as clock gating toperform reactive thermal management when the tem-perature exceeds a threshold. A run-time task mappingtechnique is proposed in [102] to minimize the wear-out by utilizing the wear-out sensors. The proposedtechnique computes task mapping at regular intervalsand when a component fails. The scheduling decisionis left to the operating system. An online approachis proposed in [103] to minimize wear-out consideringdifferent aging mechanisms. The power of a core isused to estimate its temperature and a simple heuristicis proposed to dynamically manage peak temperatureand thermal cycling. Both these techniques are based onsimulation of a real multiprocessor system.

To efficiently manage temperature for multiprocessorsystems, a machine learning based technique is proposedin [104] to reduce the peak temperature as well asthermal cycling. A set of expert policies that include loadbalancing, dynamic power management and thread mi-gration are pre-determined and the technique uses ma-chine learning principles to select the best policy basedon the given workload characteristics. A slack borrowingtechnique is proposed in [105] to dynamically managethe peak temperature for MPEG2 video decoding. Thetechnique is very application specific and can be appliedonly to video decoding applications such as MPEG4or H.264. A neural network-based adaptive thermalmanagement policy is proposed in [106]. The techniquerelies on temperature prediction using the HotSpot tool.Although the HotSpot tool can be used to validate staticthermal management policies, the high simulation timeof the tool can potentially lead to deadline misses for

dynamic thermal management of real-time systems. Areinforcement learning algorithm is proposed in [107]to manage performance-thermal trade-offs by samplingtemperature data from the on-board thermal sensors.This technique is based on thermal optimization usingthe instantaneous temperature at the decision interval,which is not a true indication of the peak temperatureor the average temperature in that interval. Furthermore,the proposed technique minimizes the peak and averagetemperatures only. Thermal cycling is not addressed,which is an emerging concern in modern multiproces-sor systems. A distributed learning agent is proposedin [108] to optimize peak temperature within a givenpower budget. The technique is implemented on FPGAwith temperature measurement using an external ther-mal gun. This suffers from limited accuracy, especially tocapture thermal cycling. Finally, a reinforcement learningalgorithm is proposed in [109] to optimize the lifetimeof a multicore system by controlling the average tem-perature and thermal cycling. The objective is to adaptto intra- and inter-application workload variations. Thealgorithm controls the voltage and frequency of opera-tion of the cores and the thread affinity through a cross-layer handshaking using feedback from the hardwareperformance monitoring unit and the direct temperaturemeasurement using thermal sensors.

A proactive thermal management policy is proposedin [110] to balance the temperature on the die. Theproposed approach uses autoregressive moving averagemodeling to forecast the future temperature. The modelis adapted to the workload variation, which is detectedusing sequential probability ratio test. The approachallocates the incoming jobs to the processing cores basedon this temperature forecast to balance the temperatureof the die. A proactive dynamic thermal managementis proposed in [111] based on predict and act philoso-phy. In the proposed framework, the operating systemscheduler predicts the temperature of the individualcores; if the temperature of a core crosses a pre-definedvalue, the operating system decides to migrate one ormore threads of a given workload to the coldest core inthe system. A thermal prediction technique is proposedin [112] to predict the temperature of a multiprocessorsystem considering intra- and inter-application work-load variations. The proposed approach is composed ofa hybrid offline/run-time approach where, the offlinephase is used to predict the thermal model of workloadsusing the data generated from a set of representativeworkloads. Based on this, the run-time phase identifiesthe workload phase change and selects the appropriatevoltage and frequency of operation the cores to minimizethe peak temperature.

A low power sensor to monitor oxide and NBTIdegradation is designed in [113] for 45nm technologynode; a dynamic NBTI management is then proposedbased on the sensor measurement for single core sys-tem. A linear discrete-time thermal model is proposedin [114] for dynamic thermal management. The proposed

IEEE TRANSACTIONS ON COMPUTERS, VOL. X, NO. X, MONTH YEAR 11

TABLE 5Related works on run-time reliability optimization techniques.

Related Works DRM Approach Workload Variation Thermal CyclingTemperature/Reliability

Validation PlatformMeasurements

Shang et al. [92] heuristic intra variation × model simulation & actual chip

Coskun et al. [93] heuristic intra variation√

constant temperature simulation

Karl et. al. [94] heuristic intra variation × thermal model simulation

Mulas et al. [95]

heuristic intra variation × thermal model simulationHanumaiah et al. [96]

Zhou et al. [97]

Faruque et al. [98]

Liu et al. [99] heuristic intra variation × thermal model multicore

Coskun et al. [100] heuristic intra variation√

temperature pre-characterization simulator

Kumar et al. [101] heuristic intra variation × thermal regression model multicore

Hartman et al. [102] heuristic intra variation × wear-out sensors simulation

Chantem et al. [103] heuristic intra variation√

thermal model simulation

Coskun et al. [104] machine learning intra variation√

HotSpot multicore

Lee et al. [105] machine learning intra variation√

sensors multicore

Jayaseelan et al. [106] machine learning intra variation × HotSpot simulation

Ge et al. [107] machine learning intra variation × thermal models & sensors multicore

Ebi et al. [108] machine learning intra variation × thermal gun FPGA

Das et al. [109] machine learning inter & intra variation√

thermal sensors multicore

Coskun et al. [110] predict and act inter & intra variation√

thermal sensors multicore

Ayoub et al. [111] predict and act intra variation × HotSpot simulator

Cochran et al. [112] predict and act inter & intra variation × thermal sensors multicore

Singh et al. [113] control theoretic intra variation × wear-out sensors single core

Sironi et al. [114] observed-decide-act intra variation × thermal sensors multicore

Bolchini et al. [115] observe-decide-act intra variation × HotSpot simulation

Mercati et al. [116] control theoretic intra variation√

wear-out sensors multicore

technique is implemented in the Linux kernel to monitorthe temperature from the on-board temperature sensors;deciding on the length of the idle time needed forreducing thermal emergencies; and inserting the idletime during the instructions by controlling the Linuxscheduler. An adaptive approach is proposed in [115] tominimize the electromigration related wear-out simul-taneously with the communication energy. This workis based on observe-decide-act framework where theadaptation engine monitors the aging at runtime. If theaging crosses a threshold, the engine changes the taskmapping to extend the lifetime of the multiprocessorsystem. A control theoretic approach is proposed in [116]to maximize the lifetime of homogeneous multicore sys-tems. The proposed approach is based on long termcontroller, which samples data from aging sensors tocompute the wear-out degradation. Based on this, theshort-term controller adjusts the voltage and frequencyof the tasks to minimize temperature while satisfyingthe performance requirement and the user experience.Table 5 summarizes these related works.

6 CONCLUSIONS

This paper provides an overview of the reliability mod-eling for multiprocessor system and a perspective ofthe different system level design techniques for lifetimeoptimization.

REFERENCES

[1] R. Dennard, F. Gaensslen, V. Rideout, E. Bassous, and A. LeBlanc,“Design of Ion-Implanted MOSFET’s with Very Small PhysicalDimensions,” IEEE Journal of Solid-State Circuits, vol. 9, no. 5, pp.256–268, 1974.

[2] K. Mistry, C. Allen, C. Auth, B. Beattie, D. Bergstrom, M. Bostet al., “A 45nm Logic Technology with High-k+Metal GateTransistors, Strained Silicon, 9 Cu Interconnect Layers, 193nmDry Patterning, and 100Packaging,” in IEEE International ElectronDevices Meeting (IEDM), 2007, pp. 247–250.

[3] M. Bohr, “A 30 Year Retrospective on Dennard’s MOSFET Scal-ing Paper,” IEEE Solid-State Circuits Society Newsletter, vol. 12,no. 1, pp. 11–13, 2007.

[4] S. Borkar, “Design Perspectives on 22Nm CMOS and Beyond,”in Proceeding of the Annual Design Automation Conference (DAC).ACM, 2009, pp. 93–94.

[5] H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, andD. Burger, “Power Challenges May End the Multicore Era,”Communications of the ACM, vol. 56, no. 2, pp. 93–102, 2013.

IEEE TRANSACTIONS ON COMPUTERS, VOL. X, NO. X, MONTH YEAR 12

[6] F. Reynolds, “Thermally Accelerated Aging of SemiconductorComponents,” Proceedings of the IEEE, vol. 62, no. 2, pp. 212–222, 1974.

[7] R. Moazzami, J. C. Lee, and C. Hu, “Temperature Accelerationof Time-Dependent Dielectric Breakdown,” IEEE Transactions onElectron Devices, vol. 36, no. 11, pp. 2462–2465, 1989.

[8] T. Brozek, Y. D. Chan, and C. R. Viswanathan, “TemperatureAccelerated Gate Oxide Degradation Under Plasma-InducedCharging,” IEEE Electron Device Letters, vol. 17, no. 6, pp. 288–290, 1996.

[9] S. Kumar, C. Kim, and S. Sapatnekar, “Adaptive Techniques forOvercoming Performance Degradation Due to Aging in CMOSCircuits,” IEEE Transactions on Very Large Scale Integration Systems(TVLSI), vol. 19, no. 4, pp. 603–614, 2011.

[10] S. Ramey, A. Ashutosh, C. Auth, J. Clifford, M. Hattendorf,J. Hicks, R. James, A. Rahman, V. Sharma, A. St.Amour, andC. Wiegand, “Intrinsic Transistor Reliability Improvements from22nm Tri-Gate Technology,” in IEEE International ReliabilityPhysics Symposium (IRPS), 2013, pp. 4C.5.1–4C.5.5.

[11] Y. Leblebici, “Design considerations for cmos digital circuitswith improved hot-carrier reliability,” IEEE Journal of Solid-StateCircuits, vol. 31, no. 7, pp. 1014–1024, 1996.

[12] V. Huard, C. Parthasarathy, A. Bravaix, C. Guerin, and E. Pion,“CMOS Device Design-in Reliability Approach in AdvancedNodes,” in IEEE International Reliability Physics Symposium, 2009,pp. 624–633.

[13] A. Tiwari and J. Torrellas, “Facelift: Hiding and Slowing DownAging in Multicores,” in Proceedings of the IEEE/ACM Interna-tional Symposium on Microarchitecture (MICRO). IEEE ComputerSociety, 2008, pp. 129–140.

[14] F. Oboril, F. Firouzi, S. Kiamehr, and M. Tahoori, “Reduc-ing NBTI-induced Processor Wearout by Exploiting the TimingSlack of Instructions,” in Proceedings of the Conference on Hard-ware/Software Codesign and System Synthesis (CODES+ISSS), ser.CODES+ISSS ’12. ACM, 2012, pp. 443–452.

[15] F. Ahmed, M. Sabry, D. Atienza, and L. Milor, “Wearout-Aware Compiler-Directed Register Assignment for EmbeddedSystems,” in Proceedings of the International Symposium on QualityElectronic Design (ISQED), 2012, pp. 33–40.

[16] S. Manolache, P. Eles, and Z. Peng, “Schedulability Analysisof Applications with Stochastic Task Execution Times,” ACMTransactions on Embedded Computing Systems (TECS), vol. 3, no. 4,pp. 706–735, Nov. 2004.

[17] M. Lombardi, M. Milano, and L. Benini, “Robust scheduling oftask graphs under execution time uncertainty,” IEEE Transactionson Computers, vol. 62, no. 1, pp. 98–111, 2013.

[18] E. Lee and D. Messerschmitt, “Synchronous data flow,” Proceed-ings of the IEEE, vol. 75, no. 9, pp. 1235–1245, 1987.

[19] S. Stuijk, M. Geilen, and T. Basten, “Exploring Trade-offsin Buffer Requirements and Throughput Constraints for Syn-chronous Dataflow Graphs,” in Proceeding of the Annual DesignAutomation Conference (DAC). ACM, 2006, pp. 899–904.

[20] P. Stravers and J. Hoogerbrugge, “Homogeneous Multiprocess-ing and the Future of Silicon Design Paradigms,” in Proceedingsof the International Symposium on VLSI Technology, Systems, andApplications, 2001, pp. 184–187.

[21] P. Cumming, “The ti omap platform approach to soc,” Winningthe SOC Revolution, 2003.

[22] M. A. Watkins and D. H. Albonesi, “ReMAP: A ReconfigurableHeterogeneous Multicore Architecture,” in Proceedings of theIEEE/ACM International Symposium on Microarchitecture (MICRO).IEEE Computer Society, 2010, pp. 497–508.

[23] L. Chen and T. Mitra, “Shared Reconfigurable Fabric for Multi-core Customization,” in Proceeding of the Annual Design Automa-tion Conference (DAC). ACM, 2011, pp. 830–835.

[24] L. Jiashu, A. Das, and A. Kumar, “A Design Flow for PartiallyReconfigurable Heterogeneous Multi-processor Platforms,” inProceedings of the International Symposium on Rapid System Pro-totyping (RSP). IEEE, 2012, pp. 170–176.

[25] M. Santarini, “Zynq-7000 EPP Sets Stage for New Era of Inno-vations,” Xcell journal, vol. 75, pp. 8–13, 2011.

[26] K. Skadron, M. R. Stan, K. Sankaranarayanan, W. Huang,S. Velusamy, and D. Tarjan, “Temperature-Aware Microarchi-tecture: Modeling and I1mplementation,” ACM Transactions onArchitecture and Code Optimization (TACO), vol. 1, no. 1, pp. 94–125, 2004.

[27] A. Sridhar, A. Vincenzi, M. Ruggiero, T. Brunschwiler, andD. Atienza, “3D-ICE: Fast Compact Transient Thermal Modelingfor 3D ICs with Inter-tier Liquid Cooling,” in Proceedings of theInternational Conference on Computer Aided Design (ICCAD). IEEEPress, 2010, pp. 463–470.

[28] F. Farkhani and F. Mohammadi, “Temperature and Power Mea-surement of Modern Dual Core Processor by Infrared Thermog-raphy,” in Proceedings of IEEE International Symposium on Circuitsand Systems (ISCAS), 2010, pp. 1603–1606.

[29] K.-J. Lee and K. Skadron, “Using performance counters forruntime temperature sensing in high-performance processors,”in Proceedings of the IEEE International Symposium on ParallelDistributed Processing (IPDPS), 2005.

[30] S. W. Chung and K. Skadron, “Using On-Chip Event CountersFor High-Resolution, Real-Time Temperature Measurement,” inIntersociety Conference on Thermal and Thermomechanical Phenom-ena in Electronics Systems (ITHERM), 2006, pp. 114–120.

[31] Y. Zhang and A. Srivastava, “Accurate Temperature EstimationUsing Noisy Thermal Sensors,” in Proceeding of the Annual DesignAutomation Conference (DAC). ACM, 2009, pp. 472–477.

[32] Z. Qi, J. Wang, A. Cabe, S. Wooters, T. Blalock, B. Calhoun,and M. Stan, “SRAM-based NBTI/PBTI Sensor System Design,”in Proceeding of the Annual Design Automation Conference (DAC).ACM, 2010, pp. 849–852.

[33] J. R. Black, “Electromigration Failure Modes in Aluminum Met-allization for Semiconductor Devices,” Proceedings of the IEEE,vol. 57, no. 9, pp. 1587–1594, 1969.

[34] J. S. S. T. Association, “Failure Mechanisms and Models forSemiconductor Devices,” JEDEC Publication JEP122-B, 2003.

[35] T. Yamamoto, K. Uwasawa, and T. Mogami, “Bias TemperatureInstability in Scaled p+ Polysilicon Gate p-MOSFET’s,” IEEETransactions on Electron Devices, vol. 46, no. 5, pp. 921–926, 1999.

[36] M. Alam, “A Critical Examination of the Mechanics of DynamicNBTI for PMOSFETs,” in IEEE International Electron DevicesMeeting (IEDM), 2003, pp. 14.4.1–14.4.4.

[37] M. Alam and S. Mahapatra, “A Comprehensive Modelof {PMOS} {NBTI} Degradation,” Microelectronics Reliability,vol. 45, no. 1, pp. 71 – 81, 2005.

[38] E. Takeda and N. Suzuki, “An Empirical Model for DeviceDegradation Due to Hot-Carrier Injection,” IEEE Electron DeviceLetters, vol. 4, no. 4, pp. 111–113, 1983.

[39] S. Tam, P.-K. Ko, and C. Hu, “Lucky-Electron Model of Chan-nel Hot-Electron Injection in MOSFET’S,” IEEE Transactions onElectron Devices, vol. 31, no. 9, pp. 1116–1125, 1984.

[40] S. Downing and D. Socie, “Simple Rainflow Counting Algo-rithms ,” International Journal of Fatigue, vol. 4, no. 1, pp. 31 –40, 1982.

[41] V. Radhakrishnan, “On the bilinearity of the coffin-manson low-cycle fatigue relationship,” International Journal of Fatigue, vol. 14,no. 5, pp. 305 – 311, 1992.

[42] T. Shimokawa and S. Tanaka, “A statistical consideration ofminer’s rule,” International Journal of Fatigue, vol. 2, no. 4, pp.165 – 170, 1980.

[43] Y. Xiang, T. Chantem, R. P. Dick, X. S. Hu, and L. Shang, “System-level Reliability Modeling for MPSoCs,” in Proceedings of theConference on Hardware/Software Codesign and System Synthesis(CODES+ISSS). ACM, 2010, pp. 297–306.

[44] K. Chopra, C. Zhuo, D. Blaauw, and D. Sylvester, “A StatisticalApproach for Full-chip Gate-oxide Reliability Analysis,” in Pro-ceedings of the International Conference on Computer Aided Design(ICCAD). IEEE Press, 2008, pp. 698–705.

[45] L. Huang and Q. Xu, “Lifetime Reliability for Load-SharingRedundant Systems With Arbitrary Failure Distributions,” IEEETransactions on Reliability, vol. 59, no. 2, pp. 319–330, 2010.

[46] G. Smith, “Platform Based Design: Does It Answer the EntireSoC Challenge?” in Proceeding of the Annual Design AutomationConference (DAC). ACM, 2004, pp. 407–407.

[47] A. Pinto, A. Bonivento, A. L. Sangiovanni-Vincentelli,R. Passerone, and M. Sgroi, “System Level Design Paradigms:Platform-based Design and Communication Synthesis,” inProceeding of the Annual Design Automation Conference (DAC).ACM, 2004, pp. 537–563.

[48] K. Popovici, X. Guerin, F. Rousseau, P. S. Paolucci, and A. A.Jerraya, “Platform-based Software Design Flow for Heteroge-neous MPSoC,” ACM Transactions on Embedded Computing Sys-tems (TECS), vol. 7, no. 4, pp. 39:1–39:23, 2008.

IEEE TRANSACTIONS ON COMPUTERS, VOL. X, NO. X, MONTH YEAR 13

[49] J. Srinivasan, S. V. Adve, P. Bose, and J. A. Rivers, “The Case forLifetime Reliability-Aware Microprocessors,” in Proceedings of theAnnual International Symposium on Computer Architecture (ISCA).IEEE Computer Society, 2004, pp. 276–287.

[50] W.-L. Hung, Y. Xie, N. Vijaykrishnan, M. Kandemir, and M. J.Irwin, “Thermal-Aware Task Allocation and Scheduling for Em-bedded Systems,” in Proceedings of the Conference on Design,Automation and Test in Europe (DATE). IEEE Computer Society,2005, pp. 898–899.

[51] T. Chantem, X. Hu, and R. Dick, “Temperature-Aware Schedul-ing and Assignment for Hard Real-Time Applications on MP-SoCs,” IEEE Transactions on Very Large Scale Integration Systems(TVLSI), vol. 19, no. 10, pp. 1884–1897, 2011.

[52] A. Mutapcic, S. Boyd, S. Murali, D. Atienza, G. De Micheli, andR. Gupta, “Processor Speed Control With Thermal Constraints,”IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 56,no. 9, pp. 1994–2008, 2009.

[53] S. Zhang and K. S. Chatha, “Thermal Aware Task Sequencingon Embedded Processors,” in Proceeding of the Annual DesignAutomation Conference (DAC). ACM, 2010, pp. 585–590.

[54] N. Fisher, J.-J. Chen, S. Wang, and L. Thiele, “Thermal-AwareGlobal Real-Time Scheduling on Multicore Systems,” in Proceed-ings of the IEEE Symposium on Real-Time and Embedded Technologyand Applications (RTAS). IEEE Computer Society, 2009, pp. 131–140.

[55] R. Rao and S. Vrudhula, “Fast and Accurate Prediction of theSteady-State Throughput of Multicore Processors Under Ther-mal Constraints,” IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems (TCAD), vol. 28, no. 10, pp. 1559–1572, Oct 2009.

[56] J. Cui and D. Maskell, “A Fast High-Level Event-Driven Ther-mal Estimator for Dynamic Thermal Aware Scheduling,” IEEETransactions on Computer-Aided Design of Integrated Circuits andSystems (TCAD), vol. 31, no. 6, pp. 904–917, 2012.

[57] P. Kumar and L. Thiele, “Thermally Optimal Stop-go Schedulingof Task Graphs with Real-time Constraints,” in Proceedings of theAsia and South Pacific Design Automation Conference (ASP-DAC).IEEE Press, 2011, pp. 123–128.

[58] M. Bao, A. Andrei, P. Eles, and Z. Peng, “Temperature-aware IdleTime Distribution for Energy Optimization with Dynamic Volt-age Scaling,” in Proceedings of the Conference on Design, Automationand Test in Europe (DATE). European Design and AutomationAssociation, 2010, pp. 21–26.

[59] D. Rai, H. Yang, I. Bacivarov, J.-J. Chen, and L. Thiele, “Worst-case Temperature Analysis for Real-time Systems,” in Proceedingsof the Conference on Design, Automation and Test in Europe (DATE),2011, pp. 1–6.

[60] Z. Gu, C. Zhu, L. Shang, and R. Dick, “Application-SpecificMPSoC Reliability Optimization,” IEEE Transactions on Very LargeScale Integration Systems (TVLSI), vol. 16, no. 5, pp. 603–608, 2008.

[61] B. H. Meyer, A. S. Hartman, and D. E. Thomas, “Cost-effectiveSlack Allocation for Lifetime Improvement in NoC-based MP-SoCs,” in Proceedings of the Conference on Design, Automationand Test in Europe (DATE). European Design and AutomationAssociation, 2010, pp. 1596–1601.

[62] L. Huang, F. Yuan, and Q. Xu, “On Task Allocation and Schedul-ing for Lifetime Extension of Platform-Based MPSoC Designs,”IEEE Transactions on Parallel and Distributed Systems (TPDS),vol. 22, no. 12, pp. 2088–2099, 2011.

[63] A. S. Hartman, D. E. Thomas, and B. H. Meyer, “A Case forLifetime-aware Task Mapping in Embedded Chip Multipro-cessors,” in Proceedings of the Conference on Hardware/SoftwareCodesign and System Synthesis (CODES+ISSS). ACM, 2010, pp.145–154.

[64] Y. Lu, L. Shang, H. Zhou, H. Zhu, F. Yang, and X. Zeng, “Sta-tistical Reliability Analysis Under Process Variation and AgingEffects,” in Proceeding of the Annual Design Automation Conference(DAC). ACM, 2009, pp. 514–519.

[65] A. Masrur, P. Kindt, M. Becker, S. Chakraborty, V. Kleeberger,M. Barke, and U. Schlichtmann, “Schedulability Analysis forProcessors with Aging-Aware Autonomic Frequency Scaling,” inProceedings of the IEEE International Conference on Embedded andReal-Time Computing Systems and Applications (RTCSA), 2012, pp.11–20.

[66] A. Das, A. Kumar, and B. Veeravalli, “Reliability-driven TaskMapping for Lifetime Extension of Networks-on-chip BasedMultiprocessor Systems,” in Proceedings of the Conference on De-

sign, Automation and Test in Europe (DATE). European Designand Automation Association, 2013, pp. 689–694.

[67] L. Huang and Q. Xu, “Energy-efficient Task Allocation andScheduling for Multi-mode MPSoCs Under Lifetime ReliabilityConstraint,” in Proceedings of the Conference on Design, Automationand Test in Europe (DATE). European Design and AutomationAssociation, 2010, pp. 1584–1589.

[68] S. Wang and J.-J. Chen, “Thermal-Aware Lifetime Reliability inMulticore Systems,” in Proceedings of the International Symposiumon Quality Electronic Design (ISQED), 2010, pp. 399–405.

[69] I. Ukhov, M. Bao, P. Eles, and Z. Peng, “Steady-state DynamicTemperature Analysis and Reliability Optimization for Embed-ded Multiprocessor Systems,” in Proceeding of the Annual DesignAutomation Conference (DAC). ACM, 2012, pp. 197–204.

[70] A. Das, A. Kumar, and B. Veeravalli, “Temperature AwareEnergy-Reliability Trade-offs for Mapping of Throughput-Constrained Applications on Multimedia MPSoCs,” in Proceed-ings of the Conference on Design, Automation and Test in Europe(DATE). European Design and Automation Association, 2014.

[71] ——, “Reliability-Aware Platform-Based Design Methodologyfor Energy-Efficient Multiprocessor Systems,” National Univer-sity of Singapore, Tech. Rep., 2014.

[72] T. Siddiqua and S. Gurumurthi, “Balancing Soft Error Coveragewith Lifetime Reliability in Redundantly Multithreaded Pro-cessors,” in Proceedings of the IEEE International Symposium onModeling, Analysis Simulation of Computer and TelecommunicationSystems (MASCOTS), 2009, pp. 1–12.

[73] C.-L. Chou and R. Marculescu, “FARM: Fault-aware ResourceManagement in NoC-based Multiprocessor Platforms,” in Pro-ceedings of the Conference on Design, Automation and Test in Europe(DATE). European Design and Automation Association, 2011,pp. 1 –6.

[74] A. Das, A. Kumar, B. Veeravalli, C. Bolchini, and A. Miele, “Com-bined DVFS and Mapping Exploration for Lifetime and Soft-Error Susceptibility Improvement in MPSoCs,” in Proceedings ofthe Conference on Design, Automation and Test in Europe (DATE).European Design and Automation Association, 2014.

[75] A. Das, A. Kumar, and B. Veeravalli, “Communication andMigration Energy Aware Design Space Exploration for MulticoreSystems with Intermittent Faults,” in Proceedings of the Conferenceon Design, Automation and Test in Europe (DATE). EuropeanDesign and Automation Association, 2013, pp. 1631–1636.

[76] B. Dave and N. Jha, “COFTA: Hardware-Software Co-Synthesisof Heterogeneous Distributed Embedded Systems for Low Over-head Fault Tolerance,” IEEE Transactions on Computers, vol. 48,no. 4, pp. 417–441, 1999.

[77] Y. Xie, L. Li, M. Kandemir, N. Vijaykrishnan, and M. Irwin,“Reliability-aware Co-synthesis for Embedded Systems,” TheJournal of VLSI Signal Processing Systems for Signal, Image, andVideo Technology, vol. 49, no. 1, pp. 87–99, 2007.

[78] C. Bolchini and A. Miele, “Reliability-Driven System-Level Syn-thesis of Embedded Systems,” in Proceedings of the IEEE In-ternational Symposium on Defect and Fault Tolerance in VLSI andNanotechnology Systems (DFT), Oct 2010, pp. 35–43.

[79] C. Bolchini, A. Miele, F. Salice, D. Sciuto, and L. Pomante,“Reliable System Co-Design: the FIR Case Study,” in Proceedingsof the IEEE International Symposium on Defect and Fault Tolerancein VLSI and Nanotechnology Systems (DFT), Oct 2004, pp. 433–441.

[80] P. v. Stralen and A. Pimentel, “A SAFE Approach TowardsEarly Design Space Exploration of Fault-tolerant MultimediaMPSoCs,” in Proceedings of the Conference on Hardware/SoftwareCodesign and System Synthesis (CODES+ISSS). ACM, 2012, pp.393–402.

[81] V. Izosimov, I. Polian, P. Pop, P. Eles, and Z. Peng, “Analysisand Optimization of Fault-tolerant Embedded Systems withHardened Processors,” in Proceedings of the Conference on Design,Automation and Test in Europe (DATE). European Design andAutomation Association, 2009, pp. 682–687.

[82] M. Glaß, M. Lukasiewycz, T. Streichert, C. Haubelt, and J. Teich,“Reliability-aware System Synthesis,” in Proceedings of the Con-ference on Design, Automation and Test in Europe (DATE). EDAConsortium, 2007, pp. 409–414.

[83] C. Zhu, Z. P. Gu, R. P. Dick, and L. Shang, “Reliable Multiproces-sor System-on-chip Synthesis,” in Proceedings of the Conference onHardware/Software Codesign and System Synthesis (CODES+ISSS).ACM, 2007, pp. 239–244.

IEEE TRANSACTIONS ON COMPUTERS, VOL. X, NO. X, MONTH YEAR 14

[84] A. Das, A. Kumar, and B. Veeravalli, “Aging-aware Hardware-software Task Partitioning for Reliable Reconfigurable Multi-processor Systems,” in Proceedings of the International Confer-ence on Compilers, Architecturesand Synthesis for Embedded Systems(CASES). IEEE Press, 2013, pp. 1:1–1:10.

[85] ——, “Reliability-Aware Hardware-Software Co-Design of Re-configurable Multiprocessor Systems,” National University ofSingapore, Tech. Rep., 2014.

[86] W. Wolf, “Hardware-Software Co-Design of Embedded Sys-tems,” Proceedings of the IEEE, vol. 82, no. 7, pp. 967–989, 1994.

[87] G. De Michell and R. Gupta, “Hardware/software Co-Design,”Proceedings of the IEEE, vol. 85, no. 3, pp. 349–365, 1997.

[88] S. Ha, S. Kim, C. Lee, Y. Yi, S. Kwon, and Y.-P. Joo, “PeaCE:A Hardware-software Codesign Environment for MultimediaEmbedded Systems,” ACM Transactions on Design Automation ofElectronic Systems (TODAES), vol. 12, no. 3, pp. 24:1–24:25, 2008.

[89] J. Teich, “Hardware/Software Codesign: The Past, the Present,and Predicting the Future,” Proceedings of the IEEE, vol. 100, no.Special Centennial Issue, pp. 1411–1430, May 2012.

[90] M. T. Schmitz, B. M. Al-Hashimi, and P. Eles, “Cosynthesis ofEnergy-Efficient Multimode Embedded Systems With Consid-eration of Mode-Execution Probabilities,” IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems (TCAD),vol. 24, no. 2, pp. 153–169, 2005.

[91] M. Kim, S. Banerjee, N. Dutt, and N. Venkatasubramanian,“Energy-Aware Cosynthesis of Real-Time Multimedia Applica-tions on MPSoCs Using Heterogeneous Scheduling Policies,”ACM Transactions on Embedded Computing Systems (TECS), vol. 7,no. 2, pp. 1–19, 2008.

[92] L. Shang, L.-S. Peh, A. Kumar, and N. K. Jha, “Thermal model-ing, characterization and management of on-chip networks,” inProceedings of the IEEE/ACM International Symposium on Microar-chitecture (MICRO). IEEE Computer Society, 2004, pp. 67–78.

[93] A. Kivilcim Coskun, T. Simunic Rosing, K. Mihic, G. De Micheli,and Y. Leblebici, “Analysis and Optimization of MPSoC Relia-bility,” Journal of Low Power Electronics, vol. 2, no. 1, pp. 56–69,2006.

[94] E. Karl, D. Blaauw, D. Sylvester, and T. Mudge, “ReliabilityModeling and Management in Dynamic Microprocessor-basedSystems,” in Proceeding of the Annual Design Automation Confer-ence (DAC). ACM, 2006, pp. 1057–1060.

[95] F. Mulas, D. Atienza, A. Acquaviva, S. Carta, L. Benini, andG. De Micheli, “Thermal Balancing Policy for MultiprocessorStream Computing Platforms,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), vol. 28,no. 12, pp. 1870–1882, 2009.

[96] V. Hanumaiah and S. Vrudhula, “Temperature-Aware DVFS forHard Real-Time Applications on Multicore Processors,” IEEETransactions on Computers, vol. 61, no. 10, pp. 1484–1494, 2012.

[97] X. Zhou, J. Yang, M. Chrobak, and Y. Zhang, “Performance-aware Thermal Management via Task Scheduling,” ACM Trans-actions on Architecture and Code Optimization (TACO), vol. 7, no. 1,pp. 5:1–5:31, 2010.

[98] M. A. Faruque, J. Jahn, and J. Henkel, “Runtime ThermalManagement Using Software Agents for Multi- and Many-CoreArchitectures,” IEEE Design & Test of Computers, vol. 27, no. 6,pp. 58–68, 2010.

[99] G. Liu, M. Fan, and G. Quan, “Neighbor-aware Dynamic Ther-mal Management for Multi-core Platform,” in Proceedings of theConference on Design, Automation and Test in Europe (DATE). EDAConsortium, 2012, pp. 187–192.

[100] A. Coskun, T. Rosing, K. Whisnant, and K. Gross, “Static and Dy-namic Temperature-Aware Scheduling for Multiprocessor SoCs,”IEEE Transactions on Very Large Scale Integration Systems (TVLSI),vol. 16, no. 9, pp. 1127–1140, 2008.

[101] A. Kumar, L. Shang, L.-S. Peh, and N. K. Jha, “HybDTM: ACoordinated Hardware-software Approach for Dynamic Ther-mal Management,” in Proceeding of the Annual Design AutomationConference (DAC). ACM, 2006, pp. 548–553.

[102] A. S. Hartman and D. E. Thomas, “Lifetime ImprovementThrough Runtime Wear-based Task Mapping,” in Proceedings ofthe Conference on Hardware/Software Codesign and System Synthesis(CODES+ISSS). ACM, 2012, pp. 13–22.

[103] T. Chantem, Y. Xiang, X. S. Hu, and R. P. Dick, “EnhancingMulticore Reliability Through Wear Compensation in OnlineAssignment and Scheduling,” in Proceedings of the Conference on

Design, Automation and Test in Europe (DATE). European Designand Automation Association, 2013, pp. 1373–1378.

[104] A. K. Coskun, T. S. Rosing, and K. C. Gross, “TemperatureManagement in Multiprocessor SoCs Using Online Learning,”in Proceeding of the Annual Design Automation Conference (DAC).ACM, 2008, pp. 890–893.

[105] W. Lee, K. Patel, and M. Pedram, “GOP-level Dynamic ThermalManagement in MPEG-2 Decoding,” IEEE Transactions on VeryLarge Scale Integration Systems (TVLSI), vol. 16, no. 6, pp. 662–672, 2008.

[106] R. Jayaseelan and T. Mitra, “Dynamic Thermal Management viaArchitectural Adaptation,” in Proceeding of the Annual DesignAutomation Conference (DAC). ACM, 2009, pp. 484–489.

[107] Y. Ge and Q. Qiu, “Dynamic Thermal Management for Multime-dia Applications Using Machine Learning,” in Proceeding of theAnnual Design Automation Conference (DAC). ACM, 2011, pp.95–100.

[108] T. Ebi, D. Kramer, W. Karl, and J. Henkel, “Economic Learn-ing for Thermal-aware Power Budgeting in Many-core Archi-tectures,” in Proceedings of the Conference on Hardware/SoftwareCodesign and System Synthesis (CODES+ISSS). ACM, 2011, pp.189–196.

[109] A. Das, R. A. Shafik, G. V. Merrett, B. M. Al-Hashimi, A. Kumar,and B. Veeravalli, “Reinforcement Learning-Based Inter- andIntra-Application Thermal Optimization for Lifetime Improve-ment of Multicore Systems,” in Proceeding of the Annual DesignAutomation Conference (DAC), 2014.

[110] A. Coskun, T. Rosing, and K. Gross, “Utilizing Predictors forEfficient Thermal Management in Multiprocessor SoCs,” IEEETransactions on Computer-Aided Design of Integrated Circuits andSystems (TCAD), vol. 28, no. 10, pp. 1503–1516, 2009.

[111] R. Z. Ayoub and T. S. Rosing, “Predict and Act: DynamicThermal Management for Multi-core Processors,” in Proceedingsof the ACM/IEEE international symposium on Low power electronicsand design (ISLPED). ACM, 2009, pp. 99–104.

[112] R. Cochran and S. Reda, “Consistent Runtime Thermal Pre-diction and Control Through Workload Phase Detection,” inProceeding of the Annual Design Automation Conference (DAC).ACM, 2010, pp. 62–67.

[113] P. Singh, E. Karl, D. Sylvester, and D. Blaauw, “Dynamic NBTIManagement Using a 45 nm Multi-Degradation Sensor,” IEEETransactions on Circuits and Systems I: Regular Papers, vol. 58, no. 9,pp. 2026–2037, 2011.

[114] F. Sironi, M. Maggio, R. Cattaneo, G. Del Nero, D. Sciuto,and M. Santambrogio, “ThermOS: System Support for DynamicThermal Management of Chip Multi-Processors,” in Proceedingsof the International Conference on Parallel Architectures and Compi-lation Techniques (PACT), 2013, pp. 41–50.

[115] C. Bolchini, M. Carminati, A. Miele, A. Das, A. Kumar, andB. Veeravalli, “Run-time mapping for reliable many-cores basedon energy/performance trade-offs,” in Proceedings of the IEEEInternational Symposium on Defect and Fault Tolerance in VLSI andNanotechnology Systems (DFT), 2013, pp. 58–64.

[116] P. Mercati, A. Bartolini, F. Paterna, T. S. Rosing, and L. Benini,“Workload and User Experience-aware Dynamic ReliabilityManagement in Multicore Processors,” in Proceeding of the AnnualDesign Automation Conference (DAC). ACM, 2013, pp. 2:1–2:6.