Analyzing the Issues of Soft Error Rates(SER)in Technology Trend & Reliability for...

Analyzing the Issues of Soft Error Rates(SER)in

Technology Trend & Reliability for CMOS designs

Sirisha.Gadiparti

Electrical Engineering

University of Central Florida

Orlando, United States

[email protected]

Abstract—Soft error rate (SER) has become a critical reliability

issue for CMOS designs due to continuous technology scaling.

However, the striking-time and multi-cycle effects have not been

properly considered in SER for advanced CMOS designs. So this

paper, reviews the striking-time and multicycle effects are

formulated into the problem of SER estimation. Also this paper

reviews impact of new microprocessor technology on

microprocessor soft error rate (SER). As microprocessor feature

sizes decreased from 180nm to 65nm, memory error rates per bit

decreased, but our data indicates a reversal of this trend at

40nm. SER as a function of power supply voltage (Vdd) over a

range of 1.2V down to 0.5V, and the data shows SER significantly

increases as Vdd decreases. This result implies that dynamic

voltage frequency scaling (DVFS), a commonly used

microprocessor energy reduction technique, could cause a

significant decrease in microprocessor reliability. The data also

show that more energy-efficient transistors using back bias

technique do not appear to significantly impact microprocessor

reliability.

Keywords—component; formatting; style; styling; insert (key

words)

I. INTRODUCTION

The vulnerability of Integrated Circuits has been increasing in past years due to small transistor sizes and higher integration density. When an energetic particle strikes a sensitive circuit nodes of a transistor at OFF state, it ionizes the semiconductor drain region, connecting it temporarily to the transistor substrate. Alpha particles originate from radioactive decay of uranium or thorium impurities in chip and packaging materials [I]. When they penetrate the silicon substrate, the a-particles generate electron-hole pairs that may be collected by p-n junctions: An important quantity that characterizes the susceptibility of a memory cell to corruption of its logic state is the critical charge (Qcrit) 131. By definition, a minimum charge of Qcrit needs to be collected at an internal node to flip the state of a memory cell. This concept is applicable to any storage cell (SRAMs, DRAMS, latches). The soft error rate (SER) of a cell is a function of Qcrit and the sensitive node area. Another critical parameter is the duty cycle.

With technology scaling, it precipitates higher operating frequency, shorter logic depth and smaller transistor-to-transistor spacing. All of these phenomena

magnify the sensitivity of transient faults, leading to an exponential growth of soft error rates (SERs) in combinational circuits. In other words, soft errors greatly degrade the reliability of a system and can no longer be ignored in nanometer technologies, especially for the safety critical (high reliability) applications such as automotive, aerospace, medical and etc.

II. TECHNOLOGY TREND & VOLTAGE SCALING

Figure 1 shows the microprocessor SRAM single-event upset (SEU) rate and voltage as a function of the technology node, normalized to a value of 1 at the 90nm technology node for ease of comparison and to protect proprietary data. (SEU rate is reported in FITs/kbit, equivalent to cell upset events per bit per million hours.) The nominal power supply voltage (Vdd) has been slowly decreasing, which decreases the critical charge necessary for a cell upset event, thus making the cells more vulnerable to bit flips. However, cell size reduction and the corresponding sensitive area reduction with technology scaling has led to a reduction in SRAM cell SEU, even as the voltage has decreased, until the 40nm technology node. The large reduction in SRAM cell SEU rate from 130nm technology to 90nm technology was most likely the result of an SRAM design change. The layout of the SRAM cell changed from the traditional one to a lithographically friendly one with uni-directional poly orientation.

Because soft error susceptibility increases exponentially as voltage decreases and decreases linearly as area decrease (quadratically as feature size decreases), it has long been expected that the voltage reduction that has accompanied feature size reduction would eventually cause SEU rates to increase.

Figure 1 also shows how flop SEU rates compare to SRAM SEU rates as a function of technology. Flop data presented here is averaged over all the different drive strengths and different flop design families in the product. Flop data also has higher error bars because the total number of flops is smaller compared to the total number of memory cells on the same die. As shown in Fig. 1, the SEU rate for flops in 130nm product is similar to flops in 90nm product, while flops in 65nm product show a reduction in SEU rate.

One reason that the trend may not yet have appeared in the

flop data is that flops are bigger than SRAMs, so the sensitive area reduction may have more relative importance. However, it should be noted that due to the special treatment by fabs, SRAM cells get close to the expected area shrink entitled by the feature size reduction. Flops don't enjoy the same treatment, and hence, the sensitive area reduction for flops is more modest. Another contributing factor to the flop trend is that flop design style plays a big role in determining the flop SEU rate. Due to the manufacturing and design dependencies, some difference can be expected in the SEU rate versus technology trend for memory cells and flops. This will be especially true when comparing trend results from different vendors.

In the past, designers have not been greatly concerned about soft errors in microprocessor logic because the number of flops/latches on a microprocessor is much fewer than the number of SRAM cells, and flop SEU rates were lower than SRAM SEU rates. In 90nm, 65nm, and 40nm technology, flop SEU rates are larger than SRAM SEU rates. Because flop protection mechanisms such as state machine encoding and invariant checking are more difficult to implement than simple parity and ECC, flops are quickly becoming the major contributor to system soft error rate as technology scales to smaller feature sizes.

As shown in Fig. 1b for the 40nm technology node, the average flop SEU rate increases significantly as Vdd decreases. This is expected because the critical charge decreases linearly as Vdd decreases and SEU rate has an exponential dependence on critical charge. The SEU rate increases by approximately 30% per 0.1V as the voltage decreases from 1.25V to 0.5V as shown by the exponential fit in Fig. 1b. The error bars in this paper indicate +/- 1 standard deviation range around the average value. The SEU rate approximately doubles when the voltage decreases from 0.7V to 0.5V. The 28nm technology node is also shown in Fig. 1b and indicates the same trend, albeit with fewer data points.

Because the dynamic power consumption of transistors is

proportional to V2 F, where V is the voltage (usually Vdd) and F is the frequency, microprocessor designers would like to reduce Vdd as much as possible to make the microprocessor more energy-efficient. These test results provide an indication of the SEU rate impact of that Vdd reduction. Dynamic voltage frequency scaling (DVFS) is the (usually simultaneous) reduction of both voltage and frequency to reduce microprocessor power consumption. Many modern microprocessors use this technique to reduce power consumption during periods of light utilization or during periods when external conditions require that the system reduce its power consumption

To reduce microprocessor power consumption, designers are creating more energy-efficient circuits. One example is the use of back-bias transistors, in which a bias voltage is applied to the “back gate” or the “body” terminal of a transistor. This can be used to increase the effective threshold voltage of an extremely leaky device, thus reducing leakage current and making the device usable in a system that would have otherwise rejected it due to power consumption. Back bias can also be used to reduce the threshold voltages of the devices. A reduction in threshold voltage increases the operating speed of the device and allows it to be used in a system for which it originally would have been too slow at the cost of added leakage current as noted above. SEU rate implications as a function of both forward bias and reverse bias for the complementary NMOS and PMOS transistors need to be understood before this scheme can be qualified for field usage.

III TRANSIENT FAULTS

A transient fault is generated when the particle strikes under a passing logic condition (i.e. the particle strikes at NMOS (PMOS) when the output is with logic-1 (logic-0)). To take Fig. 2 as example, an inverter is described in 3D mixed-mode simulation where the NMOS transistor is modeled in 3D device domain and the PMOS transistor is modeled using SPICE model. In such example, a radiation particle strikes the drain region of a NMOS transistor. As a result, a negative transient is generated and results in a fault at the output. Hence,

logically, a transient fault is assumed to occur as the output of the inverter is with logic-1. As shown in Fig. 2(a), the black dashed line represents the expected logic signal, the red solid line represents the signal with a transient fault and the arrow marks the striking time of a particle. However, a radiation-induced transient results from charge deposition and collection. Hence, a question rises, ”What will happen if the particle strikes under a blocking logic condition (i.e. the particle strikes NMOS (PMOS) when the output is with logic-0 (logic-1))?”

The example in Fig. 2 is used again. As shown in Fig. 1(b), the striking time is set at 0.2ns, and the result shows no transient fault. However, a transient fault can be observed when the striking time is shifted from 0.2ns to 0.45ns as shown in Fig. 2(c). In other words, the result suggests that the transient fault will be generated if a particle strikes the device under a proper logic condition. Such striking-time effect should be considered during SER estimation

Moreover, with continuous technology scaling, especially in nanometer technologies, operating frequency significantly increases to satisfy high performance demand. But, higher operating frequency is not a free lunch. Particularly, additional challenge for SER estimation arises - a transient fault does not only result in a single-cycle error but also a multicycle error. As shown in Fig. 3, the pulse width of generated transient fault is equal to 698ps, which means if the operating frequency is larger than 1.5GHz, an in-correct value will be captured in

multiple cycles. As we can see, transient faults with larger pulse width are prone to causing multicycle errors.

IV.GENERATION AND PROPOGATION OF TRANSIENT FAULTS

The generation of transient fault can be summarized as two steps, charge deposition and charge collection. As shown in Fig. 4, when a radiation particle strikes a node and passes through the device, it will generate electron-hole pairs along its passing path. This step is called charge deposition . After the charge is deposited from such particle strike, the deposited charge will be collected by the charge collection mechanism , including drift-diffusion, bipolar effect, and alpha-particle source-drain penetration (ALPEN). As a result, the transient current will be generated at the drain node of struck device regardless of any charge collection mechanisms. In general, this transient current can be further modeled by an exponential current pulse at the circuit level, shown as follows:

where Q is the total amount of collected charges and τ is the time constant related to the process-related factors. When a transient fault is induced by a high-energy charged particle, it may propagate to the primary output of the circuit and thus results in a soft error. However, not every generated transient fault can be latched by a memory element, three masking to exist prevent those transient faults from becoming soft errors. These three effects are:

A. Electrical Masking:

When a transient fault propagates through a subsequent gate, it

may be attenuated due to the electrical properties of

propagated gates. This transient fault may disappear if the

attenuation effect is strong enough. As illustrated in Fig. 5, the

transient fault is masked because its amplitude is not high

enough during propagating through G3.

B. Logical Masking:

The logic masking occurs when there is no sensitizable path

from the struck node to any output of the circuit. In Fig. 5, the

transient fault is blocked after it propagates through G1 whose

other side-input is with a controlling value (logic-0).

C. Timing Masking:

Because the flip-flop is insensitive to any arrival signal

outside the latching window (i.e. setup time + hold time), the

arrival transient fault will be masked since it falls outside the

latching time or its pulse width is smaller than latching

window. As shown in Fig. 5, only one transient fault, T2, will

be captured because its arrival time falls within the latching

window.

V .FRAME WORK

Fig. 6 shows the flowchart of overall SER estimation which consists of three stages: (1) sensitized-probability computation, (2) generation and propagation of transient faults, and (3) total SER estimation. The following sections, starting reversely from total SER estimation, provide the details of each stage.

Total SER Estimation First, we introduce the estimation of total SER for the circuit under test (CUT). The total SER can be computed as the summation of each individual node n in the circuit. That is,

where Nnode is the total number of nodes susceptible to be struck by radiation particles in the CUT. Each SERn can be further formulated by integrating the products of frequency of particle-hit and the error probability over the range of charge QMIN to QMAX. Hence,

where F(q) represents the effective frequency of a particle hit in unit time could be found in [19]. Perr(n, q) denotes the probability of a transient fault, induced by a particle with collected charge q strikes at node n, becomes a soft error after propagating to flip-flops.

In (6), the error probability Perr(n, q) comprises generation and propagation of transient fault, including all three masking effects, as described in Sec. 2, and can be further formulated as:

where Nf f and i indicates the total number of flip-flops and flip-flop i in CUT, respectively. Psen(·) is the sensitized probability of a transient fault with respect to logical mask-ing. Pgen(·) and Plat(·) are generation probability and latching probability of a transient fault with respect to electrical masking and timing masking. Moreover, in order to incorporate the striking-time effect, the generation of a transient fault will be classified into three cases.

VI.GANG ERROR SIMULATION

A recent tool, called Relyzer, systematically analyzes an entire application’s resiliency to single bit soft-errors using a small set of carefully selected error injection sites. Relyzer provides a practical resiliency evaluation mechanism but still requires significant evaluation time, most of which is spent on error simulations. A new technique called GangES (Gang Error Simulator) is developed that aims to reduce error simulation time. GangES observes that a set or gang of error simulations that result in the same intermediate execution state (after their error injections) will produce the same error outcome; therefore, only one simulation of the gang needs to be completed, resulting in significant overall savings in error simulation time. GangES leverages program structure to carefully select when to compare simulations and what state to compare. For workloads, GangES saves 57% of the total error simulation time with an overhead of just 1.6%

A new technique is proposed and tool called Gang Error Simulator or GangES1 that takes a different, but complementary, approach to Relyzer to reduce error simulation time. GangES is based on the observation that a set or “gang” of error simulations that result in the same intermediate execution state (after their error injections) will produce the same error outcome. If such an intermediate state can be detected, they need only complete one simulation from such a gang – the others can be terminated early and use the outcome of the single completed simulation. This observation has the potential to significantly reduce overall error simulation time, but requires addressing two challenges: (i) when to compare the state of error simulations and (ii) what state to compare.

A judicious choice for when to compare (which execution points) is critical because the instruction sequences executed by multiple error simulations may temporarily diverge but merge again. A judicious choice of what to compare is also critical because naively comparing all register and memory locations can be prohibitively inefficient.

For when to compare, select program locations where multiple error simulations are likely to reach even if the error (temporarily) exercised different system events and branch directions. A PST organizes an application’s control flow graph (CFG) into nested single-entry single-exit (SESE) regions) If an execution exercises the entry edge of a SESE region, then it will also exercise the exit edge, as long as the dynamic control flow complies with the static CFG. Therefore, if an error is injected in a particular SESE region, the corresponding SESE exit edge will most likely be exercised. Such SESE exit points provide common program locations to compare execution states of error simulations. The SESE exit points also represent program locations where potentially few program variables are alive, limiting the amount of state to compare. Comparing all the (potentially) live registers at these points and any memory locations to which there have been stores before these points.

GangES can be used to determine error-outcome equivalence for arbitrary sets of error simulations. Here it in conjunction with Relyzer is used; i.e., its input is error sites (pilots) that are not pruned by Relyzer. Without GangES, Relyzer would run error simulations for each of the input pilot instructions until the error was detected or the application completed. GangES instead checks for equivalence and terminates several of these simulations. Overall, found that after applying GangES, only 36% of the error simulations in its input required running the application to completion and checking the output to determine the fault outcome. 92% of the error simulations that were terminated early by approach required an average of only 3,025 instructions to be executed before termination; the next 7% required about 16,000 instructions on average. Overall, found that GangES replaced Relyzer’s error simulation time of 14,225 CPU hours (for analysis of 95% of all error sites) with a total time of 6,010 CPU hours, providing a wall-clock time savings of 57.7% for workloads and error model.

VII. HYBRID ANALYSIS TECHNIQUE

GangES presents a transient hardware error simulation technique that takes as input a set of errors to be simulated for

an application and outputs the outcome for each error (masked, detected, or SDC). Each error in the error set specifies a dynamic instruction instance, a hardware resource used by that instruction instance, and the type of error to be injected in that hardware resource for that instruction instance. In experiments, focus on single-bit flips in the integer architectural registers used by the specified instruction instance (one bit flip per simulation). For error detection, they use detectors similar to those used in Relyzer. In experiments, the input set of errors for GangES consists of errors that Relyzer is not able to prune (i.e., pilots of instruction equivalence classes as categorized by Relyzer). Relyzer must perform error injections for all of these errors – for those not detected, Relyzer must execute the application to the end and compare the output with that of the error-free execution to determine masked or SDC outcomes.

GangES aims to reduce the overall evaluation time for its input error set by terminating as many error simulations as possible soon after error injection and well before the end of the application (including those simulations that would eventually be masked or produce SDCs). This approach is to repeatedly compare execution states of a set or gang of multiple simulations in progress (hence the name Gang Error Simulator). Any simulations in the gang that reach identical states will produce the same outcomes and all but one of them can be terminated. Figure 7 illustrates the differences between Relyzer and GangES.

A naive implementation would compare the entire system state (processor and memory state) at every cycle to identify the earliest point in the execution to terminate an error simulation. Since the entire system state may consist of megabytes to gigabytes of data, comparing it on every cycle can be prohibitively expensive in time. Moreover, such comparisons may not identify error equivalence effectively because even a single mismatch in temporarily divergent state (e.g., due to temporarily divergent control flow) or in dead values will flag non-equivalent error outcomes. Hence, the challenge in developing a time-effective simulation framework is in identifying what state to compare and when to compare it.

VIII. METHODOLOGY

An instruction-level transient error model is used. This model injects a transient error in the form of a single bit flip in a specified bit of a specified architectural integer register

accessed by the specified dynamic instruction. Specifically, identified an error with a tuple consisting of a dynamic instruction count in an execution, the program counter of the instruction that exercises the error, integer register operand, and bit location.

As mentioned earlier, although Ganges can accept any set of error sites as input, they focus on the sites that Relyzer is not able to prune. Even these require substantial time to determine their outcome through error injection simulations (the full simulations are required to determine Relyzer’s wall-clock time and the speedup provided by Ganges). Therefore restricted input to Ganges to be the minimal set of error sites that would provide 95% coverage; i.e., the outcomes of error injection simulations for this set enable Relyzer to determine the error outcomes for 95% of all error sites (at least 92% for each application).

The GangES implementation has two components: (1) static program structure identification and (2) a framework to perform dynamic error injections and state comparisons.

Static program structure identification: They implement static program analyses at the binary level. Since error injection infrastructure is developed for the SPARC V9 ISA. The tool constructs a control flow graph from the binary and performs basic control flow analyses. They implemented intra-procedural SESE region identification and PST generation algorithms in this infrastructure.

Framework performing dynamic error injections and state comparisons: Once identifying when to compare execution states, the next steps are to (1) identify when to start an error simulation and how to group error simulations together for efficiency, (2) inject the error, and (3) collect and compare state at comparison points for early termination.

Identifying when to start an error simulation and how to group injections together: Take several application checkpoints (using Simics’ checkpointing feature) at periodic points (after every 100 million instructions) in the application. This allows us to start simulations from intermediate execution points(in Figure 8), instead of starting from the beginning of the application for every simulation, saving running time. Group (gang) error injection sites such that each error site in a gang has the same checkpoint immediately preceding it; therefore start all simulations in a gang from the immediately preceding checkpoint.

Starting a gang of error injections: Start the simulations

for a gang from an application checkpoint (➀ in Figure 8) and create a Simics bookmark just before the first error injection

point (➁ in Figure 8). A bookmark set at a particular point in an execution allows Simics to move the simulation backwards to that point from anywhere in the application, restoring the execution state at that point. This feature allows to move backwards in an execution to start a different error injection run from a particular gang. At an error injection point, inject the error directly into the architecture register, according to error model.

Collecting and comparing execution states: After an error injection, continue simulation until a comparison point, the exit point of the (current) SESE region that contains the instruction where the error was injected, is reached. Set the breakpoint at the program counter of the instruction that

immediately follows the current SESE region’s exit edge (➂ in Figure 8).

Evaluation metrics: To evaluate GangES, first determine the wall clock time to identify the outcomes of all its input error sites by performing full error injection simulations (on all input error sites). Compare this with the wall clock time that GangES needs to identify the number of errors that need full simulations plus the time needed to run such simulations to completion to obtain the outcomes. An error simulation time is measured from the same application checkpoint for both sets of wall clock times. Also show the fraction of full simulations that were saved by showing them equivalent to others and the fraction of error injections that need full simulation after applying GangES. In cases where equivalence was observed, measure the number of instructions simulated until equalization.

IX. RESULTS

Evaluated a total of about 1.33 million application error sites identified by Relyzer. The error simulation experiments for these sites (to determine the Relyzer wall clock baseline) required approximately 14,225 hours of CPU time. Ganges seeks to reduce this time and the number of full simulations. Figure 9a shows the effectiveness of GangES. For each application, the left bar shows the total wall clock time (in CPU hours) to identify the outcomes of all the input error sites by performing full error injection simulations on each such site as the baseline. The right bar shows the wall clock time to run GangES to identify the number of errors that need full simulations (GangES overhead) and to run such full simulations to completion (need full).

Figure 9b shows the fraction of the total error simulations that were saved from full execution, that need full simulation, and that result in a detection (these would be terminated early regardless of GangES). On average, approximately 36% of the total error simulations were saved; i.e., they were shown equivalent to another execution, saving the simulation time of running them to completion and comparing their output to the error-free output. Overall, 39% of the total input error set required full simulation.

Figure 10(a) shows when the equalization was performed for the saved simulations (from Figure 9b). It shows that on average about 92% of saved simulations were terminated at the first SESE region exit from the point of error injection.

Approximately 7% and 1% of the saved simulations were equalized at the second and third SESE exits respectively. Figure 10(b) shows the average distance in the number of executed instructions from the point of error injection to the simulation state equalization (at a SESE exit). Specifically, it shows that the distance to first successful comparison (averaged across applications) is approximately 3,000 instructions (where 92% of saved simulations are equalized). Distance to successful comparisons varied significantly with applications because the comparison points are identified according to the PST, which is application-specific.

VI. CONCLUSION

In technology trend SEU rates per SRAM cell have reversed a long-term trend and show an increase at the 40nm technology nodes. Data from the 28nm node is needed to confirm the trend. Microprocessor energy reduction techniques can negatively impact SEU rate, e.g., reducing Vdd to reduce energy consumption. The use of back-bias transistors to improve energy efficiency does not seem to significantly impact SEU rate.Transient faults occurred both striking-time and multi cycle effects need to be considered in SER analysis

for avoiding underestimation of SER. Particularly, high-performance or low-power designs may cause rapid increase on SER and thus need different treatment in the future.

Limitations and future directions

Evaluated GangES using an instruction-level transient error model. Prior work has performed resiliency evaluations using lower level error models. Extending concepts to such simulators is an interesting future direction. One of the main challenges in directly employing GangES on lower level error simulators (by collecting and comparing states at the architecture level) is handling latent errors – errors that are live at gate- or microarchitecture-level but not visible at architecture level – at comparison points.

X.REFERENCES

“The Impact of New Technology on Soft Error Rates” Anand Dixit[1] and Alan Wood[2] [1] Systems Group, [2] Oracle Labs Oracle Corporation Santa Clara, CA USA [1] 408-276-6335 [email protected], [2] [email protected]

“Advanced Soft-Error-Rate (SER) Estimation with Striking-Time and Multi-Cycle Effects” Ryan H.-M. Huang, and Charles H.-P. Wen Dep. of Electrical and Computer Engineering, National Chiao Tung University, Taiwan E-mail: [email protected] and [email protected]

“Analyzing the Influence of Voltage Scaling for Soft Errors in SRAM-based FPGAs” Jorge Tonfat, José Rodrigo Azambuja, Gabriel Nazar, Paolo Rech, Christopher Frost, Fernanda Lima Kastensmidt, Luigi Carro, Ricardo Reis, Juliano Benfica, Fabian Vargas, Eduardo Bezerra, Member, IEEE

“Methodology for designing highly reliable Fault Tolerance Space Systems based on COTS devices” Juan Andres Perez Celis, Saul de la Rosa Nieves, Carlos Romo Fuentes, Saul Daniel Santillan Gutierrez. High Technology Center. National University Autonomous of Mexico, School of Engineering. Querétaro, México Alvar Saenz-Otero. Space Systems Laboratory Massachusetts Institute of Technology Massachusetts, U.S.A.

“Soft Error Susceptibilities of 22 nm Tri-Gate Devices” Norbert Seifert, Senior Member, IEEE, Balkaran Gill, Member, IEEE, Shah Jahinuzzaman, Member, IEEE, Joseph Basile, Member, IEEE, Vinod Ambrose, Member, IEEE, Quan Shi, Member, IEEE, Randy Allmon, and Arkady Bramnik

GangES: Gang Error Simulation for Hardware Resiliency Evaluation∗ Siva Kumar Sastry Hari† Radha Venkatagiri‡ Sarita V. Adve‡ Helia Naeimi§ †NVIDIA ‡University of Illinois at Urbana-Champaign § Intel Labs †[email protected] ‡{venktgr2,sadve}@illinois.edu §[email protected]

mailto:[email protected]

mailto:[email protected]

Analyzing the Issues of Soft Error Rates(SER)in Technology Trend & Reliability for...

Documents

Transcript of Analyzing the Issues of Soft Error Rates(SER)in Technology Trend & Reliability for...