[IEEE 2013 IEEE Workshop on Signal Processing Systems (SiPS) - Taipei City, Taiwan...

6
ERROR-RESILIENT SYSTEMS VIA STATISTICAL SIGNAL PROCESSING Rami A. Abdallah Visual and Parallel Computing Group Intel Corporation Hillsboro, OR 97124 Naresh R. Shanbhag Coordinated Science Laboratory/ECE Department University of Illinois at Urbana-Champaign Urbana, IL 61801 ABSTRACT This paper provides an overview of error-resilient techniques to de- sign robust and energy-efficient nanoscale DSP systems while focus- ing on statistical error compensation techniques. We demonstrate that logic-level error resiliency devises techniques independent of the application context. This results in significant complexity over- head especially with the highly unreliable circuits fabric. On the other hand, system-level error resiliency, such as statistical error compensation, employs techniques from statistical signal processing in order to exploit the hardware error behavior at application level and engineer the error compensation mechanism to match the appli- cation requirements. The benefits of such a design philosophy are tremendous gains in robustness (> 1000×) and energy efficiency (3×-to-6×). In addition, the paper paves the way to the deployment of novel statistical error compensation techniques based on princi- ples from pattern recognition and iterative/turbo detection. Index TermsError resiliency, voltage overscaling, low power, statistical error compensation. 1. INTRODUCTION Nanoscale systems suffer from increased variations in process, tem- perature, and voltage (PVT) [1]. These variations manifest them- selves in increased delay variations, such that a worst-case design leads to high-power consumption and a nominal-case design results in a loss of yield. Error resiliency is an attractive design approach which copes with PVT variations while maintaining performance and yield. Error-resilient designs are implemented at nominal PVT corner, and intermittent timing errors that occur whenever a critical path is excited are corrected via logical, architectural, or algorithmic techniques. Logic-level error resiliency (LLER) dates back to Von Neu- mann [2] in 1956, where modular redundancy (MR) and N -wire logic signal representation were proposed to design reliable systems from unreliable components. This work led to several information- theoretic bounds for error-resiliency but without a clear construction or design approach to achieve these bounds [3][4]. Advanced forms of LLER, that are based on coding theory, probability and statistical theory, have also emerged [5][6][7]. However, the complexity over- head in such techniques is large, rendering LLER impractical for energy efficient system design. LLER misses out on opportunities to save energy afforded by a system-level top-down view. System-level information, such as er- ror locality, error susceptibility, criticality of processing elements, and the application performance metrics, reduces the overhead of resiliency. Microarchitecture-level error resiliency (MLER) tech- niques, such as RAZOR [8], error detection sequential (EDS) [9], Fig. 1. Statistical error compensation. elastic clocking [10], and non-uniform redundancy [11] [12], exploit the error locality and susceptibility in order to decrease the correc- tion overhead. Nevertheless, the component probability of error that can be tolerated in these systems while being energy efficient is rel- atively small (< 1%). System-level error resiliency, such as statistical error compensa- tion [13], exploits full system-level information in order to match the statistical nature of the application performance metrics to the statis- tical attributes of the underlying circuits fabric. Statistical error com- pensation utilizes the relaxed definition of “correctness” afforded by emerging applications to decrease the error correction overhead while tolerating a large component probability of error (60 70%). Fortunately, most emerging applications such as recognition, min- ing and synthesis (RMS), media processing, and immersive comput- ing employ statistical system performance metrics such as bit error- rate (BER), probability of detection, and signal-to-noise ratio (SNR), which expand the set of acceptable outputs and admit approximate error correction. Statistical error compensation (see Fig. 1) views the nanoscale circuits fabric as noisy communication networks and incorporates statistical behavioral models of the underlying hardware to develop communications-inspired error-resilient techniques based on the well-established foundations of estimation and detection theory. The benefits of such a design approach are the tremendous gains in robustness (> 1000×) and energy-efficiency (3×-to-6×) at a high-degree of circuit fabrics unreliability. This paper provides an overview of error-resilient techniques at logic-, microarchitecture-, and system-level. It demonstrates the 312 2013 IEEE Workshop on Signal Processing Systems 978-1-4673-6238-2/13 $31.00 © 2013 IEEE

Transcript of [IEEE 2013 IEEE Workshop on Signal Processing Systems (SiPS) - Taipei City, Taiwan...

Page 1: [IEEE 2013 IEEE Workshop on Signal Processing Systems (SiPS) - Taipei City, Taiwan (2013.10.16-2013.10.18)] SiPS 2013 Proceedings - Error-resilient systems via statistical signal processing

ERROR-RESILIENT SYSTEMS VIA STATISTICAL SIGNAL PROCESSING

Rami A. Abdallah

Visual and Parallel Computing GroupIntel Corporation

Hillsboro, OR 97124

Naresh R. Shanbhag

Coordinated Science Laboratory/ECE DepartmentUniversity of Illinois at Urbana-Champaign

Urbana, IL 61801

ABSTRACT

This paper provides an overview of error-resilient techniques to de-sign robust and energy-efficient nanoscale DSP systems while focus-ing on statistical error compensation techniques. We demonstratethat logic-level error resiliency devises techniques independent ofthe application context. This results in significant complexity over-head especially with the highly unreliable circuits fabric. On theother hand, system-level error resiliency, such as statistical errorcompensation, employs techniques from statistical signal processingin order to exploit the hardware error behavior at application leveland engineer the error compensation mechanism to match the appli-cation requirements. The benefits of such a design philosophy aretremendous gains in robustness (> 1000×) and energy efficiency(3×-to-6×). In addition, the paper paves the way to the deploymentof novel statistical error compensation techniques based on princi-ples from pattern recognition and iterative/turbo detection.

Index Terms— Error resiliency, voltage overscaling, low power,statistical error compensation.

1. INTRODUCTION

Nanoscale systems suffer from increased variations in process, tem-perature, and voltage (PVT) [1]. These variations manifest them-selves in increased delay variations, such that a worst-case designleads to high-power consumption and a nominal-case design resultsin a loss of yield. Error resiliency is an attractive design approachwhich copes with PVT variations while maintaining performanceand yield. Error-resilient designs are implemented at nominal PVTcorner, and intermittent timing errors that occur whenever a criticalpath is excited are corrected via logical, architectural, or algorithmictechniques.

Logic-level error resiliency (LLER) dates back to Von Neu-mann [2] in 1956, where modular redundancy (MR) and N -wirelogic signal representation were proposed to design reliable systemsfrom unreliable components. This work led to several information-theoretic bounds for error-resiliency but without a clear constructionor design approach to achieve these bounds [3][4]. Advanced formsof LLER, that are based on coding theory, probability and statisticaltheory, have also emerged [5][6][7]. However, the complexity over-head in such techniques is large, rendering LLER impractical forenergy efficient system design.

LLER misses out on opportunities to save energy afforded by asystem-level top-down view. System-level information, such as er-ror locality, error susceptibility, criticality of processing elements,and the application performance metrics, reduces the overhead ofresiliency. Microarchitecture-level error resiliency (MLER) tech-niques, such as RAZOR [8], error detection sequential (EDS) [9],

Fig. 1. Statistical error compensation.

elastic clocking [10], and non-uniform redundancy [11] [12], exploitthe error locality and susceptibility in order to decrease the correc-tion overhead. Nevertheless, the component probability of error thatcan be tolerated in these systems while being energy efficient is rel-atively small (< 1%).

System-level error resiliency, such as statistical error compensa-tion [13], exploits full system-level information in order to match thestatistical nature of the application performance metrics to the statis-tical attributes of the underlying circuits fabric. Statistical error com-pensation utilizes the relaxed definition of “correctness” affordedby emerging applications to decrease the error correction overheadwhile tolerating a large component probability of error (60− 70%).Fortunately, most emerging applications such as recognition, min-ing and synthesis (RMS), media processing, and immersive comput-ing employ statistical system performance metrics such as bit error-rate (BER), probability of detection, and signal-to-noise ratio (SNR),which expand the set of acceptable outputs and admit approximateerror correction.

Statistical error compensation (see Fig. 1) views the nanoscalecircuits fabric as noisy communication networks and incorporatesstatistical behavioral models of the underlying hardware to developcommunications-inspired error-resilient techniques based on thewell-established foundations of estimation and detection theory.The benefits of such a design approach are the tremendous gainsin robustness (> 1000×) and energy-efficiency (3×-to-6×) at ahigh-degree of circuit fabrics unreliability.

This paper provides an overview of error-resilient techniquesat logic-, microarchitecture-, and system-level. It demonstrates the

312

2013 IEEE Workshop on Signal Processing Systems

978-1-4673-6238-2/13 $31.00 © 2013 IEEE

Page 2: [IEEE 2013 IEEE Workshop on Signal Processing Systems (SiPS) - Taipei City, Taiwan (2013.10.16-2013.10.18)] SiPS 2013 Proceedings - Error-resilient systems via statistical signal processing

Fig. 2. Logic-level error resiliency: (a) NMR, (b) cascaded NMR,and (c) multiplexing logic.

effectiveness of system-level error resiliency, specifically statisticalerror compensation, and speculates on new forms of statistical er-ror compensation techniques. This paper is organized as follows:section 2 presents LLER techniques and discusses their limitations.Section 3 introduces MLER techniques and discusses the impor-tance of system-level information in order to decrease error cor-rection overhead. Section 4 presents the statistical error compen-sation paradigm describing existing techniques and demonstratingtheir robustness and energy efficiency. Finally, section 5 introducenew forms of statistical error compensation based on turbo/iterativedetection and pattern recognition.

2. LOGIC-LEVEL ERROR RESILIENCY (LLER)

Logic-level error resiliency techniques employ the ε-noisy gate er-ror model, where each gate output is in error with a probability ε.Redundancy is commonly used to correct for errors. N-modular re-dundancy (NMR) (see Fig. 2(a)) replicates the logic gates or pro-cessing elements N times and selects an estimate y of the correctoutput using a majority voter (N-Maj). Modular redundancy re-quires that the errors across the replicated elements be independentto avoid common-mode failures. This can be achieved by designdiversity at software or hardware level [14]. Variants of NMR in-cludes: 1) R-fold cascaded or recursive NMR [15] (see Fig. 2(b)),where logic replication by a factor of RN is employed followedby log2R cascaded layers of N-Maj voters, and 2) reconfigurableNMR [16], where redundant processing elements are reconfiguredbased on a defect map to avoid errors. Cascaded NMR improvesNMR reliability at the expense of more layers of redundancy whilereconfigurable NMR is effective in correcting manufacturing faultswhere a defect map is available but falls short in correcting transienterrors.

Applying coding theory, logic redundancy can be utilized moreefficiently for error detection and correction at the expense of in-creased decoding complexity. Techniques, such as algorithmic-based fault tolerance [17], where checksum encoding is added todata processing, algebraic concurrent-error detection [18], and re-cently probabilistic encoding [5] which embeds LDPC codes intologic, have been proposed for fault tolerance. Coded redundancyachieves higher error detection and correction capability than regularredundancy. However, it suffers from large encoding and decodingcomplexity overhead. For example, in a rate-1/2 LDPC-coded re-silient logic [5] with a large number of information bits (> 100) percodeword, the error correction overhead is 2×-to-6× the originalsystem complexity. This is in addition to the need for at least 5decoding iterations to reach a low output BER.

Advanced forms of non-redundant LLER designs have alsoemerged. Markov random field (MRF) [19] exploits the probability

distribution of the system’s logic states and uses feedback to enforcethe most probable state. MRF exhibits robustness to voltage noiseand enables operation at low supply voltage (0.15V ). However, itsuffers from large overhead (4×-to-5×) and increased critical path(8×). This leads to a power increase while operating at 0.15V ,when compared to a 1V design.

Von Neumann [2] showed that reliable computation, i.e., sys-tem/final output probability of error pe,sys is less than 0.5, is impos-sible if the component probability of error pe = ε ≥ 0.167. Later,Pippenger [20] and Feder [21] generalized the bound to pe ≥ 1

2− 1

2k

when using k-input gates and Evans [4] tightened it to pe ≥ 12− 1

2√k

.

On the other hand, Hajek [3] showed that reliable computation canbe achieved using 3-input erroneous gates if pe ≤ 0.167 without aclear construction. For pe < 0.0107, Von Neumann presented themultiplexing logic construction to achieve pe,sys < pe. In multi-plexing logic (see Fig. 2(c)), the gates are replicated N times andeach Boolean variable/bit b is represented by a bundle of N -wiressuch that the ratio of “1”-to-“0” wires in the bundle represents p(b =1). Intermediate “restoration stages”, consisting of permutations fol-lowed by 3-Maj voters or NAND gates, are also needed to performintermediate soft voting to prevent logic saturation. Multiplexingachieves reliable computation only for pe < 0.0107 and requiresvery large overhead to achieve a pe,sys < pe. For example, thereplication factor N needs to be greater than 1724 to achieve reli-able computation when pe = 0.005.

Multiplexing was generalized by [17] and [22] to what is re-ferred to as interwoven logic, where the N -bundle redundant inputsand gates are interconnected to exploit the error-masking propertyof logic gates. In such constructions, the intermediate soft votingis performed implicitly without restoration stages. Recently, basedon the Von-Neumann bundle representation, stochastic logic [7] wasintroduced which enhances the robustness to input errors while as-suming the logic is error-free, i.e., deterministic logic operating onstochastic signals.

These LLER constructions do not address energy efficiency andrequire a large complexity overhead achieve pe,sys < pe. In fact,LLER ignores application context and the type of error by assumingthat all logic gates fail uniformly and independently.

3. ERROR RESILIENT MICROARCHITECTURE

Logic-level resiliency techniques ignore the behavior of hardware er-ror which results in a large complexity overhead. Microarchitecture-level error resiliency (MLER) techniques have lower overhead byexploiting:

1. Error locality: errors are localized at latched outputs and canbe masked by combinational logic.

2. Error susceptibility: different components have different er-ror rates and errors are usually input dependent.

3. Criticality of processing elements: certain logic blocks canaffect the final output severely, while others can tolerate someerrors.

RAZOR [8] and EDS [9] employ shadow latches to detect timingerrors, followed by pipeline stalling or flushing to correct for theseerrors. Other techniques [23] [10] employ special latches to stretchthe clock period whenever a critical path is excited. Similarly, non-uniform redundancy utilizes the error locality and the non-uniformerror susceptibility to decrease replication overhead. For example,in [11], the gates that show higher vulnerability to errors and arespatially closer to latches are replicated. This achieved within 80%

313

Page 3: [IEEE 2013 IEEE Workshop on Signal Processing Systems (SiPS) - Taipei City, Taiwan (2013.10.16-2013.10.18)] SiPS 2013 Proceedings - Error-resilient systems via statistical signal processing

error-correction capability of uniform NMR reliability while saving50− 40% replication overhead. In [12], an exhaustive search acrossdifferent non-uniform NMR configurations was applied and the con-figuration with the best reliability performance was selected. Thisachieved a reliability close to uniform NMR while saving 80− 90%replication overhead. In [24], the authors exploited the asymmetricresiliency in many-integrated cores to avoid replication and improvedetection.

Although MLER techniques reduce the error correction and de-tection overhead, the component probability of error pe that they cantolerate at low overhead is relatively small (for example pe = 0.001in case of RAZOR [8]). This limits their energy efficiency and ro-bustness. MLER does not exploit the explicit application-level in-formation, such as the algorithmic criticality of the computing mod-ules, and the application performance metrics. In fact, MLER andLLER adopt performance metrics that require strict output correct-ness, such as the output-error rate and number of errors per word.This limits their performance in an application context. For exam-ple, decreasing the coding rate in coded logic improves the errordetection and correction capability at the expense of longer code-words and higher decoding complexity. This results in an increasedsusceptibility to errors, and thus a lower performance when relatedto application-level metrics, such as the program success ratio andcompletion time [25].

4. STATISTICAL ERROR COMPENSATION

In addition to error locality, error susceptibility, and the criticalityof the processing elements, the application performance metric andits tolerance limit offer other degrees of freedom that error resilienttechniques can utilize. As most emerging applications adopt sta-tistical performance metrics, they can tolerate small output errors(estimation errors). Statistical error compensation views the unre-liable circuits fabric as noisy communication channels and devisecommunications-inspired techniques to compensate for output errorssuch that the residual error after correction is within the applicationtolerance limit. This leads to significant energy and robustness ben-efits. At the center of such a design philosophy is abstracting thecircuit-level behaviour of errors in the form of error statistics.

A hardware error-free output, yo, in a DSP kernel M is ex-pressed as:

yo = ys + ns (1)

where ys is the desired signal component in yo, and ns is the unde-sired algorithm/application-level noise in absence of any hardwareerror. Most applications metrics can be related to yo-SNR (SNRo),given by:

SNRo = 10 log10E{ys}2E{n2

s} (2)

where ys and ns are assumed, without loss of generality, to havea zero mean. When a nominal case design is being subjected to aworse case process corner (better than worst-case design (BTWC))or the supply voltage Vdd is scaled below the critical voltage Vdd,crit

needed for error-free operation (voltage overscaling (VOS)) in orderto save energy, the latched output y suffers from intermittent tim-ing/hardware related error η. The erroneous output can be writtenas:

y = yo + η = ys + ns + η (3)

Assuming η is zero mean and independent of ns, the SNR of y

)(P)(PProbability

0

(a) (b)

Mx oy ye

yBbbb ,...,, 21

yB

,

x10-4

Err

or P

MF

Error Magnitude

x104

)((c)

)(

P

Fig. 3. Computational error model: (a) additive error model, (b)sample error statistics, and (c) measured error PMF Pη(η) of a volt-age overscaled 20-bit output filter IC in 45-nm CMOS.

(SNRy) can be expressed in terms of SNRo in (2) as:

SNRy = 10 log10E{y2

s}E{n2

s}+ E{η2}

= SNRo − 10log10

(1 +

E{η2}E{n2

s})

(4)

Statistical error compensation minimizes the SNR loss (|SNRo −SNRy|) by minimizing the ratio of E{η2} to E{n2

s}. This involvesminimizing not only the system error rate pe,sys = p(η �= 0) butalso the output error magnitude |η| and p(η) for all η.

Statistical error compensation minimizes the output error mag-nitude and its probability by providing estimates yi to yo corruptedby (small) estimation error εi. Thus, an observed output y of a com-puting kernel M (see Fig. 3(a)) under statistical error compensationis given by:

y = yo + η + ε = yo + e (5)

where y is a By-bit observed output, η and ε are the hardware andestimation errors, respectively, and e = η + ε is the composite er-ror. The set of all possible outputs of M is referred to as the outputspace Y , i.e., y, yo, and e ∈ Y . Generally, the hardware errors ηarising due to timing violations are large in magnitude because thearithmetic operations in DSP kernels are least-significant bit (LSB)first. This is reflected in the sample probability mass function (PMF)Pη(η) in Fig. 3(b). On the other hand, the estimation error ε is smallin magnitude (see Pε(ε) in Fig. 3(b)) because it can arise as a dif-ference between yo and an internal signal in block M, or perhaps theoutput of another lower-complexity block ME that tries to approxi-mate/estimate the output of M, i.e., an estimator. The topic of errorcharacterization of a computational block is an interesting one in itsown right and can be accomplished in many ways, both off-line aswell as in situ calibration using typical inputs. Figure 3(c) showsmeasured error PMF Pη(η) obtained from a voltage-overscaled ICin a 45-nm CMOS process.

314

Page 4: [IEEE 2013 IEEE Workshop on Signal Processing Systems (SiPS) - Taipei City, Taiwan (2013.10.16-2013.10.18)] SiPS 2013 Proceedings - Error-resilient systems via statistical signal processing

ME

| |>Thx

11 oyy

22 oyy

yM

x

M1

M2

MN

11 oyy

22 oyy

NoN yy

)(P

(a) (b)

(c) (d)

Sof

t-Vot

er y

yxME,1

11 eyy o

22 eyy o

NoN eyy

ME,2

ME,NRobust Fusion

-

x

)(ePE11 eyy o

22 eyy o

NoN eyy

y)|0()|1(

LPi

LPii bp

bpYY

yB

LGMLP

yB

Slicer

Fig. 4. Statistical error compensation techniques: (a) algorith-mic noise-tolerance, (b) stochastic sensor network-on-chip, (c) soft-NMR, and (d) likelihood processing.

Statistical error compensation utilizes the application-level com-putation flow and the error statistics to:

1. Provide a low-overhead observation vector to yo: Y ={yi}Ni=1, where yi = yo + ηi + εi = yo + ei with yo, yi, ηi,εi ∈ Y .

2. Utilize the error statistics to employ a Decision rule R thatselects an output estimate y ∈ Y: y = f(Y, Pη, Pε)

Generally, intermediate signals in the main block M, as well as ded-icated estimator blocks (ME,i), are used to form Y. The decisionblock (which implements the decision rule) usually constitutes lessthan 5% of the main block M complexity and is designed to be tim-ing error-free at all process corners and reduced voltages. Next, weintroduce the various statistical error compensation techniques whiledemonstrating their benefits.

4.1. Statistical Error Compensation Techniques

Algorithmic-noise tolerance (ANT) [26] employs an estimator blockME (see Fig. 4(a)), which is a low-complexity version of the M-block, to generate an estimate y2 of the error-free output yo. Theestimator block ME is designed to be free of hardware errors, i.e.,η2 = 0 and thus e2 = ε. ANT relies on the difference in the statisticsof the hardware error of the M-block η1 and the estimation errorε2, which can be made small (see Pε(ε) in Fig. 3(b)) to detect andcorrect for η. ANT can be described as:

1. YANT = (y1 = yo + η1, y2 = yo + ε2), where y1, y2, η1,ε2 ∈ Y .

2. RANT : y =

{y1 if |y1 − y2| < Th

y2 otherwise, where Th is a

user defined threshold that maximizes the application-levelperformance metric.

ANT has shown up to 3× energy savings, while tolerating high com-ponent probability of errors (pe = 10−57%), in theory and in prac-tice via prototype ICs of filters [27] and biomedical processors [28].

Stochastic sensor network-on-chip (SSNOC) [29] (see Fig. 4(b))decomposes the M-block into a set of statistically similar estima-tor blocks ME,i, and employs robust estimation techniques for errorcompensation, such as the median or the Huber estimator which re-lies on penalizing outliers [30]. SSNOC can be described as:

1. YSSNOC = (y1, y2, . . . , yN ), where yi = yo + ei, ei =ηi + εi, with yo, yi, ei ∈ Y .

2. RSSNOC : y = frobust(YSSNOC),e.g., frobust(YSSNOC) = median(YSSNOC) orfrobust(YSSNOC) = Huber(YSSNOC).

SSNOC was applied to a CDMA PN-code acquisition system andhas shown, both in theory [29] and via IC measurements [31], 800×improvement in detection probability along with 40% power savings.

Soft-NMR [32] (see Fig. 4(c)) is another statistical error com-pensation technique which employs the same observation vector asNMR. However, unlike NMR, soft-NMR exploits the hardware errorPMF Pη(η), to implement a decision ruleR based on the maximum-likelihood (ML) principle to enhance system robustness. Soft-NMRis described as:

1. YSNMR = (y1, y2, . . . , yN ), where yi = yo + ηi with yo,yi, ei = ηi ∈ Y .

2. RSNMR :y = arg max

yo

P (YSNMR|yo) = arg maxyo

Pη(η|yo),where η = (η1, η2, .....ηN ).

Soft-NMR [32] was applied to discrete-cosine transform codecwhere it showed 12 − 40% power savings and the ability to handlea high degree of unreliability (pe > 30%).

Likelihood processing (LP) [33] (see Fig. 4(d)) consists of acomputational block MLP generating an N -element observationvector YLP , a likelihood generator (LG), and a slicer. The blockMLP can be designed via one or more of the following techniques:1) replication, 2) estimation, and 3) utilization of inherent signalcorrelations in M. In the latter case, intermediate signals from theM are employed to generate YLP , thereby avoiding any hardwarereplication. For example, adjacent pixels in image/video processingapplications have correlated values, thereby providing multiple ob-servations with low overhead. The LG-block in Fig. 4(d) employsthe composite error PMF PE(e = η+ε) to compute the a-posterioriprobability (APP) ratio λj = P (bj = 1|YLP )/P (bj = 0|YLP ) foreach output bit bj (j = 0, .., By − 1) of the By-bit output yo. Thissoft information provides a measure of the confidence/reliability oneach output bit. For example, our confidence in bit bj being a ‘1’increases with the numerical value of λj for λj > 1, and vice versafor bit bj being a ‘0’. The slicer in Fig. 4(d) thresholds λj to obtain

a hard estimate bj . The LP technique can be described as:

1. YLP = (y1, y2, . . . , yn), where yi = yo+ηi+ εi = yo+eiwith yo, yi, ei ∈ Y .

2. RLP : bj =

{1, if λj ≥ 1

0, otherwise, where λj =

P (bj=1|YLP )

P (bj=0|YLP )

for j = 0, 1, . . . , By − 1.

LP was shown to tolerate up to 100× more component probabilityof error compared to conventional systems and achieved 71% energysavings of compared to NMR [33].

4.2. Case Study: A Voltage Overscaled Discrete Cosine Trans-form (DCT) Codec

We compare the performance of statistical error compensation tech-niques soft-NMR and LP to conventional NMR in the design of aDCT codec in a commercial 45-nm CMOS process. Two architec-tural setups are employed: one involves modular redundancy andthe other is based on signal correlation where adjacent pixels are

315

Page 5: [IEEE 2013 IEEE Workshop on Signal Processing Systems (SiPS) - Taipei City, Taiwan (2013.10.16-2013.10.18)] SiPS 2013 Proceedings - Error-resilient systems via statistical signal processing

5X3X

70X

Component Probability of error ( pe )

14X

LP3C

Component Probability of error ( pe )(b)

(a)

PSNR

PSNR

LP3RLP2R

Convetional N=1TMR (Majority) Soft TMR

Convetional N=1TMR (Maj) Redundancy

Fig. 5. System robustness of 2D DCT-IDCT codec under: (a) redun-dancy, (b) signal correlation [33].

used as an estimator to avoid redundancy. The peak-to-signal-noiseratio (PSNR) of the DCT codec under modular replication is shownin Fig. 5(a) for different component probability of error pe inducedby voltage overscaling. Figure 5(a) shows that LP3R, i.e., LP withtriple-MR, can tolerate 70× and 5× higher pe compared to con-ventional (single-codec) design and TMR, respectively, at a PSNRof 30 dB. Soft-NMR shows similar robustness and performance toLP. Interestingly, as seen in Fig. 5(a), LP2R, i.e., LP with dual-MR(DMR), behaves close to or even better than TMR when pe ≥ 0.05,i.e., LP with a replication factor of two outperforms TMR. This isunlike conventional DMR, which can only detect errors but not cor-rect them.

The PSNR of the DCT codec under the spatial-correlation setup,where two adjacent pixels are used as estimators in case of LP, isshown in Fig. 5(b). LP under spatial correlation (LP3C) achieves14× increase in robustness at a PSNR of 30 dB compared to the con-ventional system. This level of robustness is similar to that achievedby TMR. However, the latter employs two more DCT codecs com-pared to LP3C.

5. FUTURE WORK

Viewing the underlying hardware as a noisy communication chan-nel enables the application of other advanced principles from pattern

A-1

A-2

A-N

Like

lihoo

d G

ener

ator-A

1y

2y

Ny

Channel-A

Channel-B

Like

lihoo

d G

ener

ator

-B

PriorFrom

B priorA,

priorB ,

int,A

int,B

x

Fig. 6. Iterative likelihood processing.

recognition, signal processing, and coding theory; thus, opening upa number of interesting problems.

The decision rule in ANT is in fact an elegant linear binary clas-sifier. Such a classifier partitions the output space depending on theoccurrence of a hardware error into two classes by comparing to athreshold (linear boundary). Using principles from linear classifica-tion, it can be systematically shown how such a classification rulecan be derived from training data. This analogy opens up a numberof interesting applications from pattern recognition, where advanced(non-linear, Bayesian) classifiers and pre-classification transforma-tions can be utilized to improve classification or decrease the classi-fier complexity. An example of pre-classification transformations isthe linear discriminant analysis which ensures maximal class linearseparability. Case studies employing training data along with classi-fication to detect hardware faults have appeared recently in [34].

The estimation process under the ANT framework can also begeneralized. Under conventional ANT, the estimate of the correctoutput is either the main block output or the estimator. More ad-vanced forms of classification-based ANT involves providing ad-vanced (e.g. minimum-mean square error) estimates for each classdepending on the error statistics. Last but not least, this generaliza-tion can involve multiple estimators and classification classes.

The soft bit-level information (λi) provided by LP can be ex-ploited further in an iterative setup similar to turbo detection in or-der to improve system robustness and energy efficiency. As shownin Fig. 6, multiple (spatial/temporal) observation vectors can be pro-vided along with multiple LP generators. These generators can, then,exchange their estimate of the correct output bit in the form of priorinformation to improve the final decision. Such an iterative systemneeds to engineer the error statistics in the observation vectors suchthat early detection saturation is avoided.

Statistical error compensation, thus far, has been applied to logicdata paths while assuming error-free memory. Being data-intensive,emerging applications rely heavily on memory storage. Statisticalerror compensation principles can be similarly utilized to decreasethe energy and improve the reliability of memory systems.

ACKNOWLEDGMENTS

This work was supported in part by the Systems on Nanoscale Infor-mation fabriCs (SONIC), one of six centers supported by the STAR-net phase of the Focus Center Research Program (FCRP), a Semi-conductor Research Corporation program sponsored by MARCOand DARPA.

316

Page 6: [IEEE 2013 IEEE Workshop on Signal Processing Systems (SiPS) - Taipei City, Taiwan (2013.10.16-2013.10.18)] SiPS 2013 Proceedings - Error-resilient systems via statistical signal processing

6. REFERENCES

[1] S. Borkar, “Tackling variability and reliability challenges,”IEEE Design & Test of Computers, vol. 23, no. 6, p. 520, 2006.

[2] J. von Neumann, Probabilistic logics and the synthesis of reli-able organisms from unreliable components. Princeton, N.J.:Automata Studies, Princeton Univ. Press, 1956.

[3] B. Hajek and T. Weller, “On the maximum tolerable noise forreliable computation by formulas,” IEEE Tran. on Inf. Theory,vol. 37, no. 2, pp. 388–391, 1991.

[4] W. Evans and L. Schulman, “Signal propagation with applica-tion to a lower bound on the depth of noisy formulas,” in Proc.Foundations of Computer Science Symp., 1993, pp. 594–603.

[5] C. Winstead and S. Howard, “A probabilistic LDPC-codedfault compensation technique for reliable nanoscale comput-ing,” IEEE Trans. Cir. Sys., vol. 56, no. 6, pp. 484–488, Jun.2009.

[6] K. Nepal, R. I. Bahar, J. Mundy, W. R. Patterson, and A. Za-slavsky, “Designing nanoscale logic circuits based on markovrandom fields,” J. Electron. Test., vol. 23, no. 2-3, pp. 255–266,Jun. 2007.

[7] W. Qian, M. D. Riedel, K. Bazargan, and D. J. Lilja, “Thesynthesis of combinational logic to generate probabilities,” inProc. of Int. Conf. on Computer-Aided Design, 2009, pp. 367–374.

[8] S. Das, D. Roberts, S. Lee, S. Pant, D. Blaauw, T. Austin,K. Flautner, and T. Mudge, “A self-tuning DVS processor usingdelay-error detection and correction,” IEEE Journal of Solid-State Circuits, vol. 41, no. 4, pp. 792–804, 2006.

[9] K. A. Bowman, J. W. Tschanz, N.-S. Kim, J. C. Lee, C. B.Wilkerson, S.-L. L. Lu, T. Karnik, and V. K. De, “Energy-efficient and metastability-immune resilient circuits for dy-namic variation tolerance,” IEEE Journal of Solid-State Cir-cuits, vol. 44, no. 1, pp. 49–63, 2009.

[10] K. Chae, S. Mukhopadhyay, C.-H. Lee, and J. Laskar, “A dy-namic timing control technique utilizing time borrowing andclock stretching,” in Custom Integ. Circuits Conf., IEEE, 2010,pp. 1–4.

[11] K. Mohanram and N. A. Touba, “Partial error masking to re-duce soft error failure rate in logic circuits,” in Proc. of IEEEInt. Symp. on Defect and Fault Tolerance in VLSI Systems,2003, pp. 433–440.

[12] K. Woo and M. Guthaus, “Fault-tolerant synthesis using non-uniform redundancy,” in IEEE Int. Conf. on Computer Design,2009, pp. 213–218.

[13] N. R. Shanbhag, R. A. Abdallah, R. Kumar, and D. L. Jones,“Stochastic computation,” in Proc. of the 47th Design Automa-tion Conf., ser. DAC ’10, 2010, pp. 859–864.

[14] A. Avizienis and J. P. J. Kelly, “Fault tolerance by design di-versity: Concepts and experiments,” IEEE Computer, vol. 17,no. 8, pp. 67–80, Aug. 1984.

[15] M. Stanisavljevic, A. Schmid, and Y. Leblebici, “Optimizationof nanoelectronic systems reliability under massive defect den-sity using distributed R-fold modular redundancy (DRMR),” inIEEE Int. Symp. on Defect and Fault Tolerance in VLSI, 2009,pp. 340–348.

[16] C. He, M. Jacome, and G. De Veciana, “A reconfiguration-based defect-tolerant design paradigm for nanotechnologies,”IEEE Design Test of Computers, vol. 22, no. 4, pp. 316–326,2005.

[17] K.-H. Huang and J. Abraham, “Algorithm-based fault toler-ance for matrix operations,” IEEE Trans. on Computers, vol.C-33, no. 6, pp. 518–528, 1984.

[18] C. Hadjicostis, Coding approaches to fault tolerance in com-binational and dynamic systems. Boston: Kluwer AcademicPublishers, 2001.

[19] R. I. Bahar, J. Mundy, and J. Chen, “A probabilistic-based de-sign methodology for nanoscale computation,” in Proc. Int.Conf. on CAD, Washington, DC, USA, 2003, pp. 480–486.

[20] N. Pippenger, “Reliable computation by formulas in the pres-ence of noise,” IEEE Trans. on Inf. Theory, vol. 34, no. 2, pp.194–197, 1988.

[21] T. Feder, “Reliable computation by networks in the presence ofnoise,” IEEE Trans. on Inf. Theory, vol. 35, no. 3, pp. 569–571,1989.

[22] W. Pierce, Failure-tolerant computer design. Academic Press,1965.

[23] R. Samanta, G. Venkataraman, N. Shah, and J. Hu, “Elastictiming scheme for energy-efficient and robust performance,” inQuality Electronic Design, Int. Symp. on, 2008, pp. 537–542.

[24] H. Cho, L. Leem, and S. Mitra, “ERSA: Error resilient sys-tem architecture for probabilistic applications,” IEEE Trans. onCAD, vol. 31, no. 4, pp. 546–558, 2012.

[25] A. A. Al-Yamani, N. Oh, and E. J. McCluskey, “Algorithm-based fault tolerance: A performance perspective based on er-ror rate,” in Proc Int. Symp. on Dependable Systems and Net-works, 2001.

[26] R. Hegde and N. R. Shanbhag, “Soft digital signal processing,”IEEE Trans. Very Large Scale Integr. Syst., vol. 9, no. 6, pp.813–823, Dec. 2001.

[27] ——, “A voltage overscaled low-power digital filter IC,” IEEEJournal of Solid-State Circuits, vol. 39, no. 2, pp. 388–391,2004.

[28] R. A. Abdallah and N. R. Shanbhag, “A 14.5 fJ/cycle/k-gate,0.33V ECG processor in 45-nm CMOS using statistical errorcompensation,” in Proc. of IEEE Custom Integrated Cir. Conf.(CICC), 2012, pp. 1–4.

[29] G. V. Varatkar, S. Narayanan, N. R. Shanbhag, and D. L.Jones, “Stochastic networked computation,” IEEE Trans. onVLSI Systems, vol. 18, no. 10, pp. 1421–1432, 2010.

[30] P. J. Huber, Robust Statistics. Wiley, 1981, vol. 1, no. 3.

[31] E. P. Kim, D. J. Baker, S. Narayanan, D. L. Jones, and N. R.Shanbhag, “Low power and error resilient PN code acquisitionfilter via statistical error compensation,” in IEEE Custom Inte-grated Circ. Conf. (CICC), 2011, pp. 1–4.

[32] E. P. Kim and N. R. Shanbhag, “Soft N-modular redundancy,”IEEE Trans. Comput., vol. 61, no. 3, pp. 323–336, Mar. 2012.

[33] R. Abdallah and N. Shanbhag, “Robust and energy efficientmultimedia systems via likelihood processing,” Multimedia,IEEE Tran. on, vol. 15, no. 2, pp. 257–267, 2013.

[34] N. Verma et al., “Enabling system-level platform resiliencethrough embedded data-driven inference capabilities in elec-tronic devices,” in Int. Conf. on Acoustics, Speech and SignalProc, 2012, pp. 5285–5288.

317