Robust and Energy Efficient Multimedia Systems via Likelihood Processing

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 2, FEBRUARY 2013 257

Robust and Energy Efficient MultimediaSystems via Likelihood ProcessingRami A. Abdallah, Member, IEEE, and Naresh R. Shanbhag, Fellow, IEEE

Abstract—This paper presents likelihood processing (LP) fordesigning robust and energy-efficient multimedia systems in thepresence of nanoscale non-idealities. LP exploits error statistics ofthe underlying hardware to compute the probability of a partic-ular bit being a one or a zero. Multiple output observations aregenerated via either: 1) modular redundancy (MR), 2) estimation,or 3) exploiting spatio-temporal correlation. Energy efficiency androbustness of a 2D discrete-cosine transform (DCT) image codecemploying LP is studied. Simulations in a commercial 45-nmCMOS process show that LP can tolerate up to , andgreater component error probability as compared to conventionaland triple-MR (TMR)-based systems, respectively, while achievinga peak-signal-to-noise ratio (PSNR) of 30 dB at a pre-correctionerror rate of 20%. Furthermore, LP is able to achieve energysavings of 71% over TMR at a PSNR of 28 dB, while tolerating apre-correction error rate of 4%.

Index Terms—Error resiliency, low-power design, media pro-cessing, robust design, stochastic computation, voltage overscaling.

I. INTRODUCTION

T HE circuit fabric in sub-45nm process technologies is be-coming increasingly unreliable due to process, voltage

and temperature (PVT) variations, leakage, soft errors, supplybounce and noise [1]. The prevalent worst-case design philos-ophy leads to a large energy overhead thereby reducing the ben-efits of technology scaling. Furthermore, the growing functionalcomplexity of next generation applications leads to a powerproblem. Worst-case designs address the reliability problem atthe expense of power consumption, while nominal-case designs,though power efficient, result in a loss in manufacturing yield.Error resiliency is an alternative design philosophy for

achieving both energy efficiency and robustness in systemsimplemented in nanoscale CMOS technologies. Error-resilientsystems are designed at nominal PVT corner, and intermittenterrors are corrected via circuit [2], logic [3]–[7], or architec-ture/system level [8]–[11] techniques.The main challenge in error-resilient designs is to achieve

robustness with low error-compensation overhead. A numberof logic/microarchitecture level techniques for robust systemdesign have been proposed. Design of reliable logic networks

Manuscript received December 02, 2011; revised April 23, 2012; acceptedJune 02, 2012. Date of publication December 04, 2012; date of current versionJanuary 15, 2013. The associate editor coordinating the review of this manu-script and approving it for publication was Yiannis Andreopoulos.The authors are with the Department of Electrical and Computer Engineering,

University of Illinois at Urbana-Champaign, Champaign, Urbana, IL 61820USA (e-mail: [email protected]; [email protected]).Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TMM.2012.2231667

from unreliable gates dates back to von Neumann [12] in1950, where logic level replication with majority voting wasproposed to increase resiliency. More recently, logic level ap-proaches such as RAZOR [3], error detection sequential (EDS)[6] and random Markov field [2] have also been proposed,as well as stochastic logic [4]. These techniques either leadto a large overhead limiting their energy efficiency orare unable to compensate efficiently for error rates .These techniques miss on the opportunity presented by a largeclass of next generation applications where a relaxed defini-tion of “correctness” is adopted. Emerging applications [13],especially those in media processing, employ statistical per-formance metrics, such as signal-to-noise ratio (SNR) in videocompression, bit error-rate (BER) in data communications,probability of detection in face/target recognition, which admitan expanded set of acceptable outputs. This expansion of theacceptable output set enables the use of statistical error com-pensation techniques. There will always be a class of criticalapplications such as those in finance/banking and flight controlsystems where the set of acceptable outputs is much narrowerand -modular redundancy (NMR) is typically employed toachieve robustness.Recently, a number of statistical error compensation tech-

niques referred to as stochastic computing techniques [11]have emerged. These techniques exploit the expanded outputspace in emerging applications, along with statistical inferencetechniques such as estimation and detection, to compensate forerrors. Techniques such as algorithmic noise tolerance (ANT)[8], and stochastic sensor network-on-a-chip (SSNOC) [14] im-plicitly employ error statistics of the architectural components,while soft NMR [15] does so explicitly.In this paper, we propose the technique of likelihood pro-

cessing (LP) which also explicitly employs error statistics forerror compensation. It does so by computing the likelihood, i.e.,the ratio of the probability of an output bit being a ‘1’ vs. theprobability of it being a ‘0’. We demonstrate the energy and ro-bustness benefits of LP in the design of a 2D discrete-cosinetransform (2D-DCT) codec, a widely used image processingkernel in a commercial 45-nm CMOS process. The energy-ro-bustness trade-off in LP is evaluated in the presence of voltageoverscaling (VOS) [8], whereby the supply voltage is re-duced below the critical supply voltage , i.e., the min-imum supply voltage required to avoid timing violations. VOSoffers a well-quantified trade-off between energy and error ratesand can be employed to emulate timing violations due to PVTvariations as well. Preliminary results on the benefits of LP haveappeared in [16].The paper is organized as follows: Section II presents a uni-

fied framework for error-resiliency techniques including LP.

1520-9210/$31.00 © 2012 IEEE

258 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 2, FEBRUARY 2013

Fig. 1. Computational error model: (a) additive error model, (b) sample errorstatistics, and (c)measured error PMF of a 20-bit output filter IC in 45-nmCMOS with .

Section III formally describes the algorithmic basis for LP, itsarchitecture, and presents a motivational example. Section IVdemonstrates the benefits of LP in terms of robustness and en-ergy efficiency in the design of a 2D-DCT image codec. Finally,Section V concludes with a discussion of future work.

II. A UNIFIED FRAMEWORK FOR ERROR RESILIENCY

This section presents a unified framework for describingerror-resiliency techniques, including both conventional andstatistical techniques, in order to relate the proposed LP tech-nique to existing work.The error model for an arbitrary computational kernel M (see

Fig. 1(a)) is given by:

(1)

where is a -bit observed output, is the correct (error-free)output, and are the hardware and estimation errors, respec-tively, and is the composite error. Though canbelong to a set of acceptable outputs, we take a conservative ap-proach and assume that the cardinality of such a set is unity, i.e.,there is one ideal/error-free value for . The set of all possibleoutputs of M is referred to as the output space , i.e., , ,and . Note that, without any loss of generality, we employa weighted number representation, e.g., two’s complement, forthe output and error signals, which is quite appropriate for mediakernels, as these tend to employ arithmetic operations quite ex-tensively. This does not preclude the use of other number rep-resentations both weighted and non-weighted.In this paper, the hardware errors arise due to timing vi-

olations. These errors are typically large in magnitude becausethe arithmetic operations in DSP kernels are least-significant bit(LSB) first. This is reflected in the sample probabilitymass func-tion (PMF) in Fig. 1(b). On the other hand, the estimationerror is small in magnitude (see in Fig. 1(b)) becauseit can arise as a difference between and an internal signalin block M, or perhaps the output of another lower-complexityblock that tries to approximate/estimate the output of M,

Fig. 2. Existing architectural level error resiliency techniques: (a) NMR,(b) algorithmic noise-tolerance (ANT), (c) stochastic sensor network-on-chip(SSNOC), and (d) soft NMR.

i.e., an estimator. The topic of error characterization of a com-putational block is an interesting one in its own right, and canbe accomplished in many ways, both off-line as well as via insitu calibration using typical inputs. Characterization of timingserror for computational blocks in the form of PMFs has been ad-dressed in [17]. Fig. 1(c) shows measured error PMF ob-tained from a voltage-overscaled IC in a 45-nm CMOS process.We now describe existing error-resiliency techniques (see

Fig. 2) using the following definitions:1) Observation vector , where

with , , , .2) Decision rule , where the corrected/final output .

For example, NMR (see Fig. 2(a)) employs -way replica-tion, i.e., and , to generate the observation vector, and employs majority or other forms of voting strategies as

the decision rule . NMR is effective if the hardware errors ’sare independent, and is described as follows:1) , where with ,

, .2) , where is the ma-jority operator.

Algorithmic-noise tolerance (ANT) [8] employs an estimatorblock (see Fig. 2(b)), which is a low-complexity version ofthe M-block, to generate an estimate of the error-free output. The estimator block is designed to be free of hardware

errors, i.e., and thus . ANT relies on the differencein the statistics of the hardware error of the M-block andthe estimation error , which can be made small (see inFig. 1(b)) to detect and correct for . Thus, ANT is described as:1) , where , , ,

.2) if

otherwise

where is a user defined threshold that maximizes anapplication-level performance metric.

The stochastic sensor network-on-chip (SSNOC) [14] inFig. 2(c) decomposes the M-block into a set of statisticallysimilar estimator blocks , and employs robust estimationtechniques for error compensation, such as the median or the

ABDALLAH AND SHANBHAG: ROBUST AND ENERGY EFFICIENT MULTIMEDIA SYSTEMS VIA LIKELIHOOD PROCESSING 259

Fig. 3. The proposed technique: likelihood processing (LP).

Huber estimator which relies on penalizing outliers [18]. Thus,SSNOC is described as:1) , where ,

, with , , .2) , e.g.,

or.

Soft NMR [15] (see Fig. 2(d)) employs the same observa-tion vector as NMR. However, unlike NMR, soft NMR exploitsthe hardware error PMF , to implement a decision rulebased on the maximum-likelihood (ML) principle to enhancesystem robustness. Soft NMR is described as:1) , where with ,

, .2)

,

where .Our proposed technique LP (see Fig. 3) consists of a computa-

tional block generating an -element observation vector, a likelihood generator (LG), and a slicer. Fig. 4 shows

that the block can be designed via one or more of thefollowing techniques: 1) replication, 2) estimation, and 3) ex-ploiting inherent signal correlations in M. In the latter case, in-termediate signals from the M are employed to generate ,thereby avoiding any hardware replication (see Fig. 4(c)). Forexample, adjacent pixels in image/video processing applicationshave correlated values, thereby providing multiple observationswith low overhead.The LG-block in Fig. 3 employs the composite error PMF

to compute the a-posteriori probability (APP)ratio for each outputbit , of the -bit output . This softinformation provides a measure of the confidence/reliability oneach output bit. For example, our confidence in bit being a ‘1’increases with the numerical value of for , and viceversa for bit being a ‘0’. The slicer in Fig. 3 thresholds toobtain a hard estimate . In this paper, we consider only harddecisions for simplicity and ignore the additional improvementavailable by exploiting soft information further. The LP tech-nique can be described as:1) , wherewith , , .

2)ifotherwise

, where

for.

We employ the notation to denote the LP tech-nique operating on an observation vector of length

Fig. 4. Techniques to generate observation vector : (a) replication,(b) estimation, and (c) spatio-temporal correlation.

and generating a -bit error-compensated output. The char-acter ‘ ’ in denotes the architectural setup of the

-block and can take the values ‘r’, ‘e’, or ‘c’ correspondingto the three architectural choices: replication, estimation, or cor-relation, respectively.Section III describes the LP framework in detail and shows

how to generate the APP ratio for each output bit .

III. THE PROPOSED TECHNIQUE:LIKELIHOOD PROCESSING (LP)

In this section, we present LP in its most general form, illus-trate it through an example, and then describe an efficient archi-tecture.

A. The LP Algorithm

Consider the computational block M in Fig. 1(a), whose cor-rect output is represented with bits, , andmanifests an output space . Further-more, employing one of the observability enhancing techniquesin Fig. 4, an observation vector is generated,where . The error PMFs , areassumed to be known as in the case of soft NMR [15].For each output bit , we need to compute

the APP ratio defined as follows:

(2)

In practice, computing the log-domain APP ratio simplifiesthe algorithm and implementation, and is given by:

(3)

Thus, our confidence in being a ‘1’ is very high if ,and vice versa for being a ‘0’.


Applying the Bayes rule to (2), we obtain:

(4)

where is the probability of observing the vectorand . Substituting (4) in (2), we obtain:

(5)

where we denote foror 1. For each output bit, , the probabilities aregenerated by a bit-level to word-level mapping in the outputspace as follows:

(6)

where , i.e., the set of all possibleoutputs that have the th bit , and is the priorprobability, i.e., the distribution of the error-fee output .We assume that the errors are independent in order to re-

duce the memory complexity inherent in the storage of jointerror PMFs. Errors such as soft error are inherently independent.On the other hand, errors such as timing errors are typically de-pendent across multiple observations. Since timing errors aredata and architecture dependent, one can obtain independent er-rors by: 1) employing different schedules per processing ele-ment which in effect scrambles the input data sequence, and 2)employing different architectures for the processing elements.These techniques have shown to be effective in generating in-dependent errors [19]. Assuming independent errors, the condi-tional PMF in (6) can be written as:

(7)

Therefore, substituting (7) into (6) results in

(8)

Therefore, substituting (8) in (5), provides the final expressionfor as follows:

(9)

Next, we illustrate the LP algorithm with an example.

Fig. 5. An example of LP: (a) 2-bit output erroneous computational block and(b) a 2-bit sample error PMF.

B. Motivational Example

Consider the computation kernel M in Fig. 5(a) with aoutput corrupted by a 2-bit error

according to the PMFin Fig. 5(b), where the pre-correction error rate (componentprobability of error) , i.e.,the percentage of clock-cycles during which the final output isin error. Assume that we rely on a single output observation

, i.e., , and .We refer to this form of likelihood processing as LP1r-(2), i.e.,LP with , and .The LG-block computes in (6) by considering all pos-

sible outputs as follows:

(10)

Assuming all outputs are equally likely to occur, wehave and in(10). Employing the error PMF in Fig. 5(b), we write(10) as

(11)

Assuming that the pre-correction error rate ,, , and , (11) gives . Similarly

from (6), we get . Thus, the APPratio of first bit and , which is greaterthan 0, and thus the slicer in Fig. 3 generates . Similarly,one can show that ,

, andhence . Thus, the LG-block generates the

final sliced error compensated output even thoughthe M-block output is , i.e., there is a highprobability that is in error, given the knowledge of the errorPMF .Next consider LP3r-(2), i.e., we have three identical copies

of the M-block in Fig. 5(a) generating the observation vectorfollowed by a three-input LG-block

and a slicer. Without any loss of generality, we assume


that , , and are corrupted by independent identi-cally distributed (iid) errors , , and , respectively.In other words, the errors , , and are independentand follow the same error PMF in Fig. 5(b), i.e.,

. If the observationvector , thenTMR selects via a majority vote. On the other hand, asmart voter with the knowledge of error statistics would realizethat the correct output sincebut (see Fig. 5(b)). For example,employing LP3r-(2), the computation ofin (6) is given by:

Similarly, . Assuming, , , , and equal priors,

(10) results in , and, i.e.,

indicating . Similarly, one can show for the secondbit , ,

, and as well.Hence, , resulting in the corrected output as .The effectiveness of LP over conventional design can be cal-

culated for this example by injecting errors at the output of2-bit computational kernel(s) according to the PMF in Fig. 5(b)at various pre-correction error rates ’s. employs

to generate probabilities for and from fol-lowed by a slicer to produce a hard estimate , while a conven-tional design directly uses followed by a majority vote.The system correctness metric, defined as

, is employed to compare LP to con-ventional design.From Fig. 6, we observe that LP3r-(2) outperforms TMR for

all values of . Second, system correctness for both LP1r-(2)and LP3r-(2) increases with for and ,respectively. This unusual outcome occurs because LP under-stands that the observations in are unreliable for highvalues of , and thus tends to choose outputs from that donot belong to . Note that when , system correct-ness for TMR falls below even LP1r-(2) and the conventional

system, because the probability of two or more identicalerrors becomes larger, and hence the majority voter selects thewrong values more often. On the other hand, LP exploits theknowledge of the error distribution in Fig. 5(b), i.e., differenterror magnitudes have different error probabilities, to correctfor errors.

C. The -Input Likelihood Generator (LG-Block) Architecture

We use the log domain processing of probabilities to simplifythe implementation of the LG-block in Fig. 3. Taking the loga-rithm (base 2) of (8), we obtain:

(12)

Fig. 6. System correctness of a 2-bit output system at different .

We use the log-max approximation [20] to simplify (12). Thelog-max approximation is given by:

(13)

where is a correction term. Thus, ignoring the correctionterm in (13), (12) can be written as:

(14)

where is referred to as the word-metric and is defined as:

(15)

From (5) and (14), the log-APP ratio for bit is computedas follows:

(16)

where , and is given by (15).The LG architecture implementing (16) is shown in Fig. 7.

The look-up tables (LUTs) in Fig. 7 store the output prior, and error PMFs . The LG-block gen-

erates after clock-cycles. In each clock cycle, a specificis fed into the metric unit (MU). The MU compares

to each of the observations and generates . For each, there are two recursive compare-select (CS) units to keep

track of the maximum values of computedover and , respectively, according to (16).1) Complexity and Power Overhead: The complexity of the

LG-block depends only on the output precision and the


Fig. 7. A 1-parallel (folded) LG-processor architecture for(MU: metric unit and CS2: 2-operand compare-select unit).

TABLE IALGORITHMIC COMPLEXITY OF AN -PARALLEL LG-PROCESSOR FOR

(ADD2: 2-INPUT ADDITION, AND CS2: 2-INPUTCOMPARE-SELECT OPERATION)

number of observations , and is independent of other param-eters of the main -block. Thus, as the complexity ofincreases, overhead constitutes a smaller portionof the total system complexity, resulting in higher energy androbustness benefits. In this paper, we illustrate LP at the levelof a DCT codec. Since multimedia applications employ pro-cessing kernels that has similar or higher complexity comparedwith DCT, the LP energy and robustness benefits, that will bedemonstrated in DCT codec, are expected to be similar or higherin other multimedia kernels.The LG-architecture in Fig. 7 needs clock-cycles to

compute . Parallelization by a factor of reducesthe number of clock-cycles to but increases hardwarecomplexity. Algorithmic complexity estimate of an -parallelLG-block operating on bits is shown in Table I. The storageof prior and error PMFs ( and ) requires storing

bits where it is assumed and arequantized to bits.

D. Low-Complexity LP Architectures

The complexity and power overhead of LP can be reducedsignificantly by probabilistic activation and bit-subgrouping asexplained next.1) Probabilistic Activation: The LG-block in LP can be ac-

tivated only when there is a large difference between the obser-vations , which indicates the presence of a large error (see theactivation block in Fig. 8). Assuming hardware errors are large

Fig. 8. A bit-subgrouped LG-processor architecture forwith probabilistic activation

module.

and independent across the observations , then the LG-blockactivation factor, , is given by:

(17)

where is the probability of hardware error in .2) Bit-Subgrouping: Since the output search space is ex-

ponential in the number of output bits , the complexity andpower of the LG-block can be reduced via bit-subgrouping, asshown in Fig. 8. In bit-subgrouping, the -bit output is dividedinto subgroups with precisions , respec-tively, such that . Then, LP is appliedindependently on each subgroup of bits. With bit-sub-grouping, is denoted as .Bit-subgrouping significantly reduces the output searchspace, and thus the LP overhead. For example, if -bitoutput is uniformly divided into subgroups each with

, ’s are needed instead of asingle , thereby reducing the storage and computa-tional complexity space of the LG-block from to .However, as increases, the system-level correctness orthe robustness of LP will be reduced, since bit-subgroupingignores error correlations between adjacent subgroups of bits.

IV. SIMULATION RESULTS

We demonstrate benefits of LP in terms of robustness and en-ergy efficiency in the design of a two-dimensional inverse dis-crete-cosine transform (2D-IDCT) image codec subject to PVTerrors. The DCT-IDCT transform (see Fig. 9(a)) is applied on256 256 8-bit pixel images, stored initially in memory (Mem),in blocks of 8 8 pixels using Chen’s algorithm [21]. Each2D transform employs two 1D transforms: the first is appliedcolumn-wise on the input block, and the second is applied row-


Fig. 9. An 8-bit output 2D-DCT/IDCT codec: (a) single codec, (b) replicationset-up, (c) estimation set-up, and (d) spatial correlation set-up.

wise on the output of the first. Transposition memory (TM) isused to swap the data between rows and columns. The quan-tizer (Q) and inverse quantizer employ the JPEG quan-tization table for compression. Only the receiver computationalkernels ( and IDCT blocks) are subject to hardware er-rors. The error-free DCT-IDCT codec achieves a peak signal-to-noise ratio (PSNR) of 33 dB, where the PSNR is defined as

(18)

A. System and Architecture Set-Up

We employ three different setups to generate multiple obser-vations of the 8-bit output pixel in order to detect and correcthardware errors in the 2D-IDCT block:1) Replication (see Fig. 9(b)): 2D-IDCT kernel is replicated toprovide exact estimates of corrupted by hardware errorsonly. Such a setup is typical of robust general computing

systems and can be employed in applications where com-plexity can be traded for increased robustness. Data, archi-tecture, and scheduling diversity can be employed to en-sure error independence across redundant kernels [19]. Forexample, the main processing elements in the replicatedIDCT blocks (adders) can use different architectures suchas ripple-carry, carry-save, or carry-bypass adders. Thiswill ensure spatial error independence due to the strong de-pendence of error statistics on hardware architecture [17].

2) Estimation (see Fig. 9(c)): A reduced-precision redun-dancy (RPR) estimator of the 2D-IDCT is employed inparallel with the main 2D-IDCT. While the main block isdesigned to operate on 8-b pixel inputs, the RPR estimatoris designed to operate on the three most-significant bits(MSB) of the pixel value, and thus it is of lower com-plexity allowing, it to be designed hardware error-free atlow overhead. The estimator output is corrupted byestimation error only.

3) Spatial correlation (see Fig. 9(d)): Application-level datacorrelations are exploited to generate multiple observa-

tions at very low overhead, thereby avoiding approximateor full replication overhead. In the IDCT output, pixels inadjacent rows have similar values and have independent er-rors since the 1D-IDCT is applied row-wise on the imagein the final step of the 2D-IDCT. Thus, pixels in adjacentrows are employed to generate multiple observations .For example, a 4-element observation vector for thepixel at row and column coordinates is gen-erated by choosing pixels , ,

, and . Thus, pixels inthe observation vector other than are corrupted by bothestimation and hardware errors, while is corrupted byhardware errors only.

Error correction mechanisms, such as NMR (Fig. 2(a)),ANT (Fig. 2(b)), and soft NMR (Fig. 2(d)), and the proposedtechnique LP (Fig. 3) are employed to process the observationvector and correct for errors in each setup. All error detectionand correction blocks are operated at a critical supply voltageof , to ensure correct operation whileconsuming minimum power. Different bit-subgroupings ofthe 8-bit output (see Fig. 8) are applied to study the trade-offbetween LP complexity overhead and performance. In additionto that, a probabilistic activation module for LG-block (seeFig. 8) is employed to decrease LP power overhead. The errorand prior PMFs are quantized to 8-bits before being stored byLG-block. The complexity of different blocks in error-compen-sated 2D-IDCTs is shown in Table II. The power overhead ofLP compared to a single 2D-IDCT block is shown in Fig. 10.When the 256-parallel LG-processor is fully ac-tivated, its power constitutes 70% compared with the powerof a single 2D-IDCT block. Probabilistic activation decreasesLP power overhead by around for a decade decrease inpre-correction error rate , while folding the LG-processorfrom full-parallel to 1-parallel, and bit-subgrouping from(8) to (5,3), reduces LP power overhead by , and ,respectively.

B. Error Characterization and Simulation Procedure

Likelihood processing requires statistical characterization ofoutput errors of the computation engine (DCT-IDCT codec).Accurate modeling of errors increases resiliency at the expenseof increased storage requirement and search space of LP. Gen-erally, hardware errors are a function of the PVT settings, theinput space, and the architecture. We showed in [17] that timingerror statistics are relatively independent of the input statisticsand are a strong function of the architecture. Hence, a traininginput data can be employed to statistically characterize theerror statistics of the M-block. This captures the dependence ofhardware errors on the architecture, indirectly considers the de-pendence on the input statistics, and provides the LG-block witha good estimate of the actual output errors. This training phasecan be performed either off-chip or on-chip.We evaluate the robustness and energy efficiency of the pro-

posed soft-output technique under timing violations caused byPVT variations. As mentioned earlier, VOS is employed to em-ulate these timing violations using a 45-nm TI CMOS process.Keeping the frequency fixed, the supply voltage is reducedbeyond a critical design voltage so that intermittent


Fig. 10. power overhead percentage compared with a single 2D-IDCTblock under different LG-processor configurations and pre-correction errorrates.

TABLE IIGATE COMPLEXITY (NORMALIZED TO NAND2) OF BUILDING BLOCKS IN

ERROR-COMPENSATED 2D-IDCT ARCHITECTURES

timing errors appear. The simulation methodology involves twosteps:1) Training phase: An error PMF is obtained via anRTL simulation of the M-block employing a training inputdata-set as follows:• Step 1: Circuit simulations are employed to characterizethe worst-case delay vs. relationship of the gate li-brary.

• Step 2: At each , a gate-level netlist ofthe M-block is simulated using gate delays obtainedfrom Step 1 and employing the data-set while thefrequency of operation is fixed to meet timingconstraints only. This step generates the erroneousoutput in (1).

• Error PMFs s are obtained at each by com-paring and as shown in (1).

2) Operational phase: The M-block operates under VOS onan actual data-set , which is different from , and ex-hibits -dependent errors. LP employs the -dependenterror PMF obtained during the training phase forerror compensation.

Fig. 11(a) shows the pre-correction error rate at the outputof 2D-IDCT as is reduced from 1.2V to 0.6 V. Each point onthe curve is characterized by an error PMF . For example,error PMFs for the 8-bit output 2D-IDCT at 1.1 V and 1.0 V areshown in Fig. 11(b) and (c), respectively. As voltage is reduced,more spread in error values is observed because more circuitpaths begin to fail to meet the timing constraints.

Fig. 11. VOS errors in 2D-IDCT: (a) pre-correction error rate (componentprobability of error) , and output error PMFs at (b)and (c) .

In addition to error statistics, LP requires knowledge of theprior PMF . Training data can be employed to charac-terize . Here, we assume is uniform in order tomake our results less dependent on the application context.

C. System Performance and Robustness

The PSNR for DCT-IDCT codec output under replication(Fig. 9(b)) is shown in Fig. 12 for different (probability ofoutput error for a single 2D-IDCT) corresponding to different. Fig. 11(a) is employed to relate to . Fig. 12(a) shows

that LP3r-(8) can tolerate , and higher pre-correc-tion error rate compared to the conventional (uncompen-sated) single 2D-IDCT, TMR, and soft TMR, respectively, inorder to achieve a PSNR of 30 dB. Interestingly, as seen inFig. 12(a), LP2r-(8), i.e., LP with dual-MR (DMR), behavesclose to or even better than TMR when , i.e., LPwith a replication factor of two outperforms TMR. This is un-like conventional DMR, which can only detect errors but notcorrect them. The effect of bit-subgrouping on LP3r perfor-mance is studied in Fig. 12(b). Bit-subgrouping of LP3r-(8)to LP3r-(5,3), where one sub-LP operates on the 5-MSBs andthe second on the rest of the 3-LSBs, minimally affects therobustness of LP3r-(8). As bit-subgrouping increases in orderto decrease LP overhead, the robustness benefits of LP overTMR reduce as expected. This is because more error corre-lations across adjacent bits are being ignored. However, evenwith 1-bit sub-grouping LP still outper-forms TMR.The PSNR of the DCT-IDCT codec under estimation setup

(Fig. 9(c)) is shown in Fig. 13(a) for different pre-correctionerror rates ’s. The 3-bit RPR estimator is not subject to VOS.The estimator alone achieves a PSNR of 22.2 dB. LP2e-(8) pro-cesses the two outputs (main block) and (estimator) simi-larly to ANT (see Fig. 9(c)), and achieves robustness enhance-ment of and compared to the uncompensated single2D-IDCT, and ANT, respectively, at a PSNR of 30 dB. LP2r-(8)robustness benefits reduce with bit-subgrouping (LP2e-(5,3))and LP becomes less efficient than ANT.The PSNR of the DCT-IDCT codec under the spatial-corre-

lation setup (Fig. 9(d)), where adjacent row pixels are used asestimators, is shown in Fig. 13(b). Only LP with (5,3) bit-sub-grouping is shown. Similarly to the case of replication, simula-tions confirm that bit-subgrouping under correlation setup from(8) to (5,3) shows negligible loss in performance. In Fig. 13(b),


Fig. 12. System robustness of 2DDCT-IDCT codec under replication: (a) com-paring to other error-resilient techniques without bit-subgroupingand (b) performance with bit-subgrouping.

Fig. 13. System robustness of 2D DCT-IDCT codec using (a) estimation and(b) spatial correlation.

LP3c-(5,3) uses pixels in adjacent two rows as estimators forthe current pixel and achieves increase in robustness ata PSNR of 30 dB compared to the conventional system. Thislevel of robustness is similar to that achieved by TMR. How-ever, the latter has two additional M-blocks (IDCTs) compared

Fig. 14. Sample codec output images: (a) original image, (b) error-freeIDCT dB , (c) erroneous single IDCT

dB , (d) majority-vote TMRdB , (e) LP3c-(5,3)

dB , (f) ANT dB , (g) LP3r-(5,3)dB , (h) LP2e-(8) dB .

to LP3c-(5,3). Note that, majority voting does not apply for thespatial correlation setup, since the multiple observations are cor-rupted by estimation errors when they are hardware error-free.Using two (current and previous-row) instead of three adjacentpixels, LP2c-(5,3) shows worse performance than LP3c-(5,3),since it has a smaller number of estimators. In fact, it behavesworse than conventional design when because for

, estimation errors dominate hardware errors andLP2c-(5,3) is not able to determine which of the two outputsis correct. LP4c-(5,3) performance also degrades compared toLP3c-(5,3) because LP4c-(5,3) employs pixels that are spatiallyfarther apart than LP3c-(5,3), leading to higher estimation error.Thus, the performance of depends upon the rela-tive contribution of estimation and hardware errors to the PSNR.The perceptual quality of a sample image under different

error-compensation techniques is shown in Fig. 14 where theunderlying hardware has the same pre-correction error rate

. TMR achieves only 5 dB improvement in PSNRover the conventional system, thereby failing to improve theimage quality significantly. LP3r-(5,3) achieves much betterimage quality with a PSNR of 28 dB. This result corresponds toa 14 dB improvement in PSNR over the conventional system.Using the RPR-estimator setup, LP3e-(8) achieves the bestimage quality (PSNR of 31 dB) with just a very few noticeableerrors in the image. Avoiding any form of redundancy andusing only signal correlations, LP3c-(5,3) achieves relativelymuch better quality than TMR with 9 dB instead of 5 dBimprovement in PSNR over the conventional system.Therefore, we see that LP provides a tremendous increase

in robustness to circuit errors at the same level of applicationmetric (output quality) compared to conventional systems. Thistranslates to significant improvement in the application metricat the same degree of unreliability in the circuit fabric. Underexact replication and estimation (approximate replication), LPoutperforms existing error-resiliency techniques such as NMRand ANT in terms of robustness and application metric. It canalso provide robustness and quality levels similar to those ofNMR and ANT while avoiding any form of exact or approxi-mate replication.


Fig. 15. LP power savings in a 45-nm TI CMOS process employing fully-par-allel LG-processor: (a) replication, (b) estimation, and (c) spatial-correlationsetups.

D. Power Savings

HSPICE is used to estimate the leakage and switching powerconsumption of the gate library at different ’s in the 45-nmTI CMOS process. The total power for the computational kernel(2D-IDCT) and error-compensation blocks (majority voter, softvoter, RPR estimator, and LG-processors) is obtained by sum-ming up the individual power of constituent gates under var-ious architectural setups (replication, estimation, and correla-tion) taking into consideration the gate switching activity. Thegate switching activity is obtained from the Value ChangeDump(VCD) file generated by the gate-level netlist simulation. Thisis particularly important for the LG-processor which is acti-vated only when a large difference is observed among the ob-servations (see Fig. 10(b)). Here, we assume that all LG-pro-cessors are fully parallelized, i.e., each -bit LG-block em-ploys a -parallel version of the LG-processor in Fig. 7, sothat it requires only one clock-cycle to generate the output APPratios. Fig. 15 shows the power consumption at each PSNRin Figs. 12(a) and 13(a) and (b) for the three different setups.The conventional architecture is error-free only at

dB.In the case of replication (see Fig. 15(a)), LP3r-(5,3) achieves

15% power savings compared to TMR for a wide range ofPSNR. These power savings are achieved with additionalrobustness to hardware error, i.e., higher component proba-bility of error (pre-correction error rate), at the same PSNR asillustrated in Fig. 12(a). We can trade this additional robustnessfor further power savings. For example, at a PSNR of 28 dB,LP2r-(8) tolerates the same level of pre-correction error rateas TMR in Fig. 12(a) but achieves a power savings of 35%(see Fig. 15(a)). Bit-subgrouping of LP from (8) to (5,3) in-creases the power savings at a given PSNR due to the reducedcomplexity overhead. However, LP3r-(8) achieves slightlybetter robustness at the same PSNR compared to LP3r-(5,3) asillustrated in Fig. 12(b).

In the case of estimation setup (see Fig. 15(b)), LP2e-(8)power savings compared to conventional design ranges between10% and 27% for different PSNR and are slightly better thanthe ANT-based design. This is in addition to the im-provement in robustness over the conventional design. Notethat bit-subgrouping in the estimation setup results in additionalpower overhead at the same PSNR, since its robustness loss inFig. 13(a) compared to LP2e-(8) is significant compared to thelogic complexity savings it provides.In the spatial-correlation setup (see Fig. 15(c)), LP3c-(5,3)

achieves 15% power savings compared to conventional designs.If we want to trade increased robustness provided by LP foradditional power savings, TMR will employ two more IDCTmodules to achieve similar robustness to LP3c-(5,3) at 28 dBof PSNR (see Fig. 13(b)). In this case, the power savings ofLP3c-(5,3) will be 71% compared to TMR, since LP3c-(5,3)uses two fewer IDCT modules.All this shows the energy efficiency of LP over conven-

tional techniques. In addition, if folding of the LG-processoris employed, the reported power savings in this section willincrease at the expense of extra latency as previously illustratedin Fig. 10. These savings are expected to be similar or evenenhanced across different multimedia applications since mostof them employ processing blocks with similar complexity orhigher than a DCT codec.

V. CONCLUSIONS AND FUTURE WORK

We presented a stochastic computing technique referred toas likelihood processing to design robust and energy-efficientsystems by exploiting error statistics at bit-level. Techniquesfrom detection and estimation theory, in particular themaximuma-posteriori (MAP) rule, are employed to generate reliability in-formation or confidence-level on each output bit enabling thecorrection of errors in an optimal probabilistic sense. Simula-tions in DCT image codec show that LP improves on existingreliable system design techniques such as TMR, as well as onstochastic computing techniques such as ANT and soft NMR.Energy savings up to 71% and robustness benefits up toare illustrated.This work opens up a number of interesting problems to

explore such as: 1) studying impact of error PMF profileson robustness and bit-subgrouping, 2) engineering desirableerror PMFs, and 3) developing iterative/turbo versions ofthe proposed technique where different likelihood processorsexchange their APP ratios to achieve better robustness.

ACKNOWLEDGMENT

The authors acknowledge the support of the GigascaleSystems Research Center (GSRC), one of six research centersfunded under the Focus Center Research Program (FCRP), aSemiconductor Research Corporation entity.

REFERENCES

[1] T. Karnik, S. Borkar, and V. De, “Sub-90 nm technologies: Challengesand opportunities for CAD,” in Proc. 2002 IEEE/ACM Int. Conf. Com-puter-Aided Design, 2002, pp. 203–206, ser. ICCAD ’02.


[2] R. I. Bahar, J. Mundy, and J. Chen, “A probabilistic-based designmethodology for nanoscale computation,” in Proc. 2003 IEEE/ACMInt. Conf. Computer-Aided Design, Washington, DC, 2003, pp.480–486, ser. ICCAD ’03.

[3] T. Austin, D. Blaauw, T. Mudge, and K. Flautner, “Making typicalsilicon matter with RAZOR,” IEEE Comput., vol. 37, no. 3, pp. 57–65,Mar. 2004.

[4] W. Qian, M. D. Riedel, K. Bazargan, and D. J. Lilja, “The synthesis ofcombinational logic to generate probabilities,” in Proc. 2009 Int. Conf.Computer-Aided Design, 2009, pp. 367–374, ser. ICCAD ’09.

[5] M. R. Choudhury and K. Mohanram, “Approximate logic circuits forlow overhead, non-intrusive concurrent error detection,” in Proc. Conf.Design, Automation and Test in Europe, 2008, pp. 903–908, ser. DATE’08.

[6] J. Tschanz, K. Bowman, C. Wilkerson, S.-L. Lu, and T. Karnik, “Re-silient circuits: Enabling energy-efficient performance and reliability,”in Proc. 2009 Int. Conf. Computer-Aided Design, 2009, pp. 71–73, ser.ICCAD ’09.

[7] M. M. Nisar and A. Chatterjee, “Guided probabilistic checksums forerror control in low-power digital filters,” IEEE Trans. Comput., vol.60, no. 9, pp. 1313–1326, Sep. 2011.

[8] R. Hegde and N. R. Shanbhag, “Soft digital signal processing,” IEEETrans. Very Large Scale Integr. Syst., vol. 9, no. 6, pp. 813–823, Dec.2001.

[9] F. J. Kurdahi, A. M. Eltawil, Y.-H. Park, R. N. Kanj, and S. R. Nassif,“System-level SRAM yield enhancement,” in Proc. 7th Int. Symp.Quality Electronic Design, 2006, pp. 179–184, ser. ISQED ’06.

[10] G. Karakonstantis, G. Panagopoulos, and K. Roy, “HERQULES:System level cross-layer design exploration for efficient energy-qualitytrade-offs,” in Proc. 16th ACM/IEEE Int. Symp. Low Power Elec-tronics and Design, 2010, pp. 117–122, ser. ISLPED ’10.

[11] N. R. Shanbhag, R. A. Abdallah, R. Kumar, and D. L. Jones, “Sto-chastic computation,” in Proc. 47th Design Automation Conf., 2010,pp. 859–864, ser. DAC ’10.

[12] J. von Neumann, Probabilistic Logics and the Synthesis of ReliableOrganisms From Unreliable Components. Princeton, NJ: AutomataStudies, Princeton Univ. Press, 1956.

[13] J. M. Rabaey, D. Burke, K. Lutz, and J. Wawrzynek, “Workloads of thefuture,” IEEE Design Test Comput., vol. 25, no. 4, pp. 358–365, 2008.

[14] G. V. Varatkar, S. Narayanan, N. R. Shanbhag, and D. L. Jones, “Sto-chastic networked computation,” IEEE Trans. Very Large Scale Integr.Syst., vol. 18, no. 10, pp. 1421–1432, 2010.

[15] E. P. Kim and N. R. Shanbhag, “Soft N-modular redundancy,” IEEETrans. Comput., vol. 61, no. 3, pp. 323–336, Mar. 2012.

[16] R. A. Abdallah and N. R. Shanbhag, “Robust and energy-efficient DSPsystems via output probability processing,” in Proc. IEEE Int. Conf.Computer Design, 2010, pp. 38–44, ser. ICCD ’10.

[17] R. A. Abdallah, Y.-H. Lee, and N. R. Shanbhag, “Timing error statis-tics for energy-efficient robust DSP systems,” in Proc. Conf. Design,Automation and Test in Europe, 2011, pp. 1–4, ser. DATE.

[18] P. J. Huber, Robust Statistics. New York: Wiley, 1981, vol. 1.[19] R. A. Abdallah, Y.-H. Lee, and N. R. Shanbhag, “Engineering of

error statistics for energy-efficient robust DSP systems,” in Proc.IEEE Workshop Silicon Error in Logic-System Effect, Mar. 2011, ser.SELSE.

[20] J. Erfanian, S. Pasupathy, and G. Gulak, “Reduced complexity symboldetectors with parallel structure for ISI channels,” IEEE Trans.Commun., vol. 42, no. 234, pp. 1661–1671, Feb./Mar./Apr. 1994.

[21] W.-H. Chen, C. H. Smith, and S. C. Fralick, “A fast computationalalgorithm for the discrete cosine transform,” IEEE Trans. Commun.,vol. 25, no. 9, pp. 1004–1009, 1977.

Rami A. Abdallah (M’12) received the B.Eng.degree with the highest distinction from the Amer-ican University of Beirut (AUB), Beirut, Lebanon,in 2006 and the M.S. and Ph.D. degrees from theUniversity of Illinois at Urbana Champaign in 2008and 2012, respectively, all in electrical and computerengineering. Since August 2006, he has been aResearch Assistant at the Coordinated Science Labo-ratory. During the summers of 2007, 2008, and 2009,he was with Texas Instruments, Inc., Dallas, with theapplications and systems R&D labs where he was

involved in the design of communication receivers for Long Term Evolution(LTE) and WiMAX and on-chip DC-DC conversion. His research interests arein the design of integrated circuits and systems for communications, digitalsignal processing, and general purpose computing.Dr. Abdallah was on the Dean’s honor list at AUB from 2002 to 2006, and was

selected among World’s top students to participate in the Research Science In-stitute at the Massachusetts Institute of Technology (MIT), Cambridge, in 2001.He received the Charli S. Korban outstanding undergraduate award in 2006, theHKN Honor Society Scholarship award in 2009, the Mac Van Valkenburg out-standing researcher award in 2012, and the 2012 Yi-MinWang and Pi-Yu Chungoutstanding student award.

Naresh R. Shanbhag (F’06) received his B.Tech.from the Indian Institute of Technology, New Delhi(1988), M.S. from the Wright State University(1990), and his Ph.D. degree from the University ofMinnesota (1993) all in Electrical Engineering. From1993 to 1995, he worked at AT&T Bell Laboratoriesat Murray Hill, NJ, where he was the lead chiparchitect for AT&T’s 51.84 Mb/s transceiver chipsover twisted-pair wiring for Asynchronous TransferMode (ATM)-LAN and very high-speed digitalsubscriber line (VDSL) chip-sets. In August 1995,

he joined the University of Illinois at Urbana-Champaign where he is presentlythe Jack S. Kilby Professor of Electrical and Computer Engineering. He was ona sabbatical leave of absence at the National Taiwan University in Fall 2007.His research interests are in the design of robust and energy-efficient integratedcircuits and systems for communications including VLSI architectures forerror-control coding, and equalization, noise-tolerant integrated circuit design,error-resilient architectures and systems, and system-assisted mixed-signaldesign. He has more than 200 publications in this area and holds ten US patents.He is also a co-author of the research monograph Pipelined Adaptive DigitalFilters (Norwell, MA: Kluwer, 1994).Dr. Shanbhag received the 2010 Richard Newton GSRC Industrial Impact

Award, received the 2006 IEEE Journal of Solid-State Circuits Best PaperAward, the 2001 IEEE Transactions on VLSI Best Paper Award, the 1999IEEE Leon K. Kirchmayer Best Paper Award, the 1999 Xerox Faculty Award,the Distinguished Lecturership from the IEEE Circuits and Systems Society in1997, the National Science Foundation CAREER Award in 1996, and the 1994Darlington Best Paper Award from the IEEE Circuits and Systems Society.Dr. Shanbhag served as an Associate Editor for the IEEE Transaction on Cir-

cuits and Systems: Part II (1997–1999) and the IEEE Transactions on VLSI(1999-2002 and 2009-2011), respectively. He is the General co-Chair of the2012 IEEE International Symposium on Low-Power Design (ISLPED), was theTechnical Program co-Chair of the 2010 ISLPED, and served on the technicalprogram (wireline subcommittee) committee of the International Solid-StateCircuits Conference (ISSCC) from 2007–11. Dr. Shanbhag is leading a researchtheme onAlternative ComputationalModels in the Post-Silicon Era, in the DODand Semiconductor Research Corporation (SRC) sponsored MicroelectronicsAdvanced Research Corporation (MARCO) center under their Focus Center Re-search Program (FCRP) since 2006.In 2000, Dr. Shanbhag co-founded and served as the chief technology officer

of Intersymbol Communications, Inc., a venture-funded fabless semiconductorstart-up that provides DSP-enhanced mixed-signal ICs for electronic dispersioncompensation of OC-192 optical links. In 2007, Intersymbol Communications,Inc., was acquired by Finisar Corporation, Inc.

Robust and Energy Efficient Multimedia Systems via Likelihood Processing

Documents

Transcript of Robust and Energy Efficient Multimedia Systems via Likelihood Processing