Reliability levels for fault-tolerant linear processing using real number error correction

9
Reliability levels for fault-tolerant linear processing using real number error correction G.R.Redinbo Indexing terms: Fault-tolerant systems, Real number error correction, Reliability, Convolutional codes Abstract: Fault-tolerant linear processing systems using real number codes, as in algorithm- based fault tolerance methodologies, can be extended to include error correction for correcting intermittent errors in the processed output data. The reliability function for a protected system containing correction is considered under simple but realistic assumptions on the arrival of failures in both the normal- processing and corrector subassemblies. An error- correcting procedure employing a real, convolutional code is described briefly, and a system diagram of a corrector subsystem containing self-checking comparators for detecting its internal failures is presented. Failures are modelled as arriving according to Poisson processes in both parts of the fault-tolerant system. Formulas for the reliability of the protected system are given and an efficient lower bound is developed. This bound depends only on broad details such as relative areas used in a VLSI design of the processing and corrector parts. Computational methods employing computer algebra packages are discussed, and some typical reliability curves for two configurations demonstrate the dramatic improvement which error correction introduces. 1 Introduction Processing subsystems may be protected through the use of real number codes such as in algorithm-based fault tolerance methodologies. When intermittent fail- ures affect only a few processed data samples, error correction can be an effective alternative to recomput- ing many samples after restart. This paper considers error correction when intermittent failures are present in high-speed linear processing systems protected by real convolutional codes by calculating the reliability function under simple but realistic assumptions con- cerning the relative area sizes of VLSI realisations. Lin- ear processing systems, especially ones using block- processing techniques and supporting many data chan- nels simultaneously, are important in communication 0 IEE, 1996 IEE Proceedings online no. 19960769 Paper first received 1st May 1995 and in revised form 17th June 1996 The author is with the Department of Electrical and Computer Engineer- ing, University of California, Davis, CA 95616, USA and control systems. A common view of linear process- ing considers it as transforming an input-data sample stream into an output-data flow. Linear processing systems may be protected by cou- pling algorithm-based fault tolerance techniques with real number convolutional codes [l, 21. In this approach, comparable parity samples are generated in two ways: one parity stream uses the linear processing outputs while the other stream employs the input sam- ples directly and efficiently without necessarily produc- ing the linear system’s outputs at any intermediate point. The differences of these streams of parity values form a syndrome stream, whose values indicate possi- ble data errors. The syndrome stream is processed by a time-varying Kalman stochastic estimator to produce correction values [3]. This paper studies the improve- ment in the reliability function when error correction is employed in a protected system. simplified generat ion corrector detection thresholds parity calculation I I Fig. 1 errors in linear processing Extensions of algorithm-based fault tolerance to correction of The basic correction operation applied to the outputs from a linear processing system is shown in Fig. 1. The two parity streams, one computed from the outputs and the other direcfly from the input stream, are com- pared against a threshold as indicated by the block labelled ‘detection thresholds’. When comparable pari- ties differ above a threshold, the corrector estimates the values of errors caused by failures in the ‘linear processing’ part. The two parity-generating subassem- blies depicted in Fig. I compute the parity values asso- ciated with a systematic real convolutional code. The nature of a systemic convolutional code introduces a block processing feature in these parity-generating processes. For a rate kln code, there are (n - k) parity values produced for each group of k samples appearing at the parity-subassemblies’ inputs. The redundancy factor (n - k) is generally selected as 1 for efficiency purposes, meaning that parity-generating subassemblies operate at llkth of the rate of the major processing task. On the other hand, the parity-generating opera- tions involve a memory of data samples, called the con- 355 IEE Proc.-Comput. Digit. Tech., Vol. 143, No. 6, November 1996

Transcript of Reliability levels for fault-tolerant linear processing using real number error correction

Page 1: Reliability levels for fault-tolerant linear processing using real number error correction

Reliability levels for fault-tolerant linear processing using real number error correction

G.R.Redinbo

Indexing terms: Fault-tolerant systems, Real number error correction, Reliability, Convolutional codes

Abstract: Fault-tolerant linear processing systems using real number codes, as in algorithm- based fault tolerance methodologies, can be extended to include error correction for correcting intermittent errors in the processed output data. The reliability function for a protected system containing correction is considered under simple but realistic assumptions on the arrival of failures in both the normal- processing and corrector subassemblies. An error- correcting procedure employing a real, convolutional code is described briefly, and a system diagram of a corrector subsystem containing self-checking comparators for detecting its internal failures is presented. Failures are modelled as arriving according to Poisson processes in both parts of the fault-tolerant system. Formulas for the reliability of the protected system are given and an efficient lower bound is developed. This bound depends only on broad details such as relative areas used in a VLSI design of the processing and corrector parts. Computational methods employing computer algebra packages are discussed, and some typical reliability curves for two configurations demonstrate the dramatic improvement which error correction introduces.

1 Introduction

Processing subsystems may be protected through the use of real number codes such as in algorithm-based fault tolerance methodologies. When intermittent fail- ures affect only a few processed data samples, error correction can be an effective alternative to recomput- ing many samples after restart. This paper considers error correction when intermittent failures are present in high-speed linear processing systems protected by real convolutional codes by calculating the reliability function under simple but realistic assumptions con- cerning the relative area sizes of VLSI realisations. Lin- ear processing systems, especially ones using block- processing techniques and supporting many data chan- nels simultaneously, are important in communication 0 IEE, 1996 IEE Proceedings online no. 19960769 Paper first received 1st May 1995 and in revised form 17th June 1996 The author is with the Department of Electrical and Computer Engineer- ing, University of California, Davis, CA 95616, USA

and control systems. A common view of linear process- ing considers it as transforming an input-data sample stream into an output-data flow.

Linear processing systems may be protected by cou- pling algorithm-based fault tolerance techniques with real number convolutional codes [l, 21. In this approach, comparable parity samples are generated in two ways: one parity stream uses the linear processing outputs while the other stream employs the input sam- ples directly and efficiently without necessarily produc- ing the linear system’s outputs at any intermediate point. The differences of these streams of parity values form a syndrome stream, whose values indicate possi- ble data errors. The syndrome stream is processed by a time-varying Kalman stochastic estimator to produce correction values [3] . This paper studies the improve- ment in the reliability function when error correction is employed in a protected system.

simplified

generat ion corrector

detection thresholds parity calculation

I I Fig. 1 errors in linear processing

Extensions of algorithm-based fault tolerance to correction of

The basic correction operation applied to the outputs from a linear processing system is shown in Fig. 1. The two parity streams, one computed from the outputs and the other direcfly from the input stream, are com- pared against a threshold as indicated by the block labelled ‘detection thresholds’. When comparable pari- ties differ above a threshold, the corrector estimates the values of errors caused by failures in the ‘linear processing’ part. The two parity-generating subassem- blies depicted in Fig. I compute the parity values asso- ciated with a systematic real convolutional code. The nature of a systemic convolutional code introduces a block processing feature in these parity-generating processes. For a rate kln code, there are (n - k) parity values produced for each group of k samples appearing at the parity-subassemblies’ inputs. The redundancy factor (n - k) is generally selected as 1 for efficiency purposes, meaning that parity-generating subassemblies operate at llkth of the rate of the major processing task. On the other hand, the parity-generating opera- tions involve a memory of data samples, called the con-

355 IEE Proc.-Comput. Digit. Tech., Vol. 143, No. 6, November 1996

Page 2: Reliability levels for fault-tolerant linear processing using real number error correction

main processing detector and corrector subassemblies

///,,///,,/////

...........................................

Fig. 2 Protected detectorkovvector system TSCC = totally self-checking comparator

straint length of the code, imbuing the convolutional code with error-detecting and correcting capabilities.

The corrector subassembly in Fig. 1 implements a fixed-lag, smoothing Kalman estimator determining the location and values of any errors added to the linear processing outputs by failures. These estimates tolerate errors and roundoff noise. Since its underlying hard- ware can also be affected by intermittent failures, a protected corrector system must be considered, one in which failures in subassemblies are also detected. The extra protection involves regenerating parity samples after the correction to determine a consistent set of output and parity values [3]. A protected system is out- lined in Fig. 2, where the various subassemblies associ- ated with a protected corrector system are detailed. The overhead of the detectoricorrector should be within 25- 40% of the main processing system. The use of real convolutional codes insures low overhead because of the lower parity generating rates, regions B through E. As noted above, all subassemblies operate at likth rate because of the nature of the systematic convolutional code [3].

The various regions in the protected detectoricorrec- tor represent a partitioning of the subassemblies for error-detection purposes. Region A is the fundamental processing that was to be protected in the first place. On the other hand, region B consists of the parity-gen- eration process employing the outputs from region A. F denotes a parity-generating matrix, dictated by the systematic real convolutional code. These parity values are contained in a vector y’(t). The present k values in b(t), a state vector, are denoted by b,(t) so that the cor- rector determines a stochastic estimate vector labeled as b,(tlt + iV) in Fig. 2. Region C contains the direct par- ity-generating process using the original inputs to pro- duce comparable values y”(t) without duplicating the linear processing function.

The difference of the two parity values in vectors y’(t) and y”(t) is the syndrome vector d( t ) = y’(t) - y”(t) which is input to the corrector whenever any of its components exceeds a properly chosen threshold. The threshold detector gives a valid I-out-of-2 codeword if

356

all syndrome components of d( t ) are suitably small, below a threshold. This detector, basically a totally self-checking comparator (TSCC) [4, 51, g‘ ives an invalid 1-out-of-2 codeword when either it has failed internally or any d( t ) component has exceeded the threshold, suggesting that corrective action should be initiated. On the other hand, the totally self-checking comparator TSCCl is a redundant check on the forma- tion of syndrome vector d(t). Totally self-checking comparators for real number codes are discussed more fully in Appendix 8.1.

Region D of the protected system outlined in Fig. 2 is the corrector and a delay to hold the output data for the amount of the fixed lag used by the Kalman smoothing estimator. The following region E represents a final check on the corrector, delay and subtraction operation used for correction. Parity values are gener- ated from these corrected values, ones that are compa- rable with similar parity values computed directly from the input values properly delayed. The comparator TSCC2 indicates whether the correction was valid. However, this TSCC, as with the other two shown, also indicates when it has an internal failure [4, 51. It is demonstrated in Appendix 8.1 how this overall config- uration, including the TSCC devices, will not permit a single failed subsystem to go undetected.

The error-correcting capabilities of the protected sys- tem rely on the structure of the real convolutional code. Generally, the real codes are derived by using binary convolutional codes treating the 0 and 1 digits as integers in the real field. It is known [6], that this code-construction approach produces real codes which have distance structures, and consequently error-cor- recting performance, as good as, if not better than, the original binary codes over the binary field. Neverthe- less, such convolutional codes have limitations on the number and location of errors inherited from the base binary convolutional code [7]. For example, most codes require a guard band, a period of samples following a group of errors during which no additional errors can occur without possibly overloading the capabilities of the code.

IEE Proc -Cornput Digit Tech, Vol 143, No 6, November 1996

Page 3: Reliability levels for fault-tolerant linear processing using real number error correction

2 Statistical model: VLSI-realised protected system

The protected detectorkorrector system providing error correction for intermittent failures in the linear-process- ing operations can be realised in VLSI technology where the major part of the layout would be devoted to the linear-processing function. Furthermore, the part of the implementation supporting the detection and cor- rection aspects of the overall system, along with the totally self-checking comparators, represent a much smaller area because of their lower sampling rates, per- mitting time-shared computations. Design goals gener- ally indicate acceptable overhead requirements are from 25% to 40% of the overall area devoted to the supporting roles. An important feature of the reliability analysis below is that the exact overhead costs can be treated as variables. The lower sampling rate needed for parity values as dictated by the real convolutional code allows this lower area for the correcting parts. The relative areas are indicated in Fig. 3, where the small overlay boxes labelled memory and processor sig- nify the purposes of these areas. Also, a small area is allocated for the joint control section.

detector and corrector proc. = processor subsystem mem.= memory

Fig. 3 ,X = basic failure rate per unit area Processing area A,,, a <---z XA,, Processing failure rate

Relative representation of VLSI-realisation areas

b <---z ,XA,,,,,, Deticorrect failure rate Joint control subsystem is hard core with very small failure rate because of special design and small area

An appealing model for faults is based on area, assuming that they occur according to a Poisson arrival process with a uniform failure rate related to area. The random variable number of failures N, occuring during a time interval (0, t] in an area of Avegion is governed by the Poisson probability mass function:

X failure rate per area (1) As noted in Fig. 3, the arrival of failures in the original linear-processing part of the realisation has a rate a = AA,,,, where A,,,, denotes the relative area devoted to these operations. Likewise, the failures in the detection and correction portions are modelled by a Poisson process with rate b = AAcorctr, and the arrival of failures in these two disjoint areas is considered statistically independent. There is also a part of the design that supports the critically important control actions needed for any digital system and, while this disjoint area is small, it can be designed with special care to have high reliability. However, this part is considered ‘hardcore’ [8], and its reliability is little changed by adding the control for the detection and correction parts because

IEE Pro,.-Comput. Digit. Tech., Vol. 143, No. 6, November 1996

of their simplicity. Because of the statistical independ- ence of the failure processes associated with disjoint areas, the contribution to the reliability function in either case is a multiplicative scaling function [9]. The reliability for the joint control section will be denoted by Rconl(t), but the central focus for the remainder of the paper is on the reliability of the processing and cor- recting parts.

The reliability of the part devoted to linear process- ing may be written in a straightforward manner since it depends only on the area utilised by the processing function:

R,,(t) = P{Nt = O} = exp(-at) (2) where arrival rate a = AAproc. This part has a greater area exposed to failures because of the larger function- ality required.

3 Reliability analyses for error-correction situation

The reliability Rcode(t) of the protected system outlined in Fig. 2 can be viewed in terms of processing-sub- assembly events conditioned on the arrival of a number of failures during interval (0, t]. The ‘tolerable’ event denotes all failures which are properly corrected. Then the reliability Rode(t) may be decomposed using condi- tional probability functions and the Poisson mass func- tion of eqn. 1:

Rcode(t) = P(tolerablelNt = n)P{Nt = n} (3)

The counting random variable Nt refers to the failures in the processing part and is governed by the failure rate a. However, it is well known [lo] that, for a fixed number of arrivals, event {Nt = n} , the ordered failure times, labelled here by {tl , t2, ..., t,} have a conditional joint density function given as

+CO

n=O

(4) = { f 0 < I1 I I 2 I . ‘ . I I n I t otherwise

This permits each conditional probability function needed in eqn. 3 to be integration over a proper region of n-dimensional space. P(tolerab1elNt = n)

= /// . . // P(tolerablel&, & , . . . , I n , Nt = n)

tolerable region { 1 < 1 , < 2 , . ,<n> } x ft, $2,. . .,t, (tl , E2 , . ’ . , I n IN = n)&&-l . * . dE1

(5) Certain features of the code and several of its parame- ters enter the expression for the important conditional probability functions, P(tolerab1e I c1, c2, ..., E,,, Nt = n). From a processing viewpoint, the ordered failure- arrival times {cl, c2, ..., c,} effectively appear on the boundaries of subblocks of size k because of the inher- ent block processing at a lower sampling rate by code parameter k. On the other hand, the fundamental processing rate, denoted here by& determines the data- sampling period Tp = l/&. Also, convolutional codes require a guard band of (mk) samples following a sub- block of k where errors occur. Hence, for a ‘tolerable’ event, the failure-arrival times each have to exceed this guard band

351

Page 4: Reliability levels for fault-tolerant linear processing using real number error correction

The corrector part of the overall system has failures which arrive statistically independently of those in the processing part. Its design will not overload when a single failure arrives outside the guard band required by arrivals in the processing part. However, if multiple failures occur, the actual design details determine whether they will lead to an overload in the correcting subassembly. Therefore, without knowing all the explicit design details, it is difficult to predict when incorrect behaviour results from multiple failures. Nev- ertheless, there are many multiple failures, properly spaced in time, which will not overload the corrector. It is possible then to get a lower bound to the probability associated with corrector overload without knowing all the design details.

(t,,t*; , tn], processing failures' arrival random variables t2 t3 - - - - _ t , _ _ _ _ - _ - - t" t

I I I I I I I I subsystem

1 1 I I

I I I I I

0 tj * I processor

, , I I , I I 4

corrector subsystem

Possible arrival-time lines for two subsystems: processing$ailure Fig. 4 arrivals in interval (O,t]

0 5 1 Nt 5 2 I *

processor -

I subsystem 17 I

I I

0 I

I I I

I I I I

* t corrector no corrector subsystem Q t over load

Fi 5 Possible arrival-time lines for two subsystems: events associated w i 8 jfailure 1, corrected)

1 51 E2

-I- Nt processor

I subsystem y-7- !T I

I I I I

I

I C I- I A Qt -L -L

no corrector overloud

- corrector subsystem

Fi .6 Possible arrival-time lines for two subsystems: events associated wi8 jfailuve 2, corrected)

E"-! 5" t _z- w w I -

processor I I subsystem

Nt 1°C" 1-7- I I

t I C I-

' c Qt 1- z t - no corrector -. no corrector * ~ ~ ~ & , . ,

overload overload Fig.7 interval

Possible arrival-time lines for two subsystems: events near endof

Some typical parameters of a realistic real convolu- tional code may be given for the circumstances of a 33MHz processing rate (T, = 30ns processing period). A rate 516 Berlekamp-Preparata-Massey burst-correct- ing convolutional code has constraint length of M = 60 data samples (k = 5; IZ = 6; m = 12) [7] mote 11. Hence, Note 1: The notation in [7] for the constraint parameter m is one less than here by choice of definition

358

the guard band during which no additional failures in either subassembly, processing or correcting, should occur is given by the parameter c = M x T = 1 . 8 ~ . These values will be used in the examples computed in Section 4.

The development of the important conditional prob- ability functions in eqn. 5 will focus on the ordered arrival times t l , t2, ..., f,, with respective outcomes kl, c2, ..., 5,. Complicating matters, there are possible fail- ures occurring in the corrector subassembly during the interval (0, t]. These are statistically independent, as mentioned earlier, and are described by counting varia- ble Qt with probability mass function based on failure- arrival rate b = AAcorctr:

P{Qt = k } = - ( W k exp(-bt) k = 0 , 1 , 2 , . . . k !

b = XAcorctT (6) One way to visualise the situation is by means of time lines as shown in Figs. 4-7, where the upper line shows arrival times for failures in the processing subassembly of the system. They are reflected to the bottom line depicting as yet unspecified events in the corrector sub- assembly. However, failure arrivals in the corrector are governed by the Poisson process, eqn. 6.

The inner conditional probability functions inside eqn. 5 may be expanded further using a form of chain rule and events of the type { gfgt& } designating a situ- ation where a failure at outcome time 5, is corrected and the system is still functioning properly. P(tolerablelE1, E a , . . . ,En, Nt = n) = ({ failure 1 }

corrected failure n corrected

= ({ failure 2 } corrected

{ failure 2 } ) . , . , { failure i } corrected corrected '

<I, Ez , > <n) Nt = n

, . , ) { failure i } . . . corrected ' '

x P ({ corrected } 1 ~1,<2,. . . , E ~ , N ~ = n

(7) The chain rule employed identifies events as P(AB1C) = P(BIAC)P(AIC) where

A +- { corrected fa;llure I ({ failure 2 } , { failure 3 } , . , . , { failure n )) corrected corrected corrected

ct(Ei,Ez,...,~n,N~=n)

The last term in eqn. 7 will be evaluated shortly after all similar terms are exposed by repeatedly applying the chain rule to the first term on the right.

(8)

= p ({ } , { corrected corrected corrected } , . . . , { failure n } I

= ( { failure 3 } { failure 4 } , . , . , { failure n } 1 corrected corrected corrected

IEE Proc.-Comput. Digit. Tech., Vol. 143, No. 6, November 1996

Page 5: Reliability levels for fault-tolerant linear processing using real number error correction

{ failure 1 } , { failure 2 }) '" ' ' ' "n' Nt = n' corrected corrected

G > EZ , . . . , En, Nt = x p ( { failure 2 }I

corrected (9)

Again, the last term can be evaluated more fully, but the conditional probability just to the right can be expanded by a chain rule. Finally, the last step in the expansions by chain rules is

failure n - 1 failure n ( { corrected } ' { corrected} I "' "' * ' j t n ' Nt=n' { failure 1 } , { failure 2 } , . . . , {failure n - 2 } )

failure 1 = ( { failure n } 1 E l ) (2, . * 1 Cn) Nt =n)

{ corrected } , . . . , { corrected } ' { corrected

corrected corrected corrected

corrected failure n - 2 failure n - 1

{ failure 1 } , { failure 2 } , . , . , {failure n - 2 } ) corrected corrected corrected

(10) The last term on the right-hand side of eqn. 7 is nonzero over a certain interval and is the product of two fundamental probabilities concerning failures in the corrector subsystem. This is shown in Fig. 5. The arrival time c1 for the first failure in the processing part must allow enough time to occur before the end of the interval, C1 I t - nc. However, there cannot be a failure in the corrector part during interval (cl, cl + e], the guard band. Furthermore, the corrector subassembly cannot be overloaded before time c1 because then it would produce incorrect results. The memoryless prop- erty of the counting processes allows these probabilities to be producted:

failure 1

P ( { 1 El , 1 2 ) . . . 7 tn, Nt = n)

= P (corctr fail(E1, (1 + e ) ) P (corctr ovrld(0, &))

P (corctr fail(&,& + e ) ) = P ( Q E ~ + ~ - Qtl = 0) P no corrector failures

during constraint length

;E1 S t - n c (114

= exp(-be) (W The situation of no overload in the corrector subassem- bly before the first failure in the processing subsection, denoted as outcome el in Fig. 5, is very difficult to evaluate even if all intricate circuit design details are known. However, it is possible to determine a reasona- ble lower bound on the robabilit of overload in the corrector part, P (corctr - ovrl (0, 1)). This is discussed in Appendix 8.2 where complete equations are devel- oped. The lower bound which will be used here corre- sponds to the event 'up to a single corrector failure before E,1' which may be evaluated as explained in Appendix 8.2:

(12) P(QtI - Qo i 1) = exp(-bEi)[l + &I IEE Proc-Comput. Digit. Tech., Vol. 143, No. 6, November 1996

Then the important conditional probability at the end of eqn. 7 evaluates as ({ failure 1 } I

corrected &,E2,.. . ,En,Nt = n

= R Q t l + c - Q E ~ = 0) -- no corrector failures

= exp(-be) exp(-b&)[l+ b&]

P(QE~ - QO I 1)

up to single corrector during constraint length failure before

E1 5 t - nc (13)

If the reliability value computed using this lower bound shows acceptable performance, the system will have at least this good a reliability. The ease and robustness of these calculations are a virtue of this approach since none of the fine design details are required. Only the relative areas of the two subassemblies which impact parameters a and b are needed.

The conditional probability of only event { g:st$ } on the extreme right-hand side of eqn. 9 may be evalu- ated in a similar manner with Fig. 6 showing the regions after outcome t2 which must have a guard space and the region before t2 in the corrector part which can contain no failure overload.

= P (corctr fail(&, + c)) P (corctr ovrld(& + c, t2)) tl + c 5 E2 2 t - (n - 1)c

(14) A lower bound on this quantity, which will be used in computing the overall reliability, is easily given:

P(Q<. - Q<(,-1, i 1) -- no corrector failures up to single corrector

during constraint length failure before t2

P(Q<%+c - Q<$ = 0)

= exp(-bc) exp{-b(b - <l - c)}{1 + b(E2 - 61 - c ) }

[1 + c 5 (2 5 t - (n - 1)c (15)

The generic ith conditional probability can be evalu- ated in the same fashion, leading to a lower bound ( { failure i } , I

corrected G , ( 2 , . . . t n ) Nt 1 n)

{ failure 1 } , { failure 2 } , . . . , {failure i - I})

= exp(-be) exp{-b(& - Ei-1 - c)}[l + b(<i - ti-1 - e)] ti-1 + c 5 E; 5 t - (n - i + 1)c

(16)

corrected corrected corrected

The final conditional probability in the chain rule, eqn. 10, can be considered and accounts for the inter- val ((5, + e), t] as shown in Fig. 7:

failure 1 corrected

{ failure 2 } , . , . , {failure n - I}) corrected corrected

=p (corctr fail(tn,En+r)) P(corctr ovrld(&_l + c , & ) )

x P (corctr ovrld(& + e, t ) )

(17)

359

Page 6: Reliability levels for fault-tolerant linear processing using real number error correction

argument (a + 6 ) have failure arrival

p] t--ne t-(n-1)c t-(n-z+l)c 1 . . . 1 . . . IC Ez=E1 s c E t = E z - - l +e En =En-l+c

p] t--ne t-(n-1)c t-(n-z+l)c 1 . . . 1 . . . IC Ez=E1 s c E t = E z - - l +e En =En-l+c

in the calculations.

360

processing and corrector parts of the total system, respectively. They are both small numbers, e. order of or below, and the scaling factor sum becomes vanishingly small quickly with Therefore, the summation need only extend first few terms. The parameter e, the guard b of the code, is of the order of microseconds, whereas the variable t is in hours, p to thousands of

values of the iter-

locking recursions, index k are computed iteratively and the summation upper limit in eqn. 20 is designated as ‘upper’.

An example whi es data at 33MHz, a sample period of 30n used to compute typical reliability curves. The ers for a burst-correcting Berlekamp-Preparat nvolutional code which

nnation symbols yielding one parity sample for every five processed samples employs a memory of 60 samples. Then the guard band c = 1 . 8 ~ . The reliability function Rcode(t) for t out to 1000 hours was computed using Mathematica. Two functions for different intensities of the failures in the processing part were calculated. The first uses a = 200 fits (2 x 10” houm) while the corrector part was allowed from 20 to 100 fits. These values for par correspond to overhead areas of 20, respectively. The number in the sum plots the results (eqn. 2), with th Rcode(t) shown in Fig. 9. Logarithmic scales are used.

t .h 400 600 800 1000 I I

= -0~000051 --\\\ 57 + v

E’ -0.0001 w

aJ

-0.00021 \

a = processing b = oorrector-s re rate = 40, 60, 80, 100 fits

200 LOO 600 800 1000

b -1 TT Y -0” -2 8

222 -3

O -4

aJ D -

=40f l ts

60

80

T = processing sample period = 30ns

IEE Pvoc -Cornput Digit Tech, Vol 143, No 6, Novembev 1996

Page 7: Reliability levels for fault-tolerant linear processing using real number error correction

Comparing the scales on the two ordinates shows how dramatically the reliabilities resulting from correction differ from the unprotected processing part. I___ t , h 800 1000

-0 0005 Tr. c .., c = -0.001 & fl,

-0 002L \ Fig. 10 Example of relzabilities uncoded system with equal rocessing- area ex osures comparlsons for processing system wzth 2000 &s, re ld i l - ity of uncodedprocessing system only a = processing-section failure rate = 2 x lo6 = 2000 fits b = corrector-section failure rate = 400, 600, 800, 1000 fits T = processing sample period = 30ns

- -4

- -Jx,o-~ \ 1000 Fig. 1 1 Example comparisons of relzabilities for coded (correcting) and uncoded systems with equal rocessmng-area exposures comparisons for processing system with 2000j%, reliability Rcode(t) of total protected sys- tem a = processing-section failure rate = 2 x lo6 = 2000 fits b = corrector-section failure rate = 400, 600, 800, 1000 fits T = processing sample period = 3011s

The curves in Figs. 10 and 11 were computed in a similar way using different values for a and b. The fail- ure intensity of the processing part is now 2000 fits and the relative area used for the corrector subassembly varied from 20 to 50% with b values as shown on the Figures. These reliability values are less than those for the first set of parameters, as would be expected, but the improvement using error correction, even including the much larger area still produces great improvements. These increases in both cases on the log scale are from three to four orders of magnitude.

Numerous combinations of parameters for various intervals (0, t] were used in computations. For example, curves out to 10 000 hours have the same shape as that given in Fig. 8. Also, the upper limit for the sum in eqn. 20 was extended to 35 showing no change in the curves for Rcode(t). However, these calculations did present a significant load on the workstation, requiring 30 min in some cases.

5 Summary

A fault-tolerant linear processing system which includes real number error correction based on real convolutional codes and is protected by additional checks to verify proper behaviour is described. The overall configuration can correct intermittent errors arising in the main processing part or detect at least

IEE Proc -Cornput Digit Tech, Vol 143, No 6, November 1996

one subsystem failure in the corrector part. The reliability performance i efficient lower bound VLSI relative areas a processing and corre statistically indepe Poisson process. relying on conditiona levels are easily c algebra packages comparisons between uncoded and CO configurations are dramatically different, even the extra area for the correction parts is included.

6 Acknowledgment

This research was supported in part by gra N000214-95-1-1190 and NSF MIP-92-15957. This work was performed while the author was on Sabbatical leave at The Reliability Laboratory, ETH (Swiss Fed- eral Institute of Technology), Zurich, Switzerland.

7 References

1 REDINBO, G.R , and ZAGAR, B.G.: ‘Modifying real convolu- tional codes for protecting digital filtering systems’, ZEEE Trans., 1993, IT-39

2 GUILLORY, S.S., MARTIN, J.A., REDINBO, G.R., and ZAGAR, B.G.: ‘Fault-tolerant design methods of VLSI digital fil- ter implementations’, in MOSCOVITZ, H.S., and BRODERSEN, R.W. (Eds.): ‘VLSI signal processing 111’ (IEEE Press, 1989), chap. 35

3 REDINBO, G.R.: ‘Optimum Kalman detectoricorrector for fault-tolerant linear processing’. Digest of papers, Twenty-third international symposium on Fault-tolerant computing, Toulouse, France, 1993, pp. 299-308

4 JOHNSON, B.W.: ‘Design and analysis of fault-tolerant digital systems’ (Addison-Wesley Publishing Co., 1989)

5 WAKERLEY, J.. ‘Error detecting codes, self-checking circuits and applications’ (North-Holland, 1978)

6 NAIR, V.S.S, and ABRAHAM, J.A.: ‘Real number codes for fault-tolerant matrix operations on processor arrays’, ZEEE Trans., 1990, C-39, pp. 42-35

7 LIN, S., and COSTELLO, jr, D.J.. ‘Error control coding funda- mentals and applications’ (Prentice-Hall, 1983)

8 SIEWIOREK, D.P., SWARZ, R.S.: ‘Reliable computer systems: design and evaluation (Digital Press, Bedfrod, Mass., 1992), 2nd ed

9 TRIVEDI, K.S.: ‘Probability and statistics with reliability, queu- ing, and computer science applications’ (Prentice-Hall, Inc., 1982)

10 BARLOW, R.E., and PROSCHAN, F.: ‘Statistical theory of reli- ability and life testing’ (TO BGIN WITH, Silver Spring, Mary- land, 1981)

11 BIROLINI, A.: ‘Quality and reliability of technical systems: the- ory, practice, management (Springer-Verlag, 1994)

8 Appendices

8. I Discussion of protected detector/ corrector con figuration The protected system approach is outlined in Fig. 2 which also shows fault regions associated with groups of operations. These regions will aid in the discussions. The totally self-checking comparators (TSCC) shown in two places and used inside the threshold detector are self checking and fault secure by design [4, 51, each comparing the respective components of parity vectors which should be identical. TSSCl is concerned in part with the (n - k) components of y’(t) which are com- puted directly from the linear processing outputs. It compares them with the related deterministic-parity values in y”(t) which are determined directly from the input values. Each pair of respective components of these vectors are compared independently employing a small threshold A to account for roundoff errors con- sidered acceptable for normal calculations (see Fig. 12). The binary-output variables should form a legitimate 1-

361

Page 8: Reliability levels for fault-tolerant linear processing using real number error correction

out-of-2 code word when respective differences are within A of each other, as given by the equations given in Fig. 12.

?, I 1 i f + A c s i ( t ) i = O , l , ..., (n-k-I) si(t)=y”i(t)-y’i(t) Fig. 12 Produces proper I-out-of2 binary code if two comparable values are within tolerance magnitude

Component of real-number totally self-checking comparators

The resulting (n - k) binary pairs may be concen- trated further into a single 1-out-of-2 code using stand- ard fault tolerance design methodology [4, 51. According to this well known theory, any failure in a single operation within the TSCC or concentrator will produce an invalid code word. Consequently, the TSCC subassemblies constitute their own fault regions. Similarly, the components of the syndrome vector d(t), which when appropriately large enough nonzero cause corrective actions to the output, can be checked against zero values by the same type of comparators, denoted by the item threshold detector in Fig. 2.

Because it is difficult to predict the location of errors in a linear processing function output stream from internal failures, even if all implementation details are known, burst-correcting convolutional codes are rec- ommended. A burst code corrects any group of errors as long as they span a short part of the codeword and there is an adequate guard band. There are two well known classes of burst-correcting convolutional codes: the Berlekamp-Preparata-Massey and the Iwadare codes [7]. Typically, these codes correct a burst of length up to k followed by a guard band of (mk) error- free positions where the rate of the code is kin and the constraint length is (m + l)k, thusly defining parameter m.

The protection of each fault region in Fig. 2 will be examined. The goal is to detect the failure of any single subassembly by having it produce at least one invalid 1-out-of-2 code word at a TSCC unit. The underlying assumption is that only one failure exists at a time, multiple failures being much more improbable. When a failure exists in a region, all other parts of the system function normally, except for roundoff noise inherent in normal computations

If the linear processing operation fails, denoted in Fig. 2 as region A, the real number convolutional code will detect the onset of the resulting errors through TSCC1, indicating a mismatch in the two parity streams. The remainder of the system, and the correc- tor subassembly in particular, operates correctly so the output of TSCC2 will indicate proper behaviour. The threshold detector notes that the values in d(t) are indi- cating correction. When there is a failure in region B, the parity-computation operation, TSCCl will detect the error and the threshold detector will indicate cor- rective action from the d( t ) values, which in this case will make changes to a basically correct output. Under these circumstances, the direct parity calculations in

362

y”(t) are legitimate. Thus, when the falsely activated correction to the output stream is made, the resulting corrected outputs now contain errors due to the incor- rect subtraction. These outputs have nonzero values subtracted and the resulting recomputed parities -y”(t), one output group to TSCC2, can never match the cor- rect parities y”(t). Both TSCC units indicate errors and the output is unreliable.

A similar situation occurs when region C contains a failure, i.e. the direct parity calculations producing y”(t) are incorrect. TSCCl will indicate an error and the syndrome vector d( t ) causes the corrector to sub- tract nonzero values incorrectly from the otherwise cor- rect output stream. The values subtracted in each instance represent amounts which are added to the cor- rect outputs to produce the observed incorrect parities y”(t) . Hence, the parities at the upper input to TSCC2 are the negative values of the incorrect y”(t) values, because of the linearity of the parity-generation proc- ess. The differences inside TSCC2 are then twice in size according to equations for ri(t) and sL(t) in Fig. 12. Both TSCC units indicate errors and the corrected out- put is untrustworthy.

When the corrector hardware contains a failure (region D) TSCC2 will receive unmatching parities and will detect a failure, while TSCCl shows correct behav- iour. The output is unreliable. On the other hand, TSCC2 will also detect any failure in region E, the final parity-computation and delay storage, but TSCCl indi- cates proper performance. In this case, the output is correct and can be used since the syndromes d(t) have all components within threshold levels.

When one of the two TSCC units fails, all other sub- assemblies are functioning properly. For example, if TSCCl fails, TSCC2 will show the outputs correct since the computation of d ( t ) values is part of region D and the threshold detector will indicate no corrective action. Likewise, whenever TSCC2 indicates an error, the threshold detector shows no correction underway and the output may be trusted.

8.2 Impact of multiple failures in corrector subassembly There is a product term in each of the constituent con- ditional probabilities, arising from the chain rule expansion of P(tolerab1e I kl, k2, ..., k,, Nt = n), which is concerned with the possible overload of the corrector between arrivals of failures in the processing part. For example, the probability P (corctr ovrld (0,<1)) in eqn. 11 relates to the event that the corrector parts are not overloaded by its failures during the interval (0, k1), where cl denotes the processing part’s first failure- arrival -time. Similar terms of the general form p (- i ) appear in subsequent con- - - ~ ~

ditional probabilities: ({ failure i } 1 corrected < 1 1 < 2 , . . . , En, Nt = n,

{ failure 1 } , { failure 2 } , , , , {failure z - I}) corrected corrected corrected

The overload situation during interval (0, C1) is typical of all other intervals and so it will be analysed first. This analysis is easily extended to these intervals because of the memoryless property of the failure- arrival processes.

Since the failures in the corrector parts also follow a Poisson process, it is possible to expand the

IEE Proc -Cornput Digit Tech, Vol 143, No 6, November 1996

Page 9: Reliability levels for fault-tolerant linear processing using real number error correction

P(corctr ovrld (0,k)) term using the probabilities con- ditioned on the number of failures occurring in the interval (0, 5). The subscript 1 on k1 has been tempo- rarily suppressed for convenience.

corctr ovrld(0, E ) )

) +m

= x e x p ( - b < ) q P (corctr ovrld(O,E)/Qt = 5 k=O

(22) The ordered arrival times of the event (Qc = k) will be denoted by {xl, x2, ..., xk} and have a density function as mentioned earlier (see [lo]):

k = 0,1,2,. . . (23)

Then the inner conditional probabilities of eqn. 22 may be expressed by integrals over the proper region

1 P corctr ovrld(0, E ) IQ[ = k (

region where corrector functions

properly

x P corctr ovrld(0, <)lxl, x2,. . . , X k , Q E = k ) ( X dXkdXk-i 3 * * dxi (24)

The inner conditional probabilities P(corctr ovrld (0,g) I xl, x2, ..., X k , Qc = k) describe the statistical behaviour of the corrector to its failures at the specified times (xl, x2, ..., xk), and they are diffi- cult to evaluate even if all circuit design details are known. Nevertheless, some general statements can be made.

There can be no overload in the corrector part if no failures occur since the corrector is designed properly:

P corctr ovrld(0, [)IQ[ = 0 1 = 1 ( (25) Furthermore, since during the interval (0,Q there are no failures in the processing part, the system, will not produce incorrect results even if one failure occurs. Such a single failure will be detected, but the overall system is still producing valid data. Hence the condi- tional probability P(corctr ovrld (0,Q I XI,, Qc = 1) = 1 which, when coupled with eqns. 23 and 24 for index k = 1, yields a value

E 1

P (corctr ovrld(0,J)IQE = 1) = 1 -dx = 1 (26) E‘ x=O

On the other hand, many multiple failures permit nonzero values for the conditional probabilities inside the integrals of eqn. 24. A lower bound is determined by only considering those failure arrival times for which no overload can occur because all interarrival

times are sufficiently separated, allowing the corrector time to recover. These conditions are achieved by requiring a guard band of length at least c between each failure’s arrival in the corrector subassembly:

p ( corctr ovrld(0, J)lx~, xz, . . . , Xk, QE = k )

2 1 1 * * * 1 f(X1, X2, . . . , Xk)dXkdXk-l . * * dxl region where

failures no closer than c

(27) The integral for this region is easily expressed, permit- ting evaluation by inductive techniques:

(28) These comments lead immediately to a general lower bound to the probability P(corctr ovrld (0,c)) for the interval of length 5:

corctr ovrld(0,t)) >exp(-b[) + bEexp(-bt)

(29) After only a few numerical calculations, it is easy to see that the summation term becomes very small because of the smallness of b coupled with the effects of the exponent k. The text uses the first two terms of eqn. 29 which represents at most one additional failure in the corrector during the interarrival failure times in the processing part. - *

The general case of determining a generic probability P(-), which is needed in every conditional probability of interest in the text, may be evaluated by employing similar formulas to those developed above. The memoryless property of the Pois- son process governing failures in the corrector part allows the following identity:

corctr ovrld(J,-l + e, E , ) )

= P corctr ovrld(0, E; - - ( This immediately translates to a lower bound for each generic term

P (corctr ovrld(&-l + e, ti)) 2 exp{ -b(& - ti-1 - e ) } { 1 + b(J; - &-I - e ) }

E i - E i - 1 1-1-1

> 1 L ”

+ exp{ -b(Ez - tz-1 - e ) ) j-j [b{& - ti-1 - ( k + l )C}lk I ” .

k=2 (31)

IEE Proc.-Comput. Digit. Tech., Vol. 143, No. 6, November 1996 363