Probabilistic carry state estimate for improved asynchronous adder performance

Probabilistic carry state estimate for improvedasynchronous adder performance

W.F.Wallace, S.S.Dlay and O.R.Hinton

Abstract: The paper presents a new type of simple adder, suitable for asynchronous digitalcircuits and implementation in VLSI technology, which has either speed and=or area advantagesover existing designs. It is based on the concept of predicting the carry from least to mostsignificant halves of a 32 or 64 bit adder in such a way that it has a high probability of beingcorrect, while introducing only a low area overhead from the required early completion controlcircuitry. Detailed design and simulation of the adder at the gate level is presented, together withits evaluation by comparing detailed performance with equivalent ripple, carry lookahead, andcarry select designs. Since the objective is improved asynchronous circuits, it is average ratherthan worst-case delays that are the significant measures. In comparison to other adder networks itis demonstrated that by using the important metrics of area, speed and delay–area product theproposed adder can outperform the 32 bit and 64 bit adders cited in the literature. Delay–areaproduct results show that the proposed approach gives a saving of over 14% and 24% on the carry-select lookahead schemes for 32 bit and 64 bit adders respectively.

1 Introduction

Increasing interest in asynchronous circuits has providedopportunities for new methods of reducing delay times inarithmetic logic units (ALUs). Delay time for synchronousoperation is limited to the worst-case delay, while forasynchronous operation it is the average delay time thatis the significant metric, provided early completion can bedetected and used by surrounding circuitry.

The simplest way of increasing the speed of a synchro-nous ripple adder is to reduce the length of the carry path.In the carry select adder [1, 2] (CSA) (Fig. 1) this isachieved by splitting the data word into two parts andperforming the addition of the most significant (MS) andleast significant (LS) parts concurrently. The MS adder isduplicated to cover the two possibilities for the carry fromthe LS adder of a one or zero, and the result is multiplexeddepending on the true state of the LS adder carry output.The disadvantage of the CSA approach is the significantadditional circuit complexity, although attempts to reducethe additional area of the second MS adder [3] in the CSAand the use of dynamic logic [4] have resulted in some areaand speed savings.

For asynchronous operation, it is possible to introducelogic for the detection of early completion of the adder fora number of design types. Although this results in variablecompletion times, this is quite acceptable for asynchronousapplication. While this type of adder can provide improvedperformance, under some conditions they do not perform

# IEE, 2001

IEE Proceedings online no. 20010781

DOI: 10.1049=ip-cdt:20010781

Paper first received 13th September 2000 and in revised form 2nd October2001

The authors are with the Department of Electrical and ElectronicEngineering, University of Newcastle Upon Tyne, Newcastle Upon TyneNE1 7RU, England

IEE Proc.-Comput. Digit. Tech., Vol. 148, No. 6, November 2001

as well overall, and the circuitry for detection of earlycompletion can still impose a significant cost overhead ifthe detection hierarchy is allowed to extend [5].

In the design proposed in this paper, called the ‘esti-mated carry’ structure (ESTC), we use a CSA structure butone of the MS adders and the multiplexer are dispersedwith completely. The circuit uses the MS bits only of theLS half of the operands to predict the carry from the LS toMS halves of the adder. Since the probability of a correctprediction turns out to be 0.75, parallel addition can beused for LS and MS halves of the data for 75% ofadditions, thereby reducing carry ripple time. Incorrectlypredicted carries can be detected from the actual LS adderbit, when the adder is reconfigured from two N=2 bitparallel adders to a single N bit adder, to produce thecorrect MS sum. The majority of operations are thereforecompleted in parallel mode, hence making full use of theasynchronous advantage of reducing the overall averagetime delay for the adder. Essential to the effectiveness ofthis design is the fact that the additional control circuitryfor detecting early completion is shown to add only amodest overhead in circuit area and speed.

LS input bits

LS adder

MS input bits

MS addercarry=0

MS addercarry=1

MS input bits

carry

select

LS bits MS bits

0 1

Fig. 1 Carry select adder

221

2 Detailed operation

Speculative completion [6] adders can use multiple worst-case delay lines to provide overall delay timing. A similarmethod to [7] is proposed, as shown in Fig. 2, but contain-ing only one delay line with a bypass to provide two delaytimes, as shown in Fig. 3. The shorter delay, with thesecond part of the delay line bypassed, provides the timefor completion of the addition for the case whereC15

e¼C15

t . The longer delay of the full delay line is usedfor the case where C15

e6¼C15

t . It can be seen that theinclusion of a demultiplexer, rather than a multiplexer,allows reuse of the delay elements.

Consider first the detailed operation of a 32 bit ESTCadder, with two input operands A and B, and later considerthe extension to a 64 bit adder. The control circuit for the32 bit ESTC adder is given in Fig. 4. This circuit isdesigned to generate the correct carry. At the outset thesignals labelled DONE and C15

t are initialised to zero and apulsed input is applied to the input signal labelled START

long delay

short delay

dem

ux

Fig. 2 Speculative completion delay line

222

PULSE. The START PULSE sets a RS flip-flop which isthen held in the set state until a DONE signal is returnedwhen the addition is complete, at which point the flip-flopis reset and a new set of data can be applied to the adder.This prevents spurious carry states from being generatedprior to the next carry state evaluation during the nextaddition cycle.

The 32 bit ESTC adder is split into two 16 bit halves, theLS (A0 to A15 and B0 to B15) and MS (A16 to A31 and B16 toB31). For the LS adder, it is apparent that, for the followingtwo conditions, the true state of the carry output (C15

t ) isdependent only on the state of A15 and B15 of the operands:

C t15 ¼ 0 if A15 ¼ B15 ¼ 0

¼ 1 if A15 ¼ B15 ¼ 1ð1Þ

Therefore, looking at Fig. 4 in more detail, while the two16 bit adders are working in parallel the control unitgenerates the true carry C15 . The START PULSE placesa high signal on the output of the NOR gate that feeds theSELECT DELAY subcircuit. The SELECT DELAY is thecontrol for the MULTIPLEXER and determines whetherthe delay D1 or D1 and D2 should be used. In this particularcase the delay D1 is used. The estimated carry is given bythe signal labelled C e

15 and this is also shown in eqn. 2.Finally, the DONE signal is asserted after a delay of D1 ,MULTIPLEXER and OR gate.

D1

16 bit adder delay

multiplexer16 bit adder delay

D2

done

start

pulse

select delay

A15

B15

Ct15

A15

B15

Ct15

Ce15

C =(A B )15 15 15∧ ∨ Ct15

Fig. 4 Carry and completion control circuit (size of inverters indicates delay ratio)

D1

long delay when

correction to estimated

carry required

short delay when

correction to estimated

carry is required

is

not

start

Mux

D2

done

Fig. 3 Completion delay line


For the condition:

A15 6¼ B15

C 15t is dependent on the carry from previous bit positions.

If the estimated carry (C15e ) is taken to be zero for this

condition, then for randomly generated inputs, C 15e¼C15

t

for 50% of cases. Using this strategy, the following func-tions can easily be derived:

Ce15 ¼ A15 ^ B15 ð2Þ

PðCe15 ¼ Ct

15Þ ¼ PðA15 ¼ B15Þ

þ P ðA15 6¼ B15Þ ^ ðCt15 ¼ 0Þ

� �¼ 0:75 ð3Þ

PðC e15 6¼ C t

15Þ ¼ P ðA15 6¼ B15Þ ^ ðCt15 ¼ 1Þ

� �¼ 0:25 ð4Þ

where P(C15e¼C15

t ) is the probability of the estimated andtrue carry being equal, and P(C15

e6¼C15

t ) is the probabilityof the estimated and true carry being unequal.

When C15e6¼C 15

t , it is C15t that must be applied to the

MS adder, when it is available. This occurs after the LSadder has completed. In this case, the time taken is thesame as for the addition of a full 32 bit adder, plus a smalldelay incurred in detecting the state and applying thecorrected value. The logic for this is shown in Fig. 4.The signal that appears at C15 is the true carry C t

15 .If Ct

15 is low then the subcircuit SELECT DELAYdisables D2 and asserts the DONE signal and in additionthe carry is propagated to C15 . Therefore the DONESIGNAL is set after a total delay of D1 , MULTIPLEXERand OR gate.

However, if the Ct15 is high then this value is available

only after the LS 16 bit adder has finished and this valuewill propagate through to C15 . In addition, SELECTDELAY enables D2 and the timing has been optimised sothat the narrow START PULSE and the Ct

15 signal arrivesimultaneously at the MULTIPLEXER. Therefore theresult is available after a delay of D1 , MULTIPLEXER,D2 and one OR gate.

Therefore, for 75% of random data operations the timetaken for the 32 bit adder is the 16 bit adder delay and theadditional delay of the MULTIPLEXER and one gate. For25% of random data operations the time taken is a delay ofD1 , MULTIPLEXER, D2 and one gate.


3 Simulations

3.1 Method

Simulations have been conducted for three different typesof adder with two variations for each type: first with simple4 bit ripple adder elements, and second with 4 bit carrylookahead elements. The simulations for the 32 bit adderswere carried out in PSPICE using parameters from a0.125 mm CMOS technology. For each of the eight condi-tions shown in Table 1, a matching input data vector wasapplied and circuit timing obtained from the PSPICEoutput. The probability functions, in eqns. 3 and 4, werespecified by analysing greater than 10 000 random datasets, which were generated using MATLABTM.

Performance results for 64 bit circuits have been esti-mated by analytic evaluation of the circuit operation. Two32 bit ESTC adders (Fig. 5) are used with an additionalcontrol and delay circuit as shown in Fig. 6; there is anincreased level of complexity since four combinations ofthe delays are needed. This new circuit is therefore knownas the ESTC=ESTC adder. The delay through this adderwill now depend on the states of the three carries betweenthe 16 bit adder sections (Table 2). The best case is nowfour 16 bit concurrent operations. Starting from the prob-

delay D1startpulse

Mux

delay D2

OR donedelay

control

Ccontrol

A15 B15 Ct

ALS

BLS

SLSLS

adderMS

adder

AMS

BMS

SMS

Fig. 5 32 bit adders

delay D1startpulse

Mux

delay D2

OR

donedelaycontrol

Ccontrol

A15 B15 Ct

ALS

BLS

SLSLS

adderMS

adder

AMS

BMS

SMS

A(0,31) B(0,31) S(0,31)

A31

cont

rol

B31

Ct

delay D1startpulse

Mux

delay D2

ORdonedelay

control

Ccontrol

A15 B15 Ct

ALS

BLS

SLSLS

adderMS

adder

AMS

BMS

SMS

A(32,63) B(32,63) S(32,63)

Fig. 6 64 bit adder

223

Table 1: Carry logic truth table

A15 B15 C14 Ctrue Cestimate¼

A15 .B15

Resulting operation

0 0 0 0 0 2 � 16 bit concurrent



0 1 1 1 0 32 bit


1 0 1 1 0 32 bit



224

abilities of a short delay (D1) of 75% and a long delay (D2)of 25%, the probabilities of all eight carry conditions canbe calculated (Table 2), and hence also the average delaytime.

3.2 32 bit adders

Results obtained for the 32 bit adders are listed in Table 3,and are also presented graphically in Fig. 7 whereimproved designs tend towards the x-y origin.

It can be seen that for the three adders based on 4 bitripple elements, the 32 bit ripple adder (R=R) requires thesmallest area but is very slow, as opposed to the carryselect adder (CS=R) which is relatively fast but uses a largeadditional area. The estimated carry adder (ESTC=R)provides a considerable speed advantage of 41% over the

Table 2: 64 bit operation timing

CAB CBC CCD Delay timing Probability Absolutedelay, ps

Contributionto delay, ps

0 0 0 DL þ DL þ DC 0.016 582 9

0 0 1 DL þ DC 0.047 315 15

0 1 0 DL þ DS þ DC 0.047 469 22

0 1 1 DL þ DC 0.141 315 44

1 0 0 DL þ DS þ DC 0.047 469 22

1 0 1 DL þ DC 0.041 315 44

1 1 0 DS þ DS þ DC 0.041 357 50

1 1 1 DS þ DC 0.422 202 85

Average delay 292 ps

DC¼delay due to additional control, DS and DL¼ completion times respectively for adders A and B.

0¼ Incorrect carry estimate; 1¼Correct carry estimate.

Table 3: Transistor count and delays for the simulated 32 bit adders

Device Delay,ps

Number oftransistors

Delay�numberof transistors(normalised)

64 bit CS=CS 173 9120 0.263

64 bit CS=CLA 255 5664 0.241

64 bit ESTC=ESTC 292 3712 0.181

64 bit R=CLA 406 3360 0.912

64 bit R 2128 2816 1.0

32 bit CS=CLA 121 2832 0.229

32 bit CS=R 166 1756 0.195

32 bit ESTC=CLA 203 1680 0.228

32 bit R=CLA 584 2424 0.945

32 bit ESTC=R 627 1530 0.641

32 bit R 1064 1408 1.0

Key CS carry select, ESTC estimated carry, R ripple carry, CLA carrylookahead, ESTC=CLA estimated carry with carry lookahead elements, CS=CScarry select with carry select elements, CS=R carry select with ripple carryelements, CS=CLA carry select with carry lookahead elements, ESTC=ESTCestimated carry with estimated carry elements, ESTC=R estimated carry withripple carry elements, R=CLA ripple carry with carry lookahead elements.Delay � number of transistors is normalised to the 64 bit and 32 bit rippleadders.


10000

num

ber

of tr

ansi

stor

s

9000

8000

7000

6000

5000

4000

3000

2000

1000

00 500 1000 1500 2000 2500

delay time, ps

64 bit CS/CS

64 bit CS/CLA

64 bit ESTC/ESTC64 bit R/CLA

32 bit CS/R32 bit CS/CLA

32 bit R/CLA

32 bit ESTC/CLA32 bitESTC/R

32 bit R

64 bit R

Fig. 7 Delay plotted against number of transistors

CS¼ carry select CS=R¼ two cary select with ripple CS=CLA¼ carry select with carry lookahead elementsESTC¼ estimated carry ESTC=ESTC¼ estimated carry with ESTC=CLA¼ estimated carry with carry lookahead elementsR¼ ripple carry estimated carry elements R=CLA¼ ripple carry with carry lookahead elements

CS=CS¼ carry select with carry select elements

R=R for a size increase of only 8%. The ESTC=R whileonly 7% slower than the CSA=R uses 27% less area.

As expected, the three adders based on 4 bit carrylookahead (CLA) elements show an immediate speedadvantage over those using ripple elements. The 32 bitripple=carry lookahead adder (R=CLA) is again the slowestof the group, but is still much faster than the CS=R anduses less area. The CS=CLA is the fastest, but at theexpense of greatest chip area. The estimated carry adder(ESTC=CLA) is 18% faster than the R=CLA, for anincrease of only 4% in area, and is 27% slower than theCS=CLA, but uses 38% less area.

The above results are based on 32 bit random data inputswhich gave 75% of additions completed in the shortertime. Data that will more realistically model the data inputto the adder [7, 8] could be expected to increase thenumber of operations that occur in the shorter delay timeto approximately 80%. This is based on Matlab simulationsthat modified the random data input to the adder to include20% of additions with small operands not resulting in acarry from the LS adder. The effect of using realistic datarather than purely random data is to further improve theaverage delay times given in Table 3.

Other results reported in the literature have shown thatmultiplexer-based carry networks are widely used for highperformance carry propagation circuits [9–11]. Amongthese is the tree-based conditional carry network which isthe fastest adder [9], and uses a tree based structure whichrequires a large number of multiplexers, some of which arethemselves needed to drive a large number of other multi-plexers. Although fast, with a delay of 2.5 ns (0.6 mmCMOS technology), it uses the most hardware, with agate count of 257. In contrast the proposed 32 bit estimatedcarry adder has a delay of 203 ps and a comparable gatecount of 324 in its configuration with carry lookaheadelements. Therefore the adder is simple, relatively fast andarea efficient for VLSI realisation.

3.3 64 bit adders

The completion and delay circuits have been redesigned tocater for all the various carry conditions, as explainedearlier. Based on the area and delays determined from the


32 bit adder design and simulation, the average delay forthe 64 bit adder is found to be 292 ps and comparisons withcarry select and ripple carry adders are given in Fig. 7. Inthe case of the traditional CS=CS adder, the increasedhierarchy in moving to 64 bits clearly causes a substantialincrease in area, while this is not found in the newESTC=ESTC approach. While the estimated carry adderhas 66% of the speed in this configuration it uses only 40%of the area of the carry select adder. Using two carry selectadders with ripple between reduces the area used but theperformance of the estimated carry adder is then 98% ofthe speed with only 80% of the area of that configuration.Against the ripple adder the performance of the estimatedcarry adder is 64% increase in speed and only 9.5%increase in area.

3.4 Delay–area product

Table 3 gives results for the product of delay and numberof transistors in which the product is normalised to that ofthe ripple adder. The delay–area product has been used as ameasure because in most implementations it is the area andspeed tradeoff that characterises the adders. In additionother researchers [11] have reported similar measures fortheir design evaluations and have used the product of delayand power. The results can be divided into two separatecategories for the 32 and 64 bit adders; for the 32 bitadders they are divided into carry ripple and carry lookahead and for the 64 bit adders into carry ripple and carryselect.

Table 3 shows that the for the three adders based on the32 bit ripple adder, R, the ESTC adder gives savings ofgreater than 35% over the ripple adder and over 30% overthe carry select ripple adder (CS=R). The results for theadders based on the carry lookahead scheme show that theresults for the estimated adder give a saving of over 80%over the ripple adder. Additionally, there is a saving of over14% on the carry select (CS=CLA) and the ripple carrylookahead (R=CLA) schemes.

The results for the 64 bit adders in Table 3 show that thesavings made are much greater than for the 32 bit adders.The figures show that the 64 bit ESTC adder (64 bitESTC=ESTC) gives a saving of 82% and 73% over the

225

64 bit ripple adder and the 64 bit ripple carry lookaheadrespectively. Furthermore it outperforms the adders basedon the carry select-carry select scheme and the carry select-carry lookahead by over 31% and 24% respectively.

In summary, the aim of a designer is to design the bestadders in terms of speed and area and the results suggestthat the gains to be made using the ESTC approachincrease with increasing word length, as would beexpected. Furthermore, the results demonstrate that thereare substantial performance gains to be made by adoptingthe approach and methodology of estimated carry addersfor asynchronous adders.

4 Conclusions

It has been shown that by designing for the statisticalprobability of a carry being in a particular state, a 32 bitadder can be constructed which, for the majority of addi-tions, will operate as a 16 bit parallel adder. Gate levelsimulations demonstrate a minimum saving in time of18%, or in area of 38%, compared with other relevantadder configurations. In realistic circuits restricting thelevels of hierarchy [12] this typically minimises the over-heads of completion detection. The gains of the estimatedcarry adder are greater with the simpler circuit designs, asthe additional control delay is proportionally lowercompared to the overall delay time (Fig. 7 group 1). It istherefore likely that the estimated carry adder wouldbecome the first choice replacement for the ripple adder,showing a significant speed increase with a very lowoverhead.

It has also been shown that the advantages of using theESTC approach increase as the word length increases,because the control overhead does not increase signifi-cantly with word length. The new 64 bit estimated carryadder uses only 40% of the area of a traditional carry selectversion, while it achieves 66% of the speed. Furthermore,these designs offer another option in the tradeoff betweenarea and speed.

The overall results demonstrate that the estimated carryadder is simple, fast and area-efficient and is suitable for

226

implementation using VLSI technology. In addition, whencompared to other adder networks in terms of area, speedand delay–area product it can outperform most of the 32 bitadders that are available. Furthermore, as the word lengthgets larger, say 64 bits and higher, then the ESTC adderoutperforms other adders when using the effective metricof delay–area product. Finally, it is envisaged that as theword length increases the savings to be made by use of theestimated carry adder will be significantly greater than byuse of conventional approaches.

5 Acknowledgments

The work reported in this paper was carried out underEPSRC grant no. 96308130.

6 References

1 FREEMAN, R.B.: ‘Checked carry select adder’, IBM Tech. DisclosureBull., 1970, 13, (6), pp. 1504–1505

2 UYA, M., KANEKO, K., and YASUI, J.: ‘A CMOS floating pointmultiplier’, IEEE J. Solid-State Circuits, 1984, SC-19, (5), pp. 697–702

3 CHANG, T.Y., and HSIAO, M.J.: ‘Carry select adder using single ripplecarry adder’, Electron. Lett., 1998, 34, (22), pp. 2101–2102

4 DEGLORIA, A., and OLIVIER, M.: ‘Statistical carry lookaheadadders’, IEEE Trans. Comput., 1996, 45, (3), pp. 340–347

5 KINNIMENT, D.J.: ‘An evaluation of asynchronous addition’, IEEETrans. VLSI Syst., 1996, 4, (1), pp. 137–140

6 NORWICK, S.M.: ‘Design of a low latency asynchronous adder usingspeculative completion’, IEE Proc., Comput. Digit. Tech., 1996, 29, (5),pp. 301–307

7 NORWICK, S.M., YUN, K.Y., DEEREL, P.A., and DOOPLY, A.:‘Speculative completion for the design of high performance asynchro-nous adders ASYNC-97’, 1997, pp. 210–223

8 GARSIDE, J.D.: ‘A CMOS VLSI implementation of an asynchronousALU’. IFIP Proceedings Asynchronous Design Methodologies, 1993,Manchester, UK, pp. 181–192

9 KUO-HSING, C., SHU-MIN, C., and SHUN-WEN, C.: ‘The improve-ment of conditional sum adder for low power applications’. Proceedings11th Annual IEEE International ASIC Conference, 1998, pp. 131–134

10 HIROSHI, M., YASUNOBU, N., HIROAKI, S., HIROYUKI, M.,HIROFUMI, S., and KOICHIRO, M.: ‘An 8.8-ns 54� 54-bit multiplierwith high speed redundant binary architecture’, IEEE J. Solid-StateCircuits, 1996, 31, (6), pp. 773–783

11 KESHAB, K.P.: ‘Low-energy CSMT carry generators and binaryadders’, IEEE Trans. VLSI Syst., 1999, 7, (4), pp. 450–462

12 JOHNSON, D., and AKELLA, V.: ‘Design and analysis of asynchronousadders’, IEE Proc., Comput. Digit. Tech., 1998, 145, (1), pp. 1–7


Probabilistic carry state estimate for improved asynchronous adder performance

Documents

Transcript of Probabilistic carry state estimate for improved asynchronous adder performance