A Low-Power High-Radix Serial-Parallel Multiplier

4
A Low-Power High-Radix Serial-Parallel Multiplier Danny Crookes, Richard M. Jiang, The School of Electrical Engineering, Electronics & Computer Science, Queen’s University Belfast, Belfast BT7 1NN, UK Email: [email protected] Abstract In this paper, we introduce a novel high-radix binary signed digit (BSD) serial-parallel multiplier suitable for low-power high-speed multiplication. The proposed N- bit×N-bit radix-16 serial-parallel multiplier can reduce the number of accumulation cycles of partial products to as much as N/4, and eliminate most of the invertion operations which consume power in a conventional multiplier in generating the partial products. Unlike other high-radix methods, the pre-multiplication in the new algorithm employs a BSD method which requires no extra adder, and thus removes the extra delay for additions which hinders other high-radix algorithms. I. INTRODUCTION The high-speed multiplier circuit is an important component of today’s microprocessors, image/video processors, broadband processors, crypto processors, and many other digital ICs because of its expensive cost in circuit latency, power consumption and silicon area[1]. Emerging wireless multimedia applications in particular demand low-power capability in their real-time high- speed computation. Multiplication is the process of adding a number of partial products. Thus, integer multiplication can be readily implemented in serial-parallel mode by using an accumulator to add these partial products. In an n-bit×n-bit radix-2 multiplication algorithm, each bit will produce one partial product, and the complete accumulation will be performed in n cycles. The modified Booth algorithm[2] (Booth-2) uses two bits to produce one partial products, and thus it will need n/2 cycles to finish one multiplication. Radix-4 Booth recoding can only reduce the partial products by half (compared with Booth-2). Higher-radix Booth algorithms can reduce the partial products further. But it will require extra adders in its premultiplication to generate partial products, which will increase both power consumption and critical path delay. Thus, high-radix multiplication is not always appropriate for certain practical applications[3-4]. Another problem[5-7] is that the Booth-2 algorithm is not very power-efficient even though it reduces the number of partial products to half. With Booth-2, partial products have a 50% probability of being -1× or - 2× the multiplicand. These inverting operations introduce many transitions into the adder tree. For example, if the 32-bit multiplicand Y is “1”, -Y and -2×Y switch all of the bits up to the most significant bit, causing 32 transitions in each operation, and require an extra “add 1” operation for each negation with a possible carry propagation from the LSB to the MSB. Booth’s algorithm causes a lot of switching activity even for small number multiplications such as 1×2 and 0×2, so it is not a good algorithm for use in low-power applications. In many DSP applications, constant multipliers based on Residue Number System, Canonical Signed Digit (CSD) arithmetic, CORDIC algorithm, or Distributed Arithmetic(DA) are widely adopted as low-power alternatives to the Booth multiplier. However, these constant-based multipliers lack flexibility, and the constant coefficients are usually no longer programmable after mapped into hardware. Their inflexibility limits their use in ASIC or FPGA implementations where general purpose multiplication is required. Up to now, most generic serial-parallel multipliers in previous publications has depended on the radix-4 Booth algorithm. In this paper, we introduce an efficient low- power high-radix algorithm which can reduce the partial products to as much as one-fourth. Also, by exploiting internal signed digit representation, we can reduce the power consumption by avoiding the unnecessary inverting operations which occur in the conventional Booth multiplier. II. BOOTH SERIAL-PARALLEL MULTIPLIER In multiplication, the n-bit and m-bit input binary operands X and Y can be expressed as: = = 1 0 1 0 2 2 m i j n i i y Y x X (1) If we split X into r-bit units (digits), then we have, rk r rk r rk k r n k rk k x x x u u X ... , 2 ) ( 2 1 / 0 + + = = = (2) Thus, the multiplication between X and Y is, Y u P P Y u y u Y X k k r n k rk k r n k rk k m j j r n k rk k = = = = × = = = , 2 2 ) ( 2 2 ) ( / 0 / 0 1 0 / 0 (3) Therefore, the multiplication can be treated as the accumulation of the n/r partial products P k . This is called radix-2 r multiplication. It reduces the number of partial products to n/r. We now consider r=3 (radix-8) and r=4. If we were to choose r=3, the accumulation would require n/3 partial products. The precalculation of u k ×Y can be decomposed into negation and shift operations based on two basic elements Y×1 and Y×3. All P k can be generated by negating and shifting these two basic elements. 1-4244-1342-7/07/$25.00 ©2007 IEEE 460 Authorized licensed use limited to: SRM University. Downloaded on July 18,2010 at 11:36:48 UTC from IEEE Xplore. Restrictions apply.

Transcript of A Low-Power High-Radix Serial-Parallel Multiplier

Page 1: A Low-Power High-Radix Serial-Parallel Multiplier

A Low-Power High-Radix Serial-Parallel Multiplier

Danny Crookes, Richard M. Jiang, The School of Electrical Engineering, Electronics & Computer Science,

Queen’s University Belfast, Belfast BT7 1NN, UK Email: [email protected]

Abstract — In this paper, we introduce a novel high-radix binary signed digit (BSD) serial-parallel multiplier suitable for low-power high-speed multiplication. The proposed N-bit×N-bit radix-16 serial-parallel multiplier can reduce the number of accumulation cycles of partial products to as much as N/4, and eliminate most of the invertion operations which consume power in a conventional multiplier in generating the partial products. Unlike other high-radix methods, the pre-multiplication in the new algorithm employs a BSD method which requires no extra adder, and thus removes the extra delay for additions which hinders other high-radix algorithms.

I. INTRODUCTION The high-speed multiplier circuit is an important

component of today’s microprocessors, image/video processors, broadband processors, crypto processors, and many other digital ICs because of its expensive cost in circuit latency, power consumption and silicon area[1]. Emerging wireless multimedia applications in particular demand low-power capability in their real-time high-speed computation.

Multiplication is the process of adding a number of partial products. Thus, integer multiplication can be readily implemented in serial-parallel mode by using an accumulator to add these partial products. In an n-bit×n-bit radix-2 multiplication algorithm, each bit will produce one partial product, and the complete accumulation will be performed in n cycles. The modified Booth algorithm[2] (Booth-2) uses two bits to produce one partial products, and thus it will need n/2 cycles to finish one multiplication.

Radix-4 Booth recoding can only reduce the partial products by half (compared with Booth-2). Higher-radix Booth algorithms can reduce the partial products further. But it will require extra adders in its premultiplication to generate partial products, which will increase both power consumption and critical path delay. Thus, high-radix multiplication is not always appropriate for certain practical applications[3-4].

Another problem[5-7] is that the Booth-2 algorithm is not very power-efficient even though it reduces the number of partial products to half. With Booth-2, partial products have a 50% probability of being -1× or - 2× the multiplicand. These inverting operations introduce many transitions into the adder tree. For example, if the 32-bit multiplicand Y is “1”, -Y and -2×Y switch all of the bits up to the most significant bit, causing 32 transitions in each operation, and require an extra “add 1” operation for each negation with a possible carry propagation from the LSB to the MSB. Booth’s algorithm causes a lot of switching activity even for small number multiplications such as 1×2 and 0×2, so it is not a good algorithm for use in low-power applications.

In many DSP applications, constant multipliers based on Residue Number System, Canonical Signed Digit (CSD) arithmetic, CORDIC algorithm, or Distributed Arithmetic(DA) are widely adopted as low-power alternatives to the Booth multiplier. However, these constant-based multipliers lack flexibility, and the constant coefficients are usually no longer programmable after mapped into hardware. Their inflexibility limits their use in ASIC or FPGA implementations where general purpose multiplication is required.

Up to now, most generic serial-parallel multipliers in previous publications has depended on the radix-4 Booth algorithm. In this paper, we introduce an efficient low-power high-radix algorithm which can reduce the partial products to as much as one-fourth. Also, by exploiting internal signed digit representation, we can reduce the power consumption by avoiding the unnecessary inverting operations which occur in the conventional Booth multiplier.

II. BOOTH SERIAL-PARALLEL MULTIPLIER In multiplication, the n-bit and m-bit input binary

operands X and Y can be expressed as:

∑−

=

=

1

0

1

0

2

2

mi

j

ni

i

yY

xX (1)

If we split X into r-bit units (digits), then we have,

rkrrkrrkk

rn

k

rkk xxxuuX ... ,2)( 21

/

0−+−+

=

== ∑ (2)

Thus, the multiplication between X and Y is,

YuPP

Yu

yuYX

kk

rn

k

rkk

rn

k

rkk

mj

j

rn

k

rkk

==

=

∑∑

=

=

=

,2

2)(

22)(

/

0

/

0

1

0

/

0

(3)

Therefore, the multiplication can be treated as the accumulation of the n/r partial products Pk. This is called radix-2r multiplication. It reduces the number of partial products to n/r. We now consider r=3 (radix-8) and r=4.

If we were to choose r=3, the accumulation would require n/3 partial products. The precalculation of uk×Y can be decomposed into negation and shift operations based on two basic elements Y×1 and Y×3. All Pk can be generated by negating and shifting these two basic elements.

1-4244-1342-7/07/$25.00 ©2007 IEEE 460Authorized licensed use limited to: SRM University. Downloaded on July 18,2010 at 11:36:48 UTC from IEEE Xplore. Restrictions apply.

Page 2: A Low-Power High-Radix Serial-Parallel Multiplier

A conventional calculation of 3×Y in radix-8 uses an adder to calculate 2×Y+Y. 2×Y can be easily obtained by shifting Y one bit left. However, the extra adder will considerably increase the critical path delay as well as the power consumption.

In radix-16, the set of basic elements needed to precalculate the partial products is {Y×1, Y×3, Y×5 and Y×7}. This conventionally requires three addition operations, one for each of Y×3, Y×5 and Y×7. This in turn has implications for power and area.

Because of the cost in extra adders, high-radix multiplication is not popular in practice, though it reduces the number of partial products. How to tackle the complexity of additions in premultiplication is a key issue for any high-radix multiplication algorithm.

In this paper, we propose an arithmetic-free high-radix premultiplication serial-parallel multiplier architecture by exploiting binary signed digit (BSD) arithmetic. In the following sections, we will first review background knowledge about BSD arithmetic[8-10], and then introduce our new algorithm.

III. USING BSD ARITHMETIC

A. BSD Number Representation A BSD number is a part of a more generalized set of

numbers known as signed digit number representation. A binary signed digit has three values: {-1, 0, 1}. BSD used to be popular because it enabled LSB-first multiplication. However, it has more recently fallen out of favour because of the extra logic needed to represent numbers (typically 2 bits per binary digit).

One reason why BSD numbers are of great interest is their ability to represent a single number in many different ways. For example, Table 2 lists several different, yet equivalent, ways to represent the decimal value 1, 3, 5, and 7 as a 4-bit BSD word.

TABLE I. BINARY SIGNED DIGIT EXPRESSION

When using the BSD digit set, two bits in CMOS

threshold logic are required to represent the three distinct digits in the BSD domain. There are a few possible coding schemes to map three BSD digits to four states represented by two binary digits.

Among various definitions, some expressions can be defined as unequal-weight (UW), for example, using “11” for -1, “01” for 1, and “00” for 0. Here, the signed digit can be computed by x+ – 2x–, and x– has a higher weight than x+.

Others are equal-weight (EW) expression. Table 3 is a typical EW definition for BSD expression. Here, the two bits are coded in an equal weight and the signed digit can be compute by x+ – x–. “-1” is expressed by “10”, “1” is expressed by “01”, and “0” is expressed by “10” or “01”. With the BSD coding as described in Table 3, the

negation of the BSD digit can be achieved simply by exchanging the x+ and x– bits, and thus can neatly avoid inverting the signal in binary arithmetic. In this paper, we use the equal weight (EW) expression to define the binary signed digits.

TABLE II. BINARY SIGNED DIGIT IN EW CODING

B. Binary Signed Digit Addition

An advantage of using the BSD numbers in summation is that a carry-free addition rule can be used by exploiting the flexibility in representing numbers[10-12]. The main goal of formulating a carry-free addition rule is to guarantee that the carry-out of a BSD full adder is made independent of the carry-in. In other words, a parallel array of adders can add two N–bit words without carry propagation. A carry-free addition rule for a BSD FA is shown in Table 3.

In the table, xi and yi denote the ith digits of the two input words to add. Similarly, xi-1 and yi-1 denote the (i-1)th digits of the same two words. The symbol ci denotes the carry-out of the ith bit BSD FA, and the symbol si denotes the intermediate sum of the ith bit, which is obtained without relying on the incoming carry ci-1. Table 4 lists all nine possible combinations of different pairs of {xi, yi} and {xi-1, yi-1} and their intermediate results {ci, si}. We can find from the table that the outputs ci and si has no dependence on the previous carry ci-1.The final sum of the ith bit is obtained by adding the intermediate sum ci and si.

TABLE III. CARRY-FREE BSD ADDITION RULE

An apparent drawback of BSD arithmetic is its

normal requirement for two bits to represent each binary digit. Most previous reports[9] show that radix-4 BSD arithmetic is not effective in comparison to the binary multiplier. Higher radix algorithms can save area costs, but as we have seen, higher-radix arithmetic has the disadvantage of the delay and area cost of extra adders in its premultiplication stage.

In order to supply a remedy to the above problem, we present an arithmetic-free high-radix premultiplication algorithm, which is based on binary signed digit (BSD) arithmetic.

461Authorized licensed use limited to: SRM University. Downloaded on July 18,2010 at 11:36:48 UTC from IEEE Xplore. Restrictions apply.

Page 3: A Low-Power High-Radix Serial-Parallel Multiplier

IV. PROPOSED RADIX-16 BSD MULTIPLICATION Conventional radix-16 multiplication is not popular

because it requires extra adders to precompute Y×3, Y×5, or Y×7, which are the basic shared elements used to form the precomputed partial products PPi. We note first that: PP3 = Y×3 = Y×4–Y PP5 = Y×5 = Y×4+Y (4) PP7 = Y×7 = Y×8–Y A neat way to avoid the need for additions or subtractions in the calculation of these three elements, based on BSD representation in Table 2, is to exploit the representation of a BSD number as effectively an implicit subtraction (of x+ and x–). We therefore use the following method to form the shared elements directly:

Y PP Y,8PPY8Y7YPP

Y PP 4Y,PPY4Y3YPP

777

333

==→−×=×=

==→−×=×=−+

−+

(5)

It is a little more complex to calculate PP5. The standard sign changing algorithm[3] states:

1+=− YY (6) For example, “0111”–“0001” can be transformed as “10111”+“01110”+“00001”, which has a result of “00110”. The first bit is the sign extension. Thus, for the calculation of PP5, we have

1- correction with ,Y PP 4Y,PP

1)Y(4Y5YPP

55

5

===

→+−×=×=−+

(7)

The extra error correction “-1” can be accommodated in the subsequent adder tree by assigning a “-1” to the first bit of carry in one of the signed digit adders in the adder tree. Thus with the above assignment, we can see that no extra adder is needed to get the shared elements for both radix-8 and radix-16 multiplication. Most elements can be obtained by hardwired connections. The only exception is that an inversion is required in calculating the element 5×Y.

The delay of the premultiplication in the proposed high-radix multiplications is equal to a inversion gate delay, which is insignificant in comparison to the delay in the subsequent accumulation stage. The area cost is also quite low, and only the gate logic is the inversion needed in the generation of PP5.

While the proposed new radix-16 algorithm benefits from a fourfold reduction in the number of partial products, it also avoids of a lot of inverting operations which occur in Booth-type multipliers. In Booth-2, approximately half the partial products are negative and would be generated by inverting, according to the sign changing algorithm (Eq.(6)). In our new algorithm, the negation operation need only exchange PPk

+ and PPk−,

thus avoiding an inversion in every negation. As we have seen, the above radix-16 algorithm needs one inversion in generating the basic element 5×Y, which on average occurs only in one in eight partial products. Therefore, the proposed algorithm can save a lot of signal transitions occurring in Booth binary multipliers.

For example, to perform a 31bit×31bit multiplication, Booth-2 must generate 16 partial products, and will on average have to do 8 inversions. The above radix-16 algorithm must generate only 8 partial products, and will statistically need only 1 inversion in so doing. Furthermore, as the radix-8 algorithm does not need PP5,

it will require no inversion operations. Thus, the proposed high-radix algorithm should significantly reduce the power consumption resulting from bit transitions, which is a major problem of the Booth multiplier as addressed in many publications [3-7].

Figure 1. Generation of Partial Products Pk

The final accumulation of partial products can be done using a carry-free BSD adder. As described above, a BSD adder can perform the addition of two BSD number in a left-to-right or carry-free scheme which requires no carry propagation. This saves further power consumption in the accumulation stage. The implementation of a BSD adder tree is fast and regular and is well suited for VLSI.

Figure 2. Accumulation of radix-16 BSD partial products

Fig.2 is the accumulator structure of the radix-16 BSD multiplier. In the multiplication, four partial products will be generated in sequence, and the whole accumulation will need 4 cycles for 16-bit multiplication.

The final conversion of a BSD number to normal binary (NB) number limits the speed of the multiplication process because a carry-propagation adder is used for final conversion. Some attempts have been reported to find a better BSD-to-NB conversion method[10]. These methods speed up the already established conversion technique by employing a fast carry-rippling process of the binary adder in the BSD domain. In this paper, we use on-the-fly conversion as the final output stage if the output is required as binary format.

V. RESULTS AND DISCUSSION Compared to conventional Booth-based serial-

parallel multiplier, the proposed new algorithm has three main advantages. First, by exploiting BSD, the high-radix algorithm reduces power consumption by avoiding unnecessary inverting operations. Second, it reduces the number of partial products to one quarter (for radix-16). Third, it can use carry-free BSD adder to accumulate the partial products, which is better than binary adders which suffer from LSB-to-MSB carry propagation.

Table IV is a comparison between a Booth-2 serial-

462Authorized licensed use limited to: SRM University. Downloaded on July 18,2010 at 11:36:48 UTC from IEEE Xplore. Restrictions apply.

Page 4: A Low-Power High-Radix Serial-Parallel Multiplier

parallel multiplier and the proposed radix-16 BSD serial-parallel multiplier.

TABLE IV. COMPARISON OF N-BIT×N-BIT MULTIPLICATION

The power estimation of the algorithms can be done

at RTL level. The dynamic power dissipation of a multiplier circuit can be described by the following equation[7]:

fnodedynamic

dddynamicnodedynamic

NNN

VNCP

×=

×××= ,21 2

Eq.(8)

where Cnode is the average capacitance of each node in the circuit, Vdd is the supply voltage, Ndynamic is the number of signal transitions from ‘0’ to ‘1’ in the multiplication, Nnode is the number of nodes in the whole circuit, and Nf is the average number of transitions per node in the computation. While Cnode and Vdd are mainly dependent on the CMOS technology used to fabricate the circuit, RTL-level power estimation can be based on counting Ndynamic of all RTL-level signal switching activities. We designed two 16-bit serial-parallel multipliers – one using the proposed Radix-16 algorithm the other using a conventional Booth-2 method – and compared their power consumption index Ndynamic. Because the switching activity is correlated to the algorithm itself, such a technology-free comparison should be fair for algorithm evaluation.

Figure 3. RTL-level Power Estimation: new algorithm

compared with Booth-2 Serial-Parallel Multiplier

The inputs X and Y are produced by a random number generator in Matlab. We generated two X and Y data series for benchmark simulation. Each series has 64 pairs of X and Y. All X in both series are 16-bit random numbers. Y is 16-bit random numbers in the first data series, and small integers less than 256 in the second data series. Fig.3 is the comparison of the power consumption estimated at RTL level.

The results show that the reduction in inversion operations in generating partial products to only one-eighth of Booth-2 method, is reflected in the predicted reduced power consumption. It is also noted that while Y

are small random numbers, the ratio of signal switching activities Ndynamic can be reduced further to as much as 55%. In Booth-2, even 2×X can also cost a lot a signal inverting operations. On the other hand, the new algorithm will save power by avoiding inverting those zero bits. Since small number multiplication is the most frequent operation in processors and programs, the new algorithm can be expected to outperform conventional multiplication further.

It should be noted that bit transitions are not the only, or indeed the most significant, measure of power consumption. The additional communication needed with BSD has implications for power consumption, and detailed power estimation studies are currently being carried out.

VI. CONCLUSIONS In conclusion, the proposed high-radix BSD serial-

parallel multiplier can reduce the number of partial product cycles in serial-parallel accumulation to N/4. It has no addition in its high-radix premultiplication, and no carry propagation in its BSD accumulator. The RTL simulation shows that 45% power consumption can in some circumstances be saved by our multiplier in comparison with Booth-2 serial-parallel multiplier. Although there have been several improvements on Booth-2, the proposed algorithm has potential for power reduction for low-power microprocessors, DSP processors, and other SoC applications.

REFERENCES [1] J. Hennessy and D. A. Patterson. “Computer Architecture — A

Quantitative Approach”, Third Edition, Morgan Kaufmann Publishers, 2003;

[2] A.D. Booth, “A Signed Binary Multiplication Technique”, Quarterly J. Mechanics and Applied Mathematics, vol. 4, no. 2, 1951, p.236;

[3] R. F. Woods, S. E. McQuillan, J. Dowling, J. V. McCanny, “High performance DSP ASIC for multiply, divide and square root”, Fifth Annual IEEE International ASIC Conference and Exhibit, 21-25 Sept. 1992, p.209;

[4] Wen-Chang Yeh and Chein-Wei Jen, “High-Speed Booth Encoded Parallel Multiplier Design”, IEEE Transactions on Computers, Vol. 49, No. 7, JULY 2000, p.692;

[5] D. Crookes, M. Jiang, “Using signed digit arithmetic for low-power multiplication”, Electronics Letters, Vol. 43, Issue 11, May 24, 2007, pp.613 – 614;

[6] Hesham A. Al-Twaijry and Michael J. Flynn, “Technology Scaling Effects on Multipliers”, IEEE Transactions on Computers, Vol.47, No.11, Nov. 1998, p.1201;

[7] Yijun Liu, Steve Furber, “The Design of a Low Power Asynchronous Multiplier”, ISLPED’04, August 9–11, 2004, Newport Beach, California, USA;

[8] A. Avizienis,“Signed-digit number representations for fast parallel arithmetic,” IRE Trans. Electron. Comput., vol. EC-10, Sept. 1961, p.380;

[9] Yun Kim, Bang-Sup Song, John Grosspietsch, and Steven F. Gillig, “A Carry-Free 54×54b Multiplier Using Equivalent Bit Conversion Algorithm”, p.1538 IEEE J. Solid-State Circuits, Vol. 36, No. 10, Oct. 2001.

[10] M. D. Ercegovac and T. Lang, “On-the-fly conversion of redundant into conventional representations,” IEEE Transactions on Computers, July 1987, pp. 895–897;

463Authorized licensed use limited to: SRM University. Downloaded on July 18,2010 at 11:36:48 UTC from IEEE Xplore. Restrictions apply.