Abdullah Aldahami (11074595) March 12, 2010

Abdullah Aldahami(11074595)

March 12, 2010 1

1. Introduction

2. Background

3. Proposed Multiplier Design

a. System Overview

b. Fixed Point Multiplier

c. Master Control Unit

d. Rounding and output formulation

4. Verification and Synthesis Results

5. Conclusion

2

The paper presents a fully parallel Decimal64 floating point (FP) multiplier compliant to IEEE Std 754-2008 for floating point arithmetic.

The proposed multiplier possesses novel methods to target low latency. The proposed design is based on a previously published fixed point multiplier that uses a novel BCD–4221 recoding for decimal digits to improve the area and latency of the partial product generation and the partial product reduction tree.

3

Decimal arithmetic received an increased attention in the last decade because of its growing need in many commercial applications and database systems where the binary arithmetic is not sufficient. The arithmetic operations need to be executed in decimal format. This is because the inexact mapping between some decimal and binary numbers, such as 0.1, cannot be represented accurately using binary format in a limited precision.

Decimal floating point (DFP) units are expected to be embedded in many processors’ cores to perform the decimal operation faster than the software packages and with higher accuracy.

The paper introduces a decimal floating point multiplier based on radix-10 fixed point multiplier that introduced an efficient implementation by the parallel generation of partial products followed by a novel carry save addition (CSA) tree to end the reduction of the partial products in Carry Save (CS) format.

4

5

Decimal multiplication performs the computation:

(P=A×B)

Where A is the multiplicand, B is the multiplier, and P is the product. It is assumed that A and B are each n digits hence P is maximally 2n digits that must be rounded in order to fit in a limited precision of n digits.

Several approaches to decimal multiplication are proposed, the simple and straight forward one is to iterate over the digits of the multiplier B and based on the value of the current digit, add the corresponding multiple of the multiplicand A to an intermediate product. In this approach the multiplier multiples 2A through 9A must be generated which consumes large area and delay. The following equation represents this approach to decimal multiplication:

Xi+1 = (Xi + A.Bi) . 10-1

Where X is the partial product, X0= 0 and 0 ≤ i ≤ n-1.

6

Another approach is to generate secondary multiples which are a reduced set of multiples and generate any other missing multiple by adding two multiples from this secondary set based on the value of the current digit of the multiplier B.

This approach reduces the complexity of generating eight multiples using an addition operation. The following equation represents this approach to decimal multiplication:

Xi+1 = (Xi + A.B’l + A.B”l) . 10-1 Where A.B’l and A.B”l are the secondary multiples which together equal the proper primary multiple.

The fixed multiplication operation consists of three main stages: generation of partial products, reduction of partial products to two operands and a final carry propagate addition.

7

An IEEE Std 754-2008 DFP number contains a sign bit, an integer significand with a precision of n digits, and a biased exponent. The value of a finite DFP number is:

D = – 1sign × Significand × 10E-Bias

Where, E is the biased non-negative integer exponent.

Biased exponents in this paper represented by E relate to IEEE Std 754-2008’s exponents by:

E = e + BiasWhere e is the unbiased exponent defined in the IEEE Std 754-2008. The significand can be encoded either in binary or in Densely Packed Decimal (DPD), which in the IEEE Std 754-2008 is referred to as the decimal encoding. The exponent must be in the range [Emin, Emax], after biasing.

8

The IEEE Std 754-2008 defines the significand of the decimal floating point number as a non-normalized significand. Thus, the decimal floating-point number may have redundant representations.

For example, the value of 320 × 1024 may be represented as 320 × 1024, 32×1025, or 3200 × 1023. This set of representations for a certain decimal floating-point number is called a cohort.

Because of this possibility of multiple representations, IEEE Std 754-2008 defines a preferred exponent for each arithmetic operation, which for multiplication is:

PE = Ea + Eb – Bias Where, Ea and Eb are the biased exponents of the multiplicand and multiplier operands, respectively.

9

a. System Overview The multiplier reads the two operands and extracts the sign, exponent and

significand. The significand is decoded from a densely-packed decimal (DPD) encoding into Binary Coded Decimal (BCD).

The proposed multiplier contains two main paths: Significand path to generate the output significand of the product, and the exponent path to generate the output exponent of the product and the corresponding flags. The proposed design block diagram is shown in Fig. 1. The fixed point multiplier (FPM) operates once the decoded significands become available.

10

b. Fixed Point Multiplier Fixed point multiplier consists of three main blocks. The partial

product generation generates multiples of multiplicand based on the multiplier digits. The CSA reduction tree reduces the generated partial product to two vectors (CS format) to be added. The decimal carry propagation adder adds the output from the CSA tree, generating an intermediate product to be aligned based on the shift amount generated from the MC and then the shifted product is rounded to fit in the required precision of n digits.

Fig. 2 illustrates a block diagram

of the multiplier encoding and the

multiple sets selection units.

11

The output two vectors from the CSA tree are introduced to a novel decimal carry propagation adder to generate the intermediate product. The proposed decimal carry propagation adder is illustrated in Fig. 3.

The carry propagation is based on a

Kogge-stone tree that aims to propagate

the carry faster.

12

c. Master Control Unit The exponent of the intermediate product, shift amount and the sticky

counter are calculated in the master control unit in parallel with the fixed-point multiplication. The decimal point should be in the middle of the intermediate product which is (2n) length. Thus (n) digits of the intermediate product are to the right of the decimal point.

This increases the exponent of the intermediate product (EIP) by (n). The EIP is calculated using this equation: EIP = Ea + Eb – Bias + n

A sticky counter generated from the MC contains the number of digits that must be ORed to generate the sticky bit. To improve the latency of the multiplier, a novel sticky bit generation unit is developed to generate the sticky bit in parallel with the shifter, the sticky counter is used for generating a bit vector of (2n) length; the vector has 1’s in bits corresponding to the digits that will be ORed to generate the sticky bit, and 0’s in the other bits.

13

Sticky counter (SC) is calculated early as it depends also on the leading zeros in the multiplier and multiplicand. SC = Max (0, n – (LZa + LZb) )

The sticky counter is decremented twice. This insures that the round and guard digits are not included in the sticky bit generation.

14

d. Rounding and output formulation Rounder takes (n+1) digits from the shifted intermediate product

(SIP) and the sticky bit. The proposed multiplier supports the five rounding directions listed in the IEEE Std 754- 2008 (Round to Nearest Ties to Even (RNE), Round to Nearest Ties Away from Zero (RNA), Round to Positive Infinity (RPI), Round to negative Infinity (RNI), Round Toward Zero (RTZ)). It supports two other rounding directions. (Round Away from Zero (RAZ), Round to nearest, Ties Toward Zero (RNZ)).

Table I illustrates the rounding scheme used in our DFP multiplier design.

15

The multiplier is modeled using RTL VHDL and then it is functionally verified using FPGEN test cases supplied by IBM.

With a large number of random test cases, the results are generated using the DecNumber library that implements the general decimal arithmetic specification in ANSI C.

The multiplier is synthesized using TSMC 0.18 μm technology. The design is synthesized for a large number of pipeline stages to explore latency, area, and delay tradeoffs. This synthesis was performed using the retiming feature. The synthesis results are given in Table II. The proposed DFP multiplier has a low latency and a small area. Fig. 4 illustrates the relation between the area × delay product of our design and area × delay product of the design versus the number of pipelined stages.

18

The combinational design shows about 17% improvement in area × delay product over the parallel decimal floating point multiplier.

For Decimal128, our decimal floating point multiplier has delay of 10 ns with 2.4901265 mm2 area. The proposed floating point multiplier for decimal64 is hardware verified by integrating into NIOS II processor on Altera Cyclone II FPGA development kit.

19

Here, a decimal fully parallel and pipelined floating point multiplier is presented. Several enhancements are used to improve the latency such as the use of a parallel fixed point multiplier, the generation of the sticky bit in parallel and the use of a fast decimal carry propagation adder.

The multiplier is synthesized in 0.18 μm technology and pipelined for different numbers of stages. The multiplier shows very good performance with respect to delay and area. The multiplier is hardware verified through Altera Cyclone II FPGA testing.

Abdullah Aldahami (11074595) March 12, 2010

Documents

Transcript of Abdullah Aldahami (11074595) March 12, 2010